首页 | 本学科首页   官方微博 | 高级检索  
     

面向机器学习的流式文档逻辑结构标注方法研究
引用本文:刘倩,李宁,田英爱.面向机器学习的流式文档逻辑结构标注方法研究[J].中文信息学报,2019,33(9):50.
作者姓名:刘倩  李宁  田英爱
作者单位:北京信息科技大学 计算机学院,北京 100101
基金项目:国家自然科学基金(61672105);国家重点研发计划(2018YFB1004100)
摘    要:针对采用机器学习方法识别流式文档结构时语料库稀少、语料标注复杂的问题,该文在研究文档的逻辑结构和编辑语义特征的基础上,确立流式文档逻辑结构标注体系,并提出一种三段式的半自动文档逻辑结构标注方法: 第一阶段通过机助人工实现文档元数据的分离式标注,第二阶段自动重建逻辑结构,第三阶段自动填充特征向量。实验结果表明,该文提出的文档逻辑结构标注方法能够节省人工成本、提高机器学习算法对文档结构识别的准确率与召回率,F值达到97.5%。

关 键 词:结构标注  文档结构识别  机器学习  

Annotation of Logical Structure in Re-flowable Document for Machine Learning
LIU Qian,LI Ning,TIAN Yingai.Annotation of Logical Structure in Re-flowable Document for Machine Learning[J].Journal of Chinese Information Processing,2019,33(9):50.
Authors:LIU Qian  LI Ning  TIAN Yingai
Affiliation:School of Computer, Beijing Information Science & Technology University, Beijing 100101, China
Abstract:To construct the corpus of logical structure in re-flowable documents for machine learning, this paper proposed a three-stage semi-automatic annotation method based on the logical structure features and editing semantic features. In the first stage, document metadata is identified and annotated aided by the machine; in the second stage, the logical structure of the document is reconstructed automatically; finally, the feature vectors are automatically produced in the third stage. The experimental result shows that the proposed method can reduce manual costs, and the document corpus achieved can improve the accuracy of document structure recognition using machine learning algorithm up to 97.5% F-score.
Keywords:structure annotation  document structure recognition  machine learning  
点击此处可从《中文信息学报》浏览原始摘要信息
点击此处可从《中文信息学报》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号