首页 | 本学科首页   官方微博 | 高级检索  
     

基于证据理论的论文元数据抽取算法研究
引用本文:欧阳辉,禄乐滨. 基于证据理论的论文元数据抽取算法研究[J]. 国外电子元器件, 2010, 0(4): 66-69
作者姓名:欧阳辉  禄乐滨
作者单位:空军工程大学电讯工程学院,陕西西安710077
基金项目:陕西省科学技术研究发展计划项目(2007K04-11)
摘    要:针对PDF文件的特点,应用pdfbox开源库对PDF文件进行解析,去除PDF文件的文件头、交叉引用表以及文件尾等额外的文档描述信息得到目标信息。在研究不确定性理论的基础上,确定初始证据各种特征的可信度计算方法,通过推理网络及证据理论的推理算法,得到各个证据的可信度,最后比较各个证据可信度,对论文元数据进行抽取。实验表明,各类元数据的查全率都在87%以上,查准率都在92%以上,与常用的正则表达式方法相比准确率提高了10%以上,大幅提高了工作效率。

关 键 词:元数据抽取  不确定性  证据理论

Research of paper metadata extraction algorithm based on theory of evidence
OU YANG Hui,LU Le-bin. Research of paper metadata extraction algorithm based on theory of evidence[J]. International Electronic Elements, 2010, 0(4): 66-69
Authors:OU YANG Hui  LU Le-bin
Affiliation:(Dept. of The Telecommunication Engineering Institute, Air Force Engineering University, Xi 'an 710077, China)
Abstract:Aiming at the characteristics of PDF files, PDF files were parsed by the open source library of pdfbox, the body of the PDF files were get by removing the additional information which describes document,such as the header,the cross-reference table and the trailer.Extracted metadata using the algorithm of reasoning with uncertainty,which based on theory of evidence by definiting the calculation of the initial evidence and got the credibility of evidence.Test result shows that the recall reaches 87% and the precision reaches 92% in paperopen.Aceuracy increasing of more than 10% compared with the general method which uses the regular expressions, greatly enhances the efficiency.
Keywords:metadata extraction  reasoning with uncertainty  theory of evidence
本文献已被 CNKI 维普 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号