首页 | 本学科首页   官方微博 | 高级检索  
     

基于语义和版式的网上人物信息提取
引用本文:燕敏,岳萍,杜开峰.基于语义和版式的网上人物信息提取[J].微计算机信息,2010(12).
作者姓名:燕敏  岳萍  杜开峰
作者单位:四川托普信息技术职业学院计算机系;四川大学化工学院制药与生物工程系;
摘    要:本文利用本体思想,采用基于规则和统计相结合的算法,提出了一种网上人物信息提取算法,实现了半结构化人物信息的自动提取。通过程序统计的方法创建了一个包含4624个有效字段名的词典,用来检验提取出的字段名是否有效,当字段名有效时再提取其对应的字段值,大大提高了信息提取的准确率。实验结果表明,该算法对半结构化web人物网页信息提取具有较高的效率,平均准确率为97.6%,平均召回率为86.1%。

关 键 词:Web信息抽取  抽取规则  半结构化网页  XML  版式分析  

Extraction of people information from web based on semantic and format
YAN Min YUE Ping DU Kai-feng.Extraction of people information from web based on semantic and format[J].Control & Automation,2010(12).
Authors:YAN Min YUE Ping DU Kai-feng
Affiliation:YAN Min YUE Ping(Dept.of computer,Sichuan TOP Vocational Institute of Information Technology,ChengDu 611743,China) DU Kai-feng(Dept.of Pharmaceutical & Biological Engineering,School of Chemical Engineering,Sichuan University,Chengdu 610065,China)
Abstract:This paper presents an algorithm of extracting people information on web based on the combining of regulations and statistics,utilizing the idea of the ontology,to accomplish the auto-extracting information from the semi-structure people information.It established a field name dictionary which contained four thousands and six hundreds and twenty four effective field name by the method of program statistic,to check the effectiveness of the extracted field name.The precision of the IE was greatly raised becau...
Keywords:the Web IE  IE regulations  the semi-structure web page  XML  the web page format analyzing  
本文献已被 CNKI 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号