基于语义和版式的网上人物信息提取 Extraction of people information from web based on semantic and format期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

基于语义和版式的网上人物信息提取

引用本文：	燕敏,岳萍,杜开峰.基于语义和版式的网上人物信息提取[J].微计算机信息,2010(12).

作者姓名：	燕敏岳萍杜开峰

作者单位：	四川托普信息技术职业学院计算机系;四川大学化工学院制药与生物工程系;

摘要：	本文利用本体思想,采用基于规则和统计相结合的算法,提出了一种网上人物信息提取算法,实现了半结构化人物信息的自动提取。通过程序统计的方法创建了一个包含4624个有效字段名的词典,用来检验提取出的字段名是否有效,当字段名有效时再提取其对应的字段值,大大提高了信息提取的准确率。实验结果表明,该算法对半结构化web人物网页信息提取具有较高的效率,平均准确率为97.6%,平均召回率为86.1%。
关键词：	Web信息抽取抽取规则半结构化网页 XML 版式分析
Extraction of people information from web based on semantic and format

YAN Min YUE Ping DU Kai-feng.Extraction of people information from web based on semantic and format[J].Control & Automation,2010(12).

Authors:	YAN Min YUE Ping DU Kai-feng

Affiliation:	YAN Min YUE Ping(Dept.of computer,Sichuan TOP Vocational Institute of Information Technology,ChengDu 611743,China) DU Kai-feng(Dept.of Pharmaceutical & Biological Engineering,School of Chemical Engineering,Sichuan University,Chengdu 610065,China)

Abstract:	This paper presents an algorithm of extracting people information on web based on the combining of regulations and statistics,utilizing the idea of the ontology,to accomplish the auto-extracting information from the semi-structure people information.It established a field name dictionary which contained four thousands and six hundreds and twenty four effective field name by the method of program statistic,to check the effectiveness of the extracted field name.The precision of the IE was greatly raised becau...

Keywords:	the Web IE IE regulations the semi-structure web page XML the web page format analyzing
本文献已被 CNKI 等数据库收录！

设为首页 | 免责声明 | 关于勤云 | 加入收藏