基于正文特征及网页结构的主题网页信息抽取 Content extraction of theme pages based on body feature and page structure期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

基于正文特征及网页结构的主题网页信息抽取

引用本文：	段晓丽,王宇,谷静,刘玮楠. 基于正文特征及网页结构的主题网页信息抽取[J]. 计算机工程与应用, 2012, 48(30): 151-156

作者姓名：	段晓丽王宇谷静刘玮楠

作者单位：	1.大连理工大学管理科学与工程学院，辽宁大连 1160242.中国环境管理干部学院经济学系，河北秦皇岛 066004

基金项目：	国家自然科学基金重大项目(No.70890080)子课题(70890083);教育部人文社科研究项目(No.09YJA870005)

摘要：	Web正文信息抽取是信息检索、文本挖掘等Web信息处理工作的基础。在统计分析了主题网页的正文特征及结构特征的基础上，提出了一种结合网页正文信息特征及HTML标签特点的主题网页正文信息抽取方法。在将Web页面解析成DOM树的基础上，根据页面DOM树结构获取正文信息块，分析正文信息块块内噪音信息的特点，去除块内噪音信息。实验证明，这种方法具有很好的准确率及召回率。
关键词：	正文特征标签信息正文抽取
Content extraction of theme pages based on body feature and page structure

DUAN Xiaoli , WANG Yu , GU Jing , LIU Weinan. Content extraction of theme pages based on body feature and page structure[J]. Computer Engineering and Applications, 2012, 48(30): 151-156

Authors:	DUAN Xiaoli WANG Yu GU Jing LIU Weinan

Affiliation:	1.School of Management, Dalian University of Technology, Dalian, Liaoning 116024, China2.Department of Economics, Environmental Management College of China, Qinhuangdao, Hebei 066004, China

Abstract:	Web text extraction is the foundation of Web information processing work(information retrieval, text mining, etc.). Based on the statistical analysis of theme pages, including body features and structure characteristics, this paper puts forward a kind of theme pages text extraction method combining Web page text features and HTML tags characteristics. The text content block is acquired according to the DOM tree parsed from the Web pages, and then the characteristics of noise information are analysed in the text content block in order to remove the noise information. Experiments show this method has higher accuracy and recall rate.

Keywords:	body feature tag information content extraction
本文献已被 CNKI 万方数据等数据库收录！
	点击此处可从《计算机工程与应用》浏览原始摘要信息
	点击此处可从《计算机工程与应用》下载全文

设为首页 | 免责声明 | 关于勤云 | 加入收藏