基于HTML结构特征的网页信息提取 Page Information Extraction Based on the Structure of the HTML期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

基于HTML结构特征的网页信息提取

引用本文：	胡瑜,王立志. 基于HTML结构特征的网页信息提取[J]. 辽宁石油化工大学学报, 2009, 29(3): 65-69. DOI: 10.3696/j.issn.1672-6952.2009.03.019

作者姓名：	胡瑜王立志

作者单位：	1. 天津大学计算机科学与技术学院,天津,300072 2. 天津大学管理学院,天津,300072

摘要：	Web上的信息很多存储在HTML页面上,传统的网页数据抽取方法是使用包装器(Wrapper)来抽取网页中感兴趣的数据。包装器所需的信息模式识别知识的获取是一个费时费力且需要较高智能的工作。避开了使用Wrapper,针对新闻类网页的结构特点,从视觉角度对网页页面空间的构成进行了噪声与信息实体的划分与判断。讨论了一种根据新闻类网页层次结构和各层节点统计信息进行新闻主体提取的方法。改进了传统的DOM模型,增加了层次与样式等属性作为噪声判断的依据,并对其节点添加了统计信息,利用新闻的标题、时间等外显特性,提出并实现了一种结合正向直接抽取与反向降噪抽取新闻类网页得到结构化数据的方法。实验结果表明,用这种方法进行新闻类网页主体信息提取的有效性。
关键词：	信息提取 DOM LA-DOM HTML解析噪声标记
Page Information Extraction Based on the Structure of the HTML

HU Yu,WANG Li-zhi. Page Information Extraction Based on the Structure of the HTML[J]. Journal of Liaoning University of Petroleum & Chemical Technology, 2009, 29(3): 65-69. DOI: 10.3696/j.issn.1672-6952.2009.03.019

Authors:	HU Yu WANG Li-zhi

Affiliation:	1.Department of Computer Science and Technology;Tianjin University;Tianjin 300072;P.R.China;2.Department of Management;P.R.China

Abstract:	Large amount of information on the Web is stored as HTML documents.Traditional web page data extraction method is to use Wrapper to collect data of interest.Wrapper need the knowledge acquisition of pattern recognition,which is a time and effort consuming work,and needs high intelligence.Based on the structure features of news web pages,and from the visual perspective,the web page's space structure was partitioned into noise and information entities.A method of extracting news web pages principal part was d...

Keywords:	DOM LA-DOM
本文献已被 CNKI 维普万方数据等数据库收录！

设为首页 | 免责声明 | 关于勤云 | 加入收藏