Web信息抽取技术综述* Survey of Web information extraction technologies期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

Web信息抽取技术综述*

引用本文：	陈钊,张冬梅.Web信息抽取技术综述*[J].计算机应用研究,2010,27(12):4401-4405.

作者姓名：	陈钊张冬梅

作者单位：	北京林业大学,信息学院,北京,100083

基金项目：	中央高校基本科研业务费专项资金资助项目（BLYX200928）

摘要：	快速高效地获取网页主题信息的需求使得Web信息抽取技术成为信息技术领域的研究热点。现有的Web信息抽取技术大致可以归纳为基于统计理论的、基于视觉特征的、基于DOM树结构的和基于模板的几类。由于网页文本本身具有树结构并且具有一定的相似性，基于DOM树结构和基于模板的抽取技术发展很快而且已经得到了广泛的应用。分别论述了上述几类技术在近几年来的研究进展，从自动化程度、适用范围和复杂性三个角度分析对比了几类技术的优缺点。
关键词：	Web信息抽取网页噪声 URL聚类 DSE算法 RoadRunner系统 MDR 视觉特征模板
Survey of Web information extraction technologies

CHEN Zhao,ZHANG Dong-mei.Survey of Web information extraction technologies[J].Application Research of Computers,2010,27(12):4401-4405.

Authors:	CHEN Zhao ZHANG Dong-mei

Affiliation:	(School of Information Science & Technology, Beijing Forestry University, Beijing 100083, China)

Abstract:	Web information extraction technology has been made the focus of the field of information technology by the needs of obtaining the topic contents of Web pages more efficiently. Existing technologies of this field could be classified into the following four categories, statistics based technology, vision based technology, DOM tree based technology and template based technology. The DOM tree based technology and template based technology had gained a rapid development and a wide employment because of the special structure and similarity owned by Web pages. This paper made a detailed survey and analysis of the above four technologies as well as the comparison of their advantages and disadvantages from points of automation, application filed and complexity.

Keywords:	Web information extraction Web page noise URL clustering DSE algorithm RoadRunner system MDR algorithm vision feature template
本文献已被 CNKI 万方数据等数据库收录！
	点击此处可从《计算机应用研究》浏览原始摘要信息
	点击此处可从《计算机应用研究》下载全文

设为首页 | 免责声明 | 关于勤云 | 加入收藏