首页 | 本学科首页   官方微博 | 高级检索  
     

Web信息抽取技术综述*
引用本文:陈钊,张冬梅.Web信息抽取技术综述*[J].计算机应用研究,2010,27(12):4401-4405.
作者姓名:陈钊  张冬梅
作者单位:北京林业大学,信息学院,北京,100083
基金项目:中央高校基本科研业务费专项资金资助项目(BLYX200928)
摘    要:快速高效地获取网页主题信息的需求使得Web信息抽取技术成为信息技术领域的研究热点。现有的Web信息抽取技术大致可以归纳为基于统计理论的、基于视觉特征的、基于DOM树结构的和基于模板的几类。由于网页文本本身具有树结构并且具有一定的相似性,基于DOM树结构和基于模板的抽取技术发展很快而且已经得到了广泛的应用。分别论述了上述几类技术在近几年来的研究进展,从自动化程度、适用范围和复杂性三个角度分析对比了几类技术的优缺点。

关 键 词:Web信息抽取    网页噪声    URL聚类    DSE算法    RoadRunner系统    MDR    视觉特征    模板

Survey of Web information extraction technologies
CHEN Zhao,ZHANG Dong-mei.Survey of Web information extraction technologies[J].Application Research of Computers,2010,27(12):4401-4405.
Authors:CHEN Zhao  ZHANG Dong-mei
Affiliation:(School of Information Science & Technology, Beijing Forestry University, Beijing 100083, China)
Abstract:Web information extraction technology has been made the focus of the field of information technology by the needs of obtaining the topic contents of Web pages more efficiently. Existing technologies of this field could be classified into the following four categories, statistics based technology, vision based technology, DOM tree based technology and template based technology. The DOM tree based technology and template based technology had gained a rapid development and a wide employment because of the special structure and similarity owned by Web pages. This paper made a detailed survey and analysis of the above four technologies as well as the comparison of their advantages and disadvantages from points of automation, application filed and complexity.
Keywords:Web information extraction  Web page noise  URL clustering  DSE algorithm  RoadRunner system  MDR algorithm  vision feature  template
本文献已被 CNKI 万方数据 等数据库收录!
点击此处可从《计算机应用研究》浏览原始摘要信息
点击此处可从《计算机应用研究》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号