一种全自动生成网页信息抽取Wrapper的方法 Fully Automatic Wrapper Generation for Web Information Extraction期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

一种全自动生成网页信息抽取Wrapper的方法

引用本文：	梅雪,程学旗,郭岩,张刚,丁国栋.一种全自动生成网页信息抽取Wrapper的方法[J].中文信息学报,2008,22(1):22-29.

作者姓名：	梅雪程学旗郭岩张刚丁国栋

作者单位：	1.中国科学院计算技术研究所,北京 100080; 2.中国科学院研究生院,北京 100049

基金项目：	国家高技术研究发展计划(863计划)

摘要：	Web网页信息抽取是近年来广泛关注的话题。如何最快最准地从大量Web网页中获取主要数据成为该领域的一个研究重点。文章中提出了一种全自动化生成网页信息抽取Wrapper的方法。该方法充分利用网页设计模版的结构化、层次化特点,运用网页链接分类算法和网页结构分离算法,抽取出网页中各个信息单元,并输出相应Wrapper。利用Wrapper能够对同类网页自动地进行信息抽取。实验结果表明,该方法同时实现了对网页中严格的结构化信息和松散的结构化信息的自动化抽取,抽取结果达到非常高的准确率。
关键词：	计算机应用中文信息处理网页信息抽取网页结构分离包装器
文章编号：	1003-0077（2008）01-0022-08
收稿时间：	2007-05-21
修稿时间：	2007-12-03
Fully Automatic Wrapper Generation for Web Information Extraction

MEI Xue,CHENG Xue-qi,GUO Yan,ZHANG Gang,DING Guo-dong.Fully Automatic Wrapper Generation for Web Information Extraction[J].Journal of Chinese Information Processing,2008,22(1):22-29.

Authors:	MEI Xue CHENG Xue-qi GUO Yan ZHANG Gang DING Guo-dong

Affiliation:	1. Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100080, China; 2.Graduate University of Chinese Academy of Sciences, Beijing 100049, China

Abstract:	Web information extraction has been a hot topic in recent years. The chaiienge is how to extract important information from a large number of web pages as quickly and accurately as it can. In this paper a novel method is proposed for fully automatic wrapper generation for Web information extraction. This method makes use of structure of Web templates abundantly. It uses Web Page Link Sort algorithm and Web Page Structure_Seperator algo rithm to extract information from Web pages and output a wrapper accordingly. Experimental results showed that this method performs well in both rigidly and loosely structured records in Web pages.

Keywords:	computer application Chinese information processing Web information extraction Web structure se-perator wrapper
本文献已被 CNKI 维普万方数据等数据库收录！
	点击此处可从《中文信息学报》浏览原始摘要信息
	点击此处可从《中文信息学报》下载全文

设为首页 | 免责声明 | 关于勤云 | 加入收藏