基于相似页面的Web信息抽取系统的实现 Implementation of Web information extraction system based on similar pages期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

基于相似页面的Web信息抽取系统的实现

引用本文：	贡正仙,朱巧明,李培峰. 基于相似页面的Web信息抽取系统的实现[J]. 计算机应用, 2006, 26(8): 1983-1986

作者姓名：	贡正仙朱巧明李培峰

作者单位：	苏州大学,计算机科学与技术学院,江苏,苏州,215006;江苏省计算机信息处理技术重点实验室,江苏,苏州,215006

基金项目：	江苏省高技术研究发展计划项目;江苏省教育厅自然科学基金

摘要：	分析了RoadRunner的核心算法，针对RoadRunner的不足，综合自动和半自动抽取阶段的各项研究成果，设计并实现了基于相似页面的Web信息抽取系统。介绍了系统结构和实现的关键技术，包括如何获取相似页面，可靠的噪声处理和自动归纳抽取规则的算法。
关键词：	Web页面 RoadRunner 相似页面信息抽取
文章编号：	1001-9081（2006）08-1983-04
收稿时间：	2006-02-13
修稿时间：	2006-02-132006-03-17
Implementation of Web information extraction system based on similar pages

GONG Zheng-xian,ZHU Qiao-ming,LI Pei-feng. Implementation of Web information extraction system based on similar pages[J]. Journal of Computer Applications, 2006, 26(8): 1983-1986

Authors:	GONG Zheng-xian ZHU Qiao-ming LI Pei-feng

Affiliation:	1. School of Computer Science and Technology of Soochow University, Suzhou Jiangsu 215006, China; 2. Key Laboratory of Computer Information Processing Technology of Jiangsu Province, Suzhou Jiangsu 215006, China

Abstract:	The core algorithm of RoadRunner was analyzed. After analyzing the deficiencies of RoadRunner, a Web information extraction system based on similar pages was designed and implemented. The system architecture was introduced, then the key techniques, such as the method for getting similar Web pages, reliably dealing with Web nosy blocks and automatically deducing rules for extracting data items were presented.

Keywords:	RoadRunner
本文献已被 CNKI 维普万方数据等数据库收录！
	点击此处可从《计算机应用》浏览原始摘要信息
	点击此处可从《计算机应用》下载全文

设为首页 | 免责声明 | 关于勤云 | 加入收藏