Web信息的自主抽取方法 Autonomous Extract Information from Web Pages期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

Web信息的自主抽取方法

引用本文：	许建潮,侯锟.Web信息的自主抽取方法[J].计算机工程与应用,2005,41(14):185-189,198.

作者姓名：	许建潮侯锟

作者单位：	1. 长春工业大学计算机科学与工程学院,长春,130012;吉林大学符号计算与知识工程教育部重点实验室,长春,130026 2. 长春工业大学计算机科学与工程学院,长春,130012

摘要：	提出了基于表格结构及列表结构的W eb页面信息自主抽取的方法。可根据用户对信息的需求自主地从相关页面中抽取信息并将抽取信息按关系模型进行重组存放在数据库中,对表格结构信息源仅需标注一页网页,即可获取抽取知识,通过自学习能够较好地适应网页信息的动态变化,实现信息的自动抽取。对列表结构信息源信息,通过对DOM树结构的分析,动态获得信息块在DOM层次结构中的路径,根据信息对象基本的抽取知识,获得信息对象值。采用自学习的方法以适应网页信息的动态变化。
关键词：	Web 半结构化数据信息抽取 Wrapper
文章编号：	1002-8331-(2005)14-0185-05
Autonomous Extract Information from Web Pages

XU Jianchao,HOU Kun.Autonomous Extract Information from Web Pages[J].Computer Engineering and Applications,2005,41(14):185-189,198.

Authors:	XU Jianchao HOU Kun

Affiliation:	Xu Jianchao1,2 Hou Kun11

Abstract:	The paper presents a method of autonomous information extraction from web pages base on structure of table and list.The method utilizes extracting information from relevant pages autonomously according user's demand and relation model restructuring extracted information to database.For extracting information from table,earmark only one page and get extraction knowledge for extracting information from table.Wrapper can be adapted to the pages' changes with self-learning and make it automatic extraction.For extracting information from list,wrapper can automatic get the path,which the information block is in layer structure of DOM by analysing structure of DOM,and get the value of information object base on extraction knowledge.Adapt to Web page's dynamic change by self-learning.

Keywords:	Web semi-structured data information extraction Wrapper
本文献已被 CNKI 维普万方数据等数据库收录！

设为首页 | 免责声明 | 关于勤云 | 加入收藏