基于DOM和网页模板的Web信息抽取 Information Extraction for the Web Sources Based on DOM and WebTemPlate期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

基于DOM和网页模板的Web信息抽取

引用本文：	王丽,唐建雄.基于DOM和网页模板的Web信息抽取[J].数字社区&智能家居,2007(18).

作者姓名：	王丽唐建雄

作者单位：	武汉理工大学,计算机科学与技术学院,湖北,武汉,430063 武汉理工大学,计算机科学与技术学院,湖北,武汉,430063

摘要：	文章提出了一种基于DOM(文档结构模型)和网页模板的Web信息提取方法.参照DOM的定义,通过构造HTML解析树来描述网页结构.在抽取网页之前,先通过归纳网页模板来过滤网页中的噪音信息.然后,使用基于相对路径的抽取规则来进行信息抽取.最后,本文给出了归纳网页模板和抽取网页信息的实验结果.实验结果表明本文提出的归纳网页模板方法和信息抽取方法是正确的和高效的.
关键词：	信息抽取文档结构模型网页模板抽取规则相对路径
Information Extraction for the Web Sources Based on DOM and WebTemPlate

WANG Li,TANG Jian-xiong.Information Extraction for the Web Sources Based on DOM and WebTemPlate[J].Digital Community & Smart Home,2007(18).

Authors:	WANG Li TANG Jian-xiong

Abstract:	Information extraction studled by the Paper is based on D0M (Document object Model) and web template. According to the definition of DOM,the paper describes the structure of web Pages by constructing HTML Parsing tree. Before Information extraction,the noise information can be filtrated in web pages by inducting web template. Then,the paper uses the extraction rule based on relative path to extract information in web pages. At last,the paper presents the result of inducting web template3s and extracting web pages. From the result,it is evident that the way of inducting web templates and the way of extracting web pages are correct and effective.

Keywords:	Information Extraction DOM WebTemPlate ExtractionRule Relative Path
本文献已被 CNKI 维普等数据库收录！

设为首页 | 免责声明 | 关于勤云 | 加入收藏