基于DOM的网页主题信息自动提取 DOM-Based Automatic Extraction of Topical Information from Web Pages期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

基于DOM的网页主题信息自动提取

引用本文：	王琦,唐世渭,杨冬青,王腾蛟.基于DOM的网页主题信息自动提取[J].计算机研究与发展,2004,41(10):1786-1792.

作者姓名：	王琦唐世渭杨冬青王腾蛟

作者单位：	1. 北京大学视觉与听觉信息处理国家重点实验室,北京,100871 2. 北京大学视觉与听觉信息处理国家重点实验室,北京,100871;北京大学计算机科学与技术系,北京,100871 3. 北京大学计算机科学与技术系,北京,100871

基金项目：	国家“九七三”重点基础研究发展规划基金项目 (G19990 3 2 70 5 )，国家“八六三”高技术研究发展计划基金项目数据库管理系统及其应用重大专项课题 ( 2 0 0 2AA4Z3 440 )

摘要：	Web页面所表达的主要信息通常隐藏在大量无关的结构和文字中，使用户不能迅速获取主题信息，限制了Web的可用性，信息提取有助于解决这一问题．基于DOM规范，针对HTML的半结构化特征和缺乏语义描述的不足，提出含有语义信息的STU-DOM树模型．将HTML文档转换为STU-DOM树，并对其进行基于结构的过滤和基于语义的剪枝，能够准确地提取出主题信息．方法不依赖于信息源，而且不改变源网页的结构和内容，是一种自动、可靠和通用的方法．具有可观的应用价值，可应用于PAD和手机上的web浏览以及信息检索系统．
关键词：	DOM 信息提取分块 STU STU树 STU-DOM树相关度
DOM-Based Automatic Extraction of Topical Information from Web Pages

WANG Qi ,TANG Shi Wei ,YANG Dong Qing ,and WANG Teng Jiao.DOM-Based Automatic Extraction of Topical Information from Web Pages[J].Journal of Computer Research and Development,2004,41(10):1786-1792.

Authors:	WANG Qi TANG Shi Wei YANG Dong Qing and WANG Teng Jiao

Affiliation:	WANG Qi 1,TANG Shi Wei 1,2,YANG Dong Qing 2,and WANG Teng Jiao 2 1

Abstract:	Web is a vast resource of information, but its representation limits its availability: the main information in a web page is always hidden among unimportant features such as unnecessary images and extraneous links, and this makes it difficult for the users to acquire the topical information Information extraction can help the users to locate the information of interest A new extraction methodology based on DOM is proposed by transforming DOM trees to STU DOM trees and then processing them with some algorithms A STU DOM tree can be viewed as a DOM tree with some semantic contextual attributes The key algorithm is to filter and prune the STU DOM tree It can automatically and accurately extract the useful and relevant content from HTML documents This approach is a universal method, which is independent of document structures and domains Unlike most approaches, it maintains the structure and content as well Hence the approach is significant and reliable It can be widely applied for web browsing on handheld devices, such as PDAs and mobile phones, and retrieval systems

Keywords:	DOM information extraction partition STU STU tree STU-DOM tree correlativity
本文献已被 CNKI 维普万方数据等数据库收录！

设为首页 | 免责声明 | 关于勤云 | 加入收藏