首页 | 本学科首页   官方微博 | 高级检索  
     

基于DOM的网页主题信息自动提取
引用本文:王琦,唐世渭,杨冬青,王腾蛟.基于DOM的网页主题信息自动提取[J].计算机研究与发展,2004,41(10):1786-1792.
作者姓名:王琦  唐世渭  杨冬青  王腾蛟
作者单位:1. 北京大学视觉与听觉信息处理国家重点实验室,北京,100871
2. 北京大学视觉与听觉信息处理国家重点实验室,北京,100871;北京大学计算机科学与技术系,北京,100871
3. 北京大学计算机科学与技术系,北京,100871
基金项目:国家“九七三”重点基础研究发展规划基金项目 (G19990 3 2 70 5 ),国家“八六三”高技术研究发展计划基金项目数据库管理系统及其应用重大专项课题 ( 2 0 0 2AA4Z3 440 )
摘    要:Web页面所表达的主要信息通常隐藏在大量无关的结构和文字中,使用户不能迅速获取主题信息,限制了Web的可用性,信息提取有助于解决这一问题.基于DOM规范,针对HTML的半结构化特征和缺乏语义描述的不足,提出含有语义信息的STU-DOM树模型.将HTML文档转换为STU-DOM树,并对其进行基于结构的过滤和基于语义的剪枝,能够准确地提取出主题信息.方法不依赖于信息源,而且不改变源网页的结构和内容,是一种自动、可靠和通用的方法.具有可观的应用价值,可应用于PAD和手机上的web浏览以及信息检索系统.

关 键 词:DOM  信息提取  分块  STU  STU树  STU-DOM树  相关度

DOM-Based Automatic Extraction of Topical Information from Web Pages
WANG Qi ,TANG Shi Wei ,YANG Dong Qing ,and WANG Teng Jiao.DOM-Based Automatic Extraction of Topical Information from Web Pages[J].Journal of Computer Research and Development,2004,41(10):1786-1792.
Authors:WANG Qi  TANG Shi Wei    YANG Dong Qing  and WANG Teng Jiao
Affiliation:WANG Qi 1,TANG Shi Wei 1,2,YANG Dong Qing 2,and WANG Teng Jiao 2 1
Abstract:Web is a vast resource of information, but its representation limits its availability: the main information in a web page is always hidden among unimportant features such as unnecessary images and extraneous links, and this makes it difficult for the users to acquire the topical information Information extraction can help the users to locate the information of interest A new extraction methodology based on DOM is proposed by transforming DOM trees to STU DOM trees and then processing them with some algorithms A STU DOM tree can be viewed as a DOM tree with some semantic contextual attributes The key algorithm is to filter and prune the STU DOM tree It can automatically and accurately extract the useful and relevant content from HTML documents This approach is a universal method, which is independent of document structures and domains Unlike most approaches, it maintains the structure and content as well Hence the approach is significant and reliable It can be widely applied for web browsing on handheld devices, such as PDAs and mobile phones, and retrieval systems
Keywords:DOM  information extraction  partition  STU  STU tree  STU-DOM tree  correlativity  
本文献已被 CNKI 维普 万方数据 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号