首页 | 本学科首页   官方微博 | 高级检索  
     

基于DOM的网页主题信息的抽取
引用本文:刘军,张净.基于DOM的网页主题信息的抽取[J].计算机应用与软件,2010,27(5):188-190.
作者姓名:刘军  张净
作者单位:武汉理工大学计算机科学与技术学院,湖北,武汉,430070
摘    要:随着Internet的发展,Web页面信息量不断加大,信息密集程度不断加强。但Web页面的主题信息通常不太明确,抽取主题信息也比较困难。针对这一难题,提出一种算法:构建文档对象模型DOM(Document Object Model)树,然后针对HTML半结构特征的不足,为DOM添加显示、语义(链接数、非链接文字数、高度、宽度)等属性,并提出一种聚类规则来对其进行分块,最后对其进行剪枝,删除掉无用的信息,提取主题信息。实验表明,该方法能够准确抽取主题信息。

关 键 词:DOM  主题  信息抽取  分块  剪枝  

DOM BASED EXTRACTION OF TOPICAL INFORMATION FROM WEB PAGES
Liu Jun,Zhang Jing.DOM BASED EXTRACTION OF TOPICAL INFORMATION FROM WEB PAGES[J].Computer Applications and Software,2010,27(5):188-190.
Authors:Liu Jun  Zhang Jing
Affiliation:College of Computer Science and Technology/a>;Wuhan University of Technology/a>;Wuhan 430070/a>;Hubei/a>;China
Abstract:With the development of the Internet,the amount as well as the density of Web pages information increase day by day.However the representation of the topical information is usually not manifest enough,and this makes it difficult to acquire the topical information.A new extraction algorithm is proposed to solve this issue by constructing the DOM tree and then adding attributes to it such as display,semantics(link number,unlinked words number,height and width,etc.),as well as presenting a clustering rule for ...
Keywords:DOM Topic Information extraction Partition Prune  
本文献已被 CNKI 维普 万方数据 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号