基于DOM的网页主题信息的抽取 DOM BASED EXTRACTION OF TOPICAL INFORMATION FROM WEB PAGES期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

基于DOM的网页主题信息的抽取

引用本文：	刘军,张净.基于DOM的网页主题信息的抽取[J].计算机应用与软件,2010,27(5):188-190.

作者姓名：	刘军张净

作者单位：	武汉理工大学计算机科学与技术学院,湖北,武汉,430070

摘要：	随着Internet的发展,Web页面信息量不断加大,信息密集程度不断加强。但Web页面的主题信息通常不太明确,抽取主题信息也比较困难。针对这一难题,提出一种算法:构建文档对象模型DOM(Document Object Model)树,然后针对HTML半结构特征的不足,为DOM添加显示、语义(链接数、非链接文字数、高度、宽度)等属性,并提出一种聚类规则来对其进行分块,最后对其进行剪枝,删除掉无用的信息,提取主题信息。实验表明,该方法能够准确抽取主题信息。
关键词：	DOM 主题信息抽取分块剪枝
DOM BASED EXTRACTION OF TOPICAL INFORMATION FROM WEB PAGES

Liu Jun,Zhang Jing.DOM BASED EXTRACTION OF TOPICAL INFORMATION FROM WEB PAGES[J].Computer Applications and Software,2010,27(5):188-190.

Authors:	Liu Jun Zhang Jing

Affiliation:	College of Computer Science and Technology/a>;Wuhan University of Technology/a>;Wuhan 430070/a>;Hubei/a>;China

Abstract:	With the development of the Internet,the amount as well as the density of Web pages information increase day by day.However the representation of the topical information is usually not manifest enough,and this makes it difficult to acquire the topical information.A new extraction algorithm is proposed to solve this issue by constructing the DOM tree and then adding attributes to it such as display,semantics(link number,unlinked words number,height and width,etc.),as well as presenting a clustering rule for ...

Keywords:	DOM Topic Information extraction Partition Prune
本文献已被 CNKI 维普万方数据等数据库收录！

设为首页 | 免责声明 | 关于勤云 | 加入收藏