Web页面主题信息抽取研究与实现 Research and Implementation of Extracting Topical Information from Web Page期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

Web页面主题信息抽取研究与实现

引用本文：	刘艳敏,刘飚,封化民,宋国森,方勇. Web页面主题信息抽取研究与实现[J]. 计算机工程与应用, 2006, 42(21): 146-148

作者姓名：	刘艳敏刘飚封化民宋国森方勇

作者单位：	燕山大学信息工程学院,河北,秦皇岛,066004;北京邮电大学电信工程学院,北京,100876;北京电子科技学院信息安全与保密重点实验室,北京,100070;北京邮电大学电信工程学院,北京,100876;北京电子科技学院信息安全与保密重点实验室,北京,100070

基金项目：	国家高技术研究发展计划(863计划)

摘要：	Web页面中的主要信息通常隐藏在大量无关的特征中,如无关紧要的图片和不相关的连接,使用户不能迅速获取主题信息,限制了Web的可用性。论文提出一种网页主题内容提取的方法及相应算法,并通过人工判定的方法对来自120个网站的5000个网页进行了测试和评估。实验结果表明该方法切实可行,可达到91.35%的准确率。
关键词：	HTML 信息提取页面结构分析标记统计
文章编号：	1002-8331-（2006）21-0146-03
收稿时间：	2006-01-01
修稿时间：	2006-01-01
Research and Implementation of Extracting Topical Information from Web Page

Liu Yanmin,Liu Biao,Feng Huamin,Song Guosen,Fang Yong. Research and Implementation of Extracting Topical Information from Web Page[J]. Computer Engineering and Applications, 2006, 42(21): 146-148

Authors:	Liu Yanmin Liu Biao Feng Huamin Song Guosen Fang Yong

Affiliation:	1 School of Information Engineering,Yanshan University,Qinhuangdao,Hebei 066004; 2 School of Telecommunication Engineering,Beijing University of Posts and Telecommunications, Beijing 100876; 3 Key Laboratory for Security and Secrecy of Information,Beijing Electronic Science and Technology Institute.Beijing 100070

Abstract:	The main information in a web page is always hidded among unimportant features such as unnecessary images and irrelevant links,this makes it difficult for the users to acquire the topical information,and that limits its availability.In this paper,we propose a novel approach to extract topical information from web pages and present the corresponding algorithms.Experiments on a set of 5,000 web pages from 120 different sites show that the method is practical,and can achieve 91.35% in accuracy.

Keywords:	HTML
本文献已被 CNKI 维普万方数据等数据库收录！

设为首页 | 免责声明 | 关于勤云 | 加入收藏