首页 | 本学科首页   官方微博 | 高级检索  
     

基于文本块密度和标签路径覆盖率的网页正文抽取
引用本文:刘鹏程,胡骏,吴共庆.基于文本块密度和标签路径覆盖率的网页正文抽取[J].计算机应用研究,2018,35(6).
作者姓名:刘鹏程  胡骏  吴共庆
作者单位:合肥工业大学,计算机与信息学院,合肥工业大学,计算机与信息学院,合肥工业大学,计算机与信息学院
基金项目:国家重点研发计划;国家自然科学基金;教育部创新团队发展计划;国家留学基金
摘    要:针对大多数网页除了正文信息外,还包括导航、广告和免责声明等噪声信息的问题。为了提高网页正文抽取的准确性,提出了一种基于文本块密度和标签路径覆盖率的抽取方法(CETD-TPC),结合网页文本块密度特征和标签路径特征的优点,设计了融合两种特征的新特征,利用新特征抽取网页中的最佳文本块,最后,抽取该文本块中的正文内容。该方法有效地解决了网页正文中噪声块信息过滤和短文本难以抽取的问题,且无需训练和人工处理。在CleanEval数据集和从知名网站上随机选取的新闻网页数据集上的实验结果表明,CETD-TPC方法在不同数据源上均具有很好的适用性,抽取性能优于CETR、CETD和CEPR等算法。

关 键 词:正文抽取    文本块密度    标签路径覆盖率    特征融合
收稿时间:2017/1/13 0:00:00
修稿时间:2018/4/30 0:00:00

Webpage Content Extraction via Text Block Density and Tag Tath Coverage
LiuPengcheng,HuJun and Wu Gongqing.Webpage Content Extraction via Text Block Density and Tag Tath Coverage[J].Application Research of Computers,2018,35(6).
Authors:LiuPengcheng  HuJun and Wu Gongqing
Affiliation:School of Computer and Information, Hefei University of Technology,,
Abstract:Most webpages contains the content information, as well as noisy information such as navigation, advertisements and disclaimer notices. To address this problem and improve the accuracy of web page extraction, this paper proposes a webpage content extraction method via text block density and tag path coverage (CETD-TPC). Combining the advantages of webpage text block density feature and tag path feature, we design a new feature, TDTPC, which mixes the two features together. Then we extract the best text block from a webpage by using the TDTPC feature. Finally, contents are extracted from the content block. Without the manual processing and training, CETD-TPC is an effective solution to deal with the problems of noise block information filtering and short text extraction. Experimental results on CleanEval datasets and web news pages randomly selected from well-known websites show that the CETD-TPC method has good applicability on different data sets and performs better than CETR, CETD and CEPR.
Keywords:Content Extraction  Text Block Density  Tag Path Coverage  Feature Fusion
点击此处可从《计算机应用研究》浏览原始摘要信息
点击此处可从《计算机应用研究》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号