首页 | 本学科首页   官方微博 | 高级检索  
     

结合网页结构与文本特征的正文提取方法
引用本文:熊忠阳,蔺显强,张玉芳,牙漫.结合网页结构与文本特征的正文提取方法[J].计算机工程,2013(12):200-203,210.
作者姓名:熊忠阳  蔺显强  张玉芳  牙漫
作者单位:重庆大学计算机学院,重庆400044
基金项目:国家自然科学基金资助项目(71102065)
摘    要:网页中存在正文信息以及与正文无关的信息,无关信息的存在对Web页面的分类、存储及检索等带来负面的影响。为降低无关信息的影响,从网页的结构特征和文本特征出发,提出一种结合网页结构特征与文本特征的正文提取方法。通过正则表达式去除网页中的无关元素,完成对网页的初次过滤。根据网页的结构特征对网页进行线性分块,依据各个块的文本特征将其区分为链接块与文本块,并利用噪音块连续出现的结果完成对正文部分的定位,得到网页正文信息。实验结果表明,该方法能够快速准确地提取网页的正文内容。

关 键 词:正文提取  网页去噪  网页分块  主题爬行  信息检索  Web挖掘

Content Extraction Method Combining Web Page Structure and Text Feature
XIONG Zhong-yang,LIN Xian-qiang,ZHANG Yu-fang,YA Man.Content Extraction Method Combining Web Page Structure and Text Feature[J].Computer Engineering,2013(12):200-203,210.
Authors:XIONG Zhong-yang  LIN Xian-qiang  ZHANG Yu-fang  YA Man
Affiliation:(College of Computer Science, Chongqing University, Chongqing 400044, China)
Abstract:There are both relevant information and irrelevant information in a Web page, the irrelevant information brings some negative influence to their classification, storage and retrieve. In order to reduce the influence, aiming at theme-related Web pages, this paper proposes a new method to extract the content of Web pages based on their text and structural features. It removes those unrelated tags in the Web page by regular expressions, and segments the Web into blocks according to Web pages structure and the text information. By analyzing the text blocks and link blocks of the Web, it only retains the main content of the page; those noisy parts are deleted from the page. Experimental result shows that the method is feasible and of high accuracy in page cleaning and content extraction.
Keywords:content extraction  Web page denoising  Web page segmentation  subject crawling  information retrieve  Web mining
本文献已被 维普 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号