结合网页结构与文本特征的正文提取方法 Content Extraction Method Combining Web Page Structure and Text Feature期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

结合网页结构与文本特征的正文提取方法

引用本文：	熊忠阳,蔺显强,张玉芳,牙漫.结合网页结构与文本特征的正文提取方法[J].计算机工程,2013(12):200-203,210.

作者姓名：	熊忠阳蔺显强张玉芳牙漫

作者单位：	重庆大学计算机学院,重庆400044

基金项目：	国家自然科学基金资助项目（71102065）

摘要：	网页中存在正文信息以及与正文无关的信息，无关信息的存在对Web页面的分类、存储及检索等带来负面的影响。为降低无关信息的影响，从网页的结构特征和文本特征出发，提出一种结合网页结构特征与文本特征的正文提取方法。通过正则表达式去除网页中的无关元素，完成对网页的初次过滤。根据网页的结构特征对网页进行线性分块，依据各个块的文本特征将其区分为链接块与文本块，并利用噪音块连续出现的结果完成对正文部分的定位，得到网页正文信息。实验结果表明，该方法能够快速准确地提取网页的正文内容。
关键词：	正文提取网页去噪网页分块主题爬行信息检索 Web挖掘
Content Extraction Method Combining Web Page Structure and Text Feature

XIONG Zhong-yang,LIN Xian-qiang,ZHANG Yu-fang,YA Man.Content Extraction Method Combining Web Page Structure and Text Feature[J].Computer Engineering,2013(12):200-203,210.

Authors:	XIONG Zhong-yang LIN Xian-qiang ZHANG Yu-fang YA Man

Affiliation:	(College of Computer Science, Chongqing University, Chongqing 400044, China)

Abstract:	There are both relevant information and irrelevant information in a Web page, the irrelevant information brings some negative influence to their classification, storage and retrieve. In order to reduce the influence, aiming at theme-related Web pages, this paper proposes a new method to extract the content of Web pages based on their text and structural features. It removes those unrelated tags in the Web page by regular expressions, and segments the Web into blocks according to Web pages structure and the text information. By analyzing the text blocks and link blocks of the Web, it only retains the main content of the page; those noisy parts are deleted from the page. Experimental result shows that the method is feasible and of high accuracy in page cleaning and content extraction.

Keywords:	content extraction Web page denoising Web page segmentation subject crawling information retrieve Web mining
本文献已被维普等数据库收录！

设为首页 | 免责声明 | 关于勤云 | 加入收藏