基于统计的中文网页正文抽取的研究 Content Extraction from Chinese Web Page Based on Statistics期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

基于统计的中文网页正文抽取的研究

作者单位：	武汉理工大学计算机科学与技术学院湖北武汉430063

摘要：	信息抽取技术是一种广泛运用于互联网的数据挖掘技术。其目的是从互联网海量数据中抽取有意义、有价值的数据和信息,从而能更好的利用互联网资源。文中采用一种统计网页特征的方法,将中文网页中的正文部分抽取出来。该方法首先将网页表示成基于XML的DOM树形式,利用统计的节点信息从树中过滤掉噪音数据节点,最后再选取正文节点。该方法相比传统的基于包装器的抽取方法,具有简单,实用的特点,试验结果表明,该抽取方法准确率达到90%以上,具有很好的实用价值。
关键词：	中文信息处理信息抽取正文抽取
Content Extraction from Chinese Web Page Based on Statistics

Authors:	ZHAO Wen TANG Jian-Xiong GAO Qing-Feng

Abstract:	Information extraction is a kind of data mining technology which is widely used in Internet. The purpose is to extract meaningful and valuable information from the huge date of the Internet in order to make full use of the resource of the Internet. It extracts text content from Chinese web pages by a statistical approach in the article. The method uses a DOM tree based on XML to represent a web page according to HTML tags,and then deletes the noise node by statistical data of node,at last chooses the node which contains text content. In comparison with traditional methods based on wrappers,this method is more simple and useful. Experimental results show that the extraction precision is higher than 90%,and the method has good value of practice.

Keywords:	Chinese information processing information extraction content extraction
本文献已被 CNKI 等数据库收录！

设为首页 | 免责声明 | 关于勤云 | 加入收藏