首页 | 本学科首页   官方微博 | 高级检索  
     

基于统计的中文网页正文抽取的研究
作者单位:武汉理工大学计算机科学与技术学院 湖北武汉430063
摘    要:信息抽取技术是一种广泛运用于互联网的数据挖掘技术。其目的是从互联网海量数据中抽取有意义、有价值的数据和信息,从而能更好的利用互联网资源。文中采用一种统计网页特征的方法,将中文网页中的正文部分抽取出来。该方法首先将网页表示成基于XML的DOM树形式,利用统计的节点信息从树中过滤掉噪音数据节点,最后再选取正文节点。该方法相比传统的基于包装器的抽取方法,具有简单,实用的特点,试验结果表明,该抽取方法准确率达到90%以上,具有很好的实用价值。

关 键 词:中文信息处理  信息抽取  正文抽取

Content Extraction from Chinese Web Page Based on Statistics
Authors:ZHAO Wen  TANG Jian-Xiong  GAO Qing-Feng
Abstract:Information extraction is a kind of data mining technology which is widely used in Internet. The purpose is to extract meaningful and valuable information from the huge date of the Internet in order to make full use of the resource of the Internet. It extracts text content from Chinese web pages by a statistical approach in the article. The method uses a DOM tree based on XML to represent a web page according to HTML tags,and then deletes the noise node by statistical data of node,at last chooses the node which contains text content. In comparison with traditional methods based on wrappers,this method is more simple and useful. Experimental results show that the extraction precision is higher than 90%,and the method has good value of practice.
Keywords:Chinese information processing  information extraction  content extraction
本文献已被 CNKI 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号