基于统计的网页正文信息抽取方法的研究 A Statistical Approach for Content Extraction from Web Page期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

基于统计的网页正文信息抽取方法的研究

引用本文：	孙承杰,关毅.基于统计的网页正文信息抽取方法的研究[J].中文信息学报,2004,18(5):18-23.

作者姓名：	孙承杰关毅

作者单位：	哈尔滨工业大学计算机学院

基金项目：	国家高技术研究发展计划(863计划)

摘要：	为了把自然语言处理技术有效的运用到网页文档中,本文提出了一种依靠统计信息,从中文新闻类网页中抽取正文内容的方法。该方法先根据网页中的HTML 标记把网页表示成一棵树,然后利用树中每个结点包含的中文字符数从中选择包含正文信息的结点。该方法克服了传统的网页内容抽取方法需要针对不同的数据源构造不同的包装器的缺点,具有简单、准确的特点,试验表明该方法的抽取准确率可以达到95%以上。采用该方法实现的网页文本抽取工具目前为一个面向旅游领域的问答系统提供语料支持,很好的满足了问答系统的需求。
关键词：	计算机应用中文信息处理网页数据抽取包装器
文章编号：	1003-0077(2004)05-0017-06
修稿时间：	2004年4月22日
A Statistical Approach for Content Extraction from Web Page

SUN Cheng-jie,GUAN Yi.A Statistical Approach for Content Extraction from Web Page[J].Journal of Chinese Information Processing,2004,18(5):18-23.

Authors:	SUN Cheng-jie GUAN Yi

Affiliation:	Dept. of Computer Science and Technology , Harbin Institute of Technology

Abstract:	This paper proposes a statistical approach for extracting text content from Chinese news web pages in order to effectively apply natural language processing technologies to web page documents. The method uses a tree to represent a web page according to HTML tags , and then chooses the node which contains text content by using the number of the Chinese characters in each node of the tree. In comparison with traditional methods , the method needn't construct different wrappers for different data sources. It is simple , accurate and easy to be implemented. Experimental results show that the extraction precision is higher than 95 %. The method has been adopted to provide web text data for a question answering system of traveling domain.

Keywords:	computer application Chinese information processing web data extraction wrapper
本文献已被 CNKI 万方数据等数据库收录！
	点击此处可从《中文信息学报》浏览原始摘要信息
	点击此处可从《中文信息学报》下载全文

设为首页 | 免责声明 | 关于勤云 | 加入收藏