基于双层决策的新闻网页正文精确抽取 Precise Content Extraction from News Web Page Based on Decisions of Two Layers期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

基于双层决策的新闻网页正文精确抽取

引用本文：	胡国平,张巍,王仁华.基于双层决策的新闻网页正文精确抽取[J].中文信息学报,2006,20(6):3-9,103.

作者姓名：	胡国平张巍王仁华

作者单位：	中国科学技术大学电子工程与信息科学系讯飞语音实验室

摘要：	本文提出了基于双层决策的新闻网页正文的精确抽取算法,双层决策是指对新闻网页正文所在区域的全局范围决策和对正文范围内每段文字是否确是正文的局部内容决策。首先根据实际应用的需要给出了新闻网页正文的严格界定,然后分析了新闻网页及其正文的特性,提出了基于双层决策的正文抽取策略,基于特征向量提取和决策树学习算法对上述双层决策进行了建模,并在国内10个主要新闻网站的1687个新闻页面上开展了模型训练和测试实验。实验结果表明,上述基于双层决策的方法能够精确地抽取出新闻网页的正文,最终正文抽取与人工标注不完全一致的网页比例仅为18.14% ,比单纯局部正文内容决策的方法相对下降了29.85% ,同时抽取误差率大于10%的网页比例更是仅为7.11% ,满足了实际应用的需要。
关键词：	计算机应用中文信息处理信息抽取特征向量决策树正文抽取
文章编号：	1003-0077（2006）06-0001-09
收稿时间：	2005-10-09
修稿时间：	2005-10-09
Precise Content Extraction from News Web Page Based on Decisions of Two Layers

HU Guo-ping,ZHANG Wei,WANG Ren-hua.Precise Content Extraction from News Web Page Based on Decisions of Two Layers[J].Journal of Chinese Information Processing,2006,20(6):3-9,103.

Authors:	HU Guo-ping ZHANG Wei WANG Ren-hua

Affiliation:	iFly Speech Lab , University of Science and Technology of China

Abstract:	This paper concerns content extraction from news web pages based on decisions of two layers.The first layer of decision is introduced to predict the scope of content in a webpage,and the second layer is employed to judge whether the paragraph within predicted scope is content or not.We firstly present a strict definition of content for web pages orienting to the practical applications,then analyze the characteristics of news web pages and their contents.Based on the analysis,we propose a content extraction method based on decisions of two layers,and carry out experiments on a corpus of 1867 HTMLs collected from 10 main news web sites in China.The experiment results show that our method can predict the content of news web pages quite well: the percentage of web pages which contain mismatching in extracted content is only 18.14%,which decreases 29.85% compared to that just based on the second layer prediction,and only 7.11% of extracted pages are with more than 10% mismatching,indicating that this method could be applied to practical applications.

Keywords:	computer application Chinese information processing information extraction feature vector decision tree content extraction
本文献已被 CNKI 维普万方数据等数据库收录！
	点击此处可从《中文信息学报》浏览原始摘要信息
	点击此处可从《中文信息学报》下载全文

设为首页 | 免责声明 | 关于勤云 | 加入收藏