一种短正文网页的正文自动化抽取方法 A Content Extraction Method for Short Web Pages期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

一种短正文网页的正文自动化抽取方法

引用本文：	郗家贞,郭岩,黎强,赵岭,刘悦,俞晓明,程学旗.一种短正文网页的正文自动化抽取方法[J].中文信息学报,2016,30(1):8-16.

作者姓名：	郗家贞郭岩黎强赵岭刘悦俞晓明程学旗

作者单位：	1. 中国科学院计算技术研究所中国科学院网络数据科学与技术重点实验室,北京 100190; 2. 中国科学院大学, 北京 100080)

基金项目：	国家重点基础研究发展计划(973)(2014CB340401,2013CB329602);国家自然科学基金重点项目(61232010);国家科技支撑专项(2012BAH39B04)

摘要：	随着互联网的发展,网页形式日趋多变。短正文网页日益增多,传统的网页正文自动化抽取方式对短正文网页抽取效果较差。针对以上问题,该文提出一种单记录(新闻、博客等)、短正文网页的正文自动化抽取方法,在该方法中,首先利用短正文网页分类算法对网页进行分类,然后针对短正文网页,使用基于页面深度以及文本密度的正文抽取算法抽取正文。
关键词：	短正文正文抽取 />
A Content Extraction Method for Short Web Pages

XI Jiazhen,GUO Yan,LI Qiang,ZHAO Ling,LIU Yue,YU Xiaoming,CHENG Xueqi.A Content Extraction Method for Short Web Pages[J].Journal of Chinese Information Processing,2016,30(1):8-16.

Authors:	XI Jiazhen GUO Yan LI Qiang ZHAO Ling LIU Yue YU Xiaoming CHENG Xueqi

Affiliation:	1. CAS Key Laboratory of Network Data Science & Technology, Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100190,China; 2. University of Chinese Academy of Sciences, Beijing 100080,China

Abstract:	To deal with the ever-growing short content web pages, this paper puts forward to first classify the web pages into two typesshort content pages and long content pages. Then, an algorithm for content extraction from short content web pages is designed by combining DOM tree depth and text density.

Keywords:	short content content extraction

	点击此处可从《中文信息学报》浏览原始摘要信息
	点击此处可从《中文信息学报》下载全文

设为首页 | 免责声明 | 关于勤云 | 加入收藏