基于正文特征和网页结构的网页正文抽取方法 Method of Web Page Text Extraction Based on Text Feature and Page Structure期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

基于正文特征和网页结构的网页正文抽取方法

引用本文：	胡露露刘小勤孙凯. 基于正文特征和网页结构的网页正文抽取方法[J]. 大气与环境光学学报, 2017, 12(3): 230-235

作者姓名：	胡露露刘小勤孙凯

作者单位：	(1 中国科学院安徽光学精密机械研究所中科院大气成分与光学重点实验室，安徽合肥 230031； 2 中国科学技术大学自动化系，安徽合肥 230026)

基金项目：	Supported by Strategic Priority Research program of the Chinese Academy of Sciences(中国科学院战略性先导科技专项, XDB05040300)

摘要：	Web信息抽取技术一直是信息技术领域的研究热点。而且，近年来，DIV+CSS的网页布局方法开始普遍应用于网页设计中。基于此，提出了一种较为简单和实用的基于正文特征和网页结构的新闻网页正文抽取方法。首先识别和提取网页正文内容块，然后利用正则表达式滤除内容块中的HTML标记并提取网页正文。实验结果表明，该方法对正文抽取具有较高的通用性与准确率。
关键词：	信息抽取正文特征网页结构正文内容块正则表达式
收稿时间：	2016-01-12
修稿时间：	2016-02-17
Method of Web Page Text Extraction Based on Text Feature and Page Structure

HU Lu-Lu,LIU Xiao-Qin,SUN Kai. Method of Web Page Text Extraction Based on Text Feature and Page Structure[J]. Journal of Atmospheric and Environmental Optics, 2017, 12(3): 230-235

Authors:	HU Lu-Lu LIU Xiao-Qin SUN Kai

Affiliation:	(1 Key Laboratory of Atmospheric Composition and Optical Radiation, Anhui Institute of Optics and Fine Mechanics, Chinese Academy of Sciences, Hefei 230031, Anhui, China; 2.Department of Automation，University of Science and Technology of China, Hefei 230026, Anhui, China)

Abstract:	Web information extraction technology has been a hot topic in the field of information technology. Moreover, in recent years, DIV + CSS page layout method was commonly used in web design. Based on this, a simple and practical method for the text extraction of news web pages based on text features and page structure is presented. The text content block on the page is identified and extracted firstly, and then regular expression is used to filter the HTML tag of content block and the main text of the web page is extracted. Experimental results show that the method has great universal property and accuracy rate in text extraction.

Keywords:	information extraction text features page structure text content block regular expressions

	点击此处可从《大气与环境光学学报》浏览原始摘要信息
	点击此处可从《大气与环境光学学报》下载全文

设为首页 | 免责声明 | 关于勤云 | 加入收藏