一种基于SVM及文本密度特征的网页信息提取方法 A WEB PAGE INFORMATION EXTRACTION METHOD BASED ON SVM AND TEXT DENSITY FEATURES期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

一种基于SVM及文本密度特征的网页信息提取方法

引用本文：	周艳平,李金鹏,宋群豹.一种基于SVM及文本密度特征的网页信息提取方法[J].计算机应用与软件,2019,36(10):251-255,261.

作者姓名：	周艳平李金鹏宋群豹

作者单位：	青岛科技大学信息科学技术学院山东青岛266061;青岛科技大学信息科学技术学院山东青岛266061;青岛科技大学信息科学技术学院山东青岛266061

基金项目：	国家自然科学基金;高等学校科技计划

摘要：	针对网页的多样性、复杂性和非标准化程度的提高,提出一种基于SVM及文本密度特征的网页信息提取方法。该方法先将网页整体解析成DOM树,然后根据网页结构提出五种网页密度特征,用数学模型进行密度比例分析,并采用高斯核函数(RBF)训练样本数据。该方法训练出的数据模型能够准确地去除网页广告、导航、版权信息等噪音信息,保留正文信息块,最后进行正文信息块内除噪。实验表明,该方法不仅有较高的精度,而且通用性好。
关键词：	SVM 正文抽取 DOM树文本密度特征
A WEB PAGE INFORMATION EXTRACTION METHOD BASED ON SVM AND TEXT DENSITY FEATURES

Zhou Yanping,Li Jinpeng,Song Qunbao.A WEB PAGE INFORMATION EXTRACTION METHOD BASED ON SVM AND TEXT DENSITY FEATURES[J].Computer Applications and Software,2019,36(10):251-255,261.

Authors:	Zhou Yanping Li Jinpeng Song Qunbao

Affiliation:	(School of Information Science and Technology,Qingdao University of Science and Technology,Qingdao 266061,Shandong,China)

Abstract:	Aiming at the diversity,complexity and non-standardization of web pages,this paper proposed a web page information extraction method based on SVM and text density features.We parsed the whole web page into a DOM tree,and proposed five kinds of web page density features according to the web page structure.Then we used a mathematical model to perform density ratio analysis,and used a Gaussian kernel function to train the sample data.The data model trained by the method could accurately remove noise information such as web page advertisement,navigation,and copyright information.And it retained the body information block,and performed noise removal in the body information block.Experiments show that this method not only has high accuracy,but also has good versatility.

Keywords:	SVM Text extraction DOM tree Text density features
本文献已被维普万方数据等数据库收录！

设为首页 | 免责声明 | 关于勤云 | 加入收藏