一种基于分类算法的网页信息提取方法 A Method of Web Information Extraction Based on Classification Algorithm期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

一种基于分类算法的网页信息提取方法

引用本文：	汪建伟,杨冬青,高军,王腾蛟.一种基于分类算法的网页信息提取方法[J].计算机科学,2008,35(3):91-93.

作者姓名：	汪建伟杨冬青高军王腾蛟

作者单位：	1. 北京大学信息科学技术学院,北京,100871;军事交通学院,天津,300161 2. 北京大学信息科学技术学院,北京,100871

摘要：	在目前的Web信息提取技术中,很多都是基于HTML结构的,由于HTML结构的经常变化,使提取模板需要经常更新,而提取模板的更新需要很多领域知识.本文提出一种基于分类算法的Web信息提取方法,通过将网页文本按照其显示属性的不同进行分组,以显示属性值为基础对Web页面文本进行分类,获取所关注文本,从而完成对web页面的信息提取.这种提取方法操作简单,易于实现,对网页结构的依赖性小.
关键词：	信息提取属性向量 Wrapper 显示属性
A Method of Web Information Extraction Based on Classification Algorithm

WANG Jian-Wei,YANG Dong-Qing,CAO Jun,WANG Teng-Jiao.A Method of Web Information Extraction Based on Classification Algorithm[J].Computer Science,2008,35(3):91-93.

Authors:	WANG Jian-Wei YANG Dong-Qing CAO Jun WANG Teng-Jiao

Affiliation:	WANG Jian-Wei1,2 YANG Dong-Qing1 GAO Jun1 WANG Teng-Jiao1(School of Electronics Engineering , Computer Science,Peking University,Beijing 100871)1(Military Traffic College,CPLA,Tianjin 300161)2

Abstract:	In the research of Web information extraction,most of the existing algorithms are based on HTML structure. As the structure of HTML files changes frequently,wrapper must be updated accordingly. But the update of wrapper needs a lot of domain knowledge. In this paper,a new Web information extraction method based on classification algorithm is provided,which can group the Web text by HTML text display attributes. The information extraction of Web pages is finished by classifying the Web text with different va...

Keywords:	Web information extraction Attribute vector Wrapper Display attributes
本文献已被 CNKI 维普万方数据等数据库收录！
	点击此处可从《计算机科学》下载全文

设为首页 | 免责声明 | 关于勤云 | 加入收藏