一种基于特征符号的网页主题信息抽取方法 Content extraction of Web pages based on characteristic symbols期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

一种基于特征符号的网页主题信息抽取方法

引用本文：	王舒,朱敏,张明,牛颢,赵瑜.一种基于特征符号的网页主题信息抽取方法[J].计算机应用研究,2009,26(12):4539-4541.

作者姓名：	王舒朱敏张明牛颢赵瑜

作者单位：	1. 四川大学,计算机学院,成都,610064 2. 四川省计算机研究院,成都,610041

摘要：	随着Internet网络的日益普及，Web上的海量数据给文本挖掘尤其是网页主题提取带来了更多的挑战，现有的文本提取方法在保证高准确率的同时无法满足Web挖掘方法的通用性。通过对Web网页结构进行研究，对网页生成树模型进行了改进，找到网页结构的通用规则，提出一种基于特征符号的提取方法CECS（content extraction characteristic symbols），结合相关度对网页主题内容进行提取。实验证明，所提算法具有很高的准确性和通用性。
关键词：	生成树模型特征符号相关度主题提取
Content extraction of Web pages based on characteristic symbols

WANG Shu,ZHU Min,ZHANG Ming,NIU Hao,ZHAO Yu.Content extraction of Web pages based on characteristic symbols[J].Application Research of Computers,2009,26(12):4539-4541.

Authors:	WANG Shu ZHU Min ZHANG Ming NIU Hao ZHAO Yu

Affiliation:	(1.College of Computer Science, Sichuan University, Chengdu 610064; 2.Sichuan Institute of Computer Sciences, Chengdu 610064, China)

Abstract:	With the popularity of the Internet, the large amounts of data on the Web provides many challenges for data mining techniques, especially for content extraction of Web pages. The existing methods can not guarantee the generality and effectiveness of Web mining approaches. By studying the internal structure of Web pages, this paper proposed an improved document tree model and discovered the general rules for analyzing it. In addition, extracted content from Web pages based on characteristic symbols. The experimental results prove that the proposed method is accurate as well as generic.

Keywords:	document tree model characteristic symbols relevance content extraction
本文献已被万方数据等数据库收录！
	点击此处可从《计算机应用研究》浏览原始摘要信息
	点击此处可从《计算机应用研究》下载全文

设为首页 | 免责声明 | 关于勤云 | 加入收藏