首页 | 本学科首页   官方微博 | 高级检索  
     

一种基于特征符号的网页主题信息抽取方法
引用本文:王舒,朱敏,张明,牛颢,赵瑜.一种基于特征符号的网页主题信息抽取方法[J].计算机应用研究,2009,26(12):4539-4541.
作者姓名:王舒  朱敏  张明  牛颢  赵瑜
作者单位:1. 四川大学,计算机学院,成都,610064
2. 四川省计算机研究院,成都,610041
摘    要:随着Internet网络的日益普及,Web上的海量数据给文本挖掘尤其是网页主题提取带来了更多的挑战,现有的文本提取方法在保证高准确率的同时无法满足Web挖掘方法的通用性。通过对Web网页结构进行研究,对网页生成树模型进行了改进,找到网页结构的通用规则,提出一种基于特征符号的提取方法CECS(content extraction characteristic symbols),结合相关度对网页主题内容进行提取。实验证明,所提算法具有很高的准确性和通用性。

关 键 词:生成树模型    特征符号    相关度    主题提取

Content extraction of Web pages based on characteristic symbols
WANG Shu,ZHU Min,ZHANG Ming,NIU Hao,ZHAO Yu.Content extraction of Web pages based on characteristic symbols[J].Application Research of Computers,2009,26(12):4539-4541.
Authors:WANG Shu  ZHU Min  ZHANG Ming  NIU Hao  ZHAO Yu
Affiliation:(1.College of Computer Science, Sichuan University, Chengdu 610064; 2.Sichuan Institute of Computer Sciences, Chengdu 610064, China)
Abstract:With the popularity of the Internet, the large amounts of data on the Web provides many challenges for data mining techniques, especially for content extraction of Web pages. The existing methods can not guarantee the generality and effectiveness of Web mining approaches. By studying the internal structure of Web pages, this paper proposed an improved document tree model and discovered the general rules for analyzing it. In addition, extracted content from Web pages based on characteristic symbols. The experimental results prove that the proposed method is accurate as well as generic.
Keywords:document tree model  characteristic symbols  relevance  content extraction
本文献已被 万方数据 等数据库收录!
点击此处可从《计算机应用研究》浏览原始摘要信息
点击此处可从《计算机应用研究》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号