首页 | 本学科首页   官方微博 | 高级检索  
     

基于特征串的大规模中文网页快速去重算法研究
引用本文:吴平博,陈群秀,马亮.基于特征串的大规模中文网页快速去重算法研究[J].中文信息学报,2003,17(2):29-36.
作者姓名:吴平博  陈群秀  马亮
作者单位:智能技术与系统国家重点实验室,清华大学计算机科学与技术系
摘    要:网页检索结果中,用户经常会得到内容相同的冗余页面,其中大量是由于网站之间的转载造成。它们不但浪费了存储资源,并给用户的检索带来诸多不便。本文依据冗余网页的特点引入模糊匹配的思想,利用网页文本的内容、结构信息,提出了基于特征串的中文网页的快速去重算法,同时对算法进行了优化处理。实验结果表明该算法是有效的,大规模开放测试的重复网页召回率达97.3% ,去重正确率达99.5%。

关 键 词:计算机应用  中文信息处理  特征串  模糊匹配  去重算法  冗余网页  
文章编号:1003-0077(2003)02-0028-08
修稿时间:2002年9月13日

The Study on Large Scale Duplicated Web Pages of Chinese Fast Deletion Algorithm Based on String of Feature Code
WU Ping,bo,CHEN Qun,xiu,MA Liang.The Study on Large Scale Duplicated Web Pages of Chinese Fast Deletion Algorithm Based on String of Feature Code[J].Journal of Chinese Information Processing,2003,17(2):29-36.
Authors:WU Ping  bo  CHEN Qun  xiu  MA Liang
Affiliation:The State Key Laboratory of Intelligent Technology and System, Department of Computer Science and Technology, Tsinghua University
Abstract:Reprinting of information between websites produces a great deal redundant web pages that not only waste storage resource but also bring many burdens to user in retrieval and reading.In this paper string of feature code based algorithm is developed to remove the duplicated web page after analyzing the feature of the redundant web page.The idea of fuzzy matching and information of content and structure of the text of web page are introduced into the algorithm,and the efficiency of the algorithm is optimized.The experiment results show that the algorithm is effective.The recall rate of duplicated web pages reaches 97.3%,and the precision rate of the duplication removal reaches 99.5% in large scale testing.
Keywords:computer application  Chinese information processing  String of Feature Code  Fuzzy Matching  Duplicate Removal Algorithm  Redundant Web Pages  
本文献已被 CNKI 维普 万方数据 等数据库收录!
点击此处可从《中文信息学报》浏览原始摘要信息
点击此处可从《中文信息学报》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号