基于特征串的大规模中文网页快速去重算法研究 The Study on Large Scale Duplicated Web Pages of Chinese Fast Deletion Algorithm Based on String of Feature Code期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

基于特征串的大规模中文网页快速去重算法研究

引用本文：	吴平博,陈群秀,马亮.基于特征串的大规模中文网页快速去重算法研究[J].中文信息学报,2003,17(2):29-36.

作者姓名：	吴平博陈群秀马亮

作者单位：	智能技术与系统国家重点实验室,清华大学计算机科学与技术系

摘要：	网页检索结果中,用户经常会得到内容相同的冗余页面,其中大量是由于网站之间的转载造成。它们不但浪费了存储资源,并给用户的检索带来诸多不便。本文依据冗余网页的特点引入模糊匹配的思想,利用网页文本的内容、结构信息,提出了基于特征串的中文网页的快速去重算法,同时对算法进行了优化处理。实验结果表明该算法是有效的,大规模开放测试的重复网页召回率达97.3% ,去重正确率达99.5%。
关键词：	计算机应用中文信息处理特征串模糊匹配去重算法冗余网页
文章编号：	1003-0077(2003)02-0028-08
修稿时间：	2002年9月13日
The Study on Large Scale Duplicated Web Pages of Chinese Fast Deletion Algorithm Based on String of Feature Code

WU Ping,bo,CHEN Qun,xiu,MA Liang.The Study on Large Scale Duplicated Web Pages of Chinese Fast Deletion Algorithm Based on String of Feature Code[J].Journal of Chinese Information Processing,2003,17(2):29-36.

Authors:	WU Ping bo CHEN Qun xiu MA Liang

Affiliation:	The State Key Laboratory of Intelligent Technology and System, Department of Computer Science and Technology, Tsinghua University

Abstract:	Reprinting of information between websites produces a great deal redundant web pages that not only waste storage resource but also bring many burdens to user in retrieval and reading.In this paper string of feature code based algorithm is developed to remove the duplicated web page after analyzing the feature of the redundant web page.The idea of fuzzy matching and information of content and structure of the text of web page are introduced into the algorithm,and the efficiency of the algorithm is optimized.The experiment results show that the algorithm is effective.The recall rate of duplicated web pages reaches 97.3%,and the precision rate of the duplication removal reaches 99.5% in large scale testing.

Keywords:	computer application Chinese information processing String of Feature Code Fuzzy Matching Duplicate Removal Algorithm Redundant Web Pages
本文献已被 CNKI 维普万方数据等数据库收录！
	点击此处可从《中文信息学报》浏览原始摘要信息
	点击此处可从《中文信息学报》下载全文

设为首页 | 免责声明 | 关于勤云 | 加入收藏