A systematic study on parameter correlations in large-scale duplicate document detection期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

A systematic study on parameter correlations in large-scale duplicate document detection

Authors:	Shaozhi Ye Ji-Rong Wen Wei-Ying Ma

Affiliation:	(1) Department of Computer Science, University of California, Davis, CA 95616, USA;(2) Microsoft Research Asia, Beijing, P.R. China

Abstract:	Although much work has been done on duplicate document detection (DDD) and its applications, we observe the absence of a systematic study on the performance and scalability of large-scale DDD algorithms. It is still unclear how various parameters in DDD correlate mutually, such as similarity threshold, precision/recall requirement, sampling ratio, and document size. This paper explores the correlations among several most important parameters in DDD and the impact of sampling ratio is of most interest since it heavily affects the accuracy and scalability of DDD algorithms. An empirical analysis is conducted on a million HTML documents from the TREC .GOV collection. Experimental results show that even when using the same sampling ratio, the precision of DDD varies greatly on documents with different sizes. Based on this observation, we propose an adaptive sampling strategy for DDD, which minimizes the sampling ratio with the constraint of a given precision requirement. We believe that the insights from our analysis are helpful for guiding the future large-scale DDD work. A preliminary version of this paper appears in the Proceedings of the 10th Asia Pacific Conference on Knowledge Discovery and Data Mining, Singapore, April 2006. This work was conducted when Shaozhi Ye visited Microsoft Research Asia.

Keywords:	Duplicate document detection Clustering Sampling Shingling
本文献已被 SpringerLink 等数据库收录！

设为首页 | 免责声明 | 关于勤云 | 加入收藏