String techniques for detecting duplicates in document databases期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

String techniques for detecting duplicates in document databases

Authors:	Daniel P. Lopresti

Affiliation:	(1) Department of Electrical and Computer Engineering, University of Missouri, Columbia, MO 65211, USA;(2) Department of Electrical Engineering, University of Arkansas, Fayetteville, AR 72701, USA;(3) Department of Electrical and Computer Engineering, Missouri University of Science and Technology, Rolla, MO 65409, USA

Abstract:	Detecting duplicates in document image databases is a problem of growing importance. The task is made difficult by the various degradations suffered by printed documents, and by conflicting notions of what it means to be a "duplicate". To address these issues, this paper introduces a framework for clarifying and formalizing the duplicate detection problem. Four distinct models are presented, each with a corresponding algorithm for its solution adapted from the realm of approximate string matching. The robustness of these techniques is demonstrated through a set of experiments using data derived from real-world noise sources. Also described are several heuristics that have the potential to speed up the computation by several orders of magnitude.

Keywords:
本文献已被 SpringerLink 等数据库收录！