String techniques for detecting duplicates in document databases |
| |
Authors: | Daniel P. Lopresti |
| |
Affiliation: | (1) Department of Electrical and Computer Engineering, University of Missouri, Columbia, MO 65211, USA;(2) Department of Electrical Engineering, University of Arkansas, Fayetteville, AR 72701, USA;(3) Department of Electrical and Computer Engineering, Missouri University of Science and Technology, Rolla, MO 65409, USA |
| |
Abstract: | Detecting duplicates in document image databases is a problem of growing importance. The task is made difficult by the various degradations suffered by printed documents, and by conflicting notions of what it means to be a "duplicate". To address these issues, this paper introduces a framework for clarifying and formalizing the duplicate detection problem. Four distinct models are presented, each with a corresponding algorithm for its solution adapted from the realm of approximate string matching. The robustness of these techniques is demonstrated through a set of experiments using data derived from real-world noise sources. Also described are several heuristics that have the potential to speed up the computation by several orders of magnitude. |
| |
Keywords: | |
本文献已被 SpringerLink 等数据库收录! |
|