首页 | 本学科首页   官方微博 | 高级检索  
     


String techniques for detecting duplicates in document databases
Authors:Daniel P. Lopresti
Affiliation:(1) Department of Electrical and Computer Engineering, University of Missouri, Columbia, MO 65211, USA;(2) Department of Electrical Engineering, University of Arkansas, Fayetteville, AR 72701, USA;(3) Department of Electrical and Computer Engineering, Missouri University of Science and Technology, Rolla, MO 65409, USA
Abstract:Detecting duplicates in document image databases is a problem of growing importance. The task is made difficult by the various degradations suffered by printed documents, and by conflicting notions of what it means to be a "duplicate". To address these issues, this paper introduces a framework for clarifying and formalizing the duplicate detection problem. Four distinct models are presented, each with a corresponding algorithm for its solution adapted from the realm of approximate string matching. The robustness of these techniques is demonstrated through a set of experiments using data derived from real-world noise sources. Also described are several heuristics that have the potential to speed up the computation by several orders of magnitude.
Keywords:
本文献已被 SpringerLink 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号