An n-gram-based approach for detecting approximately duplicate database records |
| |
Authors: | Zengping Tian Hongjun Lu Wenyun Ji Aoying Zhou Zhong Tian |
| |
Affiliation: | (1) Department of Computer Science, Fudan University, Shanghi, 200433, P.R. China; E-mail: {zptian, wyji, ayzhou}@fudan.edu.cn, CN;(2) Department of Computer Science, Hong Kong University of Science and Technology, Hong Kong, P.R. China; E-mail: luhj@cs.ust.hk, CN;(3) IBM China Research Laboratory, Beijing, P.R. China; E-mail: tianz@cn.ibm.com, CN |
| |
Abstract: | Detecting and eliminating duplicate records is one of the major tasks for improving data quality. The task, however, is not
as trivial as it seems since various errors, such as character insertion, deletion, transposition, substitution, and word
switching, are often present in real-world databases. This paper presents an n-gram-based approach for detecting duplicate
records in large databases. Using the approach, records are first mapped to numbers based on the n-grams of their field values.
The obtained numbers are then clustered, and records within a cluster are taken as potential duplicate records. Finally, record
comparisons are performed within clusters to identify true duplicate records. The unique feature of this method is that it
does not require preprocessing to correct syntactic or typographical errors in the source data in order to achieve high accuracy.
Moreover, sorting the source data file is unnecessary. Only a fixed number of database scans is required. Therefore, compared
with previous methods, the algorithm is more time efficient.
Published online: 22 August 2001 |
| |
Keywords: | : Duplicate elimination – N-gram – Edit distance – Data quality |
本文献已被 SpringerLink 等数据库收录! |
|