首页 | 本学科首页   官方微博 | 高级检索  
     

一种基于条件概率分布的近似重复记录检测方法
引用本文:缪嘉嘉,吴刚,毛捍东,杨强,邓苏.一种基于条件概率分布的近似重复记录检测方法[J].小型微型计算机系统,2004,25(12):2164-2168.
作者姓名:缪嘉嘉  吴刚  毛捍东  杨强  邓苏
作者单位:1. 国防科学技术大学,计算机学院,湖南,长沙,410073
2. 国防科学技术大学,人文管理学院,湖南,长沙,410073
基金项目:国家自然科学基金 ( 60 10 3 0 0 9)资助
摘    要:数据集成往往会形成一些近似重复记录 ,如何检测重复信息是数据质量研究中的一个热门课题 .文中提出了一种高效的基于条件概率分布的动态聚类算法来进行近似重复记录检测 .该方法在评估两个记录之间是否近似等价的问题上 ,解决了原来的算法忽略序列结构特点的问题 ,基于条件概率分布定义了记录间的距离 ;并根据近邻函数准则选择了一个评议聚类结果质量的准则函数 ,采用动态聚类算法完成对序列数据集的聚类 .使用该方法 ,对仿真数据进行了聚类实验 ,都获得了比较好的聚类结果

关 键 词:信息集成  近似重复记录  动态聚类  概率后缀树
文章编号:1000-1220(2004)12-2164-05

Algorithm for Detecting Approximately Duplicate Database Records Based on Conditional Probability Distribution
MIAO Jia-jia ,WU Gang ,MAO Hang-dong ,YANG Qiang ,DENG Su.Algorithm for Detecting Approximately Duplicate Database Records Based on Conditional Probability Distribution[J].Mini-micro Systems,2004,25(12):2164-2168.
Authors:MIAO Jia-jia  WU Gang  MAO Hang-dong  YANG Qiang  DENG Su
Affiliation:MIAO Jia-jia 1,WU Gang 2,MAO Hang-dong 1,YANG Qiang 1,DENG Su 1 1
Abstract:Detecting database records that are approximate duplicates, but not exact duplicates, is an important task. Databases may contain duplicate records concerning the same real-world entity because of data entry errors, because of un-standardized abbreviations, or because of differences in the detailed schemas of records from multiple databases, among other reasons. Investigate the problem of detecting duplications based on their structural features, then presented an efficient and effective algorithm for recognizing clusters of approximately duplicate records. The conditional probability distribution (CPD) of the next symbol given a preceding segment is derived and used to characterize sequence record and to support the distance measure. A variation of the suffix tree, namely probabilistic suffix tree, is employed to organize the CPD in a concise way. And based on the near neighbour rules, we select a rule function to comment the clustering results. Finally, dynamic clustering algorithm is employed to cluster the dataset. Comprehensive experiments on synthetic database records confirm the effectiveness of the new algorithm.
Keywords:information integration  approximately duplicated records  dynamic clustering  probabilistic suffix tree
本文献已被 CNKI 维普 万方数据 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号