一种基于条件概率分布的近似重复记录检测方法 Algorithm for Detecting Approximately Duplicate Database Records Based on Conditional Probability Distribution期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

一种基于条件概率分布的近似重复记录检测方法

引用本文：	缪嘉嘉,吴刚,毛捍东,杨强,邓苏.一种基于条件概率分布的近似重复记录检测方法[J].小型微型计算机系统,2004,25(12):2164-2168.

作者姓名：	缪嘉嘉吴刚毛捍东杨强邓苏

作者单位：	1. 国防科学技术大学,计算机学院,湖南,长沙,410073 2. 国防科学技术大学,人文管理学院,湖南,长沙,410073

基金项目：	国家自然科学基金 ( 60 10 3 0 0 9)资助

摘要：	数据集成往往会形成一些近似重复记录 ,如何检测重复信息是数据质量研究中的一个热门课题 .文中提出了一种高效的基于条件概率分布的动态聚类算法来进行近似重复记录检测 .该方法在评估两个记录之间是否近似等价的问题上 ,解决了原来的算法忽略序列结构特点的问题 ,基于条件概率分布定义了记录间的距离 ;并根据近邻函数准则选择了一个评议聚类结果质量的准则函数 ,采用动态聚类算法完成对序列数据集的聚类 .使用该方法 ,对仿真数据进行了聚类实验 ,都获得了比较好的聚类结果
关键词：	信息集成近似重复记录动态聚类概率后缀树
文章编号：	1000-1220(2004)12-2164-05
Algorithm for Detecting Approximately Duplicate Database Records Based on Conditional Probability Distribution

MIAO Jia-jia ,WU Gang ,MAO Hang-dong ,YANG Qiang ,DENG Su.Algorithm for Detecting Approximately Duplicate Database Records Based on Conditional Probability Distribution[J].Mini-micro Systems,2004,25(12):2164-2168.

Authors:	MIAO Jia-jia WU Gang MAO Hang-dong YANG Qiang DENG Su

Affiliation:	MIAO Jia-jia 1,WU Gang 2,MAO Hang-dong 1,YANG Qiang 1,DENG Su 1 1

Abstract:	Detecting database records that are approximate duplicates, but not exact duplicates, is an important task. Databases may contain duplicate records concerning the same real-world entity because of data entry errors, because of un-standardized abbreviations, or because of differences in the detailed schemas of records from multiple databases, among other reasons. Investigate the problem of detecting duplications based on their structural features, then presented an efficient and effective algorithm for recognizing clusters of approximately duplicate records. The conditional probability distribution (CPD) of the next symbol given a preceding segment is derived and used to characterize sequence record and to support the distance measure. A variation of the suffix tree, namely probabilistic suffix tree, is employed to organize the CPD in a concise way. And based on the near neighbour rules, we select a rule function to comment the clustering results. Finally, dynamic clustering algorithm is employed to cluster the dataset. Comprehensive experiments on synthetic database records confirm the effectiveness of the new algorithm.

Keywords:	information integration approximately duplicated records dynamic clustering probabilistic suffix tree
本文献已被 CNKI 维普万方数据等数据库收录！

设为首页 | 免责声明 | 关于勤云 | 加入收藏