首页 | 本学科首页   官方微博 | 高级检索  
     

基于划分的海量数据相似重复记录检测
引用本文:李莉,张晓雯.基于划分的海量数据相似重复记录检测[J].计算机系统应用,2019,28(3):172-178.
作者姓名:李莉  张晓雯
作者单位:江苏大学计算机科学与通信工程学院,镇江,212013;江苏大学计算机科学与通信工程学院,镇江,212013
摘    要:针对目前社工库存储的海量数据,数据冗余、查询效率低下的质量问题,本文提出了一种有效的基于划分的近邻排序算法.对不同渠道采集、以不同存储方式存储的社工数据进行整合形成能以二维表形式存储的海量数据集,采用划分思想,对大数据集进行分割,形成簇;采用改进的近邻排序算法对各个簇中的小数据集进行检测得到最终的相似重复记录检测结果.实验和对比分析结果表明,划分和近邻排序算法的结合使用不仅提高了海量数据相似重复记录检测的时间效率,检测准确率也有所提升.

关 键 词:数据质量  数据清洗  相似重复记录  划分  SNM算法
收稿时间:2018/10/4 0:00:00
修稿时间:2018/10/23 0:00:00

Similar Duplicate Record Detection of Massive Data Based on Partition
LI Li and ZHANG Xiao-Wen.Similar Duplicate Record Detection of Massive Data Based on Partition[J].Computer Systems& Applications,2019,28(3):172-178.
Authors:LI Li and ZHANG Xiao-Wen
Affiliation:School of Computer Science and Communication Engineering, Jiangsu University, Zhenjiang 212013, China and School of Computer Science and Communication Engineering, Jiangsu University, Zhenjiang 212013, China
Abstract:Aiming at solving problems of data redundancy and low query efficiency in the storage of mass social work data, this study proposed an effective partition-based neighbor sorting algorithm. The social data collected by different channels and stored in different storage methods were integrated to form a massive data set that can be stored in a two-dimensional form. The partitioning idea was used to segment the massive data set to clusters; the improved neighbor sorting algorithm was used for each cluster to obtain the final similar duplicate record detection results. The experimental and comparative analysis results show that the combination of partitioning and neighbor sorting algorithm not only improves the time efficiency of similar duplicate records detection of massive data, but also improves the detection accuracy.
Keywords:data quality  data cleaning  similar duplicate records  partition  SNM algorithm
本文献已被 万方数据 等数据库收录!
点击此处可从《计算机系统应用》浏览原始摘要信息
点击此处可从《计算机系统应用》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号