基于划分的海量数据相似重复记录检测 Similar Duplicate Record Detection of Massive Data Based on Partition期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

基于划分的海量数据相似重复记录检测

引用本文：	李莉,张晓雯.基于划分的海量数据相似重复记录检测[J].计算机系统应用,2019,28(3):172-178.

作者姓名：	李莉张晓雯

作者单位：	江苏大学计算机科学与通信工程学院,镇江,212013;江苏大学计算机科学与通信工程学院,镇江,212013

摘要：	针对目前社工库存储的海量数据，数据冗余、查询效率低下的质量问题，本文提出了一种有效的基于划分的近邻排序算法.对不同渠道采集、以不同存储方式存储的社工数据进行整合形成能以二维表形式存储的海量数据集，采用划分思想，对大数据集进行分割，形成簇；采用改进的近邻排序算法对各个簇中的小数据集进行检测得到最终的相似重复记录检测结果.实验和对比分析结果表明，划分和近邻排序算法的结合使用不仅提高了海量数据相似重复记录检测的时间效率，检测准确率也有所提升.
关键词：	数据质量数据清洗相似重复记录划分 SNM算法
收稿时间：	2018/10/4 0:00:00
修稿时间：	2018/10/23 0:00:00
Similar Duplicate Record Detection of Massive Data Based on Partition

LI Li and ZHANG Xiao-Wen.Similar Duplicate Record Detection of Massive Data Based on Partition[J].Computer Systems& Applications,2019,28(3):172-178.

Authors:	LI Li and ZHANG Xiao-Wen

Affiliation:	School of Computer Science and Communication Engineering, Jiangsu University, Zhenjiang 212013, China and School of Computer Science and Communication Engineering, Jiangsu University, Zhenjiang 212013, China

Abstract:	Aiming at solving problems of data redundancy and low query efficiency in the storage of mass social work data, this study proposed an effective partition-based neighbor sorting algorithm. The social data collected by different channels and stored in different storage methods were integrated to form a massive data set that can be stored in a two-dimensional form. The partitioning idea was used to segment the massive data set to clusters; the improved neighbor sorting algorithm was used for each cluster to obtain the final similar duplicate record detection results. The experimental and comparative analysis results show that the combination of partitioning and neighbor sorting algorithm not only improves the time efficiency of similar duplicate records detection of massive data, but also improves the detection accuracy.

Keywords:	data quality data cleaning similar duplicate records partition SNM algorithm
本文献已被万方数据等数据库收录！
	点击此处可从《计算机系统应用》浏览原始摘要信息
	点击此处可从《计算机系统应用》下载全文

设为首页 | 免责声明 | 关于勤云 | 加入收藏