首页 | 本学科首页   官方微博 | 高级检索  
     

基于长度过滤和动态容错的SNM改进算法
引用本文:刘雅思.基于长度过滤和动态容错的SNM改进算法[J].计算机应用研究,2017,34(1).
作者姓名:刘雅思
作者单位:中国科学院新疆理化技术研究所
基金项目:基金项目1新疆维吾尔自治区青年科技创新人才培养工程基金资助项目(2014721033);基金项目2乌鲁木齐高新区发展扶持资金(2013038)
摘    要:数据仓库中相似重复记录的清洗对于数据质量影响很大,传统的基本邻近排序算法(Sorted-Neighborhood Method, SNM)时间效率和准确率均不高。针对SNM算法的缺陷,提出了一种基于长度过滤和动态容错的SNM改进算法。根据两条记录的长度比例和属性缺失情况,首先排除一部分不可能构成相似重复记录的数据,减少比较次数,提高检测效率;进一步提出了动态容错法,校准字段相似度评判结果,解决了因属性缺失而误判的问题,提高了准确率。针对实际数据集的实验分析表明,在相同的运算环境下,优化算法在准确率和时间效率上有明显优势。

关 键 词:数据挖掘    数据清洗  相似重复记录  SNM算法  动态容错  字段匹配
收稿时间:2015/11/26 0:00:00
修稿时间:2016/12/2 0:00:00

An improved SNM algorithm based on length filtering and dynamic fault-tolerance
Liu Yasi.An improved SNM algorithm based on length filtering and dynamic fault-tolerance[J].Application Research of Computers,2017,34(1).
Authors:Liu Yasi
Affiliation:Xinjiang Technical Institute of Physics
Abstract:In data warehouse systems, cleaning similar and duplicated records can effectively impact data quality. Traditional SNM (Sorted-Neighborhood Method) has performance issues with time efficiency and accuracy rate. In order to improve its performance, an enhance SNM algorithm based on length filtering and dynamic fault-tolerance (LF-SNM) is proposed. Firstly, it improves the detection efficiency by excluding the records which are impossible to be duplicated according to the length proportion and attribute absence of two records. Then, it calibrates field similarity results using dynamic fault-tolerance method. It ensures accuracy even though some attributes are absent. Experimental results indicate that the LF-SNM performs obviously better than traditional SNM method on actual datasets under the same experimental conditions.
Keywords:data cleaning  similar and duplicated records  SNM algorithm  dynamic fault-tolerance  string match  
点击此处可从《计算机应用研究》浏览原始摘要信息
点击此处可从《计算机应用研究》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号