首页 | 本学科首页   官方微博 | 高级检索  
     

基于字段过滤和伸缩窗口的SNM算法优化
引用本文:周世杰,娄渊胜.基于字段过滤和伸缩窗口的SNM算法优化[J].计算机工程与科学,2022,44(4):699-706.
作者姓名:周世杰  娄渊胜
作者单位:(河海大学计算机与信息学院,江苏 南京 211100)
摘    要:数据仓库中的问题数据对数据质量有较大的影响,为了查找和去除这些问题数据,首要的工作是处理相似重复数据,目前针对重复数据清除应用最多的算法是基本邻近排序算法(SNM)。通过分析SNM算法的缺陷,提出了一种改进的SNM算法——ISNM。采用属性区分法计算属性权值,解决了人为主观赋予权值导致的问题;使用字段过滤算法计算2条记录的相似度,减少了窗口内记录属性的比对次数,加快了算法的检测速度;使用可变窗口代替固定大小的窗口,防止记录漏配并减少无用的记录比对。实验结果表明,改进后的ISNM算法在查全率、查准率和运行时间开销上有明显的优势。

关 键 词:数据质量  数据清洗  相似重复记录  SNM算法
收稿时间:2020-12-18
修稿时间:2021-01-26

SNM algorithm optimization based on field filtering and scaling window
ZHOU Shi-jie,LOU Yuan-sheng.SNM algorithm optimization based on field filtering and scaling window[J].Computer Engineering & Science,2022,44(4):699-706.
Authors:ZHOU Shi-jie  LOU Yuan-sheng
Affiliation:(College of Computer and Information,Hohai University,Nanjing 211100,China)
Abstract:The problematic data in the data warehouse has a great impact on data quality. In order to find and delete these problematic data, the primary work is the processing of similar repeated data. Currently, the most widely used algorithm for deduplication is the sorted-neighborhood method (SNM). After analyzing the shortcomings of this algorithm, an improved SNM algorithm (ISNM) is proposed. The attribute weights are calculated using the attribute discrimination method, which solves the subjectivity caused by artificial weights. The field filtering algorithm is used to calculate the similarity of two records, which reduces the number of comparisons of record attributes in the window and accelerates the detection speed of the algorithm. Variable windows are used instead of fixed-size windows to prevent missing records and reduce useless record comparisons. Experimental results show that ISNM algorithm has obvious advantages in terms of recall, precision and running time overhead.
Keywords:data quality  data cleaning  similar duplicate records  SNM algorithm  
本文献已被 万方数据 等数据库收录!
点击此处可从《计算机工程与科学》浏览原始摘要信息
点击此处可从《计算机工程与科学》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号