首页 | 本学科首页   官方微博 | 高级检索  
     

基于N-Gram算法的数据清洗技术
引用本文:马平全,宋凯,纪建伟. 基于N-Gram算法的数据清洗技术[J]. 沈阳工业大学学报, 2017, 39(1): 67-72. DOI: 10.7688/j.issn.1000-1646.2017.01.13
作者姓名:马平全  宋凯  纪建伟
作者单位:1. 沈阳农业大学 信息与电气工程学院, 沈阳 110866; 2. 沈阳理工大学 自动化与电气工程学院, 沈阳 110159
基金项目:辽宁省教育厅科学研究项目(LG201610)
摘    要:针对数据库中存在的大量相似重复数据,对相似重复记录的属性结构以及产生原因进行了分析,采用N-Gram算法对数据记录进行计算,得到代表每条记录属性的键值,即N-Gram值.依据该键值将数据库中的数据记录进行排序处理,建立有序的数据库,并对其中的数据记录进行相似度计算.运用排列合并的清洗思想对识别出来的相似重复数据记录进行清洗,实验结果表明,N-Gram算法有效提高了相似重复数据记录的查全率和查准率.

关 键 词:相似度  相似重复记录  属性  排序  合并  数据清洗  查全率  查准率  

Data cleaning technology based on N-Gram algorithm
MA Ping-quan,SONG Kai,JI Jian-wei. Data cleaning technology based on N-Gram algorithm[J]. Journal of Shenyang University of Technology, 2017, 39(1): 67-72. DOI: 10.7688/j.issn.1000-1646.2017.01.13
Authors:MA Ping-quan  SONG Kai  JI Jian-wei
Affiliation:1. College of Information and Electrical Engineering, Shenyang Agricultural University, Shenyang 110866, China; 2. School of Automation and Electrical Engineering, Shenyang Ligong University, Shenyang 110159, China
Abstract:Aiming at the plentiful approximately duplicate data in the database, the attribute structure of approximately duplicate records and the causing reason were analyzed. The data records were calculated with the N-Gram algorithm to get the key values, namely N-Gram values, which represented the attribute of every record. According to the key values, the data records in the database were ordered so as to form a well-organized database. In addition, the similarity of data records in the database was calculated. The identified approximately duplicate records were cleaned by applying the arranged combination cleaning idea. The experimental results show that the N-Gram algorithm effectively increases the recall ratio and precision ratio of approximately duplicate data records.
Keywords:similarity  approximately duplicate record  attribute  ordering  combination  data cleaning  recall ratio  precision ratio  
本文献已被 CNKI 等数据库收录!
点击此处可从《沈阳工业大学学报》浏览原始摘要信息
点击此处可从《沈阳工业大学学报》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号