首页 | 本学科首页   官方微博 | 高级检索  
     

数据分析和清理中相关算法研究
引用本文:冯玉才,桂浩,李华旸,李又奎. 数据分析和清理中相关算法研究[J]. 小型微型计算机系统, 2005, 26(6): 1018-1022
作者姓名:冯玉才  桂浩  李华旸  李又奎
作者单位:华中科技大学,计算机学院,湖北,武汉,430074
基金项目:高等学校博士学科点专项科研基金(20030487032)资助.
摘    要:数据清理的一个主要作用是识别重复的记录.结合过滤算法和启发式剪枝算法提出了启发式剪枝改进算法.然后,针对重复记录的特点提出了长度约束条件,能有效地提高比较字段不等长时的执行速度.数据库中经常会出现各种形式不同的缩写,而启发式剪枝算法等无法识别缩写情况下的重复记录,本文因此提出了基于动态规划的缩写发现算法,该算法既可以用于缩写发现也可用于缩写存在时的重复记录识别.另外,重复记录的甄别目前必须人工处理,传统方式下用户不得不逐条浏览和分析,工作时间冗长而且乏味,容易引入新的数据质量隐患,作者提出了聚类清除方案和聚类闭包算法,它将重复的记录聚类显示,用户一次可以处理完一个重复聚类,在有效提高速度的同时方便了用户.

关 键 词:数据清理 字符串相似匹配 缩写算法 聚类清除
文章编号:1000-1220(2005)06-1018-05

Research on Related Algorithms in Data Analysing and Cleaning
FENG Yu-cai,GUI Hao,LI Hua-yang,LI You-kui. Research on Related Algorithms in Data Analysing and Cleaning[J]. Mini-micro Systems, 2005, 26(6): 1018-1022
Authors:FENG Yu-cai  GUI Hao  LI Hua-yang  LI You-kui
Abstract:For data cleaning, the author studies the algorithms of identifying duplicated records in detail, introduces the string approximate matching algorithm into data cleaning and put forward some new algorithms. First, improved the heuristic cut-off algorithm, which combine the filter algorithm and heuristic cut-off algorithm, speed up the execution. Second, the length constrain condition is introduced. The condition can reduce comparison times effectively when the attributes of key field are not at the same length. Then, finding-abbreviation algorithm based on dynamic programming is presented. The algorithm deals with the problems of identifying duplicated records caused by abbreviation forms. Meanwhile, manual handling is needed to identify correct data in the course of identifying duplicated records, the author puts forward clustering-cleaning method and clustering-cleaning closure algorithm, with which to calculate duplicated records closer and display duplicated records in clusters, users can handle a duplicated data cluster one time, which greatly increases the speed of manual handling of duplicated records.
Keywords:data cleaning  string approximate matching  abbreviation algorithm  clustering cleaning
本文献已被 CNKI 维普 万方数据 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号