首页 | 本学科首页   官方微博 | 高级检索  
     

基于输入样本和主数据的编辑规则挖掘算法
引用本文:杨辉,于守健,陈少总.基于输入样本和主数据的编辑规则挖掘算法[J].计算机系统应用,2017,26(4):162-168.
作者姓名:杨辉  于守健  陈少总
作者单位:东华大学 计算机科学与技术学院, 上海 201620,东华大学 计算机科学与技术学院, 上海 201620,东华大学 计算机科学与技术学院, 上海 201620
摘    要:基于编辑规则和主数据的数据修复技术能自动地、确切地修复不一致数据,但目前编辑规则的获取主要依靠专业人员的定义. 为了实现数据清洗全自动化,数据规则的挖掘技术近年来成为研究热点,针对条件函数依赖提出的挖掘算法主要有CFDMiner,CTANE,FastCFD. 在此基础上,扩展条件函数依赖(CFD)的定义,在编辑规则的定义下提出了一种基于输入样本和主数据的编辑规则挖掘算法,主要思路是从输入样本中挖掘出CFD,然后根据输入样本与主数据在属性上的定义域相似性求出输入样本在主数据中的对应属性,从而形成带模式组的编辑规则,此算法能有效地挖掘编辑规则. 且所挖掘的编辑规则按照编辑规则语义能有效地进行数据修复.

关 键 词:编辑规则  条件函数依赖  数据清洗  等价类划分
收稿时间:2016/7/17 0:00:00
修稿时间:2016/9/13 0:00:00

Method for Discovering Editing Rules From Sample Inputs and Master Data
YANG Hui,YU Shou-Jian and CHEN Shao-Zong.Method for Discovering Editing Rules From Sample Inputs and Master Data[J].Computer Systems& Applications,2017,26(4):162-168.
Authors:YANG Hui  YU Shou-Jian and CHEN Shao-Zong
Affiliation:School of Computer Science and Technology, Donghua University, Shanghai 201602, China,School of Computer Science and Technology, Donghua University, Shanghai 201602, China and School of Computer Science and Technology, Donghua University, Shanghai 201602, China
Abstract:Data repairing based on editing rules and master data can automatically and exactly fix inconsistent data, but editing rules mainly relies on the definition by professional staff at present. To achieve data cleaning automatically in the whole process, the techniques for discovering data rules become a hot research topic in recent years. The algorithms for mining CFDs mainly involve CFDMiner, CTANE, FastCFD. Based on the above techniques, we provide a mining algorithm for editing rule, which is based on sample inputs and master data under the extension definition of CFD and the definition of edit rules. The main ideas is as below: Mining CFD from sample inputs firstly; then according to the domain similarity between input samples and master data, we can get the corresponding properties of input samples from the master data, forming editing rules with pattern group. The algorithm can effectively discover edit rules. And the mined edit rules can effectively repair the data in accordance with the semantic of the rules.
Keywords:editing rules  conditional functional dependency  data cleaning  equivalence classes partitions
点击此处可从《计算机系统应用》浏览原始摘要信息
点击此处可从《计算机系统应用》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号