首页 | 本学科首页   官方微博 | 高级检索  
     

基于动态可配置规则的数据清洗方法
引用本文:朱会娟,蒋同海,周喜,程力,赵凡,马博. 基于动态可配置规则的数据清洗方法[J]. 计算机应用, 2017, 37(4): 1014-1020. DOI: 10.11772/j.issn.1001-9081.2017.04.1014
作者姓名:朱会娟  蒋同海  周喜  程力  赵凡  马博
作者单位:1. 中国科学院新疆理化技术研究所 多语种信息技术研究室, 乌鲁木齐 830011;2. 中国科学院大学 计算机与控制学院, 北京 100049;3. 新疆民族语音语言信息处理重点实验室, 乌鲁木齐 830011
基金项目:新疆维吾尔自治区高技术研究发展计划项目(201512103);中国科学院西部之光人才培养计划项目(XBBS201313);新疆维吾尔自治区青年科技创新人才培养工程计划项目(2014721033)。
摘    要:针对传统数据清洗方法通过硬编码方法来实现业务逻辑而导致系统的可重用性、可扩展性与灵活性较差等问题,提出了一种基于动态可配置规则的数据清洗方法——DRDCM。该方法支持多种类型规则间的复杂逻辑运算,并支持多种脏数据修复行为,集数据检测、数据修复与数据转换于一体,具有跨领域、可重用、可配置、可扩展等特点。首先,对DRDCM方法中的数据检测和数据修复的概念、实现步骤以及实现算法进行描述;其次,阐述了DRDCM方法中支持的多种规则类型以及规则配置;最后,对DRDCM方法进行实现,并通过实际项目数据集验证了该实现系统在脏数据修复中,丢弃修复行为具有很高的准确率,尤其是对需遵守法定编码规则的属性(例如身份证号码)处理时其准确率可达100%。实验结果表明,DRDCM实现系统可以将动态可配置规则无缝集成于多个数据源和多种不同应用领域且该系统的性能并不会随着规则条数增加而极速降低,这也进一步验证了DRDCM方法在真实环境中的切实可行性。

关 键 词:大数据  数据质量  数据清洗  动态可配置规则  数据预处理  
收稿时间:2016-09-20
修稿时间:2016-12-22

Data cleaning method based on dynamic configurable rules
ZHU Huijuan,JIANG Tonghai,ZHOU Xi,CHENG Li,ZHAO Fan,MA Bo. Data cleaning method based on dynamic configurable rules[J]. Journal of Computer Applications, 2017, 37(4): 1014-1020. DOI: 10.11772/j.issn.1001-9081.2017.04.1014
Authors:ZHU Huijuan  JIANG Tonghai  ZHOU Xi  CHENG Li  ZHAO Fan  MA Bo
Affiliation:1. Research Center for Multilingual Information Technology, Xinjiang Technical Institute of Physics and Chemistry, Chinese Academy of Sciences, Urumqi Xinjiang 830011, China;2. School of Computer and Control Engineering, University of Chinese Academy of Sciences, Beijing 100049, China;3. Xinjiang Laboratory of Minority Speech and Language Information Processing, Urumqi Xinjiang 830011, China
Abstract:Traditional data cleaning approaches usually implement cleaning rules specified by business requirements through hard-coding mechanism, which leads to well-known issues in terms of reusability, scalability and flexibility. In order to address these issues, a new Dynamic Rule-based Data Cleaning Method (DRDCM) was proposed, which supports the complex logic operation between various types of rules and three kinds of dirty data repair behavior. It integrates data detection, error correction and data transformation in one system and contributes several unique characteristics, including domain-independence, reusability and configurability. Besides, the formal concepts and terms regarding data detection and correction were defined, while necessary procedures and algorithms were also introduced. Specially, the supported multiple rule types and rule configurations in DRDCM were presented in detail. At last, the DRDCM approach was implemented. Experimental results show that the implemented system provides a high accuracy on the discarded behavior of dirty data repair with real-life data sets. Especially for the attribute required to comply with the statutory coding rules (such as ID card number), whose accuracy can reach 100%. Moreover, these results also indicate that this reference implementation of DRDCM can successfully support multiple data sources in cross-domain scenarios, and its performance does not sharply decrease with the increase of the number of rules. These results further validate that the proposed DRDCM is practical in real-world scenarios.
Keywords:big data   data quality   data cleaning   dynamic configurable rules   data preprocessing
点击此处可从《计算机应用》浏览原始摘要信息
点击此处可从《计算机应用》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号