首页 | 本学科首页   官方微博 | 高级检索  
     

基于多视角的多类型错误全面检测方法
引用本文:彭锦峰,申德荣,寇月,聂铁铮.基于多视角的多类型错误全面检测方法[J].软件学报,2023,34(3):1049-1064.
作者姓名:彭锦峰  申德荣  寇月  聂铁铮
作者单位:东北大学 计算机科学与工程学院, 辽宁 沈阳 110819
基金项目:国家自然科学基金(62172082,62072084,62072086);中央高校基本科研业务费(N2116008)
摘    要:随着信息化社会的发展,数据的规模越发庞大,数据的种类也越发丰富.时至今日,数据已经成为国家和企业的重要战略资源,是科学化管理的重要保障.然而,随着社会生活产生的数据日益丰富,大量的脏数据也随之而来,数据质量问题油然而生.如何准确而全面地检测出数据集中所包含的错误数据,一直是数据科学中的痛点问题.尽管已有许多传统方法被广泛用于各行各业,如基于约束与统计的检测方法,但这些方法通常需要丰富的先验知识与昂贵的人力和时间成本.受限于此,这些方法往往难以准确而全面地检测数据.近年来,许多新型错误检测方法利用深度学习技术,通过时序推断、文本解析等方式取得了更好检测效果,但它们通常只适用于特定的领域或特定的错误类型,面对现实生活中的复杂情况,泛用性不足.基于上述情况,结合传统方法与深度学习技术的优点,提出了一个基于多视角的多类型错误全面检测模型CEDM.首先,从模式的角度,结合现有约束条件,在属性、单元和元组层面进行多维度的统计分析,构建出基础检测规则;然后,通过词嵌入捕获数据语义,从语义的角度分析属性相关性、单元关联性与元组相似性,进而基于语义关系,从多个维度上更新、扩展基础规则;最终,联合多个视角...

关 键 词:数据质量  错误检测  多视角  数据语义
收稿时间:2022/5/15 0:00:00
修稿时间:2022/7/29 0:00:00

Comprehensive Error Detection Method for Multiple Types Errors Based on Multiple Views
PENG Jin-Feng,SHEN De-Rong,KOU Yue,NIE Tie-Zheng.Comprehensive Error Detection Method for Multiple Types Errors Based on Multiple Views[J].Journal of Software,2023,34(3):1049-1064.
Authors:PENG Jin-Feng  SHEN De-Rong  KOU Yue  NIE Tie-Zheng
Affiliation:School of Computer Science and Engineerning Northeastern University, Shenyang 110819, China
Abstract:With the development of the information society,the scale of data has become larger and the types of data have become more abundant.Nowadays,data have become important strategic resources,which are the vital guarantees for scientific management for countries and enterprises.However,with the increasing of data generated in social life,a large amount of dirty data come along with it,and data quality issue ensues.In the field of data science,it has always been a pain point that how to detect errors in an accurate and comprehensive manner.Although many traditional methods based on constraints or statistics have been widely used,they are usually limited by prior knowledge and labor cost.Recently,some novel methods detect errors by utilizing deep learning model to inference time series data or analyze context data and achieve better performance.However,these methods tend to be only applicable to specific areas or specific types of errors,which are not general enough for complex reality cases.Based on above observations,this paper takes advantages of both traditional methods and deep learning model to propose a comprehensive error detection method (CEDM),which can deal with multiple types errors in multiple views.Firstly,under the view of patterns,basic detection rules can be constructed based on the statistical analysis with constraints from mutiple dimensions,including attributes,cells and tuples.After this,under the semantic view,we capture data semantics by word embedding and analyse attribute relevance,cell dependency and tuple similarity.And hence,the basic rules can be extended and updated based on the semantic relations in different dimensions.Finally,the errors with multiple types could be detected comprehensively and accurately in multiple views.Extensive experiments on real and synthetic datasets demonstrate that our method outperforms the state-of-the-art error detection methods and has higher generalization ability that can be applicable to multiple areas and multiple error types.
Keywords:Data quality  Error detection  Multiple views  Data semantics
点击此处可从《软件学报》浏览原始摘要信息
点击此处可从《软件学报》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号