首页 | 本学科首页   官方微博 | 高级检索  
     

二维混合数据分布下相关性检测的新方法HY-COCA
引用本文:曹巍,王秋月,覃雄派,王珊. 二维混合数据分布下相关性检测的新方法HY-COCA[J]. 计算机科学, 2015, 42(6): 193-203
作者姓名:曹巍  王秋月  覃雄派  王珊
作者单位:中国人民大学信息学院 北京100872
基金项目:本文受国家自然科学基金项目(61202331,3),软件工程国家重点实验室开放研究基金项目(SKLSE2012-09-33)资助
摘    要:混合数据分布是指数据分布的不同区域具有不同的特殊分布.例如销售额和地区两个属性之间,在销售额比较低的数值区间中,两者呈现近似相互独立的数据分布;而在销售额比较高的数值区间,二者呈现近似函数依赖的数据分布.现有检测数据相关性的研究专注于给出一个总体的二维相关性的度量,而无法检测出子区域的特殊相关性.在统计分析时,这类具有特殊相关性的子区域有更丰富的统计意义,值得引起重视.研究并提出了存在这类混合数据分布的情况下,检测数据相关性的新方法HY-COCA.该方法在熵相关系数的基础上,缩小了子区域的搜索空间,与Naive方法相比,降低了复杂度;同时HY-COCA还讨论了子区域的相关性差异判别与结果展示等问题.在生成的数据和测试基准数据上进行了实验,结果验证了方法的有效性.

关 键 词:数据分布  混合数据分布  相关性  数据分布区域  相关性差异分数

HY-COCA:A Hybrid-data-distribution-aware Way to Detect Correlation over Bi-dimensional Data Space
CAO Wei,WANG Qiu-yue,QIN Xiong-pai and WANG Shan. HY-COCA:A Hybrid-data-distribution-aware Way to Detect Correlation over Bi-dimensional Data Space[J]. Computer Science, 2015, 42(6): 193-203
Authors:CAO Wei  WANG Qiu-yue  QIN Xiong-pai  WANG Shan
Affiliation:School of Information,Renmin University of China,Beijing 100872,China,School of Information,Renmin University of China,Beijing 100872,China,School of Information,Renmin University of China,Beijing 100872,China and School of Information,Renmin University of China,Beijing 100872,China
Abstract:Hybrid data distribution between two attributes means that different data sub-regions exhibit different correlated associations.For example,in a distribution between sale amounts and different cities,a semi-independent distribution is observed with lower sale amounts,but for higher sale amounts,the two attributes present soft functional depen-dency.Previous researches on auto detection of association focused on deducing an overall measure of association over two dimensional distributions.They were unable to address hybrid data distribution problem.In statistical analysis,such sub-regions with particular data associations are worth paying attention to.This paper proposed a new way,HY-COCA,to detect data associations globally and locally,finding those sub-regions with special data associations.We did experiments on both synthetic and benchmark data.Experimental results verify the effectiveness of HY-COCA.
Keywords:Data distribution  Hybrid data distribution  Data association  Sub-regions in data distribution  Differentiating score of association
本文献已被 万方数据 等数据库收录!
点击此处可从《计算机科学》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号