首页 | 本学科首页   官方微博 | 高级检索  
     

一种基于划分的孤立点检测算法
引用本文:孙焕良,鲍玉斌,于戈,赵法信,王大玲.一种基于划分的孤立点检测算法[J].软件学报,2006,17(5):1009-1016.
作者姓名:孙焕良  鲍玉斌  于戈  赵法信  王大玲
作者单位:1. 东北大学,信息科学与工程学院,辽宁,沈阳,110006;沈阳建筑大学,信息与控制工程学院,辽宁,沈阳,110015
2. 东北大学,信息科学与工程学院,辽宁,沈阳,110006
基金项目:中国科学院资助项目;教育部优秀青年教师资助计划;辽宁省自然科学基金;辽宁省教育厅科技攻关项目
摘    要:孤立点是不具备数据一般特性的数据对象.划分的方法是通过将数据集中的数据点分布的空间划分为不相交的超矩形单元集合,匹配数据对象到单元中,然后通过各个单元的统计信息来发现孤立点.由于大多真实数据集具有较大偏斜,因此划分后会产生影响算法性能的大量空单元.由此,提出了一种新的索引结构--CD-Tree(cell dimension tree),用于索引非空单元.为了优化CD-Tree结构和指导对数据的划分,提出了基于划分的数据偏斜度(skew of data,简称SOD)概念.基于CD-Tree与SOD,设计了新的孤立点检测算法.实验结果表明,该算法与基于单元的算法相比,在效率及有效处理的维数方面均有显著提高.

关 键 词:数据挖掘  孤立点检测  划分  基于单元的算法
收稿时间:2004-06-26
修稿时间:2005-05-23

An Algorithm Based on Partition for Outlier Detection
SUN Huan-Liang,BAO Yu-Bin,YU Ge,ZHAO Fa-Xin and WANG Da-Ling.An Algorithm Based on Partition for Outlier Detection[J].Journal of Software,2006,17(5):1009-1016.
Authors:SUN Huan-Liang  BAO Yu-Bin  YU Ge  ZHAO Fa-Xin and WANG Da-Ling
Affiliation:1.School of Information Science and Engineering, Northeastern University, Shenyang 110006, China; 2.School of Information and Control Engineering, Shenyang Jianzhu University, Shenyang 110015, China
Abstract:Outliers are objects that do not comply with the general behavior of the data. The method of partition divides data space into a set of non-overlapping rectangular cells by partitioning every dimension into equal length. Statistical information of cells is used to find knowledge in datasets. There exists very large data skew in real-life datasets, so partition will produce many empty cells, which affects the efficiency of the algorithms. An efficient index structure called CD-Tree (cell dimension tree) is designed for indexing cells. Moreover, to guide partition and to optimize the structure of CD-Tree, the concept of SOD (skew of data) is proposed to measure the degree of data skew. Finally, the CD-Tree-based algorithm is designed for outlier detection based on CD-Tree and SOD. The experimental results show that the efficiency of CD-Tree-based algorithm and the maximum number of dimensions processed increase obviously comparing with the Cell-based algorithm on real-life datasets.
Keywords:CD-Tree(cell dimension tree)
本文献已被 CNKI 维普 万方数据 等数据库收录!
点击此处可从《软件学报》浏览原始摘要信息
点击此处可从《软件学报》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号