首页 | 本学科首页   官方微博 | 高级检索  
     

基于密度的Top-n局部异常点快速检测算法
引用本文:刘芳,齐建鹏,于彦伟,曹磊,赵金东.基于密度的Top-n局部异常点快速检测算法[J].自动化学报,2019,45(9):1756-1771.
作者姓名:刘芳  齐建鹏  于彦伟  曹磊  赵金东
作者单位:1.烟台大学计算机与控制工程学院 烟台 264005 中国
基金项目:山东省自然科学基金ZR2016FM42国家自然科学基金61403328国家自然科学基金61703360国家自然科学基金61773331山东省高等学校科技计划J17KA091
摘    要:局部异常检测(Local outlier factor,LOF)能够有效解决数据倾斜分布下的异常检测问题,在很多应用领域具有较好的异常检测效果.本文面向大数据异常检测,提出了一种快速的Top-n局部异常点检测算法MTLOF(Multi-granularity upper bound pruning based top-n LOF detection),融合索引结构和多层LOF上界设计了多粒度的剪枝策略,以快速发现Top-n局部异常点.首先,提出了四个更接近真实LOF值的上界,以避免直接计算LOF值,并对它们的计算复杂度进行了理论分析;其次,结合索引结构和UB1、UB2上界,提出了两层的Cell剪枝策略,不仅采用全局Cell剪枝策略,还引入了基于Cell内部数据对象分布的局部剪枝策略,有效解决了高密度区域的剪枝问题;再次,利用所提的UB3和UB4上界,提出了两个更加合理有效的数据对象剪枝策略,UB3和UB4上界更加接近于真实LOF值,有利于剪枝更多数据对象,而基于计算复用的上界计算方法,大大降低了计算成本;最后,优化了初始Top-n局部异常点的选择方法,利用区域划分和建立的索引结构,在数据稀疏区域选择初始局部异常点,有利于将LOF值较大的数据对象选为初始局部异常点,有效提升初始剪枝临界值,使得初始阶段剪枝掉更多的数据对象,进一步提高检测效率.在六个真实数据集上的综合实验评估验证MTLOF算法的高效性和可扩展性,相比最新的TOLF(Top-n LOF)算法,时间效率提升可高达3.5倍.

关 键 词:异常检测    局部异常检测    Top-n    剪枝策略
收稿时间:2018-06-13

A Fast Algorithm for Density-based Top-n Local Outlier Detection
Affiliation:1.School of Computer and Control Engineering, Yantai University, Yantai 264005, China2.CSAIL, Massachusetts Institute of Technology, Cambridge MA 02139, USA
Abstract:Local outlier factor (LOF) effectively addresses the problem of outlier detection in skewed datasets, which has been shown remarkable detection performance in variety of applications. In this paper, we propose an efficient Top-n local outlier detection algorithm, called MTLOF (Multi-granularity upper bound pruning based top-n LOF detection), for fast detecting top-n local outliers in large-scale datasets. First, we propose four LOF upper bounds that are closer to the real LOF value to avoid the computations of LOF values, and analyze their computational complexity theoretically. Second, by combining with index structure and the upper bounds UB1 and UB2, we propose a two-layer Cell pruning strategy, not only adopting the global Cell pruning strategy, but also introducing a local pruning strategy based on the internal data objects of Cells, which effectively prunes the high-density regions. Third, we propose two more reasonable and effective data object pruning strategies using the proposed upper bounds UB3 and UB4. UB3 and UB4 are closer to the real LOF value, which benefits to pruning more data objects. On the other hand, the upper bound calculation method based on computation reuse greatly reduces the computational cost. Finally, we optimize the selecting method of initial Top-n local outliers leveraging the established index structure. Specifically, we select the initial Top-n local outliers in sparse regions, which is conducive to selecting the data objects with a larger LOF value as the initial local outliers. Experimental study on six real-world datasets demonstrates the efficiency and scalability of our proposed MTLOF-up to 3.5 times faster than the state-of-the-art TOLF (Top-n LOF) method.
Keywords:
点击此处可从《自动化学报》浏览原始摘要信息
点击此处可从《自动化学报》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号