Top-k outlier detection from uncertain data |
| |
Authors: | Salman Ahmed Shaikh Hiroyuki Kitagawa |
| |
Affiliation: | Graduate School of Systems and Information Engineering, University of Tsukuba, 1-1-1 Tennodai, Tsukuba, Ibaraki, 305-8573, Japan |
| |
Abstract: | Uncertain data are common due to the increasing usage of sensors, radio frequency identification (RFID), GPS and similar devices for data collection. The causes of uncertainty include limitations of measurements, inclusion of noise, inconsistent supply voltage and delay or loss of data in transfer. In order to manage, query or mine such data, data uncertainty needs to be considered. Hence, this paper studies the problem of top-k distance-based outlier detection from uncertain data objects. In this work, an uncertain object is modelled by a probability density function of a Gaussian distribution. The naive approach of distance-based outlier detection makes use of nested loop. This approach is very costly due to the expensive distance function between two uncertain objects. Therefore, a populated-cells list (PC-list) approach of outlier detection is proposed. Using the PC-list, the proposed top-k outlier detection algorithm needs to consider only a fraction of dataset objects and hence quickly identifies candidate objects for top-k outliers. Two approximate top-k outlier detection algorithms are presented to further increase the efficiency of the top-k outlier detection algorithm. An extensive empirical study on synthetic and real datasets is also presented to prove the accuracy, efficiency and scalability of the proposed algorithms. |
| |
Keywords: | k distance-based outlier detectionTop-k distance-based outlier detection uncertain data Gaussian uncertainty cell-based approach PC-list based approach |
本文献已被 CNKI 维普 SpringerLink 等数据库收录! |
| 点击此处可从《国际自动化与计算杂志》浏览原始摘要信息 |
|
点击此处可从《国际自动化与计算杂志》下载全文 |
|