首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Statistical outlier detection using direct density ratio estimation   总被引:2,自引:2,他引:0  
We propose a new statistical approach to the problem of inlier-based outlier detection, i.e., finding outliers in the test set based on the training set consisting only of inliers. Our key idea is to use the ratio of training and test data densities as an outlier score. This approach is expected to have better performance even in high-dimensional problems since methods for directly estimating the density ratio without going through density estimation are available. Among various density ratio estimation methods, we employ the method called unconstrained least-squares importance fitting (uLSIF) since it is equipped with natural cross-validation procedures, allowing us to objectively optimize the value of tuning parameters such as the regularization parameter and the kernel width. Furthermore, uLSIF offers a closed-form solution as well as a closed-form formula for the leave-one-out error, so it is computationally very efficient and is scalable to massive datasets. Simulations with benchmark and real-world datasets illustrate the usefulness of the proposed approach.  相似文献   

2.
赵峰  秦锋 《计算机工程》2009,35(19):78-80
研究基于单元的孤立点检测算法,给出数据空间的单元格划分及数据对象分配算法。针对该算法中阈值M设置的不足,对算法进行改进并应用于纳税行为的分析。与其他孤立点检测算法对比的结果表明,该算法不仅能有效挖掘纳税行为中的孤立点,还能确定孤立点的位置,有利于对纳税行为的分析。  相似文献   

3.
Principal component analysis (PCA) is well recognized in dimensionality reduction, and kernel PCA (KPCA) has also been proposed in statistical data analysis. However, KPCA fails to detect the nonlinear structure of data well when outliers exist. To reduce this problem, this paper presents a novel algorithm, named iterative robust KPCA (IRKPCA). IRKPCA works well in dealing with outliers, and can be carried out in an iterative manner, which makes it suitable to process incremental input data. As in the traditional robust PCA (RPCA), a binary field is employed for characterizing the outlier process, and the optimization problem is formulated as maximizing marginal distribution of a Gibbs distribution. In this paper, this optimization problem is solved by stochastic gradient descent techniques. In IRKPCA, the outlier process is in a high-dimensional feature space, and therefore kernel trick is used. IRKPCA can be regarded as a kernelized version of RPCA and a robust form of kernel Hebbian algorithm. Experimental results on synthetic data demonstrate the effectiveness of IRKPCA.  相似文献   

4.
周璨  李伯阳  黄斌  刘刘 《计算机工程》2008,34(8):184-186
通过分析现有入侵检测技术的不足,探讨基于孤立点挖掘的入侵检测技术的优势,提出一种基于核密度估计的入侵检测方法。该方法通过核密度估计求出孤立点的近似集,再通过筛选近似集获得最终的孤立点集合,从而检测入侵记录。阐述了具体实现方案,通过仿真实验验证了该方法的可行性。  相似文献   

5.
李云  袁运浩  陈峻 《计算机工程》2008,34(19):44-46
孤立点挖掘是数据挖掘的重要研究方向之一,其目标是发现数据集中不具备数据一般特性的数据对象。传统孤立点挖掘算法通常基于项集属性,不适用于多目标决策和综合评价。该文提出一种基于灰色关联分析的孤立点检测算法OMGRA,通过总评价判断数挖掘孤立点集,避免人工确定阈值。实例分析表明,该算法能有效检测数据集中的孤立点,挖掘出的孤立点符合实际情况。  相似文献   

6.
基于相似度计算的本体映射优化方法   总被引:2,自引:1,他引:2       下载免费PDF全文
谷志锋  刘勇  郭跟成 《计算机工程》2008,34(19):56-57,6
在基于相似度计算的本体映射中,相似度计算量大的主要原因是待映射概念和待计算属性过多。该文采用过滤策略,利用候选映射策略和信息增益策略减少待映射概念和待计算属性数量。该过滤策略充分利用本体特点和数据挖掘思想,有效滤除没有计算意义的概念和属性,减少了相似度计算量。实验结果证明,滤除的概念和属性对映射效果的影响很小。  相似文献   

7.
为了提高高维数据集合离群数据挖掘效率,在分析了传统的离群数据挖掘算法优点和缺点的基础上,提出了一种离群聚类算法,该算法将核方法与PP主成分变换结合于离群聚类算法中,采用基于核的PP主成分变换进行数据维数消减。通过该数据变换矩阵得到相应的非线性向量,并为每个向量分配一个动态权值,在优化经典的FCM模糊聚类的目标优化迭代函数基础上,最终得到各个数据的权值,根据权值的大小标识出数据集中的离群点,理论上证明了该算法的收敛性,仿真实验的结果表明了该方法能够有效地发现高维数据集中的离群点。  相似文献   

8.
Mining class outliers: concepts, algorithms and applications in CRM   总被引:4,自引:0,他引:4  
Outliers, or commonly referred to as exceptional cases, exist in many real-world databases. Detection of such outliers is important for many applications and has attracted much attention from the data mining research community recently. However, most existing methods are designed for mining outliers from a single dataset without considering the class labels of data objects. In this paper, we consider the class outlier detection problem ‘given a set of observations with class labels, find those that arouse suspicions, taking into account the class labels’. By generalizing two pioneer contributions [Proc WAIM02 (2002); Proc SSTD03] in this field, we develop the notion of class outlier and propose practical solutions by extending existing outlier detection algorithms to this case. Furthermore, its potential applications in CRM (customer relationship management) are also discussed. Finally, the experiments in real datasets show that our method can find interesting outliers and is of practical use.  相似文献   

9.
为了提高高维数据集合离群数据挖掘效率,该文分析传统的离群数据挖掘算法,提出一种离群点检测算法。该算法将非线性问题转化为高维特征空间中的线性问题,利用核函数-主成分进行维数约减,逐个扫描数据对象的投影分量,判断数据点是否为离群点,适用于线性可分数据集的离群点、线性不可分数据集的离群点的检测。实验表明了该算法的优越性。  相似文献   

10.
局部离群点挖掘算法研究   总被引:14,自引:0,他引:14  
离群点可分为全局离群点和局部离群点.在很多情况下,局部离群点的挖掘比全局离群点的挖掘更有意义.现有的基于局部离群度的离群点挖掘算法存在检测精度依赖于用户给定的参数、计算复杂度高等局限.文中提出将对象属性分为固有属性和环境属性,用环境属性确定对象邻域、固有属性计算离群度的方法克服上述局限;并以空间数据为例,将空间属性与非空间属性分开,用空间属性确定空间邻域,用非空间属性计算空间离群度,设计了空间离群点挖掘算法.实验结果表明,所提算法具有对用户依赖性少、检测精度高、可伸缩性强和运算效率高的优点.  相似文献   

11.
局部异常检测(Local outlier factor,LOF)能够有效解决数据倾斜分布下的异常检测问题,在很多应用领域具有较好的异常检测效果.本文面向大数据异常检测,提出了一种快速的Top-n局部异常点检测算法MTLOF(Multi-granularity upper bound pruning based top-n LOF detection),融合索引结构和多层LOF上界设计了多粒度的剪枝策略,以快速发现Top-n局部异常点.首先,提出了四个更接近真实LOF值的上界,以避免直接计算LOF值,并对它们的计算复杂度进行了理论分析;其次,结合索引结构和UB1、UB2上界,提出了两层的Cell剪枝策略,不仅采用全局Cell剪枝策略,还引入了基于Cell内部数据对象分布的局部剪枝策略,有效解决了高密度区域的剪枝问题;再次,利用所提的UB3和UB4上界,提出了两个更加合理有效的数据对象剪枝策略,UB3和UB4上界更加接近于真实LOF值,有利于剪枝更多数据对象,而基于计算复用的上界计算方法,大大降低了计算成本;最后,优化了初始Top-n局部异常点的选择方法,利用区域划分和建立的索引结构,在数据稀疏区域选择初始局部异常点,有利于将LOF值较大的数据对象选为初始局部异常点,有效提升初始剪枝临界值,使得初始阶段剪枝掉更多的数据对象,进一步提高检测效率.在六个真实数据集上的综合实验评估验证MTLOF算法的高效性和可扩展性,相比最新的TOLF(Top-n LOF)算法,时间效率提升可高达3.5倍.  相似文献   

12.
针对基于主元分析 (PCA)的统计监控模型受到历史数据中异常点强烈影响的不足,鉴于建模历史数据中存在的异常点会影响过程监控效果,分析目前常用的鲁棒异常值检测算法原理及其缺陷,提出将中心最短距离(CDC)法与椭球多变量整理(MVT)法相结合,构成一种基于鲁棒尺度的CDC-MVT异常值综合检测算法,更加准确地检测异常点。将该算法应用于工业发酵过程,与CDC法和MVT法相比较,该算法能够有效去除建模数据中的异常点。  相似文献   

13.
The forecasting process of real-world time series has to deal with especially unexpected values, commonly known as outliers. Outliers in time series can lead to unreliable modeling and poor forecasts. Therefore, the identification of future outlier occurrence is an essential task in time series analysis to reduce the average forecasting error. The main goal of this work is to predict the occurrence of outliers in time series, based on the discovery of motifs. In this sense, motifs will be those pattern sequences preceding certain data marked as anomalous by the proposed metaheuristic in a training set. Once the motifs are discovered, if data to be predicted are preceded by any of them, such data are identified as outliers, and treated separately from the rest of regular data. The forecasting of outlier occurrence has been added as an additional step in an existing time series forecasting algorithm (PSF), which was based on pattern sequence similarities. Robust statistical methods have been used to evaluate the accuracy of the proposed approach regarding the forecasting of both occurrence of outliers and their corresponding values. Finally, the methodology has been tested on six electricity-related time series, in which most of the outliers were properly found and forecasted.  相似文献   

14.
时空数据异常探测方法   总被引:1,自引:0,他引:1       下载免费PDF全文
以“k倍标准差”准则为基础,提出一种专题属性双重偏离的时空异常检测方法,在每个要素的空间邻近域里采用“k倍标准差”准则探测各时刻的空间异常数据,在每个空间异常数据的时间邻近域中,再次使用该准则判断该要素是否为时序异常,并将所有空间和时间邻近域上均表现为异常的数据定义为时空异常。实验结果表明,该方法是有效可行的。  相似文献   

15.
李光强  郑茂仪  邓敏 《计算机工程》2010,36(5):35-36,39
以“k倍标准差”准则为基础,提出一种专题属性双重偏离的时空异常检测方法,在每个要素的空间邻近域里采用“k倍标准差”准则探测各时刻的空间异常数据,在每个空间异常数据的时间邻近域中,再次使用该准则判断该要素是否为时序异常,并将所有空间和时间邻近域上均表现为异常的数据定义为时空异常。实验结果表明,该方法是有效可行的。  相似文献   

16.
This paper focuses on the development of an effective cluster validity measure with outlier detection and cluster merging algorithms for support vector clustering (SVC). Since SVC is a kernel-based clustering approach, the parameter of kernel functions and the soft-margin constants in Lagrangian functions play a crucial role in the clustering results. The major contribution of this paper is that our proposed validity measure and algorithms are capable of identifying ideal parameters for SVC to reveal a suitable cluster configuration for a given data set. A validity measure, which is based on a ratio of cluster compactness to separation with outlier detection and a cluster-merging mechanism, has been developed to automatically determine ideal parameters for the kernel functions and soft-margin constants as well. With these parameters, the SVC algorithm is capable of identifying the optimal number of clusters with compact and smooth arbitrary-shaped cluster contours for the given data set and increasing robustness to outliers and noise. Several simulations, including artificial and benchmark data sets, have been conducted to demonstrate the effectiveness of the proposed cluster validity measure for the SVC algorithm.  相似文献   

17.
A fuzzy index for detecting spatiotemporal outliers   总被引:1,自引:1,他引:0  
The detection of spatial outliers helps extract important and valuable information from large spatial datasets. Most of the existing work in outlier detection views the condition of being an outlier as a binary property. However, for many scenarios, it is more meaningful to assign a degree of being an outlier to each object. The temporal dimension should also be taken into consideration. In this paper, we formally introduce a new notion of spatial outliers. We discuss the spatiotemporal outlier detection problem, and we design a methodology to discover these outliers effectively. We introduce a new index called the fuzzy outlier index, FoI, which expresses the degree to which a spatial object belongs to a spatiotemporal neighbourhood. The proposed outlier detection method can be applied to phenomena evolving over time, such as moving objects, pedestrian modelling or credit card fraud.  相似文献   

18.
基于密度的局部离群点检测算法   总被引:1,自引:0,他引:1  
基于统计学和基于距离的离群点检测都依赖与给定数据点集的全局分布,然而数据通常并非都是均匀分布的。当分析分布密度相差很大的数据时,基于密度的局部离群点检测方法有着很好的识别局部离群点的能力。但存在时间复杂度较大,文章提出了一种改进的算法,能降低时间复杂度,实现有效的局部离群点的检测。  相似文献   

19.
In this study, we propose a novel local outlier detection approach - called LOMA - to mining local outliers in high-dimensional data sets. To improve the efficiency of outlier detection, LOMA prunes irrelevance attributes and objects in the data set by analyzing attribute relevance with a sparse factor threshold. Such a pruning technique substantially reduce the size of data sets. The core of LOMA is searching sparse subspace, which implements the particle swarm optimization method in reduced data sets. In the process of searching sparse subspace, we introduce the sparse coefficient threshold to represent sparse degrees of data objects in a subspace, where the data objects are considered as local outliers. The attribute relevance analysis provides a guidance for experts and users to identify useless attributes for detecting outliers. In addition, our sparse-subspace-based outlier algorithm is a novel technique for local-outlier detection in a wide variety of applications. Experimental results driven by both synthetic and UCI data sets validate the effectiveness and accuracy of our LOMA. In particular, LOMA achieves high mining efficiency and accuracy when the sparse factor threshold is set to a small value.  相似文献   

20.
When scanning an object using a 3D laser scanner, the collected scanned point cloud is usually contaminated by numerous measurement outliers. These outliers can be sparse outliers, isolated or non-isolated outlier clusters. The non-isolated outlier clusters pose a great challenge to the development of an automatic outlier detection method since such outliers are attached to the scanned data points from the object surface and difficult to be distinguished from these valid surface measurement points. This paper presents an effective outlier detection method based on the principle of majority voting. The method is able to detect non-isolated outlier clusters as well as the other types of outliers in a scanned point cloud. The key component is a majority voting scheme that can cut the connection between non-isolated outlier clusters and the scanned surface so that non-isolated outliers become isolated. An expandable boundary criterion is also proposed to remove isolated outliers and preserve valid point clusters more reliably than a simple cluster size threshold. The effectiveness of the proposed method has been validated by comparing with several existing methods using a variety of scanned point clouds.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号