首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到10条相似文献,搜索用时 125 毫秒
1.
一种面向高维混合属性数据的异常挖掘算法   总被引:2,自引:0,他引:2  
李庆华  李新  蒋盛益 《计算机应用》2005,25(6):1353-1356
异常检测是数据挖掘领域研究的最基本的问题之一,它在欺诈甄别、气象预报、客户分类和入侵检测等方面有广泛的应用。针对网络入侵检测的需求提出了一种新的基于混合属性聚类的异常挖掘算法,并且依据异常点(outliers)是数据集中的稀有点这一本质,给出了一种新的数据相似性和异常度的定义。本文所提出算法具有线性时间复杂度,在KDDCUP99和WisconsinPrognosisBreastCancer数据集上的实验表明,算本法在提供了近似线性时间复杂度和很好的可扩展性的同时,能够较好的发现数据集中的异常点。  相似文献   

2.
该文提出在高维空间下离群点发现技术的新方法,即利用粗糙集的属性约简技术减少高维空间的维数,并在各个关联规则子空间下对数据集进行基于密度的离群点挖掘,使高维空间下的离群点挖掘更具有实用性。数据分析表明,该算法能有效地发现高维空间数据集中的离群点。  相似文献   

3.
Mining class outliers: concepts, algorithms and applications in CRM   总被引:4,自引:0,他引:4  
Outliers, or commonly referred to as exceptional cases, exist in many real-world databases. Detection of such outliers is important for many applications and has attracted much attention from the data mining research community recently. However, most existing methods are designed for mining outliers from a single dataset without considering the class labels of data objects. In this paper, we consider the class outlier detection problem ‘given a set of observations with class labels, find those that arouse suspicions, taking into account the class labels’. By generalizing two pioneer contributions [Proc WAIM02 (2002); Proc SSTD03] in this field, we develop the notion of class outlier and propose practical solutions by extending existing outlier detection algorithms to this case. Furthermore, its potential applications in CRM (customer relationship management) are also discussed. Finally, the experiments in real datasets show that our method can find interesting outliers and is of practical use.  相似文献   

4.
Outlier or anomaly detection is a fundamental data mining task with the aim to identify data points, events, transactions which deviate from the norm. The identification of outliers in data can provide insights about the underlying data generating process. In general, outliers can be of two kinds: global and local. Global outliers are distinct with respect to the whole data set, while local outliers are distinct with respect to data points in their local neighbourhood. While several approaches have been proposed to scale up the process of global outlier discovery in large databases, this has not been the case for local outliers. We tackle this problem by optimising the use of local outlier factor (LOF) for large and high-dimensional data. We propose projection-indexed nearest-neighbours (PINN), a novel technique that exploits extended nearest-neighbour sets in a reduced-dimensional space to create an accurate approximation for k-nearest-neighbour distances, which is used as the core density measurement within LOF. The reduced dimensionality allows for efficient sub-quadratic indexing in the number of items in the data set, where previously only quadratic performance was possible. A detailed theoretical analysis of random projection (RP) and PINN shows that we are able to preserve the density of the intrinsic manifold of the data set after projection. Experimental results show that PINN outperforms the standard projection methods RP and PCA when measuring LOF for many high-dimensional real-world data sets of up to 300,000 elements and 102,600 dimensions. A further investigation into the use of high-dimensionality-specific indexing such as spatial approximate sample hierarchy (SASH) shows that our novel technique holds benefits over even these types of highly efficient indexing. We cement the practical applications of our novel technique with insights into what it means to find local outliers in real data including image and text data, and include potential applications for this knowledge.  相似文献   

5.
In recent years, much attention has been given to the problem of outlier detection, whose aim is to detect outliers - objects who behave in an unexpected way or have abnormal properties. The identification of outliers is important for many applications such as intrusion detection, credit card fraud, criminal activities in electronic commerce, medical diagnosis and anti-terrorism, etc. In this paper, we propose a hybrid approach to outlier detection, which combines the opinions from boundary-based and distance-based methods for outlier detection ( [Jiang et al., 2005], [Jiang et al., 2009] and [Knorr and Ng, 1998]). We give a novel definition of outliers - BD (boundary and distance)-based outliers, by virtue of the notion of boundary region in rough set theory and the definitions of distance-based outliers. An algorithm to find such outliers is also given. And the effectiveness of our method for outlier detection is demonstrated on two publicly available databases.  相似文献   

6.
高维空间中的离群点发现   总被引:35,自引:2,他引:33  
在许多KDD(knowledge discovery in databases)应用中,如电子商务中的欺诈行为监测,例外情况或离群点的发现比常规知识的发现更有意义.现有的离群点发现大多是针对数值属性的,而且这些方法只能发现离群点,不能对其含义进行解释.提出了一种基于超图模型的离群点(outlier)定义,这一定义既体现了"局部"的概念,又能很好地解释离群点的含义.同时给出了HOT(hypergraph-based outlier test)算法,通过计算每个点的支持度、隶属度和规模偏差来检测离群点.该算法既能够处理数值属性,又能够处理类别属性.分析表明,该算法能有效地发现高维空间数据中的离群点.  相似文献   

7.
Defining outliers by their distance to neighboring data points has been shown to be an effective non-parametric approach to outlier detection. In recent years, many research efforts have looked at developing fast distance-based outlier detection algorithms. Several of the existing distance-based outlier detection algorithms report log-linear time performance as a function of the number of data points on many real low-dimensional datasets. However, these algorithms are unable to deliver the same level of performance on high-dimensional datasets, since their scaling behavior is exponential in the number of dimensions. In this paper, we present RBRP, a fast algorithm for mining distance-based outliers, particularly targeted at high-dimensional datasets. RBRP scales log-linearly as a function of the number of data points and linearly as a function of the number of dimensions. Our empirical evaluation demonstrates that we outperform the state-of-the-art algorithm, often by an order of magnitude.  相似文献   

8.
现有的孤立点检测算法在通用性、有效性、用户友好性及处理高维大数据集的性能还不完善,为此提出一种快速有效的基于层次聚类的全局孤立点检测方法。该方法基于层次聚类的结果,根据聚类树和距离矩阵可视化判断数据孤立程度,并确定孤立点数目。从聚类树自顶向下,无监督地去除孤立点。仿真实验验证了方法能快速有效识别全局孤立点,具有用户友好性,适用于不同形状的数据集,可用于大型高维数据集的孤立点检测。  相似文献   

9.
离群数据检测,主要目的是从海量数据中发现异常数据。其有以下两点好处:第一,作为数据预处理工作,减少噪声点对模型的影响;第二,针对特定场景检测出异常,并对异常现象本身进行挖掘,也非常有价值。目前,国内外主流的方法像LOF、KNN、ORCA等,无法兼顾全局离群点、局部离群点和离群簇同时存在的复杂场景的检测。 针对这一情况,提出了一种新的离群数据检测模型。为了能够最大限度对全局、局部离群数据以及离群簇的全面检测,基于iForest、LOF、DBSCAN分别对于全局离群点、局部离群点、离群簇的高度敏感度,选定该三种特定基分类器,并且改变其目标函数,修正框架的错误率计算方式,进行融合,形成了新的离群数据检测模型ILD-BOOST。实验结果表明,该模型充分兼顾了全局和局部离群数据及离群簇的检测,且效果优于目前主流的离群数据检测方法。  相似文献   

10.
离群点检测是数据挖掘领域的重要研究方向之一,可以从大量数据中发现少量与多数数据有明显区别的数据对象。在诸如网络入侵、无线传感器网络异常事件等检测应用中,离群点检测是一项具有很高应用价值的技术。为了提高离群点检测准确度,文中在局部离群测度(SLOM)算法的基础上,作了一些改进,提出了一种基于密度的局部离群点检测算法ESLOM。引入信息熵确定数据对象的离群属性,并对对象距离采用加权距离,以提高离群点检测准确度。理论分析和实验表明该算法是可行有效的。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号