首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
针对现有基于距离的离群点检测算法在处理大规模数据时效率低的问题,提出一种基于聚类和索引的分布式离群点检测(DODCI) 算法。首先利用聚类方法将大数据集划分成簇;然后在分布式环境中的各节点处并行创建各个簇的索引;最后使用两个优化策略和两条剪枝规则以循环的方式在各节点处进行离群点检测。在合成数据集和整理后的KDD CUP数据集上的实验结果显示,在数据量较大时该算法比Orca和iDOoR算法快近一个数量级。理论和实验分析表明,该算法可以有效提高大规模数据中离群点的检测效率。  相似文献   

2.
3.
传统的聚类算法是一种无监督的学习过程,聚类的精度受到相似性度量方式以及数据集中孤立点的影响,并且算法也没有很好的利用先验知识,无法体现用户的需求。因此提出了基于共享最近邻的孤立点检测及半监督聚类算法。该算法采用共享最近邻为相似度,根据数据点的最近邻居数目来判断是否为孤立点,并在删除孤立点的数据集上进行半监督聚类。在半监督聚类过程中加入了经过扩展的先验知识,同时根据图形分割原理对数据集进行聚类。文中使用真实的数据集进行仿真,其仿真结果表明,本文所提出的算法能有效的检测出孤立点,并具有很好的聚类效果。  相似文献   

4.
一种两阶段异常检测方法   总被引:4,自引:0,他引:4  
提出了一种新的距离和对象异常因子的定义,在此基础上提出了一种两阶段异常检测方法TOD,第一阶段利用一种新的聚类算法对数据进行聚类,第二阶段利用对象的异常因子检测异常.TOD的时间复杂度与数据集大小成线性关系,与属性个数成近似线性关系,算法具有好的扩展性,适合于大规模数据集.理论分析和实验结果表明TOD具有稳健性和实用性.  相似文献   

5.
一种高效异常检测方法   总被引:3,自引:0,他引:3  
借鉴万有引力思想提出了一种差异性度量方法和度量类偏离程度的方法,以此为基础提出了一种基于聚类的异常检测方法。该异常检测方法关于数据集大小和属性个数具有近似线性时间复杂度,适合于大规模数据集。理论分析以及在真实数据集上的实验结果表明,该方法是有效的,稳健并且实用。  相似文献   

6.
离群点是与其他正常点属性不同的一类对象,其检测技术在各行业上均有维护数据纯度、保障业内安全等重要应用,现有算法大多是基于距离、密度等传统方法判断检测离群点.本算法给每个对象分配一个"孤立度",即该点相对其邻点的孤立程度,通过排序进行判定,比传统算法效率更高.在AP(affinity propagation)聚类算法的基础上进行改进与优化,提出能检测异常数据点的算法APO(outlier detection algorithm based on affinity propagation).通过加入孤立度模块并计算处理样本点的孤立信息,并引入放大因子,使其与正常点之间的差异更明显,通过增大算法对离群点的敏感性,提高算法的准确性.分别在模拟数据集和真实数据集上进行对比实验,结果表明:该算法与AP算法相比,对离群点的敏感性更加强烈,且本算法检测离群点的同时也能聚类,是其他检测算法所不具备的.  相似文献   

7.
现有的孤立点检测算法在通用性、有效性、用户友好性及处理高维大数据集的性能还不完善,为此提出一种快速有效的基于层次聚类的全局孤立点检测方法。该方法基于层次聚类的结果,根据聚类树和距离矩阵可视化判断数据孤立程度,并确定孤立点数目。从聚类树自顶向下,无监督地去除孤立点。仿真实验验证了方法能快速有效识别全局孤立点,具有用户友好性,适用于不同形状的数据集,可用于大型高维数据集的孤立点检测。  相似文献   

8.
Uncertain data management, querying and mining have become important because the majority of real world data is accompanied with uncertainty these days. Uncertainty in data is often caused by the deficiency in underlying data collecting equipments or sometimes manually introduced to preserve data privacy. This work discusses the problem of distance-based outlier detection on uncertain datasets of Gaussian distribution. The Naive approach of distance-based outlier on uncertain data is usually infeasible due to expensive distance function. Therefore a cell-based approach is proposed in this work to quickly identify the outliers. The infinite nature of Gaussian distribution prevents to devise effective pruning techniques. Therefore an approximate approach using bounded Gaussian distribution is also proposed. Approximating Gaussian distribution by bounded Gaussian distribution enables an approximate but more efficient cell-based outlier detection approach. An extensive empirical study on synthetic and real datasets show that our proposed approaches are effective, efficient and scalable.  相似文献   

9.
Uncertain data are common due to the increasing usage of sensors, radio frequency identification(RFID), GPS and similar devices for data collection. The causes of uncertainty include limitations of measurements, inclusion of noise, inconsistent supply voltage and delay or loss of data in transfer. In order to manage, query or mine such data, data uncertainty needs to be considered. Hence,this paper studies the problem of top-k distance-based outlier detection from uncertain data objects. In this work, an uncertain object is modelled by a probability density function of a Gaussian distribution. The naive approach of distance-based outlier detection makes use of nested loop. This approach is very costly due to the expensive distance function between two uncertain objects. Therefore,a populated-cells list(PC-list) approach of outlier detection is proposed. Using the PC-list, the proposed top-k outlier detection algorithm needs to consider only a fraction of dataset objects and hence quickly identifies candidate objects for top-k outliers. Two approximate top-k outlier detection algorithms are presented to further increase the efficiency of the top-k outlier detection algorithm.An extensive empirical study on synthetic and real datasets is also presented to prove the accuracy, efficiency and scalability of the proposed algorithms.  相似文献   

10.
One of the common endeavours in engineering applications is outlier detection, which aims to identify inconsistent records from large amounts of data. Although outlier detection schemes in data mining discipline are acknowledged as a more viable solution to efficient identification of anomalies from these data repository, current outlier mining algorithms require the input of domain parameters. These parameters are often unknown, difficult to determine and vary across different datasets containing different cluster features. This paper presents a novel resolution-based outlier notion and a nonparametric outlier-mining algorithm, which can efficiently identify and rank top listed outliers from a wide variety of datasets. The algorithm generates reasonable outlier results by taking both local and global features of a dataset into account. Experiments are conducted using both synthetic datasets and a real life construction equipment dataset from a large road building contractor. Comparison with the current outlier mining algorithms indicates that the proposed algorithm is more effective and can be integrated into a decision support system to serve as a universal detector of potentially inconsistent records.  相似文献   

11.
Identification of attacks by a network intrusion detection system (NIDS) is an important task. In signature or rule based detection, the previously encountered attacks are modeled, and signatures/rules are extracted. These rules are used to detect such attacks in future, but in anomaly or outlier detection system, the normal network traffic is modeled. Any deviation from the normal model is deemed to be an outlier/ attack. Data mining and machine learning techniques are widely used in offline NIDS. Unsupervised and supervised learning techniques differ the way NIDS dataset is treated. The characteristic features of unsupervised and supervised learning are finding patterns in data, detecting outliers, and determining a learned function for input features, generalizing the data instances respectively. The intuition is that if these two techniques are combined, better performance may be obtained. Hence, in this paper the advantages of unsupervised and supervised techniques are inherited in the proposed hierarchical model and devised into three stages to detect attacks in NIDS dataset. NIDS dataset is clustered using Dirichlet process (DP) clustering based on the underlying data distribution. Iteratively on each cluster, local denser areas are identified using local outlier factor (LOF) which in turn is discretized into four bins of separation based on LOF score. Further, in each bin the normal data instances are modeled using one class classifier (OCC). A combination of Density Estimation method, Reconstruction method, and Boundary methods are used for OCC model. A product rule combination of the threemethods takes into consideration the strengths of each method in building a stronger OCC model. Any deviation from this model is considered as an attack. Experiments are conducted on KDD CUP’99 and SSENet-2011 datasets. The results show that the proposed model is able to identify attacks with higher detection rate and low false alarms.  相似文献   

12.
Semi-supervised outlier detection based on fuzzy rough C-means clustering   总被引:1,自引:0,他引:1  
This paper presents a fuzzy rough semi-supervised outlier detection (FRSSOD) approach with the help of some labeled samples and fuzzy rough C-means clustering. This method introduces an objective function, which minimizes the sum squared error of clustering results and the deviation from known labeled examples as well as the number of outliers. Each cluster is represented by a center, a crisp lower approximation and a fuzzy boundary by using fuzzy rough C-means clustering and only those points located in boundary can be further discussed the possibility to be reassigned as outliers. As a result, this method can obtain better clustering results for normal points and better accuracy for outlier detection. Experiment results show that the proposed method, on average, keep, or improve the detection precision and reduce false alarm rate as well as reduce the number of candidate outliers to be discussed.  相似文献   

13.
离群点检测的目标是识别数据集中与其他样本明显不同的个体,以便检测数据中的异常或异常状态。现有的方法难以有效应对复杂、非线性分布的数据,并且面临参数敏感性和数据分布多样性的问题。为此,现提出一种新型图结构——自适应邻居图,以边为导向,通过迭代的方式对数据进行特征提取,并计算近邻可达度对离群点进行识别,减小了参数的影响,同时可适用于不同分布类型的数据。为了充分验证其性能,将该方法在多个合成与真实数据集上同其他方法进行了比较分析。实验结果表明,该方法在所有19个数据集中平均排名第一,在保持高精度的同时表现出稳定性。  相似文献   

14.
孤立数据的存在使数据挖掘结果不准确,甚至错误。现有的孤立点检测算法在通用性、有效性、用户友好性及处理高维大数据集的性能还不完善,为此,提出一种有效的全局孤立点检测方法,该方法进行凝聚层次聚类,根据聚类树和距离矩阵来可视化判断数据孤立程度,确定孤立点数目。从聚类树自顶向下,无监督地去除离群数据点。在多个数据集上的仿真实验结果表明,该方法能有效识别孤立程度最大的前n个全局孤立点,适用于不同形状的数据集,算法效率高,用户友好,且适用于大型高维数据集的孤立点检测。  相似文献   

15.
This paper focuses on the development of an effective cluster validity measure with outlier detection and cluster merging algorithms for support vector clustering (SVC). Since SVC is a kernel-based clustering approach, the parameter of kernel functions and the soft-margin constants in Lagrangian functions play a crucial role in the clustering results. The major contribution of this paper is that our proposed validity measure and algorithms are capable of identifying ideal parameters for SVC to reveal a suitable cluster configuration for a given data set. A validity measure, which is based on a ratio of cluster compactness to separation with outlier detection and a cluster-merging mechanism, has been developed to automatically determine ideal parameters for the kernel functions and soft-margin constants as well. With these parameters, the SVC algorithm is capable of identifying the optimal number of clusters with compact and smooth arbitrary-shaped cluster contours for the given data set and increasing robustness to outliers and noise. Several simulations, including artificial and benchmark data sets, have been conducted to demonstrate the effectiveness of the proposed cluster validity measure for the SVC algorithm.  相似文献   

16.
Predicting the fault-proneness labels of software program modules is an emerging software quality assurance activity and the quality of datasets collected from previous software version affects the performance of fault prediction models. In this paper, we propose an outlier detection approach using metrics thresholds and class labels to identify class outliers. We evaluate our approach on public NASA datasets from PROMISE repository. Experiments reveal that this novel outlier detection method improves the performance of robust software fault prediction models based on Naive Bayes and Random Forests machine learning algorithms.  相似文献   

17.
为识别混合属性数据集中的离群点,提出了一种基于共享最近邻的离群检测算法,通过计算增量聚类结果簇间的共享最近邻相似度,不但能够发现任意形状的簇,还可以检测到变密度数据集中的全局离群点。算法时间复杂度关于数据集的大小和属性个数呈近似线性。在人工数据集和真实数据集上的实验结果显示,提出的算法能有效检测到数据集中的离群点。  相似文献   

18.
离群点检测算法在网络入侵检测、医疗辅助诊断等领域具有十分广泛的应用。针对LDOF、CBOF及LOF算法在大规模数据集和高维数据集的检测过程中存在的执行时间长及检测率较低的问题,提出了基于图上随机游走(BGRW)的离群点检测算法。首先初始化迭代次数、阻尼因子以及数据集中每个对象的离群值;其次根据对象之间的欧氏距离推导出漫步者在各对象之间的转移概率;然后通过迭代计算得到数据集中每个对象的离群值;最后将数据集中离群值最高的对象判定为离群点并输出。在UCI真实数据集与复杂分布的合成数据集上进行实验,将BGRW算法与LDOF、CBOF和LOF算法在执行时间、检测率和误报率指标上进行对比。实验结果表明,BGRW算法能够有效降低执行时间并在检测率及误报率指标上优于对比算法。  相似文献   

19.
针对现有的离群数据检测算法时间复杂度过高,且检测质量不佳的不足,提出一种新的基于改进的OPTICS聚类和LOPW的离群数据检测算法。首先,使用改进的OPTICS聚类算法对原始数据集进行预处理,筛选由聚类形成的可达图得到初步离群数据集;然后,利用新定义的基于P权值的局部离群因子LOPW计算初步离群数据集中对象的离群程度,计算距离时引入去一划分信息熵增量确定属性的权重,提高离群检测准确性。实验结果表明,改进后的算法不仅提高了运算效率,而且提高了对离群数据检测的精确度。  相似文献   

20.
针对基于距离的离群点检测算法受全局阈值的限制, 只能检测全局离群点, 提出了基于聚类划分的两阶段离群点检测算法挖掘局部离群点。首先基于凝聚层次聚类迭代出K-means所需的k值, 然后再利用K-means的方法将数据集划分成若干个微聚类; 其次为了提高挖掘效率, 提出基于信息熵的聚类过滤机制, 判定微聚类中是否包含离群点; 最后从包含离群点的微聚类中利用基于距离的方法挖掘出相应的局部离群点。实验结果表明, 该算法效率高、检测精度高、时间复杂度低。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号