首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 46 毫秒
1.
针对数据流中离群点挖掘问题,在K-means聚类算法基础上,提出了基于距离的准则进行数据间离群点判断的离群点检测DOKM算法。根据数据流概念漂移检测结果来自适应地调整滑动窗口大小,从而实现对数据流的离群点检测,与其他离群点算法的一系列实验验证和对比结果表明,DOKM算法在人工数据集和真实数据集中均可以实现对离群点的有效检测。  相似文献   

2.
香农的信息熵被广泛用于粗糙集.利用粗糙集中的粗糙熵来检测离群点,提出一种基于粗糙熵的离群点检测方法,并应用于无监督入侵检测.首先,基于粗糙熵提出一种新的离群点定义,并设计出相应的离群点检测算法-–基于粗糙熵的离群点检测(rough entropy-based outlier detection,REOD);其次,通过将入侵行为看作是离群点,将REOD应用于入侵检测中,从而得到一种新的无监督入侵检测方法.通过多个数据集上的实验表明,REOD具有良好的离群点检测性能.另外,相对于现有的入侵检测方法,REOD具有较高的入侵检测率和较低的误报率,特别是其计算开销较小,适合于在海量高维的数据中检测入侵.  相似文献   

3.
基于滑动窗口的异常检测是数据流挖掘研究的一个重要课题,在许多应用中数据流通常在一个分布网络上传输,解决这类问题时常采用分布计算技术,以便获得实时高质量的计算结果。对分布演化数据流上连续异常检测问题,进行形式化地阐述,提出了两个基于核密度估计的异常检测定义和算法,并通过大量真实数据集的实验,表明该算法具有良好的高效性和可扩展性,完全适应数据流应用的需求。  相似文献   

4.

针对传统数据流聚类算法聚类信息损失大、不准确的缺点, 提出一种基于维度最大熵的数据流聚类算法. 采用动态数据直方图将数据维度划分为不同的维度组, 计算各维度最大熵划分维度空间簇, 将相同维度簇的数据聚集成微簇, 通过比较微簇的信息熵大小及其分布特点实现数据流的异常检测. 该方法提升了聚类速度, 克服了传统数据流聚类算法信息丢失的缺点. 实验结果表明, 所提出算法能够提高数据流异常检测的准确性和有效性.

  相似文献   

5.

信息熵是粒计算理论中度量不确定信息的重要工具之一, 已有的异常数据挖掘算法主要针对确定性的异常
数据挖掘, 采用信息熵度量不确定性数据进行异常数据挖掘的研究报道较少. 鉴于此, 在引入信息熵概念的基础上,
定义基于信息熵的异常度来度量数据之间的异常程度, 并提出基于信息熵的异常数据挖掘算法, 该算法可有效进行
异常数据的挖掘. 理论分析与实验结果表明, 所提出算法是有效可行的.

  相似文献   

6.

Recently, sequence anomaly detection has been widely used in many fields. Sequence data in these fields are usually multi-dimensional over the data stream. It is a challenge to design an anomaly detection method for a multi-dimensional sequence over the data stream to satisfy the requirements of accuracy and high speed. It is because: (1) Redundant dimensions in sequence data and large state space lead to a poor ability for sequence modeling; (2) Anomaly detection cannot adapt to the high-speed nature of the data stream, especially when concept drift occurs, and it will reduce the detection rate. On one hand, most existing methods of sequence anomaly detection focus on the single-dimension sequence. On the other hand, some studies concerning multi-dimensional sequence concentrate mainly on the static database rather than the data stream. To improve the performance of anomaly detection for a multi-dimensional sequence over the data stream, we propose a novel unsupervised fast and accurate anomaly detection (FAAD) method which includes three algorithms. First, a method called “information calculation and minimum spanning tree cluster” is adopted to reduce redundant dimensions. Second, to speed up model construction and ensure the detection rate for the sequence over the data stream, we propose a method called “random sampling and subsequence partitioning based on the index probabilistic suffix tree.” Last, the method called “anomaly buffer based on model dynamic adjustment” dramatically reduces the effects of concept drift in the data stream. FAAD is implemented on the streaming platform Storm to detect multi-dimensional log audit data. Compared with the existing anomaly detection methods, FAAD has a good performance in detection rate and speed without being affected by concept drift.

  相似文献   

7.
现有的概念漂移检测方法大多集中于单标签数据流,难以满足多标签数据流概念漂移检测的需要,因此文中提出基于分层校验的多标签数据流概念漂移检测算法.算法包括检验层和校验层,检验层通过检测数据分布变化判断是否发生概念漂移,校验层通过判断标签混淆矩阵的变化程度验证是否真正发生概念漂移.在真实多标签数据集和合成多标签数据集上的实验表明,文中算法表现更优,可以有效检测概念漂移,提升分类性能.  相似文献   

8.

Data points situated near a cluster boundary are called boundary points and they can represent useful information about the process generating this data. The existing methods of boundary points detection cannot differentiate boundary points from outliers as they are affected by the presence of outliers as well as by the size and density of clusters in the dataset. Also, they require tuning of one or more parameters and prior knowledge of the number of outliers in the dataset for tuning. In this research, a boundary points detection method called BPF is proposed which can effectively differentiate boundary points from outliers and core points. BPF combines the well-known outlier detection method Local Outlier Factor (LOF) with Gravity value to calculate the BPF score. Our proposed algorithm StaticBPF can detect the top-m boundary points in the given dataset. Importantly, StaticBPF requires tuning of only one parameter i.e. the number of nearest neighbors \((k)\) and can employ the same \(k\) used by LOF for outlier detection. This paper also extends BPF for streaming data and proposes StreamBPF. StreamBPF employs a grid structure for improving k-nearest neighbor computation and an incremental method of calculating BPF scores of a subset of data points in a sliding window over data streams. In evaluation, the accuracy of StaticBPF and the runtime efficiency of StreamBPF are evaluated on synthetic and real data where they generally performed better than their competitors.

  相似文献   

9.
轨迹大数据异常检测:研究进展及系统框架   总被引:1,自引:0,他引:1  
定位技术与普适计算的蓬勃发展催生了轨迹大数据,轨迹大数据表现为定位设备所产生的大规模高速数据流。及时、有效地对以数据流形式出现的轨迹大数据进行分析处理,可以发现隐含在轨迹数据中的异常现象,从而服务于城市规划、交通管理、安全管控等应用。受限于轨迹大数据固有的不确定性、无限性、时变进化性、稀疏性和偏态分布性等特征,传统的异常检测技术不能直接应用于轨迹大数据的异常检测。由于静态轨迹数据集的异常检测方法通常假定数据分布先验已知,忽视了轨迹数据的时间特征,也不能评测轨迹大数据中动态演化的异常行为。面对轨迹大数据低劣的数据质量和快速的数据更新,需要利用有限的系统资源处理因时变带来的概念漂移,实时检测多样化的轨迹异常,分析轨迹异常间的因果联系,继而识别更大时空区域内进化的、关联的轨迹异常,这是轨迹大数据异常检测的核心研究内容。此外,融合与位置服务应用相关的多源异质数据,剖析异常轨迹的起因以及其隐含的异常事件,也是轨迹大数据异常检测当下亟待研究的问题。为解决上述问题,对轨迹异常检测技术的研究成果进行了分类总结。针对现有轨迹异常检测方法的局限性,提出了轨迹大数据异常检测的系统架构。最后,在面向轨迹流的在线异常检测、轨迹异常的演化分析、轨迹异常检测系统的基准评测、异常检测结果语义分析的数据融合、以及轨迹异常检测的可视化技术等方面探讨了今后的研究工作。  相似文献   

10.
Statistical depth functions provide from the deepest point a center-outward ordering of multidimensional data. In this sense, depth functions can measure the extremeness or outlyingness of a data point with respect to a given data set. Hence, they can detect outliers observations that appear extreme relative to the rest of the observations. Of the various statistical depths, the spatial depth is especially appealing because of its computational efficiency and mathematical tractability. In this article, we propose a novel statistical depth, the kernelized spatial depth (KSD), which generalizes the spatial depth via positive definite kernels. By choosing a proper kernel, the KSD can capture the local structure of a data set while the spatial depth fails. We demonstrate this by the half-moon data and the ring-shaped data. Based on the KSD, we propose a novel outlier detection algorithm, by which an observation with a depth value less than a threshold is declared as an outlier. The proposed algorithm is simple in structure: the threshold is the only one parameter for a given kernel. It applies to a one-class learning setting, in which normal observations are given as the training data, as well as to a missing label scenario, where the training set consists of a mixture of normal observations and outliers with unknown labels. We give upper bounds on the false alarm probability of a depth-based detector. These upper bounds can be used to determine the threshold. We perform extensive experiments on synthetic data and data sets from real applications. The proposed outlier detector is compared with existing methods. The KSD outlier detector demonstrates a competitive performance.  相似文献   

11.
Traditional outlier mining methods identify outliers from a global point of view. These methods are inefficient to find locally biased data points (outliers) in low dimensional subspaces. Constrained concept lattices can be used as an effective formal tool for data analysis because constrained concept lattices have the characteristics of high constructing efficiency, practicability and pertinency. In this paper, we propose an outlier mining algorithm that treats the intent of any constrained concept lattice node as a subspace. We introduce sparsity and density coefficients to measure outliers in low dimensional subspaces. The intent of any constrained concept lattice node is regarded as a subspace, and sparsity subspaces are searched by traversing the constrained concept lattice according to a sparsity coefficient threshold. If the intent of any father node of the sparsity subspace is a density subspace according to a density coefficient threshold, then objects contained in the extent of the sparsity subspace node are considered as bias data points or outliers. Our experimental results show that the proposed algorithm performs very well for high red-shift spectral data sets.  相似文献   

12.
为能及时发现数据流上的局部离群点,分析数据流已有的离群点挖掘算法,提出基于小波密度估计的离群点检测算法。利用小波密度估计多尺度和多粒度的特点,通过小波概率阈值判断数据流中当前滑动窗口内的数据点是否为离群点,并对数据流中离群点检测过程进行讨论。仿真结果表明,与核密度估计算法相比,该算法的检测效率与精度较高。  相似文献   

13.
When scanning an object using a 3D laser scanner, the collected scanned point cloud is usually contaminated by numerous measurement outliers. These outliers can be sparse outliers, isolated or non-isolated outlier clusters. The non-isolated outlier clusters pose a great challenge to the development of an automatic outlier detection method since such outliers are attached to the scanned data points from the object surface and difficult to be distinguished from these valid surface measurement points. This paper presents an effective outlier detection method based on the principle of majority voting. The method is able to detect non-isolated outlier clusters as well as the other types of outliers in a scanned point cloud. The key component is a majority voting scheme that can cut the connection between non-isolated outlier clusters and the scanned surface so that non-isolated outliers become isolated. An expandable boundary criterion is also proposed to remove isolated outliers and preserve valid point clusters more reliably than a simple cluster size threshold. The effectiveness of the proposed method has been validated by comparing with several existing methods using a variety of scanned point clouds.  相似文献   

14.
Outlier detection is an important data mining task with many contemporary applications. Clustering based methods for outlier detection try to identify the data objects that deviate from the normal data. However, the uncertainty regarding the cluster membership of an outlier object has to be handled appropriately during the clustering process. Additionally, carrying out the clustering process on data described using categorical attributes is challenging, due to the difficulty in defining requisite methods and measures dealing with such data. Addressing these issues, a novel algorithm for clustering categorical data aimed at outlier detection is proposed here by modifying the standard \(k\)-modes algorithm. The uncertainty regarding the clustering process is addressed by considering a soft computing approach based on rough sets. Accordingly, the modified clustering algorithm incorporates the lower and upper approximation properties of rough sets. The efficacy of the proposed rough \(k\)-modes clustering algorithm for outlier detection is demonstrated using various benchmark categorical data sets.  相似文献   

15.
It is challenging to use traditional data mining techniques to deal with real-time data stream classifications. Existing mining classifiers need to be updated frequently to adapt to the changes in data streams. To address this issue, in this paper we propose an adaptive ensemble approach for classification and novel class detection in concept drifting data streams. The proposed approach uses traditional mining classifiers and updates the ensemble model automatically so that it represents the most recent concepts in data streams. For novel class detection we consider the idea that data points belonging to the same class should be closer to each other and should be far apart from the data points belonging to other classes. If a data point is well separated from the existing data clusters, it is identified as a novel class instance. We tested the performance of this proposed stream classification model against that of existing mining algorithms using real benchmark datasets from UCI (University of California, Irvine) machine learning repository. The experimental results prove that our approach shows great flexibility and robustness in novel class detection in concept drifting and outperforms traditional classification models in challenging real-life data stream applications.  相似文献   

16.
局部异常检测(Local outlier factor,LOF)能够有效解决数据倾斜分布下的异常检测问题,在很多应用领域具有较好的异常检测效果.本文面向大数据异常检测,提出了一种快速的Top-n局部异常点检测算法MTLOF(Multi-granularity upper bound pruning based top-n LOF detection),融合索引结构和多层LOF上界设计了多粒度的剪枝策略,以快速发现Top-n局部异常点.首先,提出了四个更接近真实LOF值的上界,以避免直接计算LOF值,并对它们的计算复杂度进行了理论分析;其次,结合索引结构和UB1、UB2上界,提出了两层的Cell剪枝策略,不仅采用全局Cell剪枝策略,还引入了基于Cell内部数据对象分布的局部剪枝策略,有效解决了高密度区域的剪枝问题;再次,利用所提的UB3和UB4上界,提出了两个更加合理有效的数据对象剪枝策略,UB3和UB4上界更加接近于真实LOF值,有利于剪枝更多数据对象,而基于计算复用的上界计算方法,大大降低了计算成本;最后,优化了初始Top-n局部异常点的选择方法,利用区域划分和建立的索引结构,在数据稀疏区域选择初始局部异常点,有利于将LOF值较大的数据对象选为初始局部异常点,有效提升初始剪枝临界值,使得初始阶段剪枝掉更多的数据对象,进一步提高检测效率.在六个真实数据集上的综合实验评估验证MTLOF算法的高效性和可扩展性,相比最新的TOLF(Top-n LOF)算法,时间效率提升可高达3.5倍.  相似文献   

17.
Classifying streaming data requires the development of methods which are computationally efficient and able to cope with changes in the underlying distribution of the stream, a phenomenon known in the literature as concept drift. We propose a new method for detecting concept drift which uses an exponentially weighted moving average (EWMA) chart to monitor the misclassification rate of an streaming classifier. Our approach is modular and can hence be run in parallel with any underlying classifier to provide an additional layer of concept drift detection. Moreover our method is computationally efficient with overhead O(1) and works in a fully online manner with no need to store data points in memory. Unlike many existing approaches to concept drift detection, our method allows the rate of false positive detections to be controlled and kept constant over time.  相似文献   

18.
In recent years, much attention has been given to the problem of outlier detection, whose aim is to detect outliers - objects who behave in an unexpected way or have abnormal properties. The identification of outliers is important for many applications such as intrusion detection, credit card fraud, criminal activities in electronic commerce, medical diagnosis and anti-terrorism, etc. In this paper, we propose a hybrid approach to outlier detection, which combines the opinions from boundary-based and distance-based methods for outlier detection ( [Jiang et al., 2005], [Jiang et al., 2009] and [Knorr and Ng, 1998]). We give a novel definition of outliers - BD (boundary and distance)-based outliers, by virtue of the notion of boundary region in rough set theory and the definitions of distance-based outliers. An algorithm to find such outliers is also given. And the effectiveness of our method for outlier detection is demonstrated on two publicly available databases.  相似文献   

19.
为了克服数据流中概念漂移对分类的影响,提出了一种基于多样性和精度加权的集成分类方法(diversity and accuracy weighting ensemble classification algorithm, DAWE),该方法与已有的其他集成方法不同的地方在于,DAWE同时考虑了多样性和精度这两种度量标准,将分类器在最新数据块上的精度及其在集成分类器中的多样性进行线性加权,以此来衡量一个分类器对于当前集成分类器的价值,并将价值度量用于基分类器替换策略。提出的DAWE算法与MOA中最新算法分别在真实数据和人工合成数据上进行了对比实验,实验表明,提出的方法是有效的,在所有数据集上的平均精度优于其他算法,该方法能有效处理数据流挖掘中的概念漂移问题。  相似文献   

20.
粒计算理论提供了一种新的处理不确定、不完全与不一致知识的有效方法。知识粒度是粒计算理论中度量不确定信息的重要工具之一。已有的异常数据挖掘算法主要针对确定性的异常数据挖掘,采用知识粒度度量不确定性数据,进行异常数据挖掘的研究尚未报道。为此,在引入知识粒度概念的基础上,定义了相对知识粒度及异常度来度量数据之间的异常程度,并提出基于知识粒度的异常数据挖掘算法,该算法可有效进行异常数据的挖掘。实例验证了该算法的有效性。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号