首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
The rapid evolution of technology has led to the generation of high dimensional data streams in a wide range of fields, such as genomics, signal processing, and finance. The combination of the streaming scenario and high dimensionality is particularly challenging especially for the outlier detection task. This is due to the special characteristics of the data stream such as the concept drift, the limited time and space requirements, in addition to the impact of the well-known curse of dimensionality in high dimensional space. To the best of our knowledge, few studies have addressed these challenges simultaneously, and therefore detecting anomalies in this context requires a great deal of attention. The main objective of this work is to study the main approaches existing in the literature, to identify a set of comparison criteria, such as the computational cost and the interpretation of outliers, which will help us to reveal the different challenges and additional research directions associated with this problem. At the end of this study, we will draw up a summary report which summarizes the main limits identified and we will detail the different directions of research related to this issue in order to promote research for this community.  相似文献   

2.
Semi-supervised outlier detection based on fuzzy rough C-means clustering   总被引:1,自引:0,他引:1  
This paper presents a fuzzy rough semi-supervised outlier detection (FRSSOD) approach with the help of some labeled samples and fuzzy rough C-means clustering. This method introduces an objective function, which minimizes the sum squared error of clustering results and the deviation from known labeled examples as well as the number of outliers. Each cluster is represented by a center, a crisp lower approximation and a fuzzy boundary by using fuzzy rough C-means clustering and only those points located in boundary can be further discussed the possibility to be reassigned as outliers. As a result, this method can obtain better clustering results for normal points and better accuracy for outlier detection. Experiment results show that the proposed method, on average, keep, or improve the detection precision and reduce false alarm rate as well as reduce the number of candidate outliers to be discussed.  相似文献   

3.
现有的孤立点检测算法在通用性、有效性、用户友好性及处理高维大数据集的性能还不完善,为此提出一种快速有效的基于层次聚类的全局孤立点检测方法。该方法基于层次聚类的结果,根据聚类树和距离矩阵可视化判断数据孤立程度,并确定孤立点数目。从聚类树自顶向下,无监督地去除孤立点。仿真实验验证了方法能快速有效识别全局孤立点,具有用户友好性,适用于不同形状的数据集,可用于大型高维数据集的孤立点检测。  相似文献   

4.
We propose an enhanced grid-density based approach for clustering high dimensional data. Our technique takes objects (or points) as atomic units in which the size requirement to cells is waived without losing clustering accuracy. For efficiency, a new partitioning is developed to make the number of cells smoothly adjustable; a concept of the ith-order neighbors is defined for avoiding considering the exponential number of neighboring cells; and a novel density compensation is proposed for improving the clustering accuracy and quality. We experimentally evaluate our approach and demonstrate that our algorithm significantly improves the clustering accuracy and quality.  相似文献   

5.
针对基于密度的空间聚类及其变种提出了拓扑的概念。给出了聚类拓扑结构的定义,把簇定义为多种拓扑连通集合。此外,运用全新的拓扑思想改进典型的算法,提出了一种拓扑聚类的新算法。实例证明此算法有效。  相似文献   

6.
7.
An effective and efficient algorithm for high-dimensional outlier detection   总被引:8,自引:0,他引:8  
The outlier detection problem has important applications in the field of fraud detection, network robustness analysis, and intrusion detection. Most such applications are most important for high-dimensional domains in which the data can contain hundreds of dimensions. Many recent algorithms have been proposed for outlier detection that use several concepts of proximity in order to find the outliers based on their relationship to the other points in the data. However, in high-dimensional space, the data are sparse and concepts using the notion of proximity fail to retain their effectiveness. In fact, the sparsity of high-dimensional data can be understood in a different way so as to imply that every point is an equally good outlier from the perspective of distance-based definitions. Consequently, for high-dimensional data, the notion of finding meaningful outliers becomes substantially more complex and nonobvious. In this paper, we discuss new techniques for outlier detection that find the outliers by studying the behavior of projections from the data set.Received: 19 November 2002, Accepted: 6 February 2004, Published online: 19 August 2004Edited by: R. Ng.  相似文献   

8.
为解决从飞机快速存取记录器(QAR)数据中发现异常数据并预测飞机潜在故障的问题,考虑QAR数据量大、飞行参数数据值相对较为稳定的特点,提出一种适用于QAR数据的离群点检测算法。第一阶段采用K均值聚类对QAR数据流分区进行聚类生成均值参考点;第二阶段采用最小二乘法对生成的均值参考点进行拟合,通过计算均值参考点到拟合飞机参数曲线的距离来判断并找出可能的离群点。实验结果表明,该算法可以准确发现飞机中的故障数据,有效解决部分飞机故障的离群点检测问题。  相似文献   

9.
针对现有的离群数据检测算法时间复杂度过高,且检测质量不佳的不足,提出一种新的基于改进的OPTICS聚类和LOPW的离群数据检测算法。首先,使用改进的OPTICS聚类算法对原始数据集进行预处理,筛选由聚类形成的可达图得到初步离群数据集;然后,利用新定义的基于P权值的局部离群因子LOPW计算初步离群数据集中对象的离群程度,计算距离时引入去一划分信息熵增量确定属性的权重,提高离群检测准确性。实验结果表明,改进后的算法不仅提高了运算效率,而且提高了对离群数据检测的精确度。  相似文献   

10.
This article addresses some problems in outlier detection and variable selection in linear regression models. First, in outlier detection there are problems known as smearing and masking. Smearing means that one outlier makes another, non-outlier observation appear as an outlier, and masking that one outlier prevents another one from being detected. Detecting outliers one by one may therefore give misleading results. In this article a genetic algorithm is presented which considers different possible groupings of the data into outlier and non-outlier observations. In this way all outliers are detected at the same time. Second, it is known that outlier detection and variable selection can influence each other, and that different results may be obtained, depending on the order in which these two tasks are performed. It may therefore be useful to consider these tasks simultaneously, and a genetic algorithm for a simultaneous outlier detection and variable selection is suggested. Two real data sets are used to illustrate the algorithms, which are shown to work well. In addition, the scalability of the algorithms is considered with an experiment using generated data.I would like to thank Dr Tero Aittokallio and an anonymous referee for useful comments.  相似文献   

11.
The well known clustering algorithm DBSCAN is founded on the density notion of clustering. However, the use of global density parameter ε-distance makes DBSCAN not suitable in varying density datasets. Also, guessing the value for the same is not straightforward. In this paper, we generalise this algorithm in two ways. First, adaptively determine the key input parameter ε-distance, which makes DBSCAN independent of domain knowledge satisfying the unsupervised notion of clustering. Second, the approach of deriving ε-distance based on checking the data distribution of each dimension makes the approach suitable for subspace clustering, which detects clusters enclosed in various subspaces of high dimensional data. Experimental results illustrate that our approach can efficiently find out the clusters of varying sizes, shapes as well as varying densities.  相似文献   

12.
近年来,混合型数据的聚类问题受到广泛关注。作为处理混合型数据的一种有效方法,K-prototype聚类算法在初始化聚类中心时通常采用随机选取的策略,然而这种策略在很多实际应用中难以保证聚类结果的质量。针对上述问题,采用基于离群点检测的策略来为K-prototype算法选择初始中心,并提出一种新的混合型数据聚类初始化算法(initialization of K-prototype clustering based on outlier detection and density, IKP-ODD)。给定一个候选对象,IKP-ODD通过计算其距离离群因子、加权密度以及与已有初始中心之间的加权距离来判断候选对象是否是一个初始中心。IKP-ODD通过采用距离离群因子和加权密度,防止选择离群点作为初始中心。在计算对象的加权密度以及对象之间的加权距离时,采用邻域粗糙集中的粒度邻域熵来计算每一个属性的重要性,并根据属性重要性的大小为不同属性赋予不同的权重,有效地反映不同属性之间的差异性。在多个UCI数据集上的实验表明,相对于现有的初始化方法,IKP-ODD能够更好地解决K-prototype聚类的初始化问题。  相似文献   

13.
Many studies on developing technologies have been published as articles, papers, or patents. We use and analyze these documents to find scientific and technological trends. In this paper, we consider document clustering as a method of document data analysis. In general, we have trouble analyzing documents directly because document data are not suitable for statistical and machine learning methods of analysis. Therefore, we have to transform document data into structured data for analytical purposes. For this process, we use text mining techniques. The structured data are very sparse, and hence, it is difficult to analyze them. This study proposes a new method to overcome the sparsity problem of document clustering. We build a combined clustering method using dimension reduction and K-means clustering based on support vector clustering and Silhouette measure. In particular, we attempt to overcome the sparseness in patent document clustering. To verify the efficacy of our work, we first conduct an experiment using news data from the machine learning repository of the University of California at Irvine. Second, using patent documents retrieved from the United States Patent and Trademark Office, we carry out patent clustering for technology forecasting.  相似文献   

14.
The control of blast furnace ironmaking process requires model of process dynamics accurate enough to facilitate the control strategies. However, data sets collected from blast furnace contain considerable number of missing values and outliers. These values can significantly affect subsequent statistical analysis and thus the identification of the whole process, so it becomes much important to deal with these values. This paper considers a data processing procedure including missing value imputation and outlier detection, and examines the impact of processing to the identification of blast furnace ironmaking process. Missing values are imputed based on the decision tree algorithm and outliers are identified and discarded through a set of multivariate outlier detection methods. The data sets before and after processing are then used for identification. Two classic identification methods, N4SID (numerical algorithms for state space subspace system identification) and PEM (prediction error method) are considered and a comparative study is presented.  相似文献   

15.
Outlier detection is concerned with discovering exceptional behaviors of objects. Its theoretical principle and practical implementation lay a foundation for some important applications such as credit card fraud detection, discovering criminal behaviors in e-commerce, discovering computer intrusion, etc. In this paper, we first present a unified model for several existing outlier detection schemes, and propose a compatibility theory, which establishes a framework for describing the capabilities for various outlier formulation schemes in terms of matching users'intuitions. Under this framework, we show that the density-based scheme is more powerful than the distance-based scheme when a dataset contains patterns with diverse characteristics. The density-based scheme, however, is less effective when the patterns are of comparable densities with the outliers. We then introduce a connectivity-based scheme that improves the effectiveness of the density-based scheme when a pattern itself is of similar density as an outlier. We compare density-based and connectivity-based schemes in terms of their strengths and weaknesses, and demonstrate applications with different features where each of them is more effective than the other. Finally, connectivity-based and density-based schemes are comparatively evaluated on both real-life and synthetic datasets in terms of recall, precision, rank power and implementation-free metrics. Jian Tang received an MS degree from the University of Iowa in 1983, and PhD from the Pennsylvania State University in 1988, both from the Department of Computer Science. He joined the Department of Computer Science, Memorial University of Newfoundland, Canada, in 1988, where he is currently a professor. He has visited a number of research institutions to conduct researches ranging over a variety of topics relating to theories and practices for database management and systems. His current research interests include data mining, e-commerce, XML and bioinformatics. Zhixiang Chen is an associate professor in the Computer Science Department, University of Texas-Pan American. He received his PhD in computer science from Boston University in January 1996, BS and MS degrees in software engineering from Huazhong University of Science and Technology. He also studied at the University of Illinois at Chicago. He taught at Southwest State University from Fall 1995 to September 1997, and Huazhong University of Science and Technology from 1982 to 1990. His research interests include computational learning theory, algorithms and complexity, intelligent Web search, informational retrieval, and data mining. Ada Waichee Fu received her BSc degree in computer science in the Chinese University of Hong Kong in 1983, and both MSc and PhD degrees in computer science in Simon Fraser University of Canada in 1986, 1990, respectively; worked at Bell Northern Research in Ottawa, Canada, from 1989 to 1993 on a wide-area distributed database project; joined the Chinese University of Hong Kong in 1993. Her research interests are XML data, time series databases, data mining, content-based retrieval in multimedia databases, parallel, and distributed systems. David Wai-lok Cheung received the MSc and PhD degrees in computer science from Simon Fraser University, Canada, in 1985 and 1989, respectively. He also received the BSc degree in mathematics from the Chinese University of Hong Kong. From 1989 to 1993, he was a member of Scientific Staff at Bell Northern Research, Canada. Since 1994, he has been a faculty member of the Department of Computer Science in the University of Hong Kong. He is also the Director of the Center for E-Commerce Infrastructure Development. His research interests include data mining, data warehouse, XML technology for e-commerce and bioinformatics. Dr. Cheung was the Program Committee Chairman of the Fifth Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD 2001), Program Co-Chair of the Ninth Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD 2005). Dr. Cheung is a member of the ACM and the IEEE Computer Society.  相似文献   

16.
针对经典的DBSCAN算法存在难以确定全局最优参数和误判离群点的问题,该算法首先从选择最优参数角度出发,通过数据集的分布特征生成Eps和MinPts列表,将两个列表中的参数进行全组合操作,把不同的参数组合依次进行聚类,从而寻找准确率最高点对应的参数。最后从离群点角度出发,将三支决策思想与离群点检测LOF算法进行结合。该算法与多种聚类算法进行效果对比分析,结果表明该算法能够全自动化选择全局最优参数,并提高聚类算法的准确性。  相似文献   

17.
Density based clustering techniques like DBSCAN are attractive because it can find arbitrary shaped clusters along with noisy outliers. Its time requirement is O(n2) where n is the size of the dataset, and because of this it is not a suitable one to work with large datasets. A solution proposed in the paper is to apply the leaders clustering method first to derive the prototypes called leaders from the dataset which along with prototypes preserves the density information also, then to use these leaders to derive the density based clusters. The proposed hybrid clustering technique called rough-DBSCAN has a time complexity of O(n) only and is analyzed using rough set theory. Experimental studies are done using both synthetic and real world datasets to compare rough-DBSCAN with DBSCAN. It is shown that for large datasets rough-DBSCAN can find a similar clustering as found by the DBSCAN, but is consistently faster than DBSCAN. Also some properties of the leaders as prototypes are formally established.  相似文献   

18.
The success of the Case Based Reasoning system depends on the quality of the case data and the speed of the retrieval process that can be costly in time, especially when the number of cases gets bulky. To guarantee the system?s quality, maintaining the contents of a case base (CB) becomes unavoidably. In this paper, we propose a novel case base maintenance policy named WCOID-DG: Weighting, Clustering, Outliers and Internal cases Detection based on Dbscan and Gaussian means. Our WCOID-DG policy uses in addition to feature weights and outliers detection methods, a new efficient clustering technique, named DBSCAN-GM (DG) which is a combination of DBSCAN and Gaussian-Means algorithms. The purpose of our WCOID-GM is to reduce both the storage requirements and search time and to focus on balancing case retrieval efficiency and competence for a CB. WCOID-GM is mainly based on the idea that a large CB with weighted features is transformed to a small CB with improving its quality. We support our approach with empirical evaluation using different benchmark data sets to show its competence in terms of shrinking the size of the CB and the research time, as well as, getting satisfying classification accuracy.  相似文献   

19.
In this paper, a new approach for fault detection and isolation that is based on the possibilistic clustering algorithm is proposed. Fault detection and isolation (FDI) is shown here to be a pattern classification problem, which can be solved using clustering and classification techniques. A possibilistic clustering based approach is proposed here to address some of the shortcomings of the fuzzy c-means (FCM) algorithm. The probabilistic constraint imposed on the membership value in the FCM algorithm is relaxed in the possibilistic clustering algorithm. Because of this relaxation, the possibilistic approach is shown in this paper to give more consistent results in the context of the FDI tasks. The possibilistic clustering approach has also been used to detect novel fault scenarios, for which the data was not available while training. Fault signatures that change as a function of the fault intensities are represented as fault lines, which have been shown to be useful to classify faults that can manifest with different intensities. The proposed approach has been validated here through simulations involving a benchmark quadruple tank process and also through experimental case studies on the same setup. For large scale systems, it is proposed to use the possibilistic clustering based approach in the lower dimensional approximations generated by algorithms such as PCA. Towards this end, finally, we also demonstrate the key merits of the algorithm for plant wide monitoring study using a simulation of the benchmark Tennessee Eastman problem.  相似文献   

20.
Cluster analysis is a primary tool for detecting anomalous behavior in real-world data such as web documents, medical records of patients or other personal data. Most existing methods for document clustering are based on the classical vector-space model, which represents each document by a fixed-size vector of weighted key terms often referred to as key phrases. Since vector representations of documents are frequently very sparse, inverted files are used to prevent a tremendous computational overload which may be caused in large and diverse document collections such as pages downloaded from the World Wide Web. In order to reduce computation costs and space complexity, many popular methods for clustering web documents, including those using inverted files, usually assume a relatively small prefixed number of clusters.We propose several new crisp and fuzzy approaches based on the cosine similarity principle for clustering documents that are represented by variable-size vectors of key phrases, without limiting the final number of clusters. Each entry in a vector consists of two fields. The first field refers to a key phrase in the document and the second denotes an importance weight associated with this key phrase within the particular document. Removing the restriction on the total number of clusters, may moderately increase computing costs but on the other hand improves the method’s performance in classifying incoming vectors as normal or abnormal, based on their similarity to the existing clusters. All the procedures represented in this work are characterized by two features: (a) the number of clusters is not restricted by some relatively prefixed small number, i.e., an arbitrary new incoming vector which is not similar to any of the existing cluster centers necessarily starts a new cluster and (b) a vector with multiple appearance n in the training set is counted as n distinct vectors rather than a single vector. These features are the main reasons for the high quality performance of the proposed algorithms. We later describe them in detail and show their implementation in a real-world application from the area of web activity monitoring, in particular, by detecting anomalous documents downloaded from the internet by users with abnormal information interests.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号