首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 109 毫秒
1.
Traditional clustering methods assume that there is no measurement error, or uncertainty, associated with data. Often, however, real world applications require treatment of data that have such errors. In the presence of measurement errors, well-known clustering methods like k-means and hierarchical clustering may not produce satisfactory results.In this article, we develop a statistical model and algorithms for clustering data in the presence of errors. We assume that the errors associated with data follow a multivariate Gaussian distribution and are independent between data points. The model uses the maximum likelihood principle and provides us with a new metric for clustering. This metric is used to develop two algorithms for error-based clustering, hError and kError, that are generalizations of Ward's hierarchical and k-means clustering algorithms, respectively.We discuss types of clustering problems where error information associated with the data to be clustered is readily available and where error-based clustering is likely to be superior to clustering methods that ignore error. We focus on clustering derived data (typically parameter estimates) obtained by fitting statistical models to the observed data. We show that, for Gaussian distributed observed data, the optimal error-based clusters of derived data are the same as the maximum likelihood clusters of the observed data. We also report briefly on two applications with real-world data and a series of simulation studies using four statistical models: (1) sample averaging, (2) multiple linear regression, (3) ARIMA models for time-series, and (4) Markov chains, where error-based clustering performed significantly better than traditional clustering methods.  相似文献   

2.
Traditional clustering methods assume that there is no measurement error, or uncertainty, associated with data. Often, however, real world applications require treatment of data that have such errors. In the presence of measurement errors, well-known clustering methods like k-means and hierarchical clustering may not produce satisfactory results.In this article, we develop a statistical model and algorithms for clustering data in the presence of errors. We assume that the errors associated with data follow a multivariate Gaussian distribution and are independent between data points. The model uses the maximum likelihood principle and provides us with a new metric for clustering. This metric is used to develop two algorithms for error-based clustering, hError and kError, that are generalizations of Ward's hierarchical and k-means clustering algorithms, respectively.We discuss types of clustering problems where error information associated with the data to be clustered is readily available and where error-based clustering is likely to be superior to clustering methods that ignore error. We focus on clustering derived data (typically parameter estimates) obtained by fitting statistical models to the observed data. We show that, for Gaussian distributed observed data, the optimal error-based clusters of derived data are the same as the maximum likelihood clusters of the observed data. We also report briefly on two applications with real-world data and a series of simulation studies using four statistical models: (1) sample averaging, (2) multiple linear regression, (3) ARIMA models for time-series, and (4) Markov chains, where error-based clustering performed significantly better than traditional clustering methods.  相似文献   

3.
4.
聚类算法在数据分析、数据挖掘等许多地方有广泛的应用,探索了基于QPSO的数据聚类及其在图像分割中的应用,提出了一种新的距离度量的聚类算法,在分析PSO聚类算法的基础上提出了QPSO聚类算法,给出了相应的实验结果和算法讨论。  相似文献   

5.
We propose a method of clustering images that combines algorithmic and human input. An algorithm provides us with pairwise image similarities. We then actively obtain selected, more accurate pairwise similarities from humans. A novel method is developed to choose the most useful pairs to show a person, obtaining constraints that improve clustering. In a clustering assignment, elements in each data pair are either in the same cluster or in different clusters. We simulate inverting these pairwise relations and see how that affects the overall clustering. We choose a pair that maximizes the expected change in the clustering. The proposed algorithm has high time complexity, so we also propose a version of this algorithm that is much faster and exactly replicates our original algorithm. We further improve run-time by adding two heuristics, and show that these do not significantly impact the effectiveness of our method. We have run experiments in three different domains, namely leaf, face and scene images, and show that the proposed method improves clustering performance significantly.  相似文献   

6.
差分隐私保护k- means聚类方法研究   总被引:3,自引:1,他引:2  
研究了基于差分隐私保护的k-means聚类隐私保护方法。首先介绍了隐私保护数据挖掘和隐私保护聚类分析的研究现状,简单介绍了差分隐私保护的基本原理和方法。为了解决差分隐私k-means聚类方法聚类结果可用性差的问题,提出了一个新的IDP k-means聚类方法,并证明了其满足e-差分隐私保护。最后的仿真实验表明,在相同隐私保护级别下,IDP k-means聚类方法与差分隐私k-means聚类方法相比,聚类可用性得到了较大程度的提高。  相似文献   

7.
In this paper, we propose a generalized fuzzy clustering regularization (GFCR) model and then study its theoretical properties. GFCR unifies several fuzzy clustering algorithms, such as fuzzy c-means (FCM), maximum entropy clustering (MEC), fuzzy clustering based on Fermi-Dirac entropy, and fuzzy bidirectional associative clustering network, etc. The proposed GFCR becomes an alternative model of the generalized FCM (GFCM) that was recently proposed by Yu and Yang. To advance theoretical study, we have the following three considerations. 1) We give an optimality test to monitor if GFCR converges to a local minimum. 2) We relate the GFCR optimality tests to Occam's razor principle, and then analyze the model complexity for fuzzy clustering algorithms. 3) We offer a general theoretical method to evaluate the performance of fuzzy clustering algorithms. Finally, some numerical experiments are used to demonstrate the validity of our theoretical results and complexity analysis.  相似文献   

8.
Robust clustering methods: a unified view   总被引:10,自引:0,他引:10  
Clustering methods need to be robust if they are to be useful in practice. In this paper, we analyze several popular robust clustering methods and show that they have much in common. We also establish a connection between fuzzy set theory and robust statistics, and point out the similarities between robust clustering methods and statistical methods such as the weighted least-squares technique, the M estimator, the minimum volume ellipsoid algorithm, cooperative robust estimation, minimization of probability of randomness, and the epsilon contamination model. By gleaning the common principles upon which the methods proposed in the literature are based, we arrive at a unified view of robust clustering methods. We define several general concepts that are useful in robust clustering, state the robust clustering problem in terms of the defined concepts, and propose generic algorithms and guidelines for clustering noisy data. We also discuss why the generalized Hough transform is a suboptimal solution to the robust clustering problem  相似文献   

9.
Traditional approach to clustering is to fit a model (partition or prototypes) for the given data. We propose a completely opposite approach by fitting the data into a given clustering model that is optimal for similar pathological data of equal size and dimensions. We then perform inverse transform from this pathological data back to the original data while refining the optimal clustering structure during the process. The key idea is that we do not need to find optimal global allocation of the prototypes. Instead, we only need to perform local fine-tuning of the clustering prototypes during the transformation in order to preserve the already optimal clustering structure.  相似文献   

10.
传统的谱聚类算法对初始化敏感,针对这个缺陷,引入Canopy算法对样本进行“粗”聚类得到初始聚类中心点,将结果作为K-Means算法的输入,提出了一种基于Canopy和谱聚类融合的聚类算法(Canopy-SC),减少了传统谱聚类算法选择初始中心点的盲目性,并将其用于人脸图像聚类。与传统的谱聚类算法相比,Canopy-SC算法能够得到较好的聚类中心和聚类结果,同时具有更高的聚类精确度。实验结果表明了该算法的有效性和可行性。  相似文献   

11.
The majority of the algorithms in the software clustering literature utilize structural information to decompose large software systems. Approaches using other attributes, such as file names or ownership information, have also demonstrated merit. At the same time, existing algorithms commonly deem all attributes of the software artifacts being clustered as equally important, a rather simplistic assumption. Moreover, no method that can assess the usefulness of a particular attribute for clustering purposes has been presented in the literature. In this paper, we present an approach that applies information theoretic techniques in the context of software clustering. Our approach allows for weighting schemes that reflect the importance of various attributes to be applied. We introduce LIMBO, a scalable hierarchical clustering algorithm based on the minimization of information loss when clustering a software system. We also present a method that can assess the usefulness of any nonstructural attribute in a software clustering context. We applied LIMBO to three large software systems in a number of experiments. The results indicate that this approach produces clusterings that come close to decompositions prepared by system experts. Experimental results were also used to validate our usefulness assessment method. Finally, we experimented with well-established weighting schemes from information retrieval, Web search, and data clustering. We report results as to which weighting schemes show merit in the decomposition of software systems.  相似文献   

12.
In the above paper by Mao-Jain (ibid., vol.7 (1996)), the Mahalanobis distance is used instead of Euclidean distance as the distance measure in order to acquire the hyperellipsoidal clustering. We prove that the clustering cost function is a constant under this condition, so hyperellipsoidal clustering cannot be realized. We also explains why the clustering algorithm developed in the above paper can get some good hyperellipsoidal clustering results. In reply, Mao-Jain state that the Wang-Xia failed to point out that their HEC clustering algorithm used a regularized Mahalanobis distance instead of the standard Mahalanobis distance. It is the regularized Mahalanobis distance which plays an important role in realizing hyperellipsoidal clusters. In conclusion, the comments made by Wang-Xia together with this response provide some new insights into the behavior of their HEC clustering algorithm. It further confirms that the HEC algorithm is a useful tool for understanding the structure of multidimensional data.  相似文献   

13.
To deliver effective personalization for digital library users, it is necessary to identify which human factors are most relevant in determining the behavior and perception of these users. This paper examines three key human factors: cognitive styles, levels of expertise and gender differences, and utilizes three individual clustering techniques: k-means, hierarchical clustering and fuzzy clustering to understand user behavior and perception. Moreover, robust clustering, capable of correcting the bias of individual clustering techniques, is used to obtain a deeper understanding. The robust clustering approach produced results that highlighted the relevance of cognitive style for user behavior, i.e., cognitive style dominates and justifies each of the robust clusters created. We also found that perception was mainly determined by the level of expertise of a user. We conclude that robust clustering is an effective technique to analyze user behavior and perception.  相似文献   

14.
On clustering massive text and categorical data streams   总被引:4,自引:4,他引:0  
In this paper, we will study the data stream clustering problem in the context of text and categorical data domains. While the clustering problem has been studied recently for numeric data streams, the problems of text and categorical data present different challenges because of the large and un-ordered nature of the corresponding attributes. Therefore, we will propose algorithms for text and categorical data stream clustering. We will propose a condensation based approach for stream clustering which summarizes the stream into a number of fine grained cluster droplets. These summarized droplets can be used in conjunction with a variety of user queries to construct the clusters for different input parameters. Thus, this provides an online analytical processing approach to stream clustering. We also study the problem of detecting noisy and outlier records in real time. We will test the approach for a number of real and synthetic data sets, and show the effectiveness of the method over the baseline OSKM algorithm for stream clustering.  相似文献   

15.
Combining multiple clusterings using evidence accumulation   总被引:2,自引:0,他引:2  
We explore the idea of evidence accumulation (EAC) for combining the results of multiple clusterings. First, a clustering ensemble - a set of object partitions, is produced. Given a data set (n objects or patterns in d dimensions), different ways of producing data partitions are: 1) applying different clustering algorithms and 2) applying the same clustering algorithm with different values of parameters or initializations. Further, combinations of different data representations (feature spaces) and clustering algorithms can also provide a multitude of significantly different data partitionings. We propose a simple framework for extracting a consistent clustering, given the various partitions in a clustering ensemble. According to the EAC concept, each partition is viewed as an independent evidence of data organization, individual data partitions being combined, based on a voting mechanism, to generate a new n /spl times/ n similarity matrix between the n patterns. The final data partition of the n patterns is obtained by applying a hierarchical agglomerative clustering algorithm on this matrix. We have developed a theoretical framework for the analysis of the proposed clustering combination strategy and its evaluation, based on the concept of mutual information between data partitions. Stability of the results is evaluated using bootstrapping techniques. A detailed discussion of an evidence accumulation-based clustering algorithm, using a split and merge strategy based on the k-means clustering algorithm, is presented. Experimental results of the proposed method on several synthetic and real data sets are compared with other combination strategies, and with individual clustering results produced by well-known clustering algorithms.  相似文献   

16.
Recent experimental studies have revealed that a large percentage of wireless links are lossy and unreliable for data delivery in wireless sensor networks (WSNs). Such findings raise new challenges for the design of clustering algorithms in WSNs in terms of data reliability and energy efficiency. In this paper, we propose distributed clustering algorithms for lossy WSNs with a mobile collector, where the mobile collector moves close to each cluster head to receive data directly and then uploads collected data to the base station. We first consider constructing one-hop clusters in lossy WSNs where all cluster members are within the direct communication range of their cluster heads. We formulate the problem into an integer program, aiming at maximizing the network lifetime, which is defined as the number of rounds of data collection until the first node dies. We then prove that the problem is NP-hard. After that, we propose a metric-based distributed clustering algorithm to solve the problem. We adopt a metric called selection weight for each sensor node that indicates both link qualities around the node and its capability of being a cluster head. We further extend the algorithm to multi-hop clustering to achieve better scalability. We have found out that the performance of the one-hop clustering algorithm in small WSNs is very close to the optimal results obtained by mathematical tools. We have conducted extensive simulations for large WSNs and the results demonstrate that the proposed clustering algorithms can significantly improve the data reception ratio, reduce the total energy consumption in the network and prolong network lifetime compared to a typical distributed clustering algorithm, HEED, that does not consider lossy links.  相似文献   

17.
Adaptive evolutionary clustering   总被引:1,自引:0,他引:1  
In many practical applications of clustering, the objects to be clustered evolve over time, and a clustering result is desired at each time step. In such applications, evolutionary clustering typically outperforms traditional static clustering by producing clustering results that reflect long-term trends while being robust to short-term variations. Several evolutionary clustering algorithms have recently been proposed, often by adding a temporal smoothness penalty to the cost function of a static clustering method. In this paper, we introduce a different approach to evolutionary clustering by accurately tracking the time-varying proximities between objects followed by static clustering. We present an evolutionary clustering framework that adaptively estimates the optimal smoothing parameter using shrinkage estimation, a statistical approach that improves a naïve estimate using additional information. The proposed framework can be used to extend a variety of static clustering algorithms, including hierarchical, k-means, and spectral clustering, into evolutionary clustering algorithms. Experiments on synthetic and real data sets indicate that the proposed framework outperforms static clustering and existing evolutionary clustering algorithms in many scenarios.  相似文献   

18.
Unsupervised clustering is an important tool to analyze video data. Selection of an appropriate clustering scheme is governed by the suitability of the clusters it produces. It is difficult to formulate cluster suitability criteria for a domain where different feature attributes have different meanings. We propose a novel clustering strategy, tailored towards the specific requirements of clustering in video data. Our clustering methodology decouples clustering along different feature components. Our scheme chooses the clustering model so as to meet the requirements of clustering in video data. The clusters obtained from our scheme reasonably model the homogeneous color regions in a video scene in both space and time. The space-time clusters obtained by our clustering methodology can be subsequently grouped together to compose meaningful objects. Experimental comparison of our results with existing clustering techniques clearly show that our scheme takes care of many of the problems with traditional clustering schemes applied to the heterogeneous feature space of video.  相似文献   

19.
Randomised Local Search Algorithm for the Clustering Problem   总被引:1,自引:0,他引:1  
We consider clustering as a combinatorial optimisation problem. Local search provides a simple and effective approach to many other combinatorial optimisation problems. It is therefore surprising how seldom it has been applied to the clustering problem. Instead, the best clustering results have been obtained by more complex techniques such as tabu search and genetic algorithms at the cost of high run time. We introduce a new randomised local search algorithm for the clustering problem. The algorithm is easy to implement, sufficiently fast, and competitive with the best clustering methods. The ease of implementation makes it possible to tailor the algorithm for various clustering applications with different distance metrics and evaluation criteria.  相似文献   

20.
Given a clustering algorithm, how can we adapt it to find multiple, nonredundant, high-quality clusterings? We focus on algorithms based on vector quantization and describe a framework for automatic ‘alternatization’ of such algorithms. Our framework works in both simultaneous and sequential learning formulations and can mine an arbitrary number of alternative clusterings. We demonstrate its applicability to various clustering algorithms—k-means, spectral clustering, constrained clustering, and co-clustering—and effectiveness in mining a variety of datasets.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号