首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Assessment of clustering tendency is an important first step in cluster analysis. One tool for assessing cluster tendency is the Visual Assessment of Tendency (VAT) algorithm. VAT produces an image matrix that can be used for visual assessment of cluster tendency in either relational or object data. However, VAT becomes intractable for large data sets. The revised VAT (reVAT) algorithm reduces the number of computations done by VAT, and replaces the image matrix with a set of profile graphs that are used for the visual assessment step. Thus, reVAT overcomes the large data set problem which encumbers VAT, but presents a new problem: interpretation of the set of reVAT profile graphs becomes very difficult when the number of clusters is large, or there is significant overlap between groups of objects in the data. In this paper, we propose a new algorithm called bigVAT which (i) solves the large data problem suffered by VAT, and (ii) solves the interpretation problem suffered by reVAT. bigVAT combines the quasi-ordering technique used by reVAT with an image display of the set of profile graphs displaying the clustering tendency information with a VAT-like image. Several numerical examples are given to illustrate and support the new technique.  相似文献   

2.
Several fast algorithms for clustering very large data sets have been proposed in the literature, including CLARA, CLARANS, GAC-R3, and GAC-RARw. CLARA is a combination of a sampling procedure and the classical PAM algorithm, while CLARANS adopts a serial randomized search strategy to find the optimal set of medoids. GAC-R3 and GAC-RARw exploit genetic search heuristics for solving clustering problems. In this research, we conducted an empirical comparison of these four clustering algorithms over a wide range of data characteristics described by data size, number of clusters, cluster distinctness, cluster asymmetry, and data randomness. According to the experimental results, CLARANS outperforms its counterparts both in clustering quality and execution time when the number of clusters increases, clusters are more closely related, more asymmetric clusters are present, or more random objects exist in the data set. With a specific number of clusters, CLARA can efficiently achieve satisfactory clustering quality when the data size is larger, whereas GAC-R3 and GAC-RARw can achieve satisfactory clustering quality and efficiency when the data size is small, the number of clusters is small, and clusters are more distinct and symmetric.  相似文献   

3.
We present two scalable model-based clustering systems based on a Gaussian mixture model with independent attributes within clusters. They first summarize data into sub-clusters, and then generate Gaussian mixtures from their clustering features using a new algorithm—EMACF. EMACF approximates the aggregate behavior of each sub-cluster of data items in the Gaussian mixture model. It provably converges. The experiments show that our clustering systems run one or two orders of magnitude faster than the traditional EM algorithm with few losses of accuracy.  相似文献   

4.
A measurement of cluster quality is often needed for DNA microarray data analysis. In this paper, we introduce a new cluster validity index, which measures geometrical features of the data. The essential concept of this index is to evaluate the ratio between the squared total length of the data eigen-axes with respect to the between-cluster separation. We show that this cluster validity index works well for data that contain clusters closely distributed or with different sizes. We verify the method using three simulated data sets, two real world data sets and two microarray data sets. The experiment results show that the proposed index is superior to five other cluster validity indices, including partition coefficients (PC), General silhouette index (GS), Dunn’s index (DI), CH Index and I-Index. Also, we have given a theorem to show for what situations the proposed index works well.  相似文献   

5.
In evaluating the results of cluster analysis, it is common practice to make use of a number of fixed heuristics rather than to compare a data clustering directly against an empirically derived standard, such as a clustering empirically obtained from human informants. Given the dearth of research into techniques to express the similarity between clusterings, there is broad scope for fundamental research in this area. In defining the comparative problem, we identify two types of worst-case matches between pairs of clusterings, characterised as independently codistributed clustering pairs and conjugate partition pairs. Desirable behaviour for a similarity measure in either of the two worst cases is discussed, giving rise to five test scenarios in which characteristics of one of a pair of clusterings was manipulated in order to compare and contrast the behaviour of different clustering similarity measures. This comparison is carried out for previously-proposed clustering similarity measures, as well as a number of established similarity measures that have not previously been applied to clustering comparison. We introduce a paradigm apparatus for the evaluation of clustering comparison techniques and distinguish between the goodness of clusterings and the similarity of clusterings by clarifying the degree to which different measures confuse the two. Accompanying this is the proposal of a novel clustering similarity measure, the Measure of Concordance (MoC). We show that only MoC, Powers’s measure, Lopez and Rajski’s measure and various forms of Normalised Mutual Information exhibit the desired behaviour under each of the test scenarios.
Darius PfitznerEmail:
  相似文献   

6.
Clustering is a popular non-directed learning data mining technique for partitioning a dataset into a set of clusters (i.e. a segmentation). Although there are many clustering algorithms, none is superior on all datasets, and so it is never clear which algorithm and which parameter settings are the most appropriate for a given dataset. This suggests that an appropriate approach to clustering should involve the application of multiple clustering algorithms with different parameter settings and a non-taxing approach for comparing the various segmentations that would be generated by these algorithms. In this paper we are concerned with the situation where a domain expert has to evaluate several segmentations in order to determine the most appropriate segmentation (set of clusters) based on his/her specified objective(s). We illustrate how a data mining process model could be applied to address this problem.  相似文献   

7.
Data mining is crucial in many areas and there are ongoing efforts to improve its effectiveness in both the scientific and the business world. There is an obvious need to improve the outcomes of mining techniques such as clustering and other classifiers without abandoning the standard mining tools that are popular with researchers and practitioners alike. Currently, however, standard tools do not have the flexibility to control similarity relations between attribute values, a critical feature in improving mining-clustering results. The study presented here introduces the Similarity Adjustment Model (SAM) where adjusted Fuzzy Similarity Functions (FSF) control similarity relations between attribute values and hence ameliorate clustering results obtained with standard data mining tools such as SPSS and SAS. The SAM draws on principles of binary database representation models and employs FSF adjusted via an iterative learning process that yields improved segmentation regardless of the choice of mining-clustering algorithm. The SAM model is illustrated and evaluated on three common datasets with the standard SPSS package. The datasets were run with several clustering algorithms. Comparison of “Naïve” runs (which used original data) and “Fuzzy” runs (which used SAM) shows that the SAM improves segmentation in all cases.  相似文献   

8.
Stability in cluster analysis is strongly dependent on the data set, especially on how well separated and how homogeneous the clusters are. In the same clustering, some clusters may be very stable and others may be extremely unstable. The Jaccard coefficient, a similarity measure between sets, is used as a cluster-wise measure of cluster stability, which is assessed by the bootstrap distribution of the Jaccard coefficient for every single cluster of a clustering compared to the most similar cluster in the bootstrapped data sets. This can be applied to very general cluster analysis methods. Some alternative resampling methods are investigated as well, namely subsetting, jittering the data points and replacing some data points by artificial noise points. The different methods are compared by means of a simulation study. A data example illustrates the use of the cluster-wise stability assessment to distinguish between meaningful stable and spurious clusters, but it is also shown that clusters are sometimes only stable because of the inflexibility of certain clustering methods.  相似文献   

9.
Data replication, as an essential service for MANETs, is used to increase data availability by creating local or nearly located copies of frequently used items, reduce communication overhead, achieve fault-tolerance and load balancing. Data replication protocols proposed for MANETs are often prone to scalability problems due to their definitions or underlying routing protocols they are based on. In particular, they exhibit poor performance when the network size is scaled up. However, scalability is an important criterion for several MANET applications. We propose a scalable and reactive data replication approach, named SCALAR, combined with a low-cost data lookup protocol. SCALAR is a virtual backbone based solution, in which the network nodes construct a connected dominating set based on network topology graph. To the best of our knowledge, SCALAR is the first work applying virtual backbone structure to operate a data lookup and replication process in MANETs. Theoretical message-complexity analysis of the proposed protocols is given. Extensive simulations are performed to analyze and compare the behavior of SCALAR, and it is shown to outperform the other solutions in terms of data accessibility, message overhead and query deepness. It is also demonstrated as an efficient solution for high-density, high-load, large-scale mobile ad hoc networks.  相似文献   

10.
In this paper, new measures—called clustering performance measures (CPMs)—for assessing the reliability of a clustering algorithm are proposed. These CPMs are defined using a validation measure, which determines how well the algorithm works with a given set of parameter values, and a repeatability measure, which is used for studying the stability of the clustering solutions and has the ability to estimate the correct number of clusters in a dataset. These proposed CPMs can be used to evaluate clustering algorithms that have a structure bias to certain types of data distribution as well as those that have no structure biases. Additionally, we propose a novel cluster validity index, V I index, which is able to handle non-spherical clusters. Five clustering algorithms on different types of real-world data and synthetic data are evaluated. The first dataset type refers to a communications signal dataset representing one modulation scheme under a variety of noise conditions, the second represents two breast cancer datasets, while the third type represents different synthetic datasets with arbitrarily shaped clusters. Additionally, comparisons with other methods for estimating the number of clusters indicate the applicability and reliability of the proposed cluster validity V I index and repeatability measure for correct estimation of the number of clusters.
Asoke K. NandiEmail:

Sameh A. Salem   graduated with a BSc degree in Communications and Electronics Engineering and an MSc in Communications and Electronics Engineering, both from Helwan University, Cairo, Egypt, in May 1998 and October 2003, respectively. He is currently pursuing PhD degree in the Signal Processing and Communications Group, Department of Electrical Engineering and Electronics, The University of Liverpool, UK. His research interests include clustering algorithms, machine learning, and parallel computing. Asoke K. Nandi   received PhD degree from the University of Cambridge (Trinity College), Cambridge, UK, in 1979. He held several research positions in Rutherford Appleton Laboratory (UK), European Organisation for Nuclear Research (Switzerland), Department of Physics, Queen Mary College (London, UK) and Department of Nuclear Physics (Oxford, UK). In 1987, he joined the Imperial College, London, UK, as the Solartron Lecturer in the Signal Processing Section of the Electrical Engineering Department. In 1991, he joined the Signal Processing Division of the Electronic and Electrical Engineering Department in the University of Strathclyde, Glasgow, UK, as a Senior Lecturer; subsequently, he was appointed as a Reader in 1995 and a Professor in 1998. In March 1999, he moved to the University of Liverpool, Liverpool, UK to take up his appointment with David Jardine, Chair of Signal Processing in the Department of Electrical Engineering and Electronics. In 1983, he was a member of the UA1 team at CERN that discovered the three fundamental particles known as W+, W and Z0 providing the evidence for the unification of the electromagnetic and weak forces, which was recognised by the Nobel Committee for Physics in 1984. Currently, he is the Head of the Signal Processing and Communications Research Group with interests in the areas of non-Gaussian signal processing, communications, and machine learning research. With his group he has been carrying out research in machine condition monitoring, signal modelling, system identification, communication signal processing, biomedical signals, ultrasonics, blind source separation, and blind deconvolution. He has authored or co-authored over 350 technical publications, including two books “Automatic Modulation Recognition of Communications Signals” (Kluwer Academic, Boston, MA, 1996) and “Blind Estimation Using Higher-Order Statistics” (Kluwer Academic, Boston, MA, 1999) and over 140 journal papers. Professor Nandi was awarded the Mounbatten Premium, Division Award of the Electronics and Communications Division, of the Institution of Electrical Engineers of the UK in 1998 and the Water Arbitration Prize of the Institution of Mechanical Engineers of the UK in 1999. He is a Fellow of the Cambridge Philosophical Society, the Institution of Engineering and Technology, the Institute of Mathematics and its applications, the Institute of Physics, the Royal Society for Arts, the Institution of Mechanical Engineers, and the British Computer Society.   相似文献   

11.
There is an interest in the problem of identifying different partitions of a given set of units obtained according to different subsets of the observed variables (multiple cluster structures). A model-based procedure has been previously developed for detecting multiple cluster structures from independent subsets of variables. The method relies on model-based clustering methods and on a comparison among mixture models using the Bayesian Information Criterion. A generalization of this method which allows the use of any model-selection criterion is considered. A new approach combining the generalized model-based procedure with variable-clustering methods is proposed. The usefulness of the new method is shown using simulated and real examples. Monte Carlo methods are employed to evaluate the performance of various approaches. Data matrices with two cluster structures are analyzed taking into account the separation of clusters, the heterogeneity within clusters and the dependence of cluster structures.  相似文献   

12.
将夹角余弦的概念推广到混合属性的数据,提出了一种基于相似度的聚类方法CABMS,同时给出了一种计算聚类阈值的简单有效的策略。有关CABMS数据库的大小,属性个数具有近似线性时间复杂度,使得聚类方法CABMS具有好的扩展性。实验结果表明,CABMS可产生高质量的聚类结果。  相似文献   

13.
The implementation and performance of the multidimensional Fast Fourier Transform (FFT) on a distributed memory Beowulf cluster is examined. We focus on the three-dimensional (3D) real transform, an essential computational component of Galerkin and pseudo-spectral codes. The approach studied is a 1D domain decomposition algorithm that relies on communication-intensive transpose operation involving P processors. Communication is based upon the standard portable message passing interface (MPI). We show that 1/P scaling for execution time at fixed problem size N3 (i.e., linear speedup) can be obtained provided that (1) the transpose algorithm is optimized for simultaneous block communication by all processors; and (2) communication is arranged for non-overlapping pairwise communication between processors, thus eliminating blocking when standard fast ethernet interconnects are employed. This method provides the basis for implementation of scalable and efficient spectral method computations of hydrodynamic and magneto-hydrodynamic turbulence on Beowulf clusters assembled from standard commodity components. An example is presented using a 3D passive scalar code.  相似文献   

14.
Clustering is an important data mining problem. However, most earlier work on clustering focused on numeric attributes which have a natural ordering to their attribute values. Recently, clustering data with categorical attributes, whose attribute values do not have a natural ordering, has received more attention. A common issue in cluster analysis is that there is no single correct answer to the number of clusters, since cluster analysis involves human subjective judgement. Interactive visualization is one of the methods where users can decide a proper clustering parameters. In this paper, a new clustering approach called CDCS (Categorical Data Clustering with Subjective factors) is introduced, where a visualization tool for clustered categorical data is developed such that the result of adjusting parameters is instantly reflected. The experiment shows that CDCS generates high quality clusters compared to other typical algorithms.  相似文献   

15.
Many validity measures have been proposed for evaluating clustering results. Most of these popular validity measures do not work well for clusters with different densities and/or sizes. They usually have a tendency of ignoring clusters with low densities. In this paper, we propose a new validity measure that can deal with this situation. In addition, we also propose a modified K-means algorithm that can assign more cluster centres to areas with low densities of data than the conventional K-means algorithm does. First, several artificial data sets are used to test the performance of the proposed measure. Then the proposed measure and the modified K-means algorithm are applied to reduce the edge degradation in vector quantisation of image compression.  相似文献   

16.
聚类是一种经典的数据挖掘技术,它在模式识别、机器学习、人工智能等多个领域得到了广泛的应用.通过聚类分析,目标数据集的深层次结构可以被有效地发掘出来.作为一种常用的划分聚类算法,K-means具有实现简单、能够处理大型数据等优点.然而,受收敛规则的影响,K-means算法仍然存在着对初始类簇中心的选取非常敏感、不能很好地...  相似文献   

17.
The combination of multiple clustering results (clustering ensemble) has emerged as an important procedure to improve the quality of clustering solutions. In this paper we propose a new cluster ensemble method based on kernel functions, which introduces the Partition Relevance Analysis step. This step has the goal of analyzing the set of partition in the cluster ensemble and extract valuable information that can improve the quality of the combination process. Besides, we propose a new similarity measure between partitions proving that it is a kernel function. A new consensus function is introduced using this similarity measure and based on the idea of finding the median partition. Related to this consensus function, some theoretical results that endorse the suitability of our methods are proven. Finally, we conduct a numerical experimentation to show the behavior of our method on several databases by making a comparison with simple clustering algorithms as well as to other cluster ensemble methods.  相似文献   

18.
With a huge amount of RDF data available on the web, the ability to find and access relevant information is crucial. Traditional approaches to storing, querying, and reasoning fall short when faced with web-scale data. We present a system that combines the computational power of large clusters for enabling large-scale reasoning and data access with an efficient data structure for storing and querying the accessed data on a traditional personal computer or other resource-constrained device. We present results of using this system to load the 2009 Billion Triples Challenge dataset, materialize RDFS inferences, extract an “interesting” subset of the data using a large cluster, and further analyze the extracted data using a personal computer, all in the order of tens of minutes.  相似文献   

19.
发布未经处理的数据会导致身份泄露和敏感属性泄露,通过概化准标识符可以达到隐私保护的目的,但信息损失过大。针对该问题提出一种基于聚类的(k,l)-多样性数据发布模型并设计算法予以实现。通过使用概率联合分布度量数据对象的离散属性和连续属性相似性,提高了数据的效用。详细论述了簇的合并、调整和概化策略,结合参数k和l提出隐私保护度概念,指出了基于聚类的最优化(k,l)-多样性算法是NP-难问题,并分析了算法的复杂度。理论分析和实验结果表明,该方法可以有效减少执行时间和信息损失,提高查询精度。  相似文献   

20.
在分析现在和将来网络服务需求的基础上, 提出基于集群的可伸缩网络服务体系结构Linux Virtual Server, 分为负载调度器、服务器池和后端存储三层结构。负载调度器采用IP 负载均衡技术和基于内容请求分发技术。LVS 集群提供了负载平衡、可伸缩性和高可用性, 可以应用于建立很多可伸缩网络服务。进而提出地理分布的LVS 集群系统, 可节约网络带宽, 改善网络服务质量, 具有良好的抗灾害性。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号