首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
In this paper, new measures—called clustering performance measures (CPMs)—for assessing the reliability of a clustering algorithm are proposed. These CPMs are defined using a validation measure, which determines how well the algorithm works with a given set of parameter values, and a repeatability measure, which is used for studying the stability of the clustering solutions and has the ability to estimate the correct number of clusters in a dataset. These proposed CPMs can be used to evaluate clustering algorithms that have a structure bias to certain types of data distribution as well as those that have no structure biases. Additionally, we propose a novel cluster validity index, V I index, which is able to handle non-spherical clusters. Five clustering algorithms on different types of real-world data and synthetic data are evaluated. The first dataset type refers to a communications signal dataset representing one modulation scheme under a variety of noise conditions, the second represents two breast cancer datasets, while the third type represents different synthetic datasets with arbitrarily shaped clusters. Additionally, comparisons with other methods for estimating the number of clusters indicate the applicability and reliability of the proposed cluster validity V I index and repeatability measure for correct estimation of the number of clusters.
Asoke K. NandiEmail:

Sameh A. Salem   graduated with a BSc degree in Communications and Electronics Engineering and an MSc in Communications and Electronics Engineering, both from Helwan University, Cairo, Egypt, in May 1998 and October 2003, respectively. He is currently pursuing PhD degree in the Signal Processing and Communications Group, Department of Electrical Engineering and Electronics, The University of Liverpool, UK. His research interests include clustering algorithms, machine learning, and parallel computing. Asoke K. Nandi   received PhD degree from the University of Cambridge (Trinity College), Cambridge, UK, in 1979. He held several research positions in Rutherford Appleton Laboratory (UK), European Organisation for Nuclear Research (Switzerland), Department of Physics, Queen Mary College (London, UK) and Department of Nuclear Physics (Oxford, UK). In 1987, he joined the Imperial College, London, UK, as the Solartron Lecturer in the Signal Processing Section of the Electrical Engineering Department. In 1991, he joined the Signal Processing Division of the Electronic and Electrical Engineering Department in the University of Strathclyde, Glasgow, UK, as a Senior Lecturer; subsequently, he was appointed as a Reader in 1995 and a Professor in 1998. In March 1999, he moved to the University of Liverpool, Liverpool, UK to take up his appointment with David Jardine, Chair of Signal Processing in the Department of Electrical Engineering and Electronics. In 1983, he was a member of the UA1 team at CERN that discovered the three fundamental particles known as W+, W and Z0 providing the evidence for the unification of the electromagnetic and weak forces, which was recognised by the Nobel Committee for Physics in 1984. Currently, he is the Head of the Signal Processing and Communications Research Group with interests in the areas of non-Gaussian signal processing, communications, and machine learning research. With his group he has been carrying out research in machine condition monitoring, signal modelling, system identification, communication signal processing, biomedical signals, ultrasonics, blind source separation, and blind deconvolution. He has authored or co-authored over 350 technical publications, including two books “Automatic Modulation Recognition of Communications Signals” (Kluwer Academic, Boston, MA, 1996) and “Blind Estimation Using Higher-Order Statistics” (Kluwer Academic, Boston, MA, 1999) and over 140 journal papers. Professor Nandi was awarded the Mounbatten Premium, Division Award of the Electronics and Communications Division, of the Institution of Electrical Engineers of the UK in 1998 and the Water Arbitration Prize of the Institution of Mechanical Engineers of the UK in 1999. He is a Fellow of the Cambridge Philosophical Society, the Institution of Engineering and Technology, the Institute of Mathematics and its applications, the Institute of Physics, the Royal Society for Arts, the Institution of Mechanical Engineers, and the British Computer Society.   相似文献   

2.
Applying graph theory to clustering, we propose a partitional clustering method and a clustering tendency index. No initial assumptions about the data set are requested by the method. The number of clusters and the partition that best fits the data set, are selected according to the optimal clustering tendency index value.  相似文献   

3.
Clustering is a classic problem in the machine learning and pattern recognition area, however a few complications arise when we try to transfer proposed solutions in the data stream model. Recently there have been proposed new algorithms for the basic clustering problem for massive data sets that produce an approximate solution using efficiently the memory, which is the most critical resource for streaming computation. In this paper, based on these solutions, we present a new model for clustering clickstream data which applies three different phases in the data processing, and is validated through a set of experiments.  相似文献   

4.
A two-step procedure is developed for the exploratory mining of real-valued vector (multivariate) time series using partition-based clustering methods. The proposed procedure was tested with model-generated data, multiple sensor-based process data, as well as simulation data. The test results indicate that the proposed procedure is quite effective in producing better clustering results than a hidden Markov model (HMM)-based clustering method if there is a priori knowledge about the number of clusters in the data. Two existing validity indices were tested and found ineffective in determining the actual number of clusters. Determining the appropriate number of clusters in the case that there is no a priori knowledge is a known unresolved research issue not only for our proposed procedure but also for the HMM-based clustering method and further development is necessary.  相似文献   

5.
Network clustering algorithms are typically based only on the topology information of the network. In this paper, we introduce traffic as a quantity representing the intensity of the relationship among nodes in the network, regardless of their connectivity, and propose an evolutionary clustering algorithm, based on the application of genetic operators and capable of exploiting the traffic information. In a comparative evaluation based on synthetic instances and two real world datasets, we show that our approach outperforms a selection of well established evolutionary and non-evolutionary clustering algorithms.  相似文献   

6.
In this paper, we show how one can take advantage of the stability and effectiveness of object data clustering algorithms when the data to be clustered are available in the form of mutual numerical relationships between pairs of objects. More precisely, we propose a new fuzzy relational algorithm, based on the popular fuzzy C-means (FCM) algorithm, which does not require any particular restriction on the relation matrix. We describe the application of the algorithm to four real and four synthetic data sets, and show that our algorithm performs better than well-known fuzzy relational clustering algorithms on all these sets.  相似文献   

7.
Thermal problem solution using a surrogate model clustering technique   总被引:1,自引:0,他引:1  
The thermal problem defined for the validation challenge workshop involves a simple one-dimensional slab geometry with a defined heat flux at the front face, adiabatic conditions at the rear face, and a provided baseline predictive simulation model to be used to simulate the time-dependent heatup of the slab. This paper will discuss a clustering methodology using a surrogate heat transfer algorithm that allows propagation of the uncertainties in the model parameters using a very limited series of full simulations. This clustering methodology can be used when the predictive model to be run is very expensive, and only a few simulation runs are possible. A series of time-dependent statistical comparisons designed to validate the model against experimental data provided in the problem formulation will also be presented, and limitations of the approach discussed. The purpose of this paper is to represent methods of propagation of uncertainty with limited computer runs, validation with uncertain data, and decision-making under uncertainty. The final results of the analysis indicate that the there is approximately 95% confidence that the regulatory criteria under consideration would be failed given the high level of physical data provided.  相似文献   

8.
The goal of clustering is to identify subsets called clusters which usually correspond to objects that are more similar to each other than they are to objects from other clusters. We have proposed the MACLAW method, a cooperative coevolution algorithm for data clustering, which has shown good results (Blansché and Gançarski, Pattern Recognit. Lett. 27(11), 1299–1306, 2006). However the complexity of the algorithm increases rapidly with the number of clusters to find. We propose in this article a parallelization of MACLAW, based on a message-passing paradigm, as well as the analysis of the application performances with experiment results. We show that we reach near optimal speedups when searching for 16 clusters, a typical problem instance for which the sequential execution duration is an obstacle to the MACLAW method. Further, our approach is original because we use the P2P-MP1 grid middleware (Genaud and Rattanapoka, Lecture Notes in Comput. Sci., vol. 3666, pp. 276–284, 2005) which both provides the message passing library and infrastructure services to discover computing resources. We also put forward that the application can be tightly coupled with the middleware to make the parallel execution nearly transparent for the user.  相似文献   

9.
Studying the communities of microbial species is highly important since many natural and artificial processes are mediated by groups of microbes rather than by single entities. One way of studying them is the search of common metabolic characteristics among microbial species, which is not only a potential measure for the differentiation and classification of closely-related organisms but also their study allows the finding of common functional properties that may describe the way of life of entire organisms or species. In this work we propose an expert system (ES), making the main contribution, to cluster a complex data set of 365 prokaryotic species by 114 metabolic features, information which may be incomplete for some species. Inspired on the human expert reasoning and based on hierarchical clustering strategies, our proposed ES estimates the optimal number of clusters adequate to divide the dataset and afterwards it starts an iterative process of clustering, based on the Self-organizing Maps (SOM) approach, where it finds relevant clusters at different steps by means of a new validity index inspired on the well-known Davies Bouldin (DB) index. In order to monitor the process and assess the behavior of the ES the partition obtained at each step is validated with the DB validity index. The resulting clusters prove that the use of metabolic features combined with the ES is able to handle a complex dataset that can help in the extraction of underlying information, gaining advantage over other existing approaches, that may relate metabolism with phenotypic, environmental or evolutionary characteristics in prokaryotic species.  相似文献   

10.
In the era of globalization, traditional theories and models of social systems are shifting their focus from isolation and independence to networks and connectedness. Analyzing these new complex social models is a growing, and computationally demanding area of research. In this study, we investigate the integration of genetic algorithms (GAs) with a random-walk-based distance measure to find subgroups in social networks. We test our approach by synthetically generating realistic social network data sets. Our clustering experiments using random-walk-based distances reveal exceptionally accurate results compared with the experiments using Euclidean distances.  相似文献   

11.
Traditional intrusion detection methods lack extensibility in face of changing network configurations as well as adaptability in face of unknown attack types. Meanwhile, current machine-learning algorithms need labeled data for training first, so they are computational expensive and sometimes misled by artificial data. In this paper, a new detection algorithm, the Intrusion Detection Based on Genetic Clustering (IDBGC) algorithm, is proposed. It can automatically establish clusters and detect intruders by labeling normal and abnormal groups. Computer simulations show that this algorithm is effective for intrusion detection.  相似文献   

12.
Clustering is the task of classifying patterns or observations into clusters or groups. Generally, clustering in high-dimensional feature spaces has a lot of complications such as: the unidentified or unknown data shape which is typically non-Gaussian and follows different distributions; the unknown number of clusters in the case of unsupervised learning; and the existence of noisy, redundant, or uninformative features which normally compromise modeling capabilities and speed. Therefore, high-dimensional data clustering has been a subject of extensive research in data mining, pattern recognition, image processing, computer vision, and other areas for several decades. However, most of existing researches tackle one or two problems at a time which is unrealistic because all problems are connected and should be tackled simultaneously. Thus, in this paper, we propose two novel inference frameworks for unsupervised non-Gaussian feature selection, in the context of finite asymmetric generalized Gaussian (AGG) mixture-based clustering. The choice of the AGG distribution is mainly due to its ability not only to approximate a large class of statistical distributions (e.g. impulsive, Laplacian, Gaussian and uniform distributions) but also to include the asymmetry. In addition, the two frameworks simultaneously perform model parameters estimation as well as model complexity (i.e., both model and feature selection) determination in the same step. This was done by incorporating a minimum message length (MML) penalty in the model learning step and by fading out the redundant densities in the mixture using the rival penalized EM (RPEM) algorithm, for first and second frameworks, respectively. Furthermore, for both algorithms, we tackle the problem of noisy and uninformative features by determining a set of relevant features for each data cluster. The efficiencies of the proposed algorithms are validated by applying them to real challenging problems namely action and facial expression recognition.  相似文献   

13.
One of the main problems in cluster analysis is the weighting of attributes so as to discover structures that may be present. By using weighted dissimilarity measures for objects, a new approach is developed, which allows the use of the k-means-type paradigm to efficiently cluster large data sets. The optimization algorithm is presented and the effectiveness of the algorithm is demonstrated with both synthetic and real data sets.  相似文献   

14.
Agglomerative clustering generates the partition hierarchically by a sequence of merge operations. We propose an alternative to the merge-based approach by removing the clusters iteratively one by one until the desired number of clusters is reached. We apply local optimization strategy by always removing the cluster that increases the distortion the least. Data structures and their update strategies are considered. The proposed algorithm is applied as a crossover method in a genetic algorithm, and compared against the best existing clustering algorithms. The proposed method provides best performance in terms of minimizing intra-cluster variance.  相似文献   

15.
Recent experimental studies have revealed that a large percentage of wireless links are lossy and unreliable for data delivery in wireless sensor networks (WSNs). Such findings raise new challenges for the design of clustering algorithms in WSNs in terms of data reliability and energy efficiency. In this paper, we propose distributed clustering algorithms for lossy WSNs with a mobile collector, where the mobile collector moves close to each cluster head to receive data directly and then uploads collected data to the base station. We first consider constructing one-hop clusters in lossy WSNs where all cluster members are within the direct communication range of their cluster heads. We formulate the problem into an integer program, aiming at maximizing the network lifetime, which is defined as the number of rounds of data collection until the first node dies. We then prove that the problem is NP-hard. After that, we propose a metric-based distributed clustering algorithm to solve the problem. We adopt a metric called selection weight for each sensor node that indicates both link qualities around the node and its capability of being a cluster head. We further extend the algorithm to multi-hop clustering to achieve better scalability. We have found out that the performance of the one-hop clustering algorithm in small WSNs is very close to the optimal results obtained by mathematical tools. We have conducted extensive simulations for large WSNs and the results demonstrate that the proposed clustering algorithms can significantly improve the data reception ratio, reduce the total energy consumption in the network and prolong network lifetime compared to a typical distributed clustering algorithm, HEED, that does not consider lossy links.  相似文献   

16.
In this paper the notion of convexity of clusterings for the given ordering of units is introduced. In the case when at least one (optimal) solution of the clustering problem is convex, dynamic programming leads to a polynomial algorithm with complexityO(kn 3). We prove that, for several criterion functions, convex optimal clusterings exist when dissimilarity is pyramidal for a given ordering of units.This research was supported in part by the Research Council of Slovenia.  相似文献   

17.
This article presents a multi-objective genetic algorithm which considers the problem of data clustering. A given dataset is automatically assigned into a number of groups in appropriate fuzzy partitions through the fuzzy c-means method. This work has tried to exploit the advantage of fuzzy properties which provide capability to handle overlapping clusters. However, most fuzzy methods are based on compactness and/or separation measures which use only centroid information. The calculation from centroid information only may not be sufficient to differentiate the geometric structures of clusters. The overlap-separation measure using an aggregation operation of fuzzy membership degrees is better equipped to handle this drawback. For another key consideration, we need a mechanism to identify appropriate fuzzy clusters without prior knowledge on the number of clusters. From this requirement, an optimization with single criterion may not be feasible for different cluster shapes. A multi-objective genetic algorithm is therefore appropriate to search for fuzzy partitions in this situation. Apart from the overlap-separation measure, the well-known fuzzy Jm index is also optimized through genetic operations. The algorithm simultaneously optimizes the two criteria to search for optimal clustering solutions. A string of real-coded values is encoded to represent cluster centers. A number of strings with different lengths varied over a range correspond to variable numbers of clusters. These real-coded values are optimized and the Pareto solutions corresponding to a tradeoff between the two objectives are finally produced. As shown in the experiments, the approach provides promising solutions in well-separated, hyperspherical and overlapping clusters from synthetic and real-life data sets. This is demonstrated by the comparison with existing single-objective and multi-objective clustering techniques.  相似文献   

18.
Traditional minimum spanning tree-based clustering algorithms only make use of information about edges contained in the tree to partition a data set. As a result, with limited information about the structure underlying a data set, these algorithms are vulnerable to outliers. To address this issue, this paper presents a simple while efficient MST-inspired clustering algorithm. It works by finding a local density factor for each data point during the construction of an MST and discarding outliers, i.e., those whose local density factor is larger than a threshold, to increase the separation between clusters. This algorithm is easy to implement, requiring an implementation of iDistance as the only k-nearest neighbor search structure. Experiments performed on both small low-dimensional data sets and large high-dimensional data sets demonstrate the efficacy of our method.  相似文献   

19.
Sixteen sea surface temperature (SST) images obtained over the coastal ocean of Portugal during the period September 1992-September 2003 were used aiming to identify automatically the areas covered by upwelling waters. Suitable high resolution colour scales were applied to the SST images in order to enhance the thermal patterns and easily identify the waters with a coastal upwelling origin. The automatic identification of the areas covered by upwelling waters was developed by the authors in a previous work, through the application of fuzzy clustering and validation indexes, and here is explored as an oceanographic application to the Portuguese coastal upwelling. The fuzzy c-means (FCM) algorithm showed to be able to find partitions that closely defined the upwelling areas and the visualization of the fuzzy c-partitions was achieved through the application of a colour scale. The Xie-Beni validation index was used to select the c-partition that best represented the stage of the upwelling event and showed an agreement with the oceanographic interpretation in 10 of the 14 SST segmented images used in this work. Two SST images without upwelling were also used in order to check the response of the algorithm to the absence of the phenomena. The computation of the matching rate between a c-partition and the two areas split by the hand-contoured upwelling boundary also allowed the evaluation of how closely the obtained segmentation reproduced the shape of the areas covered by upwelling waters. This method successfully identified the upwelling boundary regions in 10 of the 14 SST images. The values obtained for the matching rate were higher than 0.77, thus indicating the good quality of the fuzzy partitions. The segmented images with 3 or 4 clusters were the most suitable ones to reproduce the areas covered by upwelling waters, but it was also shown that, for some cases, the upwelling areas could be reasonably well reproduced by the FCM 2-partition images. While in the latter, the area covered with upwelling waters was coincident with the first cluster, in the former, the segmented image showed two clusters within the upwelling area: the first cluster coincided with the area occupied by the most recently upwelled waters near the coast, while the second cluster was coincident with the area occupied by the “older” upwelling waters with its extensions offshore, the so-called cold filaments. The FCM algorithm revealed to be a promising technique in the automatic identification of upwelling areas on SST images.  相似文献   

20.
With the development of the World Wide Web, document clustering is receiving more and more attention as an important and fundamental technique for unsupervised document organization, automatic topic extraction, and fast information retrieval or filtering. A good document clustering approach can assist computers in organizing the document corpus automatically into a meaningful cluster hierarchy for efficient browsing and navigation, which is very valuable for complementing the deficiencies of traditional information retrieval technologies. In this paper, we study the performance of different density-based criterion functions, which can be classified as internal, external or hybrid, in the context of partitional clustering of document datasets. In our study, a weight was assigned to each document, which defined its relative position in the entire collection. To show the efficiency of the proposed approach, the weighted methods were compared to their unweighted variants. To verify the robustness of the proposed approach, experiments were conducted on datasets with a wide variety of numbers of clusters, documents and terms. To evaluate the criterion functions, we used the WebKb, Reuters-21578, 20Newsgroups-18828, WebACE and TREC-5 datasets, as they are currently the most widely used benchmarks in document clustering research. To evaluate the quality of a clustering solution, a wide spectrum of indices, three internal validity indices and seven external validity indices, were used. The internal validity indices were used for evaluating the within-cluster scatter and between cluster separations. The external validity indices were used for comparing the clustering solutions produced by the proposed criterion functions with the “ground truth” results. Experiments showed that our approach significantly improves clustering quality. In this paper, we developed a modified differential evolution (DE) algorithm to optimize the criterion functions. This modification accelerates the convergence of DE and, unlike the basic DE algorithm, guarantees that the received solution will be feasible.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号