首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 0 毫秒
1.
A generalized form of Possibilistic Fuzzy C-Means (PFCM) algorithm (GPFCM) is presented for clustering noisy data. A function of distance is used instead of the distance itself to damp noise contributions. It is shown that when the data are highly noisy, GPFCM finds accurate cluster centers but FCM (Fuzzy C-Means), PCM (Possibilistic C-Means), and PFCM algorithms fail. FCM, PCM, and PFCM yield inaccurate cluster centers when clusters are not of the same size or covariance norm is used, whereas GPFCM performs well for both of the cases even when the data are noisy. It is shown that generalized forms of FCM and PCM (GFCM and GPCM) are also more accurate than FCM and PCM. A measure is defined to evaluate performance of the clustering algorithms. It shows that average error of GPFCM and its simplified forms are about 80% smaller than those of FCM, PCM, and PFCM. However, GPFCM demands higher computational costs due to nonlinear updating equations. Three cluster validity indices are introduced to determine number of clusters in clean and noisy datasets. One of them considers compactness of the clusters; the other considers separation of the clusters, and the third one considers both separation and compactness. Performance of these indices is confirmed to be satisfactory using various examples of noisy datasets.  相似文献   

2.
With the increasing availability of modern mobile devices and location acquisition technologies, massive trajectory data of moving objects are collected continuously in a streaming manner. Clustering streaming trajectories facilitates finding the representative paths or common moving trends shared by different objects in real time. Although data stream clustering has been studied extensively in the past decade, little effort has been devoted to dealing with streaming trajectories. The main challenge lies in the strict space and time complexities of processing the continuously arriving trajectory data, combined with the difficulty of concept drift. To address this issue, we present two novel synopsis structures to extract the clustering characteristics of trajectories, and develop an incremental algorithm for the online clustering of streaming trajectories (called OCluST). It contains a micro-clustering component to cluster and summarize the most recent sets of trajectory line segments at each time instant, and a macro-clustering component to build large macro-clusters based on micro-clusters over a specified time horizon. Finally, we conduct extensive experiments on four real data sets to evaluate the effectiveness and efficiency of OCluST, and compare it with other congeneric algorithms. Experimental results show that OCluST can achieve superior performance in clustering streaming trajectories.  相似文献   

3.
Unsupervised clustering methods such as K-means, hierarchical clustering and fuzzy c-means have been widely applied to the analysis of gene expression data to identify biologically relevant groups of genes. Recent studies have suggested that the incorporation of biological information into validation methods to assess the quality of clustering results might be useful in facilitating biological and biomedical knowledge discoveries. In this study, we generalize two bio-validity indices, the biological homogeneity index and the biological stability index, to quantify the abilities of soft clustering algorithms such as fuzzy c-means and model-based clustering. The results of an evaluation of several existing soft clustering algorithms using simulated and real data sets indicate that the soft versions of the indices provide both better precision and better accuracy than the classical ones. The significance of the proposed indices is also discussed.  相似文献   

4.
For streaming data that arrive continuously such as multimedia data and financial transactions, clustering algorithms are typically allowed to scan the data set only once. Existing research in this domain mainly focuses on improving the accuracy of clustering. In this paper, a novel density-based hierarchical clustering scheme for streaming data is proposed in order to improve both accuracy and effectiveness; it is based on the agglomerative clustering framework. Traditionally, clustering algorithms for streaming data often use the cluster center to represent the whole cluster when conducting cluster merging, which may lead to unsatisfactory results. We argue that even if the data set is accessed only once, some parameters, such as the variance within cluster, the intra-cluster density and the inter-cluster distance, can be calculated accurately. This may bring measurable benefits to the process of cluster merging. Furthermore, we employ a general framework that can incorporate different criteria and, given the same criteria, will produce similar clustering results for both streaming and non-streaming data. In experimental studies, the proposed method demonstrates promising results with reduced time and space complexity.  相似文献   

5.
一个改进的模糊聚类有效性指标   总被引:1,自引:0,他引:1       下载免费PDF全文
聚类有效性指标既可用来评价聚类结果的有效性,也可以用来确定最佳聚类数。根据模糊聚类的基本特性,提出了一种新的模糊聚类有效性指标。该指标结合了数据集的分布特征和数据隶属度两个重要因素来评价聚类结果,提高了判别的准确性。实验证明,该指标能对模糊聚类结果进行正确的评价,并自动获得最佳聚类数,特别是对类间有交叠的情况能够做出准确判定。  相似文献   

6.
A cluster validity index for fuzzy clustering   总被引:1,自引:0,他引:1  
A new cluster validity index is proposed for the validation of partitions of object data produced by the fuzzy c-means algorithm. The proposed validity index uses a variation measure and a separation measure between two fuzzy clusters. A good fuzzy partition is expected to have a low degree of variation and a large separation distance. Testing of the proposed index and nine previously formulated indices on well-known data sets shows the superior effectiveness and reliability of the proposed index in comparison to other indices and the robustness of the proposed index in noisy environments.  相似文献   

7.
Identification of the correct number of clusters and the appropriate partitioning technique are some important considerations in clustering where several cluster validity indices, primarily utilizing the Euclidean distance, have been used in the literature. In this paper a new measure of connectivity is incorporated in the definitions of seven cluster validity indices namely, DB-index, Dunn-index, Generalized Dunn-index, PS-index, I-index, XB-index and SV-index, thereby yielding seven new cluster validity indices which are able to automatically detect clusters of any shape, size or convexity as long as they are well-separated. Here connectivity is measured using a novel approach following the concept of relative neighborhood graph. It is empirically established that incorporation of the property of connectivity significantly improves the capabilities of these indices in identifying the appropriate number of clusters. The well-known clustering techniques, single linkage clustering technique and K-means clustering technique are used as the underlying partitioning algorithms. Results on eight artificially generated and three real-life data sets show that connectivity based Dunn-index performs the best as compared to all the other six indices. Comparisons are made with the original versions of these seven cluster validity indices.  相似文献   

8.
Discovering interesting patterns or substructures in data streams is an important challenge in data mining. Clustering algorithms are very often applied to identify single substructures although they are designed to partition a data set. Another problem of clustering algorithms is that most of them are not designed for data streams. This paper discusses a recently introduced procedure that deals with both problems. The procedure explores ideas from cluster analysis, but was designed to identify single clusters without the necessity to partition the whole data set into clusters. The new extended version of the algorithm is an incremental clustering approach applicable to stream data. It identifies new clusters formed by the incoming data and updates the data space partition. Clustering of artificial and real data sets illustrates the abilities of the proposed method.  相似文献   

9.
There has been an important emergence of applications in which data arrives in an online time-varying fashion (e.g. computer network traffic, sensor data, web searches, ATM transactions) and it is not feasible to exchange or to store all the arriving data in traditional database systems to operate on it. For this kind of applications, as it is for traditional static database schemes, density estimation is a fundamental block for data analysis. A novel online approach for probability density estimation based on wavelet bases suitable for applications involving rapidly changing streaming data is presented. The proposed approach is based on a recursive formulation of the wavelet-based orthogonal estimator using a sliding window and includes an optimised procedure for reevaluating only relevant scaling and wavelet functions each time new data items arrive. The algorithm is tested and compared using both simulated and real world data.  相似文献   

10.
From a dataset automatically identifying possible count of clusters is an important task of unsupervised classification. To address this issue, in the current paper, we have focused on the symmetry property of any cluster. Point and line symmetry are two important attributes of data partitions. Here we have proposed line symmetry versions of eight well-known validity indices: XB, PBM, FCM, PS, FS, K, SV, and DB indices to make them capable of identifying the accurate count of partitions from data sets containing clusters having line symmetric property. The global optimality of two of these newly developed indices is established mathematically. Eight artificially generated data sets of varying dimensions containing clusters of different convexities and shapes and three real-life data sets are used for the purpose of experiment. Initially, to obtain different partitions an existing genetic clustering technique which uses line symmetry property (GALS clustering) is applied on data sets varying the count of clusters. queryPlease check and confirm the edit in the following sentence: We have also provided a comparative study of our proposed line-symmetry-based cluster validity indices with their point-symmetry-based versions and original versions based on Euclidean distance. We have also provided a comparative study of our proposed line-symmetry-based cluster validity indices with their point-symmetry-based versions and original versions based on Euclidean distance. From the experimental results it is revealed that most of the line-symmetry-distance-based cluster validity indices perform better than their point symmetry and Euclidean-distance-based versions.  相似文献   

11.
An online clustering method based on a time-varying quadratic programming is proposed which can precisely detect streaming data clustering structure when no assumption is desired on the shape and density of data classes. In the proposed method, online clustering is achieved through simulating some dynamical equations which yield optimum solution of the time-varying quadratic programming over time. A new framework is also proposed which efficiently permits streaming data clustering based on a relatively small and renewable dataset. This framework reduces the need for incoming data storage memory to a small and independent of original data size. The performance of the proposed method is evaluated through the experiments using synthetic data as well as the KDD cup 99 dataset. The results illustrate higher performance of the proposed method in comparison with a range of benchmark methods.  相似文献   

12.
提出了资源负载的三种负载均衡状态,分析了这三种状态的均衡程度,并据此提出一种自适应的副本放置算法,成功地应用于集群VOD系统中,弹性地解决负载均衡与后端存储带宽的矛盾.通过仿真证明,该算法在不同数据量的情况下均具有很好的负载均衡性和优异的整体性能.  相似文献   

13.
朱强  孙玉强 《计算机应用》2014,34(9):2505-2509
传感器节点的资源是有限的,高的通信开销会消耗大量的电量。为了减小分布式流数据分类算法的通信开销,提出一种高效的分布式流数据聚类算法。该算法包含在线局部聚类和离线全局协同聚类两个阶段。在线局部聚类算法将每个流数据源进行局部聚类,并将聚类后的结果通过序列化技术发往协同节点;协同节点得到来自不同流数据源的局部聚类信息后进行全局聚类。从实验中可以看出,当不断增加窗口的大小时,算法用于数据发送的时间恒定不变,算法的聚类时间和总的时间呈线性增长,即所提出算法的执行时间不受滑动窗口宽度和聚类个数的影响;同时该算法与集中式算法的准确性接近,并且通信开销远远小于相关的分布式算法。实验结果表明,该算法具有很好的可扩展性,可应用于对大规模分布式流数据源进行聚类分析。  相似文献   

14.
Performance evaluation of some clustering algorithms and validity indices   总被引:16,自引:0,他引:16  
In this article, we evaluate the performance of three clustering algorithms, hard K-Means, single linkage, and a simulated annealing (SA) based technique, in conjunction with four cluster validity indices, namely Davies-Bouldin index, Dunn's index, Calinski-Harabasz index, and a recently developed index I. Based on a relation between the index I and the Dunn's index, a lower bound of the value of the former is theoretically estimated in order to get unique hard K-partition when the data set has distinct substructures. The effectiveness of the different validity indices and clustering methods in automatically evolving the appropriate number of clusters is demonstrated experimentally for both artificial and real-life data sets with the number of clusters varying from two to ten. Once the appropriate number of clusters is determined, the SA-based clustering technique is used for proper partitioning of the data into the said number of clusters.  相似文献   

15.
《Information Fusion》2005,6(2):143-151
Categorical data clustering (CDC) and cluster ensemble (CE) have long been considered as separate research and application areas. The main focus of this paper is to investigate the commonalities between these two problems and the uses of these commonalities for the creation of new clustering algorithms for categorical data based on cross-fertilization between the two disjoint research fields. More precisely, we formally define the CDC problem as an optimization problem from the viewpoint of CE, and apply CE approach for clustering categorical data. Experimental results on real datasets show that CE based clustering method is competitive with existing CDC algorithms with respect to clustering accuracy.  相似文献   

16.
Clustering is an explanatory procedure which helps to understand data with complex structure and multivariate relationships, and is a very useful method to extract knowledge and information especially from large datasets. When such datasets are aggregated into categories (as driven by scientific questions underlying the analysis), the resulting observations will perforce be expressed as so-called symbolic data (though symbolic data can occur “naturally” in any sized datasets). The focus of this work is to provide a divisive polythetic algorithm to establish clusters for p-dimensional histogram-valued data. In addition, two cluster validity indexes for use in establishing the optimal number of clusters are also developed. Finally, the proposed procedure is applied to a large forestry cover type dataset.  相似文献   

17.
18.
This paper focuses on the development of an effective cluster validity measure with outlier detection and cluster merging algorithms for support vector clustering (SVC). Since SVC is a kernel-based clustering approach, the parameter of kernel functions and the soft-margin constants in Lagrangian functions play a crucial role in the clustering results. The major contribution of this paper is that our proposed validity measure and algorithms are capable of identifying ideal parameters for SVC to reveal a suitable cluster configuration for a given data set. A validity measure, which is based on a ratio of cluster compactness to separation with outlier detection and a cluster-merging mechanism, has been developed to automatically determine ideal parameters for the kernel functions and soft-margin constants as well. With these parameters, the SVC algorithm is capable of identifying the optimal number of clusters with compact and smooth arbitrary-shaped cluster contours for the given data set and increasing robustness to outliers and noise. Several simulations, including artificial and benchmark data sets, have been conducted to demonstrate the effectiveness of the proposed cluster validity measure for the SVC algorithm.  相似文献   

19.
The leading partitional clustering technique, k-modes, is one of the most computationally efficient clustering methods for categorical data. However, the performance of the k-modes clustering algorithm which converges to numerous local minima strongly depends on initial cluster centers. Currently, most methods of initialization cluster centers are mainly for numerical data. Due to lack of geometry for the categorical data, these methods used in cluster centers initialization for numerical data are not applicable to categorical data. This paper proposes a novel initialization method for categorical data which is implemented to the k-modes algorithm. The method integrates the distance and the density together to select initial cluster centers and overcomes shortcomings of the existing initialization methods for categorical data. Experimental results illustrate the proposed initialization method is effective and can be applied to large data sets for its linear time complexity with respect to the number of data objects.  相似文献   

20.
Clustering is one of the important data mining tasks. Nested clusters or clusters of multi-density are very prevalent in data sets. In this paper, we develop a hierarchical clustering approach—a cluster tree to determine such cluster structure and understand hidden information present in data sets of nested clusters or clusters of multi-density. We embed the agglomerative k-means algorithm in the generation of cluster tree to detect such clusters. Experimental results on both synthetic data sets and real data sets are presented to illustrate the effectiveness of the proposed method. Compared with some existing clustering algorithms (DBSCAN, X-means, BIRCH, CURE, NBC, OPTICS, Neural Gas, Tree-SOM, EnDBSAN and LDBSCAN), our proposed cluster tree approach performs better than these methods.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号