首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
K-means is one of the most widely used clustering algorithms in various disciplines, especially for large datasets. However the method is known to be highly sensitive to initial seed selection of cluster centers. K-means++ has been proposed to overcome this problem and has been shown to have better accuracy and computational efficiency than k-means. In many clustering problems though – such as when classifying georeferenced data for mapping applications – standardization of clustering methodology, specifically, the ability to arrive at the same cluster assignment for every run of the method i.e. replicability of the methodology, may be of greater significance than any perceived measure of accuracy, especially when the solution is known to be non-unique, as in the case of k-means clustering. Here we propose a simple initial seed selection algorithm for k-means clustering along one attribute that draws initial cluster boundaries along the “deepest valleys” or greatest gaps in dataset. Thus, it incorporates a measure to maximize distance between consecutive cluster centers which augments the conventional k-means optimization for minimum distance between cluster center and cluster members. Unlike existing initialization methods, no additional parameters or degrees of freedom are introduced to the clustering algorithm. This improves the replicability of cluster assignments by as much as 100% over k-means and k-means++, virtually reducing the variance over different runs to zero, without introducing any additional parameters to the clustering process. Further, the proposed method is more computationally efficient than k-means++ and in some cases, more accurate.  相似文献   

2.
在处理高维数据过程中,特征选择是一个非常重要的数据降维步骤。低秩表示模型具有揭示数据全局结构信息的能力和一定的鉴别能力。稀疏表示模型能够利用较少的连接关系揭示数据的本质结构信息。在低秩表示模型的基础上引入稀疏约束项,构建一种低秩稀疏表示模型学习数据间的低秩稀疏相似度矩阵;基于该矩阵提出一种低秩稀疏评分机制用于非监督特征选择。在不同数据库上将选择后的特征进行聚类和分类实验,同传统特征选择算法进行比较。实验结果表明了低秩特征选择算法的有效性。  相似文献   

3.
With a sharp increase in the information volume, analyzing and retrieving this vast data volume is much more essential than ever. One of the main techniques that would be beneficial in this regard is called the Clustering method. Clustering aims to classify objects so that all objects within a cluster have similar features while other objects in different clusters are as distinct as possible. One of the most widely used clustering algorithms with the well and approved performance in different applications is the k-means algorithm. The main problem of the k-means algorithm is its performance which can be directly affected by the selection in the primary clusters. Lack of attention to this crucial issue has consequences such as creating empty clusters and decreasing the convergence time. Besides, the selection of appropriate initial seeds can reduce the cluster’s inconsistency. In this paper, we present a new method to determine the initial seeds of the k-mean algorithm to improve the accuracy and decrease the number of iterations of the algorithm. For this purpose, a new method is proposed considering the average distance between objects to determine the initial seeds. Our method attempts to provide a proper tradeoff between the accuracy and speed of the clustering algorithm. The experimental results showed that our proposed approach outperforms the Chithra with 1.7% and 2.1% in terms of clustering accuracy for Wine and Abalone detection data, respectively. Furthermore, achieved results indicate that comparing with the Reverse Nearest Neighbor (RNN) search approach, the proposed method has a higher convergence speed.  相似文献   

4.
In this article, a cluster validity index and its fuzzification is described, which can provide a measure of goodness of clustering on different partitions of a data set. The maximum value of this index, called the PBM-index, across the hierarchy provides the best partitioning. The index is defined as a product of three factors, maximization of which ensures the formation of a small number of compact clusters with large separation between at least two clusters. We have used both the k-means and the expectation maximization algorithms as underlying crisp clustering techniques. For fuzzy clustering, we have utilized the well-known fuzzy c-means algorithm. Results demonstrating the superiority of the PBM-index in appropriately determining the number of clusters, as compared to three other well-known measures, the Davies-Bouldin index, Dunn's index and the Xie-Beni index, are provided for several artificial and real-life data sets.  相似文献   

5.
传统的聚类方法,如k均值和模糊c均值,通常并不区分数据特征对聚类的不同贡献或重要度,因此在面对高维数据聚类时,常会导致偏低的聚类性能,这归咎于聚类时未考虑高维数据特征间所存在的高度相关性或冗余.而通过在聚类时为每一特征引入权重并通过聚类目标的优化,不仅能自动获得对应的权重,而且也获得了聚类性能的提升.尽管如此,但无监督获取的特征权重未必吻合用户所期望的特征间的相对重要性(或偏好).因此尝试利用用户给定的实际偏好设计出能反映特征偏好的聚类方法,其将现有独立于个体聚类的全局加权型偏好聚类方法拓展至聚类依赖的局部特征加权型方法,由此弥补了前者的不足,提升了偏好聚类算法的性能.  相似文献   

6.
Selecting a subset of salient features for performing clustering using a clustering learning algorithm has been explored extensively in many real‐world applications. To select salient features during training, the filter model evaluates the intrinsic characteristics of each individual feature but is not permitted to use a clustering learning algorithm that provides clustered information to train the features. In particular, the filter model aims to predict unobservable clusters and measure how the features help provide satisfactory within‐cluster and between‐cluster scatters to achieve a good clustering quality. However, it is generally difficult to achieve both scatters in the filter model. For example, a random variable with a large variance may raise only the between‐cluster scatter, whereas another variable following a uniform distribution may raise only the within‐cluster scatter. In this paper, we present a new filter‐based method to quantify features that consider feature compactness and separability to ensure that both scatters are raised. Moreover, our method adopts a new search strategy to locate the best feature salience vector instead of visiting the space of all the possible feature subsets. After the benchmark data sets are tested, the experimental results indicate that our method performs better than many benchmark filter‐based methods at selecting a feature subset to perform clustering.  相似文献   

7.
Data clustering has been proven to be an effective method for discovering structure in medical datasets. The majority of clustering algorithms produce exclusive clusters meaning that each sample can belong to one cluster only. However, most real-world medical datasets have inherently overlapping information, which could be best explained by overlapping clustering methods that allow one sample belong to more than one cluster. One of the simplest and most efficient overlapping clustering methods is known as overlapping k-means (OKM), which is an extension of the traditional k-means algorithm. Being an extension of the k-means algorithm, the OKM method also suffers from sensitivity to the initial cluster centroids. In this paper, we propose a hybrid method that combines k-harmonic means and overlapping k-means algorithms (KHM-OKM) to overcome this limitation. The main idea behind KHM-OKM method is to use the output of KHM method to initialize the cluster centers of OKM method. We have tested the proposed method using FBCubed metric, which has been shown to be the most effective measure to evaluate overlapping clustering algorithms regarding homogeneity, completeness, rag bag, and cluster size-quantity tradeoff. According to results from ten publicly available medical datasets, the KHM-OKM algorithm outperforms the original OKM algorithm and can be used as an efficient method for clustering medical datasets.  相似文献   

8.
针对电力公司海量数据分类问题,提出一种改进的k-means数据分类方法.在k-means算法的基础上,应用PCA对k-means算法进行降维处理,用canopy算法优化最佳簇集数、初始聚类中心.然后,应用改进的k-means算法对居民用户用电进行聚类;最后以该聚类结果为基础,建立LSTM预测模型.通过LSTM预测模型对...  相似文献   

9.
One main task for domain experts in analysing their nD data is to detect and interpret class/cluster separations and outliers. In fact, an important question is, which features/dimensions separate classes best or allow a cluster‐based data classification. Common approaches rely on projections from nD to 2D, which comes with some challenges, such as: The space of projection contains an infinite number of items. How to find the right one? The projection approaches suffers from distortions and misleading effects. How to rely to the projected class/cluster separation? The projections involve the complete set of dimensions/features. How to identify irrelevant dimensions? Thus, to address these challenges, we introduce a visual analytics concept for the feature selection based on linear discriminative star coordinates (DSC), which generate optimal cluster separating views in a linear sense for both labeled and unlabeled data. This way the user is able to explore how each dimension contributes to clustering. To support to explore relations between clusters and data dimensions, we provide a set of cluster‐aware interactions allowing to smartly iterate through subspaces of both records and features in a guided manner. We demonstrate our features selection approach for optimal cluster/class separation analysis with a couple of experiments on real‐life benchmark high‐dimensional data sets.  相似文献   

10.
We study a general algorithm to improve the accuracy in cluster analysis that employs the James-Stein shrinkage effect in k-means clustering. We shrink the centroids of clusters toward the overall mean of all data using a James-Stein-type adjustment, and then the James-Stein shrinkage estimators act as the new centroids in the next clustering iteration until convergence. We compare the shrinkage results to the traditional k-means method. A Monte Carlo simulation shows that the magnitude of the improvement depends on the within-cluster variance and especially on the effective dimension of the covariance matrix. Using the Rand index, we demonstrate that accuracy increases significantly in simulated data and in a real data example.  相似文献   

11.
张宜浩  金澎  孙锐 《计算机应用》2012,32(5):1332-1334
汉语中一词多义现象普遍存在,词义归纳就是对在不同语境中具有相同语义的词进行归类,本质上是一聚类问题。目前广泛采用无指导的聚类方法对词义归纳进行研究,提出一种改进的k-means算法,该算法主要从初始簇中心的选取以及簇均值的计算两个方面进行改进,在一定程度上克服了其对“噪声”和孤立点数据的敏感。在特征表示上用同义词词林中词的分类编号来降低特征维度。实验表明改进k-means算法在性能上有较大的提升,F-Score达到了75.8%。  相似文献   

12.
提出基于信息熵特征选择和信息瓶颈算法的图像聚类算法,首先提取图像的Gabor小波纹理特征和灰度共生矩阵纹理特征,然后采用信息熵特征选择方法进行特征降维;图像聚类方法很多,其中较为典型的k-means聚类算法,但它过分依赖距离函数和聚类中心的选择,采用信息瓶颈算法对图像进行聚类,信息瓶颈算法不需要定义距离函数,它考虑了样本与特征的关系,不仅压缩了样本的信息,同时又考虑保留特征信息。实验结果表明,提出的方法具有良好的聚类效果。  相似文献   

13.
The success rates of the expert or intelligent systems depend on the selection of the correct data clusters. The k-means algorithm is a well-known method in solving data clustering problems. It suffers not only from a high dependency on the algorithm's initial solution but also from the used distance function. A number of algorithms have been proposed to address the centroid initialization problem, but the produced solution does not produce optimum clusters. This paper proposes three algorithms (i) the search algorithm C-LCA that is an improved League Championship Algorithm (LCA), (ii) a search clustering using C-LCA (SC-LCA), and (iii) a hybrid-clustering algorithm called the hybrid of k-means and Chaotic League Championship Algorithm (KSC-LCA) and this algorithm has of two computation stages. The C-LCA employs chaotic adaptation for the retreat and approach parameters, rather than constants, which can enhance the search capability. Furthermore, to overcome the limitation of the original k-means algorithm using the Euclidean distance that cannot handle the categorical attribute type properly, we adopt the Gower distance and the mechanism for handling a discrete value requirement of the categorical value attribute. The proposed algorithms can handle not only the pure numeric data but also the mixed-type data and can find the best centroids containing categorical values. Experiments were conducted on 14 datasets from the UCI repository. The SC-LCA and KSC-LCA competed with 16 established algorithms including the k-means, k-means++, global k-means algorithms, four search clustering algorithms and nine hybrids of k-means algorithm with several state-of-the-art evolutionary algorithms. The experimental results show that the SC-LCA produces the cluster with the highest F-Measure on the pure categorical dataset and the KSC-LCA produces the cluster with the highest F-Measure for the pure numeric and mixed-type tested datasets. Out of 14 datasets, there were 13 centroids produced by the SC-LCA that had better F-Measures than that of the k-means algorithm. On the Tic-Tac-Toe dataset containing only categorical attributes, the SC-LCA can achieve an F-Measure of 66.61 that is 21.74 points over that of the k-means algorithm (44.87). The KSC-LCA produced better centroids than k-means algorithm in all 14 datasets; the maximum F-Measure improvement was 11.59 points. However, in terms of the computational time, the SC-LCA and KSC-LCA took more NFEs than the k-means and its variants but the KSC-LCA ranks first and SC-LCA ranks fourth among the hybrid clustering and the search clustering algorithms that we tested. Therefore, the SC-LCA and KSC-LCA are general and effective clustering algorithms that could be used when an expert or intelligent system requires an accurate high-speed cluster selection.  相似文献   

14.
In k-means clustering, we are given a set of n data points in d-dimensional space Rd and an integer k and the problem is to determine a set of k points in Rd, called centers, so as to minimize the mean squared distance from each data point to its nearest center. A popular heuristic for k-means clustering is Lloyd's (1982) algorithm. We present a simple and efficient implementation of Lloyd's k-means clustering algorithm, which we call the filtering algorithm. This algorithm is easy to implement, requiring a kd-tree as the only major data structure. We establish the practical efficiency of the filtering algorithm in two ways. First, we present a data-sensitive analysis of the algorithm's running time, which shows that the algorithm runs faster as the separation between clusters increases. Second, we present a number of empirical studies both on synthetically generated data and on real data sets from applications in color quantization, data compression, and image segmentation  相似文献   

15.
一种改进的k-means初始聚类中心选取算法   总被引:3,自引:0,他引:3       下载免费PDF全文
在传统的k-means聚类算法中,聚类结果会随着初始聚类中心点的不同而波动,针对这个缺点,提出一种优化初始聚类中心的算法。该算法通过计算每个数据对象的密度参数,然后选取k个处于高密度分布的点作为初始聚类中心。实验表明,在聚类类别数给定的情况下,通过用标准的UCI数据库进行实验比较,发现采用改进后方法选取的初始类中心的k-means算法比随机选取初始聚类中心算法有相对较高的准确率和稳定性。  相似文献   

16.
针对密度峰值聚类(CFSFDP)算法处理多密度峰值数据集时,人工选择聚类中心易造成簇的误划分问题,提出一种结合遗传k均值改进的密度峰值聚类算法。在CFSFDP求得的可能簇中心中,利用基于可变染色体长度编码的遗传k均值的全局搜索能力自动搜索出最优聚类中心,同时自适应确定遗传k均值的交叉概率,避免早熟问题的出现。在UCI数据集上的实验结果表明,改进算法具有较好的聚类质量和较少的迭代次数,验证了所提算法的可行性和有效性。  相似文献   

17.
提出一种的简单快速的多通道Gabor滤波技术对彩色纹理进行分割。首先,通过DRBFT和IDRBFT对彩色纹理的进行多通道Gabor滤波,再运用PCA对滤波得到的特征向量进行降维,对降维后的特征向量进行k-mean聚类,最后再对聚类后的区域用mean shift进行平滑,通过对平滑后的区域进行边缘检测就可以得到不同纹理的边界。最后给出几种分割算法的实验结果比较,表明该算法对于分割彩色纹理还是非常有效的。  相似文献   

18.
Feature selection is an important preprocessing step for dealing with high dimensional data. In this paper, we propose a novel unsupervised feature selection method by embedding a subspace learning regularization (i.e., principal component analysis (PCA)) into the sparse feature selection framework. Specifically, we select informative features via the sparse learning framework and consider preserving the principal components (i.e., the maximal variance) of the data at the same time, such that improving the interpretable ability of the feature selection model. Furthermore, we propose an effective optimization algorithm to solve the proposed objective function which can achieve stable optimal result with fast convergence. By comparing with five state-of-the-art unsupervised feature selection methods on six benchmark and real-world datasets, our proposed method achieved the best result in terms of classification performance.  相似文献   

19.
Unsupervised feature selection is an important problem, especially for high‐dimensional data. However, until now, it has been scarcely studied and the existing algorithms cannot provide satisfying performance. Thus, in this paper, we propose a new unsupervised feature selection algorithm using similarity‐based feature clustering, Feature Selection‐based Feature Clustering (FSFC). FSFC removes redundant features according to the results of feature clustering based on feature similarity. First, it clusters the features according to their similarity. A new feature clustering algorithm is proposed, which overcomes the shortcomings of K‐means. Second, it selects a representative feature from each cluster, which contains most interesting information of features in the cluster. The efficiency and effectiveness of FSFC are tested upon real‐world data sets and compared with two representative unsupervised feature selection algorithms, Feature Selection Using Similarity (FSUS) and Multi‐Cluster‐based Feature Selection (MCFS) in terms of runtime, feature compression ratio, and the clustering results of K‐means. The results show that FSFC can not only reduce the feature space in less time, but also significantly improve the clustering performance of K‐means.  相似文献   

20.
Li  Min  Yang  Chao  Sun  Qiao  Ma  Wen-Jing  Cao  Wen-Long  Ao  Yu-Long 《计算机科学技术学报》2019,34(1):77-93

With the advent of the big data era, the amounts of sampling data and the dimensions of data features are rapidly growing. It is highly desired to enable fast and efficient clustering of unlabeled samples based on feature similarities. As a fundamental primitive for data clustering, the k-means operation is receiving increasingly more attentions today. To achieve high performance k-means computations on modern multi-core/many-core systems, we propose a matrix-based fused framework that can achieve high performance by conducting computations on a distance matrix and at the same time can improve the memory reuse through the fusion of the distance-matrix computation and the nearest centroids reduction. We implement and optimize the parallel k-means algorithm on the SW26010 many-core processor, which is the major horsepower of Sunway TaihuLight. In particular, we design a task mapping strategy for load-balanced task distribution, a data sharing scheme to reduce the memory footprint and a register blocking strategy to increase the data locality. Optimization techniques such as instruction reordering and double buffering are further applied to improve the sustained performance. Discussions on block-size tuning and performance modeling are also presented. We show by experiments on both randomly generated and real-world datasets that our parallel implementation of k-means on SW26010 can sustain a double-precision performance of over 348.1 Gflops, which is 46.9% of the peak performance and 84% of the theoretical performance upper bound on a single core group, and can achieve a nearly ideal scalability to the whole SW26010 processor of four core groups. Performance comparisons with the previous state-of-the-art on both CPU and GPU are also provided to show the superiority of our optimized k-means kernel.

  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号