期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Determining the number of clusters using information entropy for mixed data

Jiye Liang Xingwang Zhao Deyu Li Fuyuan Cao Chuangyin Dang 《Pattern recognition》2012,45(6):2251-2265

In cluster analysis, one of the most challenging and difficult problems is the determination of the number of clusters in a data set, which is a basic input parameter for most clustering algorithms. To solve this problem, many algorithms have been proposed for either numerical or categorical data sets. However, these algorithms are not very effective for a mixed data set containing both numerical attributes and categorical attributes. To overcome this deficiency, a generalized mechanism is presented in this paper by integrating Rényi entropy and complement entropy together. The mechanism is able to uniformly characterize within-cluster entropy and between-cluster entropy and to identify the worst cluster in a mixed data set. In order to evaluate the clustering results for mixed data, an effective cluster validity index is also defined in this paper. Furthermore, by introducing a new dissimilarity measure into the k-prototypes algorithm, we develop an algorithm to determine the number of clusters in a mixed data set. The performance of the algorithm has been studied on several synthetic and real world data sets. The comparisons with other clustering algorithms show that the proposed algorithm is more effective in detecting the optimal number of clusters and generates better clustering results. 相似文献

2.

Density Based Cluster Growing via Dominant Sets

Jian Hou Xu E Weixue Liu 《Neural Processing Letters》2018,48(2):933-954

Although there are a lot of clustering algorithms available in the literature, existing algorithms are usually afflicted by practical problems of one form or another, including parameter dependence and the inability to generate clusters of arbitrary shapes. In this paper we aim to solve these two problems by merging the merits of dominant sets and density based clustering algorithms. We firstly apply histogram equalization to eliminate the parameter dependence problem of the dominant sets algorithm. Noticing that the obtained clusters are usually smaller than the real ones, a density threshold based cluster growing step is then used to improve the clustering results, where the involved parameters are determined based on the initial clusters. This is followed by the second cluster growing step which makes use of the density relationship between neighboring data. Data clustering experiments and comparison with other algorithms validate the effectiveness of the proposed algorithm. 相似文献

3.

A k-mean clustering algorithm for mixed numeric and categorical data

《Data & Knowledge Engineering》2007,63(2):503-527

相似文献

4.

On cluster tree for nested and multi-density data clustering

Xutao Li Author Vitae 《Pattern recognition》2010,43(9):3130-3143

Clustering is one of the important data mining tasks. Nested clusters or clusters of multi-density are very prevalent in data sets. In this paper, we develop a hierarchical clustering approach—a cluster tree to determine such cluster structure and understand hidden information present in data sets of nested clusters or clusters of multi-density. We embed the agglomerative k-means algorithm in the generation of cluster tree to detect such clusters. Experimental results on both synthetic data sets and real data sets are presented to illustrate the effectiveness of the proposed method. Compared with some existing clustering algorithms (DBSCAN, X-means, BIRCH, CURE, NBC, OPTICS, Neural Gas, Tree-SOM, EnDBSAN and LDBSCAN), our proposed cluster tree approach performs better than these methods. 相似文献

5.

Clustering of periodic multichannel timeseries data with application to plasma fluctuations

S.R. Haskey B.D. BlackwellD.G. Pretty 《Computer Physics Communications》2014

A periodic datamining algorithm has been developed and used to extract distinct plasma fluctuations in multichannel oscillatory timeseries data. The technique uses the Expectation Maximisation algorithm to solve for the maximum likelihood estimates and cluster assignments of a mixture of multivariate independent von Mises distributions (EM-VMM). The performance of the algorithm shows significant benefits when compared to a periodic k-means algorithm and clustering using non-periodic techniques on several artificial datasets and real experimental data. Additionally, a new technique for identifying interesting features in multichannel oscillatory timeseries data is described (STFT-clustering). STFT-clustering identifies the coincidence of spectral features over most channels of a multi-channel array using the averaged short time Fourier transform of the signals. These features are filtered using clustering to remove noise. This method is particularly good at identifying weaker features and complements existing methods of feature extraction. Results from applying the STFT-clustering and EM-VMM algorithm to the extraction and clustering of plasma wave modes in the time series data from a helical magnetic probe array on the H-1NF heliac are presented. 相似文献

6.

A semi-supervised regression model for mixed numerical and categorical variables

Michael K. Ng Author Vitae Elaine Y. Chan Wai-Ki Ching 《Pattern recognition》2007,40(6):1745-1752

In this paper, we develop a semi-supervised regression algorithm to analyze data sets which contain both categorical and numerical attributes. This algorithm partitions the data sets into several clusters and at the same time fits a multivariate regression model to each cluster. This framework allows one to incorporate both multivariate regression models for numerical variables (supervised learning methods) and k-mode clustering algorithms for categorical variables (unsupervised learning methods). The estimates of regression models and k-mode parameters can be obtained simultaneously by minimizing a function which is the weighted sum of the least-square errors in the multivariate regression models and the dissimilarity measures among the categorical variables. Both synthetic and real data sets are presented to demonstrate the effectiveness of the proposed method. 相似文献

7.

具有用户特征约束的多关系聚类

下载免费PDF全文

王志超张磊《计算机工程与应用》2011,47(23):124-129

多数聚类算法都是针对数据本身,往往忽略了用户聚类目的以及聚类过程中用户的参与指导,这样从数据本身出发的聚类结果准确性往往不太理想。针对这个问题,提出具有用户特征约束的多关系聚类算法。在多关系关联数据中进行用户参与的特征选择,用Must特征集和Can’t特征集描述用户聚类目的,通过领域本体进行特征集合扩充,得到聚类特征集合进行聚类。实验表明,该算法能较好地描述用户聚类目的,实现用户参与的聚类指导,获得了较好的聚类结果。相似文献

8.

Modified FDP cluster algorithm and its application in protein conformation clustering analysis

《Digital Signal Processing》2019

We present a modified find density peaks (MFDP) clustering algorithm. In the MFDP, a critical parameter, dc, is auto-defined by minimizing the entropy of all points. By considering both the point density, ρ, and large distance from points with higher densities, δ, the high-dimensional points are transformed into a 2D space. The halo points of the original FDP cluster algorithm are redefined, and a definition of boundary points is introduced to illustrate the intersection region between clusters. To demonstrate the clustering ability, the distance-based K-means clustering and density-based algorithms DBSCAN, original FDP are employed respectively. Four criteria are introduced to evaluate the clustering algorithms quantitatively. For most of the cases, the MFDP provides a superior clustering result than both of the typical clustering algorithms, and FDP in 20 commonly used benchmark datasets, particularly in clearly depicting the intersection region between clusters. Finally, we evaluate the performance of the MFDP in the cluster analysis of conformations in molecular dynamics (MD). In the MD clustering process, eight typical cluster center conformations are selected in six collective variable spaces. Moreover, it is in strong agreement with the experiment results. The clustering results demonstrate the potential for generalized applications of the modified algorithm to similar problems. 相似文献

9.

A novel cluster center fast determination clustering algorithm

《Applied Soft Computing》2017

As one of the most important techniques in data mining, cluster analysis has attracted more and more attentions in this big data era. Most clustering algorithms have encountered with challenges including cluster centers determination difficulty, low clustering accuracy, uneven clustering efficiency of different data sets and sensible parameter dependence. Aiming at clustering center determination difficulty and parameter dependence, a novel cluster center fast determination clustering algorithm was proposed in this paper. It is supposed that clustering centers are those data points with higher density and larger distance from other data points of higher density. Normal distribution curves are designed to fit the density distribution curve of density distance product. And the singular points outside the confidence interval by setting the confidence interval are proved to be clustering centers by theory analysis and simulations. Finally, according to these clustering centers, a time scan clustering is designed for the rest of the points by density to complete the clustering. Density radius is a sensible parameter in calculating density for each data point, mountain climbing algorithm is thus used to realize self-adaptive density radius. Abundant typical benchmark data sets are testified to evaluate the performance of the brought up algorithms compared with other clustering algorithms in both aspects of clustering quality and time complexity. 相似文献

10.

K步稳定的鞋印花纹图像自动聚类

下载免费PDF全文

王新年舒莹莹《中国图象图形学报》2016,21(5):574-587

目的鞋印是刑事侦查的重要物证之一,如何对积累的大量鞋印花纹图像进行自动归类管理是刑事技术迫切需要解决的问题之一。与其他类图像不同,鞋印花纹图像具有种类多但数目未知、同类花纹分布不均匀且同类花纹数目少的特点。基于鞋印花纹图像的这些特点,用目前典型的聚类算法对鞋印花纹图像集进行聚类,并不能取得很好的效果。在对鞋印花纹图像进行分析的基础上,提出一种K步稳定的鞋印花纹图像自动聚类算法。方法对已标记的鞋印花纹图像进行统计发现,各类鞋印花纹之间在特征空间上存在互不相交的区域(本文称为隔离带)。算法的核心思想是寻找各类鞋印花纹之间的隔离带,来将各类分开。过程为:以单调递增或递减的方式调整特征空间中判定两点为一类的阈值,得到数据集的多次划分;若在连续K次划分的过程中,某一类的成员不发生变化,则说明这K次调整是在隔离带中进行的,即聚出一类,并从数据集中删除已标记的数据;选择下一个阈值对剩余的数据集进行划分,输出K步不变的类;依此类推,直到剩余数据集为空,聚类完成。结果在两类公开测试数据集和实际鞋印花纹数据集上进行实验,本文算法的主要性能指标都超过典型算法,其中在包含5792枚实际鞋印花纹数据集上的聚类准确率和F-Measure值分别达到了99.68%和95.99%。结论针对鞋印花纹图像特点,提出了一种通过寻找各类之间的隔离带进行自动聚类的算法,并在实际应用中取得了很好的效果。且算法性能受参数的变化以及类的形状影响较小。本文算法同样适用于具有类似特点的其他数据集的自动聚类。相似文献

11.

一种量子行为粒子群优化动态聚类算法*

陈伟陈璟孙俊须文波《计算机应用研究》2011,28(7):2432-2435

为了改善量子行为粒子群优化算法的收敛性能,避免粒子早熟问题,提出了一种基于完全学习策略的量子行为粒子群优化算法。由此设计了一种新的数据聚类算法,新的聚类算法通过特殊的粒子编码方式在聚类过程中能够自动确定最佳的聚类数目。在五个测试数据集上与其他两种动态聚类算法进行聚类实验比较,实验结果表明,基于完全学习策略的量子行为粒子群优化动态聚类算法能够获得较好的聚类结果,有着良好的应用前景。相似文献

12.

Time series forecasting with a hybrid clustering scheme and pattern recognition

Sfetsos A. Siriopoulos C. 《IEEE transactions on systems, man, and cybernetics. Part A, Systems and humans : a publication of the IEEE Systems, Man, and Cybernetics Society》2004,34(3):399-405

This paper presents the development of a novel clustering algorithm and its application in time series forecasting. The common use of clustering algorithms in time series is to discover to groups sets of data with common characteristic their proximity. This property is used by several hybrid forecasting algorithms that additionally employ a function approximation technique to model interactions within each cluster. The proposed hybrid clustering algorithm (HCA) is a data analysis oriented clustering based on an iterative procedure that creates groups of data whose common property is that they are best described by the same linear relationship. A complementary pattern recognition scheme is employed to assist its implementation in time series forecasting. In this paper the HCA methodology is tested on the benchmark sunspots series, the daily closing values of the Dow Jones Index and hourly surface ozone concentrations. It exhibited a reduction of the forecasting error, in excess of 9%, when compared to other approaches met in the literature. 相似文献

13.

Supervised clustering of label ranking data using label preference information

Mihajlo Grbovic Nemanja Djuric Shengbo Guo Slobodan Vucetic 《Machine Learning》2013,93(2-3):191-225

This paper studies supervised clustering in the context of label ranking data. The goal is to partition the feature space into K clusters, such that they are compact in both the feature and label ranking space. This type of clustering has many potential applications. For example, in target marketing we might want to come up with K different offers or marketing strategies for our target audience. Thus, we aim at clustering the customers’ feature space into K clusters by leveraging the revealed or stated, potentially incomplete customer preferences over products, such that the preferences of customers within one cluster are more similar to each other than to those of customers in other clusters. We establish several baseline algorithms and propose two principled algorithms for supervised clustering. In the first baseline, the clusters are created in an unsupervised manner, followed by assigning a representative label ranking to each cluster. In the second baseline, the label ranking space is clustered first, followed by partitioning the feature space based on the central rankings. In the third baseline, clustering is applied on a new feature space consisting of both features and label rankings, followed by mapping back to the original feature and ranking space. The RankTree principled approach is based on a Ranking Tree algorithm previously proposed for label ranking prediction. Our modification starts with K random label rankings and iteratively splits the feature space to minimize the ranking loss, followed by re-calculation of the K rankings based on cluster assignments. The MM-PL approach is a multi-prototype supervised clustering algorithm based on the Plackett-Luce (PL) probabilistic ranking model. It represents each cluster with a union of Voronoi cells that are defined by a set of prototypes, and assign each cluster with a set of PL label scores that determine the cluster central ranking. Cluster membership and ranking prediction for a new instance are determined by cluster membership of its nearest prototype. The unknown cluster PL parameters and prototype positions are learned by minimizing the ranking loss, based on two variants of the expectation-maximization algorithm. Evaluation of the proposed algorithms was conducted on synthetic and real-life label ranking data by considering several measures of cluster goodness: (1) cluster compactness in feature space, (2) cluster compactness in label ranking space and (3) label ranking prediction loss. Experimental results demonstrate that the proposed MM-PL and RankTree models are superior to the baseline models. Further, MM-PL is has shown to be much better than other algorithms at handling situations with significant fraction of missing label preferences. 相似文献

14.

An initialization method to simultaneously find initial cluster centers and the number of clusters for clustering categorical data

Liang Bai Jiye Liang Chuangyin Dang 《Knowledge》2011,24(6):785-795

The leading partitional clustering technique, k-modes, is one of the most computationally efficient clustering methods for categorical data. However, in the k-modes-type algorithms, the performance of their clustering depends on initial cluster centers and the number of clusters needs be known or given in advance. This paper proposes a novel initialization method for categorical data which is implemented to the k-modes-type algorithms. The proposed method can not only obtain the good initial cluster centers but also provide a criterion to find candidates for the number of clusters. The performance and scalability of the proposed method has been studied on real data sets. The experimental results illustrate that the proposed method is effective and can be applied to large data sets for its linear time complexity with respect to the number of data points. 相似文献

15.

基于层次聚类改进SMOTE的过采样方法

王圆方《软件》2020,(2):201-204

针对SMOTE算法在合成少数类新样本时存在的不足,提出了一种基于层次聚类算法改进的SMOTE过采样法H-SMOTE。该算法首先对少数类样本进行层次聚类,其次根据提出的簇密度分布函数,计算各个簇的簇密度,最后在各个簇中利用改进的SMOTE算法进行过采样,提高合成样本的多样性,得到新的平衡数据集。通过对UCI数据集的实验表明,H-SMOTE算法的分类效果得到明显的提升。相似文献

16.

Validity index for crisp and fuzzy clusters

Malay K. Pakhira Author Vitae Sanghamitra Bandyopadhyay^{Author Vitae} 《Pattern recognition》2004,37(3):487-501

In this article, a cluster validity index and its fuzzification is described, which can provide a measure of goodness of clustering on different partitions of a data set. The maximum value of this index, called the PBM-index, across the hierarchy provides the best partitioning. The index is defined as a product of three factors, maximization of which ensures the formation of a small number of compact clusters with large separation between at least two clusters. We have used both the k-means and the expectation maximization algorithms as underlying crisp clustering techniques. For fuzzy clustering, we have utilized the well-known fuzzy c-means algorithm. Results demonstrating the superiority of the PBM-index in appropriately determining the number of clusters, as compared to three other well-known measures, the Davies-Bouldin index, Dunn's index and the Xie-Beni index, are provided for several artificial and real-life data sets. 相似文献

17.

基于混合蛙跳与阴影集优化的粗糙模糊聚类算法

蒙祖强胡玉兰蒋亮常红岩《控制与决策》2015,30(10):1766-1772

针对粗糙模糊聚类算法对初值敏感、易陷入局部最优和聚类性能依赖阈值选择等问题, 提出一种混合蛙跳与阴影集优化的粗糙模糊聚类算法(SFLA-SRFCM). 通过设置自适应调节因子, 以增加混合蛙跳算法的局部搜索能力; 利用类簇上、下近似集的模糊类内紧密度和模糊类间分离度构造新的适应度函数; 采用阴影集自适应获取类簇阈值. 实验结果表明, SFLA-SRFCM 算法是有效的, 并且具有更好的聚类精度和有效性指标.

相似文献

18.

基于自编码器的深度聚类算法综述

下载免费PDF全文

陶文彬钱育蓉张伊扬马恒志冷洪勇马梦楠《计算机工程与应用》2022,58(18):16-25

聚类分析作为一种常见的分析方法,广泛应用于各种场景。随着机器学习技术的发展,深度聚类算法也成了当下研究的热点,基于自编码器的深度聚类算法是其中的代表算法。为了及时了解掌握基于自编码器的深度聚类算法的发展,介绍了四种自编码器的模型,对近些年代表性的算法依照自编码器的结构进行了分类。在MNIST、USPS、Fashion-MNIST数据集上,针对传统聚类算法和基于自编码器的深度聚类算法进行了实验对比、分析,最后对基于自编码器的深度聚类算法目前存在的问题进行了总结,展望了深度聚类算法的研究方向。相似文献

19.

Extensions to the k-Means Algorithm for Clustering Large Data Sets with Categorical Values 总被引：76，自引：0，他引：76

Zhexue Huang 《Data mining and knowledge discovery》1998,2(3):283-304

The k-means algorithm is well known for its efficiency in clustering large data sets. However, working only on numeric values prohibits it from being used to cluster real world data containing categorical values. In this paper we present two algorithms which extend the k-means algorithm to categorical domains and domains with mixed numeric and categorical values. The k-modes algorithm uses a simple matching dissimilarity measure to deal with categorical objects, replaces the means of clusters with modes, and uses a frequency-based method to update modes in the clustering process to minimise the clustering cost function. With these extensions the k-modes algorithm enables the clustering of categorical data in a fashion similar to k-means. The k-prototypes algorithm, through the definition of a combined dissimilarity measure, further integrates the k-means and k-modes algorithms to allow for clustering objects described by mixed numeric and categorical attributes. We use the well known soybean disease and credit approval data sets to demonstrate the clustering performance of the two algorithms. Our experiments on two real world data sets with half a million objects each show that the two algorithms are efficient when clustering large data sets, which is critical to data mining applications. 相似文献

20.

Distributed data clustering in sensor networks

Ittay Eyal Idit Keidar Raphael Rom 《Distributed Computing》2011,24(5):207-222

Low overhead analysis of large distributed data sets is necessary for current data centers and for future sensor networks. In such systems, each node holds some data value, e.g., a local sensor read, and a concise picture of the global system state needs to be obtained. In resource-constrained environments like sensor networks, this needs to be done without collecting all the data at any location, i.e., in a distributed manner. To this end, we address the distributed clustering problem, in which numerous interconnected nodes compute a clustering of their data, i.e., partition these values into multiple clusters, and describe each cluster concisely. We present a generic algorithm that solves the distributed clustering problem and may be implemented in various topologies, using different clustering types. For example, the generic algorithm can be instantiated to cluster values according to distance, targeting the same problem as the famous k-means clustering algorithm. However, the distance criterion is often not sufficient to provide good clustering results. We present an instantiation of the generic algorithm that describes the values as a Gaussian Mixture (a set of weighted normal distributions), and uses machine learning tools for clustering decisions. Simulations show the robustness, speed and scalability of this algorithm. We prove that any implementation of the generic algorithm converges over any connected topology, clustering criterion and cluster representation, in fully asynchronous settings. 相似文献