共查询到20条相似文献,搜索用时 0 毫秒
1.
Clustering consists in partitioning a set of objects into disjoint and homogeneous clusters. For many years, clustering methods have been applied in a wide variety of disciplines and they also have been utilized in many scientific areas. Traditionally, clustering methods deal with numerical data, i.e. objects represented by a conjunction of numerical attribute values. However, nowadays commercial or scientific databases usually contain categorical data, i.e. objects represented by categorical attributes. In this paper we present a dissimilarity measure which is capable to deal with tree structured categorical data. Thus, it can be used for extending the various versions of the very popular k-means clustering algorithm to deal with such data. We discuss how such an extension can be achieved. Moreover, we empirically prove that the proposed dissimilarity measure is accurate, compared to other well-known (dis)similarity measures for categorical data. 相似文献
2.
Keke Chen Ling Liu 《The VLDB Journal The International Journal on Very Large Data Bases》2009,18(6):1241-1260
Analyzing clustering structures in data streams can provide critical information for real-time decision making. Most research in this area has focused on clustering algorithms for numerical data streams, and very few have proposed to monitor the change of clustering structure. Most surprisingly, to our knowledge, no work has been proposed on monitoring clustering structure for categorical data streams. In this paper, we present a framework for detecting the change of primary clustering structure in categorical data streams, which is indicated by the change of the best number of clusters (Best K) in the data stream. The framework uses a Hierarchical Entropy Tree structure (HE-Tree) to capture the entropy characteristics of clusters in a data stream, and detects the change of Best K by combining our previously developed BKPlot method. The HE-Tree can efficiently summarize the entropy property of a categorical data stream and allow us to draw precise clustering information from the data stream for generating high-quality BKPlots. We also develop the time-decaying HE-Tree structure to make the monitoring more sensitive to recent changes of clustering structure. The experimental result shows that with the combination of the HE-Tree and the BKPlot method we are able to promptly and precisely detect the change of clustering structure in categorical data streams. 相似文献
3.
Tao Chen Nevin L. Zhang Tengfei Liu Kin Man Poon Yi Wang 《Artificial Intelligence》2012,176(1):2246-2269
Existing models for cluster analysis typically consist of a number of attributes that describe the objects to be partitioned and one single latent variable that represents the clusters to be identified. When one analyzes data using such a model, one is looking for one way to cluster data that is jointly defined by all the attributes. In other words, one performs unidimensional clustering. This is not always appropriate. For complex data with many attributes, it is more reasonable to consider multidimensional clustering, i.e., to partition data along multiple dimensions. In this paper, we present a method for performing multidimensional clustering on categorical data and show its superiority over unidimensional clustering. 相似文献
4.
针对类属型数据聚类中对象间距离函数定义的困难问题,提出一种基于贝叶斯概率估计的类属数据聚类算法。首先,提出一种属性加权的概率模型,在这个模型中每个类属属性被赋予一个反映其重要性的权重;其次,经过贝叶斯公式的变换,定义了基于最大似然估计的聚类优化目标函数,并提出了一种基于划分的聚类算法,该算法不再依赖于对象间的距离,而是根据对象与数据集划分间的加权似然进行聚类;第三,推导了计算属性权重的表达式,得出了类属型属性权重与其符号分布的信息熵成反比的结论。在实际数据和合成数据集上进行了实验,结果表明,与基于距离的现有聚类算法相比,所提算法提高了聚类精度,特别是在生物信息学数据上取得了5%~48%的提升幅度,并可以获得有实际意义的属性加权结果。 相似文献
5.
无重叠子空间分类聚类算法 总被引:1,自引:0,他引:1
传统的聚类算法主要是对数值型的数据进行聚类,而随着对数据的发展需求,建立在分类数据上的算法也越来越多,由于分类数据没有直接意义上的距离,传统算法不能解决这个问题.同时,现有子空间上的分类聚类研究不是很多.引用熵的概念来选择确定划分的类别和类的最优中心点,同时提出了一种新的目标函数来得到每个类上的相关子空间集,并根据目标函数的最小值来优化聚类的划分.实验结果表明,该方法是可行的,同时也能够了解每个类中的数据结构特点. 相似文献
6.
7.
Can Gao Witold Pedrycz Duoqian Miao 《Soft Computing - A Fusion of Foundations, Methodologies and Applications》2013,17(9):1643-1658
Clustering categorical data arising as an important problem of data mining has recently attracted much attention. In this paper, the problem of unsupervised dimensionality reduction for categorical data is first studied. Based on the theory of rough sets, the attributes of categorical data are decomposed into a number of rough subspaces. A novel clustering ensemble algorithm based on rough subspaces is then proposed to deal with categorical data. The algorithm employs some of rough subspaces with high quality to cluster the data and yields a robust and stable solution by exploiting the resulting partitions. We also introduce a cluster index to evaluate the solution of clustering algorithm for categorical data. Experimental results for selected UCI data sets show that the proposed method produces better results than those obtained by other methods when being evaluated in terms of cluster validity indexes. 相似文献
8.
Cheqing Jin Jeffrey Xu Yu Aoying Zhou Feng Cao 《Knowledge and Information Systems》2014,40(3):509-539
Clustering uncertain data streams has recently become one of the most challenging tasks in data management because of the strict space and time requirements of processing tuples arriving at high speed and the difficulty that arises from handling uncertain data. The prior work on clustering data streams focuses on devising complicated synopsis data structures to summarize data streams into a small number of micro-clusters so that important statistics can be computed conveniently, such as Clustering Feature (CF) (Zhang et al. in Proceedings of ACM SIGMOD, pp 103–114, 1996) for deterministic data and Error-based Clustering Feature (ECF) (Aggarwal and Yu in Proceedings of ICDE, 2008) for uncertain data. However, ECF can only handle attribute-level uncertainty, while existential uncertainty, the other kind of uncertainty, has not been addressed yet. In this paper, we propose a novel data structure, Uncertain Feature (UF), to summarize data streams with both kinds of uncertainties: UF is space-efficient, has additive and subtractive properties, and can compute complicated statistics easily. Our first attempt aims at enhancing the previous streaming approaches to handle the sliding-window model by using UF instead of old synopses, inclusive of CluStream (Aggarwal et al. in Proceedings of VLDB, 2003) and UMicro (Aggarwal and Yu in Proceedings of ICDE, 2008). We show that such methods cannot achieve high efficiency. Our second attempt aims at devising a novel algorithm, cluUS , to handle the sliding-window model by using UF structure. Detailed analysis and thorough experimental reports on synthetic and real data sets confirm the advantages of our proposed method. 相似文献
9.
Outlier detection is an important data mining task with many contemporary applications. Clustering based methods for outlier detection try to identify the data objects that deviate from the normal data. However, the uncertainty regarding the cluster membership of an outlier object has to be handled appropriately during the clustering process. Additionally, carrying out the clustering process on data described using categorical attributes is challenging, due to the difficulty in defining requisite methods and measures dealing with such data. Addressing these issues, a novel algorithm for clustering categorical data aimed at outlier detection is proposed here by modifying the standard \(k\)-modes algorithm. The uncertainty regarding the clustering process is addressed by considering a soft computing approach based on rough sets. Accordingly, the modified clustering algorithm incorporates the lower and upper approximation properties of rough sets. The efficacy of the proposed rough \(k\)-modes clustering algorithm for outlier detection is demonstrated using various benchmark categorical data sets. 相似文献
10.
Tengke Xiong Shengrui Wang André Mayers Ernest Monga 《Data mining and knowledge discovery》2012,24(1):103-135
Clustering categorical data poses two challenges defining an inherently meaningful similarity measure, and effectively dealing with clusters which are often embedded in different
subspaces. In this paper, we propose a novel divisive hierarchical clustering algorithm for categorical data, named DHCC.
We view the task of clustering categorical data from an optimization perspective, and propose effective procedures to initialize
and refine the splitting of clusters. The initialization of the splitting is based on multiple correspondence analysis (MCA).
We also devise a strategy for deciding when to terminate the splitting process. The proposed algorithm has five merits. First,
due to its hierarchical nature, our algorithm yields a dendrogram representing nested groupings of patterns and similarity
levels at different granularities. Second, it is parameter-free, fully automatic and, in particular, requires no assumption
regarding the number of clusters. Third, it is independent of the order in which the data is processed. Fourth, it is scalable
to large data sets. And finally, our algorithm is capable of seamlessly discovering clusters embedded in subspaces, thanks
to its use of a novel data representation and Chi-square dissimilarity measures. Experiments on both synthetic and real data
demonstrate the superior performance of our algorithm. 相似文献
11.
A fuzzy k-modes algorithm for clustering categorical data 总被引:12,自引:0,他引:12
This correspondence describes extensions to the fuzzy k-means algorithm for clustering categorical data. By using a simple matching dissimilarity measure for categorical objects and modes instead of means for clusters, a new approach is developed, which allows the use of the k-means paradigm to efficiently cluster large categorical data sets. A fuzzy k-modes algorithm is presented and the effectiveness of the algorithm is demonstrated with experimental results 相似文献
12.
子空间聚类是高维数据聚类的一种有效手段,子空间聚类的原理就是在最大限度地保留原始数据信息的同时用尽可能小的子空间对数据聚类。在研究了现有的子空间聚类的基础上,引入了一种新的子空间的搜索方式,它结合簇类大小和信息熵计算子空间维的权重,进一步用子空间的特征向量计算簇类的相似度。该算法采用类似层次聚类中凝聚层次聚类的思想进行聚类,克服了单用信息熵或传统相似度的缺点。通过在Zoo、Votes、Soybean三个典型分类型数据集上进行测试发现:与其他算法相比,该算法不仅提高了聚类精度,而且具有很高的稳定性。 相似文献
13.
In clustering algorithms, choosing a subset of representative examples is very important in data set. Such “exemplars” can be found by randomly choosing an initial subset of data objects and then iteratively refining it, but this works well only if that initial choice is close to a good solution. In this paper, based on the frequency of attribute values, the average density of an object is defined. Furthermore, a novel initialization method for categorical data is proposed, in which the distance between objects and the density of the object is considered. We also apply the proposed initialization method to k-modes algorithm and fuzzy k-modes algorithm. Experimental results illustrate that the proposed initialization method is superior to random initialization method and can be applied to large data sets for its linear time complexity with respect to the number of data objects. 相似文献
14.
15.
16.
This paper introduces a self-organizing map dedicated to clustering, analysis and visualization of categorical data. Usually,
when dealing with categorical data, topological maps use an encoding stage: categorical data are changed into numerical vectors
and traditional numerical algorithms (SOM) are run. In the present paper, we propose a novel probabilistic formalism of Kohonen
map dedicated to categorical data where neurons are represented by probability tables. We do not need to use any coding to
encode variables. We evaluate the effectiveness of our model in four examples using real data. Our experiments show that our
model provides a good quality of results when dealing with categorical data. 相似文献
17.
为了提高分类型数据集聚类的准确性和对广泛数据集聚类的适应性,引入3种核函数,再利用基于山方法的核K-means作分类型的数据聚类,核函数把分类型数据映射到高维特征空间,从而给缺乏测度的分类型数据引入了数值型数据的测度.改进后用多个公开数据集对这些方法进行了实验评测,结果显示这些方法对分类型数据的聚类是有效的. 相似文献
18.
Liang Bai Jiye Liang Chuangyin Dang Fuyuan Cao 《Expert systems with applications》2012,39(9):8022-8029
The leading partitional clustering technique, k-modes, is one of the most computationally efficient clustering methods for categorical data. However, the performance of the k-modes clustering algorithm which converges to numerous local minima strongly depends on initial cluster centers. Currently, most methods of initialization cluster centers are mainly for numerical data. Due to lack of geometry for the categorical data, these methods used in cluster centers initialization for numerical data are not applicable to categorical data. This paper proposes a novel initialization method for categorical data which is implemented to the k-modes algorithm. The method integrates the distance and the density together to select initial cluster centers and overcomes shortcomings of the existing initialization methods for categorical data. Experimental results illustrate the proposed initialization method is effective and can be applied to large data sets for its linear time complexity with respect to the number of data objects. 相似文献
19.
In this paper, the conventional k-modes-type algorithms for clustering categorical data are extended by representing the clusters of categorical data with k-populations instead of the hard-type centroids used in the conventional algorithms. Use of a population-based centroid representation makes it possible to preserve the uncertainty inherent in data sets as long as possible before actual decisions are made. The k-populations algorithm was found to give markedly better clustering results through various experiments. 相似文献
20.
基于Storm的海量数据实时聚类 总被引:1,自引:0,他引:1
针对现有平台处理海量数据实时响应能力普遍较差的问题,引入Storm分布式实时计算平台进行大规模数据的聚类分析,设计了基于Storm框架的DBSCAN算法。该算法将整个过程分为数据接入、聚类分析、结果输出等阶段,在框架预定义的组件中分别编程实现,各组件通过数据流连通形成任务实体,提交到集群运行完成。通过对比分析和性能监测,验证了所提方案具有低延迟和高吞吐量的优势,集群运行状况良好,负载均衡。实验结果表明Storm平台处理海量数据实时性较高,能够胜任大数据背景下的数据挖掘任务。 相似文献