Extensions to the kMeans Algorithm for Clustering Large Data Sets with Categorical Values 
 
Authors:  Zhexue Huang 
 
Institution:  (1) ACSys CRC, CSIRO Mathematical and Information Sciences, GPO Box 664, Canberra, ACT, 2601, Australia 
 
Abstract:  The kmeans algorithm is well known for its efficiency in clustering large data sets. However, working only on numeric values
prohibits it from being used to cluster real world data containing categorical values. In this paper we present two algorithms
which extend the kmeans algorithm to categorical domains and domains with mixed numeric and categorical values. The kmodes
algorithm uses a simple matching dissimilarity measure to deal with categorical objects, replaces the means of clusters with
modes, and uses a frequencybased method to update modes in the clustering process to minimise the clustering cost function.
With these extensions the kmodes algorithm enables the clustering of categorical data in a fashion similar to kmeans. The
kprototypes algorithm, through the definition of a combined dissimilarity measure, further integrates the kmeans and kmodes
algorithms to allow for clustering objects described by mixed numeric and categorical attributes. We use the well known soybean
disease and credit approval data sets to demonstrate the clustering performance of the two algorithms. Our experiments on
two real world data sets with half a million objects each show that the two algorithms are efficient when clustering large
data sets, which is critical to data mining applications. 
 
Keywords:  data mining cluster analysis clustering algorithms categorical data 
本文献已被 SpringerLink 等数据库收录！ 
