期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

A cluster centers initialization method for clustering categorical data

Liang Bai Jiye Liang Chuangyin Dang Fuyuan Cao 《Expert systems with applications》2012,39(9):8022-8029

The leading partitional clustering technique, k-modes, is one of the most computationally efficient clustering methods for categorical data. However, the performance of the k-modes clustering algorithm which converges to numerous local minima strongly depends on initial cluster centers. Currently, most methods of initialization cluster centers are mainly for numerical data. Due to lack of geometry for the categorical data, these methods used in cluster centers initialization for numerical data are not applicable to categorical data. This paper proposes a novel initialization method for categorical data which is implemented to the k-modes algorithm. The method integrates the distance and the density together to select initial cluster centers and overcomes shortcomings of the existing initialization methods for categorical data. Experimental results illustrate the proposed initialization method is effective and can be applied to large data sets for its linear time complexity with respect to the number of data objects. 相似文献

2.

A new initialization method for categorical data clustering

Fuyuan Cao Jiye Liang Liang Bai 《Expert systems with applications》2009,36(7):10223-10228

In clustering algorithms, choosing a subset of representative examples is very important in data set. Such “exemplars” can be found by randomly choosing an initial subset of data objects and then iteratively refining it, but this works well only if that initial choice is close to a good solution. In this paper, based on the frequency of attribute values, the average density of an object is defined. Furthermore, a novel initialization method for categorical data is proposed, in which the distance between objects and the density of the object is considered. We also apply the proposed initialization method to k-modes algorithm and fuzzy k-modes algorithm. Experimental results illustrate that the proposed initialization method is superior to random initialization method and can be applied to large data sets for its linear time complexity with respect to the number of data objects. 相似文献

3.

基于山方法的分类型数据核聚类

朱映辉杨圣云袁德辉《计算机工程与设计》2008,29(11):2915-2917

为了提高分类型数据集聚类的准确性和对广泛数据集聚类的适应性,引入3种核函数,再利用基于山方法的核K-means作分类型的数据聚类,核函数把分类型数据映射到高维特征空间,从而给缺乏测度的分类型数据引入了数值型数据的测度.改进后用多个公开数据集对这些方法进行了实验评测,结果显示这些方法对分类型数据的聚类是有效的. 相似文献

4.

Identifying cluster number for subspace projected functional data clustering

Pai-Ling Li Jeng-Min Chiou 《Computational statistics & data analysis》2011,55(6):2090-2103

We propose a new approach, the forward functional testing (FFT) procedure, to cluster number selection for functional data clustering. We present a framework of subspace projected functional data clustering based on the functional multiplicative random-effects model, and propose to perform functional hypothesis tests on equivalence of cluster structures to identify the number of clusters. The aim is to find the maximum number of distinctive clusters while retaining significant differences between cluster structures. The null hypotheses comprise equalities between the cluster mean functions and between the sets of cluster eigenfunctions of the covariance kernels. Bootstrap resampling methods are developed to construct reference distributions of the derived test statistics. We compare several other cluster number selection criteria, extended from methods of multivariate data, with the proposed FFT procedure. The performance of the proposed approaches is examined by simulation studies, with applications to clustering gene expression profiles. 相似文献

5.

An effective dissimilarity measure for clustering of high-dimensional categorical data

Jeonghoon Lee Yoon-Joon Lee 《Knowledge and Information Systems》2014,38(3):743-757

Clustering is to group similar data and find out hidden information about the characteristics of dataset for the further analysis. The concept of dissimilarity of objects is a decisive factor for good quality of results in clustering. When attributes of data are not just numerical but categorical and high dimensional, it is not simple to discriminate the dissimilarity of objects which have synonymous values or unimportant attributes. We suggest a method to quantify the level of difference between categorical values and to weigh the implicit influence of each attribute on constructing a particular cluster. Our method exploits distributional information of data correlated with each categorical value so that intrinsic relationship of values can be discovered. In addition, it measures significance of each attribute in constructing respective cluster dynamically. Experiments on real datasets show the propriety and effectiveness of the method, which improves the results considerably even with simple clustering algorithms. Our approach does not couple with a clustering algorithm tightly and can also be applied to various algorithms flexibly. 相似文献

6.

An optimal method for data clustering

Linsen Xie Chengbo Lu Ying Mei Hong Du Zhihong Man 《Neural computing & applications》2016,27(2):283-289

An algorithm for optimizing data clustering in feature space is studied in this work. Using graph Laplacian and extreme learning machine (ELM) mapping technique, we develop an optimal weight matrix W for feature mapping. This work explicitly performs a mapping of the original data for clustering into an optimal feature space, which can further increase the separability of original data in the feature space, and the patterns points in same cluster are still closely clustered. Our method, which can be easily implemented, gets better clustering results than some popular clustering algorithms, like k-means on the original data, kernel clustering method, spectral clustering method, and ELM k-means on data include three UCI real data benchmarks (IRIS data, Wisconsin breast cancer database, and Wine database). 相似文献

7.

MMR: An algorithm for clustering categorical data using Rough Set Theory

Darshit Teresa Jennifer 《Data & Knowledge Engineering》2007,63(3):879-893

A variety of cluster analysis techniques exist to group objects having similar characteristics. However, the implementation of many of these techniques is challenging due to the fact that much of the data contained in today’s databases is categorical in nature. While there have been recent advances in algorithms for clustering categorical data, some are unable to handle uncertainty in the clustering process while others have stability issues. This research proposes a new algorithm for clustering categorical data, termed Min–Min-Roughness (MMR), based on Rough Set Theory (RST), which has the ability to handle the uncertainty in the clustering process. 相似文献

8.

An improved cluster labeling method for support vector clustering 总被引：5，自引：0，他引：5

Lee J Lee D 《IEEE transactions on pattern analysis and machine intelligence》2005,27(3):461-464

The support vector clustering (SVC) algorithm is a recently emerged unsupervised learning method inspired by support vector machines. One key step involved in the SVC algorithm is the cluster assignment of each data point. A new cluster labeling method for SVC is developed based on some invariant topological properties of a trained kernel radius function. Benchmark results show that the proposed method outperforms previously reported labeling techniques. 相似文献

9.

A k-mean clustering algorithm for mixed numeric and categorical data

《Data & Knowledge Engineering》2007,63(2):503-527

相似文献

10.

Optimal clustering based outlier detection and cluster center initialization algorithm for effective tone mapping

Neelima N. Ravi Kumar Yada 《Multimedia Tools and Applications》2019,78(22):31057-31075

Multimedia Tools and Applications - The high dynamic range (HDR) imaging and displaying a wide range of imaging levels in the imaging industry is found in the world using devices with limited... 相似文献

11.

An incremental nested partition method for data clustering

Jyrko Correa-Morris Author Vitae Dustin L. Espinosa-Isidrón^{Author Vitae} 《Pattern recognition》2010,43(7):2439-2455

Clustering methods are a powerful tool for discovering patterns in a given data set through an organization of data into subsets of objects that share common features. Motivated by the independent use of some different partitions criteria and the theoretical and empirical analysis of some of its properties, in this paper, we introduce an incremental nested partition method which combines these partitions criteria for finding the inner structure of static and dynamic datasets. For this, we proved that there are relationships of nesting between partitions obtained, respectively, from these partition criteria, and besides that the sensitivity when a new object arrives to the dataset is rigorously studied. Our algorithm exploits all of these mathematical properties for obtaining the hierarchy of clusterings. Moreover, we realize a theoretical and experimental comparative study of our method with classical hierarchical clustering methods such as single-link and complete-link and other more recently introduced methods. The experimental results over databases of UCI repository and the AFP and TDT2 news collections show the usefulness and capability of our method to reveal different levels of information hidden in datasets. 相似文献

12.

SEP/COP: An efficient method to find the best partition in hierarchical clustering based on a new cluster validity index

Ibai Gurrutxaga^{Author Vitae} Iñaki Albisua Author Vitae Author Vitae José I. Martín Author Vitae Author Vitae Jesús M. Pérez Author Vitae Author Vitae 《Pattern recognition》2010,43(10):3364-2170

Hierarchical clustering algorithms provide a set of nested partitions called a cluster hierarchy. Since the hierarchy is usually too complex it is reduced to a single partition by using cluster validity indices. We show that the classical method is often not useful and we propose SEP, a new method that efficiently searches in an extended partition set. Furthermore, we propose a new cluster validity index, COP, since many of the commonly used indices cannot be used with SEP. Experiments performed with 80 synthetic and 7 real datasets confirm that SEP/COP is superior to the method currently used and furthermore, it is less sensitive to noise. 相似文献

13.

Visualization and clustering of categorical data with probabilistic self-organizing map

Mustapha Lebbah Khalid Benabdeslem 《Neural computing & applications》2010,19(3):393-404

This paper introduces a self-organizing map dedicated to clustering, analysis and visualization of categorical data. Usually, when dealing with categorical data, topological maps use an encoding stage: categorical data are changed into numerical vectors and traditional numerical algorithms (SOM) are run. In the present paper, we propose a novel probabilistic formalism of Kohonen map dedicated to categorical data where neurons are represented by probability tables. We do not need to use any coding to encode variables. We evaluate the effectiveness of our model in four examples using real data. Our experiments show that our model provides a good quality of results when dealing with categorical data. 相似文献

14.

On cluster tree for nested and multi-density data clustering

Xutao Li Author Vitae 《Pattern recognition》2010,43(9):3130-3143

Clustering is one of the important data mining tasks. Nested clusters or clusters of multi-density are very prevalent in data sets. In this paper, we develop a hierarchical clustering approach—a cluster tree to determine such cluster structure and understand hidden information present in data sets of nested clusters or clusters of multi-density. We embed the agglomerative k-means algorithm in the generation of cluster tree to detect such clusters. Experimental results on both synthetic data sets and real data sets are presented to illustrate the effectiveness of the proposed method. Compared with some existing clustering algorithms (DBSCAN, X-means, BIRCH, CURE, NBC, OPTICS, Neural Gas, Tree-SOM, EnDBSAN and LDBSCAN), our proposed cluster tree approach performs better than these methods. 相似文献

15.

A population initialization method for evolutionary algorithms based on clustering and Cauchy deviates

《Expert systems with applications》2016

The initial population of an evolutionary algorithm is an important factor which affects the convergence rate and ultimately its ability to find high quality solutions or satisfactory solutions for that matter. If composed of good individuals it may bias the search towards promising regions of the search space right from the beginning. Although, if no knowledge about the problem at hand is available, the initial population is most often generated completely random, thus no such behavior can be expected. This paper proposes a method for initializing the population that attempts to identify i.e., to get close to promising parts of the search space and to generate (relatively) good solutions in their proximity. The method is based on clustering and a simple Cauchy mutation. The results obtained on a broad set of standard benchmark functions suggest that the proposed method succeeds in the aforementioned which is most noticeable as an increase in convergence rate compared to the usual initialization approach and a method from the literature. Also, insight into the usefulness of advanced initialization methods in higher-dimensional search spaces is provided, at least to some degree, by the results obtained on higher-dimensional problem instances—the proposed method is beneficial in such spaces as well. Moreover, results on several very high-dimensional problem instances suggest that the proposed method is able to provide a good starting position for the search. 相似文献

16.

An algorithm to find the number of parallel stations for optimal cell scheduling

Gursel A. Suer 《Computers & Industrial Engineering》1992,23(1-4):81-84

This paper focuses on cell scheduling problem in AVON Lomalinda Inc., a jewelry manufacturing company. However, the proposed algorithm can be used in other industries with similar operating conditions as well. The most important feature of the problem under study is that there are more operators allocated to the cells than the number of operations thus making it inevitable to have parallel stations. An optimizing algorithm to find the number of parallel stations with preassigned operations is suggested and later used to find the best common cell size for the entire plant. 相似文献

17.

Maxmin distance sort heuristic-based initial centroid method of partitional clustering for big data mining

Pandey Kamlesh Kumar Shukla Diwakar 《Pattern Analysis & Applications》2022,25(1):139-156

Pattern Analysis and Applications - The revolution of digital and communication technologies is producing an enormous amount of data. Therefore, the nature of classical data changes into big data,... 相似文献

18.

An interval number distance- and ranking-based method for remotely sensed image fuzzy clustering

Jifa Guo Guangxiong Peng 《International journal of remote sensing》2013,34(23):8591-8614

ABSTRACT

Fuzzy c-means clustering is an important non-supervised classification method for remote-sensing images and is based on type-1 fuzzy set theory. Type-1 fuzzy sets use singleton values to express the membership grade; therefore, such sets cannot describe the uncertainty of the membership grade. Interval type-2 fuzzy c-means (IT2FCM) clustering and relevant methods are based on interval type-2 fuzzy sets. Real vectors are used to describe the clustering centres, and the average values of the upper and lower membership grades are used to determine the classification of each pixel. Thus, the width information for interval clustering centres and interval membership grades are ignored. The main contribution of this article is to propose an improved IT2FCM* algorithm by adopting interval number distance (IND) and ranking methods, which use the width information of interval clustering centres and interval membership grades, thus distinguishing this method from existing fuzzy clustering methods. Three different IND definitions are tested, and the distance definition proposed by Li shows the best performance. The second contribution of this work is that two fuzzy cluster validity indices, FS- and XB-, are improved using the IND. Three types of multi/hyperspectral remote-sensing data sets are used to test this algorithm, and the experimental results show that the IT2FCM* algorithm based on the IND proposed by Li performs better than the IT2FCM algorithm using four cluster validity indices, the confusion matrix, and the kappa coefficient (κ). Additionally, the improved FS- index has more indicative ability than the original FS- index. 相似文献

19.

Determining number of clusters and prototype locations via multi-scale clustering

Eiji Nakamura Nasser Kehtarnavaz 《Pattern recognition letters》1998,19(14):1265-1283

In clustering algorithms, it is usually assumed that the number of clusters is known or given. In the absence of such a priori information, a procedure is needed to find an appropriate number of clusters. This paper presents a clustering algorithm that incorporates a mechanism for finding the appropriate number of clusters as well as the locations of cluster prototypes. This algorithm, called multi-scale clustering, is based on scale-space theory by considering that any prominent data structure ought to survive over many scales. The number of clusters as well as the locations of cluster prototypes are found in an objective manner by defining and using lifetime and drift speed clustering criteria. The outcome of this algorithm does not depend on the initial prototype locations that affect the outcome of many clustering algorithms. As an application of this algorithm, it is used to enhance the Hough transform technique. 相似文献

20.

On cluster validity index for estimation of the optimal number of fuzzy clusters

Dae-Won Kim Author Vitae Kwang H. Lee Author Vitae Doheon Lee Author Vitae 《Pattern recognition》2004,37(10):2009-2025

A new cluster validity index is proposed that determines the optimal partition and optimal number of clusters for fuzzy partitions obtained from the fuzzy c-means algorithm. The proposed validity index exploits an overlap measure and a separation measure between clusters. The overlap measure, which indicates the degree of overlap between fuzzy clusters, is obtained by computing an inter-cluster overlap. The separation measure, which indicates the isolation distance between fuzzy clusters, is obtained by computing a distance between fuzzy clusters. A good fuzzy partition is expected to have a low degree of overlap and a larger separation distance. Testing of the proposed index and nine previously formulated indexes on well-known data sets showed the superior effectiveness and reliability of the proposed index in comparison to other indexes. 相似文献