首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 46 毫秒
1.
The leading partitional clustering technique, k-modes, is one of the most computationally efficient clustering methods for categorical data. However, the performance of the k-modes clustering algorithm which converges to numerous local minima strongly depends on initial cluster centers. Currently, most methods of initialization cluster centers are mainly for numerical data. Due to lack of geometry for the categorical data, these methods used in cluster centers initialization for numerical data are not applicable to categorical data. This paper proposes a novel initialization method for categorical data which is implemented to the k-modes algorithm. The method integrates the distance and the density together to select initial cluster centers and overcomes shortcomings of the existing initialization methods for categorical data. Experimental results illustrate the proposed initialization method is effective and can be applied to large data sets for its linear time complexity with respect to the number of data objects.  相似文献   

2.
In the fuzzy k-modes clustering, there is just one membership degree of interest by class for each individual which cannot be sufficient to model ambiguity of data precisely. It is known that the essence of a multivariate thinking allows to expose the inherent structure and meaning revealed within a set of variables classified. In this paper, a multivariate approach for membership degrees is presented to better handle ambiguous data that share properties of different clusters. This method is compared with other fuzzy k-modes methods of the literature based on a multivariate internal index that is also proposed in this paper. Synthetic and real categorical data sets are considered in this study.  相似文献   

3.
Bank failures threaten the economic system as a whole. Therefore, predicting bank financial failures is crucial to prevent and/or lessen the incoming negative effects on the economic system. This is originally a classification problem to categorize banks as healthy or non-healthy ones. This study aims to apply various neural network techniques, support vector machines and multivariate statistical methods to the bank failure prediction problem in a Turkish case, and to present a comprehensive computational comparison of the classification performances of the techniques tested. Twenty financial ratios with six feature groups including capital adequacy, asset quality, management quality, earnings, liquidity and sensitivity to market risk (CAMELS) are selected as predictor variables in the study. Four different data sets with different characteristics are developed using official financial data to improve the prediction performance. Each data set is also divided into training and validation sets. In the category of neural networks, four different architectures namely multi-layer perceptron, competitive learning, self-organizing map and learning vector quantization are employed. The multivariate statistical methods; multivariate discriminant analysis, k-means cluster analysis and logistic regression analysis are tested. Experimental results are evaluated with respect to the correct accuracy performance of techniques. Results show that multi-layer perceptron and learning vector quantization can be considered as the most successful models in predicting the financial failure of banks.  相似文献   

4.
Categorical data clustering is a difficult and challenging task due to the special characteristic of categorical attributes: no natural order. Thus, this study aims to propose a two-stage method named partition-and-merge based fuzzy genetic clustering algorithm (PM-FGCA) for categorical data. The proposed PM-FGCA uses a fuzzy genetic clustering algorithm to partition the dataset into a maximum number of clusters in the first stage. Then, the merge stage is designed to select two clusters among the clusters that generated in the first stage based on its inter-cluster distances and merge two selected clusters to one cluster. This procedure is repeated until the number of clusters equals to the predetermined number of clusters. Thereafter, some particular instances in each cluster are considered to be re-assigned to other clusters based on the intra-cluster distances. The proposed PM-FGCA is implemented on ten categorical datasets from UCI machine learning repository. In order to evaluate the clustering performance, the proposed PM-FGCA is compared with some existing methods such as k-modes algorithm, fuzzy k-modes algorithm, genetic fuzzy k-modes algorithm, and non-dominated sorting genetic algorithm using fuzzy membership chromosomes. Adjusted Ranked Index (ARI), Normalized Mutual Information (NMI), and Davies–Bouldin (DB) index are selected as three clustering validation indices which are represented to both external index (i.e., ARI and NMI) and internal index (i.e., DB). Consequently, the experimental result shows that the proposed PM-FGCA outperforms the benchmark methods in terms of the tested indices.  相似文献   

5.
The leading partitional clustering technique, k-modes, is one of the most computationally efficient clustering methods for categorical data. However, in the k-modes-type algorithms, the performance of their clustering depends on initial cluster centers and the number of clusters needs be known or given in advance. This paper proposes a novel initialization method for categorical data which is implemented to the k-modes-type algorithms. The proposed method can not only obtain the good initial cluster centers but also provide a criterion to find candidates for the number of clusters. The performance and scalability of the proposed method has been studied on real data sets. The experimental results illustrate that the proposed method is effective and can be applied to large data sets for its linear time complexity with respect to the number of data points.  相似文献   

6.
In this paper, the conventional k-modes-type algorithms for clustering categorical data are extended by representing the clusters of categorical data with k-populations instead of the hard-type centroids used in the conventional algorithms. Use of a population-based centroid representation makes it possible to preserve the uncertainty inherent in data sets as long as possible before actual decisions are made. The k-populations algorithm was found to give markedly better clustering results through various experiments.  相似文献   

7.
We describe an approach to machine learning from numerical data that combines both qualitative and numerical learning. This approach is carried out in two stages: (1) induction of a qualitative model from numerical examples of the behaviour of a physical system, and (2) induction of a numerical regression function that both respects the qualitative constraints and fits the training data numerically. We call this approach Q2 learning, which stands for Qualitatively faithful Quantitative learning. Induced numerical models are “qualitatively faithful” in the sense that they respect qualitative trends in the learning data. Advantages of Q2 learning are that the induced qualitative model enables a (possibly causal) explanation of relations among the variables in the modelled system, and that numerical predictions are guaranteed to be qualitatively consistent with the qualitative model which alleviates the interpretation of the predictions. Moreover, as we show experimentally the qualitative model's guidance of the quantitative modelling process leads to predictions that may be considerably more accurate than those obtained by state-of-the-art numerical learning methods. The experiments include an application of Q2 learning to the identification of a car wheel suspension system—a complex, industrially relevant mechanical system.  相似文献   

8.
In cluster analysis, one of the most challenging and difficult problems is the determination of the number of clusters in a data set, which is a basic input parameter for most clustering algorithms. To solve this problem, many algorithms have been proposed for either numerical or categorical data sets. However, these algorithms are not very effective for a mixed data set containing both numerical attributes and categorical attributes. To overcome this deficiency, a generalized mechanism is presented in this paper by integrating Rényi entropy and complement entropy together. The mechanism is able to uniformly characterize within-cluster entropy and between-cluster entropy and to identify the worst cluster in a mixed data set. In order to evaluate the clustering results for mixed data, an effective cluster validity index is also defined in this paper. Furthermore, by introducing a new dissimilarity measure into the k-prototypes algorithm, we develop an algorithm to determine the number of clusters in a mixed data set. The performance of the algorithm has been studied on several synthetic and real world data sets. The comparisons with other clustering algorithms show that the proposed algorithm is more effective in detecting the optimal number of clusters and generates better clustering results.  相似文献   

9.
10.
Basing cluster analysis on mixture models has become a classical and powerful approach. Until now, this approach, which allows to explain some classic clustering criteria such as the well-known k-means criteria and to propose general criteria, has been developed to classify a set of objects measured on a set of variables. But, for this kind of data, if most clustering procedures are designated to construct an optimal partition of objects or, sometimes, of variables, there exist others methods, named block clustering methods, which consider simultaneously the two sets and organize the data into homogeneous blocks.In this work, a new mixture model called block mixture model is proposed to take into account this situation. This model allows to embed simultaneous clustering of objects and variables in a mixture approach. We first consider this probabilistic model in a general context and we develop a new algorithm of simultaneous partitioning based on the CEM algorithm. Then, we focus on the case of binary data and we show that our approach allows us to extend a block clustering method, which had been proposed in this case. Simplicity, fast convergence and the possibility to process large data sets are the major advantages of the proposed approach.  相似文献   

11.
Cluster analysis is sensitive to noise variables intrinsically contained within high dimensional data sets. As the size of data sets increases, clustering techniques robust to noise variables must be identified. This investigation gauges the capabilities of recent clustering algorithms applied to two real data sets increasingly perturbed by superfluous noise variables. The recent techniques include mixture models of factor analysers and auto-associative multivariate regression trees. Statistical techniques are integrated to create two approaches useful for clustering noisy data: multivariate regression trees with principal component scores and multivariate regression trees with factor scores. The tree techniques generate the superior clustering results.  相似文献   

12.
Benchmarking pattern recognition, machine learning and data mining methods commonly relies on real-world data sets. However, there are some disadvantages in using real-world data. On one hand collecting real-world data can become difficult or impossible for various reasons, on the other hand real-world variables are hard to control, even in the problem domain; in the feature domain, where most statistical learning methods operate, exercising control is even more difficult and hence rarely attempted. This is at odds with the scientific experimentation guidelines mandating the use of as directly controllable and as directly observable variables as possible. Because of this, synthetic data possesses certain advantages over real-world data sets. In this paper we propose a method that produces synthetic data with guaranteed global and class-specific statistical properties. This method is based on overlapping class densities placed on the corners of a regular k-simplex. This generator can be used for algorithm testing and fair performance evaluation of statistical learning methods. Because of the strong properties of this generator researchers can reproduce each others experiments by knowing the parameters used, instead of transmitting large data sets.  相似文献   

13.
《Information Fusion》2008,9(2):223-233
Clustering categorical data is an integral part of data mining and has attracted much attention recently. In this paper, we present k-ANMI, a new efficient algorithm for clustering categorical data. The k-ANMI algorithm works in a way that is similar to the popular k-means algorithm, and the goodness of clustering in each step is evaluated using a mutual information based criterion (namely, average normalized mutual information – ANMI) borrowed from cluster ensemble. This algorithm is easy to implement, requiring multiple hash tables as the only major data structure. Experimental results on real datasets show that k-ANMI algorithm is competitive with those state-of-the-art categorical data clustering algorithms with respect to clustering accuracy.  相似文献   

14.
This article proposes an optimized instance-based learning approach for prediction of the compressive strength of high performance concrete based on mix data, such as water to binder ratio, water content, super-plasticizer content, fly ash content, etc. The base algorithm used in this study is the k nearest neighbor algorithm, which is an instance-based machine leaning algorithm. Five different models were developed and analyzed to investigate the effects of the number of neighbors, the distance function and the attribute weights on the performance of the models. For each model a modified version of the differential evolution algorithm was used to find the optimal model parameters. Moreover, two different models based on generalized regression neural network and stepwise regressions were also developed. The performances of the models were evaluated using a set of high strength concrete mix data. The results of this study indicate that the optimized models outperform those derived from the standard k nearest neighbor algorithm, and that the proposed models have a better performance in comparison to generalized regression neural network, stepwise regression and modular neural networks models.  相似文献   

15.
Clustering consists in partitioning a set of objects into disjoint and homogeneous clusters. For many years, clustering methods have been applied in a wide variety of disciplines and they also have been utilized in many scientific areas. Traditionally, clustering methods deal with numerical data, i.e. objects represented by a conjunction of numerical attribute values. However, nowadays commercial or scientific databases usually contain categorical data, i.e. objects represented by categorical attributes. In this paper we present a dissimilarity measure which is capable to deal with tree structured categorical data. Thus, it can be used for extending the various versions of the very popular k-means clustering algorithm to deal with such data. We discuss how such an extension can be achieved. Moreover, we empirically prove that the proposed dissimilarity measure is accurate, compared to other well-known (dis)similarity measures for categorical data.  相似文献   

16.
Many financial organizations such as banks and retailers use computational credit risk analysis (CRA) tools heavily due to recent financial crises and more strict regulations. This strategy enables them to manage their financial and operational risks within the pool of financial institutes. Machine learning algorithms especially binary classifiers are very popular for that purpose. In real-life applications such as CRA, feature selection algorithms are used to decrease data acquisition cost and to increase interpretability of the decision process. Using feature selection methods directly on CRA data sets may not help due to categorical variables such as marital status. Such features are usually are converted into binary features using 1-of-k encoding and eliminating a subset of features from a group does not help in terms of data collection cost or interpretability. In this study, we propose to use the probit classifier with a proper prior structure and multiple kernel learning with a proper kernel construction procedure to perform group-wise feature selection (i.e., eliminating a group of features together if they are not helpful). Experiments on two standard CRA data sets show the validity and effectiveness of the proposed binary classification algorithm variants.  相似文献   

17.
Motivated from the stochastic representation of the univariate zero-inflated Poisson (ZIP) random variable, the authors propose a multivariate ZIP distribution, called as Type I multivariate ZIP distribution, to model correlated multivariate count data with extra zeros. The distributional theory and associated properties are developed. Maximum likelihood estimates for parameters of interest are obtained by Fisher’s scoring algorithm and the expectation–maximization (EM) algorithm, respectively. Asymptotic and bootstrap confidence intervals of parameters are provided. Likelihood ratio test and score test are derived and are compared via simulation studies. Bayesian methods are also presented if prior information on parameters is available. Two real data sets are used to illustrate the proposed methods. Under both AIC and BIC, our analysis of the two data sets supports the Type I multivariate zero-inflated Poisson model as a much less complex alternative with feasibility to the existing multivariate ZIP models proposed by Li et al. (Technometrics, 29–38, Vol 41, 1999).  相似文献   

18.
Due to data sparseness and attribute redundancy in high-dimensional data, clusters of objects often exist in subspaces rather than in the entire space. To effectively address this issue, this paper presents a new optimization algorithm for clustering high-dimensional categorical data, which is an extension of the k-modes clustering algorithm. In the proposed algorithm, a novel weighting technique for categorical data is developed to calculate two weights for each attribute (or dimension) in each cluster and use the weight values to identify the subsets of important attributes that categorize different clusters. The convergence of the algorithm under an optimization framework is proved. The performance and scalability of the algorithm is evaluated experimentally on both synthetic and real data sets. The experimental studies show that the proposed algorithm is effective in clustering categorical data sets and also scalable to large data sets owning to its linear time complexity with respect to the number of data objects, attributes or clusters.  相似文献   

19.
Though there have been several works on the analysis of the association between genotype and phenotype, little can be found for the association analysis between a haplotype or haplotype sets and multivariate quantitative responses. For example, QTLmarc is available for the analysis of multivariate responses, but it cannot be applied to the case of stochastic diplotype configurations and complex genetic models. The present paper proposes a method of association analysis between diplotype configuration and multivariate quantitative responses assuming the dominant, recessive and additive models. A comparative study is performed between the proposed method and QTLmarc by applying the two methods to numerical examples and small size simulated data sets with actual genotype information taken from the data set of the Hapmap project and artificial quantitative phenotype data which follow multivariate normal distributions. The results show that the proposed method is superior to QTLmarc in finding the assumed association.  相似文献   

20.
Many pattern classification algorithms such as Support Vector Machines (SVMs), Multi-Layer Perceptrons (MLPs), and K-Nearest Neighbors (KNNs) require data to consist of purely numerical variables. However many real world data consist of both categorical and numerical variables. In this paper we suggest an effective method of converting the mixed data of categorical and numerical variables into data of purely numerical variables for binary classifications. Since the suggested method is based on the theory of learning Bayesian Network Classifiers (BNCs), it is computationally efficient and robust to noises and data losses. Also the suggested method is expected to extract sufficient information for estimating a minimum-error-rate (MER) classifier. Simulations on artificial data sets and real world data sets are conducted to demonstrate the competitiveness of the suggested method when the number of values in each categorical variable is large and BNCs accurately model the data.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号