期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

On cluster tree for nested and multi-density data clustering

Xutao Li Author Vitae 《Pattern recognition》2010,43(9):3130-3143

Clustering is one of the important data mining tasks. Nested clusters or clusters of multi-density are very prevalent in data sets. In this paper, we develop a hierarchical clustering approach—a cluster tree to determine such cluster structure and understand hidden information present in data sets of nested clusters or clusters of multi-density. We embed the agglomerative k-means algorithm in the generation of cluster tree to detect such clusters. Experimental results on both synthetic data sets and real data sets are presented to illustrate the effectiveness of the proposed method. Compared with some existing clustering algorithms (DBSCAN, X-means, BIRCH, CURE, NBC, OPTICS, Neural Gas, Tree-SOM, EnDBSAN and LDBSCAN), our proposed cluster tree approach performs better than these methods. 相似文献

2.

Fast agglomerative hierarchical clustering algorithm using Locality-Sensitive Hashing 总被引：1，自引：6，他引：1

Hisashi Koga Tetsuo Ishibashi Toshinori Watanabe 《Knowledge and Information Systems》2007,12(1):25-53

The single linkage method is a fundamental agglomerative hierarchical clustering algorithm. This algorithm regards each point as a single cluster initially. In the agglomeration step, it connects a pair of clusters such that the distance between the nearest members is the shortest. This step is repeated until only one cluster remains. The single linkage method can efficiently detect clusters in arbitrary shapes. However, a drawback of this method is a large time complexity of O(n ²), where n represents the number of data points. This time complexity makes this method infeasible for large data. This paper proposes a fast approximation algorithm for the single linkage method. Our algorithm reduces the time complexity to O(nB) by rapidly finding the near clusters to be connected by Locality-Sensitive Hashing, a fast algorithm for the approximate nearest neighbor search. Here, B represents the maximum number of points going into a single hash entry and it practically diminishes to a small constant as compared to n for sufficiently large hash tables. Experimentally, we show that (1) the proposed algorithm obtains clustering results similar to those obtained by the single linkage method and (2) it runs faster for large data than the single linkage method. Hisashi Koga received the M.S. and Ph.D. degree in information science in 1995 and 2002, respectively, from the University of Tokyo. From 1995 to 2003, he worked as a researcher at Fujitsu Laboratories Ltd. Since 2003, he has been a faculty member at the University of Electro-Communications, Tokyo (Japan). Currently, he is an associate professor at the Graduate School of Information Systems, University of Electro-Communications. His research interest includes various kinds of algorithms such as clustering algorithms, on-line algorithms, and algorithms in network communications. Tetsuo Ishibashi received the M.E. degree in information systems design from the Graduate School of Information Systems at the University of Electro-Communications in 2004. Presently, he is a system engineer at Fujitsu Broad Solution & Consulting Inc. Toshinori Watanabe received the B.E. degree in aeronautical engineering in 1971 and the D.E. degree in 1985, both from the University of Tokyo. In 1971, he worked at Hitachi as a researcher in the field of information systems design. His experience includes demand forecasting, inventory and production management, VLSI design automation, knowledge-based nonlinear optimizer, and a case-based evolutionary learning system nicknamed TAMPOPO. He also engaged in FGCS (Fifth Generation Computer System) project of Japan and developed a new hierarchical message-passing parallel cooperative VLSI layout problem solver that ran on PIM (Parallel Inference Machine) in 1991. Since 1992, he has been a professor at the Graduate School of Information Systems, University of Electro-Communications, Tokyo, Japan. His areas of interest include media analysis, learning intelligence, and the semantics of information systems. He is a member of the IEEE. 相似文献

3.

A k-populations algorithm for clustering categorical data

Dae-Won Kim KiYoung Lee Kwang H. Lee 《Pattern recognition》2005,38(7):1131-1134

In this paper, the conventional k-modes-type algorithms for clustering categorical data are extended by representing the clusters of categorical data with k-populations instead of the hard-type centroids used in the conventional algorithms. Use of a population-based centroid representation makes it possible to preserve the uncertainty inherent in data sets as long as possible before actual decisions are made. The k-populations algorithm was found to give markedly better clustering results through various experiments. 相似文献

4.

Density-based hierarchical clustering for streaming data

Q. Tu B. Yuan 《Pattern recognition letters》2012,33(5):641-645

For streaming data that arrive continuously such as multimedia data and financial transactions, clustering algorithms are typically allowed to scan the data set only once. Existing research in this domain mainly focuses on improving the accuracy of clustering. In this paper, a novel density-based hierarchical clustering scheme for streaming data is proposed in order to improve both accuracy and effectiveness; it is based on the agglomerative clustering framework. Traditionally, clustering algorithms for streaming data often use the cluster center to represent the whole cluster when conducting cluster merging, which may lead to unsatisfactory results. We argue that even if the data set is accessed only once, some parameters, such as the variance within cluster, the intra-cluster density and the inter-cluster distance, can be calculated accurately. This may bring measurable benefits to the process of cluster merging. Furthermore, we employ a general framework that can incorporate different criteria and, given the same criteria, will produce similar clustering results for both streaming and non-streaming data. In experimental studies, the proposed method demonstrates promising results with reduced time and space complexity. 相似文献

5.

Hierarchical clustering of mixed data based on distance hierarchy 总被引：1，自引：0，他引：1

Chung-Chian Hsu Chin-Long Chen Yu-Wei Su 《Information Sciences》2007,177(20):4474-4492

Data clustering is an important data mining technique which partitions data according to some similarity criterion. Abundant algorithms have been proposed for clustering numerical data and some recent research tackles the problem of clustering categorical or mixed data. Unlike the subtraction scheme used for numerical attributes, there is no standard for measuring distance between categorical values. In this article, we propose a distance representation scheme, distance hierarchy, which facilitates expressing the similarity between categorical values and also unifies distance measuring of numerical and categorical values. We then apply the scheme to mixed data clustering, in particular, to integrate with a hierarchical clustering algorithm. Consequently, this integrated approach can uniformly handle numerical data and categorical data, and also enables one to take the similarity between categorical values into consideration. Experimental results show that the proposed approach produces better clustering results than conventional clustering algorithms when categorical attributes are present and their values have different degree of similarity. 相似文献

6.

On-line hierarchical clustering 总被引：1，自引：0，他引：1

Yasser El-Sonbaty M.A Ismail 《Pattern recognition letters》1998,19(14):1285-1291

Most of the techniques used in the literature for hierarchical clustering are based on off-line operation. The main contribution of this paper is to propose a new algorithm for on-line hierarchical clustering by finding the nearest k objects to each introduced object so far and these nearest k objects are continuously updated by the arrival of a new object. By final object, we have the objects and their nearest k objects which are sorted to produce the hierarchical dendogram. The results of the application of the new algorithm on real and synthetic data and also using simulation experiments, show that the new technique is quite efficient and, in many respects, superior to traditional off-line hierarchical methods. 相似文献

7.

A time-efficient pattern reduction algorithm for k-means clustering

Ming-Chao Chiang Author Vitae Chun-Wei Tsai Author Vitae Chu-Sing Yang Author Vitae 《Information Sciences》2011,181(4):716-3410

This paper presents an efficient algorithm, called pattern reduction (PR), for reducing the computation time of k-means and k-means-based clustering algorithms. The proposed algorithm works by compressing and removing at each iteration patterns that are unlikely to change their membership thereafter. Not only is the proposed algorithm simple and easy to implement, but it can also be applied to many other iterative clustering algorithms such as kernel-based and population-based clustering algorithms. Our experiments—from 2 to 1000 dimensions and 150 to 10,000,000 patterns—indicate that with a small loss of quality, the proposed algorithm can significantly reduce the computation time of all state-of-the-art clustering algorithms evaluated in this paper, especially for large and high-dimensional data sets. 相似文献

8.

Genetic operators for hierarchical graph clustering

Stefano Rizzi 《Pattern recognition letters》1998,19(14):1293-1300

In this paper we propose an encoding scheme and ad hoc operators for a genetic approach to hierarchical graph clustering. Given a connected graph whose vertices correspond to points within a Euclidean space and a fitness function, a hierarchy of graphs in which each vertex corresponds to a connected subgraph of the graph below is generated. Both the number of clustering levels and the number of clusters on each level are not fixed a priori and are subject to optimization. 相似文献

9.

Data clustering by minimizing disconnectivity 总被引：1，自引：0，他引：1

Jong-Seok Lee Sigurdur Olafsson 《Information Sciences》2011,181(4):732-3410

Identifying clusters of arbitrary shapes remains a challenge in the field of data clustering. We propose a new measure of cluster quality based on minimizing the penalty of disconnection between objects that would be ideally clustered together. This disconnectivity is based on analysis of nearest neighbors and the principle that an object should be in the same cluster as its nearest neighbors. An algorithm called MinDisconnect is proposed that heuristically minimizes disconnectivity and numerical results are presented that indicate that the new algorithm can effectively identify clusters of complex shapes and is robust in finding clusters of arbitrary shapes. 相似文献

10.

MPM: a hierarchical clustering algorithm using matrix partitioning method for non-numeric data

Hewijin Christine Jiau Yi-Jen Su Yeou-Min Lin Shang-Rong Tsai 《Journal of Intelligent Information Systems》2006,26(2):185-207

Clustering has been widely adopted in numerous applications, including pattern recognition, data analysis, image processing, and market research. When performing data mining, traditional clustering algorithms which use distance-based measurements to calculate the difference between data are unsuitable for non-numeric attributes such as nominal, Boolean, and categorical data. Applying an unsuitable similarity measurement in clustering may cause some valuable information embedded in the data attributes to be lost, and hence low quality clusters will be created. This paper proposes a novel hierarchical clustering algorithm, referred to as MPM, for the clustering of non-numeric data. The goals of MPM are to retain the data features of interest while effectively grouping data objects into clusters with high intra-similarity and low inter-similarity. MPM achieves these goals through two principal methods: (1) the adoption of a novel similarity measurement which has the ability to capture the “characterized properties” of information, and (2) the application of matrix permutation and matrix participation partitioning to the results of the similarity measurement (constructed in the form of a similarity matrix) in order to assign data to appropriate clusters. This study also proposes a heuristic-based algorithm, the Heuristic_MPM, to reduce the processing times required for matrix permutation and matrix partitioning, which together constitute the bulk of the total MPM execution time. An erratum to this article is available at . 相似文献

11.

Improvement of the k-means clustering filtering algorithm

Jim Z.C. Lai Author Vitae Author Vitae 《Pattern recognition》2008,41(12):3677-3681

In this paper, we present a modified filtering algorithm (MFA) by making use of center variations to speed up clustering process. Our method first divides clusters into static and active groups. We use the information of cluster displacements to reject unlikely cluster centers for all nodes in the kd-tree. We reduce the computational complexity of filtering algorithm (FA) through finding candidates for each node mainly from the set of active cluster centers. Two conditions for determining the set of candidate cluster centers for each node from active clusters are developed. Our approach is different from the major available algorithm, which passes no information from one stage of iteration to the next. Theoretical analysis shows that our method can reduce the computational complexity, in terms of the number of distance calculations, of FA at each stage of iteration by a factor of FC/AC, where FC and AC are the numbers of total clusters and active clusters, respectively. Compared with the FA, our algorithm can effectively reduce the computing time and number of distance calculations. It is noted that our proposed algorithm can generate the same clusters as that produced by hard k-means clustering. The superiority of our method is more remarkable when a larger data set with higher dimension is used. 相似文献

12.

Dynamic rough clustering and its applications

Georg Peters Richard Weber René Nowatzke 《Applied Soft Computing》2012,12(10):3193-3207

Dynamic data mining has gained increasing attention in the last decade. It addresses changing data structures which can be observed in many real-life applications, e.g. buying behavior of customers. As opposed to classical, i.e. static data mining where the challenge is to discover pattern inherent in given data sets, in dynamic data mining the challenge is to understand – and in some cases even predict – how such pattern will change over time. Since changes in general lead to uncertainty, the appropriate approaches for uncertainty modeling are needed in order to capture, model, and predict the respective phenomena considered in dynamic environments. As a consequence, the combination of dynamic data mining and soft computing is a very promising research area. The proposed algorithm consists of a dynamic clustering cycle when the data set will be refreshed from time to time. Within this cycle criteria check if the newly arrived data have structurally changed in comparison to the data already analyzed. If yes, appropriate actions are triggered, in particular an update of the initial settings of the cluster algorithm. As we will show, rough clustering offers strong tools to detect such changing data structures. To evaluate the proposed dynamic rough clustering algorithm it has been applied to synthetic as well as to real-world data sets where it provides new insights regarding the underlying dynamic phenomena. 相似文献

13.

Stability of a hierarchical clustering

Stephen P. Smith Richard Dubes 《Pattern recognition》1980,12(3):177-187

Clustering algorithms have the annoying habit of finding clusters even when the data are generated randomly. Verifying that potential clusterings are real in some objective sense is receiving more attention as the number of new clustering algorithms and their applications grow. We consider one aspect of this question and study the stability of a hierarchical structure with a variation on a measure of stability proposed in the literature.^(1,2)Our measure of stability is appropriate for proximity matrices whose entries are on an ordinal scale. We randomly split the data set, cluster the two halves, and compare the two hierarchical clusterings with the clustering achieved on the entire data set. Two stability statistics, based on the Goodman-Kruskal rank correlation coefficient, are defined. The distributions of these statistics are estimated with Monte Carlo techniques for two clustering methods (single-link and complete-link) and under two conditions (randomly selected proximity matrices and proximity matrices with good hierarchical structure). The stability measures are applied to some real data sets. 相似文献

14.

An optimal hierarchical clustering algorithm for gene expression data

Sudip Seal 《Information Processing Letters》2005,93(3):143-147

Microarrays are used for measuring expression levels of thousands of genes simultaneously. Clustering algorithms are used on gene expression data to find co-regulated genes. An often used clustering strategy is the Pearson correlation coefficient based hierarchical clustering algorithm presented in [Proc. Nat. Acad. Sci. 95 (25) (1998) 14863-14868], which takes O(N³) time. We note that this run time can be reduced to O(N²) by applying known hierarchical clustering algorithms [Proc. 9th Annual ACM-SIAM Symposium on Discrete Algorithms, 1998, pp. 619-628] to this problem. In this paper, we present an algorithm which runs in O(NlogN) time using a geometrical reduction and show that it is optimal. 相似文献

15.

Fast global k-means clustering using cluster membership and inequality

Jim Z.C. Lai Author Vitae Author Vitae 《Pattern recognition》2010,43(5):1954-1963

In this paper, we present a fast global k-means clustering algorithm by making use of the cluster membership and geometrical information of a data point. This algorithm is referred to as MFGKM. The algorithm uses a set of inequalities developed in this paper to determine a starting point for the jth cluster center of global k-means clustering. Adopting multiple cluster center selection (MCS) for MFGKM, we also develop another clustering algorithm called MFGKM+MCS. MCS determines more than one starting point for each step of cluster split; while the available fast and modified global k-means clustering algorithms select one starting point for each cluster split. Our proposed method MFGKM can obtain the least distortion; while MFGKM+MCS may give the least computing time. Compared to the modified global k-means clustering algorithm, our method MFGKM can reduce the computing time and number of distance calculations by a factor of 3.78-5.55 and 21.13-31.41, respectively, with the average distortion reduction of 5,487 for the Statlog data set. Compared to the fast global k-means clustering algorithm, our method MFGKM+MCS can reduce the computing time by a factor of 5.78-8.70 with the average reduction of distortion of 30,564 using the same data set. The performances of our proposed methods are more remarkable when a data set with higher dimension is divided into more clusters. 相似文献

16.

The MinMax k-Means clustering algorithm

Grigorios Tzortzis Aristidis Likas 《Pattern recognition》2014

Applying k-Means to minimize the sum of the intra-cluster variances is the most popular clustering approach. However, after a bad initialization, poor local optima can be easily obtained. To tackle the initialization problem of k-Means, we propose the MinMax k-Means algorithm, a method that assigns weights to the clusters relative to their variance and optimizes a weighted version of the k-Means objective. Weights are learned together with the cluster assignments, through an iterative procedure. The proposed weighting scheme limits the emergence of large variance clusters and allows high quality solutions to be systematically uncovered, irrespective of the initialization. Experiments verify the effectiveness of our approach and its robustness over bad initializations, as it compares favorably to both k-Means and other methods from the literature that consider the k-Means initialization problem. 相似文献

17.

Equally contributory privacy-preserving k-means clustering over vertically partitioned data

Xun Yi Yanchun Zhang 《Information Systems》2013

In recent years, there have been numerous attempts to extend the k-means clustering protocol for single database to a distributed multiple database setting and meanwhile keep privacy of each data site. Current solutions for (whether two or more) multiparty k-means clustering, built on one or more secure two-party computation algorithms, are not equally contributory, in other words, each party does not equally contribute to k-means clustering. This may lead a perfidious attack where a party who learns the outcome prior to other parties tells a lie of the outcome to other parties. In this paper, we present an equally contributory multiparty k-means clustering protocol for vertically partitioned data, in which each party equally contributes to k-means clustering. Our protocol is built on ElGamal's encryption scheme, Jakobsson and Juels's plaintext equivalence test protocol, and mix networks, and protects privacy in terms that each iteration of k-means clustering can be performed without revealing the intermediate values. 相似文献

18.

Evaluation of the performance of clustering algorithms in kernel-induced feature space

Dae-Won Kim Ki Young Lee Kwang H. Lee 《Pattern recognition》2005,38(4):607-611

By using a kernel function, data that are not easily separable in the original space can be clustered into homogeneous groups in the implicitly transformed high-dimensional feature space. Kernel k-means algorithms have recently been shown to perform better than conventional k-means algorithms in unsupervised classification. However, few reports have examined the benefits of using a kernel function and the relative merits of the various kernel clustering algorithms with regard to the data distribution. In this study, we reformulated four representative clustering algorithms based on a kernel function and evaluated their performances for various data sets. The results indicate that each kernel clustering algorithm gives markedly better performance than its conventional counterpart for almost all data sets. Of the kernel clustering algorithms studied in the present work, the kernel average linkage algorithm gives the most accurate clustering results. 相似文献

19.

基于层次分析法的模糊分类优选模型 总被引：1，自引：0，他引：1

李春生王耀南陈光辉蒋宏锋《控制与决策》2009,24(12)

不同的模糊分类算法在同一个数据集合上常会产生不同的模糊分类．究竟哪种方法最能揭示数据的真实结构,对此,以模糊分类有效性指标为评价指标,应用层次分析法对各模糊分类进行综合评价,建立了一个模糊分类优选模型．大量实验表明,该优选模型所选出的最优模糊分类,其模式识别率高,能揭示数据的真实结构．相似文献

20.

The global k-means clustering algorithm

Aristidis LikasAuthor Vitae Nikos VlassisAuthor VitaeJakob J. VerbeekAuthor Vitae 《Pattern recognition》2003,36(2):451-461

We present the global k-means algorithm which is an incremental approach to clustering that dynamically adds one cluster center at a time through a deterministic global search procedure consisting of N (with N being the size of the data set) executions of the k-means algorithm from suitable initial positions. We also propose modifications of the method to reduce the computational load without significantly affecting solution quality. The proposed clustering methods are tested on well-known data sets and they compare favorably to the k-means algorithm with random restarts. 相似文献