期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Fast and exact out-of-core and distributed k-means clustering 总被引：1，自引：2，他引：1

Ruoming Jin Anjan Goswami Gagan Agrawal 《Knowledge and Information Systems》2006,10(1):17-40

Clustering has been one of the most widely studied topics in data mining and k-means clustering has been one of the popular clustering algorithms. K-means requires several passes on the entire dataset, which can make it very expensive for large disk-resident datasets. In view of this, a lot of work has been done on various approximate versions of k-means, which require only one or a small number of passes on the entire dataset.In this paper, we present a new algorithm, called fast and exact k-means clustering (FEKM), which typically requires only one or a small number of passes on the entire dataset and provably produces the same cluster centres as reported by the original k-means algorithm. The algorithm uses sampling to create initial cluster centres and then takes one or more passes over the entire dataset to adjust these cluster centres. We provide theoretical analysis to show that the cluster centres thus reported are the same as the ones computed by the original k-means algorithm. Experimental results from a number of real and synthetic datasets show speedup between a factor of 2 and 4.5, as compared with k-means.This paper also describes and evaluates a distributed version of FEKM, which we refer to as DFEKM. This algorithm is suitable for analysing data that is distributed across loosely coupled machines. Unlike the previous work in this area, DFEKM provably produces the same results as the original k-means algorithm. Our experimental results show that DFEKM is clearly better than two other possible options for exact clustering on distributed data, which are down loading all data and running sequential k-means or running parallel k-means on a loosely coupled configuration. Moreover, even in a tightly coupled environment, DFEKM can outperform parallel k-means if there is a significant load imbalance. Ruoming Jin is currently an assistant professor in the Computer Science Department at Kent State University. He received a BE and a ME degree in computer engineering from Beihang University (BUAA), China in 1996 and 1999, respectively. He earned his MS degree in computer science from University of Delaware in 2001, and his Ph.D. degree in computer science from the Ohio State University in 2005. His research interests include data mining, databases, processing of streaming data, bioinformatics, and high performance computing. He has published more than 30 papers in these areas. He is a member of ACM and SIGKDD. Anjan Goswami studied robotics at the Indian Institute of Technology at Kanpur. While working with IBM, he was interested in studying computer science. He then obtained a masters degree from the University of South Florida, where he worked on computer vision problems. He then transferred to the PhD program in computer science at OSU, where he did a Masters thesis on efficient clustering algorithms for massive, distributed and streaming data. On successful completion of this, he decided to join a web-service-provider company to do research in designing and developing high-performance search solutions for very large structured data. Anjan' favourite recreations are studying and predicting technology trends, nature photography, hiking, literature and soccer. Gagan Agrawal is an Associate Professor of Computer Science and Engineering at the Ohio State University. He received his B.Tech degree from Indian Institute of Technology, Kanpur, in 1991, and M.S. and Ph.D degrees from University of Maryland, College Park, in 1994 and 1996, respectively. His research interests include parallel and distributed computing, compilers, data mining, grid computing, and data integration. He has published more than 110 refereed papers in these areas. He is a member of ACM and IEEE Computer Society. He received a National Science Foundation CAREER award in 1998. 相似文献

2.

具有自适应参数的粗糙k-means聚类算法

下载免费PDF全文

周涛《计算机工程与应用》2010,46(26):7-10

粗糙聚类是不确定聚类算法中一种有效的聚类算法,这里通过分析粗糙k-means算法,指出了其中3个参数w_l,w_u和ε设置时存在的缺点,提出了一种自适应粗糙k-means聚类算法,该算法能进一步优化粗糙k-means的聚类效果,降低对“噪声”的敏感程度,最后通过实验验证了算法的有效性。相似文献

3.

Text document clustering based on neighbors

Congnan Yanjun Soon M. 《Data & Knowledge Engineering》2009,68(11):1271

Clustering is a very powerful data mining technique for topic discovery from text documents. The partitional clustering algorithms, such as the family of k-means, are reported performing well on document clustering. They treat the clustering problem as an optimization process of grouping documents into k clusters so that a particular criterion function is minimized or maximized. Usually, the cosine function is used to measure the similarity between two documents in the criterion function, but it may not work well when the clusters are not well separated. To solve this problem, we applied the concepts of neighbors and link, introduced in [S. Guha, R. Rastogi, K. Shim, ROCK: a robust clustering algorithm for categorical attributes, Information Systems 25 (5) (2000) 345–366], to document clustering. If two documents are similar enough, they are considered as neighbors of each other. And the link between two documents represents the number of their common neighbors. Instead of just considering the pairwise similarity, the neighbors and link involve the global information into the measurement of the closeness of two documents. In this paper, we propose to use the neighbors and link for the family of k-means algorithms in three aspects: a new method to select initial cluster centroids based on the ranks of candidate documents; a new similarity measure which uses a combination of the cosine and link functions; and a new heuristic function for selecting a cluster to split based on the neighbors of the cluster centroids. Our experimental results on real-life data sets demonstrated that our proposed methods can significantly improve the performance of document clustering in terms of accuracy without increasing the execution time much. 相似文献

4.

Parallel bisecting <Emphasis Type="Italic">k</Emphasis>-means with prediction clustering algorithm

Yanjun Li Soon M. Chung 《The Journal of supercomputing》2007,39(1):19-37

In this paper, we propose a new parallel clustering algorithm, named Parallel Bisecting k-means with Prediction (PBKP), for message-passing multiprocessor systems. Bisecting k-means tends to produce clusters of similar sizes, and according to our experiments, it produces clusters with smaller entropy (i.e., purer clusters) than k-means does. Our PBKP algorithm fully exploits the data-parallelism of the bisecting k-means algorithm, and adopts a prediction step to balance the workloads of multiple processors to achieve a high speedup. We implemented PBKP on a cluster of Linux workstations and analyzed its performance. Our experimental results show that the speedup of PBKP is linear with the number of processors and the number of data points. Moreover, PBKP scales up better than the parallel k-means with respect to the dimension and the desired number of clusters. This research was supported in part by AFRL/Wright Brothers Institute (WBI). 相似文献

5.

A supervised clustering algorithm for computer intrusion detection 总被引：2，自引：1，他引：1

Xiangyang Li Nong Ye 《Knowledge and Information Systems》2005,8(4):498-509

We previously developed a clustering and classification algorithm—supervised (CCAS) to learn patterns of normal and intrusive activities and to classify observed system activities. Here we further enhance the robustness of CCAS to the presentation order of training data and the noises in training data. This robust CCAS adds data redistribution, a supervised hierarchical grouping of clusters and removal of outliers as the postprocessing steps. 相似文献

6.

A niching genetic k-means algorithm and its applications to gene expression data

Weiguo Sheng Allan Tucker Xiaohui Liu 《Soft Computing - A Fusion of Foundations, Methodologies and Applications》2010,14(1):9-19

Partitional clustering is a common approach to cluster analysis. Although many algorithms have been proposed, partitional clustering remains a challenging problem with respect to the reliability and efficiency of recovering high quality solutions in terms of its criterion functions. In this paper, we propose a niching genetic k-means algorithm (NGKA) for partitional clustering, which aims at reliably and efficiently identifying high quality solutions in terms of the sum of squared errors criterion. Within the NGKA, we design a niching method, which encourages mating among similar clustering solutions while allowing for some competitions among dissimilar solutions, and integrate it into a genetic algorithm to prevent premature convergence during the evolutionary clustering search. Further, we incorporate one step of k-means operation into the regeneration steps of the resulted niching genetic algorithm to improve its computational efficiency. The proposed algorithm was applied to cluster both simulated data and gene expression data and compared with previous work. Experimental results clear show that the NGKA is an effective clustering algorithm and outperforms two other genetic algorithm based clustering methods implemented for comparison. 相似文献

7.

Ensemble classification based on supervised clustering for credit scoring

《Applied Soft Computing》2016

Credit scoring aims to assess the risk associated with lending to individual consumers. Recently, ensemble classification methodology has become popular in this field. However, most researches utilize random sampling to generate training subsets for constructing the base classifiers. Therefore, their diversity is not guaranteed, which may lead to a degradation of overall classification performance. In this paper, we propose an ensemble classification approach based on supervised clustering for credit scoring. In the proposed approach, supervised clustering is employed to partition the data samples of each class into a number of clusters. Clusters from different classes are then pairwise combined to form a number of training subsets. In each training subset, a specific base classifier is constructed. For a sample whose class label needs to be predicted, the outputs of these base classifiers are combined by weighted voting. The weight associated with a base classifier is determined by its classification performance in the neighborhood of the sample. In the experimental study, two benchmark credit data sets are adopted for performance evaluation, and an industrial case study is conducted. The results show that compared to other ensemble classification methods, the proposed approach is able to generate base classifiers with higher diversity and local accuracy, and improve the accuracy of credit scoring. 相似文献

8.

Evaluation of the performance of clustering algorithms in kernel-induced feature space

Dae-Won Kim Ki Young Lee Kwang H. Lee 《Pattern recognition》2005,38(4):607-611

By using a kernel function, data that are not easily separable in the original space can be clustered into homogeneous groups in the implicitly transformed high-dimensional feature space. Kernel k-means algorithms have recently been shown to perform better than conventional k-means algorithms in unsupervised classification. However, few reports have examined the benefits of using a kernel function and the relative merits of the various kernel clustering algorithms with regard to the data distribution. In this study, we reformulated four representative clustering algorithms based on a kernel function and evaluated their performances for various data sets. The results indicate that each kernel clustering algorithm gives markedly better performance than its conventional counterpart for almost all data sets. Of the kernel clustering algorithms studied in the present work, the kernel average linkage algorithm gives the most accurate clustering results. 相似文献

9.

Semi-supervised graph clustering: a kernel approach 总被引：6，自引：0，他引：6

Brian Kulis Sugato Basu Inderjit Dhillon Raymond Mooney 《Machine Learning》2009,74(1):1-22

Semi-supervised clustering algorithms aim to improve clustering results using limited supervision. The supervision is generally given as pairwise constraints; such constraints are natural for graphs, yet most semi-supervised clustering algorithms are designed for data represented as vectors. In this paper, we unify vector-based and graph-based approaches. We first show that a recently-proposed objective function for semi-supervised clustering based on Hidden Markov Random Fields, with squared Euclidean distance and a certain class of constraint penalty functions, can be expressed as a special case of the weighted kernel k-means objective (Dhillon et al., in Proceedings of the 10th International Conference on Knowledge Discovery and Data Mining, 2004a). A recent theoretical connection between weighted kernel k-means and several graph clustering objectives enables us to perform semi-supervised clustering of data given either as vectors or as a graph. For graph data, this result leads to algorithms for optimizing several new semi-supervised graph clustering objectives. For vector data, the kernel approach also enables us to find clusters with non-linear boundaries in the input data space. Furthermore, we show that recent work on spectral learning (Kamvar et al., in Proceedings of the 17th International Joint Conference on Artificial Intelligence, 2003) may be viewed as a special case of our formulation. We empirically show that our algorithm is able to outperform current state-of-the-art semi-supervised algorithms on both vector-based and graph-based data sets. 相似文献

10.

A time-efficient pattern reduction algorithm for k-means clustering

Ming-Chao Chiang Author Vitae Chun-Wei Tsai Author Vitae Chu-Sing Yang Author Vitae 《Information Sciences》2011,181(4):716-3410

This paper presents an efficient algorithm, called pattern reduction (PR), for reducing the computation time of k-means and k-means-based clustering algorithms. The proposed algorithm works by compressing and removing at each iteration patterns that are unlikely to change their membership thereafter. Not only is the proposed algorithm simple and easy to implement, but it can also be applied to many other iterative clustering algorithms such as kernel-based and population-based clustering algorithms. Our experiments—from 2 to 1000 dimensions and 150 to 10,000,000 patterns—indicate that with a small loss of quality, the proposed algorithm can significantly reduce the computation time of all state-of-the-art clustering algorithms evaluated in this paper, especially for large and high-dimensional data sets. 相似文献

11.

Fuzzy C-means based clustering for linearly and nonlinearly separable data

Du-Ming Tsai Author Vitae Chung-Chan Lin Author Vitae 《Pattern recognition》2011,44(8):1750-1760

In this paper we present a new distance metric that incorporates the distance variation in a cluster to regularize the distance between a data point and the cluster centroid. It is then applied to the conventional fuzzy C-means (FCM) clustering in data space and the kernel fuzzy C-means (KFCM) clustering in a high-dimensional feature space. Experiments on two-dimensional artificial data sets, real data sets from public data libraries and color image segmentation have shown that the proposed FCM and KFCM with the new distance metric generally have better performance on non-spherically distributed data with uneven density for linear and nonlinear separation. 相似文献

12.

An approach of clustering data with noisy or imprecise feature measurement

B.B Chaudhuri P.R Bhowmik 《Pattern recognition letters》1998,19(14):1307-1317

In this paper we considered clustering of data corrupted by noise or suffering from imprecision due to finite resolution of the feature measuring device. Our work is motivated by the fact that no measurement can be made perfect and addition of noise is not an uncommon phenomenon for telemetric data. Here we tried to show how the classical k-means algorithm should be modified to take care of the noise/imprecision. Experimental results on Fisher's Iris data and a Nutrition data are demonstrated. 相似文献

13.

基于分类的半监督聚类方法 总被引：1，自引：0，他引：1

下载免费PDF全文

李杉张化祥《计算机工程与应用》2011,47(3):132-134

提出一种基于分类的半监督聚类算法。充分利用了数据集中的少量标记对象对原始数据集进行粗分类,在传统k均值算法的基础上扩展了聚类中心点的选择方法;用k-meansGuider方法对数据集进行粗聚类,在此基础上对粗聚类结果进行集成。在多个UCI标准数据集上进行实验,结果表明提出的算法能有效改善聚类质量。相似文献

14.

Improving the performance of k-means for color quantization

M. Emre Celebi 《Image and vision computing》2011,29(4):260-271

Color quantization is an important operation with many applications in graphics and image processing. Most quantization methods are essentially based on data clustering algorithms. However, despite its popularity as a general purpose clustering algorithm, k-means has not received much respect in the color quantization literature because of its high computational requirements and sensitivity to initialization. In this paper, we investigate the performance of k-means as a color quantizer. We implement fast and exact variants of k-means with several initialization schemes and then compare the resulting quantizers to some of the most popular quantizers in the literature. Experiments on a diverse set of images demonstrate that an efficient implementation of k-means with an appropriate initialization strategy can in fact serve as a very effective color quantizer. 相似文献

15.

A genetic-fuzzy mining approach for items with multiple minimum supports

Chun-Hao Chen Tzung-Pei Hong Vincent S. Tseng Chang-Shing Lee 《Soft Computing - A Fusion of Foundations, Methodologies and Applications》2009,13(5):521-533

Data mining is the process of extracting desirable knowledge or interesting patterns from existing databases for specific purposes. Mining association rules from transaction data is most commonly seen among the mining techniques. Most of the previous mining approaches set a single minimum support threshold for all the items and identify the relationships among transactions using binary values. In the past, we proposed a genetic-fuzzy data-mining algorithm for extracting both association rules and membership functions from quantitative transactions under a single minimum support. In real applications, different items may have different criteria to judge their importance. In this paper, we thus propose an algorithm which combines clustering, fuzzy and genetic concepts for extracting reasonable multiple minimum support values, membership functions and fuzzy association rules from quantitative transactions. It first uses the k-means clustering approach to gather similar items into groups. All items in the same cluster are considered to have similar characteristics and are assigned similar values for initializing a better population. Each chromosome is then evaluated by the criteria of requirement satisfaction and suitability of membership functions to estimate its fitness value. Experimental results also show the effectiveness and the efficiency of the proposed approach. 相似文献

16.

基于曲率图小波分解的平面曲线光顺方法*

张力宁张定华刘元朋赵西峰《计算机应用研究》2005,22(12):250-252

将模拟退火遗传算法用于聚类分析，通过对聚类中心进行编码，定义适应度函数，选择、交叉、变异操作以及模拟退火算法的运用，给出了一种新的基于模拟退火遗传算法的聚类算法，实验结果显示该方法优于基本的遗传算法。相似文献

17.

Supervised adaptive clustering: A hybrid neural network clustering algorithm

M. F. Augusteijn U. J. Steck 《Neural computing & applications》1998,7(1):78-89

A neural network architecture is introduced which implements a supervised clustering algorithm for the classification of feature vectors. The network is selforganising, and is able to adapt to the shape of the underlying pattern distribution as well as detect novel input vectors during training. It is also capable of determining the relative importance of the feature components for classification. The architecture is a hybrid of supervised and unsupervised networks, and combines the strengths of three wellknown architectures: learning vector quantisation, backpro-pagation and adaptive resonance theory. Network performance is compared to that of learning vector quantisation, back-propagation and cascade-correlation. It is found that performance is generally as good as or better than the performance of these other architectures, while training time is considerably shorter. However, the main advantage of the hybrid architecture is its ability to gain insight into the feature pattern space.Nomenclature O _j The output value of thejth unit - I _i Theith component of the input pattern - W _ij The weight of the cluster connection between theith input and thejth unit - B _ij The weight of the shape connection between theith input and thejth unit - N The dimension of the input patterns - v _j The vigilance parameter of thejth unit - v _init The initial vigilance parameter value - v _rate The change in the vigilance parameter value - X _i Theith direction in anN-dimensional coordinate system - T _k The classification tag of thekth unit - C The classification tag of the current input vector - (p) The learning rate at thepth epoch for the cluster weights - p The current epoch - P The total number of epochs - E _k The error associated with thekth unit - The constant learning rate for the shape weights - a _j The age in epochs of thejth unit 相似文献

18.

The MinMax k-Means clustering algorithm

Grigorios Tzortzis Aristidis Likas 《Pattern recognition》2014

Applying k-Means to minimize the sum of the intra-cluster variances is the most popular clustering approach. However, after a bad initialization, poor local optima can be easily obtained. To tackle the initialization problem of k-Means, we propose the MinMax k-Means algorithm, a method that assigns weights to the clusters relative to their variance and optimizes a weighted version of the k-Means objective. Weights are learned together with the cluster assignments, through an iterative procedure. The proposed weighting scheme limits the emergence of large variance clusters and allows high quality solutions to be systematically uncovered, irrespective of the initialization. Experiments verify the effectiveness of our approach and its robustness over bad initializations, as it compares favorably to both k-Means and other methods from the literature that consider the k-Means initialization problem. 相似文献

19.

On an approach to clustering of network traffic

L. E. Kerimova 《Automatic Control and Computer Sciences》2007,41(2):107-113

The problem of clustering with a new generalized performance criterion is considered, the concept of the “center of the cluster” is introduced, and it is shown that the definition of the concept is well-defined. Necessary conditions for minimization of the functional are derived in a theorem which encompasses both fuzzy and crisp partitions into clusters. The k-means algorithm, which is based on this necessary condition, finds the optimal cluster iteratively. 相似文献

20.

A k-populations algorithm for clustering categorical data

Dae-Won Kim KiYoung Lee Kwang H. Lee 《Pattern recognition》2005,38(7):1131-1134

In this paper, the conventional k-modes-type algorithms for clustering categorical data are extended by representing the clusters of categorical data with k-populations instead of the hard-type centroids used in the conventional algorithms. Use of a population-based centroid representation makes it possible to preserve the uncertainty inherent in data sets as long as possible before actual decisions are made. The k-populations algorithm was found to give markedly better clustering results through various experiments. 相似文献