首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 0 毫秒
1.
In this paper, we propose a new parallel clustering algorithm, named Parallel Bisecting k-means with Prediction (PBKP), for message-passing multiprocessor systems. Bisecting k-means tends to produce clusters of similar sizes, and according to our experiments, it produces clusters with smaller entropy (i.e., purer clusters) than k-means does. Our PBKP algorithm fully exploits the data-parallelism of the bisecting k-means algorithm, and adopts a prediction step to balance the workloads of multiple processors to achieve a high speedup. We implemented PBKP on a cluster of Linux workstations and analyzed its performance. Our experimental results show that the speedup of PBKP is linear with the number of processors and the number of data points. Moreover, PBKP scales up better than the parallel k-means with respect to the dimension and the desired number of clusters. This research was supported in part by AFRL/Wright Brothers Institute (WBI).  相似文献   

2.
This paper introduces a novel pairwise-adaptive dissimilarity measure for large high dimensional document datasets that improves the unsupervised clustering quality and speed compared to the original cosine dissimilarity measure. This measure dynamically selects a number of important features of the compared pair of document vectors. Two approaches for selecting the number of features in the application of the measure are discussed. The proposed feature selection process makes this dissimilarity measure especially applicable in large, high dimensional document collections. Its performance is validated on several test sets originating from standardized datasets. The dissimilarity measure is compared to the well-known cosine dissimilarity measure using the average F-measures of the hierarchical agglomerative clustering result. This new dissimilarity measure results in an improved clustering result obtained with a lower required computational time.  相似文献   

3.
Harmony K-means algorithm for document clustering   总被引:2,自引:0,他引:2  
Fast and high quality document clustering is a crucial task in organizing information, search engine results, enhancing web crawling, and information retrieval or filtering. Recent studies have shown that the most commonly used partition-based clustering algorithm, the K-means algorithm, is more suitable for large datasets. However, the K-means algorithm can generate a local optimal solution. In this paper we propose a novel Harmony K-means Algorithm (HKA) that deals with document clustering based on Harmony Search (HS) optimization method. It is proved by means of finite Markov chain theory that the HKA converges to the global optimum. To demonstrate the effectiveness and speed of HKA, we have applied HKA algorithms on some standard datasets. We also compare the HKA with other meta-heuristic and model-based document clustering approaches. Experimental results reveal that the HKA algorithm converges to the best known optimum faster than other methods and the quality of clusters are comparable.  相似文献   

4.
Document clustering is an intentional act that reflects individual preferences with regard to the semantic coherency and relevant categorization of documents. Hence, effective document clustering must consider individual preferences and needs to support personalization in document categorization. Most existing document-clustering techniques, generally anchoring in pure content-based analysis, generate a single set of clusters for all individuals without tailoring to individuals' preferences and thus are unable to support personalization. The partial-clustering-based personalized document-clustering approach, incorporating a target individual's partial clustering into the document-clustering process, has been proposed to facilitate personalized document clustering. However, given a collection of documents to be clustered, the individual might have categorized only a small subset of the collection into his or her personal folders. In this case, the small partial clustering would degrade the effectiveness of the existing personalized document-clustering approach for this particular individual. In response, we extend this approach and propose the collaborative-filtering-based personalized document-clustering (CFC) technique that expands the size of an individual's partial clustering by considering those of other users with similar categorization preferences. Our empirical evaluation results suggest that when given a small-sized partial clustering established by an individual, the proposed CFC technique generally achieves better clustering effectiveness for the individual than does the partial-clustering-based personalized document-clustering technique.  相似文献   

5.
The creation and deployment of knowledge repositories for managing, sharing, and reusing tacit knowledge within an organization has emerged as a prevalent approach in current knowledge management practices. A knowledge repository typically contains vast amounts of formal knowledge elements, which generally are available as documents. To facilitate users' navigation of documents within a knowledge repository, knowledge maps, often created by document clustering techniques, represent an appealing and promising approach. Various document clustering techniques have been proposed in the literature, but most deal with monolingual documents (i.e., written in the same language). However, as a result of increased globalization and advances in Internet technology, an organization often maintains documents in different languages in its knowledge repositories, which necessitates multilingual document clustering (MLDC) to create organizational knowledge maps. Motivated by the significance of this demand, this study designs a Latent Semantic Indexing (LSI)-based MLDC technique capable of generating knowledge maps (i.e., document clusters) from multilingual documents. The empirical evaluation results show that the proposed LSI-based MLDC technique achieves satisfactory clustering effectiveness, measured by both cluster recall and cluster precision, and is capable of maintaining a good balance between monolingual and cross-lingual clustering effectiveness when clustering a multilingual document corpus.  相似文献   

6.
Fast and exact out-of-core and distributed k-means clustering   总被引:1,自引:2,他引:1  
Clustering has been one of the most widely studied topics in data mining and k-means clustering has been one of the popular clustering algorithms. K-means requires several passes on the entire dataset, which can make it very expensive for large disk-resident datasets. In view of this, a lot of work has been done on various approximate versions of k-means, which require only one or a small number of passes on the entire dataset.In this paper, we present a new algorithm, called fast and exact k-means clustering (FEKM), which typically requires only one or a small number of passes on the entire dataset and provably produces the same cluster centres as reported by the original k-means algorithm. The algorithm uses sampling to create initial cluster centres and then takes one or more passes over the entire dataset to adjust these cluster centres. We provide theoretical analysis to show that the cluster centres thus reported are the same as the ones computed by the original k-means algorithm. Experimental results from a number of real and synthetic datasets show speedup between a factor of 2 and 4.5, as compared with k-means.This paper also describes and evaluates a distributed version of FEKM, which we refer to as DFEKM. This algorithm is suitable for analysing data that is distributed across loosely coupled machines. Unlike the previous work in this area, DFEKM provably produces the same results as the original k-means algorithm. Our experimental results show that DFEKM is clearly better than two other possible options for exact clustering on distributed data, which are down loading all data and running sequential k-means or running parallel k-means on a loosely coupled configuration. Moreover, even in a tightly coupled environment, DFEKM can outperform parallel k-means if there is a significant load imbalance. Ruoming Jin is currently an assistant professor in the Computer Science Department at Kent State University. He received a BE and a ME degree in computer engineering from Beihang University (BUAA), China in 1996 and 1999, respectively. He earned his MS degree in computer science from University of Delaware in 2001, and his Ph.D. degree in computer science from the Ohio State University in 2005. His research interests include data mining, databases, processing of streaming data, bioinformatics, and high performance computing. He has published more than 30 papers in these areas. He is a member of ACM and SIGKDD. Anjan Goswami studied robotics at the Indian Institute of Technology at Kanpur. While working with IBM, he was interested in studying computer science. He then obtained a masters degree from the University of South Florida, where he worked on computer vision problems. He then transferred to the PhD program in computer science at OSU, where he did a Masters thesis on efficient clustering algorithms for massive, distributed and streaming data. On successful completion of this, he decided to join a web-service-provider company to do research in designing and developing high-performance search solutions for very large structured data. Anjan' favourite recreations are studying and predicting technology trends, nature photography, hiking, literature and soccer. Gagan Agrawal is an Associate Professor of Computer Science and Engineering at the Ohio State University. He received his B.Tech degree from Indian Institute of Technology, Kanpur, in 1991, and M.S. and Ph.D degrees from University of Maryland, College Park, in 1994 and 1996, respectively. His research interests include parallel and distributed computing, compilers, data mining, grid computing, and data integration. He has published more than 110 refereed papers in these areas. He is a member of ACM and IEEE Computer Society. He received a National Science Foundation CAREER award in 1998.  相似文献   

7.
Semi-supervised graph clustering: a kernel approach   总被引:6,自引:0,他引:6  
Semi-supervised clustering algorithms aim to improve clustering results using limited supervision. The supervision is generally given as pairwise constraints; such constraints are natural for graphs, yet most semi-supervised clustering algorithms are designed for data represented as vectors. In this paper, we unify vector-based and graph-based approaches. We first show that a recently-proposed objective function for semi-supervised clustering based on Hidden Markov Random Fields, with squared Euclidean distance and a certain class of constraint penalty functions, can be expressed as a special case of the weighted kernel k-means objective (Dhillon et al., in Proceedings of the 10th International Conference on Knowledge Discovery and Data Mining, 2004a). A recent theoretical connection between weighted kernel k-means and several graph clustering objectives enables us to perform semi-supervised clustering of data given either as vectors or as a graph. For graph data, this result leads to algorithms for optimizing several new semi-supervised graph clustering objectives. For vector data, the kernel approach also enables us to find clusters with non-linear boundaries in the input data space. Furthermore, we show that recent work on spectral learning (Kamvar et al., in Proceedings of the 17th International Joint Conference on Artificial Intelligence, 2003) may be viewed as a special case of our formulation. We empirically show that our algorithm is able to outperform current state-of-the-art semi-supervised algorithms on both vector-based and graph-based data sets.  相似文献   

8.
In this paper we considered clustering of data corrupted by noise or suffering from imprecision due to finite resolution of the feature measuring device. Our work is motivated by the fact that no measurement can be made perfect and addition of noise is not an uncommon phenomenon for telemetric data. Here we tried to show how the classical k-means algorithm should be modified to take care of the noise/imprecision. Experimental results on Fisher's Iris data and a Nutrition data are demonstrated.  相似文献   

9.
Dynamic data mining has gained increasing attention in the last decade. It addresses changing data structures which can be observed in many real-life applications, e.g. buying behavior of customers. As opposed to classical, i.e. static data mining where the challenge is to discover pattern inherent in given data sets, in dynamic data mining the challenge is to understand – and in some cases even predict – how such pattern will change over time. Since changes in general lead to uncertainty, the appropriate approaches for uncertainty modeling are needed in order to capture, model, and predict the respective phenomena considered in dynamic environments. As a consequence, the combination of dynamic data mining and soft computing is a very promising research area. The proposed algorithm consists of a dynamic clustering cycle when the data set will be refreshed from time to time. Within this cycle criteria check if the newly arrived data have structurally changed in comparison to the data already analyzed. If yes, appropriate actions are triggered, in particular an update of the initial settings of the cluster algorithm. As we will show, rough clustering offers strong tools to detect such changing data structures. To evaluate the proposed dynamic rough clustering algorithm it has been applied to synthetic as well as to real-world data sets where it provides new insights regarding the underlying dynamic phenomena.  相似文献   

10.
In recent years, there have been numerous attempts to extend the k-means clustering protocol for single database to a distributed multiple database setting and meanwhile keep privacy of each data site. Current solutions for (whether two or more) multiparty k-means clustering, built on one or more secure two-party computation algorithms, are not equally contributory, in other words, each party does not equally contribute to k-means clustering. This may lead a perfidious attack where a party who learns the outcome prior to other parties tells a lie of the outcome to other parties. In this paper, we present an equally contributory multiparty k-means clustering protocol for vertically partitioned data, in which each party equally contributes to k-means clustering. Our protocol is built on ElGamal's encryption scheme, Jakobsson and Juels's plaintext equivalence test protocol, and mix networks, and protects privacy in terms that each iteration of k-means clustering can be performed without revealing the intermediate values.  相似文献   

11.
12.
Document clustering using synthetic cluster prototypes   总被引:3,自引:0,他引:3  
The use of centroids as prototypes for clustering text documents with the k-means family of methods is not always the best choice for representing text clusters due to the high dimensionality, sparsity, and low quality of text data. Especially for the cases where we seek clusters with small number of objects, the use of centroids may lead to poor solutions near the bad initial conditions. To overcome this problem, we propose the idea of synthetic cluster prototype that is computed by first selecting a subset of cluster objects (instances), then computing the representative of these objects and finally selecting important features. In this spirit, we introduce the MedoidKNN synthetic prototype that favors the representation of the dominant class in a cluster. These synthetic cluster prototypes are incorporated into the generic spherical k-means procedure leading to a robust clustering method called k-synthetic prototypes (k-sp). Comparative experimental evaluation demonstrates the robustness of the approach especially for small datasets and clusters overlapping in many dimensions and its superior performance against traditional and subspace clustering methods.  相似文献   

13.
This paper presents an efficient algorithm, called pattern reduction (PR), for reducing the computation time of k-means and k-means-based clustering algorithms. The proposed algorithm works by compressing and removing at each iteration patterns that are unlikely to change their membership thereafter. Not only is the proposed algorithm simple and easy to implement, but it can also be applied to many other iterative clustering algorithms such as kernel-based and population-based clustering algorithms. Our experiments—from 2 to 1000 dimensions and 150 to 10,000,000 patterns—indicate that with a small loss of quality, the proposed algorithm can significantly reduce the computation time of all state-of-the-art clustering algorithms evaluated in this paper, especially for large and high-dimensional data sets.  相似文献   

14.
Generative model-based document clustering: a comparative study   总被引:7,自引:2,他引:7  
This paper presents a detailed empirical study of 12 generative approaches to text clustering, obtained by applying four types of document-to-cluster assignment strategies (hard, stochastic, soft and deterministic annealing (DA) based assignments) to each of three base models, namely mixtures of multivariate Bernoulli, multinomial, and von Mises-Fisher (vMF) distributions. A large variety of text collections, both with and without feature selection, are used for the study, which yields several insights, including (a) showing situations wherein the vMF-centric approaches, which are based on directional statistics, fare better than multinomial model-based methods, and (b) quantifying the trade-off between increased performance of the soft and DA assignments and their increased computational demands. We also compare all the model-based algorithms with two state-of-the-art discriminative approaches to document clustering based, respectively, on graph partitioning (CLUTO) and a spectral coclustering method. Overall, DA and CLUTO perform the best but are also the most computationally expensive. The vMF models provide good performance at low cost while the spectral coclustering algorithm fares worse than vMF-based methods for a majority of the datasets.  相似文献   

15.
XML has recently become very popular as a means of representing semistructured data and as a standard for data exchange over the Web, because of its varied applicability in numerous applications. Therefore, XML documents constitute an important data mining domain. In this paper, we propose a new method of XML document clustering by a global criterion function, considering the weight of common structures. Our approach initially extracts representative structures of frequent patterns from schemaless XML documents using a sequential pattern mining algorithm. Then, we perform clustering of an XML document by the weight of common structures, without a measure of pairwise similarity, assuming that an XML document is a transaction and frequent structures extracted from documents are items of the transaction. We conducted experiments to compare our method with previous methods. The experimental results show the effectiveness of our approach.  相似文献   

16.
Intrusion detection is a necessary step to identify unusual access or attacks to secure internal networks. In general, intrusion detection can be approached by machine learning techniques. In literature, advanced techniques by hybrid learning or ensemble methods have been considered, and related work has shown that they are superior to the models using single machine learning techniques. This paper proposes a hybrid learning model based on the triangle area based nearest neighbors (TANN) in order to detect attacks more effectively. In TANN, the k-means clustering is firstly used to obtain cluster centers corresponding to the attack classes, respectively. Then, the triangle area by two cluster centers with one data from the given dataset is calculated and formed a new feature signature of the data. Finally, the k-NN classifier is used to classify similar attacks based on the new feature represented by triangle areas. By using KDD-Cup ’99 as the simulation dataset, the experimental results show that TANN can effectively detect intrusion attacks and provide higher accuracy and detection rates, and the lower false alarm rate than three baseline models based on support vector machines, k-NN, and the hybrid centroid-based classification model by combining k-means and k-NN.  相似文献   

17.
In this paper, we develop a genetic algorithm method based on a latent semantic model (GAL) for text clustering. The main difficulty in the application of genetic algorithms (GAs) for document clustering is thousands or even tens of thousands of dimensions in feature space which is typical for textual data. Because the most straightforward and popular approach represents texts with the vector space model (VSM), that is, each unique term in the vocabulary represents one dimension. Latent semantic indexing (LSI) is a successful technology in information retrieval which attempts to explore the latent semantics implied by a query or a document through representing them in a dimension-reduced space. Meanwhile, LSI takes into account the effects of synonymy and polysemy, which constructs a semantic structure in textual data. GA belongs to search techniques that can efficiently evolve the optimal solution in the reduced space. We propose a variable string length genetic algorithm which has been exploited for automatically evolving the proper number of clusters as well as providing near optimal data set clustering. GA can be used in conjunction with the reduced latent semantic structure and improve clustering efficiency and accuracy. The superiority of GAL approach over conventional GA applied in VSM model is demonstrated by providing good Reuter document clustering results.  相似文献   

18.
For the past few decades the mainstream data clustering technologies have been fundamentally based on centralized operation; data sets were of small manageable sizes, and usually resided on one site that belonged to one organization. Today, data is of enormous sizes and is usually located on distributed sites; the primary example being the Web. This created a need for performing clustering in distributed environments. Distributed clustering solves two problems: infeasibility of collecting data at a central site, due to either technical and/or privacy limitations, and intractability of traditional clustering algorithms on huge data sets. In this paper we propose a distributed collaborative clustering approach for clustering Web documents in distributed environments. We adopt a peer-to-peer model, where the main objective is to allow nodes in a network to first form independent opinions of local document clusterings, then collaborate with peers to enhance the local clusterings. Information exchanged between peers is minimized through the use of cluster summaries in the form of keyphrases extracted from the clusters. This summarized view of peer data enables nodes to request merging of remote data selectively to enhance local clusters. Initial clustering, as well as merging peer data with local clusters, utilizes a clustering method, called similarity histogram-based clustering, based on keeping a tight similarity distribution within clusters. This approach achieves significant improvement in local clustering solutions without the cost of centralized clustering, while maintaining the initial local clustering structure. Results show that larger networks exhibit larger improvements, up to 15% improvement in clustering quality, albeit lower absolute clustering quality than smaller networks.  相似文献   

19.
In this paper, we present a fast global k-means clustering algorithm by making use of the cluster membership and geometrical information of a data point. This algorithm is referred to as MFGKM. The algorithm uses a set of inequalities developed in this paper to determine a starting point for the jth cluster center of global k-means clustering. Adopting multiple cluster center selection (MCS) for MFGKM, we also develop another clustering algorithm called MFGKM+MCS. MCS determines more than one starting point for each step of cluster split; while the available fast and modified global k-means clustering algorithms select one starting point for each cluster split. Our proposed method MFGKM can obtain the least distortion; while MFGKM+MCS may give the least computing time. Compared to the modified global k-means clustering algorithm, our method MFGKM can reduce the computing time and number of distance calculations by a factor of 3.78-5.55 and 21.13-31.41, respectively, with the average distortion reduction of 5,487 for the Statlog data set. Compared to the fast global k-means clustering algorithm, our method MFGKM+MCS can reduce the computing time by a factor of 5.78-8.70 with the average reduction of distortion of 30,564 using the same data set. The performances of our proposed methods are more remarkable when a data set with higher dimension is divided into more clusters.  相似文献   

20.
为解决在互联网文本信息爆炸性增长的前提下,在大规模文本数据中如何发现隐含的、有价值的潜在知识的问题,提出基于多层次文本聚类的文本知识挖掘方法,针对不同规模的文本数据进行不同粒度的聚类,实现不同层次知识的挖掘。针对最广义层次的文本知识挖掘可实现各主题事务划分,针对子级分类数据的文本知识挖掘可发现下一层次主题分类,针对自定义层次的文本知识挖掘可发现该事件中存在的具体细节。对诉求实际数据的分析结果表明,该方法可在所有诉求数据中挖掘出各种诉求主题,精确挖掘出其中的细节问题,为管理者提供数据和决策支持,提高服务效率。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号