首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到10条相似文献,搜索用时 125 毫秒
1.
Non-negative matrix factorization for semi-supervised data clustering   总被引:9,自引:6,他引:3  
Traditional clustering algorithms are inapplicable to many real-world problems where limited knowledge from domain experts is available. Incorporating the domain knowledge can guide a clustering algorithm, consequently improving the quality of clustering. In this paper, we propose SS-NMF: a semi-supervised non-negative matrix factorization framework for data clustering. In SS-NMF, users are able to provide supervision for clustering in terms of pairwise constraints on a few data objects specifying whether they “must” or “cannot” be clustered together. Through an iterative algorithm, we perform symmetric tri-factorization of the data similarity matrix to infer the clusters. Theoretically, we show the correctness and convergence of SS-NMF. Moveover, we show that SS-NMF provides a general framework for semi-supervised clustering. Existing approaches can be considered as special cases of it. Through extensive experiments conducted on publicly available datasets, we demonstrate the superior performance of SS-NMF for clustering.
Ming DongEmail:
  相似文献   

2.
3.
Clustering of time series subsequence data commonly produces results that are unspecific to the data set. This paper introduces a clustering algorithm, that creates clusters exclusively from those subsequences that occur more frequently in a data set than would be expected by random chance. As such, it partially adopts a pattern mining perspective into clustering. When subsequences are being labeled based on such clusters, they may remain without label. In fact, if the clustering was done on an unrelated time series it is expected that the subsequences should not receive a label. We show that pattern-based clusters are indeed specific to the data set for 7 out of 10 real-world sets we tested, and for window-lengths up to 128 time points. While kernel-density-based clustering can be used to find clusters with similar properties for window sizes of 8–16 time points, its performance degrades fast for increasing window sizes.
Dietmar H. DorrEmail:
  相似文献   

4.
In this paper, we present efficient, scalable, and portable parallel algorithms for the off-line clustering, the on-line retrieval and the update phases of the Text Retrieval (TR) problem based on the vector space model and using clustering to organize and handle a dynamic document collection. The algorithms are running on the Coarse-Grained Multicomputer (CGM) and/or the Bulk Synchronous Parallel (BSP) model which are two models that capture within a few parameters the characteristics of the parallel machine. To the best of our knowledge, our parallel retrieval algorithms are the first ones analyzed under these specific parallel models. For all the phases of the proposed algorithms, we analytically determine the relevant communication and computation cost thereby formally proving the efficiency of the proposed solutions. In addition, we prove that our technique for the on-line retrieval phase performs very well in comparison to other possible alternatives in the typical case of a multiuser information retrieval (IR) system where a number of user queries are concurrently submitted to an IR system. Finally, we discuss external memory issues and show how our techniques can be adapted to the case when processors have limited main memory but sufficient disk capacity for holding their local data.
Damianos GavalasEmail:
  相似文献   

5.
Recently, a new class of data mining methods, known as privacy preserving data mining (PPDM) algorithms, has been developed by the research community working on security and knowledge discovery. The aim of these algorithms is the extraction of relevant knowledge from large amount of data, while protecting at the same time sensitive information. Several data mining techniques, incorporating privacy protection mechanisms, have been developed that allow one to hide sensitive itemsets or patterns, before the data mining process is executed. Privacy preserving classification methods, instead, prevent a miner from building a classifier which is able to predict sensitive data. Additionally, privacy preserving clustering techniques have been recently proposed, which distort sensitive numerical attributes, while preserving general features for clustering analysis. A crucial issue is to determine which ones among these privacy-preserving techniques better protect sensitive information. However, this is not the only criteria with respect to which these algorithms can be evaluated. It is also important to assess the quality of the data resulting from the modifications applied by each algorithm, as well as the performance of the algorithms. There is thus the need of identifying a comprehensive set of criteria with respect to which to assess the existing PPDM algorithms and determine which algorithm meets specific requirements. In this paper, we present a first evaluation framework for estimating and comparing different kinds of PPDM algorithms. Then, we apply our criteria to a specific set of algorithms and discuss the evaluation results we obtain. Finally, some considerations about future work and promising directions in the context of privacy preservation in data mining are discussed. *The work reported in this paper has been partially supported by the EU under the IST Project CODMINE and by the Sponsors of CERIAS. Editor:  Geoff Webb
Elisa Bertino (Corresponding author)Email:
Igor Nai FovinoEmail:
Loredana Parasiliti ProvenzaEmail:
  相似文献   

6.
Email overload is a recent problem that there is increasingly difficulty that people have to process the large number of emails received daily. Currently, this problem becomes more and more serious and it has already affected the normal usage of email as a knowledge management tool. It has been recognized that categorizing emails into meaningful groups can greatly save cognitive load to process emails, and thus this is an effective way to manage the email overload problem. However, most current approaches still require significant human input for categorizing emails. In this paper, we develop an automatic email clustering system, underpinned by a new nonparametric text clustering algorithm. This system does not require any predefined input parameters and can automatically generate meaningful email clusters. The evaluation shows our new algorithm outperforms existing text clustering algorithms with higher efficiency and quality in terms of computational time and clustering quality measured by different gauges. The experimental results also well match the labeled human clustering results.
Yang XiangEmail:
  相似文献   

7.
Optimal virtual cluster-based multiprocessor scheduling   总被引:1,自引:1,他引:0  
Scheduling of constrained deadline sporadic task systems on multiprocessor platforms is an area which has received much attention in the recent past. It is widely believed that finding an optimal scheduler is hard, and therefore most studies have focused on developing algorithms with good processor utilization bounds. These algorithms can be broadly classified into two categories: partitioned scheduling in which tasks are statically assigned to individual processors, and global scheduling in which each task is allowed to execute on any processor in the platform. In this paper we consider a third, more general, approach called cluster-based scheduling. In this approach each task is statically assigned to a processor cluster, tasks in each cluster are globally scheduled among themselves, and clusters in turn are scheduled on the multiprocessor platform. We develop techniques to support such cluster-based scheduling algorithms, and also consider properties that minimize total processor utilization of individual clusters. In the last part of this paper, we develop new virtual cluster-based scheduling algorithms. For implicit deadline sporadic task systems, we develop an optimal scheduling algorithm that is neither Pfair nor ERfair. We also show that the processor utilization bound of us-edf{m/(2m−1)} can be improved by using virtual clustering. Since neither partitioned nor global strategies dominate over the other, cluster-based scheduling is a natural direction for research towards achieving improved processor utilization bounds.
Insup LeeEmail:
  相似文献   

8.
Association Rule Mining is one of the important data mining activities and has received substantial attention in the literature. Association rule mining is a computationally and I/O intensive task. In this paper, we propose a solution approach for mining optimized fuzzy association rules of different orders. We also propose an approach to define membership functions for all the continuous attributes in a database by using clustering techniques. Although single objective genetic algorithms are used extensively, they degenerate the solution. In our approach, extraction and optimization of fuzzy association rules are done together using multi-objective genetic algorithm by considering the objectives such as fuzzy support, fuzzy confidence and rule length. The effectiveness of the proposed approach is tested using computer activity dataset to analyze the performance of a multi processor system and network audit data to detect anomaly based intrusions. Experiments show that the proposed method is efficient in many scenarios.
V. S. AnanthanarayanaEmail:
  相似文献   

9.
Subspace sums for extracting non-random data from massive noise   总被引:1,自引:1,他引:0  
An algorithm is introduced that distinguishes relevant data points from randomly distributed noise. The algorithm is related to subspace clustering based on axis-parallel projections, but considers membership in any projected cluster of a given side length, as opposed to a particular cluster. An aggregate measure is introduced that is based on the total number of points that are close to the given point in all possible 2 d projections of a d-dimensional hypercube. No explicit summation over subspaces is required for evaluating this measure. Attribute values are normalized based on rank order to avoid making assumptions on the distribution of random data. Effectiveness of the algorithm is demonstrated through comparison with conventional outlier detection on a real microarray data set as well as on time series subsequence data.
Anne M. DentonEmail:
  相似文献   

10.
K-means clustering is a very popular clustering technique, which is used in numerous applications. In the k-means clustering algorithm, each point in the dataset is assigned to the nearest cluster by calculating the distances from each point to the cluster centers. The computation of these distances is a very time-consuming task, particularly for large dataset and large number of clusters. In order to achieve high performance, we need to reduce the number of the distance calculations for each point efficiently. In this paper, we describe an FPGA implementation of k-means clustering for color images based on the filtering algorithm. In our implementation, when calculating the distances for each point, clusters which are apparently not closer to the point than other clusters are filtered out using kd-trees which are dynamically generated on the FPGA in each iteration of k-means clustering. The performance of our system for 512 × 512 and 640 × 480  pixel images (24-bit full color RGB) is more than 30 fps, and 20–30 fps for 756 × 512 pixel images in average when dividing to 256 clusters.
Tsutomu Maruyama (Corresponding author)Email:
  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号