共查询到10条相似文献,搜索用时 125 毫秒
1.
Traditional clustering algorithms are inapplicable to many real-world problems where limited knowledge from domain experts
is available. Incorporating the domain knowledge can guide a clustering algorithm, consequently improving the quality of clustering.
In this paper, we propose SS-NMF: a semi-supervised non-negative matrix factorization framework for data clustering. In SS-NMF,
users are able to provide supervision for clustering in terms of pairwise constraints on a few data objects specifying whether
they “must” or “cannot” be clustered together. Through an iterative algorithm, we perform symmetric tri-factorization of the
data similarity matrix to infer the clusters. Theoretically, we show the correctness and convergence of SS-NMF. Moveover,
we show that SS-NMF provides a general framework for semi-supervised clustering. Existing approaches can be considered as
special cases of it. Through extensive experiments conducted on publicly available datasets, we demonstrate the superior performance
of SS-NMF for clustering.
相似文献
Ming DongEmail: |
2.
Characterizing pattern preserving clustering 总被引:4,自引:4,他引:0
Hui Xiong Michael Steinbach Arifin Ruslim Vipin Kumar 《Knowledge and Information Systems》2009,19(3):311-336
3.
Pattern-based time-series subsequence clustering using radial distribution functions 总被引:2,自引:2,他引:0
Anne M. Denton Christopher A. Besemann Dietmar H. Dorr 《Knowledge and Information Systems》2009,18(1):1-27
Clustering of time series subsequence data commonly produces results that are unspecific to the data set. This paper introduces
a clustering algorithm, that creates clusters exclusively from those subsequences that occur more frequently in a data set
than would be expected by random chance. As such, it partially adopts a pattern mining perspective into clustering. When subsequences
are being labeled based on such clusters, they may remain without label. In fact, if the clustering was done on an unrelated
time series it is expected that the subsequences should not receive a label. We show that pattern-based clusters are indeed
specific to the data set for 7 out of 10 real-world sets we tested, and for window-lengths up to 128 time points. While kernel-density-based
clustering can be used to find clusters with similar properties for window sizes of 8–16 time points, its performance degrades
fast for increasing window sizes.
相似文献
Dietmar H. DorrEmail: |
4.
Charalampos Konstantopoulos Basilis Mamalis Grammati Pantziou Damianos Gavalas 《The Journal of supercomputing》2009,48(3):286-318
In this paper, we present efficient, scalable, and portable parallel algorithms for the off-line clustering, the on-line retrieval
and the update phases of the Text Retrieval (TR) problem based on the vector space model and using clustering to organize
and handle a dynamic document collection. The algorithms are running on the Coarse-Grained Multicomputer (CGM) and/or the Bulk Synchronous Parallel (BSP) model which are two models that capture within a few parameters the characteristics of the parallel machine. To the best
of our knowledge, our parallel retrieval algorithms are the first ones analyzed under these specific parallel models. For
all the phases of the proposed algorithms, we analytically determine the relevant communication and computation cost thereby
formally proving the efficiency of the proposed solutions. In addition, we prove that our technique for the on-line retrieval
phase performs very well in comparison to other possible alternatives in the typical case of a multiuser information retrieval
(IR) system where a number of user queries are concurrently submitted to an IR system. Finally, we discuss external memory
issues and show how our techniques can be adapted to the case when processors have limited main memory but sufficient disk
capacity for holding their local data.
相似文献
Damianos GavalasEmail: |
5.
Recently, a new class of data mining methods, known as privacy preserving data mining (PPDM) algorithms, has been developed by the research community working on security and knowledge discovery. The aim of these
algorithms is the extraction of relevant knowledge from large amount of data, while protecting at the same time sensitive
information. Several data mining techniques, incorporating privacy protection mechanisms, have been developed that allow one
to hide sensitive itemsets or patterns, before the data mining process is executed. Privacy preserving classification methods,
instead, prevent a miner from building a classifier which is able to predict sensitive data. Additionally, privacy preserving
clustering techniques have been recently proposed, which distort sensitive numerical attributes, while preserving general
features for clustering analysis. A crucial issue is to determine which ones among these privacy-preserving techniques better
protect sensitive information. However, this is not the only criteria with respect to which these algorithms can be evaluated.
It is also important to assess the quality of the data resulting from the modifications applied by each algorithm, as well
as the performance of the algorithms. There is thus the need of identifying a comprehensive set of criteria with respect to
which to assess the existing PPDM algorithms and determine which algorithm meets specific requirements.
In this paper, we present a first evaluation framework for estimating and comparing different kinds of PPDM algorithms. Then,
we apply our criteria to a specific set of algorithms and discuss the evaluation results we obtain. Finally, some considerations
about future work and promising directions in the context of privacy preservation in data mining are discussed.
*The work reported in this paper has been partially supported by the EU under the IST Project CODMINE and by the Sponsors of
CERIAS.
Editor: Geoff Webb
相似文献
Elisa Bertino (Corresponding author)Email: |
Igor Nai FovinoEmail: |
Loredana Parasiliti ProvenzaEmail: |
6.
Yang Xiang 《The Journal of supercomputing》2009,48(3):227-242
Email overload is a recent problem that there is increasingly difficulty that people have to process the large number of emails
received daily. Currently, this problem becomes more and more serious and it has already affected the normal usage of email
as a knowledge management tool. It has been recognized that categorizing emails into meaningful groups can greatly save cognitive
load to process emails, and thus this is an effective way to manage the email overload problem. However, most current approaches
still require significant human input for categorizing emails. In this paper, we develop an automatic email clustering system,
underpinned by a new nonparametric text clustering algorithm. This system does not require any predefined input parameters
and can automatically generate meaningful email clusters. The evaluation shows our new algorithm outperforms existing text
clustering algorithms with higher efficiency and quality in terms of computational time and clustering quality measured by
different gauges. The experimental results also well match the labeled human clustering results.
相似文献
Yang XiangEmail: |
7.
Optimal virtual cluster-based multiprocessor scheduling 总被引:1,自引:1,他引:0
Scheduling of constrained deadline sporadic task systems on multiprocessor platforms is an area which has received much attention
in the recent past. It is widely believed that finding an optimal scheduler is hard, and therefore most studies have focused
on developing algorithms with good processor utilization bounds. These algorithms can be broadly classified into two categories:
partitioned scheduling in which tasks are statically assigned to individual processors, and global scheduling in which each
task is allowed to execute on any processor in the platform. In this paper we consider a third, more general, approach called
cluster-based scheduling. In this approach each task is statically assigned to a processor cluster, tasks in each cluster
are globally scheduled among themselves, and clusters in turn are scheduled on the multiprocessor platform. We develop techniques
to support such cluster-based scheduling algorithms, and also consider properties that minimize total processor utilization
of individual clusters. In the last part of this paper, we develop new virtual cluster-based scheduling algorithms. For implicit
deadline sporadic task systems, we develop an optimal scheduling algorithm that is neither Pfair nor ERfair. We also show
that the processor utilization bound of us-edf{m/(2m−1)} can be improved by using virtual clustering. Since neither partitioned nor global strategies dominate over the other,
cluster-based scheduling is a natural direction for research towards achieving improved processor utilization bounds.
相似文献
Insup LeeEmail: |
8.
Association Rule Mining is one of the important data mining activities and has received substantial attention in the literature.
Association rule mining is a computationally and I/O intensive task. In this paper, we propose a solution approach for mining optimized fuzzy association rules of different orders.
We also propose an approach to define membership functions for all the continuous attributes in a database by using clustering
techniques. Although single objective genetic algorithms are used extensively, they degenerate the solution. In our approach,
extraction and optimization of fuzzy association rules are done together using multi-objective genetic algorithm by considering
the objectives such as fuzzy support, fuzzy confidence and rule length. The effectiveness of the proposed approach is tested
using computer activity dataset to analyze the performance of a multi processor system and network audit data to detect anomaly
based intrusions. Experiments show that the proposed method is efficient in many scenarios.
相似文献
V. S. AnanthanarayanaEmail: |
9.
Subspace sums for extracting non-random data from massive noise 总被引:1,自引:1,他引:0
Anne M. Denton 《Knowledge and Information Systems》2009,20(1):35-62
An algorithm is introduced that distinguishes relevant data points from randomly distributed noise. The algorithm is related
to subspace clustering based on axis-parallel projections, but considers membership in any projected cluster of a given side
length, as opposed to a particular cluster. An aggregate measure is introduced that is based on the total number of points
that are close to the given point in all possible 2
d
projections of a d-dimensional hypercube. No explicit summation over subspaces is required for evaluating this measure. Attribute values are
normalized based on rank order to avoid making assumptions on the distribution of random data. Effectiveness of the algorithm
is demonstrated through comparison with conventional outlier detection on a real microarray data set as well as on time series
subsequence data.
相似文献
Anne M. DentonEmail: |
10.
K-means clustering is a very popular clustering technique, which is used in numerous applications. In the k-means clustering
algorithm, each point in the dataset is assigned to the nearest cluster by calculating the distances from each point to the
cluster centers. The computation of these distances is a very time-consuming task, particularly for large dataset and large
number of clusters. In order to achieve high performance, we need to reduce the number of the distance calculations for each
point efficiently. In this paper, we describe an FPGA implementation of k-means clustering for color images based on the filtering
algorithm. In our implementation, when calculating the distances for each point, clusters which are apparently not closer
to the point than other clusters are filtered out using kd-trees which are dynamically generated on the FPGA in each iteration
of k-means clustering. The performance of our system for 512 × 512 and 640 × 480 pixel images (24-bit full color RGB) is
more than 30 fps, and 20–30 fps for 756 × 512 pixel images in average when dividing to 256 clusters.
相似文献
Tsutomu Maruyama (Corresponding author)Email: |