首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
曲武  王莉军  韩晓光 《计算机科学》2014,41(11):195-202
近年来,随着计算机技术、信息处理技术在工业生产、信息处理等领域的广泛应用,会连续不断地产生大量随时间演变的序列型数据,构成时间序列数据流,如互联网新闻语料分析、网络入侵检测、股市行情分析和传感器网络数据分析等。实时数据流聚类分析是当前数据流挖掘研究的热点问题。单遍扫描算法虽然满足数据流高速、数据规模较大和实时分析的需求,但因缺乏有效的聚类算法来识别和区分模式而限制了其有效性和可扩展性。为了解决以上问题,提出云环境下基于LSH的分布式数据流聚类算法DLCStream,通过引入Map-Reduce框架和位置敏感哈希机制,DLCStream算法能够快速找到数据流中的聚类模式。通过详细的理论分析和实验验证表明,与传统的数据流聚类框架CluStream算法相比,DLCStream算法在高效并行处理、可扩展性和聚类结果质量方面更有优势。  相似文献   

2.
Although the problem of clustering numerical time-evolving data is well-explored, the problem of clustering categorical time-evolving data remains as a challenge issue. In this paper, we propose a generalized clustering framework which utilizes existing clustering algorithms and adopts sliding window technique to detect if there is a drifting-concept or not in the incoming sliding window. The framework is composed of two algorithms: Drifting Concept Detecting (abbreviated as DCD) algorithm detecting the changes of cluster distributions between the current sliding window and the last clustering result, and Cluster Relationship Analysis (abbreviated as CRA) algorithm analyzing the relationship between clustering results at different time. In DCD, the concept is said to drift if quite a large number of outliers are found in the current sliding window, or if quite a large number of clusters are varied in the ratio of data points. The drifted sliding window will perform re-clustering to capture the recent concept. In CRA, a visualizing method is devised to facilitate the observation of the evolving clustering results. The framework is validated on real and synthetic data sets, and is shown to not only accurately detect the drifting-concepts but also attain clustering results of better quality.  相似文献   

3.
This paper presents SCALE, a fully automated transactional clustering framework. The SCALE design highlights three unique features. First, we introduce the concept of Weighted Coverage Density as a categorical similarity measure for efficient clustering of transactional datasets. The concept of weighted coverage density is intuitive and it allows the weight of each item in a cluster to be changed dynamically according to the occurrences of items. Second, we develop the weighted coverage density measure based clustering algorithm, a fast, memory-efficient, and scalable clustering algorithm for analyzing transactional data. Third, we introduce two clustering validation metrics and show that these domain specific clustering evaluation metrics are critical to capture the transactional semantics in clustering analysis. Our SCALE framework combines the weighted coverage density measure for clustering over a sample dataset with self-configuring methods. These self-configuring methods can automatically tune the two important parameters of our clustering algorithms: (1) the candidates of the best number K of clusters; and (2) the application of two domain-specific cluster validity measures to find the best result from the set of clustering results. We have conducted extensive experimental evaluation using both synthetic and real datasets and our results show that the weighted coverage density approach powered by the SCALE framework can efficiently generate high quality clustering results in a fully automated manner.  相似文献   

4.
Identifying genes that are differentially expressed under different experimental conditions is a fundamental task in microarray studies. However, different ranking methods generate very different gene lists, and this could profoundly impact follow-up analyses and biological interpretation. Therefore, developing improved ranking methods are critical in microarray data analysis. We developed a new algorithm, the probabilistic fold change (PFC), which ranks genes based on a confidence interval estimate of fold change. We performed extensive testing using multiple benchmark data sources including the MicroArray Quality Control (MAQC) data sets. We corroborated our observations with MAQC data sets using qRT-PCR data sets and Latin square spike-in data sets. Along with PFC, we tested six other popular ranking algorithms including Mean Fold Change (FC), SAM, t-statistic (T), Bayesian-t (BAYT), Intensity-Conditional Fold Change (CFC), and Rank Product (RP). PFC achieved reproducibility and accuracy that are consistently among the best of the seven ranking algorithms while other ranking algorithms would show weakness in some cases. Contrary to common belief, our results demonstrated that statistical accuracy will not translate to biological reproducibility and therefore both quality aspects need to be evaluated.  相似文献   

5.
网格服务资源多维性能聚类任务调度   总被引:1,自引:0,他引:1  
陈志刚  杨博 《软件学报》2009,20(10):2766-2775
网格计算是当前一个重要的研究领域,其中任务调度是一个基本组成部分,其性能直接影响到网格服务质量.为了缩短任务调度完成时间,提高任务调度性能,提出了一种网格资源多维性能聚类任务调度算法MPCGSR (task scheduling algorithm based on multidimensional performance clustering of grid service resources).该算法根据网格环境下服务资源数量庞大、异构、多样的特点,预先以构建的网格服务资源超图模型为基础,结合小世界理论对服务资源进行多维性能聚类,将任务与聚类资源相匹配并实施调度.模拟实验结果表明,算法较之同类算法具有优越性,是一种有效的网格任务调度算法.  相似文献   

6.
The weapon-target assignment (WTA) problem is crucial for strategic planning in military decision-making operations. It defines the best way to assign defensive resources against threats in combat scenarios. This is a NP-complete problem where no exact solution is available to deal with all possible scenarios. A critical issue in modeling the WTA problem is the time performance of the developed algorithms, subject only recently contemplated in related publications. This paper presents a hybrid approach which combines an ant colony optimization with a greedy algorithm, called the Greedy Ant Colony System (GACS), in which a multi colony parallel strategy was also implemented to improve the results. Aiming at large scale air combat scenarios, simulations controlling the algorithm time performance were executed achieving good quality results.  相似文献   

7.
一种混合属性数据流聚类算法   总被引:5,自引:0,他引:5  
杨春宇  周杰 《计算机学报》2007,30(8):1364-1371
数据流聚类是数据流挖掘中的重要问题.现实世界中的数据流往往同时具有连续属性和标称属性,但现有算法局限于仅处理其中一种属性,而对另一种采取简单舍弃的办法.目前还没有能在算法层次上进行混合属性数据流聚类的算法.文中提出了一种针对混合属性数据流的聚类算法;建立了数据流到达的泊松过程模型;用频度直方图对离散属性进行了描述;给出了混合属性条件下微聚类生成、更新、合并和删除算法.在公共数据集上的实验表明,文中提出的算法具有鲁棒的性能.  相似文献   

8.
An ensemble of clustering solutions or partitions may be generated for a number of reasons. If the data set is very large, clustering may be done on tractable size disjoint subsets. The data may be distributed at different sites for which a distributed clustering solution with a final merging of partitions is a natural fit. In this paper, two new approaches to combining partitions, represented by sets of cluster centers, are introduced. The advantage of these approaches is that they provide a final partition of data that is comparable to the best existing approaches, yet scale to extremely large data sets. They can be 100,000 times faster while using much less memory. The new algorithms are compared against the best existing cluster ensemble merging approaches, clustering all the data at once and a clustering algorithm designed for very large data sets. The comparison is done for fuzzy and hard-k-means based clustering algorithms. It is shown that the centroid-based ensemble merging algorithms presented here generate partitions of quality comparable to the best label vector approach or clustering all the data at once, while providing very large speedups.  相似文献   

9.
Computational Grids and peer‐to‐peer (P2P) networks enable the sharing, selection, and aggregation of geographically distributed resources for solving large‐scale problems in science, engineering, and commerce. The management and composition of resources and services for scheduling applications, however, becomes a complex undertaking. We have proposed a computational economy framework for regulating the supply of and demand for resources and allocating them for applications based on the users' quality‐of‐service requirements. The framework requires economy‐driven deadline‐ and budget‐constrained (DBC) scheduling algorithms for allocating resources to application jobs in such a way that the users' requirements are met. In this paper, we propose a new scheduling algorithm, called the DBC cost–time optimization scheduling algorithm, that aims not only to optimize cost, but also time when possible. The performance of the cost–time optimization scheduling algorithm has been evaluated through extensive simulation and empirical studies for deploying parameter sweep applications on global Grids. Copyright © 2005 John Wiley & Sons, Ltd.  相似文献   

10.
Clustering aggregation, known as clustering ensembles, has emerged as a powerful technique for combining different clustering results to obtain a single better clustering. Existing clustering aggregation algorithms are applied directly to data points, in what is referred to as the point-based approach. The algorithms are inefficient if the number of data points is large. We define an efficient approach for clustering aggregation based on data fragments. In this fragment-based approach, a data fragment is any subset of the data that is not split by any of the clustering results. To establish the theoretical bases of the proposed approach, we prove that clustering aggregation can be performed directly on data fragments under two widely used goodness measures for clustering aggregation taken from the literature. Three new clustering aggregation algorithms are described. The experimental results obtained using several public data sets show that the new algorithms have lower computational complexity than three well-known existing point-based clustering aggregation algorithms (Agglomerative, Furthest, and LocalSearch); nevertheless, the new algorithms do not sacrifice the accuracy.  相似文献   

11.
Recent advancement in microarray technology permits monitoring of the expression levels of a large set of genes across a number of time points simultaneously. For extracting knowledge from such huge volume of microarray gene expression data, computational analysis is required. Clustering is one of the important data mining tools for analyzing such microarray data to group similar genes into clusters. Researchers have proposed a number of clustering algorithms in this purpose. In this article, an attempt has been made in order to improve the performance of fuzzy clustering by combining it with support vector machine (SVM) classifier. A recently proposed real-coded variable string length genetic algorithm based clustering technique and an iterated version of fuzzy C-means clustering have been utilized in this purpose. The performance of the proposed clustering scheme has been compared with that of some well-known existing clustering algorithms and their SVM boosted versions for one simulated and six real life gene expression data sets. Statistical significance test based on analysis of variance (ANOVA) followed by posteriori Tukey-Kramer multiple comparison test has been conducted to establish the statistical significance of the superior performance of the proposed clustering scheme. Moreover biological significance of the clustering solutions have been established.  相似文献   

12.
在分布式计算和内存为王的时代,Spark作为基于内存计算的分布式框架技术得到了前所未有的关注与应用。着重研究BIRCH算法在Spark上并行化的设计和实现,经过理论性能分析得到并行化过程中时间消耗较多的Spark转化操作,同时根据并行化BIRCH算法的有向无环图DAG,减少shuffle和磁盘读写频率,以期达到性能优化。最后,将并行化后的BIRCH算法分别与单机的BIRCH算法和MLlib中的K-Means聚类算法做了性能对比实验。实验结果表明,通过Spark对BIRCH算法并行化,其聚类质量没有明显的损失,并且获得了比较理想的运行时间和加速比。  相似文献   

13.
Over the last several years, many clustering algorithms have been applied to gene expression data. However, most clustering algorithms force the user into having one set of clusters, resulting in a restrictive biological interpretation of gene function. It would be difficult to interpret the complex biological regulatory mechanisms and genetic interactions from this restrictive interpretation of microarray expression data. The software package SignatureClust allows users to select a group of functionally related genes (called ‘Landmark Genes’), and to project the gene expression data onto these genes. Compared to existing algorithms and software in this domain, our software package offers two unique benefits. First, by selecting different sets of landmark genes, it enables the user to cluster the microarray data from multiple biological perspectives. This encourages data exploration and discovery of new gene associations. Second, most packages associated with clustering provide internal validation measures, whereas our package validates the biological significance of the new clusters by retrieving significant ontology and pathway terms associated with the new clusters. SignatureClust is a free software tool that enables biologists to get multiple views of the microarray data. It highlights new gene associations that were not found using a traditional clustering algorithm. The software package ‘SignatureClust’ and the user manual can be downloaded from .  相似文献   

14.
CASS is a task management system that provides facilities for automatic grain-size optimization and task scheduling of parallel programs on distributed memory parallel architectures. The heart of CASS is a clustering module that partitions the tasks of a program into clusters that match the granularity of the target machine. This paper describes the clustering algorithms used by CASS and compares them with the best known algorithms reported in the literature, namely the PY algorithm (for clustering with task duplication) and the DSC algorithm (for clustering with no task duplication). It is shown that the clustering algorithms used by CASS outperform both the PY and DSC algorithms in terms of both speed and solution quality.  相似文献   

15.
Cluster analysis for gene expression data: a survey   总被引:16,自引:0,他引:16  
DNA microarray technology has now made it possible to simultaneously monitor the expression levels of thousands of genes during important biological processes and across collections of related samples. Elucidating the patterns hidden in gene expression data offers a tremendous opportunity for an enhanced understanding of functional genomics. However, the large number of genes and the complexity of biological networks greatly increases the challenges of comprehending and interpreting the resulting mass of data, which often consists of millions of measurements. A first step toward addressing this challenge is the use of clustering techniques, which is essential in the data mining process to reveal natural structures and identify interesting patterns in the underlying data. Cluster analysis seeks to partition a given data set into groups based on specified features so that the data points within a group are more similar to each other than the points in different groups. A very rich literature on cluster analysis has developed over the past three decades. Many conventional clustering algorithms have been adapted or directly applied to gene expression data, and also new algorithms have recently been proposed specifically aiming at gene expression data. These clustering algorithms have been proven useful for identifying biologically relevant groups of genes and samples. In this paper, we first briefly introduce the concepts of microarray technology and discuss the basic elements of clustering on gene expression data. In particular, we divide cluster analysis for gene expression data into three categories. Then, we present specific challenges pertinent to each clustering category and introduce several representative approaches. We also discuss the problem of cluster validation in three aspects and review various methods to assess the quality and reliability of clustering results. Finally, we conclude this paper and suggest the promising trends in this field.  相似文献   

16.
网格任务调度是当前重要的研究领域。网格环境具有动态性、异构性等特点,网格资源的处理性能和稳定性都是影响到任务调度顺利完成的重要因素。为了获得更小的任务完成时间,该文根据网格环境的特点,建立了网格资源超图模型,在该模型基础上对资源按性能进行聚类,并提出一种可信任务调度算法GRHTS。模拟实验结果表明,该基于网格资源超图模型的可信任务调度算法优于同类算法,是一种有效的网格任务调度算法。  相似文献   

17.
随着网络技术的不断发展,分布式系统得到了广泛的研究与应用。然而由于分布式系统中网络带宽有限,且临界资源的数目是固定的,因此研究设计网络负载轻、临界资源利用率高的分布式互斥算法具有重要的意义。文中首先介绍了几种传统的互斥算法,对各个算法的性能加以比较,结合上述分析提出了一种新的基于令牌的算法,并详细阐述算法的设计思想及其数据结构。该算法最主要的特点是在分布式互斥中引入了优先级和选举算法的概念,能有效提高进程间的通信效率。  相似文献   

18.
In the rapidly evolving field of genomics, many clustering and classification methods have been developed and employed to explore patterns in gene expression data. Biologists face the choice of which clustering algorithm(s) to use and how to interpret different results from various clustering algorithms. No clear objective criteria have been developed to assess the agreement and compare the results from different clustering methods. We describe two generally applicable objective measures to quantify agreement between different clustering methods. These two measures are referred to as the local agreement measure, which is defined for each gene/subject, and the global agreement measure, which is defined for the whole gene expression experiment. The agreement measures are based on a probabilistic weighting scheme applied to the number of concordant and discordant pairs from two clustering methods. In the comparison and assessment process, newly-developed concepts are implemented under the framework of reliability of a cluster. The algorithms are illustrated by simulations and then applied to a yeast sporulation gene expression microarray data. Analysis of the sporulation data identified ∼5% (23 of 477) genes which were not consistently clustered using a neural net algorithm and K-means or pam. The two agreement measures provide objective criteria to conclude whether or not two clustering methods agree with each other. Using the local agreement measure, genes of unknown function which cluster consistently can more confidently be assigned functions based on co-regulation.  相似文献   

19.
H. B. Zhou 《Parallel Computing》1993,19(12):1359-1373
This paper presents a group of multiple-way graph (with weighted nodes and edges) partitioning algorithms based on a 2-stage constructive-and-refinement mechanism. The graph partitions can be used to control allocation of program units to distributed processors in a way that minimizes the completion time and for design automation applications. In the constructive stage, 4 clustering algorithms are used to construct raw partitions, the second refinement step first adjusts the cluster number to the processor number and then iteratively improves the partitioning cost by employing a Kernighan-Lin based heuristic. This approach represents several extensions to the state-of-the-art methods. A performance comparison of the proposed algorithms is given, based on experiment results.  相似文献   

20.
基因表达数据的聚类分析研究进展   总被引:3,自引:1,他引:3  
基因表达数据的爆炸性增长迫切需求自动、有效的数据分析工具. 目前聚类分析已成为分析基因表达数据获取生物学信息的有力工具. 为了更好地挖掘基因表达数据, 近年来提出了许多改进的传统聚类算法和新聚类算法. 本文首先简单介绍了基因表达数据的获取和表示, 之后系统地介绍了近年来应用在基因表达数据分析中的聚类算法. 根据聚类目标的不同将算法分为基于基因的聚类、基于样本的聚类和两路聚类, 并对每类算法介绍了其生物学的含义及其难点, 详细讨论了各种算法的基本原理及优缺点. 最后总结了当前的基因表达数据的聚类分析方法,并对发展趋势作了进一步的展望.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号