首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Generating high-quality gene clusters and identifying the underlying biological mechanism of the gene clusters are the important goals of clustering gene expression analysis. To get high-quality cluster results, most of the current approaches rely on choosing the best cluster algorithm, in which the design biases and assumptions meet the underlying distribution of the dataset. There are two issues for this approach: 1) usually, the underlying data distribution of the gene expression datasets is unknown and 2) there are so many clustering algorithms available and it is very challenging to choose the proper one. To provide a textual summary of the gene clusters, the most explored approach is the extractive approach that essentially builds upon techniques borrowed from the information retrieval, in which the objective is to provide terms to be used for query expansion, and not to act as a stand-alone summary for the entire document sets. Another drawback is that the clustering quality and cluster interpretation are treated as two isolated research problems and are studied separately. In this paper, we design and develop a unified system Gene Expression Miner to address these challenging issues in a principled and general manner by integrating cluster ensemble, text clustering, and multidocument summarization and provide an environment for comprehensive gene expression data analysis. We present a novel cluster ensemble approach to generate high-quality gene cluster. In our text summarization module, given a gene cluster, our expectation-maximization based algorithm can automatically identify subtopics and extract most probable terms for each topic. Then, the extracted top k topical terms from each subtopic are combined to form the biological explanation of each gene cluster. Experimental results demonstrate that our system can obtain high-quality clusters and provide informative key terms for the gene clusters.  相似文献   

2.
Many existing clustering algorithms have been used to identify coexpressed genes in gene expression data. These algorithms are used mainly to partition data in the sense that each gene is allowed to belong only to one cluster. Since proteins typically interact with different groups of proteins in order to serve different biological roles, the genes that produce these proteins are therefore expected to coexpress with more than one group of genes. In other words, some genes are expected to belong to more than one cluster. This poses a challenge to gene expression data clustering as there is a need for overlapping clusters to be discovered in a noisy environment. For this task, we propose an effective information theoretical approach, which consists of an initial clustering phase and a second reclustering phase, in this paper. The proposed approach has been tested with both simulated and real expression data. Experimental results show that it can improve the performances of existing clustering algorithms and is able to effectively uncover interesting patterns in noisy gene expression data so that, based on these patterns, overlapping clusters can be discovered.  相似文献   

3.
MapReduce has emerged as a popular computing model used in datacenters to process large amount of datasets.In the map phase,hash partitioning is employed to distribute data that sharing the same key across data center-scale cluster nodes.However,we observe that this approach can lead to uneven data distribution,which can result in skewed loads among reduce tasks,thus hamper performance of MapReduce systems.Moreover,worker nodes in MapReduce systems may differ in computing capability due to(1) multiple generations of hardware in non-virtualized data centers,or(2) co-location of virtual machines in virtualized data centers.The heterogeneity among cluster nodes exacerbates the negative effects of uneven data distribution.To improve MapReduce performance in heterogeneous clusters,we propose a novel load balancing approach in the reduce phase.This approach consists of two components:(1) performance prediction for reducers that run on heterogeneous nodes based on support vector machines models,and(2) heterogeneity-aware partitioning(HAP),which balances skewed data for reduce tasks.We implement this approach as a plug-in in current MapReduce system.Experimental results demonstrate that our proposed approach distributes work evenly among reduce tasks,and improves MapReduce performance with little overhead.  相似文献   

4.
Mammography is considered the most effective method for early detection of breast cancers. However, it is difficult for radiologists to detect microcalcification clusters. Therefore, we have developed a computerized scheme for detecting early-stage microcalcification clusters in mammograms. We first developed a novel filter bank based on the concept of the Hessian matrix for classifying nodular structures and linear structures. The mammogram images were decomposed into several subimages for second difference at scales from 1 to 4 by this filter bank. The subimages for the nodular component (NC) and the subimages for the nodular and linear component (NLC) were then obtained from analysis of the Hessian matrix. Many regions of interest (ROIs) were selected from the mammogram image. In each ROI, eight features were determined from the subimages for NC at scales from 1 to 4 and the subimages for NLC at scales from 1 to 4. The Bayes discriminant function was employed for distinguishing among abnormal ROIs with a microcalcification cluster and two different types of normal ROIs without a microcalcification cluster. We evaluated the detection performance by using 600 mammograms. Our computerized scheme was shown to have the potential to detect microcalcification clusters with a clinically acceptable sensitivity and low false positives.  相似文献   

5.
Cluster analysis of gene expression data from a cDNA microarray is useful for identifying biologically relevant groups of genes. However, finding the natural clusters in the data and estimating the correct number of clusters are still two largely unsolved problems. In this paper, we propose a new clustering framework that is able to address both these problems. By using the one-prototype-take-one-cluster (OPTOC) competitive learning paradigm, the proposed algorithm can find natural clusters in the input data, and the clustering solution is not sensitive to initialization. In order to estimate the number of distinct clusters in the data, we propose a cluster splitting and merging strategy. We have applied the new algorithm to simulated gene expression data for which the correct distribution of genes over clusters is known a priori. The results show that the proposed algorithm can find natural clusters and give the correct number of clusters. The algorithm has also been tested on real gene expression changes during yeast cell cycle, for which the fundamental patterns of gene expression and assignment of genes to clusters are well understood from numerous previous studies. Comparative studies with several clustering algorithms illustrate the effectiveness of our method.  相似文献   

6.
Nonparametric genetic clustering: comparison of validity indices   总被引:3,自引:0,他引:3  
A variable-string-length genetic algorithm (GA) is used for developing a novel nonparametric clustering technique when the number of clusters is not fixed a-priori. Chromosomes in the same population may now have different lengths since they encode different number of clusters. The crossover operator is redefined to tackle the concept of variable string length. A cluster validity index is used as a measure of the fitness of a chromosome. The performance of several cluster validity indices, namely the Davies-Bouldin (1979) index, Dunn's (1973) index, two of its generalized versions and a recently developed index, in appropriately partitioning a data set, are compared  相似文献   

7.
氢分子是宇宙中结构最简单的分子,而它的团簇是分子团簇科学的研究模型。目前,氢分子团簇的结构和动力学仍是基础科学研究的热点之一。氢分子团簇的实验和理论研究主要集中在三个方面:团簇的形成条件和尺度分布;团簇的电离能和出现势;以及氢分子团簇的量子化学计算研究等。我们主要介绍了国外氢分子团簇实验和理论研究的现状,同时也对我们在氢团簇的光电离研究方面取得的进展进行了介绍,希望能为以后氢分子团簇的基础和应用研究提供一些参考信息。  相似文献   

8.
This paper proposes a dynamic-model-based method for selecting significantly expressed (SE) genes from their time-course expression profiles. A gene is considered to be SE if its time-course expression profile is more likely time-dependent than random. The proposed method describes a time-dependent gene expression profile by a nonzero-order autoregressive (AR) model, and a time-independent gene expression profile by a zero-order AR model. Akaike information criterion (AIC) is used to compare the models and subsequently determine whether a time-course gene expression profile is time-independent or time-dependent. The performance of the proposed method is investigated on both a synthetic dataset and a real-life biological dataset in terms of the false discovery rate (FDR) and the false nondiscovery rate (FNR). The results show that the proposed method is valid for selecting SE genes from their time-course expression profiles.   相似文献   

9.
可靠性、维修性和保障性(RMS)是影响机群系统任务效能的重要因素,机群系统效能的解析计算模型无法体现具体的任务剖面以及备件保障延误等因素的影响,Simlox仿真方法可以针对机群系统在特定保障条件下的任务系统效能进行仿真分析。针对执行连续任务的机群系统任务效能进行了仿真分析,得出RMS参数对任务系统效能的影响。  相似文献   

10.
Designing an energy efficient and durable wireless sensor networks (WSNs) is a key challenge as it personifies potential and reactive functionalities in harsh antagonistic environment at which wired system deployment is completely infeasible. Majority of the clustering mechanisms contributed to the literature concentrated on augmenting network lifetime and energy stability. However, energy consumption incurred by cluster heads (CHs) are high and thereby results in minimized network lifetime and frequent CHs selection. In this paper, a modified whale-dragonfly optimization algorithm and self-adaptive cuckoo search-based clustering strategy (MWIDOA-SACS) is proposed for sustaining energy stability and augment network lifetime. In specific, MWIDOA-SACS is included for exploiting the fitness values that aids in determining two optimal nodes that are selected as optimal CH and cluster router (CR) nodes in the network. In MWIDOA, the search conduct of dragon flies is completely updated through whale optimization algorithm (WOA) for preventing load balancing at CHs. It minimized the overhead of CH by adopting CHs and CR for collecting information from cluster members and transmitting the aggregated data from CHs to the base station (BS). It included self-adaptive cuckoo search (SACS) for achieving sink mobility using radius, energy stability, received signal strength, and throughput for achieving optimal data transmission process after partitioning the network into unequal clusters. Simulation experiments of the proposed MWIDOA-SACS confirmed better performance in terms of total residual energy by 21.28% and network lifetime by 26.32%, compared to the competitive CH selection strategies.  相似文献   

11.
The possibility of the existence of single−wall carbon nanotubes (SWNTs) in organic solvents in the form of clusters is discussed. A theory is developed based on a bundlet model for clusters, which enables describing the distribution function of clusters by size. Comparison of the calculated values of solubility with experiments would permit obtaining energetic parameters characterizing the interaction of an SWNT with its surrounding, in a solid or solution. Fullerenes and SWNTs are unique objects, whose behaviour in many physical situations is characterized by remarkable peculiarities. Peculiarities in solutions show up first in that fullerenes and SWNTs represent the only soluble forms of carbon, what is related to the originality in the molecular structure of fullerenes and SWNTs. The fullerene molecule is a virtually uniform closed spherical or spheroidal surface, and an SWNT is a smooth cylindrical unit. Both structures give rise to the relatively weak interaction between the neighbouring molecules in a crystal and promote interaction of the molecules with those of a solvent. Another peculiarity in solutions is related to their trend to form clusters, consisting of a number of fullerene molecules or SWNTs. The energy of interaction of a fullerene molecule or SWNT with solvent molecules is proportional to the surface of the former molecule and roughly independent of the orientation of solvent molecules. All these phenomena have a unified explanation in the framework of the bundlet model of a cluster, in accordance with which the free energy of an SWNT involved in a cluster is combined from two components, viz. a volume one proportional to the number of molecules n in a cluster, and a surface one proportional to n1/2. Algorithms for classification are proposed based on the criteria information entropy and its production. Many classification algorithms are based on information entropy. When applying these procedures to sets of moderate size, an excessive number of results appear compatible with data, and this number suffers a combinatorial explosion. However, after the equipartition conjecture, one has a selection criterion between different variants resulting from classification between hierarchical trees. According to this conjecture, for a given charge or duty, the best configuration of a flowsheet is the one in which the entropy production is most uniformly distributed. Information entropy, cluster and principal component analyses agree.  相似文献   

12.
Computer vision and machine learning tools offer an exciting new way for automatically analyzing and categorizing information from complex computer simulations. Here we design an ensemble machine learning framework that can independently and robustly categorize and dissect simulation data output contents of turbulent flow patterns into distinct structure catalogs. The segmentation is performed using an unsupervised clustering algorithm, which segments physical structures by grouping together similar pixels in simulation images. The accuracy and robustness of the resulting segment region boundaries are enhanced by combining information from multiple simultaneously-evaluated clustering operations. The stacking of object segmentation evaluations is performed using image mask combination operations. This statistically-combined ensemble (SCE) of different cluster masks allows us to construct cluster reliability metrics for each pixel and for the associated segments without any prior user input. By comparing the similarity of different cluster occurrences in the ensemble, we can also assess the optimal number of clusters needed to describe the data. Furthermore, by relying on ensemble-averaged spatial segment region boundaries, the SCE method enables reconstruction of more accurate and robust region of interest (ROI) boundaries for the different image data clusters. We apply the SCE algorithm to 2-dimensional simulation data snapshots of magnetically-dominated fully-kinetic turbulent plasma flows where accurate ROI boundaries are needed for geometrical measurements of intermittent flow structures known as current sheets.  相似文献   

13.
The radar signal sorting method based on traditional support vector clustering (SVC) algorithm takes a high time complexity, and the traditional validity index cannot efficiently indicate the best sorting result. Aiming at solving the problem, we study a new sorting method based on cone cluster labeling (CCL) method. The CCL method relies on the theory of approximate coverings both in feature space and data space. Also a new cluster validity index, similitude entropy (SE), is proposed. It can be used to evaluate the compactness and separation of clusters with information entropy theory. Simulations including the performance comparison between the proposed method and the conventional methods are presented. Results show that while maintaining the sorting accuracy, the proposed method can reduce the computing complexity effectively in sorting the signals.  相似文献   

14.
Vehicular ad‐hoc networks have several roles in alert messages dissemination between vehicles in danger, the most important role is to provide helpful information for drivers (eg, road traffic state). But, some performance improvements are frequently needed in terms of routing. Hence, several clustering approaches have been proposed to optimize the network services. These approaches are based on increasing data delivery, reducing data congestion, and dividing the traffic into clusters. However, a stable clustering algorithm is always required in order to ensure the data dissemination in a dense, mobile, or a large‐scale environment. Therefore, in this paper, we have proposed a stable routing protocol based on the fuzzy logic system, which can deliver alert messages with minimum delay and improve the stability of clusters structure by generating only a small number of clusters in the network. In this work, the fuzzy logic system has been used to create the clusters and select a cluster head for each cluster. We have used the network simulator (NS2) to generate the results. As a result, we could reduce the cluster head changes and increase the cluster member lifetime compared with recent approaches.  相似文献   

15.
In Fifth Generation (5G) Heterogeneous Mobile Networks (HetNets), deploying dense small cell networks makes user association more challenging. The process of collecting cell load information from the User Equipments (UEs) and broadcasting the feedback message involves significant overhead and time complexity. Moreover, the UEs may not know the optimum cell to reselect, satisfying its data rate requirements. In order to overcome these drawbacks, in this paper, we propose to design an Hierarchical and Hybrid Cell Load Balancing (HHCLB) technique using Selective Handoff. In this technique, the UEs of each cell are grouped into clusters depending on their proximity distance. Each cluster contains a cluster controller (CC) which is in charge of determining the intra-cell load and redirecting the cell-reselection request of a UE. If the data rate of any UE in a cluster becomes less than its required rate, then the cell reselection process is performed. By simulation results, it is shown that load balancing can be done proactively (implicitly) by the CCs when the load is unbalanced or can be done on demand (explicitly) when a UE send a request for cell reselection. In the case of Macro cells, HHCLB attains 71% higher throughput for low load scenario and 59% higher throughput for high load scenario. Similarly, in the case of Femto cells, HHCLB attains 19% higher throughput for low load scenario and 27% higher throughput for high load scenario.  相似文献   

16.
为了解决传统聚类算法对聚类表征特征量的依赖性以及定义的不完备性,结合遥感图像的数据的空间位置关系提出了一种结合多元信息聚类与空间约束的遥感图像分割方法。针对某一聚类数据,以若干数据点(多元)组合的方式遍历其所有数据点,并定义多元组合的互信息,以表征该聚类的类内相似性;通过计算类外像素对类内多元组合的互信息,刻画类间的非相似性。在此基础上建立类内相似性和类间差异性,然后结合两者之间的平衡关系建立目标函数,并将Potts模型扩展到目标函数以加入空间约束,最后通过最大化目标函数实现图像分割。对模拟及真实全色遥感影像分割结果的定性、定量分析表明:结合多元信息聚类与空间约束的遥感影像分割方法可以避免聚类表征特征量的定义,从根本上消除其对图像分割的影响,并充分考虑遥感数据的空间位置关系。   相似文献   

17.
与经典的K均值聚类算法相比,模糊C均值(FCM)聚类算法通过引入模糊因子,考虑不同聚类数据簇之间的相互关系,得到可分性更好的聚类结果。但是模糊因子的引入,使得任意一个样本点都存在模糊性,造成FCM极易受到噪声和离群点的影响,聚类结果泛化性能较差。因此,该文提出一种簇间可分的鲁棒FCM算法(RBI-FCM)。RBI-FCM利用K均值算法对模糊隶属度的稀疏特征,降低不同数据簇之间的相互作用,突出不同数据簇相邻区域的可分性;另外,RBI-FCM在极小化数据簇内部散布度的条件下,考虑不同数据簇之间的可分性,可提高聚类模型的泛化性能。该文设计了有效的模型求解迭代算法。实验结果表明,RBI-FCM算法提高了FCM的鲁棒性,有效降低FCM对数据簇分布差异性和抽样不均衡的敏感性,得到理想的聚类结果。  相似文献   

18.
ITERATE: a conceptual clustering algorithm for data mining   总被引:2,自引:0,他引:2  
The data exploration task can be divided into three interrelated subtasks: 1) feature selection, 2) discovery, and 3) interpretation. This paper describes an unsupervised discovery method with biases geared toward partitioning objects into clusters that improve interpretability. The algorithm ITERATE employs: 1) a data ordering scheme and 2) an iterative redistribution operator to produce maximally cohesive and distinct clusters. Cohesion or intraclass similarity is measured in terms of the match between individual objects and their assigned cluster prototype. Distinctness or interclass dissimilarity is measured by an average of the variance of the distribution match between clusters. The authors demonstrate that interpretability, from a problem-solving viewpoint, is addressed by the intraclass and interclass measures. Empirical results demonstrate the properties of the discovery algorithm and its applications to problem solving  相似文献   

19.
针对FAT文件系统,目前多数恢复软件难以恢复物理上非连续存储的被彻底删除文件。本文提出一种基于多结构信息的数据恢复算法,结合FAT表中簇分配信息、目录表中起始簇和时间等信息,能够较好地对一些物理非连续的文件进行恢复,提高了数据恢复的成功率。  相似文献   

20.
数据流上基于K-median聚类的算法研究   总被引:1,自引:0,他引:1  
文章研究和分析了数据流上的K-median聚类算法技术,包括:(1)流模型和K-median问题定义;(2)基于流的K-median聚类基本决策和内在机理;(3)理论上有性能保证的流算法。对于每一特征,这种技术能在没有实际保留任何数据流对象的情形下有效地确定聚类点。它通过一个聚类块的一分为二或相邻聚类块的合二为一来动态地生成聚类点,从而实现上述目标。作为结果,这种技术所确定的聚类点将比其他常规方法更准确。在数据流环境中,这种技术能够在产生高质量聚类结果的同时非常有效地执行。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号