首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 125 毫秒
1.
微阵列技术是后基因组时代功能基因组研究的主要工具。基因表达谱数据的聚类分析对于研究基因功能和基因调控机制有重要意义。针对聚类算法要求事先确定簇的个数、对噪声敏感和可伸缩性差的问题,基于密度聚类算法DBSCAN和共享近邻SharedNearestNeighbors(SNN)的不同的特点,提出了一种新的最近邻先吸收的聚类算法,将其应用于一个公开的酵母细胞同期数据集,并用评价方法FOM将聚类结果与K-means聚类方法的结果进行了比较。结果表明,该文的聚类算法优于其他聚类算法,聚类结果具有明显的生物学意义,并能对数据的类别数作出较好的预测和评估。  相似文献   

2.
基于代表熵的基因表达数据聚类分析方法   总被引:1,自引:0,他引:1       下载免费PDF全文
针对基因表达数据样本少,维数高的特点,尤其是在样本分型缺乏先验知识的情况下,结合自组织特征映射的优点提出了基于代表熵的双向聚类算法。该算法首先通过自组织特征映射网络(SOM)对基因聚类,根据波动系数挑选特征基因。然后根据代表熵的大小判断基因聚类的好坏,并确定网络的神经元个数。最后采用FCM(Fuzzy C Means)聚类算法对挑选出的特征基因集进行样本分型。将该算法用于两组公开的基因表达数据集,实验结果表明该算法在降低特征维数的同时,得出了较高的聚类准确率。  相似文献   

3.
王晓明  印莹 《计算机科学》2007,34(8):171-176
DNA微阵列技术使同时监测成千上万的基因表达水平成为可能.直接把传统聚类算法用于高维基因表达数据分析会受到"维难"的困扰.特征转换和特征选择是两种常用的降维方式,但前者产生的新特征难以用原来的领域知识解释,后者通常会丢失信息.另外,传统的聚类算法通常由用户指定聚类参数,参数设置不同对聚类结果有很大的影响.针对上述问题,本文提出了一种新的基于迭代扩张的微阵列数据聚类算法-CIS.它不采用特征转换和特征选择的方式,并自动确定聚类参数.CIS反复用最新得到的样本聚簇得到新的聚类基因,然后以新的基因聚簇为特征重新聚类样本,逐步求精,最终的结果容易解释且避免了信息的丢失.该方法降低了由于用户缺少领域知识引起的实验误差.CIS算法被应用于两个真实的微阵列数据集,实验结果证实了算法的有效性.  相似文献   

4.
沈宁敏  李静  周培云  庄毅 《计算机科学》2015,42(Z6):453-458
聚类已成为基因表达数据的一种前沿分析方法,通过基因类别的划分可以较快速地发现病变细胞,以实现对疾病的诊断。然而,高维、小样本的数据特点使得原始采集的基因表达数据具有大量的冗余与干扰信息,直接聚类会使得算法运行时间长,分析结果精度低。主成分分析是一种经典的数据降维方法,在保持方差最大的情况下,将高维数据映射到低维空间。但负载因子的非零特性使得主成分不具有强解释能力。提出基于截断幂的稀疏主成分分析方法对基因表达数据进行特征提取,并结合K-means方法对稀疏提取的特征基因数据进行聚类分析。最后,利用3个公开的基因数据集进行实验分析,验证了所提出的特征提取方法可提高基因表达数据聚类的精确性与高效性。  相似文献   

5.
颜文胜 《计算机工程》2011,37(5):202-203,206
依据基因表达数据的特点,提出一种基于弹簧模型的基因表达数据可视化聚类方法,将多维空间的基因表达数据映射到二维空间中,较好地保持了原始多维数据间的时空相似性。实验结果表明,该方法能发现基因表达数据集中隐含的类簇结构以及共表达基因模式。  相似文献   

6.
聚类分析中类数估计方法的实验比较   总被引:4,自引:0,他引:4       下载免费PDF全文
在基因表达数据的探索性聚类分析中,聚类个数的确定是决定聚类质量的关键因素。许多聚类有效性评价指标和方法可用于PAM聚类算法。该文讨论适合于PAM算法的7种常用评价指标和方法,采用4种不同聚类结构特征的基因表达数据对它们的性能进行实验比较。结果表明,系统演化方法和稳定性方法估计聚类个数的性能最好,正确率分别为100%与90%。  相似文献   

7.
基因表达数据聚类是发现基因功能和确立基因调控网络的重要方法,计算智能在该领域的应用为分析 大量基因数据提供了新途径.本文根据基因表达数据的特点,提出了基因表达数据聚类领域的关键问题,探讨了基 于计算智能的基因表达数据聚类基本框架,综述了计算智能在基因数据聚类领域的应用现状,最后指出了在基因数 据聚类领域计算智能方法未来的发展方向.  相似文献   

8.
伴随着问题场景数据在规模上的快速增长和构成上的复杂化,精确估计簇的个数和簇的中心点是当下聚类算法处理和分析复杂大规模数据的重要挑战.簇数及簇心的精确估计对于部分有参聚类算法、数据集整体复杂性度量和数据简化表示等都十分关键.文中在深入分析I-nice的基础上,提出基于候选中心融合的多观测点I-nice聚类算法.在原多观测...  相似文献   

9.
在生命科学中,需要对物种及基因进行分类,以获得对种群固有结构的认识。利用数据聚类方法,有效地辨别/识别基因表示数据的模式,对它们进行分类。将特征相似性大的归为一类,特征相异性大的归为不同类。这对于研究基因的结构、功能、以及不同种类基因之间的关系都具有重要意义。利用图论的方法对分子生物学中基因表示数据进行初始聚类,然后再结合别的算法,如K-近邻自学习聚类算法或基于中心点的自学习聚类算法,对其进一步求精。对于某种聚类判别准则,能够产生全局最优簇。最后对算法进行了分析和讨论,并用模拟数据进行了实验验证。  相似文献   

10.
本文提出了一种双层结构的基因表达数据聚类算法,该算法针对基因表达数据量庞大且已知功能的基因较少的特点,将聚类过程分为两个层次,快速分析层和精确聚类层。聚类结果采用信息熵方法进行评价。实验结果表明该聚类方法对于聚类基因表达数据非常有效。  相似文献   

11.
In recent year, the problem of clustering in microarray data has been gaining significant attention. However most of the clustering methods attempt to find the group of genes where the number of cluster is known a priori. This fact motivated us to develop a new real-coded improved differential evolution based automatic fuzzy clustering algorithm which automatically evolves the number of clusters as well as the proper partitioning of a gene expression data set. To improve the result further, the clustering method is integrated with a support vector machine, a well-known technique for supervised learning. A fraction of the gene expression data points selected from different clusters based on their proximity to the respective centers, is used for training the SVM. The clustering assignments of the remaining gene expression data points are thereafter determined using the trained classifier. The performance of the proposed clustering technique has been demonstrated on five gene expression data sets by comparing it with the differential evolution based automatic fuzzy clustering, variable length genetic algorithm based fuzzy clustering and well known Fuzzy C-Means algorithm. Statistical significance test has been carried out to establish the statistical superiority of the proposed clustering approach. Biological significance test has also been carried out using a web based gene annotation tool to show that the proposed method is able to produce biologically relevant clusters of genes. The processed data sets and the matlab version of the software are available at http://bio.icm.edu.pl/~darman/IDEAFC-SVM/.  相似文献   

12.
鉴于传统的基因选择方法会选出大量冗余基因从而导致较低的样本预测准确率,提出一种基于聚类和微粒群优化的基因选择算法。首先采用聚类算法将基因分成固定数目的簇;然后,采用极限学习机作为分类器进行簇中的特征基因分类性能评价,得到一个备选基因库;最后,采用基于微粒群优化和极限学习机的缠绕法从备选基因库中选择具有最大分类率、最小数目的基因子集。所选出的基因具有良好的分类性能。在两个公开的微阵列数据集上的实验结果表明,相对于一些经典的方法,新方法能够以较少的基因获得更高的分类性能。  相似文献   

13.
Gene expression data generated by DNA microarray experiments provide a vast resource of medical diagnostic and disease understanding. Unfortunately, the large amount of data makes it hard, sometimes even impossible, to understand the correct behavior of genes. In this work, we develop a possibilistic approach for mining gene microarray data. Our model consists of two steps. In the first step, we use possibilistic clustering to partition the data into groups (or clusters). The optimal number of clusters is evaluated automatically from the data using the Information Entropy as a validity measure. In the second step, we select from each computed cluster the most representative genes and model them as a graph called a proximity graph. This set of graphs (or hyper-graph) will be used to predict the function of new and previously unknown genes. Experimental results using real-world data sets reveal a good performance and a high prediction accuracy of our model.  相似文献   

14.
Clustering is a popular technique for analyzing microarray data sets, with n genes and m experimental conditions. As explored by biologists, there is a real need to identify coregulated gene clusters, which include both positive and negative regulated gene clusters. The existing pattern-based and tendency-based clustering approaches cannot directly be applied to find such coregulated gene clusters, because they are designed for finding positive regulated gene clusters. In this paper, in order to cluster coregulated genes, we propose a coding scheme that allows us to cluster two genes into the same cluster if they have the same code, where two genes that have the same code can be either positive or negative regulated. Based on the coding scheme, we propose a new algorithm for finding maximal subspace coregulated gene clusters with new pruning techniques. A maximal subspace coregulated gene cluster clusters a set of genes on a condition sequence such that the cluster is not included in any other subspace coregulated gene clusters. We conduct extensive experimental studies. Our approach can effectively and efficiently find maximal subspace coregulated gene clusters. In addition, our approach outperforms the existing approaches for finding positive regulated gene clusters.  相似文献   

15.
Gene clustering is one of the most important problems in bioinformatics. In the sequential data clustering, hidden Markov models (HMMs) have been widely used to find similarity between sequences, due to their capability of handling sequence patterns with various lengths. In this paper, a novel gene clustering scheme based on HMMs optimized by particle swarm optimization algorithm is introduced. In this approach, each gene sequence is described by a specific HMM, and then for each model, its probability to generate individual sequence is evaluated. A hierarchical clustering algorithm based on a new definition of a distance measure has been applied to find the best clusters. Experiments carried out on lung cancer-related genes dataset show that the proposed approach can be successfully utilized for gene clustering.  相似文献   

16.
最小生成树用于基因表示数据的聚类算法   总被引:6,自引:0,他引:6  
在生物学研究中,需要对植物和动物分类,对基因进行分类,以获得对种群固有结构的认识.使用聚类分析方法,有效地鉴别基因表示数据的模式,将它们分组成为由类似对象组成的多个类,对研究基因的结构、功能以及不同种类基因之间的关系都具有重要意义.将图论的最小生成树理论引入分子生物学中基因表示数据的聚类分析方法,设计了生成树的表示和基于最小生成树的聚类算法,证明了该方法对于一些准则函数能够产生全局最优簇,并根据实验结果对算法进行了讨论和评价.  相似文献   

17.
An interactive approach to mining gene expression data   总被引:1,自引:0,他引:1  
Effective identification of coexpressed genes and coherent patterns in gene expression data is an important task in bioinformatics research and biomedical applications. Several clustering methods have recently been proposed to identify coexpressed genes that share similar coherent patterns. However, there is no objective standard for groups of coexpressed genes. The interpretation of co-expression heavily depends on domain knowledge. Furthermore, groups of coexpressed genes in gene expression data are often highly connected through a large number of "intermediate" genes. There may be no clear boundaries to separate clusters. Clustering gene expression data also faces the challenges of satisfying biological domain requirements and addressing the high connectivity of the data sets. In this paper, we propose an interactive framework for exploring coherent patterns in gene expression data. A novel coherent pattern index is proposed to give users highly confident indications of the existence of coherent patterns. To derive a coherent pattern index and facilitate clustering, we devise an attraction tree structure that summarizes the coherence information among genes in the data set. We present efficient and scalable algorithms for constructing attraction trees and coherent pattern indices from gene expression data sets. Our experimental results show that our approach is effective in mining gene expression data and is scalable for mining large data sets.  相似文献   

18.
Based on the molecular kinetic theory, a molecular dynamics-like data clustering approach is proposed in this paper. Clusters are extracted after data points fuse in the iterating space by the dynamical mechanism that is similar to the interacting mechanism between molecules through molecular forces. This approach is to find possible natural clusters without pre-specifying the number of clusters. Compared with 3 other clustering methods (trimmed k-means, JP algorithm and another gravitational model based method), this approach found clusters better than the other 3 methods in the experiments.  相似文献   

19.
Recently Fourier Transform Infrared (FTIR) spectroscopic imaging has been used as a tool to detect the changes in cellular composition that may reflect the onset of a disease. This approach has been investigated as a mean of monitoring the change of the biochemical composition of cells and providing a diagnostic tool for various human cancers and other diseases. The discrimination between different types of tissue based upon spectroscopic data is often achieved using various multivariate clustering techniques. However, the number of clusters is a common unknown feature for the clustering methods, such as hierarchical cluster analysis, k-means and fuzzy c-means. In this study, we apply a FCM based clustering algorithm to obtain the best number of clusters as given by the minimum validity index value. This often results in an excessive number of clusters being created due to the complexity of this biochemical system. A novel method to automatically merge clusters was developed to try to address this problem. Three lymph node tissue sections were examined to evaluate our new method. These results showed that this approach can merge the clusters which have similar biochemistry. Consequently, the overall algorithm automatically identifies clusters that accurately match the main tissue types that are independently determined by the clinician.  相似文献   

20.
Bagging for path-based clustering   总被引:3,自引:0,他引:3  
A resampling scheme for clustering with similarity to bootstrap aggregation (bagging) is presented. Bagging is used to improve the quality of path-based clustering, a data clustering method that can extract elongated structures from data in a noise robust way. The results of an agglomerative optimization method are influenced by small fluctuations of the input data. To increase the reliability of clustering solutions, a stochastic resampling method is developed to infer consensus clusters. A related reliability measure allows us to estimate the number of clusters, based on the stability of an optimized cluster solution under resampling. The quality of path-based clustering with resampling is evaluated on a large image data set of human segmentations.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号