首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
基因表达数据的聚类分析研究进展   总被引:4,自引:1,他引:3  
基因表达数据的爆炸性增长迫切需求自动、有效的数据分析工具. 目前聚类分析已成为分析基因表达数据获取生物学信息的有力工具. 为了更好地挖掘基因表达数据, 近年来提出了许多改进的传统聚类算法和新聚类算法. 本文首先简单介绍了基因表达数据的获取和表示, 之后系统地介绍了近年来应用在基因表达数据分析中的聚类算法. 根据聚类目标的不同将算法分为基于基因的聚类、基于样本的聚类和两路聚类, 并对每类算法介绍了其生物学的含义及其难点, 详细讨论了各种算法的基本原理及优缺点. 最后总结了当前的基因表达数据的聚类分析方法,并对发展趋势作了进一步的展望.  相似文献   

2.
cDNA microarrays permit massively parallel gene expression analysis and have spawned a new paradigm in the study of molecular biology. One of the significant challenges in this genomic revolution is to develop sophisticated approaches to facilitate the visualization, analysis, and interpretation of the vast amounts of multi-dimensional gene expression data. We have applied self-organizing map (SOM) in order to meet these challenges. In essence, we utilize U-matrix and component planes in microarray data visualization and introduce general procedure for assessing significance for a cluster detected from U-matrix. Our case studies consist of two data sets. First, we have analyzed a data set containing 13,824 genes in 14 breast cancer cell lines. In the second case we show an example of the SOM in drug treatment of prostate cancer cells. Our results indicate that (1) SOM is capable of helping finding certain biologically meaningful clusters, (2) clustering algorithms could be used for finding a set of potential predictor genes for classification purposes, and (3) comparison and visualization of the effects of different drugs is straightforward with the SOM. In summary, the SOM provides an excellent format for visualization and analysis of gene microarray data, and is likely to facilitate extraction of biologically and medically useful information.  相似文献   

3.
Over the last several years, many clustering algorithms have been applied to gene expression data. However, most clustering algorithms force the user into having one set of clusters, resulting in a restrictive biological interpretation of gene function. It would be difficult to interpret the complex biological regulatory mechanisms and genetic interactions from this restrictive interpretation of microarray expression data. The software package SignatureClust allows users to select a group of functionally related genes (called ‘Landmark Genes’), and to project the gene expression data onto these genes. Compared to existing algorithms and software in this domain, our software package offers two unique benefits. First, by selecting different sets of landmark genes, it enables the user to cluster the microarray data from multiple biological perspectives. This encourages data exploration and discovery of new gene associations. Second, most packages associated with clustering provide internal validation measures, whereas our package validates the biological significance of the new clusters by retrieving significant ontology and pathway terms associated with the new clusters. SignatureClust is a free software tool that enables biologists to get multiple views of the microarray data. It highlights new gene associations that were not found using a traditional clustering algorithm. The software package ‘SignatureClust’ and the user manual can be downloaded from .  相似文献   

4.
Recent advancement in microarray technology permits monitoring of the expression levels of a large set of genes across a number of time points simultaneously. For extracting knowledge from such huge volume of microarray gene expression data, computational analysis is required. Clustering is one of the important data mining tools for analyzing such microarray data to group similar genes into clusters. Researchers have proposed a number of clustering algorithms in this purpose. In this article, an attempt has been made in order to improve the performance of fuzzy clustering by combining it with support vector machine (SVM) classifier. A recently proposed real-coded variable string length genetic algorithm based clustering technique and an iterated version of fuzzy C-means clustering have been utilized in this purpose. The performance of the proposed clustering scheme has been compared with that of some well-known existing clustering algorithms and their SVM boosted versions for one simulated and six real life gene expression data sets. Statistical significance test based on analysis of variance (ANOVA) followed by posteriori Tukey-Kramer multiple comparison test has been conducted to establish the statistical significance of the superior performance of the proposed clustering scheme. Moreover biological significance of the clustering solutions have been established.  相似文献   

5.
Microarray technology has made it possible to monitor the expression levels of many genes simultaneously across a number of experimental conditions. Fuzzy clustering is an important tool for analyzing microarray gene expression data. In this article, a real-coded Simulated Annealing (VSA) based fuzzy clustering method with variable length configuration is developed and combined with popular Artificial Neural Network (ANN) based classifier. The idea is to refine the clustering produced by VSA using ANN classifier to obtain improved clustering performance. The proposed technique is used to cluster three publicly available real life microarray data sets. The superior performance of the proposed technique has been demonstrated by comparing with some widely used existing clustering algorithms. Also statistical significance test has been conducted to establish the statistical significance of the superior performance of the proposed clustering algorithm. Finally biological relevance of the clustering solutions are established.  相似文献   

6.
Clustering is concerned with the discovery of interesting groupings of records in a database. Many algorithms have been developed to tackle clustering problems in a variety of application domains. In particular, some of them have been used in bioinformatics research to uncover inherent clusters in gene expression microarray data. In this paper, we show how some popular clustering algorithms have been used for this purpose. Based on experiments using simulated and real data, we also show that the performance of these algorithms can be further improved. For more effective clustering of gene expression microarray data, which is typically characterized by a lot of noise, we propose a novel evolutionary algorithm called evolutionary clustering (EvoCluster). EvoCluster encodes an entire cluster grouping in a chromosome so that each gene in the chromosome encodes one cluster. Based on such encoding scheme, it makes use of a set of reproduction operators to facilitate the exchange of grouping information between chromosomes. The fitness function that the EvoCluster adopts is able to differentiate between how relevant a feature value is in determining a particular cluster grouping. As such, instead of just local pairwise distances, it also takes into consideration how clusters are arranged globally. Unlike many popular clustering algorithms, EvoCluster does not require the number of clusters to be decided in advance. Also, patterns hidden in each cluster can be explicitly revealed and presented for easy interpretation even by casual users. For performance evaluation, we have tested EvoCluster using both simulated and real data. Experimental results show that it can be very effective and robust even in the presence of noise and missing values. Also, when correlating the gene expression microarray data with DNA sequences, we were able to uncover significant biological binding sites (both previously known and unknown) in each cluster discovered by EvoCluster.  相似文献   

7.
Clustering analysis of temporal gene expression data is widely used to study dynamic biological systems, such as identifying sets of genes that are regulated by the same mechanism. However, most temporal gene expression data often contain noise, missing data points, and non-uniformly sampled time points, which imposes challenges for traditional clustering methods of extracting meaningful information. In this paper, we introduce an improved clustering approach based on the regularized spline regression and an energy based similarity measure. The proposed approach models each gene expression profile as a B-spline expansion, for which the spline coefficients are estimated by regularized least squares scheme on the observed data. To compensate the inadequate information from noisy and short gene expression data, we use its correlated genes as the test set to choose the optimal number of basis and the regularization parameter. We show that this treatment can help to avoid over-fitting. After fitting the continuous representations of gene expression profiles, we use an energy based similarity measure for clustering. The energy based measure can include the temporal information and relative changes of the time series using the first and second derivatives of the time series. We demonstrate that our method is robust to noise and can produce meaningful clustering results.  相似文献   

8.
Cluster analysis of DNA microarray data is an important but difficult task in knowledge discovery processes. Many clustering methods are applied to analysis of data for gene expression, but none of them is able to deal with an absolute way with the challenges that this technology raises. Due to this, many applications have been developed for visually representing clustering algorithm results on DNA microarray data, usually providing dendrogram and heat map visualizations. Most of these applications focus only on the above visualizations, and do not offer further visualization components to the validate the clustering methods or to validate one another. This paper proposes using a visual analytics framework in cluster analysis of gene expression data. Additionally, it presents a new method for finding cluster boundaries based on properties of metric spaces. Our approach presents a set of visualization components able to interact with each other; namely, parallel coordinates, cluster boundary genes, 3D cluster surfaces and DNA microarray visualizations as heat maps. Experimental results have shown that our framework can be very useful in the process of more fully understanding DNA microarray data. The software has been implemented in Java, and the framework is publicly available at http://www.analiticavisual.com/jcastellanos/3DVisualCluster/3D-VisualCluster.  相似文献   

9.
This paper proposes a new hierarchical clustering method using genetic algorithms for the analysis of gene expression data. This method is based on the mathematical proof of several results, showing its effectiveness with regard to other clustering methods. Genetic algorithms applied to cluster analysis have disclosed good results on biological data and many studies have been carried out in this sense, although most of them are focused on partitional clustering methods. Even though there are few studies that attempt to use genetic algorithms for building hierarchical clustering, they do not include constraints that allow us to reduce the complexity of the problem. Therefore, these studies become intractable problems for large data sets. On the other hand, the deterministic hierarchical clustering methods generally face the problem of convergence towards local optimums due to their greedy strategy. The method introduced here is an alternative to solve some of the problems existing methods face. The results of the experiments have shown that our approach can be very effective in cluster analysis of DNA microarray data.  相似文献   

10.
An interactive approach to mining gene expression data   总被引:1,自引:0,他引:1  
Effective identification of coexpressed genes and coherent patterns in gene expression data is an important task in bioinformatics research and biomedical applications. Several clustering methods have recently been proposed to identify coexpressed genes that share similar coherent patterns. However, there is no objective standard for groups of coexpressed genes. The interpretation of co-expression heavily depends on domain knowledge. Furthermore, groups of coexpressed genes in gene expression data are often highly connected through a large number of "intermediate" genes. There may be no clear boundaries to separate clusters. Clustering gene expression data also faces the challenges of satisfying biological domain requirements and addressing the high connectivity of the data sets. In this paper, we propose an interactive framework for exploring coherent patterns in gene expression data. A novel coherent pattern index is proposed to give users highly confident indications of the existence of coherent patterns. To derive a coherent pattern index and facilitate clustering, we devise an attraction tree structure that summarizes the coherence information among genes in the data set. We present efficient and scalable algorithms for constructing attraction trees and coherent pattern indices from gene expression data sets. Our experimental results show that our approach is effective in mining gene expression data and is scalable for mining large data sets.  相似文献   

11.
Microarrays have reformed biotechnological research in the past decade. Deciphering the hidden patterns in gene expression data proffers a prodigious preference to strengthen the understanding of functional genomics. The complexity of biological networks with larger volume of genes also increases the challenges of comprehending and interpretation of the resulting mass of data. Clustering addresses these challenges, which is essential in the data mining process to reveal natural structures and identify interesting patterns in the underlying data. The clustering of gene expression data has been proven to be useful in making known the natural structure inherent in gene expression data, understanding gene functions, cellular processes, and molecular functions. Clustering techniques are used to examine gene expression data to extract groups of genes from the tested samples based on a similarity criterion. Subspace clustering broadens the traditional clustering by extracting the groups of genes that are highly correlated in different subspace within the dataset. Mining the temporal patterns in high dimensional data is done with computational effort and thus normalization is needed. In this work, normalization using fuzzy logic is applied to the data before clustering. The multi-objective cuckoo search optimization is implemented to extract co-expressed genes over different subspaces. The proposed methods are applied to the real life temporal gene expression datasets in which it extracts the genes that are responsible for the disease grouped in a same cluster. The experiment results prove that the impact of fuzzy normalization on the dataset improves the clustering.  相似文献   

12.
谢娟英  丁丽娟  王明钊 《软件学报》2020,31(4):1009-1024
基因表达数据具有高维小样本特点,包含了大量与疾病无关的基因,对该类数据进行分析的首要步骤是特征选择.常见的特征选择方法需要有类标的数据,但样本类标获取往往比较困难.针对基因表达数据的特征选择问题,提出基于谱聚类的无监督特征选择思想FSSC(feature selection by spectral clustering).FSSC对所有特征进行谱聚类,将相似性较高的特征聚成一类,定义特征的区分度与特征独立性,以二者之积度量特征重要性,从各特征簇选取代表性特征,构造特征子集.根据使用的不同谱聚类算法,得到FSSC-SD(FSSC based on standard deviation) FSSCMD(FSSC based on mean distance)和FSSC-ST(FSSC based on self-tuning)这3种无监督特征选择算法.以SVMs(support vector machines)和KNN(K-nearest neighbours)为分类器,在10个基因表达数据集上进行实验测试.结果表明,FSSC-SD、FSSC-MD和FSSC-ST算法均能选择到具有强分类能力的特征子集.  相似文献   

13.
Micro array technologies have become a widespread research technique for biomedical researchers to assess tens of thousands of gene expression values simultaneously in a single experiment. Micro array data analysis for biological discovery requires computational tools. In this research a novel two-dimensional hierarchical clustering is presented. From the review, it is evident that the previous research works have used clustering which have been applied in gene expression data to create only one cluster for a gene that leads to biological complexity. This is mainly because of the nature of proteins and their interactions. Since proteins normally interact with different groups of proteins in order to serve different biological roles, the genes that produce these proteins are therefore expected to co express with more than one group of genes. This constructs that in micro array gene expression data, a gene may makes its presence in more than one cluster. In this research, multi-level micro array clustering, performed in two dimensions by the proposed two-dimensional hierarchical clustering technique can be used to represent the existence of genes in one or more clusters consistent with the nature of the gene and its attributes and prevent biological complexities.  相似文献   

14.
Unsupervised clustering methods such as K-means, hierarchical clustering and fuzzy c-means have been widely applied to the analysis of gene expression data to identify biologically relevant groups of genes. Recent studies have suggested that the incorporation of biological information into validation methods to assess the quality of clustering results might be useful in facilitating biological and biomedical knowledge discoveries. In this study, we generalize two bio-validity indices, the biological homogeneity index and the biological stability index, to quantify the abilities of soft clustering algorithms such as fuzzy c-means and model-based clustering. The results of an evaluation of several existing soft clustering algorithms using simulated and real data sets indicate that the soft versions of the indices provide both better precision and better accuracy than the classical ones. The significance of the proposed indices is also discussed.  相似文献   

15.
基因芯片是微阵列技术的典型代表,它具有高通量的特性和同时检测全部基因组基因表达水平的能力。应用微阵列芯片的一个主要目的是基因表达模式的发现,即在基因组水平发现功能相似,生物学过程相关的基因簇;或者将样本分类,发现样本的各种亚型。例如根据基因表达水平对癌症样本进行分类,发现疾病的分子亚型。非负矩阵分解NMF方法是一种非监督的、非正交的、基于局部表示的矩阵分解方法。近年来这种方法被越来越多地应用在微阵列数据的分类分析和聚类发现中。系统地介绍了非负矩阵分解的原理、算法和应用,分解结果的生物学解释,分类结果的质量评估和基于NMF算法的分类软件。总结并评估了NMF方法在微阵列数据分类和聚类发现应用中的表现。  相似文献   

16.
In a DNA microarray dataset, gene expression data often has a huge number of features(which are referred to as genes) versus a small size of samples. With the development of DNA microarray technology, the number of dimensions increases even faster than before, which could lead to the problem of the curse of dimensionality. To get good classification performance, it is necessary to preprocess the gene expression data. Support vector machine recursive feature elimination (SVM-RFE) is a classical method for gene selection. However, SVM-RFE suffers from high computational complexity. To remedy it, this paper enhances SVM-RFE for gene selection by incorporating feature clustering, called feature clustering SVM-RFE (FCSVM-RFE). The proposed method first performs gene selection roughly and then ranks the selected genes. First, a clustering algorithm is used to cluster genes into gene groups, in each which genes have similar expression profile. Then, a representative gene is found to represent a gene group. By doing so, we can obtain a representative gene set. Then, SVM-RFE is applied to rank these representative genes. FCSVM-RFE can reduce the computational complexity and the redundancy among genes. Experiments on seven public gene expression datasets show that FCSVM-RFE can achieve a better classification performance and lower computational complexity when compared with the state-the-art-of methods, such as SVM-RFE.  相似文献   

17.
We develop an approach to analyze time-course microarray data which are obtained from a single sample at multiple time points and to identify which genes are cell-cycle regulated. Since some genes have similar gene expression patterns, to reduce the amount of hypothesis testing, we first perform a clustering analysis to group genes into classes with similar cell-cycle patterns, including a class with no cell-cycle phenomena at all. Then we build a statistical model and an inference function assuming that genes within a cluster share the same mean model. A varying coefficient nonparametric approach is employed to be more flexible to fit the time-course data. In order to incorporate the correlation of longitudinal measurements, the quadratic inference function method is applied to obtain more efficient estimators and more powerful tests. Furthermore, this method allows us to perform chi-squared tests to determine whether certain genes are cell-cycle regulated. A data example on cell-cycle microarray data as well as simulations are illustrated.  相似文献   

18.
Microarray technology has been widely applied in study of measuring gene expression levels for thousands of genes simultaneously. In this technology, gene cluster analysis is useful for discovering the function of gene because co-expressed genes are likely to share the same biological function. Many clustering algorithms have been used in the field of gene clustering. This paper proposes a new scheme for clustering gene expression datasets based on a modified version of Quantum-behaved Particle Swarm Optimization (QPSO) algorithm, known as the Multi-Elitist QPSO (MEQPSO) model. The proposed clustering method also employs a one-step K-means operator to effectively accelerate the convergence speed of the algorithm. The MEQPSO algorithm is tested and compared with some other recently proposed PSO and QPSO variants on a suite of benchmark functions. Based on the computer simulations, some empirical guidelines have been provided for selecting the suitable parameters of MEQPSO clustering. The performance of MEQPSO clustering algorithm has been extensively compared with several optimization-based algorithms and classical clustering algorithms over several artificial and real gene expression datasets. Our results indicate that MEQPSO clustering algorithm is a promising technique and can be widely used for gene clustering.  相似文献   

19.
In the rapidly evolving field of genomics, many clustering and classification methods have been developed and employed to explore patterns in gene expression data. Biologists face the choice of which clustering algorithm(s) to use and how to interpret different results from various clustering algorithms. No clear objective criteria have been developed to assess the agreement and compare the results from different clustering methods. We describe two generally applicable objective measures to quantify agreement between different clustering methods. These two measures are referred to as the local agreement measure, which is defined for each gene/subject, and the global agreement measure, which is defined for the whole gene expression experiment. The agreement measures are based on a probabilistic weighting scheme applied to the number of concordant and discordant pairs from two clustering methods. In the comparison and assessment process, newly-developed concepts are implemented under the framework of reliability of a cluster. The algorithms are illustrated by simulations and then applied to a yeast sporulation gene expression microarray data. Analysis of the sporulation data identified ∼5% (23 of 477) genes which were not consistently clustered using a neural net algorithm and K-means or pam. The two agreement measures provide objective criteria to conclude whether or not two clustering methods agree with each other. Using the local agreement measure, genes of unknown function which cluster consistently can more confidently be assigned functions based on co-regulation.  相似文献   

20.
The identification of coexpressed genes from microarray data is a challenging problem in bioinformatics and computational biology. The objective of this study is to obtain knowledge about the most important genes and clusters related to production outputs of real-world time-series microarray data in the industrial microbiology area. Each sample in the microarray data experiment is complemented with the measurement of the corresponding production and growth values. A novel aspect of this research refers to considering the relation of coexpression patterns with the measured outputs to guide the biological interpretation of results. Shape-based clustering models are developed using the pattern of gene expression values over time and further incorporating knowledge about the correlation between the change in the gene expression level and the output value. Experiments are performed for time-series microarray of bacteria, and an analysis from a biological perspective is carried out. The obtained results confirm the existence of relationships between output variables and gene expressions. Moreover, the shape-based clustering methods show promising results, being able to guide metabolic engineering actions with the identification of potential targets.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号