首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
In a recent review [R. Giancarlo, D. Scaturro, F. Utro, Textual data compression in computational biology: a synopsis, Bioinformatics 25 (2009) 1575–1586] the first systematic organization and presentation of the impact of textual data compression for the analysis of biological data has been given. Its main focus was on a systematic presentation of the key areas of bioinformatics and computational biology where compression has been used together with a technical presentation of how well-known notions from information theory have been adapted to successfully work on biological data. Rather surprisingly, the use of data compression is pervasive in computational biology. Starting from that one, the focus of this companion review is on the computational methods involved in the use of data compression in computational biology. Indeed, although one would expect ad hoc adaptation of compression techniques to work on biological data, unifying and homogeneous algorithmic approaches are emerging. Moreover, given that experiments based on parallel sequencing are the future for biological research, data compression techniques are among a handful of candidates that seem able, successfully, to deal with the deluge of sequence data they produce; although, until now, only in terms of storage and indexing, with the analysis still being a challenge. Therefore, the two reviews, complementing each other, are perceived to be a useful starting point for computer scientists to get acquainted with many of the computational challenges coming from computational biology in which core ideas of the information sciences are already having a substantial impact.  相似文献   

2.
K-means is one of the most widely used clustering algorithms in various disciplines, especially for large datasets. However the method is known to be highly sensitive to initial seed selection of cluster centers. K-means++ has been proposed to overcome this problem and has been shown to have better accuracy and computational efficiency than k-means. In many clustering problems though – such as when classifying georeferenced data for mapping applications – standardization of clustering methodology, specifically, the ability to arrive at the same cluster assignment for every run of the method i.e. replicability of the methodology, may be of greater significance than any perceived measure of accuracy, especially when the solution is known to be non-unique, as in the case of k-means clustering. Here we propose a simple initial seed selection algorithm for k-means clustering along one attribute that draws initial cluster boundaries along the “deepest valleys” or greatest gaps in dataset. Thus, it incorporates a measure to maximize distance between consecutive cluster centers which augments the conventional k-means optimization for minimum distance between cluster center and cluster members. Unlike existing initialization methods, no additional parameters or degrees of freedom are introduced to the clustering algorithm. This improves the replicability of cluster assignments by as much as 100% over k-means and k-means++, virtually reducing the variance over different runs to zero, without introducing any additional parameters to the clustering process. Further, the proposed method is more computationally efficient than k-means++ and in some cases, more accurate.  相似文献   

3.
《Information Fusion》2009,10(3):233-241
The analysis of non-coding DNA regulatory regions is one of the most challenging open problems in computational biology. In this paper we investigate whether we can predict functional information about genes by using information extracted from their sequences together with expression data. We formalize this problem as a classification problem, and we apply Support Vector Machines (SVMs) with non-linear kernels to predict classes of co-expressed genes obtained from clustering procedures. SVMs are trained using information about selected motifs extracted from DNA regulatory regions through combinatorial and statistical methods. In our experiments, we show that functional classes of genes can be predicted from biological sequence data in Saccharomices cerevisiae, achieving results competitive with those recently presented in the literature.  相似文献   

4.
Pairwise clustering methods have shown great promise for many real-world applications. However, the computational demands of these methods make them impractical for use with large data sets. The contribution of this paper is a simple but efficient method, called eSPEC, that makes clustering feasible for problems involving large data sets. Our solution adopts a “sampling, clustering plus extension” strategy. The methodology starts by selecting a small number of representative samples from the relational pairwise data using a selective sampling scheme; then the chosen samples are grouped using a pairwise clustering algorithm combined with local scaling; and finally, the label assignments of the remaining instances in the data are extended as a classification problem in a low-dimensional space, which is explicitly learned from the labeled samples using a cluster-preserving graph embedding technique. Extensive experimental results on several synthetic and real-world data sets demonstrate both the feasibility of approximately clustering large data sets and acceleration of clustering in loadable data sets of our method.  相似文献   

5.
高维数据挖掘算法的研究与进展   总被引:1,自引:1,他引:1  
生物信息学和电子商务应用的迅速发展积累了大量高维数据,对高维数据的挖掘变得越来越重要,一般的数据挖掘方法在处理高维数据时会遇到维灾的问题,同时传统相似性度量在高维空间中也变得没有意义。文章从频繁项集挖掘、聚类、分类等三个方面对最新的高维数据挖掘算法的现状进行了综述,对这些算法如何解决高维数据挖掘存在的问题进行研究。  相似文献   

6.
基因表达数据聚类是发现基因功能和确立基因调控网络的重要方法,计算智能在该领域的应用为分析 大量基因数据提供了新途径.本文根据基因表达数据的特点,提出了基因表达数据聚类领域的关键问题,探讨了基 于计算智能的基因表达数据聚类基本框架,综述了计算智能在基因数据聚类领域的应用现状,最后指出了在基因数 据聚类领域计算智能方法未来的发展方向.  相似文献   

7.
Micro array technologies have become a widespread research technique for biomedical researchers to assess tens of thousands of gene expression values simultaneously in a single experiment. Micro array data analysis for biological discovery requires computational tools. In this research a novel two-dimensional hierarchical clustering is presented. From the review, it is evident that the previous research works have used clustering which have been applied in gene expression data to create only one cluster for a gene that leads to biological complexity. This is mainly because of the nature of proteins and their interactions. Since proteins normally interact with different groups of proteins in order to serve different biological roles, the genes that produce these proteins are therefore expected to co express with more than one group of genes. This constructs that in micro array gene expression data, a gene may makes its presence in more than one cluster. In this research, multi-level micro array clustering, performed in two dimensions by the proposed two-dimensional hierarchical clustering technique can be used to represent the existence of genes in one or more clusters consistent with the nature of the gene and its attributes and prevent biological complexities.  相似文献   

8.
Microarrays have reformed biotechnological research in the past decade. Deciphering the hidden patterns in gene expression data proffers a prodigious preference to strengthen the understanding of functional genomics. The complexity of biological networks with larger volume of genes also increases the challenges of comprehending and interpretation of the resulting mass of data. Clustering addresses these challenges, which is essential in the data mining process to reveal natural structures and identify interesting patterns in the underlying data. The clustering of gene expression data has been proven to be useful in making known the natural structure inherent in gene expression data, understanding gene functions, cellular processes, and molecular functions. Clustering techniques are used to examine gene expression data to extract groups of genes from the tested samples based on a similarity criterion. Subspace clustering broadens the traditional clustering by extracting the groups of genes that are highly correlated in different subspace within the dataset. Mining the temporal patterns in high dimensional data is done with computational effort and thus normalization is needed. In this work, normalization using fuzzy logic is applied to the data before clustering. The multi-objective cuckoo search optimization is implemented to extract co-expressed genes over different subspaces. The proposed methods are applied to the real life temporal gene expression datasets in which it extracts the genes that are responsible for the disease grouped in a same cluster. The experiment results prove that the impact of fuzzy normalization on the dataset improves the clustering.  相似文献   

9.
One of the central problems in information retrieval, data mining, computational biology, statistical analysis, computer vision, geographic analysis, pattern recognition, distributed protocols is the question of classification of data according to some clustering rule. Often the data is noisy and even approximate classification is of extreme importance. The difficulty of such classification stems from the fact that usually the data has many incomparable attributes, and often results in the question of clustering problems in high dimensional spaces. Since they require measuring distance between every pair of data points, standard algorithms for computing the exact clustering solutions use quadratic or “nearly quadratic” running time; i.e., O(dn 2?α(d)) time where n is the number of data points, d is the dimension of the space and α(d) approaches 0 as d grows. In this paper, we show (for three fairly natural clustering rules) that computing an approximate solution can be done much more efficiently. More specifically, for agglomerative clustering (used, for example, in the Alta Vista? search engine), for the clustering defined by sparse partitions, and for a clustering based on minimum spanning trees we derive randomized (1 + ∈) approximation algorithms with running times Õ(d 2 n 2?γ) where γ > 0 depends only on the approximation parameter ∈ and is independent of the dimension d.  相似文献   

10.
The identification of coexpressed genes from microarray data is a challenging problem in bioinformatics and computational biology. The objective of this study is to obtain knowledge about the most important genes and clusters related to production outputs of real-world time-series microarray data in the industrial microbiology area. Each sample in the microarray data experiment is complemented with the measurement of the corresponding production and growth values. A novel aspect of this research refers to considering the relation of coexpression patterns with the measured outputs to guide the biological interpretation of results. Shape-based clustering models are developed using the pattern of gene expression values over time and further incorporating knowledge about the correlation between the change in the gene expression level and the output value. Experiments are performed for time-series microarray of bacteria, and an analysis from a biological perspective is carried out. The obtained results confirm the existence of relationships between output variables and gene expressions. Moreover, the shape-based clustering methods show promising results, being able to guide metabolic engineering actions with the identification of potential targets.  相似文献   

11.
Traditional clustering methods assume that there is no measurement error, or uncertainty, associated with data. Often, however, real world applications require treatment of data that have such errors. In the presence of measurement errors, well-known clustering methods like k-means and hierarchical clustering may not produce satisfactory results.In this article, we develop a statistical model and algorithms for clustering data in the presence of errors. We assume that the errors associated with data follow a multivariate Gaussian distribution and are independent between data points. The model uses the maximum likelihood principle and provides us with a new metric for clustering. This metric is used to develop two algorithms for error-based clustering, hError and kError, that are generalizations of Ward's hierarchical and k-means clustering algorithms, respectively.We discuss types of clustering problems where error information associated with the data to be clustered is readily available and where error-based clustering is likely to be superior to clustering methods that ignore error. We focus on clustering derived data (typically parameter estimates) obtained by fitting statistical models to the observed data. We show that, for Gaussian distributed observed data, the optimal error-based clusters of derived data are the same as the maximum likelihood clusters of the observed data. We also report briefly on two applications with real-world data and a series of simulation studies using four statistical models: (1) sample averaging, (2) multiple linear regression, (3) ARIMA models for time-series, and (4) Markov chains, where error-based clustering performed significantly better than traditional clustering methods.  相似文献   

12.
To unravel the controlling mechanisms of gene regulation, in this paper we present the application of sophisticated soft computing methods applied on an important problem from Bioinformatics—inferring gene regulatory networks (GRN) from time series gene expression microarray data. The main questions addressed in this paper are: (a) what knowledge can be derived from different models? (b) Would an integrated approach be more suitable to reveal about the controls of gene regulation? To reduce the number of genes in addition to apply the appropriate clustering methods, here we have also considered the valuable inputs from the biological experiments. To infer the GRN we have applied: three computational intelligence methods—Least Angle Regression (LARS), Expectation Maximization (EM) with Kalman Filter (KF), and an Evolving Fuzzy Neural Network (EFuNN). The methods are applied on time series microarray data of Schizosaccharomyces pombe yeast cell-cycle genes. Each method reveals some new aspects of the problem and it is agreed that to infer the GRN and to understand the processes behind gene regulation it is more suitable to adopt such integrative approach as ours through which some new knowledge is discovered, such as: using LARS we hypothesize—first, an exoglucanase gene exg1 is now implicated to be tied with MCB cluster regulation and second, a mannosidase with histone linked mannoses. A new quantitative prediction is that the time delay of the interaction between two genes seems to be approximately 30 min, or 0.17 cell cycles. Using the method of EM with KF, 25 cell cycle-regulated key genes were successfully clustered into three functionally co-regulated groups. We have also identified two genes namely Cdc22 and Suc22 that indeed interact with each other and are the potential candidates as a control in Ribonucleotide reductase (RNR) activity. Based on the EFuNN results and integrating knowledge from EM-KF method, we hypothesize that interaction between Suc22, Cdc22 and Mrc1 may be mediated by two other genes namely Cds1 and Spd1. The methods discussed and applied here can be used to analyze any kind of short time series of many interacting variables for inferring the regulatory network. Researchers should take such integrative computational intelligence approach seriously to understand the complex phenomenon of gene regulation and thus to simulate the development of the cell.  相似文献   

13.
Optimal clustering of co-regulated genes is critical for reliable inference of the underlying biological processes in gene expression analysis, for which the K-means algorithm have been widely employed for its efficiency. However, given that the solution space is large and multimodal, which is typical of gene expression data, K-means is prone to produce inconsistent and sub-optimal cluster solutions that may be unreliable and misleading for biological interpretation.This paper applies a novel global clustering method called the greedy elimination method (GEM) to alleviate these problems. GEM is simple to implement, yet very effective in improving the global optimality of the solutions. Experiments over two sets of gene expression data show that the GEM scores significantly lower clustering errors than the standard K-means and the greedy incremental method.  相似文献   

14.
Unlike traditional clustering analysis,the biclustering algorithm works simultaneously on two dimensions of samples (row) and variables (column).In recent years,biclustering methods have been developed rapidly and widely applied in biological data analysis,text clustering,recommendation system and other fields.The traditional clustering algorithms cannot be well adapted to process high-dimensional data and/or large-scale data.At present,most of the biclustering algorithms are designed for the differentially expressed big biological data.However,there is little discussion on binary data clustering mining such as miRNA-targeted gene data.Here,we propose a novel biclustering method for miRNA-targeted gene data based on graph autoencoder named as GAEBic.GAEBic applies graph autoencoder to capture the similarity of sample sets or variable sets,and takes a new irregular clustering strategy to mine biclusters with excellent generalization.Based on the miRNA-targeted gene data of soybean,we benchmark several different types of the biclustering algorithm,and find that GAEBic performs better than Bimax,Bibit and the Spectral Biclustering algorithm in terms of target gene enrichment.This biclustering method achieves comparable performance on the high throughput miRNA data of soybean and it can also be used for other species.  相似文献   

15.
Gene expression data represents a condition matrix where each row represents the gene and the column shows the condition. Micro array used to detect gene expression in lab for thousands of gene at a time. Genes encode proteins which in turn will dictate the cell function. The production of messenger RNA along with processing the same are the two main stages involved in the process of gene expression. The biological networks complexity added with the volume of data containing imprecision and outliers increases the challenges in dealing with them. Clustering methods are hence essential to identify the patterns present in massive gene data. Many techniques involve hierarchical, partitioning, grid based, density based, model based and soft clustering approaches for dealing with the gene expression data. Understanding the gene regulation and other useful information from this data can be possible only through effective clustering algorithms. Though many methods are discussed in the literature, we concentrate on providing a soft clustering approach for analyzing the gene expression data. The population elements are grouped based on the fuzziness principle and a degree of membership is assigned to all the elements. An improved Fuzzy clustering by Local Approximation of Memberships (FLAME) is proposed in this work which overcomes the limitations of the other approaches while dealing with the non-linear relationships and provide better segregation of biological functions.  相似文献   

16.
17.
Microarray technology allows for the monitoring of thousands of gene expressions in various biological conditions, but most of these genes are irrelevant for classifying these conditions. Feature selection is consequently needed to help reduce the dimension of the variable space. Starting from the application of the stochastic meta-algorithm “Optimal Feature Weighting” (OFW) for selecting features in various classification problems, focus is made on the multiclass problem that wrapper methods rarely handle. From a computational point of view, one of the main difficulties comes from the unbalanced classes situation that is commonly encountered in microarray data. From a theoretical point of view, very few methods have been developed so far to minimize the classification error made on the minority classes. The OFW approach is developed to handle multiclass problems using CART and one-vs-one SVM classifiers. Comparisons are made with other multiclass selection algorithms such as Random Forests and the filter method F-test on five public microarray data sets with various complexities. Statistical relevancy of the gene selections is assessed by computing the performances and the stability of these different approaches and the results obtained show that the two proposed approaches are competitive and relevant to selecting genes classifying the minority classes.Application to a pig folliculogenesis study follows and a detailed interpretation of the genes that were selected shows that the OFW approach answers the biological question.  相似文献   

18.
为改善传统的基因表达数据聚类方法正确率偏低的问题,研究了支持向量数据描述(SVDD)算法在基因表达数据聚类中的应用,该方法通过寻找最优分类超球实现对数据集的有效聚类.将类间信息融入聚类有效性评估准则中,通过模拟退火优化算法寻找SVDD算法中的最优核函数参数和惩罚因子,在训练时引入非样本数据提高运算效率.对酵母细胞生长周期的基因表达数据集的仿真实验结果表明,在新的聚类有效性评估准则下进行参数寻优,能够更快更好地得到最佳参数,同时,算法具有聚类精度高和运算速度快的优点.  相似文献   

19.
在生命科学中,需要对物种及基因进行分类,以获得对种群固有结构的认识。利用数据聚类方法,有效地辨别/识别基因表示数据的模式,对它们进行分类。将特征相似性大的归为一类,特征相异性大的归为不同类。这对于研究基因的结构、功能、以及不同种类基因之间的关系都具有重要意义。利用图论的方法对分子生物学中基因表示数据进行初始聚类,然后再结合别的算法,如K-近邻自学习聚类算法或基于中心点的自学习聚类算法,对其进一步求精。对于某种聚类判别准则,能够产生全局最优簇。最后对算法进行了分析和讨论,并用模拟数据进行了实验验证。  相似文献   

20.
Unsupervised clustering methods such as K-means, hierarchical clustering and fuzzy c-means have been widely applied to the analysis of gene expression data to identify biologically relevant groups of genes. Recent studies have suggested that the incorporation of biological information into validation methods to assess the quality of clustering results might be useful in facilitating biological and biomedical knowledge discoveries. In this study, we generalize two bio-validity indices, the biological homogeneity index and the biological stability index, to quantify the abilities of soft clustering algorithms such as fuzzy c-means and model-based clustering. The results of an evaluation of several existing soft clustering algorithms using simulated and real data sets indicate that the soft versions of the indices provide both better precision and better accuracy than the classical ones. The significance of the proposed indices is also discussed.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号