首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
ICA是应用于盲源信号分离的一种统计方法。利用ICA对基因微阵列表达谱数据进行分解获得由基因模型谱和对应系数构成的线性谱模型,并在此基础上进行基因分类。由于基于ICA的一个模型谱并不能完整地代表一个具有生物意义的类别,并且模型谱之间不具正交性,在此线性模型下不能有效的表示基因数据,为此提出基于ICA的模式表达空间的概念,并在该模式空间中重新构造了基因的数据表达形式,并利用此表达形式进行基因分类。实验结果表明,该分类方法比线性谱模型下的基因分类具有更高的正确率。  相似文献   

2.
Recently, biology has been confronted with large multidimensional gene expression data sets where the expression of thousands of genes is measured over dozens of conditions. The patterns in gene expression are frequently explained retrospectively by underlying biological principles. Here we present a method that uses text analysis to help find meaningful gene expression patterns that correlate with the underlying biology described in scientific literature. The main challenge is that the literature about an individual gene is not homogenous and may addresses many unrelated aspects of the gene. In the first part of the paper we present and evaluate the neighbor divergence per gene (NDPG) method that assigns a score to a given subgroup of genes indicating the likelihood that the genes share a biological property or function. To do this, it uses only a reference index that connects genes to documents, and a corpus including those documents. In the second part of the paper we present an approach, optimizing separating projections (OSP), to search for linear projections in gene expression data that separate functionally related groups of genes from the rest of the genes; the objective function in our search is the NDPG score of the positively projected genes. A successful search, therefore, should identify patterns in gene expression data that correlate with meaningful biology. We apply OSP to a published gene expression data set; it discovers many biologically relevant projections. Since the method requires only numerical measurements (in this case expression) about entities (genes) with textual documentation (literature), we conjecture that this method could be transferred easily to other domains. The method should be able to identify relevant patterns even if the documentation for each entity pertains to many disparate subjects that are unrelated to each other.  相似文献   

3.
基因表达数据具有样本数少、基因维数高、非线性等特点,为能有效地处理基因表达数据,提出光滑近邻表示子空间聚类算法.利用每个数据点的近邻线性表示刻画数据集的非线性特点,并对近邻表示添加光滑约束,使数据点与近邻的距离关系嵌入到该数据点的重构表示中.在基因表达数据上的实验表明,所提出的方法优于其他几个现有方法,进而表明所提出方法对基因表达数据的聚类是有效的.  相似文献   

4.
姜涛  李战怀  尚学群  陈伯林  李卫榜 《计算机科学》2016,43(7):191-196, 223
基因表达数据分析一般是通过挖掘局部模式来实现的。保序子矩阵是局部模式挖掘中一种经典的模型,可以获取到在若干条件下表现出一致趋势的一组基因。高通量基因微阵列技术的进步,促进了海量基因表达数据的产生,使得对高性能基因表达数据分析算法的需求极为迫切。现有方法大多数是通过批量挖掘的方法来分析数据,即使有通过查询方式来获取精确结果的方法,其全面性与性能也有待提高。为了提高数据分析的效率与准确性,首先提出一种基于前缀树的基因表达数据索引gIndex,然后给出了一种基于列关键词查询的保序子矩阵分析方法GEQc。其不经过批量挖掘,只需要建立索引并通过关键词来完成正相关/负相关/时滞等模式的查询。实验结果表明,与现有方法相比,所提算法具有良好的数据分析效率与可扩展性。  相似文献   

5.
基于相容关系的基因选择方法   总被引:1,自引:0,他引:1  
焦娜  苗夺谦 《计算机科学》2010,37(10):217-220
有效的基因选择是对基因表达数据进行分析的重要内容。粗糙集作为一种软计算方法能够保持在数据集分类能力不变的基础上,对属性进行约简。由于基因表达数据的连续性,为了避免运用粗糙集方法所必需的离散化过程带来的信息丢失,将相容粗糙集应用于基因的特征选取,提出了基于相容关系的基因选择方法。首先,通过i检验对基因表达数据进行排列,选择评分靠前的若干基因;然后,通过相容粗糙集对这些基因进一步约简。在两个标准的基因表达数据上进行了实验,结果表明该方法是可行性和有效性的。  相似文献   

6.
基因表达数据聚类是发现基因功能和确立基因调控网络的重要方法,计算智能在该领域的应用为分析 大量基因数据提供了新途径.本文根据基因表达数据的特点,提出了基因表达数据聚类领域的关键问题,探讨了基 于计算智能的基因表达数据聚类基本框架,综述了计算智能在基因数据聚类领域的应用现状,最后指出了在基因数 据聚类领域计算智能方法未来的发展方向.  相似文献   

7.
已被证明修改的岭回归模型(MRRLM)在满足一定条件下可以发现目标变量的马尔科夫边的子集。但由于该模型引入协方差矩阵,导致在有变量共线的数据集上无法求解。为克服MRRLM缺陷寻找合适的替代模型,以实证的方式结合置换检验方法研究MRRLM与其他正则线性模型马尔科夫边发现效率之间的关系,并研究新的变种岭回岭模型(NVRRLM)在数据集上的适用性规律。实验结果表明:在低维连续数据集上,MRRLM马尔科夫边子集的发现效率远高于岭回归模型,但与拉索模型和弹性网络模型基本相近;在低维二值离散数据集上,MRRLM与岭回归模型、拉索模型和弹性网络模型的马尔科夫边的子集发现效率基本相近;NVRRLM完全可以用于变量共线数据集上马尔科夫边的子集发现。实验结果为解决低维变量共线数据集上选择合适的MRRLM替代模型提供了依据。  相似文献   

8.
Gaussian graphical models are promising tools for analysing genetic networks. In many applications, biologists have some knowledge of the genetic network and may want to assess the quality of their model using gene expression data. This is why one introduces a novel procedure for testing the neighborhoods of a Gaussian graphical model. It is based on the connection between the local Markov property and conditional regression of a Gaussian random variable. Adapting recent results on tests for high-dimensional Gaussian linear models, one proves that the testing procedure inherits appealing theoretical properties. Besides, it applies and is computationally feasible in a high-dimensional setting: the number of nodes may be much larger than the number of observations. A large part of the study is devoted to illustrating and discussing applications to simulated data and to biological data.  相似文献   

9.
For one to infer the structures of a gene regulatory network (GRN), it is important to identify, for each gene in the GRN, which other genes can affect its expression and how they can affect it. For this purpose, many algorithms have been developed to generate hypotheses about the presence or absence of interactions between genes. These algorithms, however, cannot be used to determine if a gene activates or inhibits another. To obtain such information to better infer GRN structures, we propose a fuzzy data mining technique here. By transforming quantitative expression values into linguistic terms, it defines a measure of fuzzy dependency among genes. Using such a measure, the technique is able to discover interesting fuzzy dependency relationships in noisy, high dimensional time series expression data so that it can not only determine if a gene is dependent on another but also if a gene is supposed to be activated or inhibited. In addition, the technique can also predict how a gene in an unseen sample (i.e., expression data that are not in the original database) would be affected by other genes in it and this makes statistical verification of the reliability of the discovered gene interactions easier. For evaluation, the proposed technique has been tested using real expression data and experimental results show that the use of fuzzy-logic based technique in gene expression data analysis can be quite effective.  相似文献   

10.
11.
基因微阵列(DNA microarray)是实验分子生物学中的一个重要突破,其使得研究者可以同时监测多个基因在多个实验条件下表达水平的变化,进而为发现基因协同表达网络、研制药物、预防疾病等提供技术支持.研究者们提出了大量的聚类算法来分析基因表达数据,但是标准的聚类算法(单向聚类)只能发现少量的知识.因为基因不可能在所有实验条件下共表达,也不可能展示出相同的表达水平,但是可能参与多种遗传通路.在这种情况下,双聚类方法应运而生.这样就将基因表达数据的分析从整体模式转向局部模式,从而改变了只根据数据的全部对象或属性将数据聚类的局面.主要从局部模式的定义、局部模式类型与标准、局部模式的挖掘与查询等方面进行了梳理.介绍了基因表达数据中局部模式挖掘当前的研究现状与进展,详细总结了基于定量和定性的局部模式挖掘标准以及相关的挖掘系统,分析了存在的问题,并深入探讨了未来的研究方向.  相似文献   

12.
一种基于递归分类树的集成特征基因选择方法   总被引:15,自引:1,他引:14  
李霞  张田文  郭政 《计算机学报》2004,27(5):675-682
利用DNA芯片基因表达谱信息识别疾病相关基因,对癌症等疾病分型、诊断及病理学研究有非常重要的实际意义.该文提出了一种基于递归分类树的特征基因选择的集成方法EFST(Ensemble Feature Selection based on Recursive Partition—Tree).EFST可选择多组基于不同样本分布结构的特征基因,结合有监督机器学习中的多分类器集成(ensemble)决策技术,利用提出的衡量特征基因稳定性与显著性测度.集成各特征基因组选择最终的特征基因.应用结肠癌2000个基因的表达谱实验数据分析结果显示:EFST方法不仅具有寻找疾病相关基因的能力和较强的数据维数压缩能力,而且由支持向量机(SVM)等4种模式分类方法证实EFST方法可以明显地提高疾病鉴别分类的准确率.  相似文献   

13.
DNA微阵列技术的应用产生了大量的基因表达时序数据,对这些数据进行聚类是获取其中隐含的生物分子信息的一种重要方法。提出了一种基于隐马尔可夫模型(HMM)的层次聚类方法,根据基因表达时序数据的统计特性对其进行标准化和离散化等预处理,用HMM对经过预处理的数据建模以利用基因表达时序数据不同时间点之间的相关性,用层次聚类方法对建立的模型进行聚类。实验结果表明该方法不仅能够产生好的聚类,而且能够确定最优的聚类数。  相似文献   

14.
基因序列数据中往往存在大量的非编码和缺失序列,现有的基因序列表示大多通过人工方法对高维的基因序列进行特征提取,不仅非常耗时且成功的预测很大程度依赖于生物学知识的正确利用.基于病毒传播网络构建了一种基于图上下文信息的基因序列表示方法,对目标节点病毒序列进行编码后,使用注意力机制对其邻居节点的序列信息进行聚合,从而得到目标节点病毒序列的新的低维表示.进而依据病毒传播网络中相邻节点的基因序列相似性高于不相邻节点的特征,对基因序列表示模型进行优化,训练后得到的新的表示不仅可以有效表达基因序列的特征,同时极大地降低了序列的维度,提高了计算效率.分别在仿真病毒传播网络、新型冠状病毒和艾滋病毒传播网络数据上训练基因序列表示模型,并在相应的网络上进行未采样感染者发现任务.实验结果充分验证了模型的有效性,与其他方法的比较证明了模型的高效性,模型可以有效地在病毒传播网络上发现未采样感染者,这在流行病调查领域也具有一定的实际意义.  相似文献   

15.
Microarrays are capable of detecting the expression levels of thousands of genes simultaneously. So, gene expression data from DNA microarray are characterized by many measured variables (genes) on only a few samples. One important application of gene expression data is to classify the samples. In statistical terms, the very large number of predictors or variables compared to small number of samples makes most of classical “class prediction” methods unemployable. Generally, this problem can be avoided by selecting only the relevant features or extracting new features containing the maximal information about the class label from the original data. In this paper, a new method for gene selection based on independent variable group analysis is proposed. In this method, we first used t-statistics method to select a part of genes from the original data. Then, we selected the key genes from the selected genes for tumor classification using IVGA. Finally, we used SVM to classify tumors based on the key genes selected using IVGA. To validate the efficiency, the proposed method is applied to classify three different DNA microarray data sets. The prediction results show that our method is efficient and feasible.  相似文献   

16.
颜文胜 《计算机工程》2011,37(5):202-203,206
依据基因表达数据的特点,提出一种基于弹簧模型的基因表达数据可视化聚类方法,将多维空间的基因表达数据映射到二维空间中,较好地保持了原始多维数据间的时空相似性。实验结果表明,该方法能发现基因表达数据集中隐含的类簇结构以及共表达基因模式。  相似文献   

17.
传统预测基因表达的线性模型无法解决基因表达谱数据高维度、少样本和非线性的现实问题。对此提出一种基于直连输入输出深度神经网络(DCIO-DNN)和迁移学习的基因表达回归预测模型(DCIO-DNN_GM)。提出一种可以建模landmark和target基因的线性和非线性映射关系的新型网络结构;引入迁移学习策略和正则化技术在小数据集上训练了模型。实验结果表明,该模型各项指标都更高。  相似文献   

18.
一种基于拆分的基因选择算法   总被引:1,自引:0,他引:1  
基因表达数据是由成千上万个基因及几十个样本组成的,有效的基因选择算法是基因表达数据研究的重要内容。粗糙集是一个有效的去掉冗余特征的工具。然而,对于含有成千上万特征、几十个样本的基因表达数据,现有基于粗糙集的特征选择算法的计算效率会变得非常低。为此,将拆分方法应用于特征选择,提出了一种基于拆分的特征选择算法。该算法把一个复杂的表拆分成简单的、更容易处理的主表与子表形式,然后把它们的结果连接到一起解决初始表的问题。实验结果表明,该算法在保证分类精度的同时,能明显提高计算效率。  相似文献   

19.
Variational learning for switching state-space models   总被引:6,自引:0,他引:6  
We introduce a new statistical model for time series that iteratively segments data into regimes with approximately linear dynamics and learnsthe parameters of each of these linear regimes. This model combines and generalizes two of the most widely used stochastic time-series models -- hidden Markov models and linear dynamical systems -- and is closely related to models that are widely used in the control and econometrics literatures. It can also be derived by extending the mixture of experts neural network (Jacobs, Jordan, Nowlan, & Hinton, 1991) to its fully dynamical version, in which both expert and gating networks are recurrent. Inferring the posterior probabilities of the hidden states of this model is computationally intractable, and therefore the exact expectation maximization (EM) algorithm cannot be applied. However, we present a variational approximation that maximizes a lower bound on the log-likelihood and makes use of both the forward and backward recursions for hidden Markov models and the Kalman filter recursions for linear dynamical systems. We tested the algorithm on artificial data sets and a natural data set of respiration force from a patient with sleep apnea. The results suggest that variational approximations are a viable method for inference and learning in switching state-space models.  相似文献   

20.
基因表达数据具有高维、小样本、多噪声和高冗余的特点,使传统聚类方法效率较低.子空间分割是高维数据聚类的有效手段,但直接对基因表达数据进行子空间分割会降低聚类性能.为了更有效地聚类,文中提出低秩投影最小二乘回归子空间分割方法.首先利用改进的低秩方法将数据投影至潜在子空间,以便去除数据中可能的毁损,得到较干净的数据字典.然后采用最小二乘回归方法获得数据低维表示并构造仿射矩阵,利用该仿射矩阵实现聚类.在6个公开基因表达数据集上的实验表明文中方法的有效性.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号