首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 234 毫秒
1.
结合中心氨基酸组成成分预测固有不规则蛋白质   总被引:1,自引:0,他引:1  
在固有不规则蛋白质结构预测过程中,针对短的不规则结构区域特征提取困难,提出一种结合中心氨基酸组成成分进行预测的方法。利用滑窗技术,计算20种氨基酸在窗口内出现的频率,构建一个子预测器;计算窗口中心氨基酸形成不规则结构的统计概率,以此作为新的特征参数;对子预测器的结果与新的特征参数分别赋予一个系数,进行加权组合,建立基于组合模型的固有不规则蛋白质结构预测器。实验结果表明,该预测器在保持对长的不规则结构区域预测精度较高的前提下,能够显著提高短的不规则结构区域的预测精度。  相似文献   

2.
蛋白质功能的准确预测有利于推进生物医学发展,高通量测序技术的快速发展加快了蛋白质序列的提取速度,从而产生了大量未注释的蛋白质,并且新测序序列缺乏结构等生物信息,针对该问题提出了基于序列和组合图卷积网络的蛋白质功能预测模型(Protein Function Prediction using Sequences and Combined Graph Convolutional Networks, PFP-SCGCN).首先通过深度学习方法捕获蛋白质序列的多维特征信息,再通过多序列比对从蛋白质序列中提取进化耦合信息和氨基酸残基群落,然后利用进化耦合信息和氨基酸残基群落生成序列氨基酸之间两种不同连接程度的邻接矩阵,将这两种邻接矩阵与序列特征信息一起输入给组合图卷积网络进行信息融合,最后通过多个全连接层获得蛋白质功能类别信息.本文还通过分析PFP-SCGCN的特定网络层识别蛋白质功能位点,可帮助人们推测出新序列中的重要氨基酸.模型结果表明,PFP-SCGCN模型的功能预测准确率远高于对比方法,具有较好的鲁棒性,并且可以较准确的识别功能位点.  相似文献   

3.
后基因组时代的到来,蛋白质的数据量急剧增长.为对蛋白质结构进行准确预测,提出了一种深度学习的方法,来预测蛋白质的二级结构分类问题.采用由近似熵、疏水模式以及图像特征组成的伪氨基酸组分方法,来提取蛋白质序列的特征;预测模型采用了5层的深度玻尔兹曼机(DBM)+分类层,5层的DBM组成了4个RBM,分类层采用softmax分类器;同时采用了非监督学习和监督学习作为预测模型的训练策略.与现有预测方法相比,提出的预测方法,比目前较好的支持向量机(SVM),人工神经网络(ANN)精度均要高.实验结果表明,提出的改进方法具有很好的可行性和有效性.  相似文献   

4.
组成蛋白质的基本单位是氨基酸,对于蛋白质分类预测问题,氨基酸序列特征提取方法是一个非常重要的因素。对基于氨基酸组成、位置的特征提取算法如熵密度、n阶耦联组成和基于氨基酸性质的特征提取方法如自相关函数、伪氨基酸组成等方法进行了阐述,并进行了简单评价。基于氨基酸组成的方法实现简单、计算量小,且对所有的氨基酸序列都适用,但丢失了氨基酸的顺序信息以及其问的相互作用,基于氨基酸位置信息或理化特性等方法计算量非常大,科研工作者可以根据对蛋白质的不同要求选择相应的特征提取方法。  相似文献   

5.
罗林波  陈绮 《微机发展》2010,(2):206-208,212
组成蛋白质的基本单位是氨基酸,对于蛋白质分类预测问题,氨基酸序列特征提取方法是一个非常重要的因素。对基于氨基酸组成、位置的特征提取算法如熵密度、n阶耦联组成和基于氨基酸性质的特征提取方法如自相关函数、伪氨基酸组成等方法进行了阐述,并进行了简单评价。基于氨基酸组成的方法实现简单、计算量小,且对所有的氨基酸序列都适用,但丢失了氨基酸的顺序信息以及其问的相互作用,基于氨基酸位置信息或理化特性等方法计算量非常大,科研工作者可以根据对蛋白质的不同要求选择相应的特征提取方法。  相似文献   

6.
特征向量的构造是蛋白质二级结构预测的一个关键问题. 现有的研究方法,通常只使用BLOSUM62进化矩阵生成PSSM矩阵,对蛋白质进化过程中存在的氨基酸残基突变现象缺乏考虑. 本文提出利用多重进化矩阵构造蛋白质特征向量,其融合了不同进化时间的PSSM矩阵,不仅能够很好地反映序列中氨基酸的位置信息,而且能够反映序列进化过程中氨基酸位点发生突变产生的影响. 本文通过组合不同进化程度的矩阵来构造特征向量,选用逻辑回归、随机森林和多分类支持向量机三种分类算法作为预测工具,利用网格搜索法和交叉实验法优化参数,在RS126、CB513和25PDB公用数据集上进行了若干组实验. 对比实验结果表明,本文所提出基于多重进化矩阵的蛋白质特征向量构造方法能够有效提高蛋白质二级结构的预测精度.  相似文献   

7.
周程  张培林 《计算机应用》2012,32(9):2628-2630
针对物流货运量组合预测模型中赋权策略这个难点问题,在灰色模型、三次多项式趋势外推模型(PTEM)和三次指数平滑模型(TESM)基础上,引入关联面积法确定组合权系数,构建物流货运量组合预测模型。实例表明,与等权法、熵权法、平均绝对误差法对比分析可知,关联面积法综合体现了模型预测结果与真实时间序列之间的相关性及拟合误差,提高了模型预测性能和精度,是一种有效的组合赋权策略。  相似文献   

8.
于琼  田宪 《计算机工程与科学》2021,43(10):1817-1825
为解决复杂系统中非线性时间序列预测模型构建效率低、预测精度低的问题,提出基于组合模型的HURST-EMD预测算法.采用EMD算法将非线性时间序列分解为代表原始序列特征的各个IMF,然后引入赫斯特(Hurst)指数将同类的IMF整合为新的分量,最后选用LS-SVR-ARIMA模型进行组合预测.在该算法中,设计了序列分类整合等过程,优化了建模的计算量,构建了高效精准的预测模型.为验证模型的有效性,采用上证指数公共数据集和真实交通流数据进行检验,实验结果表明,改进的基于组合模型的HURST-EMD预测算法在提高预测效率的同时具有更好的预测精度.  相似文献   

9.
鉴于不同类型氨基酸的相互作用对蛋白质结构预测的影响不同,文中融合卷积神经网络和长短时记忆神经网络模型,提出卷积长短时记忆神经网络,并应用到蛋白质8类二级结构的预测中.首先基于氨基酸序列的类别信息和氨基酸结构的进化信息表示蛋白质序列,并采用卷积提取氨基酸残基之间的局部相关特征,然后利用双向长短时记忆神经网络提取蛋白质序列内部残基之间的远程相互作用,最后将提取的蛋白质的局部相关特征和远程相互作用用于蛋白质8类二级结构的预测.实验表明,相比基准方法,文中模型提高8类二级结构预测的精度,并具有良好的可扩展性.  相似文献   

10.
氨基酸序列的特征描述是指从一条氨基酸序列选取相关的特征信息并用数学方法描述这些信息,使之能正确反映序列与结构或功能之间的关系。在根据氨基酸序列预测蛋白质的结构类或亚细胞位置等问题中,氨基酸序列的特征描述直接影响预测质量;同时比较不同描述方法对预测结果的影响可以帮助我们理解序列与结构或序列与功能之间的关系。本文介绍了几种氨基酸序列的特征描述方法,以FDOD方程作为判别函数,比较了其中几种描述方法对蛋白质结构类预测结果的影响,发现二级结构单纯的全α类和全β类蛋白质对于氨基酸组成比较敏感,而对于混合型蛋白质,即α+β类和α/β类蛋白质,考虑氨基酸残基排列顺序可以显著提高预测结果。  相似文献   

11.
《Computers & chemistry》1994,18(3):255-258
The identification and characterization of local residue patterns or conserved segments shared by a set of biopolymers has provided a number of insights in molecular biology. Biopolymer sequences are observations from macro molecules that share common structural or function features. The approach taken here rests on the notion that information may be most efficiently extracted from these observations through the use of a model that faithfully represents macro-molecular characteristics. Accordingly, our efforts are focused on statistical models which attempt to capture central features of protein structure, function, and change. Here the assumptions that underlie two new methods for the analysis of protein sequence data are explicitly delineated. (1) Threading of a sequence through structural motifs seeks to determine if a protein sequence fits a known protein structure. The assumptions delineated here also generally apply to other contact based threading methods that have been recently described. (2) Multiple sequence alignment via the Gibbs sampling algorithm seeks to identify position specific empirical free energy models for residue sites in common motifs and simultaneously the align sequence observations form these motifs.  相似文献   

12.
13.
Scrutineer is an interactive, user-friendly program designed to search for motifs, patterns and profiles in the Swissprot, Protein Identification Resource (PIR) or SeqDb protein sequence databases. Basic capabilities include (i) searches for strings of amino acids with multiple choices at a given position; (ii) searches for strings including variable-length segments and delocalized constraints; (iii) searches over subsets of a database or particular regions within each sequence (e.g. N-terminal one-third); (iv) searches involving secondary structure predictions, physicochemical characteristics, and the like; and (v) searches using aligned sequences as targets with various optional weighting schemes. The various search criteria and hits can be combined and complex targets located. Once the data are loaded into virtual memory, all occurrences in PIR release 22.0 (3.7 x 10(6) amino acids) of a given short string of amino acids (e.g. a hexamer) are found in approximately 36 s. Scrutineer can also describe the entire database, user-specified hits, user-defined regions of sequence and all hits. The source code and accompanying manual are being freely distributed.  相似文献   

14.
As many structures of protein–DNA complexes have been known in the past years, several computational methods have been developed to predict DNA-binding sites in proteins. However, its inverse problem (i.e., predicting protein-binding sites in DNA) has received much less attention. One of the reasons is that the differences between the interaction propensities of nucleotides are much smaller than those between amino acids. Another reason is that DNA exhibits less diverse sequence patterns than protein. Therefore, predicting protein-binding DNA nucleotides is much harder than predicting DNA-binding amino acids. We computed the interaction propensity (IP) of nucleotide triplets with amino acids using an extensive dataset of protein–DNA complexes, and developed two support vector machine (SVM) models that predict protein-binding nucleotides from sequence data alone. One SVM model predicts protein-binding nucleotides using DNA sequence data alone, and the other SVM model predicts protein-binding nucleotides using both DNA and protein sequences. In a 10-fold cross-validation with 1519 DNA sequences, the SVM model that uses DNA sequence data only predicted protein-binding nucleotides with an accuracy of 67.0%, an F-measure of 67.1%, and a Matthews correlation coefficient (MCC) of 0.340. With an independent dataset of 181 DNAs that were not used in training, it achieved an accuracy of 66.2%, an F-measure 66.3% and a MCC of 0.324. Another SVM model that uses both DNA and protein sequences achieved an accuracy of 69.6%, an F-measure of 69.6%, and a MCC of 0.383 in a 10-fold cross-validation with 1519 DNA sequences and 859 protein sequences. With an independent dataset of 181 DNAs and 143 proteins, it showed an accuracy of 67.3%, an F-measure of 66.5% and a MCC of 0.329. Both in cross-validation and independent testing, the second SVM model that used both DNA and protein sequence data showed better performance than the first model that used DNA sequence data. To the best of our knowledge, this is the first attempt to predict protein-binding nucleotides in a given DNA sequence from the sequence data alone.  相似文献   

15.
有效分析蛋白质家族是生物信息学的一项重要挑战,聚类成为解决这一问题的主要途径之一.基于传统序列比对方法定义蛋白质序列间相似关系时,假设了同源片断问的邻接保守性,与遗传重组相冲突.为更好地识别蛋白质家族,提出了一种蛋白质序列家族挖掘算法ProFaM.ProFaM首先采用前缀投影策略挖掘表征蛋白质序列的模式,然后基于模式及其权重信息构造相似度度量函数,并采用共享最近邻方法,实现了蛋白质序列家族聚类.解决了以往方法在蛋白质模式挖掘及相似度设计中的不足.在蛋白质家族数据库Pfam上的实验结果证实了ProFaM算法在蛋白质家族分析上有良好的结果.  相似文献   

16.
The veracity present in molecular data available in biological databases possesses new challenges for data analytics. The analysis of molecular data of various diseases can provide vital information for developing better understanding of the molecular mechanism of a disease. In this paper, an attempt has been made to propose a model that addresses the issue of veracity in data analytics for amino acid association patterns in protein sequences of Swine Influenza Virus. The veracity is caused by intra-sequential and inter-sequential biases present in the sequences due to varying degrees of relationships among amino acids. A complete dataset of 63,682 protein sequences is downloaded from NCBI and is refined. The refined dataset consists of 26,594 sequences which are employed in the present study. The type I fuzzy set is employed to explore amino acid association patterns in the dataset. The type I fuzzy support is refined to partially remove the inter-sequential biases causing veracity in data. The remaining inter-sequential biases present in refined fuzzy support are evaluated and eliminated using type II fuzzy set. Hence, it is concluded that a combination of type II fuzzy & refined fuzzy approach is the optimal approach for extracting a better picture of amino acid association patterns in the molecular dataset.  相似文献   

17.
The elucidation of protein function by sequence motif analysis   总被引:7,自引:0,他引:7  
Protein sequence motifs are acquiring increasing prominence in the area of sequence analysis. This review describes the current methods of their construction and their use in the determination of protein function, and offers guidelines on interpreting data obtained. An appendix is attached which refers to 200 motifs of various kinds.  相似文献   

18.
In this paper, a modified particle swarm optimisation algorithm is proposed for protein sequence motif discovery. Protein sequences are represented as a chain of symbols and a protein sequence motif is a short sequence that exists in most of the protein sequence families. Protein sequence symbols are converted into numbers using a one to one amino acid translation table. The simulation uses EGF protein and C2H2 Zinc Finger protein families obtained from the PROSITE database. Simulation results show that the modified particle swarm optimisation algorithm is effective in obtaining global optimum sequence patterns, achieving 96.9 and 99.5 classification accuracy respectively in EGF and C2H2 Zinc Finger protein families. A better true positive hit result is achieved when compared to the motifs published in PROSITE database.  相似文献   

19.
Two statistical models in probabilistic biostratigraphy have been applied to the same dataset:295, Cenozoic and Mesozoic calcareous nannofossil taxa occurring at 55 geologic sections recovered by the Deep Sea Drilling Project. Model 1 results in a ranked optimum sequence that is significant at a lower limit of probability. Model 2 results in both a ranked sequence, and a scaled sequence tied to statistical significance. The ranked solutions are similar in their overall trends, but important differences are apparent. These differences can be attributed to specific assumptions and algorithms explicit for each model, and include the initial evaluation of sampling density for the dataset, the handling of coeval events, and the assumed frequency distribution of an event occurrence in time. An optimum solution for the dataset can be obtained using the ranking approach of Model 1 and the scaling approach of Model 2.  相似文献   

20.
施晋  毛嘉莉  金澈清 《软件学报》2019,30(3):770-783
城市道路的旅行时间预测,对于路径规划以及交通管理至关重要.尽管旅行时间预测会受路段依赖、时空相关性以及其他因素的影响,但现有的方法并未考虑如何结合外部因素进行建模,因而可能会有引入错误信息、路段建模时忽略上下游路段间的依赖关系等问题,导致预测精度较差.鉴于此,提出了两阶段的旅行时间预测框架:首先,使用Skip-Gram模型对轨迹数据地图匹配后的路段序列进行编码,将其映射为低维向量,通过该编码方式避免引入错误信息的同时保留了路段间的上下游依赖信息.随后,基于路段编码模式整合天气、日期等外部因素,设计了基于深度神经网络的城市道路旅行时间预测模型.基于真实出租车轨迹数据集的对比实验结果表明,所提方法比对比算法具有更高的预测精度.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号