首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 472 毫秒
1.
The PRECISE database was developed by our laboratory to allow for the systematic study of the ligand interactions common to a set of functionally related enzymes, where an interaction site is defined broadly as any residue(s) that interact with a ligand. During the construction of PRECISE, enzyme chains are extracted from the protein data bank (PDB) and clustered according to functional homology as defined by the enzyme commission (EC) nomenclature system. A sequence representative is chosen from each cluster based on the criterion set forth by the non-redundant PDB set, and pair-wise alignments of each cluster member to the representative are performed. Atom-based residue–ligand interactions are calculated for each cluster member, and the summation of ligand interactions for all cluster members at each aligned position is determined. Although we were able to successfully align most clusters using a simple dynamic programming algorithm, several cluster created exhibited poor pair-wise alignments of each cluster member to its sequence representative. We hypothesized that the observed alignment problems were, in most cases, due to the incorrect separation and alignment of different domains in multi-domain proteins, a mistake that frequently causes error proliferation in functional annotation. Here we present the results of generating primary sequence patterns for each poorly aligned cluster in PRECISE to assess the extent to which multi-domain proteins that are incorrectly aligned contributes to poor pair-wise alignments of each cluster member to its representative. This requires the use of an iterative locally optimal pair-wise alignment algorithm to build a hierarchical similarity-based sequence pattern for a set of functionally related enzymes. Our results show that poor alignments in PRECISE are caused most frequently by the misalignment of multi-domain proteins, and that the generation of primary sequence patterns for the assignment of sequence family membership yields better alignments for the functionally related enzyme clusters in PRECISE than our original alignment algorithm.  相似文献   

2.
The PRECISE database was developed by our laboratory to allow for the systematic study of the ligand interactions common to a set of functionally related enzymes, where an interaction site is defined broadly as any residue(s) that interact with a ligand. During the construction of PRECISE, enzyme chains are extracted from the protein data bank (PDB) and clustered according to functional homology as defined by the enzyme commission (EC) nomenclature system. A sequence representative is chosen from each cluster based on the criterion set forth by the non-redundant PDB set, and pair-wise alignments of each cluster member to the representative are performed. Atom-based residue–ligand interactions are calculated for each cluster member, and the summation of ligand interactions for all cluster members at each aligned position is determined. Although we were able to successfully align most clusters using a simple dynamic programming algorithm, several cluster created exhibited poor pair-wise alignments of each cluster member to its sequence representative. We hypothesized that the observed alignment problems were, in most cases, due to the incorrect separation and alignment of different domains in multi-domain proteins, a mistake that frequently causes error proliferation in functional annotation. Here we present the results of generating primary sequence patterns for each poorly aligned cluster in PRECISE to assess the extent to which multi-domain proteins that are incorrectly aligned contributes to poor pair-wise alignments of each cluster member to its representative. This requires the use of an iterative locally optimal pair-wise alignment algorithm to build a hierarchical similarity-based sequence pattern for a set of functionally related enzymes. Our results show that poor alignments in PRECISE are caused most frequently by the misalignment of multi-domain proteins, and that the generation of primary sequence patterns for the assignment of sequence family membership yields better alignments for the functionally related enzyme clusters in PRECISE than our original alignment algorithm.  相似文献   

3.
This paper describes three weighting schemes for improving the accuracy of progressive multiple sequence alignment methods: (1) global profile pre-processing, to capture for each sequence information about other sequences in a profile before the actual multiple alignment takes place; (2) local pre-processing; which incorporates a new protocol to only use non-overlapping local sequence regions to construct the pre-processed profiles; and (3) local-global alignment, a weighting scheme based on the double dynamic programming (DDP) technique to softly bias global alignment to local sequence motifs. The first two schemes allow the compilation of residue-specific multiple alignment reliability indices, which can be used in an iterative fashion. The schemes have been implemented with associated iterative modes in the PRALINE multiple sequence alignment method, and have been evaluated using the BAliBASE benchmark alignment database. These tests indicate that PRALINE is a toolbox able to build alignments with very high quality. We found that local profile pre-processing raises the alignment quality by 5.5% compared to PRALINE alignments generated under default conditions. Iteration enhances the quality by a further percentage point. The implications of multiple alignment scoring functions and iteration in relation to alignment quality and benchmarking are discussed.  相似文献   

4.
Multiple sequence alignment is of central importance to bioinformatics and computational biology. Although a large number of algorithms for computing a multiple sequence alignment have been designed, the efficient computation of highly accurate and statistically significant multiple alignments is still a challenge. In this paper, we propose an efficient method by using multi-objective genetic algorithm (MSAGMOGA) to discover optimal alignments with affine gap in multiple sequence data. The main advantage of our approach is that a large number of tradeoff (i.e., non-dominated) alignments can be obtained by a single run with respect to conflicting objectives: affine gap penalty minimization and similarity and support maximization. To the best of our knowledge, this is the first effort with three objectives in this direction. The proposed method can be applied to any data set with a sequential character. Furthermore, it allows any choice of similarity measures for finding alignments. By analyzing the obtained optimal alignments, the decision maker can understand the tradeoff between the objectives. We compared our method with the three well-known multiple sequence alignment methods, MUSCLE, SAGA and MSA-GA. As the first of them is a progressive method, and the other two are based on evolutionary algorithms. Experiments on the BAliBASE 2.0 database were conducted and the results confirm that MSAGMOGA obtains the results with better accuracy statistical significance compared with the three well-known methods in aligning multiple sequence alignment with affine gap. The proposed method also finds solutions faster than the other evolutionary approaches mentioned above.  相似文献   

5.
《Applied Soft Computing》2007,7(3):1121-1130
We describe a new method for pairwise nucleic acid sequence alignment that can also be used for pattern searching and tandem repeat searching within a nucleic acid sequence. The method is broadly a hybrid algorithm employing ant colony optimization (ACO) and the simple genetic algorithm. The method first employs ACO to obtain a set of alignments, which are then further processed by an elitist genetic algorithm, which employs primitive selection and a novel multipoint crossover-mutation operator to generate accurate alignments. The resulting alignments show a fair amount of accuracy for smaller and medium size sequences. Furthermore, this algorithm can be used rather quickly and efficiently for aligning shorter sequences and also for pattern searching in both nucleic acid and amino acid sequences. Furthermore, it can be used as an effective local alignment method or as a global alignment tool. On improvement of accuracy, this method can be extended for use towards multiple sequence alignment.  相似文献   

6.
Over the past several decades, biologists have conducted numerous studies examining both general and specific functions of proteins. Generally, if similarities in either the structure or sequence of amino acids exist for two proteins, then a common biological function is expected. Protein function is determined primarily based on the structure rather than the sequence of amino acids. The algorithm for protein structure alignment is an essential tool for the research. The quality of the algorithm depends on the quality of the similarity measure that is used, and the similarity measure is an objective function used to determine the best alignment. However, none of existing similarity measures became golden standard because of their individual strength and weakness. They require excessive filtering to find a single alignment. In this paper, we introduce a new strategy that finds not a single alignment, but multiple alignments with di?erent lengths. This method has obvious benefits of high quality alignment. However, this novel method leads to a new problem that the running time for this method is considerably longer than that for methods that find only a single alignment. To address this problem, we propose algorithms that can locate a common region (CORE) of multiple alignment candidates, and can then extend the CORE into multiple alignments. Because the CORE can be defined from a final alignment, we introduce CORE* that is similar to CORE and propose an algorithm to identify the CORE*. By adopting CORE* and dynamic programming, our proposed method produces multiple alignments of various lengths with higher accuracy than previous methods. In the experiments, the alignments identified by our algorithm are longer than those obtained by TM-align by 17% and 15.48%, on average, when the comparison is conducted at the level of super-family and fold, respectively.  相似文献   

7.
序列比对是生物信息学中基本的信息处理方法,对于发现生物序列中的功能、结构和进化信息具有重要的意义。该文对典型的双序列比对算法Smith-Waterman、FASTA、BLAST以及多序列比对算法CLUSTAL进行了描述和评价;针对目前序列比对算法普遍存在的不足,简单介绍了应用KDD技术进行序列相似性发现的定义及其处理过程。  相似文献   

8.
Accuracy on multiple sequence alignments (MSA) is of great significance for such important biological applications as evolution and phylogenetic analysis, homology and domain structure prediction. In such analyses, alignment accuracy is crucial. In this paper, we investigate a combined scoring function capable of obtaining a good approximation to the biological quality of the alignment. The algorithm uses the information obtained by the different quality scores in order to improve the accuracy. The results show that the combined score is able to evaluate alignments better than the isolated scores.  相似文献   

9.
Multiple sequence alignment, known as NP-complete problem, is among the most important and challenging tasks in computational biology. For multiple sequence alignment, it is difficult to solve this type of problems directly and always results in exponential complexity. In this paper, we present a novel algorithm of genetic algorithm with ant colony optimization for multiple sequence alignment. The proposed GA-ACO algorithm is to enhance the performance of genetic algorithm (GA) by incorporating local search, ant colony optimization (ACO), for multiple sequence alignment. In the proposed GA-ACO algorithm, genetic algorithm is conducted to provide the diversity of alignments. Thereafter, ant colony optimization is performed to move out of local optima. From simulation results, it is shown that the proposed GA-ACO algorithm has superior performance when compared to other existing algorithms.  相似文献   

10.
多序列联配(MSA)是一个NP问题,为了取得一个好的联配结果,常用渐进和迭代两种方法,但渐进方法不能调整早期的错误,迭代方法面临怎样跳出局部最优的问题。该文提出了一种新的求精方法,该方法基于极值遗传算法和挖掘策略。极值遗传算法基于极值组合元素,能够减少搜索空间。易于找到全局最优解。算法实现过程中,首先用挖掘算法挖掘出已知联配中的不良序列块,然后所有的不良序列块用极值遗传算法重新联配。当初始的序列是用渐进算法联配时,新的求精方法能调整早期的一些错误,充分结合渐进和迭代算法的优点。最后算法用来自于数据库BAliBASE中数据进行了验证。  相似文献   

11.
基于最大权值路径算法的DNA多序列比对方法   总被引:1,自引:0,他引:1  
霍红卫  肖智伟 《软件学报》2007,18(2):185-195
针对生物序列分析中的多序列比对问题,当输入数据量比较大时,人们提出了很多启发式的算法来改善计算速度和比对结果.提出了用于进行全局DNA多序列比对的一种方法:MWPAlign(maximum weighted path alignment).该算法把序列信息用de Bruijn图的形式表示,并将输入序列的信息记录在图的边上,这样,就将求调和序列的问题转化为求图的最大权值路径问题,使多序列比对问题的时间复杂度降低到几乎线性.实验结果显示:MWPAlign是可行的多序列比对算法,尤其对于变异率低于5.2%的大量序列数据,相对于CLUSTALW(cluster alignments weight),T-Coffee和HMMT(hidden Markov model training)有较好的比对结果和运算性能.  相似文献   

12.
The rigorous alignment of multiple protein sequences becomes impractical even with a modest number of sequences, since computer memory and time requirements increase as the product of the lengths of the sequences. We have devised a strategy to approach such an optimal alignment, which modifies the intensive computer storage and time requirements of dynamic programming. Our algorithm randomly divides a group of unaligned sequences into two subgroups, between which an optimal alignment is then obtained by a Needleman-Wunsch style of algorithm. Our algorithm uses a matrix with dimensions corresponding to the lengths of the two aligned sequence subgroups. The pairwise alignment process is repeated using different random divisions of the whole group into two subgroups. Compared with the rigorous approach of solving the n-dimensional lattice by dynamic programming, our iterative algorithm results in alignments that match or are close to the optimal solution, on a limited set of test problems. We have implemented this algorithm in a computer program that runs on the IBM PC class of machines, together with a user-friendly environment for interactively selecting sequences or groups of sequences to be aligned either simultaneously or progressively.  相似文献   

13.
Genome resequencing with short reads generated from pyrosequencing generally relies on mapping the short reads against a single reference genome. However, mapping of reads from multiple reference genomes is not possible using a pairwise mapping algorithm. In order to align the reads w.r.t each other and the reference genomes, existing multiple sequence alignment(MSA) methods cannot be used because they do not take into account the position of these short reads with respect to the genome, and are highly inefficient for a large number of sequences. In this paper, we develop a highly scalable parallel algorithm based on domain decomposition, referred to as P-Pyro-Align, to align such a large number of reads from single or multiple reference genomes. The proposed alignment algorithm accurately aligns the erroneous reads, and has been implemented on a cluster of workstations using MPI library. Experimental results for different problem sizes are analyzed in terms of execution time, quality of the alignments, and the ability of the algorithm to handle reads from multiple haplotypes. We report high quality multiple alignment of up to 0.5 million reads. The algorithm is shown to be highly scalable and exhibits super-linear speedups with increasing number of processors.  相似文献   

14.
在业务过程发现的一致性检测中,现有事件日志与过程模型的多视角对齐方法一次只能获得一条迹与过程模型的最优对齐;并且最优对齐求解中的启发函数计算复杂,以致最优对齐的计算效率较低。为此,提出一种基于迹最小编辑距离的、事件日志的批量迹与过程模型的多视角对齐方法。首先选取事件日志中的多条迹组成批量迹,使用过程挖掘算法得到批量迹的日志模型;进而获取日志模型与过程模型的乘积模型及其变迁系统,即为批量迹的搜索空间;然后设计基于Petri网变迁序列集合与剩余迹的最小编辑距离的启发函数来加快A*算法;最后设计可调节数据和资源视角所占权重的多视角代价函数,在乘积模型的变迁系统上提出批量迹中每条迹与过程模型的多视角最优对齐方法。仿真实验结果表明,相比已有工作,在计算批量迹与过程模型间的多视角对齐时,所提方法占用更少的内存空间和使用更少的运行时间。该方法提高了最优对齐的启发函数计算速度,可以一次获得批量迹的所有最优对齐,进而提高了事件日志与过程模型的多视角对齐效率。  相似文献   

15.
Optimizing railway alignments is a quite complex and time-consuming engineering problem. The huge continuous search space, complex constraints, implicit objective function and infinite potential alternatives of this problem pose many challenges. Especially in mountainous regions, finding a near-optimal alignment for extremely complex terrain and constraints is a most arduous task, which cannot be solved satisfactorily with most existing methods. In this study, a stepwise & hybrid particle swarm-genetic algorithm is developed for railway alignment optimization in mountainous regions. It is a continuous search method suitable for railway alignment design. A stepwise horizontal–vertical–integral approach which defines the horizontal and vertical alignments as two kinds of particles, is proposed to solve the three-dimensional railway alignment optimization problem. To enhance the initial diversity and momentum, butterfly-shaped areas are preset on a path generated with a bidirectional distance transform for initializing horizontal particles. For the solution method, specific genetic operators, including roulette wheel selection, four crossovers and two mutations are integrated into the stepwise particle swarm method to address parameter-dependent performance and avoid premature convergence. In addition, a cubic polynomial weight update strategy is employed for thoroughly searching the problem space. This synthesis method has been applied to a real-world case in a very mountainous region. The detailed data analyses demonstrate that it can offer more promising solutions compared with alternatives designed by experienced designers and those generated with a genetic algorithm or non-stepwise particle swarm algorithm.  相似文献   

16.
基于双序列比对算法的立体图像匹配方法*   总被引:1,自引:1,他引:0  
在分析现有立体匹配方法的基础上,提出一种基于双序列比对算法的立体图像匹配方法。将立体图像对中同名极线上的像素灰度值看做是一对字符序列,使用基于动态规划思想的双序列比对算法对这些对字符序列进行匹配,以获取立体图像视差。为验证该方法的可行性和适用性,采用人脸立体图像对进行实验。实验结果表明,使用该方法进行立体图像匹配能获得光滑的、稠密的视差图。基于动态规划思想的双序列比对算法能够有效地解决立体图像匹配问题,从而为图像的立体匹配提供了一个实用有效的方法。  相似文献   

17.
Biological sequence (e.g. DNA sequence) can be treated as strings over some fixed alphabet of characters (a, c, t and g)[1]. Sequence alignment is a procedure of comparing two or more sequences by searching for a series of individual characters that are in the same order in the sequences. Two-sequence alignment, pair-wise alignment, is a way of stacking one sequence above the other and matching characters from the two sequences thaat lie in the same column: identical characters are placed in …  相似文献   

18.
Over the last decade there has been an increasing interest in semi-supervised clustering. Several studies have suggested that even a small amount of supervised information can significantly improve the results of unsupervised learning. One popular method of incorporating partial supervised information is through pair-wise constraints indicating whether a certain pair of patterns should belong to the same (Must-link) or different (Dont-link) clusters. In this study we propose a novel semi-supervised fuzzy clustering algorithm (SSFCA). The supervised information is incorporated via a method quantifying Must-link and/or Dont-link constraints. Additionally, we present an extension of SSFCA that allows the algorithm to automatically detect the number of clusters in the data. We apply SSFCA to the intrinsic problem of gene expression profiles clustering. The advantageous properties of fuzzy logic, inherited to SSFCA, allow genes to belong to more than one group, revealing this way more profound information concerning their multiple functioning roles. Finally, we investigate the incorporation of prior biological knowledge arriving from Gene Ontology in the process of selecting pair-wise constraints. Simulations on artificial and real life datasets proved that the proposed SSFCA significantly outperformed other standard and semi-supervised clustering methods.  相似文献   

19.
When developing personal DNA databases, there must be an appropriate guarantee of anonymity, which means that the data cannot be related back to individuals. DNA lattice anonymization (DNALA) is a successful method for making personal DNA sequences anonymous. However, it uses time-consuming multiple sequence alignment and a low-accuracy greedy clustering algorithm. Furthermore, DNALA is not an online algorithm, and so it cannot quickly return results when the database is updated. This study improves the DNALA method. Specifically, we replaced the multiple sequence alignment in DNALA with global pairwise sequence alignment to save time, and we designed a hybrid clustering algorithm comprised of a maximum weight matching (MWM)-based algorithm and an online algorithm. The MWM-based algorithm is more accurate than the greedy algorithm in DNALA and has the same time complexity. The online algorithm can process data quickly when the database is updated.  相似文献   

20.
The multiple alignment of the sequences of DNA and proteins is applicable to various important fields in molecular biology. Although the approach based on Dynamic Programming is well-known for this problem, it requires enormous time and space to obtain the optimal alignment. On the other hand, this problem corresponds to the shortest path problem and the A* algorithm, which can efficiently find the shortest path with an estimator, is usable.

First, this paper directly applies the A* algorithm to multiple sequence alignment problem with more powerful estimator in more than two-dimensional case and discusses the extensions of this approach utilizing an upper bound of the shortest path length and of modification of network structure. The algorithm to provide the upper bound is also proposed in this paper. The basic part of these results was originally shown in Ikeda and Imai [11]. This part is similar to the branch-and-bound techniques implemented in MSA program in Gupta et al. [6]. Our framework is based on the edge length transformation to reduce the problem to the shortest path problem, which is more suitable to generalizations to enumerating suboptimal alignments and parametric analysis as done in Shibuya and Imai [15–17]. By this enhanced A* algorithm, optimal multiple alignments of several long sequences can be computed in practice, which is shown by computational results.

Second, this paper proposes a k-group alignment algorithm for multiple alignment as a practical method for much larger-size problem of, say multiple alignments of 50–100 sequences. A basic part of these results were originally presented in Imai and Ikeda [13]. In existing iterative improvement methods for multiple alignment, the so-called group-to-group two-dimensional dynamic programming has been used, and in this respect our proposal is to extend the ordinary two-group dynamic programming to a k-group alignment programming. This extension is conceptually straightforward, and here our contribution is to demonstrate that the k-group alignment can be implemented so as to run in a reasonable time and space under standard computing environments. This is established by generalizing the above A* search approach. The k-group alignment method can be directly incorporated in existing methods such as iterative improvement algorithms [2, 5] and tree-based (iterative) algorithms [9]. This paper performs computational experiments by applying the k-group method to iterative improvement algorithms, and shows that our approach can find better alignments in reasonable time. For example, through larger-scale computational experiments here, 34 protein sequences with very high homology can be optimally 10-group aligned, and 64 sequences with high homology can be optimally 5-group aligned.  相似文献   


设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号