首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Methods for calculating the probabilities of finding patterns in sequences   总被引:12,自引:0,他引:12  
This paper describes the use of probability-generating functions for calculating the probabilities of finding motifs in nucleic acid and protein sequences. Equations and algorithms are given for calculating the probabilities associated with nine different ways of defining motifs. Comparisons are made with searches of random sequences. A higher level structure--the pattern--is defined as a list of motifs. A pattern also specifies the permitted ranges of spacing allowed between its constituent motifs. Equations for calculating the expected numbers of matches to patterns are given.  相似文献   

2.
Text-indexing structures provide significant advantages in the solution of many problems related to string analysis and comparison, and are nowadays widely used in the analysis of biological sequences. In this paper, we present some applications of affix trees to problems of exact and approximate pattern matching and discovery in RNA sequences. By allowing bidirectional search for symmetric patterns in the sequences, affix trees permit to discover and locate in the sequences patterns describing not only sequence regions, but also containing information about the secondary structure that a given region could form, with improvements in terms of theoretical and practical efficiency over the existing methods. The search can be either exact or approximate, where the approximation can be defined simultaneously both for the sequence and the structure of patterns. The approach presented in this paper could provide significant help in the analysis of RNA sequences, where the functional motifs often involve not only sequence, but also the structural constraints.  相似文献   

3.
在蛋白质序列的比对研究中,拥有相似模式的蛋白质常常具有相似的功能.通过已知的蛋白质序列模式可以很方便地对新蛋白质序列的功能结构进行研究和确认.蛋白质序列的发现已成为一个很有意义的题目.对基于模式驱动Pratt算法进行改进以提高其效率,在原来基础上引入模糊查询方法,能够更为快捷地从互不相关的蛋白质序列集合中找出最具代表性的蛋白质模式.  相似文献   

4.
《Applied Soft Computing》2007,7(3):1121-1130
We describe a new method for pairwise nucleic acid sequence alignment that can also be used for pattern searching and tandem repeat searching within a nucleic acid sequence. The method is broadly a hybrid algorithm employing ant colony optimization (ACO) and the simple genetic algorithm. The method first employs ACO to obtain a set of alignments, which are then further processed by an elitist genetic algorithm, which employs primitive selection and a novel multipoint crossover-mutation operator to generate accurate alignments. The resulting alignments show a fair amount of accuracy for smaller and medium size sequences. Furthermore, this algorithm can be used rather quickly and efficiently for aligning shorter sequences and also for pattern searching in both nucleic acid and amino acid sequences. Furthermore, it can be used as an effective local alignment method or as a global alignment tool. On improvement of accuracy, this method can be extended for use towards multiple sequence alignment.  相似文献   

5.
In this paper, we study the problem of motif discoveries in unaligned DNA and protein sequences. The problem of motif identification in DNA and protein sequences has been studied for many years in the literature. Major hurdles at this point include computational complexity and reliability of the search algorithms. We propose a self-organizing neural network structure for solving the problem of motif identification in DNA and protein sequences. Our network contains several layers, with each layer performing classifications at different levels. The top layer divides the input space into a small number of regions and the bottom layer classifies all input patterns into motifs and nonmotif patterns. Depending on the number of input patterns to be classified, several layers between the top layer and the bottom layer are needed to perform intermediate classifications. We maintain a low computational complexity through the use of the layered structure so that each pattern's classification is performed with respect to a small subspace of the whole input space. Our self-organizing neural network will grow as needed (e.g., when more motif patterns are classified). It will give the same amount of attention to each input pattern and will not omit any potential motif patterns. Finally, simulation results show that our algorithm outperforms existing algorithms in certain aspects. In particular, simulation results show that our algorithm can identify motifs with more mutations than existing algorithms. Our algorithm works well for long DNA sequences as well.  相似文献   

6.
Information about the three-dimensional structure or function of a newly determined protein sequence can be obtained if the protein is found to contain a characterized motif or pattern of residues. Recently a database (PROSITE) has been established that contains 337 known motifs encoded as a list of allowed residue types at specific positions along the sequence. PROMOT is a FORTRAN computer program that takes a protein sequence and examines if it contains any of the motifs in PROSITE. The program also extends the definitions of patterns beyond those used in PROSITE to provide a simple, yet flexible, method to scan either a PROSITE or a user-defined pattern against a protein sequence database.  相似文献   

7.
The rapid increase of available DNA, protein, and other biological sequences has made the problem of discovering meaningful patterns from sequences an important task for Bioinformatics research. Among all types of patterns defined in the literature, the most challenging one is to find repeating patterns with gap constraints. In this article, we identify a new research problem for mining approximate repeating patterns (ARPs) with gap constraints, where the appearance of a pattern is subject to an approximate match, which is very common in biological sequences. To solve the problem, we propose an ArpGap (ARP mining with Gap constraints) algorithm with three major components for ARP mining: (1) a data‐driven pattern generation approach to avoid generating unnecessary candidates for validation; (2) a back‐tracking pattern search process to discover approximate occurrences of a pattern under user specified gap constraints; and (3) an Apriori‐like deterministic pruning approach to progressively prune patterns and cease the search process if necessary. Experimental results on synthetic and real‐world protein sequences assert that ArpGap is efficient in terms of memory consumption and computational cost. The results further suggest that the proposed method is practical for discovering approximate patterns for protein sequences where the sequence length is usually several hundreds to one thousand and the pattern length is relatively short.  相似文献   

8.
Methods for discovering novel motifs in nucleic acid sequences   总被引:6,自引:0,他引:6  
We describe a computer tool to aid the discovery of new motifs in nucleic acid sequences. A typical use would be to analyse a set of upstream regions from a family of related genes in order to find possible control sequences. The heart of the method is the creation of dictionaries of related subsequences. These dictionaries can then be analysed to look for the commonest or best-defined subsequences, those that occur in the highest number of different sequences, or for those in equivalent positions within the family. We show the application of the method to a set of E. coli promoter sequences.  相似文献   

9.
《Computers & chemistry》1993,17(2):219-227
A neural network classification method has been developed as an alternative approach to the search/organization problem of large molecular databases. Two artificial neural systems have been implemented on a Cray for rapid protein/nucleic acid classification of unknown sequences. The system employs a n-gram hashing function for sequence encoding and modular back-propagation networks for classification. The protein system, which classifies proteins into PIR (Protein Identification Resource) superfamilies, has achieved 82–100% sensitivity at a speed that is about an order of magnitude faster than other search methods. The pilot nucleic acid system showed a 91–97% classification accuracy. The software tool could be used as a filter program to reduce the database search time and help organize the molecular sequence databases. The tool is generally applicable to any databases that are organized according to family relationships.  相似文献   

10.
11.
有效分析蛋白质家族是生物信息学的一项重要挑战,聚类成为解决这一问题的主要途径之一.基于传统序列比对方法定义蛋白质序列间相似关系时,假设了同源片断问的邻接保守性,与遗传重组相冲突.为更好地识别蛋白质家族,提出了一种蛋白质序列家族挖掘算法ProFaM.ProFaM首先采用前缀投影策略挖掘表征蛋白质序列的模式,然后基于模式及其权重信息构造相似度度量函数,并采用共享最近邻方法,实现了蛋白质序列家族聚类.解决了以往方法在蛋白质模式挖掘及相似度设计中的不足.在蛋白质家族数据库Pfam上的实验结果证实了ProFaM算法在蛋白质家族分析上有良好的结果.  相似文献   

12.
Computer programs that can be used for the design of synthetic genes and that are run on an Apple Macintosh computer are described. These programs determine nucleic acid sequences encoding amino acid sequences. They select DNA sequences based on codon usage as specified by the user, and determine the placement of base changes that can be used to create restriction enzyme sites without altering the amino acid sequence. A new algorithm for finding restriction sites by translating the restriction endonuclease target sequence in all three reading frames and then searching the given peptide or protein amino acid sequence with these short restriction enzyme peptide sequences is described. Examples are given for the creation of synthetic DNA sequences for the bovine prethrombin-2 and ribonuclease A genes.  相似文献   

13.
14.
PRONUC is a menu-driven software package from which a molecular biologist may gain access to a variety of tools for the analysis of protein and nucleic acid sequences. Features include various algorithms for sequence comparisons, secondary structure prediction, sequence manipulation (translation complementation etc.) and finding restriction enzyme cut-sites. The sequences under study can be retrieved from several databases of published sequences or a users sequence(s) can be entered by means of a sequence editor or retrieved from a database constructed by the user. PRONUC comes with a comprehensive manual and on-line help which reflects several years of user feedback and is available for Digital VAX computer systems running the VMS or micro-VMS operating system.  相似文献   

15.
The Watson-Crick double helix is perhaps themost predictable and programmable of allintermolecular interactions. In addition toits biological role in the cell, double helicalDNA is used for DNA-based computation and forDNA nanotechnology. The success of theseapplications has been based on the reliabilityof Watson-Crick base pairing, and, in thelatter case, circumventing the linearity of thedouble helix. We survey some of thealternative base pairing structures that can befound in synthetic systems, indicating motifsthat can be propagated and giving examples ofmispairing that can occur within the doublehelical context. We discuss some of the morecommon covalent modifications of nucleic acids. We also indicate the structural interplay ofspecial sequences and negative supercoiling. In addition to the caveats that wepresent involving unexpected results in nucleicacid systems, we show that the process ofreciprocal exchange and the generalization ofcomplementarity can be used to generatebranched DNA motifs for use in DNAnanotechnology or DNA-based computation. Weshow that these motifs can be used for thedirected construction of DNA objects, for thegeneration of specific designed patterns intwo-dimensional lattices, for computation byself-assembly, and for the fabrication ofDNA-based nanodevices. The use ofnon-Watson-Crick DNA leads inherently both toerrors and to thrilling possibilities. Successfully juggling these aspects ofgeneralized nucleic acid structure is anexciting challenge to investigators in thearea.  相似文献   

16.
Fast and sensitive multiple sequence alignments on a microcomputer   总被引:24,自引:0,他引:24  
A strategy is described for the rapid alignment of many long nucleic acid or protein sequences on a microcomputer. The program described can handle up to 100 sequences of 1200 residues each. The approach is based on progressively aligning sequences according to the branching order in an initial phylogenetic tree. The results obtained using the package appear to be as sensitive as those from any other available method.  相似文献   

17.
In this article we present ConQueSt, a constraint-based querying system able to support the intrinsically exploratory (i.e., human-guided, interactive and iterative) nature of pattern discovery. Following the inductive database vision, our framework provides users with an expressive constraint-based query language, which allows the discovery process to be effectively driven toward potentially interesting patterns. Such constraints are also exploited to reduce the cost of pattern mining computation. ConQueSt is a comprehensive mining system that can access real-world relational databases from which to extract data. Through the interaction with a friendly graphical user interface (GUI), the user can define complex mining queries by means of few clicks. After a pre-processing step, mining queries are answered by an efficient and robust pattern mining engine which entails the state-of-the-art of data and search space reduction techniques. Resulting patterns are then presented to the user in a pattern browsing window, and possibly stored back in the underlying database as relations.  相似文献   

18.
Two programs, MOTIF and PATTERN, that scan sequences for matches to user-defined motifs and patterns of motifs based on identity and set membership are described. The programs use a simple and logical notation to define motifs, and may be used either interactively or by using command line parameters (suitable for batch processing). The two programs described also incorporate a simple, yet reliable, algorithm that automatically detects in which of six possible formats the sequence entry is written.  相似文献   

19.
A fixed-point alignment analysis technique is presented which is designed to locate common sequence motifs in collections of proteins or nucleic acids. Initially a program aligns a collection of sequences by a common sequence pattern or known biological feature. The common pattern or feature (fixed-point) may be a user-specified sequence string or a known sequence position like mRNA start site, which may be taken directly from the annotated feature table of GenBank. Once all alignment markers are located, the sequences are scanned for occurrences of given oligomers within a specified span both upstream and downstream of the fixed-point. The occurrences may then be plotted as a function of the position relative to the fixed-point, displayed as an actual sequence alignment or selectively summarized via various program options. Applications of the technique are discussed.  相似文献   

20.
基于shell命令和多重行为模式挖掘的用户伪装攻击检测   总被引:3,自引:0,他引:3  
伪装攻击是指非授权用户通过伪装成合法用户来获得访问关键数据或更高层访问权限的行为.近年来,伪装攻击检测在保障网络信息安全中发挥着越来越大的作用.文中提出一种新的用户伪装攻击检测方法.同现有的典型检测方法相比,该方法在训练阶段改进了对用户行为模式的表示方式,通过合理选择用户行为特征并基于阶梯式的序列模式支持度来建立合法用户的正常行为轮廓,提高了用户行为描述的准确性和对不同类型用户的适应性;在充分考虑shell命令审计数据时序特征的基础上,针对伪装攻击行为复杂多变的特点,提出基于多重行为模式并行挖掘和多门限联合判决的检测模型,并通过交叉验证和等量迭代逼近方法确定最佳门限参数,克服了单一序列模式检测模型在性能稳定性和容错能力方面的不足,在不明显增加计算成本的条件下大幅度提高了检测准确度.文中提出的方法已应用于实际检测系统,并表现出良好的检测性能.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号