首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
《Computers & chemistry》1993,17(2):185-190
Three topics are treated in this paper: (1) can the polymorphism evident in the length of many simple sequence repeats (SSRs) or microsatellites be explained as a result of unequal mitotic crossing over? We conclude that although this mechanism may be a reasonable explanation for polymorphisms of minisatellite sequences, it is less attractive for SSRs, because they are considerably shorter and some of the rates of generation of new length alleles are extremely high. A more likely mechanism is some form of slipped strand mispairing, whether occuring during normal DNA replication or during replication accompanying recombination. (2) Some results are presented on the number of mono- and di-nucleotide repeats in the human genome. For each high scoring locus, an optimal alignment is made of the actual with an ideal SSR; from such alignments, the relative numbers of insertions/deletions (indels), transitions and transversions are obtained for each class of SSR. (3) An elementary derivation of the number of equivalence classes of SSRs of any word length, n, is given.  相似文献   

2.
《Computers & chemistry》1994,18(3):233-243
Many proteins sequences contain motifs which display similarity. The similarities between the repeats are a result of gene duplication and/or gene fusion. The evolutionary role of repeats within protein sequences is considered and some repeat examples are given ranging from tandem repeats to multiple types of repeats which are sequentially interspersed. Existing computer methods to delineate repeats in individual protein sequences are discussed and a novel sensitive repeat recognition method is introduced.  相似文献   

3.
We apply the Minimal Length Encoding Principle to formalize inference about the evolution of macromolecular sequences. The Principle is shown to imply a combination of Weighted Parsimony and Compatibility methods that have long been used by biologists because of their good practical performance. The background assumptions are expressed as an encoding scheme for the observed data and as heuristic rules for selection of diagnostic positions in the sequences. The Principle was applied to discover new subfamilies of Alu sequences, the most numerous family of repetitive DNA sequences in the human genome.  相似文献   

4.
Clustering is one of the major operations to analyse genome sequence data. Sophisticated sequencing technologies generate huge DNA sequence data; consequently, the complexity of analysing sequences is also increased. So, there is an enormous need for faster sequence analysis algorithms. Most of the existing tools focused on alignment-based approaches, which are slow-paced for sequence comparison. Alignment-free approaches are more successful for fast clustering. The state-of-the-art methods have been applied to cluster small genome sequences of various species; however, they are sensitive to large size sequences. To subdue this limitation, we propose a novel alignment-free method called DNA sequence clustering with map-reduce (DCMR). Initially, MapReduce paradigm is used to speed up the process of extracting eight different types of repeats. Then, the frequency of each type of repeat in a sequence is considered as a feature for clustering. Finally, K-means (DCMR-Kmeans) and K-median (DCMR-Kmedian) algorithms are used to cluster large DNA sequences by using extracted features. The two variants of proposed method are evaluated to cluster large genome sequences of 21 different species and the results show that sequences are very well clustered. Our method is tested for different benchmark data sets like viral genome, influenza A virus, mtDNA, and COXI data sets. Proposed method is compared with MeshClust, UCLUST, STARS, and ClustalW. DCMR-Kmeans outperforms MeshClust, UCLUST, and DCMR-Kmedian with respect to purity and NMI on virus data sets. The computational time of DCMR-Kmeans is less than STARS, DCMR-Kmedian, and much less than UCLUST on COXI data set.  相似文献   

5.
Various research efforts would benefit from the ability to exchange and share information (traces with packet payloads, or other detailed system logs) to enable more data-driven research. Protection of the sensitive content is crucial for extensive information sharing. We present results of Kencl and Loebl (2009) [41] and Blamey et al. (in preparation) [4] about a technique of information concealing, based on introduction and maintenance of families of repeats. The structure of repeats in DNA constitutes an important obstacle for its reconstruction by hybridisation. A large proportion of eukaryotic genomes is composed of DNA segments that are repeated either precisely or in variant form more than once. As yet, no function has been associated with many of the repeats. In the paper by Blamey et al. (in preparation) [4], the authors propose that in eukaryotes the cells have DNA as a depositary of concealed genetic information and the genome achieves the self-concealing by accumulation and maintenance of repeats. The protected information may be shared and this is useful for the development of intercellular communication and in the development of multicellular organisms. The results presented here are protected by Czech patent number 301799 and by US Patent Application number 12/670.  相似文献   

6.
本文主要研究DNA片断拼接中重复序列信息识别算法。包含大量重复信息的DNA序列,其重构是大规模DNA片段拼接所面临的实际困难之一。针对目前大多数拼接算法对于重复段的处理采用效率较低的反复迭代算法的特点,提出了基于k-mer子串的重复段分析方法,充分考虑了拼接中可能的分割点,设计与分析了识别重复序列并提高序列一致性的高效算法。  相似文献   

7.
A new method for the search of local repeats in long DNA sequences, such as complete genomes, is presented. It detects a large variety of repeats varying in length from one to several hundred bases, which may contain many mutations. By mutations we mean substitutions, insertions or deletions of one or more bases. The method is based on counting occurrences of short words (3-12 bases) in sequence fragments called windows. A score is computed for each window, based on calculating exact word occurrence probabilities for all the words of a given length in the window. The probabilities are defined using a Bernoulli model (independent letters) for the sequence, using the actual letter frequencies from each window. A plot of the probabilities along the sequence for high-scoring windows facilitates the identification of the repeated patterns. We applied the method to the 1.87 Mb sequence of chromosome 4 of Arabidopsis thaliana and to the complete genome of Bacillus subtilis (4.2 Mb). The repeats that we found were classified according to their size, number of occurrences, distance between occurrences, and location with respect to genes. The method proves particularly useful in detecting long, inexact repeats that are local, but not necessarily tandem. The method is implemented as a C program called EXCEP, which is available on request from the authors.  相似文献   

8.
DNA序列分析研究是生物信息学的重要内容之一。基因组的基因相关区域和基因外区域中含有大量重复序列,尽管目前大多数重复序列的功能还没能肯定,但它们在遗传分析中已起重要作用。挖掘DNA重复序列成为DNA序列分析的关键。自底向上的挖掘算法中间过程产生很多短的、甚至单字符的模式,使得挖掘效率降低;另一方面,目前序列模式挖掘算法在多序列挖掘中表现出高效性,但由于单支持度定义的局限导致无法在挖掘过程中同时找到单条DNA序列中的重复序列,因此不能很好地适用于DNA重复序列挖掘。本文基于新的多支持度序列模式挖掘框架,提出了一种融合自底向上和自顶向下策略挖掘DNA重复序列的新算法DnaReSM,其结果为生物学相关实验提供基础。实验结果表明,DnaReSM探测算法能有效挖掘DNA重复序列。  相似文献   

9.
In this paper we consider the problem of DNA compression. It is well known that one of the main features of DNA sequences is that they contain substrings which are duplicated except for a few random mutations. For this reason most DNA compressors work by searching and encoding approximate repeats. We depart from this strategy by searching and encoding only exact repeats. However, we use an encoding designed to take advantage of the possible presence of approximate repeats. Our approach leads to an algorithm which is an order of magnitude faster than any other algorithm and achieves a compression ratio very close to the best DNA compressors. Another important feature of our algorithm is its small space occupancy which makes it possible to compress sequences hundreds of megabytes long, well beyond the range of any previous DNA compressor. Copyright © 2004 John Wiley & Sons, Ltd.  相似文献   

10.
Gao  J. Cao  Y. Qi  Y. Hu  J. 《Intelligent Systems, IEEE》2005,20(6):34-39
Indices that can discriminate DNA sequences' coding and noncoding regions are crucial elements of a successful gene identification algorithm. Multiscale analysis of various species' genome sequences facilitates construction of novel codon indices. We've developed two novel DNA codon indices, one based on recurrence time and one based on fractal deviation. Because both work well on short DNA sequences, they both hold the promise of being integrated into and thus improving existing gene identification algorithms.  相似文献   

11.
《Computers & chemistry》1994,18(3):259-267
Non-homogeneous Markov chain models can represent biologically important regions of DNA sequences. The statistical pattern that is described by these models is usually weak and was found primarily because of strong biological indications. The general method for extracting similar patterns is presented in the current paper. The algorithm incorporates cluster analysis, multiple alignment and entropy minimization.The method was first tested using the set of DNA sequences produced by Markov chain generators. It was shown that artificial gene sequences, which initially have been randomly set up along the multiple alignment panels, are aligned according to the hidden triplet phase. Then the method was applied to real protein-coding sequences and the resulting alignment clearly indicated the triplet phase and produced the parameters of the optimal 3-periodic non-homogeneous Markov chain model. These Markov models were already employed in the GeneMark gene prediction algorithm, which is used in genome sequencing projects.The algorithm can also handle the case in which the sequences to be aligned reveal different statistical patterns, such as Escherichia coli protein-coding sequences belonging to Class II and Class III. The algorithm accepts a random mix of sequences from different classes, and is able to separate them into two groups (clusters), align each cluster separately, and define a non-homogeneous Markov chain model for each sequence cluster.  相似文献   

12.
Feitelson  G. Treinin  M. 《Computer》2002,35(7):34-40
One of the greatest scientific discoveries of the twentieth century is the structure of DNA and how it encodes proteins. Current genome projects, especially the Human Genome Project, have sparked interest in the information encoded in DNA, which is often referred to as "the blueprint for life", implying that it contains all the information needed to create life. This interpretation ignores the complex interactions between DNA and its cellular environment, interactions that regulate and control the spatial and temporal patterns of gene expression. Moreover, the particulars of many cellular structures seem not to be encoded in DNA, and they are never created from scratch, rather, each cell inherits templates for these structures from its parent cell. Thus, it is not clear that DNA directly or indirectly encodes all life processes, casting doubt on the belief that we can understand them solely by studying DNA sequences. The paper discusses DNA encoding and computer programming  相似文献   

13.
Determination of the nucleotide sequences of hundreds of organisms (in the first place, the human genome) is a significant technical achievement of modern biology. The next stage of studying the genome is to determine the functions of each gene and the corresponding protein: the so-called genome annotation. The existing methods of classifying the biological roles of proteins on the basis of the amino acid sequence are restricted to searching for similar sequences in a database and, as a result, have limited applicability. In this paper, a formalism is introduced for studying this problem in the framework of the algebraic approach and the solvability, locality, and regularity of the problem and monotonicity of the condition of solvability are considered. The proposed formalism enables one to study systematically the hypothesis of locality of various biological roles of proteins.  相似文献   

14.
Sequence complexity for biological sequence analysis   总被引:2,自引:0,他引:2  
A new statistical model for DNA considers a sequence to be a mixture of regions with little structure and regions that are approximate repeats of other subsequences, i.e. instances of repeats do not need to match each other exactly. Both forward- and reverse-complementary repeats are allowed. The model has a small number of parameters which are fitted to the data. In general there are many explanations for a given sequence and how to compute the total probability of the data given the model is shown. Computer algorithms are described for these tasks. The model can be used to compute the information content of a sequence, either in total or base by base. This amounts to looking at sequences from a data-compression point of view and it is argued that this is a good way to tackle intelligent sequence analysis in general.  相似文献   

15.
Recently, a new approach to analyze genomes evolving which is based on comparision of gene orders versus traditional comparision of DNA sequences was proposed (Sankoff et al. 1992). The approach is based on the global rearrangements (e.g., inversions and transpositions of fragments). Analysis of genomes evolving by inversions and transpositions leads to a combinatorial problem of sorting by reversals and transpositions, i.e., sorting of a permutation using reversals and transpositions of arbitrary fragments. We study sorting of signed permutations by reversals and transpositions, a problem which adequately models genome rearrangements, as the genes in DNA are oriented. We establish a lower bound and give a 2-approximation algorithm for the problem.  相似文献   

16.
M.A.H.  Ankush  R.C. 《Pattern recognition》2006,39(12):2312-2322
The tree representation of evolutionary relationship oversimplifies the view of the process of evolution as it cannot take into account the events such as horizontal gene transfer, hybridization, homoplasy and genetic recombination. Several algorithms exist for constructing phylogenetic networks which result from events such as horizontal gene transfer, hybridization and homoplasy. Very little work has been published on the algorithmic detail of phylogenetic networks with constrained recombination. The problem of minimizing the number of recombinations in a phylogenetic network, constructed using binary DNA sequences, is NP-hard. In this paper, we propose a pattern recognition-based O(n2) time approach for constructing the phylogenetic network, where n is the number of nodes or sequences in the input data. The network is constructed with the restriction that no two cycles in the network share a common node.  相似文献   

17.
A microcomputer program for the identification of tRNA genes   总被引:1,自引:0,他引:1  
A microcomputer program which locates tRNA genes within long DNA sequences is described. The search is performed either by identifying tRNA-like secondary structures or by locating eukaryotic RNA polymerase III promoter consensus sequences. The program is also useful in finding inverted repeats allowing the formation of stem-loop secondary structures in tRNA. The program has been developed in BASIC and 6502 Assembler and runs on the Apple II plus and IIe microcomputers. The execution is quite fast; all the operations are carried out in 1-90 s, depending on the required task and on the sequence length.  相似文献   

18.
Pop  M. Salzberg  S.L. Shumway  M. 《Computer》2002,35(7):47-54
Ultimately, genome sequencing seeks to provide an organism's complete DNA sequence. Automation of DNA sequencing allowed scientists to decode entire genomes and gave birth to genomics, the analytic and comparative study of genomes. Although genomes can include billions of nucleotides, the chemical reactions researchers use to decode the DNA are accurate for only about 600 to 700 nucleotides at a time. The DNA reads that sequencing produces must then be assembled into a complete picture of the genome. Errors and certain DNA characteristics complicate assembly. Resolving these problems entails an additional and costly finishing phase that involves extensive human intervention. Assembly programs can dramatically reduce this cost by taking into account additional information obtained during finishing. The paper considers how algorithms that can assemble millions of DNA fragments into gene sequences underlie the current revolution in biotechnology, helping researchers build the growing database of complete genomes  相似文献   

19.
Recursive segmentation is a procedure that partitions a DNA sequence into domains with a homogeneous composition of the four nucleotides A, C, G and T. This procedure can also be applied to any sequence converted from a DNA sequence, such as to a binary strong(G + C)/weak(A + T) sequence, to a binary sequence indicating the presence or absence of the dinucleotide CpG, or to a sequence indicating both the base and the codon position information. We apply various conversion schemes in order to address the following five DNA sequence analysis problems: isochore mapping, CpG island detection, locating the origin and terminus of replication in bacterial genomes, finding complex repeats in telomere sequences, and delineating coding and noncoding regions. We find that the recursive segmentation procedure can successfully detect isochore borders, CpG islands, and the origin and terminus of replication, but it needs improvement for detecting complex repeats as well as borders between coding and noncoding regions.  相似文献   

20.
The family of stichotrichous ciliates have received a great deal of study due to the presence of scrambled genes in their genomes. The mechanism by which these genes are descrambled is of interest both as a biological process and as a model of natural computation. Several formal models of this process have been proposed, the most recent of which involves the recombination of DNA strands based on template guides. We generalize this template-guided DNA recombination model proposed by Prescott, Ehrenfeucht and Rozenberg to an operation on strings and languages. We then proceed to investigate the properties of this operation with the intention of viewing ciliate gene descrambling as a computational process.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号