首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
《Computers & chemistry》1992,16(2):135-143
Repetitive sequences are ubiquitous in the DNA of eukaryotes, some as tandem arrays and others interspersed widely in the genome. Repetitive sequences have special roles in genome evolution, which increasingly detailed sequence information is helping to elucidate. Processes, including meiotic crossing over (equal and unequal), unequal mitotic sister chromatid exchange, gene conversion and transposition, with or without multiplication, can foster homogeneity of the members of a repeat family (concerted evolution) and turnover of the whole genome. Some examples are considered. Tandem repeats, satellite and minisatellite sequences are considered as well as telomeric repeats. For a minisatellite locus, in which the frequency of length mutations has been measured, comparisons are made with expectations due to unequal sister chromatid exchange, and qualitative agreement is found. Interspersed repeats, of which Alu sequences are discussed as an important example, can through unequal recombination lead to a loss or duplication of the DNA between the recombination sites and hence to genetic disease or to gene duplication. It is argued that the rate of unequal Alu recombination may be quite high in the human genome.  相似文献   

2.
霍红卫  白帆 《计算机学报》2008,31(2):214-219
当前大部分重复体识别算法不是依靠于已经标识的重复体数据库就是定义重复体为两个最大长度的相似序列,而没有一个严格的定义来平衡重复体的长度和频率.针对这些问题文中提出了一种基于局部序列比对算法BLAST变型且支持空位的快速识别重复体的RepeatSearcher算法.算法通过定义重复体的精确边界运用逐步扩展调和序列来识别重复体.算法使用C.briggsae基因组序列作为测试对象,并与当前通用的重复体识别算法RECON以及新近的识别算法RepeatScout做了比较分析.结果表明RepeatSearcher使每一条重复体序列具有了精确的边界,而且相对其它算法在没有损失精度的情况下,缩短了算法的运行时间.  相似文献   

3.
本文主要研究DNA片断拼接中重复序列信息识别算法。包含大量重复信息的DNA序列,其重构是大规模DNA片段拼接所面临的实际困难之一。针对目前大多数拼接算法对于重复段的处理采用效率较低的反复迭代算法的特点,提出了基于k-mer子串的重复段分析方法,充分考虑了拼接中可能的分割点,设计与分析了识别重复序列并提高序列一致性的高效算法。  相似文献   

4.
A lot of evidence suggests that many proteins with the symmetric structures have evolved by internal duplication and fusion. Meanwhile many internal sequence repeats correspond to functional and structural units. These proteins, which have internal structural symmetry, this means that their sequences should be made up of identical repeats. However, many of these repeat signals can only be seen at the structural level yet. We have developed a de novo algorithm, modified recurrence correlation analysis, to detect the symmetries in the primary sequences of immunoglobulin folds (Ig folds), which adopt highly symmetrical tertiary structures while their sequences appear nearly random. Using this method, we show that the internal repetitions of the immunoglobulin folds could be identified directly at the sequence level. These results may give us some help to study the hypotheses about the origin of Ig folds by duplication of simpler fragments and it may also give us some helps to understand the relationship between the sequences and their tertiary structures.  相似文献   

5.
Several interactive Pascal programs have been written for the analysis and display of structural information in nucleic acid sequences. Layout procedures were developed to display the homology and repeat matrices of a sequence and to predict and display the secondary structure of RNA/DNA molecules free of overlap and to predict and display internal repeats. No special plotting devices are required because the output is adapted to line printers. Sequences from several DNA database systems can be used as input. These programs are part of a general nucleic acid sequence analysis package.  相似文献   

6.
We recently introduced evolutive tandem repeats with jump (using Hamming distance) (Proc. MFCS’02: the 27th Internat. Symp. Mathematical Foundations of Computer Science, Warszawa, Otwock, Poland, August 2002, Lecture Notes in Computer Science, Vol. 2420, Springer, Berlin, pp. 292–304) which consist in a series of almost contiguous copies having the following property: the Hamming distance between two consecutive copies is always smaller than a given parameter e. In this article, we present a significative improvement that speeds up the detection of evolutive tandem repeats. It is based on the progressive computation of distances between candidate copies participating to the evolutive tandem repeat. It leads to a new algorithm, still quadratic in the worst case, but much more efficient on average, authorizing larger sequences to be processed.  相似文献   

7.
Clustering is one of the major operations to analyse genome sequence data. Sophisticated sequencing technologies generate huge DNA sequence data; consequently, the complexity of analysing sequences is also increased. So, there is an enormous need for faster sequence analysis algorithms. Most of the existing tools focused on alignment-based approaches, which are slow-paced for sequence comparison. Alignment-free approaches are more successful for fast clustering. The state-of-the-art methods have been applied to cluster small genome sequences of various species; however, they are sensitive to large size sequences. To subdue this limitation, we propose a novel alignment-free method called DNA sequence clustering with map-reduce (DCMR). Initially, MapReduce paradigm is used to speed up the process of extracting eight different types of repeats. Then, the frequency of each type of repeat in a sequence is considered as a feature for clustering. Finally, K-means (DCMR-Kmeans) and K-median (DCMR-Kmedian) algorithms are used to cluster large DNA sequences by using extracted features. The two variants of proposed method are evaluated to cluster large genome sequences of 21 different species and the results show that sequences are very well clustered. Our method is tested for different benchmark data sets like viral genome, influenza A virus, mtDNA, and COXI data sets. Proposed method is compared with MeshClust, UCLUST, STARS, and ClustalW. DCMR-Kmeans outperforms MeshClust, UCLUST, and DCMR-Kmedian with respect to purity and NMI on virus data sets. The computational time of DCMR-Kmeans is less than STARS, DCMR-Kmedian, and much less than UCLUST on COXI data set.  相似文献   

8.
DNA序列分析研究是生物信息学的重要内容之一。基因组的基因相关区域和基因外区域中含有大量重复序列,尽管目前大多数重复序列的功能还没能肯定,但它们在遗传分析中已起重要作用。挖掘DNA重复序列成为DNA序列分析的关键。自底向上的挖掘算法中间过程产生很多短的、甚至单字符的模式,使得挖掘效率降低;另一方面,目前序列模式挖掘算法在多序列挖掘中表现出高效性,但由于单支持度定义的局限导致无法在挖掘过程中同时找到单条DNA序列中的重复序列,因此不能很好地适用于DNA重复序列挖掘。本文基于新的多支持度序列模式挖掘框架,提出了一种融合自底向上和自顶向下策略挖掘DNA重复序列的新算法DnaReSM,其结果为生物学相关实验提供基础。实验结果表明,DnaReSM探测算法能有效挖掘DNA重复序列。  相似文献   

9.
黄亚佳  倪磊  金帆  杨光 《集成技术》2019,8(6):31-38
直接的重复序列广泛地存在于真核和原核细胞基因组中,并且与多种疾病(如遗传性神经肌 肉神经退行性疾病等)相关,因此定量重复序列的删除变得非常重要。结合高通量显微成像和分析技术,该文设计了基于三色荧光报告系统的方法来定量重复序列删除的发生。结果显示,在铜绿假单胞菌中,重复序列的删除频率在 recA 基因缺失突变株中明显降低,而 RadA 蛋白和 UvrD 蛋白的缺失则会提高重复序列的删除频率,并且重复序列的删除与细菌的生长率和启动子等因素无关。该研究有助于加深对直接重复序列相关问题的理解,并为直接重复序列删除定量提供了新的方法。  相似文献   

10.
This paper aims at repeat clip mining and knowledge discovery from video data. A unified approach is proposed to detect both unknown video repeats and known video clips of arbitrary length. Two detectors in a cascade structure are employed to achieve fast and accurate detection, and a reinforcement learning approach is adopted to efficiently maximize detection accuracy. In this approach very short video repeats (<1 s) and long ones can be detected by a single process, while overall accuracy remains high. Since video segmentation is essential for repeat detection, performance analysis is also conducted for several segmentation methods. Furthermore we propose a method to analyze video syntactical structure based on short video repeats detection. Experimental results on news videos demonstrate that identifying short video repeats is an effective way for video structure discovery and syntactical segmentation  相似文献   

11.
In this paper we consider the problem of DNA compression. It is well known that one of the main features of DNA sequences is that they contain substrings which are duplicated except for a few random mutations. For this reason most DNA compressors work by searching and encoding approximate repeats. We depart from this strategy by searching and encoding only exact repeats. However, we use an encoding designed to take advantage of the possible presence of approximate repeats. Our approach leads to an algorithm which is an order of magnitude faster than any other algorithm and achieves a compression ratio very close to the best DNA compressors. Another important feature of our algorithm is its small space occupancy which makes it possible to compress sequences hundreds of megabytes long, well beyond the range of any previous DNA compressor. Copyright © 2004 John Wiley & Sons, Ltd.  相似文献   

12.
Sequence complexity for biological sequence analysis   总被引:2,自引:0,他引:2  
A new statistical model for DNA considers a sequence to be a mixture of regions with little structure and regions that are approximate repeats of other subsequences, i.e. instances of repeats do not need to match each other exactly. Both forward- and reverse-complementary repeats are allowed. The model has a small number of parameters which are fitted to the data. In general there are many explanations for a given sequence and how to compute the total probability of the data given the model is shown. Computer algorithms are described for these tasks. The model can be used to compute the information content of a sequence, either in total or base by base. This amounts to looking at sequences from a data-compression point of view and it is argued that this is a good way to tackle intelligent sequence analysis in general.  相似文献   

13.
自动粒度选择的半结构化页面信息抽取   总被引:1,自引:0,他引:1       下载免费PDF全文
半结构化页面的数据记录间存在结构相似性,在先序遍历DOM树生成的标记序列中表现为重复出现的模式,可利用后缀树进行挖掘。由于标记序列可以在块粒度和文本粒度两个级别上展现,而不同粒度下产生的最佳抽取模式在抽取效果方面又表现出不确定性,因此提出一种自动粒度选择的半结构化页面信息抽取方法。算法从后缀树获取的重复模式中选取最大重复和串联重复构成候选模式集,通过特征参数确定两个粒度各自的最佳模式集,最后引入抽取结果规则度参数并进行综合评价,以确定抽取模式完成半结构化页面数据记录的自动抽取。  相似文献   

14.
An artificial short term memory, the binary kernel function, is presented to facilitate the learning of complex sequences of integers by Neural Networks, requiring far fewer weights than are usually needed. This is achieved by using only a single weight to encode repeat occurrences of an integer in a sequence. The coding used allows a complex sequence to be learned in only one presentation. The kernel's exponential complexity growth is overcome with hierarchical architectures which chunk the sequences to be learnt. Architectures are introduced for recognition and reproduction of complex sequences.  相似文献   

15.
The task of research on repeated segments in data sequences is considered in terms of genetic sequences. The principle of detection of repeats is offered based on comparison of specters of signal decomposition by classical orthogonal polynomials. The proposed approach can be applied in the search for extensive inexact repeats in different signals.  相似文献   

16.
A tandem repeat (or square) is a string αα, where α is a non-empty string. We present an O(|S|)-time algorithm that operates on the suffix tree T(S) for a string S, finding and marking the endpoint in T(S) of every tandem repeat that occurs in S. This decorated suffix tree implicitly represents all occurrences of tandem repeats in S, and can be used to efficiently solve many questions concerning tandem repeats and tandem arrays in S. This improves and generalizes several prior efforts to efficiently capture large subsets of tandem repeats.  相似文献   

17.
The software commonly used for assembly of shotgun sequence data has several limitations. One such limitation becomes obvious when repetitive sequences are encountered. Shotgun assembly is a difficult task, even for non-repetitive regions, but the use of quality assessments of the data and efficient matching algorithms have made it possible to assemble most sequences efficiently. In the case of highly repetitive sequences, however, these algorithms fail to distinguish between sequencing errors and single base differences in regions containing nearly identical repeats. None of the currently available fragment assembly programs are able to correctly assemble highly similar repetitive data, and we, therefore, present a novel shotgun assembly program, Tandem Repeat Assembly Program (TRAP). The main feature of this program is the ability to separate long repetitive regions from each other by distinguishing single base substitutions as well as insertions/deletions from sequencing errors. This is accomplished by using a novel multiple-alignment based analysis method. Since repeats are a common complication in most sequencing projects, this software should be of use for the whole sequencing community.  相似文献   

18.
A new method for the search of local repeats in long DNA sequences, such as complete genomes, is presented. It detects a large variety of repeats varying in length from one to several hundred bases, which may contain many mutations. By mutations we mean substitutions, insertions or deletions of one or more bases. The method is based on counting occurrences of short words (3-12 bases) in sequence fragments called windows. A score is computed for each window, based on calculating exact word occurrence probabilities for all the words of a given length in the window. The probabilities are defined using a Bernoulli model (independent letters) for the sequence, using the actual letter frequencies from each window. A plot of the probabilities along the sequence for high-scoring windows facilitates the identification of the repeated patterns. We applied the method to the 1.87 Mb sequence of chromosome 4 of Arabidopsis thaliana and to the complete genome of Bacillus subtilis (4.2 Mb). The repeats that we found were classified according to their size, number of occurrences, distance between occurrences, and location with respect to genes. The method proves particularly useful in detecting long, inexact repeats that are local, but not necessarily tandem. The method is implemented as a C program called EXCEP, which is available on request from the authors.  相似文献   

19.
《Computers & chemistry》1996,20(1):41-48
Simple Sequence Repeats (SSRs) are common and frequently polymorphic in eukaryote DNA. Many are subject to high rates of length mutation in which a gain or loss of one repeat unit is most often observed. Can the observed abundances and their length distributions be explained as the result of an unbiased random walk, starting from some initial repeat length? In order to address this question, we have considered two models for an unbiased random walk on the integers, n (n0n). The first is a continuous time process (Birth and Death Model or BDM) in which the probability of a transition to n + 1 or n − 1 is λk, with k = nn0 + 1 per unit time. The second is a discrete time model (Random Walk Model or RWM), in which a transition is made at each time step, either to n − 1 or to n + 1. In each case the walks start at length n0, with new walks being generated at a steady rate, S, the source rate, determined by a base substitution rate of mutation from neighboring sequences. Each walk terminates whenever n reaches n0 − 1 or at some time, T, which reflects the contamination of pure repeat sequences by other mutations that remove them from consideration, either because they fail to satisfy the criteria for repeat selection from some database or because they can no longer undergo efficient length mutations. For infinite T, the results are particularly simple for N(k), the expected number of repeats of length n = k + n0 − 1, being, for BDM, N(k) = S/, and for RWM, N(k) = 2S. In each case, there is a cut-off value of k for finite T, namely k = ln2 for BDM and k = 0.57√T for RWM; for larger values of k, N(k) becomes rapidly smaller than the infinite time limit. We argue that these results may be compared with SSR length distributions averaged over many loci, but not for a particular locus, for which founder effects are important. For the data of Beckmann & Weber [(1992), Genomics 12, 627] on GT·AC repeats in the human, each model gives a reasonable fit to the data, with the source at two repeat units (n0 = 2). Both the absolute number of loci and their length distribution are well represented.  相似文献   

20.
Scientific progress in recent years has led to the generation of huge amounts of biological data, most of which remains unanalyzed. Mining the data may provide insights into various realms of biology, such as finding co-occurring biosequences, which are essential for biological data mining and analysis. Data mining techniques like sequential pattern mining may reveal implicitly meaningful patterns among the DNA or protein sequences. If biologists hope to unlock the potential of sequential pattern mining in their field, it is necessary to move away from traditional sequential pattern mining algorithms, because they have difficulty handling a small number of items and long sequences in biological data, such as gene and protein sequences. To address the problem, we propose an approach called Depth-First SPelling (DFSP) algorithm for mining sequential patterns in biological sequences. The algorithm’s processing speed is faster than that of PrefixSpan, its leading competitor, and it is superior to other sequential pattern mining algorithms for biological sequences.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号