首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 677 毫秒
1.
With the first human DNA being decoded into a sequence of about 2.8 billion characters, much biological research has been centered on analyzing this sequence. Theoretically speaking, it is now feasible to accommodate an index for human DNA in the main memory so that any pattern can be located efficiently. This is due to the recent breakthrough on compressed suffix arrays, which reduces the space requirement from O(n log n) bits to O(n) bits. However, constructing compressed suffix arrays is still not an easy task because we still have to compute suffix arrays first and need a working memory of O(n log n) bits (i.e., more than 13 gigabytes for human DNA). This paper initiates the study of constructing compressed suffix arrays directly from the text. The main contribution is a construction algorithm that uses only O(n) bits of working memory, and the time complexity is O(n log n). Our construction algorithm is also time and space efficient for texts with large alphabets such as Chinese or Japanese. Precisely, when the alphabet size is |Σ|, the working space is O(n log |Σ|) bits, and the time complexity remains O(n log n), which is independent of |Σ|.  相似文献   

2.
Suffix trees are the fundamental data structure of combinatorial pattern matching on words. Suffix trees have been used in order to give optimal solutions to a great variety of problems on static words, but for practical situations, such as in a text editor, where the incremental changes of the text make dynamic updating of the corresponding suffix trees necessary, this data structure alone has not been used with success. We prove that, for dynamic modifications of order O(1) of words of length n, any suffix tree updating algorithm, such as the ones proposed by McCreight, requires O(n) worst-case running time, as for the full reconstruction of the suffix tree. Consequently, we argue that this data structure alone is not appropriate for the solution of combinatorial problems on words that change dynamically.  相似文献   

3.
The two-dimensional suffix tree of an n × n square matrix A is a compacted trie that represents all square submatrices of A[11]. For the off-line case, i.e., A is given in advance to the algorithm, it is known how to build it in optimal time, for any type of alphabet size [11],[18]. Motivated by applications in Image Compression[22[, Giancarlo and Guaiana [14] considered the on-line version of the two-dimensional suffix tree and presented an O(n2 log2 n)-time algorithm, which we refer to as GG. That algorithm is a nontrivial generalization of Ukkonen's on-line algorithm for standard suffix trees [23]. The main contribution in this paper is an O(log n) factor improvement in the time complexity of the GG algorithm, making it optimal for unbounded alphabets [9]. Moreover, the ideas presented here also lead to a major simplification of the GG algorithm. Technically, we are able to preserve most of the structure of the original GG algorithm, by reducing a computational bottleneck to a primitive operation, i.e., comparison of Lcharacters, which is here implemented in constant time rather than O(log n) time as in GG. However, preserving that structure comes at a price. Indeed, in order to make everything work, we need a careful reorganization of another fundamental algorithm: Weiner's algorithm for the construction of standard suffix trees [24]. Specifically, here we provide a version of that algorithm which takes linear time and works on-line and concurrently over a set of strings.  相似文献   

4.
Given a binary string of length n, we give a representation of its suffix array that takes O(nt(lgn)1/t) bits of space such that given i,1?i?n, the ith entry in the suffix array of the string can be retrieved in O(t) time, for any parameter 1?t?lglgn. For t=lglgn, this gives a compressed suffix array representation of Grossi and Vitter [Proc. Symp. on Theory Comput., 2000, pp. 397-406]. For t=O(1/ε), this gives the best known (in terms of space) compressed suffix array representation with constant query time. From this representation one can construct a suffix tree structure for a text of length n, that uses o(nlgn) bits of space which can be used to find all the k occurrences of a given pattern of length m in O(m/lgn+k) time. No such structure was known earlier.  相似文献   

5.
Several new number representations based on a residue number system are presented which use the smallest prime numbers as moduli and are suited for parallel computations on a reconfigurable mesh architecture. The bit model of linear reconfigurable mesh with exclusive write and unit-time delay for broadcasting on a subbus is assumed. It is shown how to convert in O(1) time any integer, ranging between 0 and n-1, from any commonly used representation to any new representation proposed in this paper (and vice versa) using an n/spl times/O(log/sup 2/ n/log log n) reconfigurable mesh. In particular, some of the previously known conversion techniques are improved. Moreover, as a byproduct, it is shown how to compute in O(1) time the Prefix Sums of n bits by a reconfigurable mesh having the above mentioned size, thus improving previously known results. Applications to the Prefix Sums of n h-bit integers and to Approximate String Matching with /spl alpha/ mismatches are also considered. The Summation and the Prefix Sums can be computed in O(1) time using O(h log N+log/sup 2/ N/log log N)/spl times/Nh and O(h/sup 2/+log/sup 2/ N/log(h+log N))/spl times/O(N(h+log N)) reconfigurable meshes, respectively. Moreover, it is shown for the first time how to find in O(1) time all the occurrences of a pattern of length m in a text of length n, allowing less than /spl alpha/ mismatches, using a reconfigurable mesh of size O(m log|/spl Sigma/|)/spl times/O (n(log|/spl Sigma/|+log/sup 2/ /spl alpha//log log /spl alpha/)), where the pattern and the text are strings over a finite alphabet /spl Sigma/ and /spl alpha/相似文献   

6.
Several papers describe linear time algorithms to preprocess a tree, in order to answer subsequent nearest common ancestor queries in constant time. Here, we survey these algorithms and related results. Whereas previous algorithms produce a linear space data structure, in this paper we address the problem of distributing the data structure into short labels associated with the nodes. Localized data structures received much attention recently as they play an important role for distributed applications such as routing. We conclude our survey with a new simple algorithm that labels in O(n) time all the nodes of an n-node rooted tree such that from the labels of any two nodes alone one can compute in constant time the label of their nearest common ancestor. The labels assigned by our algorithm are of size O(log n) bits.  相似文献   

7.
We present a new simple algorithm that constructs an Aho Corasick automaton for a set of patterns, P, of total length n, in O(n) time and space for integer alphabets. Processing a text of size m over an alphabet Σ with the automaton costs O(mlog|Σ|+k), where there are k occurrences of patterns in the text.A new, efficient implementation of nodes in the Aho Corasick automaton is introduced, which works for suffix trees as well.  相似文献   

8.
Suffix trees are among the most important data structures in stringology, with a number of applications in flourishing areas like bioinformatics. Their main problem is space usage, which has triggered much research striving for compressed representations that are still functional. A smaller suffix tree representation could fit in a faster memory, outweighing by far the theoretical slowdown brought by the space reduction. We present a novel compressed suffix tree, which is the first achieving at the same time sublogarithmic complexity for the operations, and space usage that asymptotically goes to zero as the entropy of the text does. The main ideas in our development are compressing the longest common prefix information, totally getting rid of the suffix tree topology, and expressing all the suffix tree operations using range minimum queries and a novel primitive called next/previous smaller value in a sequence. Our solutions to those operations are of independent interest.  相似文献   

9.
Constructing suffix tree for gigabyte sequences with megabyte memory   总被引:7,自引:0,他引:7  
Mammalian genomes are typically 3 Gbps (gibabase pairs) in size. The largest public database NCBI (National Center for Biotechnology Information (http://www.ncbi.nlm.nih.gov)) of DNA contains more than 20 Gbps. Suffix trees are widely acknowledged as a data structure to support exact/approximate sequence matching queries as well as repetitive structure finding efficiently when they can reside in main memory. But, it has been shown as difficult to handle long DNA sequences using suffix trees due to the so-called memory bottleneck problems. The most space efficient main-memory suffix tree construction algorithm takes nine hours and 45 GB memory space to index the human genome [S. Kurtz (1999)]. We show that suffix trees for long DNA sequences can be efficiently constructed on disk using small bounded main memory space and, therefore, all existing algorithms based on suffix trees can be used to handle long DNA sequences that cannot be held in main memory. We adopt a two-phase strategy to construct a suffix tree on disk: 1) to construct a diskbase suffix-tree without suffix links and 2) rebuild suffix links upon the suffix-tree being constructed on disk, if needed. We propose a new disk-based suffix tree construction algorithm, called DynaCluster, which shows O(nlogn) experimental behavior regarding CPU cost and linearity for I/O cost. DynaCluster needs 16 MB main memory only to construct more than 200 Mbps DNA sequences and significantly outperforms the existing disk-based suffix-tree construction algorithms using prepartitioning techniques in terms of both construction cost and query processing cost. We conducted extensive performance studies and report our findings in this paper.  相似文献   

10.
杨智应  朱洪  宋建涛 《软件学报》2004,15(5):650-659
算法的复杂度平滑分析是对许多算法在实际应用中很有效但其最坏情况复杂度却很糟这一矛盾给出的更合理的解释.高性能计算机被广泛用于求解大规模线性系统及大规模矩阵的分解.求解线性系统的最简单且容易实现的算法是高斯消元算法(高斯算法).用高斯算法求解n个方程n个变量的线性系统所需要的算术运算次数为O(n3).如果这些方程中的系数用m位表示,则最坏情况下需要机器位数mn位来运行高斯算法.这是因为在消元过程中可能产生异常大的中间项.但大量的数值实验表明,在实际应用中,需要如此高的精度是罕见的.异常大的矩阵条件数和增长因子是导致矩阵A病态,继而导致解的误差偏大的主要根源.设-A为任意矩阵,A是-A受到微小幅度的高斯随机扰动所得到的随机矩阵,方差σ2≤1.Sankar等人对矩阵A的条件数及增长因子进行平滑分析,证明了Pr[K(A)≥α]≤(3.64n(1+4√log(α)))/ασ.在此基础上证明了运行高斯算法输出具有m位精度的解所需机器位数的平滑复杂度为m+71og2(n)+3log2(1/σ)+log2log2n+7.在上述结果的证明过程中存在错误,将其纠正后得到以下结果:m+71og2n+3log2(1/σ)+4√2+log2n+log2(1/σ)+7.367.通过构造两个分别关于矩阵范数和随机变量乘积的不等式,将关于矩阵条件数的平滑分析结果简化到Pr[K(A)≥α]≤(6√2n2)/α·σ.部分地解决了Sankar等人提出的猜想:Pr[K(A)≥α]≤O(n/α·σ).并将运行高斯算法输出具有m位精度的解所需机器位数的平滑复杂度降低到m+81og2n+3log2(1/σ)+7.实验结果表明,所得到的平滑复杂度更好.  相似文献   

11.
Suffix arrays are a key data structure for solving a run of problems on texts and sequences, from data compression and information retrieval to biological sequence analysis and pattern discovery. In their simplest version, they can just be seen as a permutation of the elements in {1,2,…,n}, encoding the sorted sequence of suffixes from a given text of length n, under the lexicographic order. Yet, they are on a par with ubiquitous and sophisticated suffix trees. Over the years, many interesting combinatorial properties have been devised for this special class of permutations: for instance, they can implicitly encode extra information, and they are a well characterized subset of the n! permutations. This paper gives a short tutorial on suffix arrays and their compressed version to explore and review some of their algorithmic features, discussing the space issues related to their usage in text indexing, combinatorial pattern matching, and data compression.  相似文献   

12.
In this paper, we study the packet classification problem and the filter conflict resolution problem, both motivated by applications in the routing of packets in an IP network. For the first problem, we focus on the static 2-dimensional conflict-free (i.e., nested) filters. We design a linear space data structure with O(T w (n)+(log log n)2) query time on a RAM with word size O(w) bits where n is the number of filters in the router, w is the number of bits in an IP address and
This is the first optimal space data structure with poly-logarithmic query time for this problem. In practice, network filters often contain very few conflicts but are not completely conflict-free. Fortunately, conflicts can be resolved by adding conflict-resolving filters. Moreover, practical filters often possess another slightly different nesting property which we called 1-nestedness. We present an algorithm to resolve conflicts in a set of 1-nested filters in O(nT w (n)+k) time and linear space, where k is the number of filter pairs in conflict. Furthermore, we show that our data structure for the first problem can be adapted to apply on conflict-resolved 1-nested filters with the same query and space complexities. This research was fully supported by a grant from the Research Grants Council of the Hong Kong SAR, China (City U 1164/04E). A preliminary version appeared in ISPAN’04.  相似文献   

13.
We consider the problem of routing in an asynchronous dynamically changing ring of processors using schemes that minimize the storage space for the routing information. In general, applying static techniques to a dynamic network would require significant re-computation. Moreover, the known dynamic techniques applied to the ring lead to inefficient schemes. In this paper we introduce a new technique, Dynamic Interval Routing, and we show tradeoffs between the stretch factor, the adaptation cost, and the size of the update messages used by routing schemes based upon it. We give three algorithms for rings of maximum size N: the first two are deterministic, one with adaptation cost zero but worst case stretch factor \lfloor N/2 \rfloor, the other with worst case adaptation cost O(N) update messages of O(log N) bits and stretch factor 1. The third algorithm is randomized, uses update messages of size O(k log N), has adaptation cost O(k), and expected stretch factor 1+1/k, for any integer k 3. All schemes require O(log N) bits per node for the routing information and all messages headers are of O(log N) bits.  相似文献   

14.
后缀树的重要性可以为多年来学术界对它总是有新的发现而印证.它的结构简单,但可以在线性的时间里解决许多复杂的问题,被大量的使用在字符串及树的模式匹配中,对于XML标准,有很多基于关系库和对象库的索引技术和查询方案被提出来,我们试图给出一种基于后缀树进行路径导航的查询机制:用后缀树构造XML路径字典加速路径查询评价速度,我们提出可以在线地建立一个trie树的后缀树,讨论了XML路径字典中的后缀树建树算法,阐述了整个索引方案和查询机制,并探讨了包括RPE在内的它所支持的各种查询操作,XML路径字典被用于加快路径查询的评价速度.  相似文献   

15.
This paper investigates threshold based neural networks for periodic symmetric Boolean functions and some related operations. It is shown that any n-input variable periodic symmetric Boolean function can be implemented with a feedforward linear threshold-based neural network with size of O(log n) and depth also of O(log n), both measured in terms of neurons. The maximum weight and fan-in values are in the order of O(n). Under the same assumptions on weight and fan-in values, an asymptotic bound of O(log n) for both size and depth of the network is also derived for symmetric Boolean functions that can be decomposed into a constant number of periodic symmetric Boolean subfunctions. Based on this results neural networks for serial binary addition and multiplication of n-bit operands are also proposed. It is shown that the serial addition can be computed with polynomially bounded weights and a maximum fan-in in the order of O(log n) in O(n/log n) serial cycles. Finally, it is shown that the serial multiplication can be computed in O(n) serial cycles with O(log n) size neural gate network, and with O(n log n) latches.  相似文献   

16.
Suffix tree, suffix array, and directed acyclic word graph (DAWG) are data-structures for indexing a text. Although they enable efficient pattern matching, their data-structures require O(nlogn) bits, which make them impractical to index long text like human genome. Recently, the development of compressed data-structures allow us to simulate suffix tree and suffix array using O(n) bits. However, there is still no O(n)-bit data-structure for DAWG with full functionality. This work introduces an $n(H_{k}(\overline{S})+ 2 H_{0}^{*}(\mathcal {T}_{\overline{S}}))+o(n)$ -bit compressed data-structure for simulating DAWG (where $H_{k}(\overline{S})$ and $H_{0}^{*}(\mathcal{T}_{\overline{S}})$ are the empirical entropies of the reversed sequence and the reversed suffix tree topology, respectively.) Besides, we also propose an application of DAWG to improve the time complexity for the local alignment problem. In this application, the previously proposed solutions using BWT (a version of compressed suffix array) run in O(n 2 m) worst case time and O(n 0.628 m) average case time where n and m are the lengths of the database and the query, respectively. Using compressed DAWG proposed in this paper, the problem can be solved in O(nm) worst case time and the same average case time.  相似文献   

17.
We define and study the complexity of robust polynomials for Boolean functions and the related fault-tolerant quantum decision trees, where input bits are perturbed by noise. We compare several different possible definitions. Our main results are: *For every n-bit Boolean function f there is an n-variate polynomial p of degree O(n) that robustly approximates it, in the sense that p(x) remains close to f(x) if we slightly vary each of the n inputs of the polynomial. *There is an O(n)-query quantum algorithm that robustly recovers n noisy input bits. Hence every n-bit function can be quantum computed with O(n) queries in the presence of noise. This contrasts with the classical model of Feige et al., where functions such as parity need Θ(n log n) queries. We give several extensions and applications of these results.  相似文献   

18.
Faster Approximate String Matching   总被引:11,自引:0,他引:11  
We present a new algorithm for on-line approximate string matching. The algorithm is based on the simulation of a nondeterministic finite automaton built from the pattern and using the text as input. This simulation uses bit operations on a RAM machine with word length w = Ω (log n) bits, where n is the text size. This is essentially similar to the model used in Wu and Manber's work, although we improve the search time by packing the automaton states differently. The running time achieved is O(n) for small patterns (i.e., whenever mk = O(log n)) , where m is the pattern length and k<m is the number of allowed errors. This is in contrast with the result of Wu and Manber, which is O(kn) for m=O(log n) . Longer patterns can be processed by partitioning the automaton into many machine words, at O(mk/w n) search cost. We allow generalizations in the pattern, such as classes of characters, gaps, and others, at essentially the same search cost. We then explore other novel techniques to cope with longer patterns. We show how to partition the pattern into short subpatterns which can be searched with less errors using the simple automaton, to obtain an average cost close to . Moreover, we allow the superimposition of many subpatterns in a single automaton, obtaining near average complexity (σ is the alphabet size). We perform a complete analysis of all the techniques and show how to combine them in an optimal form, also obtaining new tighter bounds for the probability of an approximate occurrence in random text. Finally, we show experimental results comparing our algorithms against previous work. These experiments show that our algorithms are among the fastest for typical text searching, being the fastest in some cases. Although we aim mainly at text searching, we believe that our ideas can be successfully applied to other areas such as computational biology. Received November 22, 1996; revised October 13 and December 5, 1997.  相似文献   

19.
A new randomized Byzantine agreement algorithm is presented. This algorithm operates in a synchronous system of n processors, at most t of which can fail. The algorithm reaches agreement in 0(t/log n) expected rounds and O(n2tf/log n) expected message bits independent of the distribution of processor failures. This performance is further improved to a constant expected number of rounds and O(n2) message bits if the distribution of processor failures is assumed to be uniform. In either event, the algorithm improves on the known lower bound on rounds for deterministic algorithms. Some other advantages of the algorithm are that it requires no cryptographic techniques, that the amount of local computation is small, and that the expected number of random bits used per processor is only one. It is argued that in many practical applications of Byzantine agreement, the randomized algorithm of this paper achieves superior performance.  相似文献   

20.
后缀树的并行构造算法   总被引:1,自引:0,他引:1  
后缀树是一种非常重要的数据结构,它在与字符串处理相关的各种领域里有着非常广泛的应用。构造后缀树是应用后缀树解决问题的前提和关键。虽然很多现有的后缀树构造算法都是线性时间和空间的,但是,当被索引的字符串的长度很长时,构造其后缀树所消耗的时间和空间仍将非常巨大,这极大地限制了后缀树的实际应用。而并行技术是解决这一问题的很好途径,因此人们提出了后缀树的并行构造算法。本文对后缀树的三种并行构造算法进行了综述,通过系统的比较和分析,总结出当前存在的问题,并指明了下一步的研究方向。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号