首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Practical methods for constructing suffix trees   总被引:7,自引:0,他引:7  
Sequence datasets are ubiquitous in modern life-science applications, and querying sequences is a common and critical operation in many of these applications. The suffix tree is a versatile data structure that can be used to evaluate a wide variety of queries on sequence datasets, including evaluating exact and approximate string matches, and finding repeat patterns. However, methods for constructing suffix trees are often very time-consuming, especially for suffix trees that are large and do not fit in the available main memory. Even when the suffix tree fits in memory, it turns out that the processor cache behavior of theoretically optimal suffix tree construction methods is poor, resulting in poor performance. Currently, there are a large number of algorithms for constructing suffix trees, but the practical tradeoffs in using these algorithms for different scenarios are not well characterized. In this paper, we explore suffix tree construction algorithms over a wide spectrum of data sources and sizes. First, we show that on modern processors, a cache-efficient algorithm with O(n2) worst-case complexity outperforms popular linear time algorithms like Ukkonen and McCreight, even for in-memory construction. For larger datasets, the disk I/O requirement quickly becomes the bottleneck in each algorithm's performance. To address this problem, we describe two approaches. First, we present a buffer management strategy for the O(n2) algorithm. The resulting new algorithm, which we call “Top Down Disk-based” (TDD), scales to sizes much larger than have been previously described in literature. This approach far outperforms the best known disk-based construction methods. Second, we present a new disk-based suffix tree construction algorithm that is based on a sort-merge paradigm, and show that for constructing very large suffix trees with very little resources, this algorithm is more efficient than TDD.  相似文献   

2.
We discuss an intrinsic generalization of the suffix tree, designed to index a string of length n which has a natural partitioning into m multicharacter substrings or words . This word suffix tree represents only the m suffixes that start at word boundaries. These boundaries are determined by delimiters , whose definition depends on the application. Since traditional suffix tree construction algorithms rely heavily on the fact that all suffixes are inserted, construction of a word suffix tree is nontrivial, in particular when only O(m) construction space is allowed. We solve this problem, presenting an algorithm with O(n) expected running time. In general, construction cost is Ω(n) due to the need of scanning the entire input. In applications that require strict node ordering, an additional cost of sorting O(m') characters arises, where m' is the number of distinct words. In either case, this is a significant improvement over previously known solutions. Furthermore, when the alphabet is small, we may assume that the n characters in the input string occupy o(n) machine words. We illustrate that this can allow a word suffix tree to be built in sublinear time. Received September 2, 1997; revised December 10, 1997.  相似文献   

3.
A suffix tree is a fundamental data structure for string searching algorithms. Unfortunately, when it comes to the use of suffix trees in real-life applications, the current methods for constructing suffix trees do not scale for large inputs. As suffix trees are larger than the input sequences and quickly outgrow the main memory, the first attempts at building large suffix trees focused on algorithms which avoid massive random access to the trees being built. However, all the existing practical algorithms perform random access to the input string, thus requiring in essence that the input be small enough to be kept in main memory. The constantly growing pool of string data, especially biological sequences, requires us to build suffix trees for much larger strings.  相似文献   

4.
Suffix trees and suffix arrays are fundamental full-text index data structures to solve problems occurring in string processing. Since suffix trees and suffix arrays have different capabilities, some problems are solved more efficiently using suffix trees and others are solved more efficiently using suffix arrays. We consider efficient index data structures with the capabilities of both suffix trees and suffix arrays without requiring much space. When the size of an alphabet is small, enhanced suffix arrays are such index data structures. However, when the size of an alphabet is large, enhanced suffix arrays lose the power of suffix trees. Pattern searching in an enhanced suffix array takes O(m|Σ|) time while pattern searching in a suffix tree takes O(mlog |Σ|) time where m is the length of a pattern and Σ is an alphabet. In this paper, we present linearized suffix trees which are efficient index data structures with the capabilities of both suffix trees and suffix arrays even when the size of an alphabet is large. A linearized suffix tree has all the functionalities of the enhanced suffix array and supports the pattern search in O(mlog |Σ|) time. In a different point of view, it can be considered a practical implementation of the suffix tree supporting O(mlog |Σ|)-time pattern search. In addition, we also present two efficient algorithms for computing suffix links on the enhanced suffix array and the linearized suffix tree. These are the first algorithms that run in O(n) time without using the range minima query. Our experimental results show that our algorithms are faster than the previous algorithms.  相似文献   

5.
Our aim is to develop new database technologies for the approximate matching of unstructured string data using indexes. We explore the potential of the suffix tree data structure in this context. We present a new method of building suffix trees, allowing us to build trees in excess of RAM size, which has hitherto not been possible. We show that this method performs in practice as well as the O(n) method of Ukkonen [70]. Using this method we build indexes for 200 Mb of protein and 300 Mbp of DNA, whose disk-image exceeds the available RAM. We show experimentally that suffix trees can be effectively used in approximate string matching with biological data. For a range of query lengths and error bounds the suffix tree reduces the size of the unoptimised O(mn) dynamic programming calculation required in the evaluation of string similarity, and the gain from indexing increases with index size. In the indexes we built this reduction is significant, and less than 0.3% of the expected matrix is evaluated. We detail the requirements for further database and algorithmic research to support efficient use of large suffix indexes in biological applications. Received: November 1, 2001 / Accepted: March 2, 2002 Published online: September 25, 2002  相似文献   

6.
后缀树的并行构造算法   总被引:1,自引:0,他引:1  
后缀树是一种非常重要的数据结构,它在与字符串处理相关的各种领域里有着非常广泛的应用。构造后缀树是应用后缀树解决问题的前提和关键。虽然很多现有的后缀树构造算法都是线性时间和空间的,但是,当被索引的字符串的长度很长时,构造其后缀树所消耗的时间和空间仍将非常巨大,这极大地限制了后缀树的实际应用。而并行技术是解决这一问题的很好途径,因此人们提出了后缀树的并行构造算法。本文对后缀树的三种并行构造算法进行了综述,通过系统的比较和分析,总结出当前存在的问题,并指明了下一步的研究方向。  相似文献   

7.
张毅超  车玫  马骏 《计算机仿真》2007,24(12):97-100,116
高效求解2个字符串的最长公共子串(Longest Common Substring)是实现很多字符串算法的关键.文中首先给出了求解LCP问题的动态规划算法,广义后缀树算法,研究并分析了这两种算法,得出动态规划算法易于理解,但时间复杂度较高;广义后缀树算法的时间复杂度较低,但实现较为复杂并且广义后缀树占用的空间也较多.最后提出了一个新算法,该算法使用2个字符串的广义后缀数组,在保持和广义后缀树时间复杂度相等的基础上,可以简单地实现并且占用较少的空间.  相似文献   

8.
Theapproximate string matching problem is, given a text string, a pattern string, and an integerk, to find in the text all approximate occurrences of the pattern. An approximate occurrence means a substring of the text with edit distance at mostk from the pattern. We give a newO(kn) algorithm for this problem, wheren is the length of the text. The algorithm is based on the suffix automaton with failure transitions and on the diagonalwise monotonicity of the edit distance table. Some experiments showing that the algorithm has a small overhead are reported.  相似文献   

9.
In molecular biology, it is said that two biological sequences tend to have similar properties if they have similar three-dimensional structures. Hence, it is very important to find not only similar sequences in the string sense, but also structurally similar sequences from databases. In this paper we propose a new data structure that is a generalization of a parameterized suffix tree (p-suffix tree for short) introduced by Baker. We call it the structural suffix tree or s-suffix tree for short. The s-suffix tree can be used for finding structurally related patterns of RNA or single-stranded DNA. Furthermore, we propose an O(n(log|| + log||)) on-line algorithm for constructing it, where n is the sequence length, || is the size of the normal alphabet, and || is that of the alphabet called parameter, which is related to the structure of the sequence. Our algorithm achieves linear time when it is used to analyze RNA and DNA sequences. Furthermore, as an algorithm for constructing the p-suffix tree, it is the first on-line algorithm, though the computing bound of our algorithm is the same as that of Kosarajus best-known algorithm. The results of computational experiments using actual RNA and DNA sequences are also given to demonstrate our algorithms practicality.  相似文献   

10.
The suffix tree is a key data structure for biological sequence analysis, since it permits efficient solutions to many string-based problems. Constructing large suffix trees is challenging because of high memory overheads and poor memory locality. Even though efficient suffix tree construction algorithms exist, their run-time is still very high for long DNA sequences such as whole human chromosomes. In this paper, we are using a hierarchical grid system as a computational platform in order to reduce this run-time significantly. To achieve an efficient mapping onto this type of architecture we introduce a parallel suffix tree construction algorithm that makes use of a new data structure called the common prefix suffix tree. Using this algorithm together with a dynamic load balancing strategy we show that our distributed grid implementation leads to significant run-time savings.  相似文献   

11.
We present a new and simple algorithm to reconstruct suffix links in suffix trees and suffix arrays. The algorithm is based on observations regarding suffix tree construction algorithms. With our algorithm we bring suffix arrays even closer to the ease of use and implementation of suffix trees.  相似文献   

12.
Suffix trees are the fundamental data structure of combinatorial pattern matching on words. Suffix trees have been used in order to give optimal solutions to a great variety of problems on static words, but for practical situations, such as in a text editor, where the incremental changes of the text make dynamic updating of the corresponding suffix trees necessary, this data structure alone has not been used with success. We prove that, for dynamic modifications of order O(1) of words of length n, any suffix tree updating algorithm, such as the ones proposed by McCreight, requires O(n) worst-case running time, as for the full reconstruction of the suffix tree. Consequently, we argue that this data structure alone is not appropriate for the solution of combinatorial problems on words that change dynamically.  相似文献   

13.
Engineering a Lightweight Suffix Array Construction Algorithm   总被引:6,自引:0,他引:6  
In this paper we describe a new algorithm for building the suffix array of a string. This task is equivalent to the problem of lexicographically sorting all the suffixes of the input string. Our algorithm is based on a new approach called deep–shallow sorting: we use a shallow sorter for the suffixes with a short common prefix, and a deep sorter for the suffixes with a long common prefix. All the known algorithms for building the suffix array either require a large amount of space or are inefficient when the input string contains many repeated substrings. Our algorithm has been designed to overcome this dichotomy. Our algorithm is lightweight in the sense that it uses very small space in addition to the space required by the suffix array itself. At the same time our algorithm is fast even when the input contains many repetitions: this has been shown by extensive experiments with inputs of size up to 110 Mb. The source code of our algorithm, as well as a C library providing a simple API, is available under the GNU GPL.  相似文献   

14.
Simple and flexible detection of contiguous repeats using a suffix tree   总被引:1,自引:0,他引:1  
We study the problem of detecting all occurrences of (primitive) tandem repeats and tandem arrays in a string. We first give a simple time- and space-optimal algorithm to find all tandem repeats, and then modify it to become a time and space-optimal algorithm for finding only the primitive tandem repeats. Both of these algorithms are then extended to handle tandem arrays. The contribution of this paper is both pedagogical and practical, giving simple algorithms and implementations based on a suffix tree, using only standard tree traversal techniques.  相似文献   

15.
Given a binary string of length n, we give a representation of its suffix array that takes O(nt(lgn)1/t) bits of space such that given i,1?i?n, the ith entry in the suffix array of the string can be retrieved in O(t) time, for any parameter 1?t?lglgn. For t=lglgn, this gives a compressed suffix array representation of Grossi and Vitter [Proc. Symp. on Theory Comput., 2000, pp. 397-406]. For t=O(1/ε), this gives the best known (in terms of space) compressed suffix array representation with constant query time. From this representation one can construct a suffix tree structure for a text of length n, that uses o(nlgn) bits of space which can be used to find all the k occurrences of a given pattern of length m in O(m/lgn+k) time. No such structure was known earlier.  相似文献   

16.
We consider on-line construction of the suffix tree for a parameterized string, where we always have the suffix tree of the input string read so far. This situation often arises from source code management systems where, for example, a source code repository is gradually increasing in its size as users commit new codes into the repository day by day. We present an on-line algorithm which constructs a parameterized suffix tree in randomized O(n) time, where n is the length of the input string. Our algorithm is the first randomized linear time algorithm for the on-line construction problem.  相似文献   

17.
Maa? 《Algorithmica》2008,37(1):43-74
Affix trees are a generalization of suffix trees that are based on the inherent duality of suffix trees induced by the suffix links. An algorithm is presented that constructs affix trees on-line by expanding the underlying string in both directions and that is the first known algorithm to do this with linear time complexity.  相似文献   

18.
多模式串匹配算法是网络内容过滤系统的核心技术之一.自动机的存储空间大小和Cache性能是影响多模式串匹配算法速度的关键因素.随着模式串规模的扩大,自动机的巨大存储开销导致现有的串匹配算法性能大幅度下降.从压缩存储空间以提高Cache命中率的思想出发,提出了一种对经典SBOM算法的优化策略,它用Suffix Tree代替SBOM算法中的Factor Oracle结构,同时用剪枝的方法将Suffix Tree降低为近似线性的空间复杂度,然后用双数组Trie表示之,以压缩存储空间.与SBOM算法相比,改进算法不仅能够有效地节省存储空间,而且显著地提高了串匹配的速度,非常适合于在线高速匹配的应用环境.  相似文献   

19.
Finding motifs in biological sequences is one of the most intriguing problems for string algorithm designers due to, on the one hand, the numerous applications of this problem in molecular biology and, on the other hand, the challenging aspects of the computational problem. Indeed, when dealing with biological sequences it is necessary to work with approximations (that is, to identify fragments that are not necessarily identical, but just similar, according to a given similarity notion), and this complicates the problem. Existing algorithms run in time linear with respect to the input size. Nevertheless, the output size can be very large due to the approximation (namely exponential in the approximation degree). This often makes the output unreadable, as well as slowing down the inference itself. A high degree of redundancy has been detected in the set of motifs that satisfy traditional requirements, even for exact motifs. Moreover, it has been observed many times that only a subset of these motifs, namely the maximal motifs, could be enough to provide the information of all of them. In this paper, we aim at removing such redundancy. We extend some notions of maximality already defined for exact motifs to the case of approximate motifs with Hamming distance, and we give a characterization of maximal motifs on the suffix tree. Given that this data structure is used by a whole class of motif extraction tools, we show how these tools can be modified to include the maximality requirement without changing the asymptotical complexity.  相似文献   

20.
The Longest Previous non-overlapping Factor table (LPnF) stores for each position of a string the maximal length of factors occurring both there and in the preceding part of the string. The notion is a slight variant of the LPF table described before and used for text compression. The LPnF table is an essential element for the design of efficient algorithms on strings as it is related to a certain type of Ziv-Lempel factorisation used for this purpose.We show how to compute the LPnF table in linear time from the suffix array of the string when it is drawn from an integer alphabet. The algorithm is a non-immediate extension of the LPF computation and it does not require any other sophisticated data structure than the suffix array of the input string.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号