共查询到20条相似文献,搜索用时 15 毫秒
1.
Suffix trees are the fundamental data structure of combinatorial pattern
matching on words. Suffix trees have been used in order to give optimal
solutions to a great variety of problems on static words, but for practical
situations, such as in a text editor, where the incremental changes of
the text make dynamic updating of the corresponding suffix trees necessary, this
data structure alone has not been used with success. We prove that, for dynamic
modifications of order O(1) of words of length n, any suffix tree updating
algorithm, such as the ones proposed by McCreight, requires O(n) worst-case
running time, as for the full reconstruction of the suffix tree. Consequently,
we argue that this data structure alone is not appropriate for the solution
of combinatorial problems on words that change dynamically. 相似文献
2.
Practical methods for constructing suffix trees 总被引:7,自引:0,他引:7
Yuanyuan Tian Sandeep Tata Richard A. Hankins Jignesh M. Patel 《The VLDB Journal The International Journal on Very Large Data Bases》2005,14(3):281-299
Sequence datasets are ubiquitous in modern life-science applications, and querying sequences is a common and critical operation
in many of these applications. The suffix tree is a versatile data structure that can be used to evaluate a wide variety of
queries on sequence datasets, including evaluating exact and approximate string matches, and finding repeat patterns. However,
methods for constructing suffix trees are often very time-consuming, especially for suffix trees that are large and do not
fit in the available main memory. Even when the suffix tree fits in memory, it turns out that the processor cache behavior
of theoretically optimal suffix tree construction methods is poor, resulting in poor performance. Currently, there are a large
number of algorithms for constructing suffix trees, but the practical tradeoffs in using these algorithms for different scenarios
are not well characterized.
In this paper, we explore suffix tree construction algorithms over a wide spectrum of data sources and sizes. First, we show
that on modern processors, a cache-efficient algorithm with O(n2) worst-case complexity outperforms popular linear time algorithms like Ukkonen and McCreight, even for in-memory construction.
For larger datasets, the disk I/O requirement quickly becomes the bottleneck in each algorithm's performance. To address this
problem, we describe two approaches. First, we present a buffer management strategy for the O(n2) algorithm. The resulting new algorithm, which we call “Top Down Disk-based” (TDD), scales to sizes much larger than have
been previously described in literature. This approach far outperforms the best known disk-based construction methods. Second,
we present a new disk-based suffix tree construction algorithm that is based on a sort-merge paradigm, and show that for constructing
very large suffix trees with very little resources, this algorithm is more efficient than TDD. 相似文献
3.
Gabriele Fici 《Theoretical computer science》2011,412(29):3604-3615
The suffix automaton (resp. factor automaton) of a finite word w is the minimal deterministic automaton recognizing the set of suffixes (resp. factors) of w. We study the relationships between the structure of the suffix and factor automata and classical combinatorial parameters related to the special factors of w. We derive formulae for the number of states of these automata. We also characterize the languages LSA and LFA of words having respectively suffix automaton and factor automaton with the minimal possible number of states. 相似文献
4.
Moritz G. Maaß 《Information Processing Letters》2007,101(6):250-254
We present a new and simple algorithm to reconstruct suffix links in suffix trees and suffix arrays. The algorithm is based on observations regarding suffix tree construction algorithms. With our algorithm we bring suffix arrays even closer to the ease of use and implementation of suffix trees. 相似文献
5.
S.Srinivasa Rao 《Information Processing Letters》2002,82(6):307-311
Given a binary string of length n, we give a representation of its suffix array that takes O(nt(lgn)1/t) bits of space such that given i,1?i?n, the ith entry in the suffix array of the string can be retrieved in O(t) time, for any parameter 1?t?lglgn. For t=lglgn, this gives a compressed suffix array representation of Grossi and Vitter [Proc. Symp. on Theory Comput., 2000, pp. 397-406]. For t=O(1/ε), this gives the best known (in terms of space) compressed suffix array representation with constant query time. From this representation one can construct a suffix tree structure for a text of length n, that uses o(nlgn) bits of space which can be used to find all the k occurrences of a given pattern of length m in O(m/lgn+k) time. No such structure was known earlier. 相似文献
6.
Suffix trees are among the most important data structures in stringology, with a number of applications in flourishing areas like bioinformatics. Their main problem is space usage, which has triggered much research striving for compressed representations that are still functional. A smaller suffix tree representation could fit in a faster memory, outweighing by far the theoretical slowdown brought by the space reduction. We present a novel compressed suffix tree, which is the first achieving at the same time sublogarithmic complexity for the operations, and space usage that asymptotically goes to zero as the entropy of the text does. The main ideas in our development are compressing the longest common prefix information, totally getting rid of the suffix tree topology, and expressing all the suffix tree operations using range minimum queries and a novel primitive called next/previous smaller value in a sequence. Our solutions to those operations are of independent interest. 相似文献
7.
Taehyung Lee 《Information Processing Letters》2011,111(5):201-207
We consider on-line construction of the suffix tree for a parameterized string, where we always have the suffix tree of the input string read so far. This situation often arises from source code management systems where, for example, a source code repository is gradually increasing in its size as users commit new codes into the repository day by day. We present an on-line algorithm which constructs a parameterized suffix tree in randomized O(n) time, where n is the length of the input string. Our algorithm is the first randomized linear time algorithm for the on-line construction problem. 相似文献
8.
Theapproximate string matching problem is, given a text string, a pattern string, and an integerk, to find in the text all approximate occurrences of the pattern. An approximate occurrence means a substring of the text with edit distance at mostk from the pattern. We give a newO(kn) algorithm for this problem, wheren is the length of the text. The algorithm is based on the suffix automaton with failure transitions and on the diagonalwise monotonicity of the edit distance table. Some experiments showing that the algorithm has a small overhead are reported. 相似文献
9.
Roberto Grossi 《Theoretical computer science》2011,412(27):2964-2973
Suffix arrays are a key data structure for solving a run of problems on texts and sequences, from data compression and information retrieval to biological sequence analysis and pattern discovery. In their simplest version, they can just be seen as a permutation of the elements in {1,2,…,n}, encoding the sorted sequence of suffixes from a given text of length n, under the lexicographic order. Yet, they are on a par with ubiquitous and sophisticated suffix trees. Over the years, many interesting combinatorial properties have been devised for this special class of permutations: for instance, they can implicitly encode extra information, and they are a well characterized subset of the n! permutations. This paper gives a short tutorial on suffix arrays and their compressed version to explore and review some of their algorithmic features, discussing the space issues related to their usage in text indexing, combinatorial pattern matching, and data compression. 相似文献
10.
We use automata-theoretic approach to analyze properties of Fibonacci words. The directed acyclic subword graph (dawg) is a useful deterministic automaton accepting all suffixes of the word. We show that dawg's of Fibonacci words have particularly simple structure. Our main result is a unifying framework for a large collection of relatively simple properties of Fibonacci words. The simple structure of dawgs of Fibonacci words gives in many cases simplified alternative proofs and new interpretation of several well-known properties of Fibonacci words. In particular, the structure of lengths of paths corresponds to a number-theoretic characterization of occurrences of any subword. Using the structural properties of dawg's it can be easily shown that for a string w we can check if w is a subword of a Fibonacci word in time O(|w|) and O(1) space. Compact dawg's of Fibonacci words show a very regular structure of their suffix trees and show how the suffix tree for the Fibonacci word grows (extending the leaves in a very simple way) into the suffix tree for the next Fibonacci word. 相似文献
11.
Suffix automata and factor automata are efficient data structures for representing the full index of a set of strings. They are minimal deterministic automata representing the set of all suffixes or substrings of a set of strings. This paper presents a novel analysis of the size of the suffix automaton or factor automaton of a set of strings. It shows that the suffix automaton or factor automaton of a set of strings U has at most 2Q−2 states, where Q is the number of nodes of a prefix-tree representing the strings in U. This bound significantly improves over 2‖U‖−1, the bound given by Blumer et al. [A. Blumer, J. Blumer, D. Haussler, R.M. McConnell, A. Ehrenfeucht, Complete inverted files for efficient text retrieval and analysis, Journal of the ACM 34 (1987) 578–589], where ‖U‖ is the sum of the lengths of all strings in U. More generally, we give novel and general bounds for the size of the suffix or factor automaton of an automaton as a function of the size of the original automaton and the maximal length of a suffix shared by the strings it accepts. We also describe in detail a linear-time algorithm for constructing the suffix automaton S or factor automaton F of U in time O(|S|). Our algorithm applies in fact to any input suffix-unique automaton and strictly generalizes the standard on-line construction of a suffix automaton for a single input string. Our algorithm can also be used straightforwardly to generate the suffix oracle or factor oracle of a set of strings, which has been shown to have various useful properties in string-matching. Our analysis suggests that the use of factor automata of automata can be practical for large-scale applications, a fact that is further supported by the results of our experiments applying factor automata to a music identification task with more than 15,000 songs. 相似文献
12.
A suffix tree approach to anti-spam email filtering 总被引:1,自引:0,他引:1
We present an approach to email filtering based on the suffix tree data structure. A method for the scoring of emails using
the suffix tree is developed and a number of scoring and score normalisation functions are tested. Our results show that the
character level representation of emails and classes facilitated by the suffix tree can significantly improve classification
accuracy when compared with the currently popular methods, such as naive Bayes. We believe the method can be extended to the
classification of documents in other domains.
Editor: Tom Fawcett 相似文献
13.
We study the problem of detecting all occurrences of (primitive) tandem repeats and tandem arrays in a string. We first give a simple time- and space-optimal algorithm to find all tandem repeats, and then modify it to become a time and space-optimal algorithm for finding only the primitive tandem repeats. Both of these algorithms are then extended to handle tandem arrays. The contribution of this paper is both pedagogical and practical, giving simple algorithms and implementations based on a suffix tree, using only standard tree traversal techniques. 相似文献
14.
We propose a fast and memory-efficient algorithm for lexicographically sorting the suffixes of a string, a problem that has important applications in data compression as well as string matching. 相似文献
15.
We present a linear algorithm which generates randomly and with uniform probability many kinds of trees: binary trees, ternary
trees, arbitrary trees, forests ofp k-ary trees,.... The algorithm is based on the definition of generic trees which can be coded as words. These words, in turn,
are generated in linear time. 相似文献
16.
J. F. Weng 《Algorithmica》1997,19(3):318-330
A Steiner tree T on a given set of points A is called linear if all Steiner points, including those collapsing into their adjacent given points, lie on one path referred
to as its trunk. Suppose A is a simple polygonal line. Roughly speaking, T is similar to A if its trunk turns right or left when A does. In this paper we prove that A can be expanded to another polygonal line, and T can be constructed in linear time using this expansion method.
Received January 15, 1995; revised November 19, 1995, and February 3, 1996. 相似文献
17.
18.
Many string manipulations can be performed efficiently on suffix trees. In this paper a CRCW parallel RAM algorithm is presented that constructs the suffix tree associated with a string ofn symbols inO(logn) time withn processors. The algorithm requires (n
2) space. However, the space needed can be reduced toO(n
1+) for any 0< 1, with a corresponding slow-down proportional to 1/. Efficient parallel procedures are also given for some string problems that can be solved with suffix trees.The results of this paper have been achieved independently and simultaneously in [AI-86] and [LSV-86]. The research of U. Vishkin was supported by NSF Grant NSF-CCR-8615337, ONR Grant N00014-85-K-0046, and Foundation for Research in Electronics, Computers, and Communication, administered by the Israeli Academy of Sciences and Humanities. The research of A. Apostolico was carried out in part while visiting at the Istituto di Analisi dei Sistemi e Informatica, Rome, with support from the Italian National Research Council. The research of G. M. Landau, B. Schieber, and U. Vishkin was supported by the Applied Mathematical Sciences subprogram of the Office of Energy Research, U.S. Department of Energy under Contract DE-AC02-76ER03077. 相似文献
19.
Christine Largeron-Letno 《Pattern recognition letters》2003,24(16):3153-3164
This paper presents a statistical test and algorithms for patterns extraction and supervised classification of sequential data. First it defines the notion of prediction suffix tree (PST). This type of tree can be used to efficiently describe variable order chain. It performs better than the Markov chain of order L and at a lower storage cost. We propose an improvement of this model, based on a statistical test. This test enables us to control the risk of encountering different patterns in the model of the sequence to classify and in the model of its class. Applications to biological sequences are presented to illustrate this procedure. We compare the results obtained with different models (Markov chain of order L, Variable order model and the statistical test, with or without smoothing). We set out to show how the choice of the parameters of the models influences performance in these applications. Obviously these algorithms can be used in other fields in which the data are naturally ordered. 相似文献
20.
A popular way to describe and build the DAWG or Directed Acyclic Word Graph of a string is by transformation of the corresponding subword tree. This transformation, which is not difficult to reverse, is easy to grasp and almost trivial to implement except for the assumed implication of a standard tree isomorphism algorithm. Here we point out a simple property of subword trees that makes checking tree isomorphism in this context a straightforward process, thereby simplifying the transformation significantly. Subword trees and DAWGs arise rather ubiquitously in applications of string processing, where they often play complementary roles. Efficient conversions are thus especially desirable. 相似文献