期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

A Linear Time Lower Bound on McCreight and General Updating Algorithms for Suffix Trees

Mauricio? Ayala-Rincón Email author Paulo D.?Conejo Email author 《Algorithmica》2003,37(3):233-241

Suffix trees are the fundamental data structure of combinatorial pattern matching on words. Suffix trees have been used in order to give optimal solutions to a great variety of problems on static words, but for practical situations, such as in a text editor, where the incremental changes of the text make dynamic updating of the corresponding suffix trees necessary, this data structure alone has not been used with success. We prove that, for dynamic modifications of order O(1) of words of length n, any suffix tree updating algorithm, such as the ones proposed by McCreight, requires O(n) worst-case running time, as for the full reconstruction of the suffix tree. Consequently, we argue that this data structure alone is not appropriate for the solution of combinatorial problems on words that change dynamically. 相似文献

2.

Practical methods for constructing suffix trees 总被引：7，自引：0，他引：7

Yuanyuan Tian Sandeep Tata Richard A. Hankins Jignesh M. Patel 《The VLDB Journal The International Journal on Very Large Data Bases》2005,14(3):281-299

Sequence datasets are ubiquitous in modern life-science applications, and querying sequences is a common and critical operation in many of these applications. The suffix tree is a versatile data structure that can be used to evaluate a wide variety of queries on sequence datasets, including evaluating exact and approximate string matches, and finding repeat patterns. However, methods for constructing suffix trees are often very time-consuming, especially for suffix trees that are large and do not fit in the available main memory. Even when the suffix tree fits in memory, it turns out that the processor cache behavior of theoretically optimal suffix tree construction methods is poor, resulting in poor performance. Currently, there are a large number of algorithms for constructing suffix trees, but the practical tradeoffs in using these algorithms for different scenarios are not well characterized. In this paper, we explore suffix tree construction algorithms over a wide spectrum of data sources and sizes. First, we show that on modern processors, a cache-efficient algorithm with O(n²) worst-case complexity outperforms popular linear time algorithms like Ukkonen and McCreight, even for in-memory construction. For larger datasets, the disk I/O requirement quickly becomes the bottleneck in each algorithm's performance. To address this problem, we describe two approaches. First, we present a buffer management strategy for the O(n²) algorithm. The resulting new algorithm, which we call “Top Down Disk-based” (TDD), scales to sizes much larger than have been previously described in literature. This approach far outperforms the best known disk-based construction methods. Second, we present a new disk-based suffix tree construction algorithm that is based on a sort-merge paradigm, and show that for constructing very large suffix trees with very little resources, this algorithm is more efficient than TDD. 相似文献

3.

Special factors and the combinatorics of suffix and factor automata

Gabriele Fici 《Theoretical computer science》2011,412(29):3604-3615

The suffix automaton (resp. factor automaton) of a finite word w is the minimal deterministic automaton recognizing the set of suffixes (resp. factors) of w. We study the relationships between the structure of the suffix and factor automata and classical combinatorial parameters related to the special factors of w. We derive formulae for the number of states of these automata. We also characterize the languages L_SA and L_FA of words having respectively suffix automaton and factor automaton with the minimal possible number of states. 相似文献

4.

Computing suffix links for suffix trees and arrays

Moritz G. Maaß 《Information Processing Letters》2007,101(6):250-254

We present a new and simple algorithm to reconstruct suffix links in suffix trees and suffix arrays. The algorithm is based on observations regarding suffix tree construction algorithms. With our algorithm we bring suffix arrays even closer to the ease of use and implementation of suffix trees. 相似文献

5.

Time-space trade-offs for compressed suffix arrays

S.Srinivasa Rao 《Information Processing Letters》2002,82(6):307-311

Given a binary string of length n, we give a representation of its suffix array that takes O(nt(lgn)^1/t) bits of space such that given i,1?i?n, the ith entry in the suffix array of the string can be retrieved in O(t) time, for any parameter 1?t?lglgn. For t=lglgn, this gives a compressed suffix array representation of Grossi and Vitter [Proc. Symp. on Theory Comput., 2000, pp. 397-406]. For t=O(1/ε), this gives the best known (in terms of space) compressed suffix array representation with constant query time. From this representation one can construct a suffix tree structure for a text of length n, that uses o(nlgn) bits of space which can be used to find all the k occurrences of a given pattern of length m in O(m/lgn+k) time. No such structure was known earlier. 相似文献

6.

Faster entropy-bounded compressed suffix trees

Johannes Fischer Veli Mäkinen Gonzalo Navarro 《Theoretical computer science》2009

Suffix trees are among the most important data structures in stringology, with a number of applications in flourishing areas like bioinformatics. Their main problem is space usage, which has triggered much research striving for compressed representations that are still functional. A smaller suffix tree representation could fit in a faster memory, outweighing by far the theoretical slowdown brought by the space reduction. We present a novel compressed suffix tree, which is the first achieving at the same time sublogarithmic complexity for the operations, and space usage that asymptotically goes to zero as the entropy of the text does. The main ideas in our development are compressing the longest common prefix information, totally getting rid of the suffix tree topology, and expressing all the suffix tree operations using range minimum queries and a novel primitive called next/previous smaller value in a sequence. Our solutions to those operations are of independent interest. 相似文献

7.

On-line construction of parameterized suffix trees for large alphabets

Taehyung Lee 《Information Processing Letters》2011,111(5):201-207

We consider on-line construction of the suffix tree for a parameterized string, where we always have the suffix tree of the input string read so far. This situation often arises from source code management systems where, for example, a source code repository is gradually increasing in its size as users commit new codes into the repository day by day. We present an on-line algorithm which constructs a parameterized suffix tree in randomized O(n) time, where n is the length of the input string. Our algorithm is the first randomized linear time algorithm for the on-line construction problem. 相似文献

8.

Approximate string matching with suffix automata

Esko Ukkonen Derick Wood 《Algorithmica》1993,10(5):353-364

Theapproximate string matching problem is, given a text string, a pattern string, and an integerk, to find in the text all approximate occurrences of the pattern. An approximate occurrence means a substring of the text with edit distance at mostk from the pattern. We give a newO(kn) algorithm for this problem, wheren is the length of the text. The algorithm is based on the suffix automaton with failure transitions and on the diagonalwise monotonicity of the edit distance table. Some experiments showing that the algorithm has a small overhead are reported. 相似文献

9.

A quick tour on suffix arrays and compressed suffix arrays

Roberto Grossi 《Theoretical computer science》2011,412(27):2964-2973

Suffix arrays are a key data structure for solving a run of problems on texts and sequences, from data compression and information retrieval to biological sequence analysis and pattern discovery. In their simplest version, they can just be seen as a permutation of the elements in {1,2,…,n}, encoding the sorted sequence of suffixes from a given text of length n, under the lexicographic order. Yet, they are on a par with ubiquitous and sophisticated suffix trees. Over the years, many interesting combinatorial properties have been devised for this special class of permutations: for instance, they can implicitly encode extra information, and they are a well characterized subset of the n! permutations. This paper gives a short tutorial on suffix arrays and their compressed version to explore and review some of their algorithmic features, discussing the space issues related to their usage in text indexing, combinatorial pattern matching, and data compression. 相似文献

10.

The structure of subword graphs and suffix trees of Fibonacci words

Wojciech Rytter 《Theoretical computer science》2006

We use automata-theoretic approach to analyze properties of Fibonacci words. The directed acyclic subword graph (dawg) is a useful deterministic automaton accepting all suffixes of the word. We show that dawg's of Fibonacci words have particularly simple structure. Our main result is a unifying framework for a large collection of relatively simple properties of Fibonacci words. The simple structure of dawgs of Fibonacci words gives in many cases simplified alternative proofs and new interpretation of several well-known properties of Fibonacci words. In particular, the structure of lengths of paths corresponds to a number-theoretic characterization of occurrences of any subword. Using the structural properties of dawg's it can be easily shown that for a string w

w

we can check if w

w

is a subword of a Fibonacci word in time O(|w|)

O (| w |)

and O(1)

O (1)

space. Compact dawg's of Fibonacci words show a very regular structure of their suffix trees and show how the suffix tree for the Fibonacci word grows (extending the leaves in a very simple way) into the suffix tree for the next Fibonacci word. 相似文献

11.

General suffix automaton construction algorithm and space bounds

Mehryar Mohri Pedro Moreno Eugene Weinstein 《Theoretical computer science》2009

Suffix automata and factor automata are efficient data structures for representing the full index of a set of strings. They are minimal deterministic automata representing the set of all suffixes or substrings of a set of strings. This paper presents a novel analysis of the size of the suffix automaton or factor automaton of a set of strings. It shows that the suffix automaton or factor automaton of a set of strings U

U

has at most 2Q−2

2 Q - 2

states, where Q

Q

is the number of nodes of a prefix-tree representing the strings in U

U

. This bound significantly improves over 2‖U‖−1

2 ‖ U ‖ - 1

, the bound given by Blumer et al. [A. Blumer, J. Blumer, D. Haussler, R.M. McConnell, A. Ehrenfeucht, Complete inverted files for efficient text retrieval and analysis, Journal of the ACM 34 (1987) 578–589], where ‖U‖

‖ U ‖

is the sum of the lengths of all strings in U

U

. More generally, we give novel and general bounds for the size of the suffix or factor automaton of an automaton as a function of the size of the original automaton and the maximal length of a suffix shared by the strings it accepts. We also describe in detail a linear-time algorithm for constructing the suffix automaton S

S

or factor automaton F

F

of U

U

in time O(|S|)

O (| S |)

. Our algorithm applies in fact to any input suffix-unique automaton and strictly generalizes the standard on-line construction of a suffix automaton for a single input string. Our algorithm can also be used straightforwardly to generate the suffix oracle or factor oracle of a set of strings, which has been shown to have various useful properties in string-matching. Our analysis suggests that the use of factor automata of automata can be practical for large-scale applications, a fact that is further supported by the results of our experiments applying factor automata to a music identification task with more than 15,000 songs. 相似文献

12.

A suffix tree approach to anti-spam email filtering 总被引：1，自引：0，他引：1

Rajesh Pampapathi Boris Mirkin Mark Levene 《Machine Learning》2006,65(1):309-338

We present an approach to email filtering based on the suffix tree data structure. A method for the scoring of emails using the suffix tree is developed and a number of scoring and score normalisation functions are tested. Our results show that the character level representation of emails and classes facilitated by the suffix tree can significantly improve classification accuracy when compared with the currently popular methods, such as naive Bayes. We believe the method can be extended to the classification of documents in other domains. Editor: Tom Fawcett 相似文献

13.

Simple and flexible detection of contiguous repeats using a suffix tree 总被引：1，自引：0，他引：1

Jens Stoye Dan Gusfield 《Theoretical computer science》2002,270(1-2):843-856

We study the problem of detecting all occurrences of (primitive) tandem repeats and tandem arrays in a string. We first give a simple time- and space-optimal algorithm to find all tandem repeats, and then modify it to become a time and space-optimal algorithm for finding only the primitive tandem repeats. Both of these algorithms are then extended to handle tandem arrays. The contribution of this paper is both pedagogical and practical, giving simple algorithms and implementations based on a suffix tree, using only standard tree traversal techniques. 相似文献

14.

Faster suffix sorting

N. Jesper Larsson Kunihiko Sadakane 《Theoretical computer science》2007

We propose a fast and memory-efficient algorithm for lexicographically sorting the suffixes of a string, a problem that has important applications in data compression as well as string matching. 相似文献

15.

A linear-time algorithm for the generation of trees

L. Alonso J. L. Rémy R. Schott 《Algorithmica》1997,17(2):162-182

We present a linear algorithm which generates randomly and with uniform probability many kinds of trees: binary trees, ternary trees, arbitrary trees, forests ofp k-ary trees,.... The algorithm is based on the definition of generic trees which can be coded as words. These words, in turn, are generated in linear time. 相似文献

16.

Expansion of Linear Steiner Trees

J. F. Weng 《Algorithmica》1997,19(3):318-330

A Steiner tree T on a given set of points A is called linear if all Steiner points, including those collapsing into their adjacent given points, lie on one path referred to as its trunk. Suppose A is a simple polygonal line. Roughly speaking, T is similar to A if its trunk turns right or left when A does. In this paper we prove that A can be expanded to another polygonal line, and T can be constructed in linear time using this expansion method. Received January 15, 1995; revised November 19, 1995, and February 3, 1996. 相似文献

17.

Fast BWT in small space by blockwise suffix sorting

Juha Kärkkäinen 《Theoretical computer science》2007

相似文献

18.

Parallel construction of a suffix tree with applications

A. Apostolico C. Iliopoulos G. M. Landau B. Schieber U. Vishkin 《Algorithmica》1988,3(1):347-365

Many string manipulations can be performed efficiently on suffix trees. In this paper a CRCW parallel RAM algorithm is presented that constructs the suffix tree associated with a string ofn symbols inO(logn) time withn processors. The algorithm requires (n ²) space. However, the space needed can be reduced toO(n ¹⁺) for any 0< 1, with a corresponding slow-down proportional to 1/. Efficient parallel procedures are also given for some string problems that can be solved with suffix trees.The results of this paper have been achieved independently and simultaneously in [AI-86] and [LSV-86]. The research of U. Vishkin was supported by NSF Grant NSF-CCR-8615337, ONR Grant N00014-85-K-0046, and Foundation for Research in Electronics, Computers, and Communication, administered by the Israeli Academy of Sciences and Humanities. The research of A. Apostolico was carried out in part while visiting at the Istituto di Analisi dei Sistemi e Informatica, Rome, with support from the Italian National Research Council. The research of G. M. Landau, B. Schieber, and U. Vishkin was supported by the Applied Mathematical Sciences subprogram of the Office of Energy Research, U.S. Department of Energy under Contract DE-AC02-76ER03077. 相似文献

19.

Prediction suffix trees for supervised classification of sequences

Christine Largeron-Letno 《Pattern recognition letters》2003,24(16):3153-3164

This paper presents a statistical test and algorithms for patterns extraction and supervised classification of sequential data. First it defines the notion of prediction suffix tree (PST). This type of tree can be used to efficiently describe variable order chain. It performs better than the Markov chain of order L and at a lower storage cost. We propose an improvement of this model, based on a statistical test. This test enables us to control the risk of encountering different patterns in the model of the sequence to classify and in the model of its class. Applications to biological sequences are presented to illustrate this procedure. We compare the results obtained with different models (Markov chain of order L, Variable order model and the statistical test, with or without smoothing). We set out to show how the choice of the parameters of the models influences performance in these applications. Obviously these algorithms can be used in other fields in which the data are naturally ordered. 相似文献

20.

A speed-up for the commute between subword trees and DAWGs

Alberto Apostolico Stefano Lonardi 《Information Processing Letters》2002,83(3):159-161

A popular way to describe and build the DAWG or Directed Acyclic Word Graph of a string is by transformation of the corresponding subword tree. This transformation, which is not difficult to reverse, is easy to grasp and almost trivial to implement except for the assumed implication of a standard tree isomorphism algorithm. Here we point out a simple property of subword trees that makes checking tree isomorphism in this context a straightforward process, thereby simplifying the transformation significantly. Subword trees and DAWGs arise rather ubiquitously in applications of string processing, where they often play complementary roles. Efficient conversions are thus especially desirable. 相似文献