首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 46 毫秒
1.
The two-dimensional (2-D) suffix tree of an n×n square matrix A is a compacted trie that represents all square submatrices of A. We consider constructing 2-D suffix trees on-line, which means, instead of giving the whole matrix A in advance, A is separated and each part of A is given at different time as algorithms proceed. In general, developing an on-line algorithm is more difficult than developing an off-line algorithm. Moreover, the smaller the input grain size is, the harder it is to develop an on-line algorithm. In the case of 2-D suffix tree construction, dealing with a character at a time is harder than dealing with a row or a column at a time.In this paper we propose a randomized linear-time algorithm for constructing 2-D suffix trees on-line. This algorithm is superior to previous algorithms in two ways: (1) This is the first linear-time algorithm for constructing 2-D suffix trees on-line. Although there have been some linear-time algorithms for off-line construction, there were no linear-time algorithms for on-line construction. (2) We deal with the most fine-grain on-line case, i.e., our algorithm can construct a 2-D suffix tree even though only one character of A is given at a time, while previous on-line algorithms require at least a row and/or a column at a time.  相似文献   

2.
We discuss an intrinsic generalization of the suffix tree, designed to index a string of length n which has a natural partitioning into m multicharacter substrings or words . This word suffix tree represents only the m suffixes that start at word boundaries. These boundaries are determined by delimiters , whose definition depends on the application. Since traditional suffix tree construction algorithms rely heavily on the fact that all suffixes are inserted, construction of a word suffix tree is nontrivial, in particular when only O(m) construction space is allowed. We solve this problem, presenting an algorithm with O(n) expected running time. In general, construction cost is Ω(n) due to the need of scanning the entire input. In applications that require strict node ordering, an additional cost of sorting O(m') characters arises, where m' is the number of distinct words. In either case, this is a significant improvement over previously known solutions. Furthermore, when the alphabet is small, we may assume that the n characters in the input string occupy o(n) machine words. We illustrate that this can allow a word suffix tree to be built in sublinear time. Received September 2, 1997; revised December 10, 1997.  相似文献   

3.
Many string manipulations can be performed efficiently on suffix trees. In this paper a CRCW parallel RAM algorithm is presented that constructs the suffix tree associated with a string ofn symbols inO(logn) time withn processors. The algorithm requires Θ(n 2) space. However, the space needed can be reduced toO(n 1+?) for any 0< ? ≤1, with a corresponding slow-down proportional to 1/?. Efficient parallel procedures are also given for some string problems that can be solved with suffix trees.  相似文献   

4.
A novel index structure based on the generalized suffix tree (PIGST) is proposed. Combined with post lists, PIGST can answer both structural and content queries. The distinct paths in an XML collection are mapped into strings. The construction algorithm of the PIGST for the path strings is presented based on the modification and improvement of a well-known suffix tree construction algorithm that only requires linear time and space complexity. The query process merely needs m character comparisons for direct containment queries, where m is the length of a query string. An efficient processing method for the indirect containment queries that avoids the inefficient tree traversal operation is also presented. Experiments show that PIGST outperforms earlier approaches.  相似文献   

5.
The classical pattern matching paradigm is that of seeking occurrences of one string in another, where both strings are drawn from an alphabet set Σ. In the parameterized pattern matching model, a consistent renaming of symbols from Σ is allowed in a match. The parameterized matching paradigm has proven useful in problems in software engineering, computer vision, and other applications. In classical pattern matching, both the text and pattern are strings. Applications such as searching in xml or searching in hypertext require searching strings in non-linear structures such as trees or graphs. There has been work in the literature on exact and approximate parameterized matching, as well as work on exact and approximate string matching on non-linear structures. In this paper we explore parameterized matching in non-linear structures. We prove that exact parameterized matching on trees can be computed in linear time for alphabets in an O(n)-size integer range, and in time O(nlogm) in general, where n is the tree size and m the pattern length. These bounds are optimal in the comparison model. We also show that exact parameterized matching on directed acyclic graphs (DAGs) is NP-complete.  相似文献   

6.
We present an algorithm to approximate edit distance between two ordered and rooted trees of bounded degree. In this algorithm, each input tree is transformed into a string by computing the Euler string, where labels of some edges in the input trees are modified so that structures of small subtrees are reflected to the labels. We show that the edit distance between trees is at least 1/6 and at most O(n 3/4) of the edit distance between the transformed strings, where n is the maximum size of two input trees and we assume unit cost edit operations for both trees and strings. The algorithm works in O(n 2) time since computation of edit distance and reconstruction of tree mapping from string alignment takes O(n 2) time though transformation itself can be done in O(n) time.  相似文献   

7.
This paper deals with a new method of coding unlabeled trees, the resulting code being a string of integers. For a tree onn vertices both the coding and the decoding algorithm work in time 0(n). Givenn the “mean length” of code strings is shorter than with other existing integer string codes. The new coding method is used to derive an algorithm for generating all unlabeled trees of a given size without repetition.  相似文献   

8.
The length-constrained heaviest path (LCHP) in a weighted tree T, where each edge is assigned a weight and a length, is the path P in T with maximum total path weight and total path length bounded by a given value B. This paper presents an O(nlogn) time LCHP algorithm which utilizes a data structure constructed from the spine decomposition of the input tree. This is an improvement over the existing algorithm by Wu et al. (1999), which runs in O(nlog2n) time. Our method also improves on a previous O(nlogn) time algorithm by Kim (2005) for the special case of finding a longest nonnegative path in a constant degree tree in that we can handle trees of arbitrary degree within the same time bounds.  相似文献   

9.
We derive an efficient parallel algorithm to find all occurrences of a pattern string in a subject string in O(logn) time, where n is the length of the subject string. The number of processors employed is of the order of the product of the two string lengths. The theory of powerlists [J. Kornerup, PhD Thesis, 1997; J. Misra, ACM Trans. Programming Languages Systems 16 (6) (1994) 1737-1740] is central to the development of the algorithm and its algebraic manipulations.  相似文献   

10.
A tandem repeat (or square) is a string αα, where α is a non-empty string. We present an O(|S|)-time algorithm that operates on the suffix tree T(S) for a string S, finding and marking the endpoint in T(S) of every tandem repeat that occurs in S. This decorated suffix tree implicitly represents all occurrences of tandem repeats in S, and can be used to efficiently solve many questions concerning tandem repeats and tandem arrays in S. This improves and generalizes several prior efforts to efficiently capture large subsets of tandem repeats.  相似文献   

11.
Suffix trees and suffix arrays are fundamental full-text index data structures to solve problems occurring in string processing. Since suffix trees and suffix arrays have different capabilities, some problems are solved more efficiently using suffix trees and others are solved more efficiently using suffix arrays. We consider efficient index data structures with the capabilities of both suffix trees and suffix arrays without requiring much space. When the size of an alphabet is small, enhanced suffix arrays are such index data structures. However, when the size of an alphabet is large, enhanced suffix arrays lose the power of suffix trees. Pattern searching in an enhanced suffix array takes O(m|Σ|) time while pattern searching in a suffix tree takes O(mlog |Σ|) time where m is the length of a pattern and Σ is an alphabet. In this paper, we present linearized suffix trees which are efficient index data structures with the capabilities of both suffix trees and suffix arrays even when the size of an alphabet is large. A linearized suffix tree has all the functionalities of the enhanced suffix array and supports the pattern search in O(mlog |Σ|) time. In a different point of view, it can be considered a practical implementation of the suffix tree supporting O(mlog |Σ|)-time pattern search. In addition, we also present two efficient algorithms for computing suffix links on the enhanced suffix array and the linearized suffix tree. These are the first algorithms that run in O(n) time without using the range minima query. Our experimental results show that our algorithms are faster than the previous algorithms.  相似文献   

12.
On-line construction of suffix trees   总被引:47,自引:0,他引:47  
E. Ukkonen 《Algorithmica》1995,14(3):249-260
An on-line algorithm is presented for constructing the suffix tree for a given string in time linear in the length of the string. The new algorithm has the desirable property of processing the string symbol by symbol from left to right. It always has the suffix tree for the scanned part of the string ready. The method is developed as a linear-time version of a very simple algorithm for (quadratic size) suffixtries. Regardless of its quadratic worst case this latter algorithm can be a good practical method when the string is not too long. Another variation of this method is shown to give, in a natural way, the well-known algorithms for constructing suffix automata (DAWGs).This research was supported by the Academy of Finland and by the Alexander von Humboldt Foundation (Germany).  相似文献   

13.
Practical methods for constructing suffix trees   总被引:7,自引:0,他引:7  
Sequence datasets are ubiquitous in modern life-science applications, and querying sequences is a common and critical operation in many of these applications. The suffix tree is a versatile data structure that can be used to evaluate a wide variety of queries on sequence datasets, including evaluating exact and approximate string matches, and finding repeat patterns. However, methods for constructing suffix trees are often very time-consuming, especially for suffix trees that are large and do not fit in the available main memory. Even when the suffix tree fits in memory, it turns out that the processor cache behavior of theoretically optimal suffix tree construction methods is poor, resulting in poor performance. Currently, there are a large number of algorithms for constructing suffix trees, but the practical tradeoffs in using these algorithms for different scenarios are not well characterized. In this paper, we explore suffix tree construction algorithms over a wide spectrum of data sources and sizes. First, we show that on modern processors, a cache-efficient algorithm with O(n2) worst-case complexity outperforms popular linear time algorithms like Ukkonen and McCreight, even for in-memory construction. For larger datasets, the disk I/O requirement quickly becomes the bottleneck in each algorithm's performance. To address this problem, we describe two approaches. First, we present a buffer management strategy for the O(n2) algorithm. The resulting new algorithm, which we call “Top Down Disk-based” (TDD), scales to sizes much larger than have been previously described in literature. This approach far outperforms the best known disk-based construction methods. Second, we present a new disk-based suffix tree construction algorithm that is based on a sort-merge paradigm, and show that for constructing very large suffix trees with very little resources, this algorithm is more efficient than TDD.  相似文献   

14.
On approximation algorithms for the terminal Steiner tree problem   总被引:1,自引:0,他引:1  
The terminal Steiner tree problem is a special version of the Steiner tree problem, where a Steiner minimum tree has to be found in which all terminals are leaves. We prove that no polynomial time approximation algorithm for the terminal Steiner tree problem can achieve an approximation ratio less than (1−o(1))lnn unless NP has slightly superpolynomial time algorithms. Moreover, we present a polynomial time approximation algorithm for the metric version of this problem with a performance ratio of 2ρ, where ρ denotes the best known approximation ratio for the Steiner tree problem. This improves the previously best known approximation ratio for the metric terminal Steiner tree problem of ρ+2.  相似文献   

15.
We present an O(n3)-time approximation algorithm for the maximum traveling salesman problem whose approximation ratio is asymptotically , where n is the number of vertices in the input complete edge-weighted (undirected) graph. We also present an O(n3)-time approximation algorithm for the metric case of the problem whose approximation ratio is asymptotically . Both algorithms improve on the previous bests.  相似文献   

16.
17.
In this paper, we revisit the Property Matching problem studied by Amir et al. [Property Matching and Weighted Matching, CPM 2006] and present a better indexing scheme for the problem. In particular, the data structure by Amir et al., namely PST, requires O(nlog|Σ|+nloglogn) construction time and O(mlog|Σ|+K) query time, where n and m are the length of, respectively, the text and the pattern, Σ is the alphabet and K is the output size. On the other hand, the construction time of our data structure, namely IDS_PIP, is dominated by the suffix tree construction time and hence is O(n) time for alphabets that are natural numbers from 1 to a polynomial in n and O(nlogσ) time otherwise, where σ=min(n,|Σ|). The query time is same as that of PST. Also, IDS_PIP has the advantage that it can be built on either a suffix tree or a suffix array and additionally, it retains the capability of answering normal pattern matching queries.  相似文献   

18.
Given a binary string of length n, we give a representation of its suffix array that takes O(nt(lgn)1/t) bits of space such that given i,1?i?n, the ith entry in the suffix array of the string can be retrieved in O(t) time, for any parameter 1?t?lglgn. For t=lglgn, this gives a compressed suffix array representation of Grossi and Vitter [Proc. Symp. on Theory Comput., 2000, pp. 397-406]. For t=O(1/ε), this gives the best known (in terms of space) compressed suffix array representation with constant query time. From this representation one can construct a suffix tree structure for a text of length n, that uses o(nlgn) bits of space which can be used to find all the k occurrences of a given pattern of length m in O(m/lgn+k) time. No such structure was known earlier.  相似文献   

19.
20.
In molecular biology, it is said that two biological sequences tend to have similar properties if they have similar three-dimensional structures. Hence, it is very important to find not only similar sequences in the string sense, but also structurally similar sequences from databases. In this paper we propose a new data structure that is a generalization of a parameterized suffix tree (p-suffix tree for short) introduced by Baker. We call it the structural suffix tree or s-suffix tree for short. The s-suffix tree can be used for finding structurally related patterns of RNA or single-stranded DNA. Furthermore, we propose an O(n(log|| + log||)) on-line algorithm for constructing it, where n is the sequence length, || is the size of the normal alphabet, and || is that of the alphabet called parameter, which is related to the structure of the sequence. Our algorithm achieves linear time when it is used to analyze RNA and DNA sequences. Furthermore, as an algorithm for constructing the p-suffix tree, it is the first on-line algorithm, though the computing bound of our algorithm is the same as that of Kosarajus best-known algorithm. The results of computational experiments using actual RNA and DNA sequences are also given to demonstrate our algorithms practicality.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号