首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Parallel construction of a suffix tree with applications   总被引:1,自引:1,他引:0  
Many string manipulations can be performed efficiently on suffix trees. In this paper a CRCW parallel RAM algorithm is presented that constructs the suffix tree associated with a string ofn symbols inO(logn) time withn processors. The algorithm requires (n 2) space. However, the space needed can be reduced toO(n 1+) for any 0< 1, with a corresponding slow-down proportional to 1/. Efficient parallel procedures are also given for some string problems that can be solved with suffix trees.The results of this paper have been achieved independently and simultaneously in [AI-86] and [LSV-86]. The research of U. Vishkin was supported by NSF Grant NSF-CCR-8615337, ONR Grant N00014-85-K-0046, and Foundation for Research in Electronics, Computers, and Communication, administered by the Israeli Academy of Sciences and Humanities. The research of A. Apostolico was carried out in part while visiting at the Istituto di Analisi dei Sistemi e Informatica, Rome, with support from the Italian National Research Council. The research of G. M. Landau, B. Schieber, and U. Vishkin was supported by the Applied Mathematical Sciences subprogram of the Office of Energy Research, U.S. Department of Energy under Contract DE-AC02-76ER03077.  相似文献   

2.
We present an algorithm to approximate edit distance between two ordered and rooted trees of bounded degree. In this algorithm, each input tree is transformed into a string by computing the Euler string, where labels of some edges in the input trees are modified so that structures of small subtrees are reflected to the labels. We show that the edit distance between trees is at least 1/6 and at most O(n 3/4) of the edit distance between the transformed strings, where n is the maximum size of two input trees and we assume unit cost edit operations for both trees and strings. The algorithm works in O(n 2) time since computation of edit distance and reconstruction of tree mapping from string alignment takes O(n 2) time though transformation itself can be done in O(n) time.  相似文献   

3.
Given a binary string of length n, we give a representation of its suffix array that takes O(nt(lgn)1/t) bits of space such that given i,1?i?n, the ith entry in the suffix array of the string can be retrieved in O(t) time, for any parameter 1?t?lglgn. For t=lglgn, this gives a compressed suffix array representation of Grossi and Vitter [Proc. Symp. on Theory Comput., 2000, pp. 397-406]. For t=O(1/ε), this gives the best known (in terms of space) compressed suffix array representation with constant query time. From this representation one can construct a suffix tree structure for a text of length n, that uses o(nlgn) bits of space which can be used to find all the k occurrences of a given pattern of length m in O(m/lgn+k) time. No such structure was known earlier.  相似文献   

4.
We consider on-line construction of the suffix tree for a parameterized string, where we always have the suffix tree of the input string read so far. This situation often arises from source code management systems where, for example, a source code repository is gradually increasing in its size as users commit new codes into the repository day by day. We present an on-line algorithm which constructs a parameterized suffix tree in randomized O(n) time, where n is the length of the input string. Our algorithm is the first randomized linear time algorithm for the on-line construction problem.  相似文献   

5.
Practical methods for constructing suffix trees   总被引:7,自引:0,他引:7  
Sequence datasets are ubiquitous in modern life-science applications, and querying sequences is a common and critical operation in many of these applications. The suffix tree is a versatile data structure that can be used to evaluate a wide variety of queries on sequence datasets, including evaluating exact and approximate string matches, and finding repeat patterns. However, methods for constructing suffix trees are often very time-consuming, especially for suffix trees that are large and do not fit in the available main memory. Even when the suffix tree fits in memory, it turns out that the processor cache behavior of theoretically optimal suffix tree construction methods is poor, resulting in poor performance. Currently, there are a large number of algorithms for constructing suffix trees, but the practical tradeoffs in using these algorithms for different scenarios are not well characterized. In this paper, we explore suffix tree construction algorithms over a wide spectrum of data sources and sizes. First, we show that on modern processors, a cache-efficient algorithm with O(n2) worst-case complexity outperforms popular linear time algorithms like Ukkonen and McCreight, even for in-memory construction. For larger datasets, the disk I/O requirement quickly becomes the bottleneck in each algorithm's performance. To address this problem, we describe two approaches. First, we present a buffer management strategy for the O(n2) algorithm. The resulting new algorithm, which we call “Top Down Disk-based” (TDD), scales to sizes much larger than have been previously described in literature. This approach far outperforms the best known disk-based construction methods. Second, we present a new disk-based suffix tree construction algorithm that is based on a sort-merge paradigm, and show that for constructing very large suffix trees with very little resources, this algorithm is more efficient than TDD.  相似文献   

6.
We revisit various string indexing problems with range reporting features, namely, position-restricted substring searching, indexing substrings with gaps, and indexing substrings with intervals. We obtain the following main results.
  • We give efficient reductions for each of the above problems to a new problem, which we call substring range reporting. Hence, we unify the previous work by showing that we may restrict our attention to a single problem rather than studying each of the above problems individually.
  • We show how to solve substring range reporting with optimal query time and little space. Combined with our reductions this leads to significantly improved time-space trade-offs for the above problems. In particular, for each problem we obtain the first solutions with optimal time query and O(nlog O(1) n) space, where n is the length of the indexed string.
  • We show that our techniques for substring range reporting generalize to substring range counting and substring range emptiness variants. We also obtain non-trivial time-space trade-offs for these problems.
Our bounds for substring range reporting are based on a novel combination of suffix trees and range reporting data structures. The reductions are simple and general and may apply to other combinations of string indexing with range reporting.  相似文献   

7.
We present the design and analysis of a nearly-linear work parallel algorithm for solving symmetric diagonally dominant (SDD) linear systems. On input an SDD n-by-n matrix A with m nonzero entries and a vector b, our algorithm computes a vector \(\tilde{x}\) such that \(\|\tilde{x} - A^{+}b\|_{A} \leq\varepsilon\cdot\|{A^{+}b}\|_{A}\) in \(O(m\log^{O(1)}{n}\log {\frac{1}{\varepsilon}})\) work and \(O(m^{1/3+\theta}\log\frac{1}{\varepsilon})\) depth for any θ>0, where A + denotes the Moore-Penrose pseudoinverse of A. The algorithm relies on a parallel algorithm for generating low-stretch spanning trees or spanning subgraphs. To this end, we first develop a parallel decomposition algorithm that in O(mlog O(1) n) work and polylogarithmic depth, partitions a graph with n nodes and m edges into components with polylogarithmic diameter such that only a small fraction of the original edges are between the components. This can be used to generate low-stretch spanning trees with average stretch O(n α ) in O(mlog O(1) n) work and O(n α ) depth for any α>0. Alternatively, it can be used to generate spanning subgraphs with polylogarithmic average stretch in O(mlog O(1) n) work and polylogarithmic depth. We apply this subgraph construction to derive a parallel linear solver. By using this solver in known applications, our results imply improved parallel randomized algorithms for several problems, including single-source shortest paths, maximum flow, minimum-cost flow, and approximate maximum flow.  相似文献   

8.
We consider the following problem. For a binary tree T = (V, E) where V = {1, 2, ..., n}, given its inorder traversal and either its preorder or its postorder traversal, reconstruct the binary tree. We present a new parallel algorithm for this problem. Our algorithm requires O(n) space. The main idea of our algorithm is to reduce the reconstruction process to merging two sorted sequences. With the best parallel merging algorithms, our algorithm can be implemented in O(log log n) time using O(n/log log n) processors on the CREW PRAM (or in O(log n) time using O(n/log n) processors on the EREW PRAM). Our result provides one more example of a fundamental problem which can be solved by optimal parallel algorithms in O(log log n)time on the CREW PRAM.  相似文献   

9.
Suffix trees and suffix arrays are fundamental full-text index data structures to solve problems occurring in string processing. Since suffix trees and suffix arrays have different capabilities, some problems are solved more efficiently using suffix trees and others are solved more efficiently using suffix arrays. We consider efficient index data structures with the capabilities of both suffix trees and suffix arrays without requiring much space. When the size of an alphabet is small, enhanced suffix arrays are such index data structures. However, when the size of an alphabet is large, enhanced suffix arrays lose the power of suffix trees. Pattern searching in an enhanced suffix array takes O(m|Σ|) time while pattern searching in a suffix tree takes O(mlog |Σ|) time where m is the length of a pattern and Σ is an alphabet. In this paper, we present linearized suffix trees which are efficient index data structures with the capabilities of both suffix trees and suffix arrays even when the size of an alphabet is large. A linearized suffix tree has all the functionalities of the enhanced suffix array and supports the pattern search in O(mlog |Σ|) time. In a different point of view, it can be considered a practical implementation of the suffix tree supporting O(mlog |Σ|)-time pattern search. In addition, we also present two efficient algorithms for computing suffix links on the enhanced suffix array and the linearized suffix tree. These are the first algorithms that run in O(n) time without using the range minima query. Our experimental results show that our algorithms are faster than the previous algorithms.  相似文献   

10.
In this paper we consider the problem of finding two parallel rectangles in arbitrary orientation for covering a given set of n points in a plane, such that the area of the larger rectangle is minimized. We propose an algorithm that solves the problem in O(n3) time using O(n2) space. Without altering the complexity, our approach can be used to solve another optimization problem namely, minimize the sum of the areas of two arbitrarily oriented parallel rectangles covering a given set of points in a plane.  相似文献   

11.
This paper determines upper bounds on the expected time complexity for a variety of parallel algorithms for undirected and directed random graph problems. For connectivity, biconnectivity, transitive closure, minimum spanning trees, and all pairs minimum cost paths, we prove the expected time to beO(log logn) for the CRCW PRAM (this parallel RAM machine allows resolution of write conflicts) andO(logn · log logn) for the CREW PRAM (which allows simultaneous reads but not simultaneous writes). We also show that the problem of graph isomorphism has expected parallel timeO(log logn) for the CRCW PRAM andO(logn) for the CREW PRAM. Most of these results follow because of upper bounds on the mean depth of a graph, derived in this paper, for more general graphs than was known before. For undirected connectivity especially, we present a new probabilistic algorithm which runs on a randomized input and has an expected running time ofO(log logn) on the CRCW PRAM, withO(n) expected number of processors only. Our results also improve known upper bounds on the expected space required for sequential graph algorithms. For example, we show that the problems of finding connected components, transitive closure, minimum spanning trees, and minimum cost paths have expected sequential spaceO(logn · log logn) on a deterministic Turing Machine. We use a simulation of the CRCW PRAM to get these expected sequential space bounds.  相似文献   

12.
A stringw isprimitive if it is not a power of another string (i.e., writingw =v k impliesk = 1. Conversely,w is asquare ifw =vv, withv a primitive string. A stringx issquare-free if it has no nonempty substring of the formww. It is shown that the square-freedom of a string ofn symbols over an arbitrary alphabet can be tested by a CRCW PRAM withn processors inO(logn) time and linear auxiliary space. If the cardinality of the input alphabet is bounded by a constant independent of the input size, then the number of processors can be reduced ton/logn without affecting the time complexity of this strategy. The fastest sequential algorithms solve this problemO(n logn) orO(n) time, depending on whether the cardinality of the input alphabet is unbounded or bounded, and either performance is known to be optimal within its class. More elaborate constructions lead to a CRCW PRAM algorithm for detecting, within the samen-processors bounds, all positioned squares inx in timeO(logn) and using linear auxiliary space. The fastest sequential algorithms solve this problem inO(n logn) time, and such a performance is known to be optimal.  相似文献   

13.
We provide a uniform framework for the study of index data structures for a two-dimensional matrixTEXT[1:n, 1:n] whose entries are drawn from an ordered alphabetΣ. An index forTEXTcan be informally seen as the two-dimensional analog of the suffix tree for a string. It allows on-line searches and statistics to be performed onTEXTby representing compactly theΘ(n3) square submatrices ofTEXTin optimalO(n2) space. We identify 4n−1families of indices forTEXT, each containing ∏ni=1 (2i−1)! isomorphic data structures. We also develop techniques leading to a single algorithm that efficiently builds any index in any family inO(n2 log n) time andO(n2) space. Such an algorithm improves in various respects the algorithms for the construction of the PAT tree and the Lsuffix tree. The framework and the algorithm easily generalize tod>2 dimensions. Moreover, as part of our algorithm, we provide new algorithmic tools that yield a space-efficient implementation of the “naming scheme” of R. Karpet al.(in“Proceedings, Fourth Symposium on Theory of Computing,” pp. 125–136) for strings and matrices.  相似文献   

14.
The problem of outputting all parse trees of a string accepted by a context-free grammar is considered. A systolic algorithms is presented that operates inO(m·n) time, wherem is the number of distinct parse trees andn is the length of the input. The systolic array usesn 2 processors, each of which requires at mostO(logn) bits of storage. This is much more space-efficient that a previously reported systolic algorithm for the same problem, which requiredO(n logn) space per processor. The algorithm also extends previous algorithms that only output a single parse tree of the input.Research squpported in part by NSF Grant DCR-8420935 and DCR-8604603.  相似文献   

15.
We consider the following problem: Given an unsorted array of n elements, and a sequence of intervals in the array, compute the median in each of the subarrays defined by the intervals. We describe a simple algorithm which needs O(nlogk+klogn) time to answer k such median queries. This improves previous algorithms by a logarithmic factor and matches a comparison lower bound for k=O(n). The space complexity of our simple algorithm is O(nlogn) in the pointer machine model, and O(n) in the RAM model. In the latter model, a more involved O(n) space data structure can be constructed in O(nlogn) time where the time per query is reduced to O(logn/loglogn). We also give efficient dynamic variants of both data structures, achieving O(log2n) query time using O(nlogn) space in the comparison model and O((logn/loglogn)2) query time using O(nlogn/loglogn) space in the RAM model, and show that in the cell-probe model, any data structure which supports updates in O(logO(1)n) time must have Ω(logn/loglogn) query time.Our approach naturally generalizes to higher-dimensional range median problems, where element positions and query ranges are multidimensional—it reduces a range median query to a logarithmic number of range counting queries.  相似文献   

16.
李肯立  赵欢  李仁发  李庆华 《软件学报》2007,18(6):1319-1327
将串行动态二表算法应用于并行三表算法的设计中,提出一种求解背包、精确的可满足性和集覆盖等背包类NP完全问题的并行三表六子表算法.基于EREW-PRAM模型,该算法可使用O(2n/8)的处理机在O(27n/16)的时间和O(213n/48)的空间求解n维背包类问题,其时间-空间-处理机折衷为O(25n/6).与现有文献的性能对比分析表明,该算法极大地提高了并行求解背包类问题的时间-空间-处理机折衷性能.由于该算法能够破解更高维数的背包类公钥和数字水印系统,其结论在密钥分析领域具有一定的理论和实际意义.  相似文献   

17.
String matching is the problem of finding all the occurrences of a pattern P in a text T, where P and T are strings over a finite alphabet Σ. A variable length don′t care is a special character, not belonging to Σ, which can match any string in Σ*. The string-matching problem with variable length don′t cares is an extension of the classical string-matching problem in which the pattern P may contain an arbitrary number of don′t cares. An efficient parallel algorithm is given for solving the string-matching problem with variable length don′t cares. The EREW PRAM model of parallel computer with scan operations is used to obtain an O(log n) running time using O(mn/log n) processors, where m and n are, respectively, the lengths of P and T. The proposed parallel algorithm has an Ω(1/log n) processor utilization, since the fastest serial algorithm known so far has an O(mn/log n) running time.  相似文献   

18.
Our aim is to develop new database technologies for the approximate matching of unstructured string data using indexes. We explore the potential of the suffix tree data structure in this context. We present a new method of building suffix trees, allowing us to build trees in excess of RAM size, which has hitherto not been possible. We show that this method performs in practice as well as the O(n) method of Ukkonen [70]. Using this method we build indexes for 200 Mb of protein and 300 Mbp of DNA, whose disk-image exceeds the available RAM. We show experimentally that suffix trees can be effectively used in approximate string matching with biological data. For a range of query lengths and error bounds the suffix tree reduces the size of the unoptimised O(mn) dynamic programming calculation required in the evaluation of string similarity, and the gain from indexing increases with index size. In the indexes we built this reduction is significant, and less than 0.3% of the expected matrix is evaluated. We detail the requirements for further database and algorithmic research to support efficient use of large suffix indexes in biological applications. Received: November 1, 2001 / Accepted: March 2, 2002 Published online: September 25, 2002  相似文献   

19.
We present an algorithm for computingL 1 shortest paths among polygonal obstacles in the plane. Our algorithm employs the “continuous Dijkstra” technique of propagating a “wavefront” and runs in timeO(E logn) and spaceO(E), wheren is the number of vertices of the obstacles andE is the number of “events.” By using bounds on the density of certain sparse binary matrices, we show thatE =O(n logn), implying that our algorithm is nearly optimal. We conjecture thatE =O(n), which would imply our algorithm to be optimal. Previous bounds for our problem were quadratic in time and space. Our algorithm generalizes to the case of fixed orientation metrics, yielding anO(n??1/2 log2 n) time andO(n??1/2) space approximation algorithm for finding Euclidean shortest paths among obstacles. The algorithm further generalizes to the case of many sources, allowing us to compute anL 1 Voronoi diagram for source points that lie among a collection of polygonal obstacles.  相似文献   

20.
A tandem repeat (or square) is a string αα, where α is a non-empty string. We present an O(|S|)-time algorithm that operates on the suffix tree T(S) for a string S, finding and marking the endpoint in T(S) of every tandem repeat that occurs in S. This decorated suffix tree implicitly represents all occurrences of tandem repeats in S, and can be used to efficiently solve many questions concerning tandem repeats and tandem arrays in S. This improves and generalizes several prior efforts to efficiently capture large subsets of tandem repeats.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号