首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 62 毫秒
1.
M. Farach  M. Thorup 《Algorithmica》1998,20(4):388-404
String matching and compression are two widely studied areas of computer science. The theory of string matching has a long association with compression algorithms. Data structures from string matching can be used to derive fast implementations of many important compression schemes, most notably the Lempel—Ziv (LZ77) algorithm. Intuitively, once a string has been compressed—and therefore its repetitive nature has been elucidated—one might be tempted to exploit this knowledge to speed up string matching. The Compressed Matching Problem is that of performing string matching in a compressed text, without uncompressing it. More formally, let T be a text, let Z be the compressed string representing T , and let P be a pattern. The Compressed Matching Problem is that of deciding if P occurs in T , given only P and Z . Compressed matching algorithms have been given for several compression schemes such as LZW. In this paper we give the first nontrivial compressed matching algorithm for the classic adaptive compression scheme, the LZ77 algorithm. In practice, the LZ77 algorithm is known to compress more than other dictionary compression schemes, such as LZ78 and LZW, though for strings with constant per bit entropy, all these schemes compress optimally in the limit. However, for strings with o(1) per bit entropy, while it was recently shown that the LZ77 gives compression to within a constant factor of optimal, schemes such as LZ78 and LZW may deviate from optimality by an exponential factor. Asymptotically, compressed matching is only relevant if |Z|=o(|T|) , i.e., if the compression ratio |T|/|Z| is more than a constant. These results show that LZ77 is the appropriate compression method in such settings. We present an LZ77 compressed matching algorithm which runs in time O(n log 2 u/n + p) where n=|Z| , u=|T| , and p=|P| . Compare with the na?ve ``decompresion' algorithm, which takes time Θ(u+p) to decide if P occurs in T . Writing u+p as (n u)/n+p , we see that we have improved the complexity, replacing the compression factor u/n by a factor log 2 u/n . Our algorithm is competitive in the sense that O(n log 2 u/n + p)=O(u+p) , and opportunistic in the sense that O(n log 2 u/n + p)=o(u+p) if n=o(u) and p=o(u) . Received December 20, 1995; revised October 29, 1996, and February 6, 1997.  相似文献   

2.
Wei Jun Liu  Ge Nong  Wai hong Chan  Yi Wu 《Software》2016,46(9):1201-1217
Computing the Lempel–Ziv factorization (LZ77) of a string is a key step in many applications. However, at the same time, it constitutes a bottleneck of the entire computation. The investigation of time and space efficient computation of the LZ77 has become an important topic. In this paper, we present a lightweight linear‐time algorithm called LZone for computing the LZ77, which is designed by improvements on the existing linear‐time space efficient LZ77 algorithm BGone for speed acceleration. For an input string T[1..n] over a constant alphabet size of O(1), LZone requires only n words of workspace in addition to the input string and the output factorization, ?logn? bits per word. This is the same space requirement for the algorithm BGone. LZone has two versions, LZoneT and LZoneSA, corresponding to BGoneT and BGoneSA, respectively. Our experimental results show that for computing the LZ77 from an input string T, LZoneT and LZoneSA run at around 26% and 57%, respectively, faster than their counterparts in BGone. Moreover, for computing the LZ77 from the suffix array of T, the speed of LZoneSA is on average twice that of BGoneSA. Copyright © 2015 John Wiley & Sons, Ltd.  相似文献   

3.
A classical measure of similarity between strings is the length of the longest common subsequence (LCS) between the two given strings. The search for efficient algorithms for finding the LCS has been going on for more than three decades. To date, all known algorithms may take quadratic time (shaved by logarithmic factors) to find large LCS. In this paper, the problem of approximating LCS is studied, while focusing on the hard inputs for this problem, namely, approximating LCS of near-linear size in strings over a relatively large alphabet (of size at least n? for some constant ?>0, where n is the length of the string). We show that, any given string over a relatively large alphabet can be embedded into a locally non-repetitive string. This embedding has a negligible additive distortion for strings that are not too dissimilar in terms of the edit distance. We also show that LCS can be efficiently approximated in locally-non-repetitive strings. Our new method (the embedding together with the approximation algorithm) gives a strictly sub-quadratic time algorithm (i.e., of complexity O(n2-?) for some constant ?) which can find common subsequences of linear (and near linear) size that cannot be detected efficiently by the existing tools.  相似文献   

4.
In parameterized string matching the pattern P matches a substring t of the text T if there exist a bijective mapping from the symbols of P to the symbols of t. We give simple and practical algorithms for finding all such pattern occurrences in sublinear time on average. The algorithms work for a single and multiple patterns.  相似文献   

5.
The edit distance problem is a classical fundamental problem in computer science in general, and in combinatorial pattern matching in particular. The standard dynamic programming solution for this problem computes the edit-distance between a pair of strings of total length O(N) in O(N 2) time. To this date, this quadratic upper-bound has never been substantially improved for general strings. However, there are known techniques for breaking this bound in case the strings are known to compress well under a particular compression scheme. The basic idea is to first compress the strings, and then to compute the edit distance between the compressed strings. As it turns out, practically all known o(N 2) edit-distance algorithms work, in some sense, under the same paradigm described above. It is therefore natural to ask whether there is a single edit-distance algorithm that works for strings which are compressed under any compression scheme. A rephrasing of this question is to ask whether a single algorithm can exploit the compressibility properties of strings under any compression method, even if each string is compressed using a different compression. In this paper we set out to answer this question by using straight line programs. These provide a generic platform for representing many popular compression schemes including the LZ-family, Run-Length Encoding, Byte-Pair Encoding, and dictionary methods. For two strings of total length N having straight-line program representations of total size n, we present an algorithm running in O(nNlg(N/n)) time for computing the edit-distance of these two strings under any rational scoring function, and an O(n 2/3 N 4/3) time algorithm for arbitrary scoring functions. Our new result, while providing a speed up for compressible strings, does not surpass the quadratic time bound even in the worst case scenario.  相似文献   

6.
基于字符串匹配的通用数据压缩算法   总被引:1,自引:0,他引:1  
本文主要介绍基于字符串匹配的数据压缩算法原理,该算法从多方面时著名LZ77算法进行了改进,我们的算法所用到的工作缓冲区是一个循环历史表,摈弃了输入符号超前缓冲区;结果,匹配过程是边接收输入边进行,无需等待一组输入数据填满超前缓冲区才开始,同时,最大争配长度不再受超前缓冲区大小的限制,而且,不再需要做大量的平移工作缓立足点冲区的操作,另外,还涉及一些其他方面的改进,包括改等长压缩码为变长码和引入匹配  相似文献   

7.
We study the parallel complexity of a bounded size dictionary version (LRU deletion heuristic) of the LZ2 compression algorithm. The unbounded version was shown to be P-complete. When the size of the dictionary is O(logkn), the problem of computing the LZ2 compression is shown to be hard for the class of problems solvable simultaneously in polynomial time and O(logkn) space (that is, SCk). We also introduce a variation of this heuristic that turns out to be an SCk-complete problem (the original heuristic belongs to SCk+1). In virtue of these results, we argue that there are no practical parallel algorithms for LZ2 compression with LRU deletion heuristic or any other heuristic deleting dictionary elements in a continuous way. For simpler heuristics (SWAP, RESTART, FREEZE), practical parallel algorithms are given.  相似文献   

8.
一种基于自适应字典的通用无损压缩算法   总被引:6,自引:2,他引:4  
对LZ77和LZ78两种算法进行了深入的考察,提出了一种改进的LZ算法LZI(LZ Improved),即基于LZ78算法和LZ77的混合算法,LZI算法具有LZ78和LZ77相似的计算复杂度和存储复杂度。实验结果表明,LZI算法具有更好的全局与局部自适应性。更高的压缩效率。  相似文献   

9.
Approximating points by piecewise linear functions is an intensively researched topic in computational geometry. In this paper, we study, based on the uniform error metric, an array of variations of this problem in 2-D and 3-D, including points with weights, approximation with violations, using step functions or more generally piecewise linear functions. We consider both the min-# (i.e., given an error tolerance ?, minimizing the size k of the approximating function) and min-? (i.e., given a size k of the approximating function, minimizing the error tolerance ?) versions of the problems. Our algorithms either improve on the previously best-known solutions or are the first known results for the respective problems. Our approaches are based on interesting geometric observations and algorithmic techniques. Some data structures we develop are of independent interest and may find other applications.  相似文献   

10.
在文本压缩中联合使用LZSS和LZW   总被引:3,自引:0,他引:3  
本文分析了LZ77和LZ78算法在文本压缩中各自的长处和不足,以它们的实用算法LZSS和LZW的中文文本改进算法LZSSCH和LZWCH为基础,设计了联合使用LZ77和LZ78原理的LZSWCH算法。算法具有良好的通用性、实时性,对9个各种长度的样本文本文件取得的压缩比均高于LZSS和LZW,高出幅度分别达到6~19%。算法无须任何预处理,并可用于压缩其它文字的文本文件。  相似文献   

11.
Gonzalo Navarro  Jorma Tarhio 《Software》2005,35(12):1107-1130
We present a Boyer–Moore (BM) approach to string matching over LZ78 and LZW compressed text. The idea is to search the text directly in compressed form instead of decompressing and then searching it. We modify the BM approach so as to skip text using the characters explicitly represented in the LZ78/LZW formats, modifying the basic technique where the algorithm can choose which characters to inspect. We present and compare several solutions for single and multipattern searches. We show that our algorithms obtain speedups of up to 50% compared to the simple decompress‐then‐search approach. Finally, we present a public tool, LZgrep, which uses our algorithms to offer grep‐like capabilities directly searching files compressed using Unix's Compress, a LZW compressor. LZgrep can also search files compressed with Unix gzip, using new decompress‐then‐search techniques we develop, which are faster than the current tools. This way, users can always keep their files in compressed form and still search them, uncompressing only when they want to see them. Copyright © 2005 John Wiley & Sons, Ltd.  相似文献   

12.
《Parallel Computing》2007,33(10-11):720-740
The Sony–Toshiba–IBM Cell Broadband Engine (Cell/B.E.) is a heterogeneous multicore architecture that consists of a traditional microprocessor (PPE) with eight SIMD co-processing units (SPEs) integrated on-chip. While the Cell/B.E. processor is architected for multimedia applications with regular processing requirements, we are interested in its performance on problems with non-uniform memory access patterns. In this article, we present two case studies to illustrate the design and implementation of parallel combinatorial algorithms on Cell/B.E.: we discuss list ranking, a fundamental kernel for graph problems, and zlib, a data compression and decompression library.List ranking is a particularly challenging problem to parallelize on current cache-based and distributed memory architectures due to its low computational intensity and irregular memory access patterns. To tolerate memory latency on the Cell/B.E. processor, we decompose work into several independent tasks and coordinate computation using the novel idea of Software-Managed threads (SM-Threads). We apply this generic SPE work-partitioning technique to efficiently implement list ranking, and demonstrate substantial speedup in comparison to traditional cache-based microprocessors. For instance, on a 3.2 GHz IBM QS20 Cell/B.E. blade, for a random linked list of 1 million nodes, we achieve an overall speedup of 8.34 over a PPE-only implementation.Our second case study, zlib, is a data compression/decompression library that is extensively used in both scientific as well as general purpose computing. The core kernels in the zlib library are the LZ77 longest subsequence matching algorithm and Huffman data encoding. We design efficient parallel algorithms for these combinatorial kernels, and exploit concurrency at multiple levels on the Cell/B.E. processor. We also present a Cell/B.E. optimized implementation of gzip, a popular file-compression application based on the zlib library. For our Cell/B.E. implementation of gzip, we achieve an average speedup of 2.9 in compression over current workstations.  相似文献   

13.
Many efficient string matching algorithms make use of q-grams and process the text in windows which are read backward. In this paper we provide a framework for analyzing the average case complexity of these algorithms taking into account the statistical dependencies between overlapping q-grams. We apply this to the q-gram Boyer–Moore–Horspool algorithm adapted to various string matching problems and show that the algorithm is optimal on average.  相似文献   

14.
We consider schemes for enacting task share changes—a process called reweighting—on real-time multiprocessor platforms. Our particular focus is reweighting schemes that are deployed in environments in which tasks may frequently request significant share changes. Prior work has shown that fair scheduling algorithms are capable of reweighting tasks with minimal allocation error and that partitioning-based scheduling algorithms can reweight tasks with better average-case performance, but greater error. However, preemption and migration overheads can be high in fair schemes. In this paper, we consider the question of whether non-fair, earliest-deadline-first ( $\mathsf{EDF}$ ) global scheduling techniques can improve the accuracy of reweighting relative to partitioning-based schemes and provide improved average-case performance relative to fair-scheduled systems. Our conclusion is that, for soft real-time systems, global $\mathsf{EDF}$ schemes provide a good mix of accuracy and average-case performance.  相似文献   

15.
The LZ2 compression method is hardly parallelizable since it is known to be P-complete. In spite of such negative result, we show in this paper that the decoding process can be parallelized efficiently on an EREW PRAM model of computation with O(n/log(n)) processors and O(log2 n) time, where n is the length of the output string.  相似文献   

16.
17.
We consider the damped second-order system Mu? + C/.u + Ku = F(t) (M, C, K symmetric, positive definite n × n matrices) by conversion to an equivalent first-order system
IOOMddtO?KI?Cuu?+oF(t)
We demonstrate that an algorithm proposed by Fairweather for the implementation of the (2, 2) Padé approximation of the exponential matrix for approximating the solution of homogeneous first-order systems extends advantageously to this case, yielding an unconditionally stable fourth-order scheme with the feature that the approximating equations decouple. As a result we are required only to solve one symmetric complex n × n system of linear algebraic equations at each time step, with a fixed matrix which may be decomposed into triangular factors at the outset. We also consider iterative schemes involving only real, positive definite, symmetric n × n matrices. Numerical results are included.  相似文献   

18.
In this paper, we present and analyze a new set of low-rank recovery algorithms for linear inverse problems within the class of hard thresholding methods. We provide strategies on how to set up these algorithms via basic ingredients for different configurations to achieve complexity vs. accuracy tradeoffs. Moreover, we study acceleration schemes via memory-based techniques and randomized, ?-approximate matrix projections to decrease the computational costs in the recovery process. For most of the configurations, we present theoretical analysis that guarantees convergence under mild problem conditions. Simulation results demonstrate notable performance improvements as compared to state-of-the-art algorithms both in terms of reconstruction accuracy and computational complexity.  相似文献   

19.
In various information processing tasks obtaining regularized versions of a noisy or corrupted image data is often a prerequisite for successful use of classical image analysis algorithms. Image restoration and decomposition methods need to be robust if they are to be useful in practice. In particular, this property has to be verified in engineering and scientific applications. By robustness, we mean that the performance of an algorithm should not be affected significantly by small deviations from the assumed model. In image processing, total variation (TV) is a powerful tool to increase robustness. In this paper, we define several concepts that are useful in robust restoration and robust decomposition. We propose two extended total variation models, weighted total variation (WTV) and extended total variation (ETV). We state generic approaches. The idea is to replace the TV penalty term with more general terms. The motivation is to increase the robustness of ROF (Rudin, Osher, Fatemi) model and to prevent the staircasing effect due to this method. Moreover, rewriting the non-convex sublinear regularizing terms as WTV, we provide a new approach to perform minimization via the well-known Chambolle's algorithm. The implementation is then more straightforward than the half-quadratic algorithm. The behavior of image decomposition methods is also a challenging problem, which is closely related to anisotropic diffusion. ETV leads to an anisotropic decomposition close to edges improving the robustness. It allows to respect desired geometric properties during the restoration, and to control more precisely the regularization process. We also discuss why compression algorithms can be an objective method to evaluate the image decomposition quality.  相似文献   

20.
Finite‐state automata are important components in information retrieval and natural language processing software. A recursive automaton is the most compact representation of the acyclic deterministic finite‐state automata. It is based on merging not only the equivalent states but also the identical substructures in an automaton. The LZ trie variant is the state‐of‐the‐art in automata compression regarding space, but the time needed for its construction was, until now, quadratic, which has made it impractical for large inputs. In this paper, we present the first algorithm for LZ trie construction that runs in effectively linear time thereby making it an attractive choice for finite‐state automata implementation. We achieve this goal by adding a new functionality to the enhanced suffix array data structure. We present two variants of the construction procedure – an optimal method regarding the final size and a method that sacrifices some compression for low intermediate memory usage. We have made the implementation of our algorithms available in an open source software package LzLex.Copyright © 2014 John Wiley & Sons, Ltd.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号