首页 | 本学科首页   官方微博 | 高级检索  
     


Stronger Lempel-Ziv Based Compressed Text Indexing
Authors:Diego Arroyuelo  Gonzalo Navarro  Kunihiko Sadakane
Affiliation:1. Yahoo! Research Latin America, Chile, Blanco Encalada 2120, Santiago, Chile
2. Dept. of Computer Science, Universidad de Chile, Blanco Encalada 2120, Santiago, Chile
3. Principles of Informatics Research Division, National Institute of Informatics, 2-1-2 Hitotsubashi, Chiyoda-ku, Tokyo, 101-8430, Japan
Abstract:Given a text T[1..u] over an alphabet of size σ, the full-text search problem consists in finding the occ occurrences of a given pattern P[1..m] in T. In indexed text searching we build an index on T to improve the search time, yet increasing the space requirement. The current trend in indexed text searching is that of compressed full-text self-indices, which replace the text with a more space-efficient representation of it, at the same time providing indexed access to the text. Thus, we can provide efficient access within compressed space. The Lempel-Ziv index (LZ-index) of Navarro is a compressed full-text self-index able to represent T using 4uH k (T)+o(ulog?σ) bits of space, where H k (T) denotes the k-th order empirical entropy of T, for any k=o(log? σ u). This space is about four times the compressed text size. The index can locate all the occ occurrences of a pattern P in T in O(m 3log?σ+(m+occ)log?u) worst-case time. Although this index has proven very competitive in practice, the O(m 3log?σ) term can be excessive for long patterns. Also, the factor 4 in its space complexity makes it larger than other state-of-the-art alternatives. In this paper we present stronger Lempel-Ziv based indices (LZ-indices), improving the overall performance of the original LZ-index. We achieve indices requiring (2+ε)uH k (T)+o(ulog?σ) bits of space, for any constant ε>0, which makes them the smallest existing LZ-indices. We simultaneously improve the search time to O(m 2+(m+occ)log?u), which makes our indices very competitive with state-of-the-art alternatives. Our indices support displaying any text substring of length ? in optimal O(?/log? σ u) time. In addition, we show how the space can be squeezed to (1+ε)uH k (T)+o(ulog?σ) to obtain a structure with O(m 2) average search time for m≥2log? σ u. Alternatively, the search time of LZ-indices can be improved to O((m+occ)log?u) with (3+ε)uH k (T)+o(ulog?σ) bits of space, which is much less than the space needed by other Lempel-Ziv-based indices achieving the same search time. Overall our indices stand out as a very attractive alternative for space-efficient indexed text searching.
Keywords:
本文献已被 SpringerLink 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号