Self-indexing Based on LZ77

Kreft, Sebastian; Navarro, Gonzalo

doi:10.1007/978-3-642-21458-5_6

Sebastian Kreft¹⁸ &
Gonzalo Navarro¹⁸

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 6661))

Included in the following conference series:

Annual Symposium on Combinatorial Pattern Matching

1162 Accesses
33 Citations
3 Altmetric

Abstract

We introduce the first self-index based on the Lempel-Ziv 1977 compression format (LZ77). It is particularly competitive for highly repetitive text collections such as sequence databases of genomes of related species, software repositories, versioned document collections, and temporal text databases. Such collections are extremely compressible but classical self-indexes fail to capture that source of compressibility. Our self-index takes in practice a few times the space of the text compressed with LZ77 (as little as 2.5 times), extracts 1–2 million characters of the text per second, and finds patterns at a rate of 10–50 microseconds per occurrence. It is smaller (up to one half) than the best current self-index for repetitive collections, and faster in many cases.

Partially funded by Millennium Institute for Cell Dynamics and Biotechnology (ICDB), Grant ICM P05-001-F, Mideplan, Chile and, the first author, by Conicyt’s Master Scholarship.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Arroyuelo, D., Cánovas, R., Navarro, G., Sadakane, K.: Succinct trees in practice. In: ALENEX, pp. 84–97 (2010)
Google Scholar
Benoit, D., Demaine, E., Munro, I., Raman, R., Raman, V., Rao, S.: Representing trees of higher degree. Algorithmica 43(4), 275–292 (2005)
Article MathSciNet MATH Google Scholar
Brisaboa, N.R., Ladra, S., Navarro, G.: Directly addressable variable-length codes. In: Karlgren, J., Tarhio, J., Hyyrö, H. (eds.) SPIRE 2009. LNCS, vol. 5721, pp. 122–130. Springer, Heidelberg (2009)
Chapter Google Scholar
Burrows, M., Wheeler, D.: A block sorting lossless data compression algorithm. TRep. 124, DEC (1994)
Google Scholar
Claude, F., Fariña, A., Martínez-Prieto, M., Navarro, G.: Compressed q-gram indexing for highly repetitive biological sequences. In: BIBE, pp. 86–91 (2010)
Google Scholar
Ferragina, P., Manzini, G.: Indexing compressed text. J. ACM 52(4), 552–581 (2005)
Article MathSciNet MATH Google Scholar
Ferragina, P., Manzini, G., Mäkinen, V., Navarro, G.: Compressed representations of sequences and full-text indexes. ACM Trans. Alg. 3(2), article 20 (2007)
Google Scholar
Fischer, J.: Optimal succinctness for range minimum queries. In: López-Ortiz, A. (ed.) LATIN 2010. LNCS, vol. 6034, pp. 158–169. Springer, Heidelberg (2010)
Chapter Google Scholar
Gog, S., Fischer, J.: Advantages of shared data structures for sequences of balanced parentheses. In: DCC, pp. 406–415 (2010)
Google Scholar
Grossi, R., Gupta, A., Vitter, J.: High-order entropy-compressed text indexes. In: SODA, pp. 841–850 (2003)
Google Scholar
Grossi, R., Vitter, J.: Compressed suffix arrays and suffix trees with applications to text indexing and string matching. In: STOC, pp. 397–406 (2000)
Google Scholar
He, J., Zeng, J., Suel, T.: Improved index compression techniques for versioned document collections. In: CIKM, pp. 1239–1248 (2010)
Google Scholar
Kärkkäinen, J.: Repetition-Based Text Indexes. Ph.D. thesis, Univ. Helsinki, Finland (1999)
Google Scholar
Kärkkäinen, J., Ukkonen, E.: Lempel-Ziv parsing and sublinear-size index structures for string matching. In: WSP, pp. 141–155 (1996)
Google Scholar
Kreft, S.: Self-Index based on LZ77. MSc thesis, Univ. of Chile (2010), http://www.dcc.uchile.cl/gnavarro/algoritmos/tesisKreft.pdf
Kreft, S., Navarro, G.: LZ77-like compression with fast random access. In: DCC, pp. 239–248 (2010)
Google Scholar
Kuruppu, S., Beresford-Smith, B., Conway, T., Zobel, J.: Repetition-based compression of large DNA datasets. In: RECOMB (2009), poster
Google Scholar
Kuruppu, S., Puglisi, S.J., Zobel, J.: Relative lempel-ziv compression of genomes for large-scale storage and retrieval. In: Chavez, E., Lonardi, S. (eds.) SPIRE 2010. LNCS, vol. 6393, pp. 201–206. Springer, Heidelberg (2010)
Chapter Google Scholar
Mäkinen, V., Navarro, G.: Rank and select revisited and extended. Theo.Comp.Sci. 387(3), 332–347 (2007)
Article MathSciNet MATH Google Scholar
Mäkinen, V., Navarro, G., Sirén, J., Välimäki, N.: Storage and retrieval of highly repetitive sequence collections. J. Comp. Biol. 17(3), 281–308 (2010)
Article MathSciNet Google Scholar
Manzini, G.: An analysis of the Burrows-Wheeler transform. J. ACM 48(3), 407–430 (2001)
Article MathSciNet MATH Google Scholar
Morrison, D.: PATRICIA-Practical algorithm to retrieve information coded in alphanumeric. J. ACM 15(4), 514–534 (1968)
Article Google Scholar
Munro, I., Raman, R., Raman, V., Rao, S.: Succinct representations of permutations. In: ICALP, pp. 345–356 (2003)
Google Scholar
Muthukrishnan, S.: Efficient algorithms for document retrieval problems. In: SODA, pp. 657–666 (2002)
Google Scholar
Navarro, G.: Indexing text using the Ziv-Lempel trie. J. Discr. Alg. 2(1), 87–114 (2004)
Article MathSciNet MATH Google Scholar
Navarro, G., Mäkinen, V.: Compressed full-text indexes. ACM Comp. Surv. 39(1), article 2 (2007)
Google Scholar
Okanohara, D., Sadakane, K.: Practical entropy-compressed rank/select dictionary. In: ALENEX (2007)
Google Scholar
Pǎtraşcu, M.: Succincter. In: FOCS, pp. 305–313 (2008)
Google Scholar
Raman, R., Raman, V., Rao, S.: Succinct indexable dictionaries with applications to encoding k-ary trees and multisets. In: SODA, pp. 233–242 (2002)
Google Scholar
Russo, L., Oliveira, A.: A compressed self-index using a Ziv-Lempel dictionary. Inf. Retr. 5(3), 501–513 (2008)
Google Scholar
Sadakane, K.: New text indexing functionalities of the compressed suffix arrays. J. Alg. 48(2), 294–313 (2003)
Article MathSciNet MATH Google Scholar
Sirén, J., Välimäki, N., Mäkinen, V., Navarro, G.: Run-length compressed indexes are superior for highly repetitive sequence collections. In: Amir, A., Turpin, A., Moffat, A. (eds.) SPIRE 2008. LNCS, vol. 5280, pp. 164–175. Springer, Heidelberg (2008)
Chapter Google Scholar
Ziv, J., Lempel, A.: A universal algorithm for sequential data compression. IEEE Trans. Inf. Theo. 23(3), 337–343 (1977)
Article MathSciNet MATH Google Scholar
Ziv, J., Lempel, A.: Compression of individual sequences via variable-rate coding. IEEE Trans. Inf. Theo. 24(5), 530–536 (1978)
Article MathSciNet MATH Google Scholar

Download references

Author information

Authors and Affiliations

Dept. of Computer Science, University of Chile, Santiago, Chile
Sebastian Kreft & Gonzalo Navarro

Authors

Sebastian Kreft
View author publications
You can also search for this author in PubMed Google Scholar
Gonzalo Navarro
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Mathematics, Università degli Studi di Palermo, Via Archirafi 34, 90123, Palermo, Italy
Raffaele Giancarlo
Department of Computer Science, University of ’Piemonte Orientale’, Viale T. Michel 11, 15121, Alessandria, Italy
Giovanni Manzini

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Kreft, S., Navarro, G. (2011). Self-indexing Based on LZ77. In: Giancarlo, R., Manzini, G. (eds) Combinatorial Pattern Matching. CPM 2011. Lecture Notes in Computer Science, vol 6661. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-21458-5_6

Download citation

DOI: https://doi.org/10.1007/978-3-642-21458-5_6
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-21457-8
Online ISBN: 978-3-642-21458-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics