Advertisement

Self-indexing Based on LZ77

  • Sebastian Kreft
  • Gonzalo Navarro
Part of the Lecture Notes in Computer Science book series (LNCS, volume 6661)

Abstract

We introduce the first self-index based on the Lempel-Ziv 1977 compression format (LZ77). It is particularly competitive for highly repetitive text collections such as sequence databases of genomes of related species, software repositories, versioned document collections, and temporal text databases. Such collections are extremely compressible but classical self-indexes fail to capture that source of compressibility. Our self-index takes in practice a few times the space of the text compressed with LZ77 (as little as 2.5 times), extracts 1–2 million characters of the text per second, and finds patterns at a rate of 10–50 microseconds per occurrence. It is smaller (up to one half) than the best current self-index for repetitive collections, and faster in many cases.

Keywords

Binary Search Text Collection Phrase Boundary Software Repository Reverse Trie 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Arroyuelo, D., Cánovas, R., Navarro, G., Sadakane, K.: Succinct trees in practice. In: ALENEX, pp. 84–97 (2010)Google Scholar
  2. 2.
    Benoit, D., Demaine, E., Munro, I., Raman, R., Raman, V., Rao, S.: Representing trees of higher degree. Algorithmica 43(4), 275–292 (2005)MathSciNetCrossRefzbMATHGoogle Scholar
  3. 3.
    Brisaboa, N.R., Ladra, S., Navarro, G.: Directly addressable variable-length codes. In: Karlgren, J., Tarhio, J., Hyyrö, H. (eds.) SPIRE 2009. LNCS, vol. 5721, pp. 122–130. Springer, Heidelberg (2009)CrossRefGoogle Scholar
  4. 4.
    Burrows, M., Wheeler, D.: A block sorting lossless data compression algorithm. TRep. 124, DEC (1994)Google Scholar
  5. 5.
    Claude, F., Fariña, A., Martínez-Prieto, M., Navarro, G.: Compressed q-gram indexing for highly repetitive biological sequences. In: BIBE, pp. 86–91 (2010)Google Scholar
  6. 6.
    Ferragina, P., Manzini, G.: Indexing compressed text. J. ACM 52(4), 552–581 (2005)MathSciNetCrossRefzbMATHGoogle Scholar
  7. 7.
    Ferragina, P., Manzini, G., Mäkinen, V., Navarro, G.: Compressed representations of sequences and full-text indexes. ACM Trans. Alg. 3(2), article 20 (2007)Google Scholar
  8. 8.
    Fischer, J.: Optimal succinctness for range minimum queries. In: López-Ortiz, A. (ed.) LATIN 2010. LNCS, vol. 6034, pp. 158–169. Springer, Heidelberg (2010)CrossRefGoogle Scholar
  9. 9.
    Gog, S., Fischer, J.: Advantages of shared data structures for sequences of balanced parentheses. In: DCC, pp. 406–415 (2010)Google Scholar
  10. 10.
    Grossi, R., Gupta, A., Vitter, J.: High-order entropy-compressed text indexes. In: SODA, pp. 841–850 (2003)Google Scholar
  11. 11.
    Grossi, R., Vitter, J.: Compressed suffix arrays and suffix trees with applications to text indexing and string matching. In: STOC, pp. 397–406 (2000)Google Scholar
  12. 12.
    He, J., Zeng, J., Suel, T.: Improved index compression techniques for versioned document collections. In: CIKM, pp. 1239–1248 (2010)Google Scholar
  13. 13.
    Kärkkäinen, J.: Repetition-Based Text Indexes. Ph.D. thesis, Univ. Helsinki, Finland (1999)Google Scholar
  14. 14.
    Kärkkäinen, J., Ukkonen, E.: Lempel-Ziv parsing and sublinear-size index structures for string matching. In: WSP, pp. 141–155 (1996)Google Scholar
  15. 15.
    Kreft, S.: Self-Index based on LZ77. MSc thesis, Univ. of Chile (2010), http://www.dcc.uchile.cl/gnavarro/algoritmos/tesisKreft.pdf
  16. 16.
    Kreft, S., Navarro, G.: LZ77-like compression with fast random access. In: DCC, pp. 239–248 (2010)Google Scholar
  17. 17.
    Kuruppu, S., Beresford-Smith, B., Conway, T., Zobel, J.: Repetition-based compression of large DNA datasets. In: RECOMB (2009), posterGoogle Scholar
  18. 18.
    Kuruppu, S., Puglisi, S.J., Zobel, J.: Relative lempel-ziv compression of genomes for large-scale storage and retrieval. In: Chavez, E., Lonardi, S. (eds.) SPIRE 2010. LNCS, vol. 6393, pp. 201–206. Springer, Heidelberg (2010)CrossRefGoogle Scholar
  19. 19.
    Mäkinen, V., Navarro, G.: Rank and select revisited and extended. Theo.Comp.Sci. 387(3), 332–347 (2007)MathSciNetCrossRefzbMATHGoogle Scholar
  20. 20.
    Mäkinen, V., Navarro, G., Sirén, J., Välimäki, N.: Storage and retrieval of highly repetitive sequence collections. J. Comp. Biol. 17(3), 281–308 (2010)MathSciNetCrossRefGoogle Scholar
  21. 21.
    Manzini, G.: An analysis of the Burrows-Wheeler transform. J. ACM 48(3), 407–430 (2001)MathSciNetCrossRefzbMATHGoogle Scholar
  22. 22.
    Morrison, D.: PATRICIA-Practical algorithm to retrieve information coded in alphanumeric. J. ACM 15(4), 514–534 (1968)CrossRefGoogle Scholar
  23. 23.
    Munro, I., Raman, R., Raman, V., Rao, S.: Succinct representations of permutations. In: ICALP, pp. 345–356 (2003)Google Scholar
  24. 24.
    Muthukrishnan, S.: Efficient algorithms for document retrieval problems. In: SODA, pp. 657–666 (2002)Google Scholar
  25. 25.
    Navarro, G.: Indexing text using the Ziv-Lempel trie. J. Discr. Alg. 2(1), 87–114 (2004)MathSciNetCrossRefzbMATHGoogle Scholar
  26. 26.
    Navarro, G., Mäkinen, V.: Compressed full-text indexes. ACM Comp. Surv. 39(1), article 2 (2007)Google Scholar
  27. 27.
    Okanohara, D., Sadakane, K.: Practical entropy-compressed rank/select dictionary. In: ALENEX (2007)Google Scholar
  28. 28.
    Pǎtraşcu, M.: Succincter. In: FOCS, pp. 305–313 (2008)Google Scholar
  29. 29.
    Raman, R., Raman, V., Rao, S.: Succinct indexable dictionaries with applications to encoding k-ary trees and multisets. In: SODA, pp. 233–242 (2002)Google Scholar
  30. 30.
    Russo, L., Oliveira, A.: A compressed self-index using a Ziv-Lempel dictionary. Inf. Retr. 5(3), 501–513 (2008)Google Scholar
  31. 31.
    Sadakane, K.: New text indexing functionalities of the compressed suffix arrays. J. Alg. 48(2), 294–313 (2003)MathSciNetCrossRefzbMATHGoogle Scholar
  32. 32.
    Sirén, J., Välimäki, N., Mäkinen, V., Navarro, G.: Run-length compressed indexes are superior for highly repetitive sequence collections. In: Amir, A., Turpin, A., Moffat, A. (eds.) SPIRE 2008. LNCS, vol. 5280, pp. 164–175. Springer, Heidelberg (2008)CrossRefGoogle Scholar
  33. 33.
    Ziv, J., Lempel, A.: A universal algorithm for sequential data compression. IEEE Trans. Inf. Theo. 23(3), 337–343 (1977)MathSciNetCrossRefzbMATHGoogle Scholar
  34. 34.
    Ziv, J., Lempel, A.: Compression of individual sequences via variable-rate coding. IEEE Trans. Inf. Theo. 24(5), 530–536 (1978)MathSciNetCrossRefzbMATHGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2011

Authors and Affiliations

  • Sebastian Kreft
    • 1
  • Gonzalo Navarro
    • 1
  1. 1.Dept. of Computer ScienceUniversity of ChileSantiagoChile

Personalised recommendations