Advertisement

Lightweight Lempel-Ziv Parsing

  • Juha Kärkkäinen
  • Dominik Kempa
  • Simon J. Puglisi
Part of the Lecture Notes in Computer Science book series (LNCS, volume 7933)

Abstract

We introduce a new approach to LZ77 factorization that uses \(\O(n/d)\) words of working space and \(\O(dn)\) time for any d ≥ 1 (for polylogarithmic alphabet sizes). We also describe carefully engineered implementations of alternative approaches to lightweight LZ77 factorization. Extensive experiments show that the new algorithm is superior, and particularly so at the lowest memory levels and for highly repetitive data. As a part of the algorithm, we describe new methods for computing matching statistics which may be of independent interest.

Keywords

Peak Memory Left Extension Large Data Structure Increase Scanning Speed Repetitive Data 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Abouelhoda, M.I., Kurtz, S., Ohlebusch, E.: Replacing suffix trees with enhanced suffix arrays. J. Discrete Algorithms 2(1), 53–86 (2004)MathSciNetzbMATHCrossRefGoogle Scholar
  2. 2.
    Barbay, J., Gagie, T., Navarro, G., Nekrich, Y.: Alphabet partitioning for compressed rank/select and applications. In: Cheong, O., Chwa, K.-Y., Park, K. (eds.) ISAAC 2010, Part II. LNCS, vol. 6507, pp. 315–326. Springer, Heidelberg (2010)CrossRefGoogle Scholar
  3. 3.
    Burrows, M., Wheeler, D.: A block sorting lossless data compression algorithm. Tech. Rep. 124, Digital Equipment Corporation, Palo Alto, California (1994)Google Scholar
  4. 4.
    Cánovas, R., Navarro, G.: Practical compressed suffix trees. In: Festa, P. (ed.) SEA 2010. LNCS, vol. 6049, pp. 94–105. Springer, Heidelberg (2010)CrossRefGoogle Scholar
  5. 5.
    Chang, W.I., Lawler, E.L.: Sublinear approximate string matching and biological applications. Algorithmica 12(4-5), 327–344 (1994)MathSciNetzbMATHCrossRefGoogle Scholar
  6. 6.
    Chen, G., Puglisi, S.J., Smyth, W.F.: Lempel-Ziv factorization using less time and space. Mathematics in Computer Science 1(4), 605–623 (2008)MathSciNetzbMATHCrossRefGoogle Scholar
  7. 7.
    Crochemore, M.: String-matching on ordered alphabets. Theoretical Computer Science 92, 33–47 (1992)MathSciNetzbMATHCrossRefGoogle Scholar
  8. 8.
    Ferragina, P., Manzini, G.: Indexing compressed text. J. ACM 52(4), 552–581 (2005)MathSciNetCrossRefGoogle Scholar
  9. 9.
    Ferragina, P., Gagie, T., Manzini, G.: Lightweight data indexing and compression in external memory. Algorithmica 63(3), 707–730 (2012)MathSciNetzbMATHCrossRefGoogle Scholar
  10. 10.
    Fischer, J., Heun, V.: Space-efficient preprocessing schemes for range minimum queries on static arrays. SIAM J. Comput. 40(2), 465–492 (2011)MathSciNetzbMATHCrossRefGoogle Scholar
  11. 11.
    Gagie, T., Gawrychowski, P., Kärkkäinen, J., Nekrich, Y., Puglisi, S.J.: A faster grammar-based self-index. In: Dediu, A.-H., Martín-Vide, C. (eds.) LATA 2012. LNCS, vol. 7183, pp. 240–251. Springer, Heidelberg (2012)CrossRefGoogle Scholar
  12. 12.
    Kärkkäinen, J., Kempa, D., Puglisi, S.J.: Linear time Lempel–Ziv factorization: Simple, fast, small. In: CPM 2013. LNCS. Springer (to appear, 2013), http://arxiv.org/abs/1212.2952
  13. 13.
    Kärkkäinen, J., Manzini, G., Puglisi, S.J.: Permuted longest-common-prefix array. In: Kucherov, G., Ukkonen, E. (eds.) CPM 2009. LNCS, vol. 5577, pp. 181–192. Springer, Heidelberg (2009)CrossRefGoogle Scholar
  14. 14.
    Kasai, T., Lee, G., Arimura, H., Arikawa, S., Park, K.: Linear-time longest-common-prefix computation in suffix arrays and its applications. In: Amir, A., Landau, G.M. (eds.) CPM 2001. LNCS, vol. 2089, pp. 181–192. Springer, Heidelberg (2001)CrossRefGoogle Scholar
  15. 15.
    Kempa, D., Puglisi, S.J.: Lempel-Ziv factorization: simple, fast, practical. In: Zeh, N., Sanders, P. (eds.) ALENEX 2013, pp. 103–112. SIAM (2013)Google Scholar
  16. 16.
    Kreft, S., Navarro, G.: LZ77-like compression with fast random access. In: Storer, J.A., Marcellin, M.W. (eds.) DCC, pp. 239–248. IEEE Computer Society (2010)Google Scholar
  17. 17.
    Kreft, S., Navarro, G.: Self-indexing based on LZ77. In: Giancarlo, R., Manzini, G. (eds.) CPM 2011. LNCS, vol. 6661, pp. 41–54. Springer, Heidelberg (2011)CrossRefGoogle Scholar
  18. 18.
    Kuruppu, S., Puglisi, S.J., Zobel, J.: Relative Lempel-Ziv compression of genomes for large-scale storage and retrieval. In: Chavez, E., Lonardi, S. (eds.) SPIRE 2010. LNCS, vol. 6393, pp. 201–206. Springer, Heidelberg (2010)CrossRefGoogle Scholar
  19. 19.
    Manber, U., Myers, G.W.: Suffix arrays: a new method for on-line string searches. SIAM Journal on Computing 22(5), 935–948 (1993)MathSciNetzbMATHCrossRefGoogle Scholar
  20. 20.
    Navarro, G.: Indexing text using the Ziv-Lempel trie. J. Discrete Algorithms 2(1), 87–114 (2004)MathSciNetzbMATHCrossRefGoogle Scholar
  21. 21.
    Navarro, G., Mäkinen, V.: Compressed full-text indexes. ACM Computing Surveys 39(1), article 2 (2007)Google Scholar
  22. 22.
    Navarro, G.: Indexing highly repetitive collections. In: Arumugam, S., Smyth, B. (eds.) IWOCA 2012. LNCS, vol. 7643, pp. 274–279. Springer, Heidelberg (2012)Google Scholar
  23. 23.
    Ohlebusch, E., Gog, S.: Lempel-Ziv factorization revisited. In: Giancarlo, R., Manzini, G. (eds.) CPM 2011. LNCS, vol. 6661, pp. 15–26. Springer, Heidelberg (2011)CrossRefGoogle Scholar
  24. 24.
    Ohlebusch, E., Gog, S., Kügel, A.: Computing matching statistics and maximal exact matches on compressed full-text indexes. In: Chavez, E., Lonardi, S. (eds.) SPIRE 2010. LNCS, vol. 6393, pp. 347–358. Springer, Heidelberg (2010)CrossRefGoogle Scholar
  25. 25.
    Okanohara, D., Sadakane, K.: An online algorithm for finding the longest previous factors. In: Halperin, D., Mehlhorn, K. (eds.) ESA 2008. LNCS, vol. 5193, pp. 696–707. Springer, Heidelberg (2008)CrossRefGoogle Scholar
  26. 26.
    Okanohara, D., Sadakane, K.: A linear-time Burrows-Wheeler transform using induced sorting. In: Karlgren, J., Tarhio, J., Hyyrö, H. (eds.) SPIRE 2009. LNCS, vol. 5721, pp. 90–101. Springer, Heidelberg (2009)CrossRefGoogle Scholar
  27. 27.
    Starikovskaya, T.: Computing Lempel-Ziv factorization online. In: Rovan, B., Sassone, V., Widmayer, P. (eds.) MFCS 2012. LNCS, vol. 7464, pp. 789–799. Springer, Heidelberg (2012)CrossRefGoogle Scholar
  28. 28.
    Ziv, J., Lempel, A.: A universal algorithm for sequential data compression. IEEE Transactions on Information Theory 23(3), 337–343 (1977)MathSciNetzbMATHCrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2013

Authors and Affiliations

  • Juha Kärkkäinen
    • 1
  • Dominik Kempa
    • 1
  • Simon J. Puglisi
    • 1
  1. 1.Department of Computer ScienceUniversity of HelsinkiHelsinkiFinland

Personalised recommendations