Advertisement

Algorithmica

, Volume 80, Issue 7, pp 2012–2047 | Cite as

Lempel–Ziv-78 Compressed String Dictionaries

  • Julian Arz
  • Johannes FischerEmail author
Article
  • 208 Downloads
Part of the following topical collections:
  1. Special Issue on Compact Data Structures

Abstract

String dictionaries store a collection \(\left( s_i\right) _{0\le i < m}\) of m variable-length keys (strings) over an alphabet \(\varSigma \) and support the operations lookup (given a string \(s\in \varSigma ^*\), decide if \(s_i=s\) for some i, and return this i) and access (given an integer \(0\le i < m\), return the string \(s_i\)). We show how to modify the Lempel–Ziv-78 data compression algorithm to store the strings space-efficiently and support the operations lookup and access in optimal time. Our approach is validated experimentally on dictionaries of up to 1.5 GB of uncompressed text. We achieve compression ratios often outperforming the existing alternatives, especially on dictionaries containing many repeated substrings. Our query times remain competitive.

Keywords

Data structures Compression Strings Dictionaries Searching 

Notes

Acknowledgements

Many people helped to improve this article in different ways. First, we thank Giuseppe Ottaviano for providing his data sets, and Francisco Claude and Miguel Ángel Martínez-Prieto for the source codes of their implementations. Second, we thank Paweł Gawrychowski for interesting discussions on this topic, and Giuseppe Ottaviano, Rossano Venturini, and Gonzalo Navarro for pointing out the work by Russo and Oliveira [31] during the Dagstuhl Seminar 13232 “Indexes and Computation over Compressed Structured Data” [24]. Gonzalo Navarro also brought Lemma 2.3 from Kosaraju and Manzini [22] to our attention. We further thank Simon Gog for bringing [36] to our attention, and the anonymous reviewers for their comments that helped to improve this article.

References

  1. 1.
    Altschul, S.F., Gish, W., Miller, W., Myers, E.W., Lipman, D.J.: Basic local alignment search tool. J. Mol. Biol. 215(3), 403–410 (1990)CrossRefGoogle Scholar
  2. 2.
    Arroyuelo, D., Navarro, G.: Space-efficient construction of Lempel–Ziv compressed text indexes. Inf. Comput. 209(7), 1070–1102 (2011)MathSciNetCrossRefzbMATHGoogle Scholar
  3. 3.
    Arroyuelo, D., Navarro, G., Sadakane, K.: Stronger Lempel–Ziv based compressed text indexing. Algorithmica 62(1–2), 54–101 (2012)MathSciNetCrossRefzbMATHGoogle Scholar
  4. 4.
    Arz, J., Fischer, J.: LZ-compressed string dictionaries. In: Proceedings of the DCC, pp. 322–331. IEEE Press (2014)Google Scholar
  5. 5.
    Belazzougui, D., Navarro, G.: Optimal lower and upper bounds for representing sequences. ACM Trans. Algorithms 11(4), 31 (2015)MathSciNetCrossRefGoogle Scholar
  6. 6.
    Boldi, P., Codenotti, B., Santini, M., Vigna, S.: Ubicrawler: a scalable fully distributed web crawler. Softw. Pract. Exp. 34(8), 711–726 (2004)CrossRefGoogle Scholar
  7. 7.
    Böttcher, S., Lohrey, M., Maneth, S., Rytter, W. (eds): Abstracts collection—structure-based compression of complex massive data. No. 08261 in Dagstuhl Seminar Proceedings, Schloss Dagstuhl—Leibniz-Zentrum für Informatik, Germany (2008)Google Scholar
  8. 8.
    Brisaboa, N.R., Cánovas, R., Claude, F., Martínez-Prieto, M.A., Navarro, G.: Compressed string dictionaries. In: Proceedings of the 10th International Symposium on Experimental Algorithms (SEA 2011), Springer, Lecture Notes in Computer Science, vol. 6630, pp. 136–147 (2011)Google Scholar
  9. 9.
    Clark, D.R.: Compact Pat Trees. PhD thesis, Waterloo, ON, Canada (1998)Google Scholar
  10. 10.
    Ferragina, P., Venturini, R.: Compressed permuterm index. In: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2007), ACM, pp. 535–542 (2007)Google Scholar
  11. 11.
    Fischer, J., Gawrychowski, P.: Alphabet-dependent string searching with wexponential search trees. In: Proceedings of the CPM, Springer, LNCS, vol. 9133, pp. 160–171 (2015)Google Scholar
  12. 12.
    Fischer, J., I, T., Köppl, D.: Lempel Ziv computation in small space (LZ-CISS). In: Proceedings of the CPM, Springer, LNCS, vol. 9133, pp. 172–184 (2015)Google Scholar
  13. 13.
    Fredman, M.L., Komlós, J., Szemerédi, E.: Storing a sparse table with \({O}(1)\) worst case access time. J. ACM 31(3), 538–544 (1984)MathSciNetCrossRefzbMATHGoogle Scholar
  14. 14.
    Gog, S., Beller, T., Moffat, A., Petri, M.: From theory to practice: plug and play with succinct data structures. In: Proceedings of the SEA, Springer, LNCS, vol. 8504, pp. 326–337 (2014)Google Scholar
  15. 15.
    Grossi, R., Ottaviano, G.: Fast compressed tries through path decompositions. ACM J. Exp. Algorithmics 19, 3–4 (2014)MathSciNetzbMATHGoogle Scholar
  16. 16.
    Grossi, R., Gupta, A., Vitter, J.S.: High-order entropy-compressed text indexes. In: Proceedings of the Fourteenth Annual ACM-SIAM Symposium on Discrete Algorithm (SODA 2003), ACM/SIAM, pp. 841–850 (2003)Google Scholar
  17. 17.
    Hu, T.C., Tucker, A.C.: Optimal computer search trees and variable-length alphabetical codes. SIAM J. Appl. Math. 21(4), 514–532 (1971)MathSciNetCrossRefzbMATHGoogle Scholar
  18. 18.
    Huffman, D.A.: A method for the construction of minimum-redundancy codes. Proc. IRE 40(9), 1098–1101 (1952)CrossRefzbMATHGoogle Scholar
  19. 19.
    Jacobson, G.: Space-efficient static trees and graphs. In: Proceedings of the 30th Annual Symposium on Foundations of Computer Science (FOCS 1989), IEEE Computer Society, pp. 549–554 (1989)Google Scholar
  20. 20.
    Jansson, J., Sadakane, K., Sung, W.: Linked dynamic tries with applications to LZ-compression in sublinear time and space. Algorithmica 71(4), 969–988 (2015)MathSciNetCrossRefzbMATHGoogle Scholar
  21. 21.
    Knuth, D.E.: Sorting and Searching, The Art of Computer Programming, vol. 3, 2nd edn. Addison Wesley, Reading (1998)zbMATHGoogle Scholar
  22. 22.
    Kosaraju, S.R., Manzini, G.: Compression of low entropy strings with Lempel–Ziv algorithms. SIAM J. Comput. 29(3), 893–911 (1999)MathSciNetCrossRefzbMATHGoogle Scholar
  23. 23.
    Larsson, N.J., Moffat, A.: Offline dictionary-based compression. In: Proceedings of the Data Compression Conference (DCC 1999), IEEE Computer Society, pp. 296–305 (1999)Google Scholar
  24. 24.
    Maneth, S., Navarro, G.: Indexes and computation over compressed structured data (Dagstuhl Seminar 13232). Dagstuhl Rep. 3(6), 22–37 (2013)Google Scholar
  25. 25.
    Martínez-Prieto, M.A., Brisaboa, N.R., Cánovas, R., Claude, F., Navarro, G.: Practical compressed string dictionaries. Inf. Syst. 56, 73–108 (2016)CrossRefGoogle Scholar
  26. 26.
    Mehlhorn, K., Sanders, P.: Algorithms and Data Structures: The Basic Toolbox. Springer, Berlin (2008)zbMATHGoogle Scholar
  27. 27.
    Müller, I., Ratsch, C., Färber, F.: Adaptive string dictionary compression in in-memory column-store database systems. In: Proceedings of the 17th International Conference on Extending Database Technology (EDBT), OpenProceedings.org, pp. 283–294 (2014)Google Scholar
  28. 28.
    Munro, J.I.: Tables. In: Proceedings of the 16th Conference on Foundations of Software Technology and Theoretical Computer Science (FSTTCS 1996), Springer, Lecture Notes in Computer Science, vol. 1180, pp. 37–42 (1996)Google Scholar
  29. 29.
    Navarro, G., Mäkinen, V.: Compressed full-text indexes. ACM Comput. Surv. 39, 2 (2007)CrossRefzbMATHGoogle Scholar
  30. 30.
    Navarro, G., Providel, E.: Fast, small, simple rank/select on bitmaps. In: Proceedings of the SEA, Springer, LNCS, vol. 7276, pp. 295–306 (2012)Google Scholar
  31. 31.
    Russo, L.M.S., Oliveira, A.L.: A compressed self-index using a Ziv–Lempel dictionary. Inf. Retr. 11(4), 359–388 (2008)CrossRefGoogle Scholar
  32. 32.
    Russo, L.M.S., Navarro, G., Oliveira, A.L.: Fully compressed suffix trees. ACM Trans. Algorithms 7(4), 53 (2011)MathSciNetCrossRefzbMATHGoogle Scholar
  33. 33.
    Vigna, S.: Broadword implementation of rank/select queries. In: Proceedings of the 7th International Workshop on Experimental Algorithms (WEA 2008), Springer, Lecture Notes in Computer Science, vol. 5038, pp. 154–168 (2008)Google Scholar
  34. 34.
    Williams, H.E., Zobel, J.: Compressing integers for fast file access. Comput. J. 42(3), 193–201 (1999)CrossRefGoogle Scholar
  35. 35.
    Witten, I.H., Moffat, A., Bell, T.C.: Managing Gigabytes: Compressing and Indexing Documents and Images, 2nd edn. Morgan Kaufmann, San Francisco (1999)zbMATHGoogle Scholar
  36. 36.
    Yata, S.: Dictionary compression using nested prefix/Patricia tries (in Japanese). In: Proceedings of the 17th Annual Meeting on Natural Language Processing (NLP2001), pp. 576–578 (2011). http://www.anlp.jp/proceedings/annual_meeting/2011/pdf_dir/F2-6.pdf
  37. 37.
    Zhou, D., Andersen, D.G., Kaminsky, M.: Space-efficient, high-performance rank and select structures on uncompressed bit sequences. In: Proceedings of the SEA, Springer, LNCS, vol. 7933, pp. 151–163 (2013)Google Scholar
  38. 38.
    Ziv, J., Lempel, A.: Compression of individual sequences via variable-rate coding. IEEE Trans. Inf. Theory 24(5), 530–536 (1978)MathSciNetCrossRefzbMATHGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC 2017

Authors and Affiliations

  1. 1.Department of InformaticsKarlsruhe Institute of TechnologyKarlsruheGermany
  2. 2.Department of Computer ScienceTU DortmundDortmundGermany

Personalised recommendations