Skip to main content

Compressed String Dictionaries

  • Conference paper
Experimental Algorithms (SEA 2011)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 6630))

Included in the following conference series:

Abstract

The problem of storing a set of strings – a string dictionary – in compact form appears naturally in many cases. While classically it has represented a small part of the whole data to be processed (e.g., for Natural Language processing or for indexing text collections), recent applications in Web engines, RDF graphs, Bioinformatics, and many others, handle very large string dictionaries, whose size is a significant fraction of the whole data. Thus efficient approaches to compress them are necessary. In this paper we empirically compare time and space performance of some existing alternatives, as well as new ones we propose. We show that space reductions of up to 20% of the original size of the strings is possible while supporting dictionary searches within a few microseconds, and up to 10% within a few tens or hundreds of microseconds.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Apostolico, A., Drovandi, G.: Graph compression by BFS. Algorithms 2, 1031–1044 (2009)

    Article  MathSciNet  Google Scholar 

  2. Baeza-Yates, R., Ribeiro-Neto, B.: Modern Information Retrieval. Addison Wesley, Reading (1999)

    Google Scholar 

  3. Boldi, P., Vigna, S.: The Webgraph framework i: Compression techniques. In: Proc. WWW, pp. 595–602 (2004)

    Google Scholar 

  4. Brisaboa, N., Ladra, S., Navarro, G.: Directly addressable variable-length codes. In: Karlgren, J., Tarhio, J., Hyyrö, H. (eds.) SPIRE 2009. LNCS, vol. 5721, pp. 122–130. Springer, Heidelberg (2009)

    Chapter  Google Scholar 

  5. Broder, A., Kumar, R., Maghoul, F., Raghavan, P., Rajagopalan, S., Stata, R., Tomkins, A., Wiener, J.: Graph structure in the Web. Comput. Netw. 33, 309–320 (2000)

    Article  Google Scholar 

  6. Burrows, M., Wheeler, D.: A block-sorting lossless data compression algorithm. Technical report, Digital Equipment Corporation (1994)

    Google Scholar 

  7. Charikar, M., Lehman, E., Liu, D., Panigrahy, R., Prabhakaran, M., Sahai, A., Shelat, A.: The smallest grammar problem. IEEE Trans. Inf. Theory 51(7), 2554–2576 (2005)

    Article  MathSciNet  MATH  Google Scholar 

  8. Cormen, T.H., Leiserson, C.E., Rivest, R.L., Stein, C.: Introduction to Algorithms, 2nd edn. MIT Press and McGraw-Hill (2001)

    MATH  Google Scholar 

  9. Donato, D., Laura, L., Leonardi, S., Meyer, U., Millozzi, S., Sibeyn, J.: Algorithms and experiments for the Webgraph. J. Graph Algor. App. 10(2), 219–236 (2006)

    Article  MathSciNet  MATH  Google Scholar 

  10. Fernández, J.D., Martínez-Prieto, M.A., Gutierrez, C.: Compact representation of large RDF data sets for publishing and exchange. In: Patel-Schneider, P.F., Pan, Y., Hitzler, P., Mika, P., Zhang, L., Pan, J.Z., Horrocks, I., Glimm, B. (eds.) ISWC 2010, Part I. LNCS, vol. 6496, pp. 193–208. Springer, Heidelberg (2010)

    Chapter  Google Scholar 

  11. Ferragina, P., González, R., Navarro, G., Venturini, R.: Compressed text indexes: From theory to practice. ACM JEA 13, article 12 (2009)

    Google Scholar 

  12. Ferragina, P., Manzini, G.: Opportunistic data structures with applications. In: Proc. FOCS, pp. 390–398 (2000)

    Google Scholar 

  13. Ferragina, P., Manzini, G., Mäkinen, V., Navarro, G.: Compressed representations of sequences and full-text indexes. ACM Trans. Alg. 3(2), article 20 (2007)

    Google Scholar 

  14. Ferragina, P., Venturini, R.: The compressed permuterm index. ACM Trans. Alg. 7(1), article 10 (2010)

    Google Scholar 

  15. Ferragina, P., Luccio, F., Manzini, G., Muthukrishnan, S.: Structuring labeled trees for optimal succinctness, and beyond. In: Proc. FOCS, pp. 184–196 (2005)

    Google Scholar 

  16. González, R., Grabowski, S., Mäkinen, V., Navarro, G.: Practical implementation of rank and select queries. Posters WEA, pp. 27–38 (2005)

    Google Scholar 

  17. Grossi, R., Gupta, A., Vitter, J.: High-order entropy-compressed text indexes. In: Proc. SODA, pp. 841–850 (2003)

    Google Scholar 

  18. Gusfield, D.: Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology. Cambridge Univ. Press, Cambridge (2007)

    MATH  Google Scholar 

  19. Heaps, H.S.: Information Retrieval: Computational and Theoretical Aspects. Academic Press, London (1978)

    MATH  Google Scholar 

  20. Huffman, D.A.: A method for the construction of minimum-redundancy codes. Proc. of the Institute of Radio Engineers 40(9), 1098–1101 (1952)

    MATH  Google Scholar 

  21. Kleinberg, J., Kumar, R., Raghavan, P., Rajagopalan, S., Tomkins, A.: The Web as a graph: Measurements, models, and methods. In: Asano, T., Imai, H., Lee, D.T., Nakano, S.-i., Tokuyama, T. (eds.) COCOON 1999. LNCS, vol. 1627, pp. 1–17. Springer, Heidelberg (1999)

    Chapter  Google Scholar 

  22. Knuth, D.E.: The Art of Computer Programming, volume 3: Sorting and Searching. Addison Wesley, Reading (2007)

    Google Scholar 

  23. Larsson, N.J., Moffat, J.A.: Offline dictionary-based compression. Proc. of the IEEE 88, 1722–1732 (2000)

    Article  Google Scholar 

  24. Mäkinen, V., Navarro, G.: Implicit compression boosting with applications to self-indexing. In: Ziviani, N., Baeza-Yates, R. (eds.) SPIRE 2007. LNCS, vol. 4726, pp. 229–241. Springer, Heidelberg (2007)

    Chapter  Google Scholar 

  25. Manning, C.D., Schütze, H.: Foundations of Statistical Natural Language Processing. MIT Press, Cambridge (1999)

    MATH  Google Scholar 

  26. Moffat, A., Katajainen, J.: In-place calculation of minimum-redundancy codes. In: Sack, J.-R., Akl, S.G., Dehne, F., Santoro, N. (eds.) WADS 1995. LNCS, vol. 955, pp. 393–402. Springer, Heidelberg (1995)

    Chapter  Google Scholar 

  27. Nagwani, N.: Clustering based URL normalization technique for Web mining. In: Proc. ACE, pp. 349–351 (2010)

    Google Scholar 

  28. Navarro, G., Mäkinen, V.: Compressed full-text indexes. ACM Comp. Surv. 39(1), article 2 (2007)

    Google Scholar 

  29. Raman, R., Raman, V., Rao, S.: Succinct indexable dictionaries with applications to encoding k-ary trees and multisets. In: Proc. SODA, pp. 233–242 (2002)

    Google Scholar 

  30. Russo, L., Navarro, G., Oliveira, A., Morales, P.: Approximate string matching with compressed indexes. Algorithms 2(3), 1105–1136 (2009)

    Article  MathSciNet  Google Scholar 

  31. Suel, T., Yuan, J.: Compressing the graph structure of the Web. In: Proc. DCC, pp. 213–222 (2001)

    Google Scholar 

  32. Williams, H., Zobel, J.: Compressing integers for fast file access. The Computer Journal 42, 193–201 (1999)

    Article  Google Scholar 

  33. Witten, I.H., Moffat, A., Bell, T.C.: Managing Gigabytes: Compressing and Indexing Documents and Images. Morgan Kaufmann, San Francisco (1999)

    MATH  Google Scholar 

  34. Yin, M., Goh, D., Lim, E.-P., Sun, A.: Discovery of concept entities from Web sites using web unit mining. Intl. J. of Web Inf. Sys. 1(3), 123–135 (2005)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2011 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Brisaboa, N.R., Cánovas, R., Claude, F., Martínez-Prieto, M.A., Navarro, G. (2011). Compressed String Dictionaries. In: Pardalos, P.M., Rebennack, S. (eds) Experimental Algorithms. SEA 2011. Lecture Notes in Computer Science, vol 6630. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-20662-7_12

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-20662-7_12

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-20661-0

  • Online ISBN: 978-3-642-20662-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics