Advertisement

Algorithmica

, Volume 69, Issue 1, pp 232–268 | Cite as

Efficient Fully-Compressed Sequence Representations

  • Jérémy Barbay
  • Francisco Claude
  • Travis Gagie
  • Gonzalo Navarro
  • Yakov Nekrich
Article

Abstract

We present a data structure that stores a sequence s[1..n] over alphabet [1..σ] in \(n\mathcal{H}_{0}(s) + o(n)(\mathcal {H}_{0}(s){+}1)\) bits, where \(\mathcal{H}_{0}(s)\) is the zero-order entropy of s. This structure supports the queries access, rank and select, which are fundamental building blocks for many other compressed data structures, in worst-case time \({\mathcal{O} ( {\lg\lg\sigma} )}\) and average time \({\mathcal{O} ( {\lg\mathcal{H}_{0}(s)} )}\). The worst-case complexity matches the best previous results, yet these had been achieved with data structures using \(n\mathcal{H}_{0}(s)+o(n\lg \sigma)\) bits. On highly compressible sequences the o(nlgσ) bits of the redundancy may be significant compared to the \(n\mathcal{H}_{0}(s)\) bits that encode the data. Our representation, instead, compresses the redundancy as well. Moreover, our average-case complexity is unprecedented.

Our technique is based on partitioning the alphabet into characters of similar frequency. The subsequence corresponding to each group can then be encoded using fast uncompressed representations without harming the overall compression ratios, even in the redundancy.

The result also improves upon the best current compressed representations of several other data structures. For example, we achieve (i) compressed redundancy, retaining the best time complexities, for the smallest existing full-text self-indexes; (ii) compressed permutations π with times for π() and π −1() improved to loglogarithmic; and (iii) the first compressed representation of dynamic collections of disjoint sets. We also point out various applications to inverted indexes, suffix arrays, binary relations, and data compressors.

Our structure is practical on large alphabets. Our experiments show that, as predicted by theory, it dominates the space/time tradeoff map of all the sequence representations, both in synthetic and application scenarios.

Keywords

Compressed sequence representations Rank and select on sequences Compact data structures Entropy-bounded structures Compressed text indexing 

Notes

Acknowledgements

We thank Djamal Belazzougui for helpful comments on a draft of this paper, and Meg Gagie for righting our grammar.

References

  1. 1.
    Arroyuelo, D., González, S., Oyarzún, M.: Compressed self-indices supporting conjunctive queries on document collections. In: Proc. 17th International Symposium on String Processing and Information Retrieval (SPIRE), pp. 43–54 (2010) CrossRefGoogle Scholar
  2. 2.
    Barbay, J., Claude, F., Navarro, G.: Compact rich-functional binary relation representations. In: Proc. 9th Latin American Symposium on Theoretical Informatics (LATIN). LNCS, vol. 6034, pp. 170–183 (2010) Google Scholar
  3. 3.
    Barbay, J., Golynski, A., Munro, J.I., Rao, S.S.: Adaptive searching in succinctly encoded binary relations and tree-structured documents. Theor. Comput. Sci. 387(3), 284–297 (2007) CrossRefzbMATHMathSciNetGoogle Scholar
  4. 4.
    Barbay, J., He, M., Munro, J.I., Rao, S.S.: Succinct indexes for strings, binary relations and multilabeled trees. ACM Trans. Algorithms 7(4), 52 (2011) CrossRefMathSciNetGoogle Scholar
  5. 5.
    Barbay, J., López-Ortiz, A., Lu, T., Salinger, A.: An experimental investigation of set intersection algorithms for text searching. ACM J. Exp. Algorithmics 14(3), 7 (2009) Google Scholar
  6. 6.
    Barbay, J., Navarro, G.: Compressed representations of permutations, and applications. In: Proc. 26th Symposium on Theoretical Aspects of Computer Science (STACS), pp. 111–122 (2009) Google Scholar
  7. 7.
    Barbay, J., Navarro, G.: On compressing permutations and adaptive sorting. CoRR (2011). 1108.4408v1
  8. 8.
    Belazzougui, D., Navarro, G.: Alphabet-independent compressed text indexing. In: Proc. 19th Annual European Symposium on Algorithms (ESA). LNCS, vol. 6942, pp. 748–759 (2011) Google Scholar
  9. 9.
    Belazzougui, D., Navarro, G.: New lower and upper bounds for representing sequences. In: Proc. 20th Annual European Symposium on Algorithms (ESA). LNCS, vol. 7501, pp. 181–192 (2012) Google Scholar
  10. 10.
    Boldi, P., Vigna, S.: The WebGraph framework I: compression techniques. In: Proc. 13th World Wide Web Conference (WWW), pp. 595–602 (2004) CrossRefGoogle Scholar
  11. 11.
    Brisaboa, N., Luaces, M., Navarro, G., Seco, D.: A new point access method based on wavelet trees. In: Proc. 3rd International Workshop on Semantic and Conceptual Issues in GIS (SeCoGIS). LNCS, vol. 5833, pp. 297–306 (2009) Google Scholar
  12. 12.
    Burrows, M., Wheeler, D.: A block sorting lossless data compression algorithm. Technical Report 124, Digital Equipment Corporation (1994) Google Scholar
  13. 13.
    Clark, D.: Compact Pat Trees. Ph.D. Thesis, University of Waterloo, Canada (1996) Google Scholar
  14. 14.
    Clarke, C., Cormack, G., Tudhope, E.: Relevance ranking for one to three term queries. In: Proc. 5th International Conference on Computer-Assisted Information Retrieval (RIAO), pp. 388–401 (1997) Google Scholar
  15. 15.
    Claude, F., Navarro, G.: Practical rank/select queries over arbitrary sequences. In: Proc. 15th International Symposium on String Processing and Information Retrieval (SPIRE), pp. 176–187 (2008) Google Scholar
  16. 16.
    Claude, F., Navarro, G.: Extended compact web graph representations. In: Elomaa, T., Mannila, H., Orponen, P. (eds.) Algorithms and Applications (Ukkonen Festschrift). LNCS, vol. 6060, pp. 77–91. Springer, Berlin (2010) CrossRefGoogle Scholar
  17. 17.
    Claude, F., Navarro, G.: Fast and compact web graph representations. ACM Trans. Web 4(4), 16 (2010) CrossRefGoogle Scholar
  18. 18.
    Demaine, E., López-Ortiz, A., Munro, J.I.: Adaptive set intersections, unions, and differences. In: Proc. 11th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pp. 743–752 (2000) Google Scholar
  19. 19.
    Fariña, A., Brisaboa, N., Navarro, G., Claude, F., Places, A., Rodríguez, E.: Word-based self-indexes for natural language text. ACM Trans. Inf. Syst. 30(1), 1 (2012) CrossRefGoogle Scholar
  20. 20.
    Ferragina, P., Luccio, F., Manzini, G., Muthukrishnan, S.: Compressing and indexing labeled trees, with applications. J. ACM 57(1), 4 (2009) CrossRefMathSciNetGoogle Scholar
  21. 21.
    Ferragina, P., Manzini, G.: Indexing compressed texts. J. ACM 52(4), 552–581 (2005) CrossRefMathSciNetGoogle Scholar
  22. 22.
    Ferragina, P., Manzini, G., Mäkinen, V., Navarro, G.: Compressed representations of sequences and full-text indexes. ACM Trans. Algorithms 3(2), 20 (2007) CrossRefMathSciNetGoogle Scholar
  23. 23.
    Ferragina, P., Venturini, R.: A simple storage scheme for strings achieving entropy bounds. Theor. Comput. Sci. 372(1), 115–121 (2007) CrossRefzbMATHMathSciNetGoogle Scholar
  24. 24.
    Gagie, T., Nekrich, Y.: Worst-case optimal adaptive prefix coding. In: Proc. 11th International Symposium on Algorithms and Data Structures (WADS). LNCS, vol. 5664, pp. 315–326 (2009) CrossRefGoogle Scholar
  25. 25.
    Golynski, A.: Optimal lower bounds for rank and select indexes. Theor. Comput. Sci. 387(3), 348–359 (2007) CrossRefzbMATHMathSciNetGoogle Scholar
  26. 26.
    Golynski, A.: Cell probe lower bounds for succinct data structures. In: Proc. 20th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pp. 625–634 (2009) CrossRefGoogle Scholar
  27. 27.
    Golynski, A., Grossi, R., Gupta, A., Raman, R., Srinivasa Rao, S.: On the size of succinct indices. In: Proc. 15th Annual European Symposium on Algorithms (ESA). LNCS, vol. 4698, pp. 371–382 (2007) Google Scholar
  28. 28.
    Golynski, A., Munro, J.I., Rao, S.S.: Rank/select operations on large alphabets: a tool for text indexing. In: Proc. 17th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pp. 368–373 (2006) Google Scholar
  29. 29.
    Golynski, A., Raman, R., Rao, S.: On the redundancy of succinct data structures. In: Proc. 11th Scandinavian Workshop on Algorithm Theory (SWAT). LNCS, vol. 5124, pp. 148–159 (2008) Google Scholar
  30. 30.
    González, R., Grabowski, Sz., Mäkinen, V., Navarro, G.: Practical implementation of rank and select queries. In: Proc. 4th Workshop on Efficient and Experimental Algorithms (WEA), pp. 27–38 (2005). Posters Google Scholar
  31. 31.
    González, R., Navarro, G.: Statistical encoding of succinct data structures. In: Proc. 17th Annual Symposium on Combinatorial Pattern Matching (CPM), pp. 294–305 (2006) CrossRefGoogle Scholar
  32. 32.
    Grossi, R., Gupta, A., Vitter, J.: High-order entropy-compressed text indexes. In: Proc. 14th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pp. 841–850 (2003) Google Scholar
  33. 33.
    Grossi, R., Orlandi, A., Raman, R.: Optimal trade-offs for succinct string indexes. In: Proc. 37th International Colloquim on Automata, Languages and Programming (ICALP), pp. 678–689 (2010) CrossRefGoogle Scholar
  34. 34.
    Grossi, R., Orlandi, A., Raman, R., Srinivasa Rao, S.: More haste, less waste: lowering the redundancy in fully indexable dictionaries. In: Proc. 26th Symposium on Theoretical Aspects of Computer Science (STACS), pp. 517–528 (2009) Google Scholar
  35. 35.
    Grossi, R., Vitter, J.: Compressed suffix arrays and suffix trees with applications to text indexing and string matching. SIAM J. Comput. 35(2), 378–407 (2006) CrossRefMathSciNetGoogle Scholar
  36. 36.
    Haskel, B., Puri, A., Netravali, A.: Digital Video: an Introduction to MPEG-2. Chapman & Hall, London (1997) Google Scholar
  37. 37.
    Hernández, C., Navarro, G.: Compression of web and social graphs supporting neighbor and community queries. In: Proc. 5th ACM Workshop on Social Network Mining and Analysis (SNA-KDD). ACM, New York (2011) Google Scholar
  38. 38.
    Hreinsson, J.B., Krøyer, M., Pagh, R.: Storing a compressed function with constant time access. In: Proc. 17th European Symposium on Algorithms (ESA), pp. 730–741 (2009) Google Scholar
  39. 39.
    Huffman, D.: A method for the construction of minimum-redundancy codes. Proc. IRE 40(9), 1090–1101 (1952) CrossRefGoogle Scholar
  40. 40.
    Levcopoulos, C., Petersson, O.: Sorting shuffled monotone sequences. Inf. Comput. 112(1), 37–50 (1994) CrossRefzbMATHMathSciNetGoogle Scholar
  41. 41.
    Mäkinen, V., Navarro, G.: Rank and select revisited and extended. Theor. Comput. Sci. 387(3), 332–347 (2007) CrossRefzbMATHGoogle Scholar
  42. 42.
    Mäkinen, V., Navarro, G.: Dynamic entropy-compressed sequences and full-text indexes. ACM Trans. Algorithms 4(3), 32 (2008) CrossRefMathSciNetGoogle Scholar
  43. 43.
    Manzini, G.: An analysis of the Burrows-Wheeler transform. J. ACM 48(3), 407–430 (2001) CrossRefMathSciNetGoogle Scholar
  44. 44.
    Mehlhorn, K.: Sorting presorted files. In: Proc. 4th GI-Conference on Theoretical Computer Science. LNCS, vol. 67, pp. 199–212 (1979) CrossRefGoogle Scholar
  45. 45.
    Moffat, A., Turpin, A.: On the implementation of minimum-redundancy prefix codes. IEEE Trans. Commun. 45(10), 1200–1207 (1997) CrossRefGoogle Scholar
  46. 46.
    Munro, I.: Tables. In: Proc. 16th Conference on Foundations of Software Technology and Theoretical Computer Science (FSTTCS). LNCS, vol. 1180, pp. 37–42 (1996) CrossRefGoogle Scholar
  47. 47.
    Munro, I., Raman, R., Raman, V., Rao, S.S.: Succinct representations of permutations and functions. Theor. Comput. Sci. 438, 74–88 (2012) CrossRefzbMATHMathSciNetGoogle Scholar
  48. 48.
    Navarro, G., Mäkinen, V.: Compressed full-text indexes. ACM Comput. Surv. 39(1), 2 (2007) CrossRefGoogle Scholar
  49. 49.
    Navarro, G., Nekrich, Y.: Optimal dynamic sequence representations. In: Proc. 24th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA) (2013, to appear) Google Scholar
  50. 50.
    Okanohara, D., Sadakane, K.: Practical entropy-compressed rank/select dictionary. In: Proc. 10th Workshop on Algorithm Engineering and Experiments (ALENEX), pp. 60–70 (2007) Google Scholar
  51. 51.
    Pearlman, W., Islam, A., Nagaraj, N., Said, A.: Efficient, low-complexity image coding with a set-partitioning embedded block coder. IEEE Trans. Circuits Syst. Video Technol. 14(11), 1219–1235 (2004) CrossRefGoogle Scholar
  52. 52.
    Pennebaker, W., Mitchell, J.: JPEG: Still Image Data Compression Standard. Van Nostrand-Reinhold, New York (1992) Google Scholar
  53. 53.
    Pătraşcu, M.: Succincter. In: Proc. 49th Annual IEEE Symposium on Foundations of Computer Science (FOCS), pp. 305–313 (2008) Google Scholar
  54. 54.
    Pătraşcu, M.: A lower bound for succinct rank queries. CoRR (2009). arXiv:0907.1103v1 [cs.DS]
  55. 55.
    Raman, R., Raman, V., Rao, S.: Succinct indexable dictionaries with applications to encoding k-ary trees, prefix sums and multisets. ACM Trans. Algorithms 3(4), 43 (2007) CrossRefMathSciNetGoogle Scholar
  56. 56.
    Russo, L., Navarro, G., Oliveira, A., Morales, P.: Approximate string matching with compressed indexes. Algorithms 2(3), 1105–1136 (2009) CrossRefMathSciNetGoogle Scholar
  57. 57.
    Sadakane, K.: New text indexing functionalities of the compressed suffix arrays. J. Algorithms 48(2), 294–313 (2003) CrossRefzbMATHMathSciNetGoogle Scholar
  58. 58.
    Sadakane, K., Grossi, R.: Squeezing succinct data structures into entropy bounds. In: Proc. 17th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pp. 1230–1239 (2006) Google Scholar
  59. 59.
    Said, A.: Efficient alphabet partitioning algorithms for low-complexity entropy coding. In: Proc. 15th Data Compression Conference (DCC), pp. 193–202 (2005) Google Scholar
  60. 60.
    Tarjan, R.E., van Leeuwen, J.: Worst-case analysis of set union algorithms. J. ACM 31(2), 245–281 (1984) CrossRefzbMATHGoogle Scholar
  61. 61.
    Witten, I., Moffat, A., Bell, T.: Managing Gigabytes, 2nd edn. Morgan Kaufmann, San Mateo (1999) Google Scholar

Copyright information

© Springer Science+Business Media New York 2012

Authors and Affiliations

  • Jérémy Barbay
    • 1
  • Francisco Claude
    • 2
  • Travis Gagie
    • 3
  • Gonzalo Navarro
    • 1
  • Yakov Nekrich
    • 4
  1. 1.Department of Computer ScienceUniversity of ChileSantiagoChile
  2. 2.David R. Cheriton School of Computer ScienceUniversity of WaterlooWaterlooCanada
  3. 3.Department of Computer ScienceAalto UniversityHelsinkiFinland
  4. 4.Department of Electrical Engineering & Computer ScienceUniversity of KansasKansasUSA

Personalised recommendations