Skip to main content

Adaptive Succinctness

  • Conference paper
  • First Online:
String Processing and Information Retrieval (SPIRE 2019)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 11811))

Included in the following conference series:

Abstract

Representing a static set of integers S, \(|S| = n\) from a finite universe \(U = [1{.}{.}u]\) is a fundamental task in computer science. Our concern is to represent S in small space while supporting the operations of \(\mathsf {rank}\) and \(\mathsf {select}\) on S; if S is viewed as its characteristi c vector, the problem becomes that of representing a bit-vector, which is arguably the most fundamental building block of succinct data structures.

Although there is an information-theoretic lower bound of \(\mathcal {B}(n, u)= \lg {u\atopwithdelims ()n}\) bits on the space needed to represent S, this applies to worst-case (random) sets S, and sets found in practical applications are compressible. We focus on the case where elements of S contain non-trivial runs of consecutive elements, one that occurs in many practical situations.

Let \(\mathcal {C}_n\) denote the class of \({u\atopwithdelims ()n}\) distinct sets of \(n\) elements over the universe \([1{.}{.}u]\). Let also \(\mathcal {C}^{n}_{g} \subset \mathcal {C}_{n}\) contain the sets whose \(n\) elements are arranged in \(g \le n\) runs of \(\ell _i \ge 1\) consecutive elements from U for \(i=1,\ldots , g\), and let \(\mathcal {C}^{n}_{g, r}\subset \mathcal {C}^{n}_{g}\) contain all sets that consist of g runs, such that \(r \le g\) of them have at least 2 elements.

  • We introduce new compressibility measures for sets, including:

    • \(\mathcal {L}_1 = \lg {|\mathcal {C}^{n}_{g}|} = \lg {{u-n+1 \atopwithdelims ()g}} + \lg {{n-1 \atopwithdelims ()g-1}}\) and

    • \(\mathcal {L}_2 = \lg {|\mathcal {C}^{n}_{g,r}|} =\lg {{u-n+1 \atopwithdelims ()g}} + \lg {{n-g-1 \atopwithdelims ()r-1}} + \lg {{g\atopwithdelims ()r}}\)

    We show that \(\mathcal {L}_2 \le \mathcal {L}_1 \le \mathcal {B}(n, u)\).

  • We give data structures that use space close to bounds \(\mathcal {L}_1\) and \(\mathcal {L}_2\) and support \(\mathsf {rank}\) and \(\mathsf {select}\) in O(1) time.

  • We provide additional measures involving entropy-coding run lengths and gaps between items, data structures to support these measures, and show experimentally that these approaches are promising for real-world datasets.

Funded by the Millennium Institute for Foundational Research on Data (IMFD).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    For example, if we choose every element in \(U \) to be in \(S \) with probability 0.5, then , less than the Shannon lower bound for \(S \).

  2. 2.

    Since and are not achievable, this statement is imprecise.

  3. 3.

    \([k\not \in \hat{L}]\) is Iverson brackets notation, which equals 1 iff \(k\not \in \hat{L}\) is true, 0 otherwise.

References

  1. Andersson, A., Thorup, M.: Dynamic ordered sets with exponential search trees. J. ACM 54(3), 13 (2007)

    Article  MathSciNet  Google Scholar 

  2. Arroyuelo, D., Oyarzún, M., González, S., Sepulveda, V.: Hybrid compression of inverted lists for reordered document collections. Inf. Process. Manag. 54(6), 1308–1324 (2018)

    Article  Google Scholar 

  3. Boldi, P., Vigna, S.: The webgraph framework I: compression techniques. In: Proceedings of the 13th International Conference on World Wide Web (WWW), pp. 595–602 (2004)

    Google Scholar 

  4. Boldi, P., Vigna, S.: The webgraph framework II: codes for the world-wide web. In: Proceedings of the Data Compression Conference (DCC), p. 528 (2004)

    Google Scholar 

  5. Cafagna, F., Böhlen, M.H.: Disjoint interval partitioning. VLDB J. 26(3), 447–466 (2017)

    Article  Google Scholar 

  6. Chen, Y., Chen, Y.: An efficient algorithm for answering graph reachability queries. In: Proceedings of the 24th International Conference on Data Engineering (ICDE), pp. 893–902 (2008)

    Google Scholar 

  7. Chen, Y., Chen, Y.: Decomposing DAGs into spanning trees: a new way to compress transitive closures. In: Proceedings of the 27th International Conference on Data Engineering (ICDE), pp. 1007–1018 (2011)

    Google Scholar 

  8. Chen, Y., Shen, W.: An efficient method to evaluate intersections on big data sets. Theor. Comput. Sci. 647, 1–21 (2016)

    Article  MathSciNet  Google Scholar 

  9. Clark, D.: Compact PAT trees. Ph.D. thesis, University of Waterloo (1997)

    Google Scholar 

  10. Clark, D.R., Munro, J.I.: Efficient suffix trees on secondary storage (extended abstract). In: Proceedings of the 7th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pp. 383–391 (1996)

    Google Scholar 

  11. Cover, T.M., Thomas, J.A.: Elements of Information Theory. Wiley Interscience, Hoboken (2006)

    MATH  Google Scholar 

  12. de Berg, M., Cheong, O., van Kreveld, M.J., Overmars, M.H.: Computational Geometry: Algorithms and Applications, 3rd edn. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-77974-2

    Book  MATH  Google Scholar 

  13. Dignös, A., Böhlen, M.H., Gamper, J.: Overlap interval partition join. In: Proceedings of the 2014 International Conference on Management of Data (SIGMOD), pp. 1459–1470 (2014)

    Google Scholar 

  14. Foschini, L., Grossi, R., Gupta, A., Vitter, J.S.: When indexing equals compression: experiments with compressing suffix arrays and applications. ACM Trans. Algorithms 2(4), 611–639 (2006)

    Article  MathSciNet  Google Scholar 

  15. Gagie, T., Navarro, G., Prezza, N.: Optimal-time text indexing in BWT-runs bounded space. In: Proceedings of the 29h Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pp. 1459–1477 (2018)

    Google Scholar 

  16. Gao, D., Jensen, C.S., Snodgrass, R.T., Soo, M.D.: Join operations in temporal databases. VLDB J. 14(1), 2–29 (2005)

    Article  Google Scholar 

  17. Gog, S., Petri, M.: Optimized succinct data structures for massive data. Softw. Practrice Exp. 44(11), 1287–1314 (2014)

    Article  Google Scholar 

  18. Golomb, S.: Run-length encodings (corresp.). IEEE Trans. Inf. Theory 12(3), 399–401 (1966)

    Google Scholar 

  19. Golynski, A., Grossi, R., Gupta, A., Raman, R., Rao, S.S.: On the size of succinct indices. In: Arge, L., Hoffmann, M., Welzl, E. (eds.) ESA 2007. LNCS, vol. 4698, pp. 371–382. Springer, Heidelberg (2007). https://doi.org/10.1007/978-3-540-75520-3_34

    Chapter  Google Scholar 

  20. Golynski, A., Orlandi, A., Raman, R., Srinivasa Rao, S.: Optimal indexes for sparse bit vectors. Algorithmica 69(4), 906–924 (2014)

    Article  MathSciNet  Google Scholar 

  21. Golynski, A., Raman, R., Rao, S.S.: On the redundancy of succinct data structures. In: Gudmundsson, J. (ed.) SWAT 2008. LNCS, vol. 5124, pp. 148–159. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-69903-3_15

    Chapter  Google Scholar 

  22. Grossi, R., Gupta, A., Vitter, J.S.: High-order entropy-compressed text indexes. In: Proceedings of the 14th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pp. 841–850 (2003)

    Google Scholar 

  23. Gupta, A., Hon, W.-K., Shah, R., Vitter, J.S.: Compressed data structures: dictionaries and data-aware measures. Theor. Comput. Sci. 387(3), 313–331 (2007)

    Article  MathSciNet  Google Scholar 

  24. Huo, H., Chen, L., Zhao, H., Vitter, J.S., Nekrich, Y., Yu, Q.: A data-aware fm-index. In: Proceedings of the 17th Workshop on Algorithm Engineering and Experiments (ALENEX), pp. 10–23 (2015)

    Google Scholar 

  25. Jacobson, G.: Space-efficient static trees and graphs. In: Proceedings of the 30th Annual Symposium on Foundations of Computer Science (FOCS), pp. 549–554 (1989)

    Google Scholar 

  26. Johnson, D.S., Krishnan, S., Chhugani, J., Kumar, S., Venkatasubramanian, S.: Compressing large boolean matrices using reordering techniques. In: Proceedings of the 30th International Conference on Very Large Data Bases (VLDB), pp. 13–23 (2004)

    Google Scholar 

  27. Mäkinen, V., Navarro, G.: Succinct suffix arrays based on run-length encoding. Nord. J. Comput. 12(1), 40–66 (2005)

    MathSciNet  MATH  Google Scholar 

  28. Navarro, G.: Compact Data Structures - A Practical Approach. Cambridge University Press, Cambridge (2016)

    Book  Google Scholar 

  29. Okanohara, D., Sadakane, K.: Practical entropy-compressed rank/select dictionary. In: Proceedings of the 9th Workshop on Algorithm Engineering and Experiments (ALENEX), pp. 60–70 (2007)

    Google Scholar 

  30. Ottaviano, G., Venturini, R.: Partitioned Elias-fano indexes. In: Proceedings of the 37th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 273–282 (2014)

    Google Scholar 

  31. Pǎtraşcu, M., Viola, E.: Cell-probe lower bounds for succinct partial sums. In: Proceedings of the 21st Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pp. 117–122 (2010)

    Google Scholar 

  32. Pǎtraşcu, M.: Succincter. In: Proceedings of the 49th Annual IEEE Symposium on Foundations of Computer Science (FOCS), pp. 305–313 (2008)

    Google Scholar 

  33. Pǎtraşcu, M., Thorup, M.: Time-space trade-offs for predecessor search. In: Proceedings of the 38th Annual ACM Symposium on Theory of Computing (STOC), pp. 232–240 (2006)

    Google Scholar 

  34. Quinlan, A.R., Robins, G., Hall, I.M., Skadron, K., Layer, R.M.: Binary Interval Search: a scalable algorithm for counting interval intersections. Bioinformatics 29(1), 1–7 (2012)

    Google Scholar 

  35. Rahman, N., Raman, R.: Rank and select operations on binary strings. In: Kao, M.Y. (ed.) Encyclopedia of Algorithms. Springer, Boston (2008). https://doi.org/10.1007/978-0-387-30162-4

    Chapter  Google Scholar 

  36. Raman, R., Raman, V., Rao Satti, S.: Succinct indexable dictionaries with applications to encoding k-ary trees, prefix sums and multisets. ACM Trans. Algorithms 3(4), 43 (2007)

    Google Scholar 

  37. Sadakane,K., Grossi, R.: Squeezing succinct data structures into entropy bounds. In: Proceedings of the 17th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pp. 1230–1239 (2006)

    Google Scholar 

  38. Silvestri, F.: Sorting out the document identifier assignment problem. In: Amati, G., Carpineto, C., Romano, G. (eds.) ECIR 2007. LNCS, vol. 4425, pp. 101–112. Springer, Heidelberg (2007). https://doi.org/10.1007/978-3-540-71496-5_12

    Chapter  Google Scholar 

  39. Soo, M.D., Snodgrass, R.T., Jensen, C.S.: Efficient evaluation of the valid-time natural join. In: Proceedings of the 10th International Conference on Data Engineering (ICDE), pp. 282–292 (1994)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Diego Arroyuelo .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Arroyuelo, D., Raman, R. (2019). Adaptive Succinctness. In: Brisaboa, N., Puglisi, S. (eds) String Processing and Information Retrieval. SPIRE 2019. Lecture Notes in Computer Science(), vol 11811. Springer, Cham. https://doi.org/10.1007/978-3-030-32686-9_33

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-32686-9_33

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-32685-2

  • Online ISBN: 978-3-030-32686-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics