Adaptive Succinctness

Arroyuelo, Diego; Raman, Rajeev

doi:10.1007/978-3-030-32686-9_33

Diego Arroyuelo^10,11 &
Rajeev Raman¹²

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 11811))

Included in the following conference series:

International Symposium on String Processing and Information Retrieval

611 Accesses
6 Citations

Abstract

Representing a static set of integers S, \(|S| = n\) from a finite universe \(U = [1{.}{.}u]\) is a fundamental task in computer science. Our concern is to represent S in small space while supporting the operations of \(\mathsf {rank}\) and \(\mathsf {select}\) on S; if S is viewed as its characteristi c vector, the problem becomes that of representing a bit-vector, which is arguably the most fundamental building block of succinct data structures.

Although there is an information-theoretic lower bound of \(\mathcal {B}(n, u)= \lg {u\atopwithdelims ()n}\) bits on the space needed to represent S, this applies to worst-case (random) sets S, and sets found in practical applications are compressible. We focus on the case where elements of S contain non-trivial runs of consecutive elements, one that occurs in many practical situations.

Let \(\mathcal {C}_n\) denote the class of \({u\atopwithdelims ()n}\) distinct sets of \(n\) elements over the universe \([1{.}{.}u]\). Let also \(\mathcal {C}^{n}_{g} \subset \mathcal {C}_{n}\) contain the sets whose \(n\) elements are arranged in \(g \le n\) runs of \(\ell _i \ge 1\) consecutive elements from U for \(i=1,\ldots , g\), and let \(\mathcal {C}^{n}_{g, r}\subset \mathcal {C}^{n}_{g}\) contain all sets that consist of g runs, such that \(r \le g\) of them have at least 2 elements.

We introduce new compressibility measures for sets, including:
- \(\mathcal {L}_1 = \lg {|\mathcal {C}^{n}_{g}|} = \lg {{u-n+1 \atopwithdelims ()g}} + \lg {{n-1 \atopwithdelims ()g-1}}\) and
- \(\mathcal {L}_2 = \lg {|\mathcal {C}^{n}_{g,r}|} =\lg {{u-n+1 \atopwithdelims ()g}} + \lg {{n-g-1 \atopwithdelims ()r-1}} + \lg {{g\atopwithdelims ()r}}\)
We show that \(\mathcal {L}_2 \le \mathcal {L}_1 \le \mathcal {B}(n, u)\).
We give data structures that use space close to bounds \(\mathcal {L}_1\) and \(\mathcal {L}_2\) and support \(\mathsf {rank}\) and \(\mathsf {select}\) in O(1) time.
We provide additional measures involving entropy-coding run lengths and gaps between items, data structures to support these measures, and show experimentally that these approaches are promising for real-world datasets.

Funded by the Millennium Institute for Foundational Research on Data (IMFD).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
For example, if we choose every element in \(U \) to be in \(S \) with probability 0.5, then , less than the Shannon lower bound for \(S \).
2.
Since and are not achievable, this statement is imprecise.
3.
\([k\not \in \hat{L}]\) is Iverson brackets notation, which equals 1 iff \(k\not \in \hat{L}\) is true, 0 otherwise.

References

Andersson, A., Thorup, M.: Dynamic ordered sets with exponential search trees. J. ACM 54(3), 13 (2007)
Article MathSciNet Google Scholar
Arroyuelo, D., Oyarzún, M., González, S., Sepulveda, V.: Hybrid compression of inverted lists for reordered document collections. Inf. Process. Manag. 54(6), 1308–1324 (2018)
Article Google Scholar
Boldi, P., Vigna, S.: The webgraph framework I: compression techniques. In: Proceedings of the 13th International Conference on World Wide Web (WWW), pp. 595–602 (2004)
Google Scholar
Boldi, P., Vigna, S.: The webgraph framework II: codes for the world-wide web. In: Proceedings of the Data Compression Conference (DCC), p. 528 (2004)
Google Scholar
Cafagna, F., Böhlen, M.H.: Disjoint interval partitioning. VLDB J. 26(3), 447–466 (2017)
Article Google Scholar
Chen, Y., Chen, Y.: An efficient algorithm for answering graph reachability queries. In: Proceedings of the 24th International Conference on Data Engineering (ICDE), pp. 893–902 (2008)
Google Scholar
Chen, Y., Chen, Y.: Decomposing DAGs into spanning trees: a new way to compress transitive closures. In: Proceedings of the 27th International Conference on Data Engineering (ICDE), pp. 1007–1018 (2011)
Google Scholar
Chen, Y., Shen, W.: An efficient method to evaluate intersections on big data sets. Theor. Comput. Sci. 647, 1–21 (2016)
Article MathSciNet Google Scholar
Clark, D.: Compact PAT trees. Ph.D. thesis, University of Waterloo (1997)
Google Scholar
Clark, D.R., Munro, J.I.: Efficient suffix trees on secondary storage (extended abstract). In: Proceedings of the 7th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pp. 383–391 (1996)
Google Scholar
Cover, T.M., Thomas, J.A.: Elements of Information Theory. Wiley Interscience, Hoboken (2006)
MATH Google Scholar
de Berg, M., Cheong, O., van Kreveld, M.J., Overmars, M.H.: Computational Geometry: Algorithms and Applications, 3rd edn. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-77974-2
Book MATH Google Scholar
Dignös, A., Böhlen, M.H., Gamper, J.: Overlap interval partition join. In: Proceedings of the 2014 International Conference on Management of Data (SIGMOD), pp. 1459–1470 (2014)
Google Scholar
Foschini, L., Grossi, R., Gupta, A., Vitter, J.S.: When indexing equals compression: experiments with compressing suffix arrays and applications. ACM Trans. Algorithms 2(4), 611–639 (2006)
Article MathSciNet Google Scholar
Gagie, T., Navarro, G., Prezza, N.: Optimal-time text indexing in BWT-runs bounded space. In: Proceedings of the 29h Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pp. 1459–1477 (2018)
Google Scholar
Gao, D., Jensen, C.S., Snodgrass, R.T., Soo, M.D.: Join operations in temporal databases. VLDB J. 14(1), 2–29 (2005)
Article Google Scholar
Gog, S., Petri, M.: Optimized succinct data structures for massive data. Softw. Practrice Exp. 44(11), 1287–1314 (2014)
Article Google Scholar
Golomb, S.: Run-length encodings (corresp.). IEEE Trans. Inf. Theory 12(3), 399–401 (1966)
Google Scholar
Golynski, A., Grossi, R., Gupta, A., Raman, R., Rao, S.S.: On the size of succinct indices. In: Arge, L., Hoffmann, M., Welzl, E. (eds.) ESA 2007. LNCS, vol. 4698, pp. 371–382. Springer, Heidelberg (2007). https://doi.org/10.1007/978-3-540-75520-3_34
Chapter Google Scholar
Golynski, A., Orlandi, A., Raman, R., Srinivasa Rao, S.: Optimal indexes for sparse bit vectors. Algorithmica 69(4), 906–924 (2014)
Article MathSciNet Google Scholar
Golynski, A., Raman, R., Rao, S.S.: On the redundancy of succinct data structures. In: Gudmundsson, J. (ed.) SWAT 2008. LNCS, vol. 5124, pp. 148–159. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-69903-3_15
Chapter Google Scholar
Grossi, R., Gupta, A., Vitter, J.S.: High-order entropy-compressed text indexes. In: Proceedings of the 14th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pp. 841–850 (2003)
Google Scholar
Gupta, A., Hon, W.-K., Shah, R., Vitter, J.S.: Compressed data structures: dictionaries and data-aware measures. Theor. Comput. Sci. 387(3), 313–331 (2007)
Article MathSciNet Google Scholar
Huo, H., Chen, L., Zhao, H., Vitter, J.S., Nekrich, Y., Yu, Q.: A data-aware fm-index. In: Proceedings of the 17th Workshop on Algorithm Engineering and Experiments (ALENEX), pp. 10–23 (2015)
Google Scholar
Jacobson, G.: Space-efficient static trees and graphs. In: Proceedings of the 30th Annual Symposium on Foundations of Computer Science (FOCS), pp. 549–554 (1989)
Google Scholar
Johnson, D.S., Krishnan, S., Chhugani, J., Kumar, S., Venkatasubramanian, S.: Compressing large boolean matrices using reordering techniques. In: Proceedings of the 30th International Conference on Very Large Data Bases (VLDB), pp. 13–23 (2004)
Google Scholar
Mäkinen, V., Navarro, G.: Succinct suffix arrays based on run-length encoding. Nord. J. Comput. 12(1), 40–66 (2005)
MathSciNet MATH Google Scholar
Navarro, G.: Compact Data Structures - A Practical Approach. Cambridge University Press, Cambridge (2016)
Book Google Scholar
Okanohara, D., Sadakane, K.: Practical entropy-compressed rank/select dictionary. In: Proceedings of the 9th Workshop on Algorithm Engineering and Experiments (ALENEX), pp. 60–70 (2007)
Google Scholar
Ottaviano, G., Venturini, R.: Partitioned Elias-fano indexes. In: Proceedings of the 37th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 273–282 (2014)
Google Scholar
Pǎtraşcu, M., Viola, E.: Cell-probe lower bounds for succinct partial sums. In: Proceedings of the 21st Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pp. 117–122 (2010)
Google Scholar
Pǎtraşcu, M.: Succincter. In: Proceedings of the 49th Annual IEEE Symposium on Foundations of Computer Science (FOCS), pp. 305–313 (2008)
Google Scholar
Pǎtraşcu, M., Thorup, M.: Time-space trade-offs for predecessor search. In: Proceedings of the 38th Annual ACM Symposium on Theory of Computing (STOC), pp. 232–240 (2006)
Google Scholar
Quinlan, A.R., Robins, G., Hall, I.M., Skadron, K., Layer, R.M.: Binary Interval Search: a scalable algorithm for counting interval intersections. Bioinformatics 29(1), 1–7 (2012)
Google Scholar
Rahman, N., Raman, R.: Rank and select operations on binary strings. In: Kao, M.Y. (ed.) Encyclopedia of Algorithms. Springer, Boston (2008). https://doi.org/10.1007/978-0-387-30162-4
Chapter Google Scholar
Raman, R., Raman, V., Rao Satti, S.: Succinct indexable dictionaries with applications to encoding k-ary trees, prefix sums and multisets. ACM Trans. Algorithms 3(4), 43 (2007)
Google Scholar
Sadakane,K., Grossi, R.: Squeezing succinct data structures into entropy bounds. In: Proceedings of the 17th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pp. 1230–1239 (2006)
Google Scholar
Silvestri, F.: Sorting out the document identifier assignment problem. In: Amati, G., Carpineto, C., Romano, G. (eds.) ECIR 2007. LNCS, vol. 4425, pp. 101–112. Springer, Heidelberg (2007). https://doi.org/10.1007/978-3-540-71496-5_12
Chapter Google Scholar
Soo, M.D., Snodgrass, R.T., Jensen, C.S.: Efficient evaluation of the valid-time natural join. In: Proceedings of the 10th International Conference on Data Engineering (ICDE), pp. 282–292 (1994)
Google Scholar

Download references

Author information

Authors and Affiliations

Millennium Institute for Foundational Research on Data (IMFD), Santiago, Chile
Diego Arroyuelo
Department of Informatics, Universidad Técnica Federico Santa María, Santiago, Chile
Diego Arroyuelo
Department of Informatics, University of Leicester, University Road, Leicester, LE1 7RH, UK
Rajeev Raman

Authors

Diego Arroyuelo
View author publications
You can also search for this author in PubMed Google Scholar
Rajeev Raman
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Diego Arroyuelo .

Editor information

Editors and Affiliations

University of A Coruña, A Coruña, Spain
Nieves R. Brisaboa
University of Helsinki, Helsinki, Finland
Simon J. Puglisi

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Arroyuelo, D., Raman, R. (2019). Adaptive Succinctness. In: Brisaboa, N., Puglisi, S. (eds) String Processing and Information Retrieval. SPIRE 2019. Lecture Notes in Computer Science(), vol 11811. Springer, Cham. https://doi.org/10.1007/978-3-030-32686-9_33

Download citation

DOI: https://doi.org/10.1007/978-3-030-32686-9_33
Published: 03 October 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-32685-2
Online ISBN: 978-3-030-32686-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics