Abstract
Representing a static set of integers S, \(|S| = n\) from a finite universe \(U = [1{.}{.}u]\) is a fundamental task in computer science. Our concern is to represent S in small space while supporting the operations of \(\mathsf {rank}\) and \(\mathsf {select}\) on S; if S is viewed as its characteristi c vector, the problem becomes that of representing a bit-vector, which is arguably the most fundamental building block of succinct data structures.
Although there is an information-theoretic lower bound of \(\mathcal {B}(n, u)= \lg {u\atopwithdelims ()n}\) bits on the space needed to represent S, this applies to worst-case (random) sets S, and sets found in practical applications are compressible. We focus on the case where elements of S contain non-trivial runs of consecutive elements, one that occurs in many practical situations.
Let \(\mathcal {C}_n\) denote the class of \({u\atopwithdelims ()n}\) distinct sets of \(n\) elements over the universe \([1{.}{.}u]\). Let also \(\mathcal {C}^{n}_{g} \subset \mathcal {C}_{n}\) contain the sets whose \(n\) elements are arranged in \(g \le n\) runs of \(\ell _i \ge 1\) consecutive elements from U for \(i=1,\ldots , g\), and let \(\mathcal {C}^{n}_{g, r}\subset \mathcal {C}^{n}_{g}\) contain all sets that consist of g runs, such that \(r \le g\) of them have at least 2 elements.
-
We introduce new compressibility measures for sets, including:
-
\(\mathcal {L}_1 = \lg {|\mathcal {C}^{n}_{g}|} = \lg {{u-n+1 \atopwithdelims ()g}} + \lg {{n-1 \atopwithdelims ()g-1}}\) and
-
\(\mathcal {L}_2 = \lg {|\mathcal {C}^{n}_{g,r}|} =\lg {{u-n+1 \atopwithdelims ()g}} + \lg {{n-g-1 \atopwithdelims ()r-1}} + \lg {{g\atopwithdelims ()r}}\)
We show that \(\mathcal {L}_2 \le \mathcal {L}_1 \le \mathcal {B}(n, u)\).
-
-
We give data structures that use space close to bounds \(\mathcal {L}_1\) and \(\mathcal {L}_2\) and support \(\mathsf {rank}\) and \(\mathsf {select}\) in O(1) time.
-
We provide additional measures involving entropy-coding run lengths and gaps between items, data structures to support these measures, and show experimentally that these approaches are promising for real-world datasets.
Funded by the Millennium Institute for Foundational Research on Data (IMFD).
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
For example, if we choose every element in \(U \) to be in \(S \) with probability 0.5, then , less than the Shannon lower bound for \(S \).
- 2.
Since and are not achievable, this statement is imprecise.
- 3.
\([k\not \in \hat{L}]\) is Iverson brackets notation, which equals 1 iff \(k\not \in \hat{L}\) is true, 0 otherwise.
References
Andersson, A., Thorup, M.: Dynamic ordered sets with exponential search trees. J. ACM 54(3), 13 (2007)
Arroyuelo, D., Oyarzún, M., González, S., Sepulveda, V.: Hybrid compression of inverted lists for reordered document collections. Inf. Process. Manag. 54(6), 1308–1324 (2018)
Boldi, P., Vigna, S.: The webgraph framework I: compression techniques. In: Proceedings of the 13th International Conference on World Wide Web (WWW), pp. 595–602 (2004)
Boldi, P., Vigna, S.: The webgraph framework II: codes for the world-wide web. In: Proceedings of the Data Compression Conference (DCC), p. 528 (2004)
Cafagna, F., Böhlen, M.H.: Disjoint interval partitioning. VLDB J. 26(3), 447–466 (2017)
Chen, Y., Chen, Y.: An efficient algorithm for answering graph reachability queries. In: Proceedings of the 24th International Conference on Data Engineering (ICDE), pp. 893–902 (2008)
Chen, Y., Chen, Y.: Decomposing DAGs into spanning trees: a new way to compress transitive closures. In: Proceedings of the 27th International Conference on Data Engineering (ICDE), pp. 1007–1018 (2011)
Chen, Y., Shen, W.: An efficient method to evaluate intersections on big data sets. Theor. Comput. Sci. 647, 1–21 (2016)
Clark, D.: Compact PAT trees. Ph.D. thesis, University of Waterloo (1997)
Clark, D.R., Munro, J.I.: Efficient suffix trees on secondary storage (extended abstract). In: Proceedings of the 7th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pp. 383–391 (1996)
Cover, T.M., Thomas, J.A.: Elements of Information Theory. Wiley Interscience, Hoboken (2006)
de Berg, M., Cheong, O., van Kreveld, M.J., Overmars, M.H.: Computational Geometry: Algorithms and Applications, 3rd edn. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-77974-2
Dignös, A., Böhlen, M.H., Gamper, J.: Overlap interval partition join. In: Proceedings of the 2014 International Conference on Management of Data (SIGMOD), pp. 1459–1470 (2014)
Foschini, L., Grossi, R., Gupta, A., Vitter, J.S.: When indexing equals compression: experiments with compressing suffix arrays and applications. ACM Trans. Algorithms 2(4), 611–639 (2006)
Gagie, T., Navarro, G., Prezza, N.: Optimal-time text indexing in BWT-runs bounded space. In: Proceedings of the 29h Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pp. 1459–1477 (2018)
Gao, D., Jensen, C.S., Snodgrass, R.T., Soo, M.D.: Join operations in temporal databases. VLDB J. 14(1), 2–29 (2005)
Gog, S., Petri, M.: Optimized succinct data structures for massive data. Softw. Practrice Exp. 44(11), 1287–1314 (2014)
Golomb, S.: Run-length encodings (corresp.). IEEE Trans. Inf. Theory 12(3), 399–401 (1966)
Golynski, A., Grossi, R., Gupta, A., Raman, R., Rao, S.S.: On the size of succinct indices. In: Arge, L., Hoffmann, M., Welzl, E. (eds.) ESA 2007. LNCS, vol. 4698, pp. 371–382. Springer, Heidelberg (2007). https://doi.org/10.1007/978-3-540-75520-3_34
Golynski, A., Orlandi, A., Raman, R., Srinivasa Rao, S.: Optimal indexes for sparse bit vectors. Algorithmica 69(4), 906–924 (2014)
Golynski, A., Raman, R., Rao, S.S.: On the redundancy of succinct data structures. In: Gudmundsson, J. (ed.) SWAT 2008. LNCS, vol. 5124, pp. 148–159. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-69903-3_15
Grossi, R., Gupta, A., Vitter, J.S.: High-order entropy-compressed text indexes. In: Proceedings of the 14th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pp. 841–850 (2003)
Gupta, A., Hon, W.-K., Shah, R., Vitter, J.S.: Compressed data structures: dictionaries and data-aware measures. Theor. Comput. Sci. 387(3), 313–331 (2007)
Huo, H., Chen, L., Zhao, H., Vitter, J.S., Nekrich, Y., Yu, Q.: A data-aware fm-index. In: Proceedings of the 17th Workshop on Algorithm Engineering and Experiments (ALENEX), pp. 10–23 (2015)
Jacobson, G.: Space-efficient static trees and graphs. In: Proceedings of the 30th Annual Symposium on Foundations of Computer Science (FOCS), pp. 549–554 (1989)
Johnson, D.S., Krishnan, S., Chhugani, J., Kumar, S., Venkatasubramanian, S.: Compressing large boolean matrices using reordering techniques. In: Proceedings of the 30th International Conference on Very Large Data Bases (VLDB), pp. 13–23 (2004)
Mäkinen, V., Navarro, G.: Succinct suffix arrays based on run-length encoding. Nord. J. Comput. 12(1), 40–66 (2005)
Navarro, G.: Compact Data Structures - A Practical Approach. Cambridge University Press, Cambridge (2016)
Okanohara, D., Sadakane, K.: Practical entropy-compressed rank/select dictionary. In: Proceedings of the 9th Workshop on Algorithm Engineering and Experiments (ALENEX), pp. 60–70 (2007)
Ottaviano, G., Venturini, R.: Partitioned Elias-fano indexes. In: Proceedings of the 37th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 273–282 (2014)
Pǎtraşcu, M., Viola, E.: Cell-probe lower bounds for succinct partial sums. In: Proceedings of the 21st Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pp. 117–122 (2010)
Pǎtraşcu, M.: Succincter. In: Proceedings of the 49th Annual IEEE Symposium on Foundations of Computer Science (FOCS), pp. 305–313 (2008)
Pǎtraşcu, M., Thorup, M.: Time-space trade-offs for predecessor search. In: Proceedings of the 38th Annual ACM Symposium on Theory of Computing (STOC), pp. 232–240 (2006)
Quinlan, A.R., Robins, G., Hall, I.M., Skadron, K., Layer, R.M.: Binary Interval Search: a scalable algorithm for counting interval intersections. Bioinformatics 29(1), 1–7 (2012)
Rahman, N., Raman, R.: Rank and select operations on binary strings. In: Kao, M.Y. (ed.) Encyclopedia of Algorithms. Springer, Boston (2008). https://doi.org/10.1007/978-0-387-30162-4
Raman, R., Raman, V., Rao Satti, S.: Succinct indexable dictionaries with applications to encoding k-ary trees, prefix sums and multisets. ACM Trans. Algorithms 3(4), 43 (2007)
Sadakane,K., Grossi, R.: Squeezing succinct data structures into entropy bounds. In: Proceedings of the 17th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pp. 1230–1239 (2006)
Silvestri, F.: Sorting out the document identifier assignment problem. In: Amati, G., Carpineto, C., Romano, G. (eds.) ECIR 2007. LNCS, vol. 4425, pp. 101–112. Springer, Heidelberg (2007). https://doi.org/10.1007/978-3-540-71496-5_12
Soo, M.D., Snodgrass, R.T., Jensen, C.S.: Efficient evaluation of the valid-time natural join. In: Proceedings of the 10th International Conference on Data Engineering (ICDE), pp. 282–292 (1994)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
Arroyuelo, D., Raman, R. (2019). Adaptive Succinctness. In: Brisaboa, N., Puglisi, S. (eds) String Processing and Information Retrieval. SPIRE 2019. Lecture Notes in Computer Science(), vol 11811. Springer, Cham. https://doi.org/10.1007/978-3-030-32686-9_33
Download citation
DOI: https://doi.org/10.1007/978-3-030-32686-9_33
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-32685-2
Online ISBN: 978-3-030-32686-9
eBook Packages: Computer ScienceComputer Science (R0)