Skip to main content

Range Majorities and Minorities in Arrays

Abstract

The problem of parameterized range majority asks us to preprocess a string of length n such that, given the endpoints of a range, one can quickly find all the distinct elements whose relative frequencies in that range are more than a threshold \(\tau\). This is a more tractable version of the classical problem of finding the range mode, which is unlikely to be solvable in polylogarithmic time and linear space. In this paper we give the first linear-space solution with optimal \(\mathcal {O}\!\left( {1 / \tau } \right)\) query time, even when \(\tau\) can be specified with the query. We then consider data structures whose space is bounded by the entropy of the distribution of the symbols in the sequence. For the case when the alphabet size \(\sigma\) is polynomial on the computer word size, we retain the optimal time within optimally compressed space (i.e., with sublinear redundancy). Otherwise, either the compressed space is increased by an arbitrarily small constant factor or the time rises to any function in \((1/\tau )\cdot \omega (1)\). We obtain the same results on the complementary problem of parameterized range minority.

This is a preview of subscription content, access via your institution.

Notes

  1. In that paper they find the predecessor of x, which is the largest \(x_i \le x\), but the problem is analogous.

  2. M. Pǎtraşcu, personal communication, 2009.

  3. This could have been simply \(X \leftarrow (Y-X)~\textsc {and}~Y\) if there was an unused highest bit set to zero in the \((\lg \sigma ')\)-length fields of X. Instead, we have to use this more complex formula that first zeroes the highest bit of the fields and later considers them separately.

References

  1. Barbay, J., Claude, F., Gagie, T., Navarro, G., Nekrich, Y.: Efficient fully-compressed sequence representations. Algorithmica 69(1), 232–268 (2014)

    MathSciNet  Article  Google Scholar 

  2. Belazzougui, D., Boldi, P., Pagh, R., Vigna, S.: Theory and practice of monotone minimal perfect hashing. ACM J. Exp. Algorithm. 16(3), 2 (2011)

    MathSciNet  MATH  Google Scholar 

  3. Belazzougui, D., Cunial, F., Kärkkäinen, J., Mäkinen, V.: Linear-time string indexing and analysis in small space. ACM Trans. Algorithm. 16(2), 17:1-17:54 (2020)

    MathSciNet  Article  Google Scholar 

  4. Belazzougui, D., Gagie, T., Navarro, G.: Better space bounds for parameterized range majority and minority. In: Proceedings of the 12th Annual Workshop on Algorithms and Data Structures (WADS), LNCS 8037, pp. 121–132 (2013)

  5. Belazzougui, D., Navarro, G.: Alphabet-independent compressed text indexing. ACM Trans. Algorithm. 10(4), 23 (2014)

    MathSciNet  Article  Google Scholar 

  6. Belazzougui, D., Navarro, G.: Optimal lower and upper bounds for representing sequences. ACM Trans. Algorithm. 11(4), 31 (2015)

    MathSciNet  Article  Google Scholar 

  7. Bose, P., Kranakis, E., Morin, P., Tang, Y.: Approximate range mode and range median queries. In: Proceedings of the 22nd Symposium on Theoretical Aspects of Computer Science (STACS), pp. 377–388 (2005)

  8. Boyer, R.S., Moore, J.S.: MJRTY–a fast majority vote algorithm. In: Automated Reasoning, pp. 105–117. Springer, Berlin (1991)

  9. Chan, T.M., Durocher, S., Larsen, K.G., Morrison, J., Wilkinson, B.T.: Linear-space data structures for range mode query in arrays. Theory Comput. Syst. 55(4), 719–741 (2014)

    MathSciNet  Article  Google Scholar 

  10. Chan, T.M., Durocher, S., Skala, M., Wilkinson, B.T.: Linear-space data structures for range minority query in arrays. Algorithmica 72(4), 901–913 (2015)

    MathSciNet  Article  Google Scholar 

  11. Cormode, G.: Misra-Gries summaries. In: Encyclopedia of Algorithms, pp. 1334–1337. Springer, Berlin (2016)

  12. Cormode, G., Muthukrishnan, S.: Data stream methods. http://www.cs.rutgers.edu/~muthu/198-3.pdf, 2003. Lecture 3 of Rutger’s 198:671 Seminar on Processing Massive Data Sets

  13. Demaine, E.D., López-Ortiz, A., Munro, J. I.: Frequency estimation of internet packet streams with limited space. In: Proceedings of the 10th European Symposium on Algorithms (ESA), pp. 348–360 (2002)

  14. Durocher, S., He, M., Munro, J.I., Nicholson, P.K., Skala, M.: Range majority in constant time and linear space. Inf. Comput. 222, 169–179 (2013)

    MathSciNet  Article  Google Scholar 

  15. Durocher, S., Shah, R., Skala, M., Thankachan, S.V.: Linear-space data structures for range frequency queries on arrays and trees. Algorithmica 74(1), 344–366 (2016)

    MathSciNet  Article  Google Scholar 

  16. Elmasry, A., He, M., Munro, J.I., Nicholson, P.K.: Dynamic range majority data structures. Theoret. Comput. Sci. 647, 59–73 (2016)

    MathSciNet  Article  Google Scholar 

  17. Ferragina, P., Manzini, G., Mäkinen, V., Navarro, G.: Compressed representations of sequences and full-text indexes. ACM Trans. Algorithm. 3(2), 1–22 (2007)

    MathSciNet  Article  Google Scholar 

  18. Ferragina, P., Venturini, R.: A simple storage scheme for strings achieving entropy bounds. Theoret. Comput. Sci. 371(1), 115–121 (2007)

    MathSciNet  Article  Google Scholar 

  19. Fischer, J., Heun, V.: Space-efficient preprocessing schemes for range minimum queries on static arrays. SIAM J. Comput. 40(2), 465–492 (2011)

    MathSciNet  Article  Google Scholar 

  20. Gagie, T., He, M., Munro, J.I., Nicholson, P.K.: Finding frequent elements in compressed 2D arrays and strings. In: Proceedings of the 18th Symposium on String Processing and Information Retrieval (SPIRE), pp. 295–300 (2011)

  21. Gagie, T., He, M., Navarro, G.: Compressed dynamic range majority and minority data structures. Algorithmica 82(7), 2063–2086 (2020)

    MathSciNet  Article  Google Scholar 

  22. Gawrychowski, P., Nicholson, P.K.: Optimal query time for encoding range majority. In: Proceedings of the 15th International Symposium on Algorithms and Data Structures (WADS), pp. 409–420 (2017). Extended version in CoRR abs/1704.06149

  23. Greve, M., Jørgensen, A.G., Larsen, K.D., Truelsen, J.: Cell probe lower bounds and approximations for range mode. In: Proceedings of the 37th International Colloquium on Automata, Languages and Programming (ICALP), pp. 605–616 (2010)

  24. Grossi, R., Orlandi, A., Raman, R., Srinivasa Rao, S.: More haste, less waste: Lowering the redundancy in fully indexable dictionaries. In: Proceedings of the 26th International Symposium on Theoretical Aspects of Computer Science (STACS), pp. 517–528 (2009)

  25. Hon, W.-K., Shah, R., Vitter, J.: Space-efficient framework for top-\(k\) string retrieval problems. In: Proceedings of the 50th IEEE Annual Symposium on Foundations of Computer Science (FOCS), pp. 713–722 (2009)

  26. Karp, R.M., Shenker, S., Papadimitriou, C.H.: A simple algorithm for finding frequent elements in streams and bags. ACM Trans. Database Syst. 28(1), 51–55 (2003)

    Article  Google Scholar 

  27. Karpinski, M., Nekrich, Y.: Searching for frequent colors in rectangles. In: Proceedings of the 20th Canadian Conference on Computational Geometry (CCCG), pp. 11–14 (2008)

  28. Kosaraju, R., Manzini, G.: Compression of low entropy strings with Lempel-Ziv algorithms. SIAM J. Comput. 29(3), 893–911 (2000)

    MathSciNet  Article  Google Scholar 

  29. Krizanc, D., Morin, P., Smid, M.H.M.: Range mode and range median queries on lists and trees. Nordic J. Comput. 12(1), 1–17 (2005)

    MathSciNet  MATH  Google Scholar 

  30. Lai, Y.K., Poon, C.K., Shi, B.: Approximate colored range and point enclosure queries. J. Discrete Algorithm. 6(3), 420–432 (2008)

    MathSciNet  Article  Google Scholar 

  31. Misra, J., Gries, D.: Finding repeated elements. Sci. Comput. Program. 2(2), 143–152 (1982)

    MathSciNet  Article  Google Scholar 

  32. Munro, J.I., Navarro, G., Nekrich, Y.: Fast compressed self-indexes with deterministic linear-time construction. Algorithmica 82(2), 316–337 (2020)

    MathSciNet  Article  Google Scholar 

  33. Muthukrishnan, S.: Efficient algorithms for document retrieval problems. In: Proceedings of the 13th Symposium on Discrete Algorithms (SODA), pp. 657–666 (2002)

  34. Navarro, G., Thankachan, S.V.: Optimal encodings for range majority queries. Algorithmica 74(3), 1082–1098 (2016)

    MathSciNet  Article  Google Scholar 

  35. Petersen, H.: Improved bounds for range mode and range median queries. In: Proceedings of the 34th Conference on Current Trends in Theory and Practice of Computer Science (SOFSEM), pp. 418–423 (2008)

  36. Petersen, H., Grabowski, S.: Range mode and range median queries in constant time and sub-quadratic space. Inf. Process. Lett. 109(4), 225–228 (2009)

    MathSciNet  Article  Google Scholar 

  37. Pǎtraşcu, M.: Succincter. In: Proceedings of the 49th Symposium on Foundations of Computer Science (FOCS), pp. 305–313 (2008)

  38. Sadakane, K.: Succinct data structures for flexible text retrieval systems. J. Discrete Algorithm. 5(1), 12–22 (2007)

    MathSciNet  Article  Google Scholar 

  39. Wei, Z., Yi, K.: Beyond simple aggregates: indexing for summary queries. In: Proceedings of the 30th Symposium on Principles of Database Systems (PODS), pp. 117–128 (2011)

Download references

Acknowledgements

Many thanks to Patrick Nicholson for helpful comments.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Gonzalo Navarro.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Funded in part by Academy of Finland Grant 268324, by NSERC Grant RGPIN-07185-2020, by ANID – Millennium Science Initiative Program – Code ICN17_002, and by Fondecyt Grant 1-200038, Chile. An early partial version of this article appeared in Proc. WADS 2013.

Appendix: Finding \(\tau '\)-Majorities on Tiny Alphabets

Appendix: Finding \(\tau '\)-Majorities on Tiny Alphabets

We show how to find \(\tau '\)-majorities in time \(\mathcal {O}\!\left( {1/\tau '} \right)\) on ranges of length \(\mathcal {O}\!\left( {(1/\tau ')w^\beta } \right)\), over alphabet \([0..\sigma '-1]\), with \(\sigma '=w^\beta\), in the case \(1/\tau ' < \sigma '\). We will compute an array of \(\sigma '\) counters with the frequency of the symbols in the range, and then report those exceeding the threshold. The maximum size of the range is \((4/\tau ')w^\beta /4 \le \sigma ' w^\beta =w^{2\beta }\), and thus \(2\beta \lg w\) bits suffice to represent each counter. The \(\sigma '\) counters then require \(2\beta w^\beta \lg w\) bits and can be maintained in a computer word (although we will store them somewhat spaced for technical reasons). We can read the elements in \(S_v\) by chunks of \(w^\beta\) symbols, and compute in constant time the corresponding counters for those symbols. Then we sum the current counters and the counters for the chunk, all in constant time because they are fields in a single computer word. The range is then processed in time \(\mathcal {O}\!\left( {1/\tau '} \right)\).

To compute the counters corresponding to \(w^\beta\) symbols, we extend the popcounting algorithm of Belazzougui and Navarro [6, Sec. 4.1]; assume we extract the \(w^\beta\) symbols from \(S_v\) and have them packed in the lowest \(k\ell\) bits of a computer word X, where \(k=w^\beta\) is the number of symbols and \(\ell =\lg \sigma '\) is the number of bits used per symbol. We first create \(\sigma '\) copies of the sequence at distance \(2k\ell\) of each other: \(X \leftarrow X \cdot (0^{2k\ell -1}1)^{\sigma '}\). In each copy we will count the occurrences of a different symbol. To have the \((i+1)\)th copy count the occurrences of symbol i, for \(0 \le i < \sigma '\), we perform

$$\begin{aligned} X ~\leftarrow ~ X~~\textsc {xor}~~0^{k\ell } ((\sigma '-1)_\ell )^k \ldots 0^{k\ell } (2_\ell )^k ~ 0^{k\ell } (1_\ell )^k ~ 0^{k\ell } (0_\ell )^k, \end{aligned}$$

where \(i_\ell\) is number i written in \(\ell\) bits. Thus in the \((i+1)\)th copy the symbols equal to i become zero and the others nonzero. We then set a 1 at the highest bit of the symbols equal to i in the \((i+1)\)th copy, with

$$\begin{aligned} X ~\leftarrow ~ (Y - (X~\textsc {and~not}~Y))~\textsc {and}~Y~\textsc {and~not}~X, \end{aligned}$$

where \(Y=(0^{k\ell } (10^{\ell -1})^k)^{\sigma '}\).Footnote 3 Now we add all the 1s in each copy with \(X \leftarrow X \cdot 0^{k\ell (2\sigma '-1)} (0^{\ell -1}1)^k\). This spreads several sums across the \(2k\ell\) bits of each copy, and in particular the kth sum adds up all the 1s of the copy. Each sum requires \(\lg k\) bits, which is precisely the \(\ell\) bits we have allocated per field. Finally, we isolate the desired counters using \(X \leftarrow X~\textsc {and}~(0^{k\ell }1^\ell 0^{(k-1)\ell })^{\sigma '}\). The \(\sigma '\) counters are not contiguous in the computer word, but we still can afford to store them spaced: we use \(2k\ell \sigma ' = 2 \beta w^{2\beta }\lg w\) bits, which since \(\beta \le 1/4\), is always less than w.

Since the range is of length at most \(w^{2\beta }\), the cumulative counters need \(\lg (w^{2\beta }) = 2\ell\) bits. We will store them in a computer word A separated by \(2k\ell\) bits so that we can directly add the resulting word X after processing a chunk of \(w^\beta\) symbols of the range in \(S_v\): \(A \leftarrow A+X\). If the last chunk is of length \(l < w^\beta\), we complete it with zeros and then subtract those spurious \(w^\beta -l\) occurrences from the first counter, \(A \leftarrow A - (w^\beta -l) \cdot 2^{(k-1)\ell }\).

The last challenge is to output the counters that are at least \(y = \lfloor \tau ' (j-i+1) \rfloor +1\) after processing the range. We use

$$\begin{aligned} A \leftarrow A + (2^{2\ell }-y) \cdot (0^{k\ell +\ell -1} 1 0^{(k-1)\ell })^{\sigma '} \end{aligned}$$

so that the counters reaching y will overflow to the next bit. We isolate those overflow bits with \(A \leftarrow A~\textsc {and}~(0^{(k-1)\ell -1} 1 0^{(k+1)\ell })^{\sigma '}\), so that we have to report the symbol i if and only if \(A~\textsc {and}~0^{(k(2\sigma '-2i-1)-1)\ell -1} 1 0^{(k(2i+1)+1)\ell } \not = 0\). We then repeatedly isolate the lowest bit of A with

$$\begin{aligned} D \leftarrow (A~\textsc {xor}~(A-1))~\textsc {and}~ (0^{(k-1)\ell -1} 1 0^{(k+1)\ell })^{\sigma '}, \end{aligned}$$

and then remove it with \(A \leftarrow A~\textsc {and}~(A-1)\), until \(A=0\). Once we have a position isolated in D, we find the position in constant time by using a monotone minimum perfect hash function over the set \(\{ 2^{(k(2i+1)+1)\ell },~0 \le i < \sigma '\}\), which uses \(\mathcal {O}\!\left( {\sigma ' \lg w} \right) =o(w)\) bits [2]. Only one such data structure is needed for all the sequences, and it takes less space than a single systemwide pointer.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Belazzougui, D., Gagie, T., Munro, J.I. et al. Range Majorities and Minorities in Arrays. Algorithmica 83, 1707–1733 (2021). https://doi.org/10.1007/s00453-021-00799-7

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00453-021-00799-7

Keywords

  • Compressed data structures
  • Range majority and minority
  • Range mode