## Abstract

The problem of parameterized range majority asks us to preprocess a string of length *n* such that, given the endpoints of a range, one can quickly find all the distinct elements whose relative frequencies in that range are more than a threshold \(\tau\). This is a more tractable version of the classical problem of finding the range mode, which is unlikely to be solvable in polylogarithmic time and linear space. In this paper we give the first linear-space solution with optimal \(\mathcal {O}\!\left( {1 / \tau } \right)\) query time, even when \(\tau\) can be specified with the query. We then consider data structures whose space is bounded by the entropy of the distribution of the symbols in the sequence. For the case when the alphabet size \(\sigma\) is polynomial on the computer word size, we retain the optimal time within optimally compressed space (i.e., with sublinear redundancy). Otherwise, either the compressed space is increased by an arbitrarily small constant factor or the time rises to any function in \((1/\tau )\cdot \omega (1)\). We obtain the same results on the complementary problem of parameterized range minority.

This is a preview of subscription content, access via your institution.

## Notes

In that paper they find the predecessor of

*x*, which is the largest \(x_i \le x\), but the problem is analogous.M. Pǎtraşcu, personal communication, 2009.

This could have been simply \(X \leftarrow (Y-X)~\textsc {and}~Y\) if there was an unused highest bit set to zero in the \((\lg \sigma ')\)-length fields of

*X*. Instead, we have to use this more complex formula that first zeroes the highest bit of the fields and later considers them separately.

## References

Barbay, J., Claude, F., Gagie, T., Navarro, G., Nekrich, Y.: Efficient fully-compressed sequence representations. Algorithmica

**69**(1), 232–268 (2014)Belazzougui, D., Boldi, P., Pagh, R., Vigna, S.: Theory and practice of monotone minimal perfect hashing. ACM J. Exp. Algorithm.

**16**(3), 2 (2011)Belazzougui, D., Cunial, F., Kärkkäinen, J., Mäkinen, V.: Linear-time string indexing and analysis in small space. ACM Trans. Algorithm.

**16**(2), 17:1-17:54 (2020)Belazzougui, D., Gagie, T., Navarro, G.: Better space bounds for parameterized range majority and minority. In: Proceedings of the 12th Annual Workshop on Algorithms and Data Structures (WADS), LNCS 8037, pp. 121–132 (2013)

Belazzougui, D., Navarro, G.: Alphabet-independent compressed text indexing. ACM Trans. Algorithm.

**10**(4), 23 (2014)Belazzougui, D., Navarro, G.: Optimal lower and upper bounds for representing sequences. ACM Trans. Algorithm.

**11**(4), 31 (2015)Bose, P., Kranakis, E., Morin, P., Tang, Y.: Approximate range mode and range median queries. In: Proceedings of the 22nd Symposium on Theoretical Aspects of Computer Science (STACS), pp. 377–388 (2005)

Boyer, R.S., Moore, J.S.: MJRTY–a fast majority vote algorithm. In: Automated Reasoning, pp. 105–117. Springer, Berlin (1991)

Chan, T.M., Durocher, S., Larsen, K.G., Morrison, J., Wilkinson, B.T.: Linear-space data structures for range mode query in arrays. Theory Comput. Syst.

**55**(4), 719–741 (2014)Chan, T.M., Durocher, S., Skala, M., Wilkinson, B.T.: Linear-space data structures for range minority query in arrays. Algorithmica

**72**(4), 901–913 (2015)Cormode, G.: Misra-Gries summaries. In: Encyclopedia of Algorithms, pp. 1334–1337. Springer, Berlin (2016)

Cormode, G., Muthukrishnan, S.: Data stream methods. http://www.cs.rutgers.edu/~muthu/198-3.pdf, 2003. Lecture 3 of Rutger’s 198:671 Seminar on Processing Massive Data Sets

Demaine, E.D., López-Ortiz, A., Munro, J. I.: Frequency estimation of internet packet streams with limited space. In: Proceedings of the 10th European Symposium on Algorithms (ESA), pp. 348–360 (2002)

Durocher, S., He, M., Munro, J.I., Nicholson, P.K., Skala, M.: Range majority in constant time and linear space. Inf. Comput.

**222**, 169–179 (2013)Durocher, S., Shah, R., Skala, M., Thankachan, S.V.: Linear-space data structures for range frequency queries on arrays and trees. Algorithmica

**74**(1), 344–366 (2016)Elmasry, A., He, M., Munro, J.I., Nicholson, P.K.: Dynamic range majority data structures. Theoret. Comput. Sci.

**647**, 59–73 (2016)Ferragina, P., Manzini, G., Mäkinen, V., Navarro, G.: Compressed representations of sequences and full-text indexes. ACM Trans. Algorithm.

**3**(2), 1–22 (2007)Ferragina, P., Venturini, R.: A simple storage scheme for strings achieving entropy bounds. Theoret. Comput. Sci.

**371**(1), 115–121 (2007)Fischer, J., Heun, V.: Space-efficient preprocessing schemes for range minimum queries on static arrays. SIAM J. Comput.

**40**(2), 465–492 (2011)Gagie, T., He, M., Munro, J.I., Nicholson, P.K.: Finding frequent elements in compressed 2D arrays and strings. In: Proceedings of the 18th Symposium on String Processing and Information Retrieval (SPIRE), pp. 295–300 (2011)

Gagie, T., He, M., Navarro, G.: Compressed dynamic range majority and minority data structures. Algorithmica

**82**(7), 2063–2086 (2020)Gawrychowski, P., Nicholson, P.K.: Optimal query time for encoding range majority. In: Proceedings of the 15th International Symposium on Algorithms and Data Structures (WADS), pp. 409–420 (2017). Extended version in CoRR abs/1704.06149

Greve, M., Jørgensen, A.G., Larsen, K.D., Truelsen, J.: Cell probe lower bounds and approximations for range mode. In: Proceedings of the 37th International Colloquium on Automata, Languages and Programming (ICALP), pp. 605–616 (2010)

Grossi, R., Orlandi, A., Raman, R., Srinivasa Rao, S.: More haste, less waste: Lowering the redundancy in fully indexable dictionaries. In: Proceedings of the 26th International Symposium on Theoretical Aspects of Computer Science (STACS), pp. 517–528 (2009)

Hon, W.-K., Shah, R., Vitter, J.: Space-efficient framework for top-\(k\) string retrieval problems. In: Proceedings of the 50th IEEE Annual Symposium on Foundations of Computer Science (FOCS), pp. 713–722 (2009)

Karp, R.M., Shenker, S., Papadimitriou, C.H.: A simple algorithm for finding frequent elements in streams and bags. ACM Trans. Database Syst.

**28**(1), 51–55 (2003)Karpinski, M., Nekrich, Y.: Searching for frequent colors in rectangles. In: Proceedings of the 20th Canadian Conference on Computational Geometry (CCCG), pp. 11–14 (2008)

Kosaraju, R., Manzini, G.: Compression of low entropy strings with Lempel-Ziv algorithms. SIAM J. Comput.

**29**(3), 893–911 (2000)Krizanc, D., Morin, P., Smid, M.H.M.: Range mode and range median queries on lists and trees. Nordic J. Comput.

**12**(1), 1–17 (2005)Lai, Y.K., Poon, C.K., Shi, B.: Approximate colored range and point enclosure queries. J. Discrete Algorithm.

**6**(3), 420–432 (2008)Misra, J., Gries, D.: Finding repeated elements. Sci. Comput. Program.

**2**(2), 143–152 (1982)Munro, J.I., Navarro, G., Nekrich, Y.: Fast compressed self-indexes with deterministic linear-time construction. Algorithmica

**82**(2), 316–337 (2020)Muthukrishnan, S.: Efficient algorithms for document retrieval problems. In: Proceedings of the 13th Symposium on Discrete Algorithms (SODA), pp. 657–666 (2002)

Navarro, G., Thankachan, S.V.: Optimal encodings for range majority queries. Algorithmica

**74**(3), 1082–1098 (2016)Petersen, H.: Improved bounds for range mode and range median queries. In: Proceedings of the 34th Conference on Current Trends in Theory and Practice of Computer Science (SOFSEM), pp. 418–423 (2008)

Petersen, H., Grabowski, S.: Range mode and range median queries in constant time and sub-quadratic space. Inf. Process. Lett.

**109**(4), 225–228 (2009)Pǎtraşcu, M.: Succincter. In: Proceedings of the 49th Symposium on Foundations of Computer Science (FOCS), pp. 305–313 (2008)

Sadakane, K.: Succinct data structures for flexible text retrieval systems. J. Discrete Algorithm.

**5**(1), 12–22 (2007)Wei, Z., Yi, K.: Beyond simple aggregates: indexing for summary queries. In: Proceedings of the 30th Symposium on Principles of Database Systems (PODS), pp. 117–128 (2011)

## Acknowledgements

Many thanks to Patrick Nicholson for helpful comments.

## Author information

### Authors and Affiliations

### Corresponding author

## Additional information

### Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Funded in part by Academy of Finland Grant 268324, by NSERC Grant RGPIN-07185-2020, by ANID – Millennium Science Initiative Program – Code ICN17_002, and by Fondecyt Grant 1-200038, Chile. An early partial version of this article appeared in *Proc. WADS 2013*.

## Appendix: Finding \(\tau '\)-Majorities on Tiny Alphabets

### Appendix: Finding \(\tau '\)-Majorities on Tiny Alphabets

We show how to find \(\tau '\)-majorities in time \(\mathcal {O}\!\left( {1/\tau '} \right)\) on ranges of length \(\mathcal {O}\!\left( {(1/\tau ')w^\beta } \right)\), over alphabet \([0..\sigma '-1]\), with \(\sigma '=w^\beta\), in the case \(1/\tau ' < \sigma '\). We will compute an array of \(\sigma '\) counters with the frequency of the symbols in the range, and then report those exceeding the threshold. The maximum size of the range is \((4/\tau ')w^\beta /4 \le \sigma ' w^\beta =w^{2\beta }\), and thus \(2\beta \lg w\) bits suffice to represent each counter. The \(\sigma '\) counters then require \(2\beta w^\beta \lg w\) bits and can be maintained in a computer word (although we will store them somewhat spaced for technical reasons). We can read the elements in \(S_v\) by chunks of \(w^\beta\) symbols, and compute in constant time the corresponding counters for those symbols. Then we sum the current counters and the counters for the chunk, all in constant time because they are fields in a single computer word. The range is then processed in time \(\mathcal {O}\!\left( {1/\tau '} \right)\).

To compute the counters corresponding to \(w^\beta\) symbols, we extend the popcounting algorithm of Belazzougui and Navarro [6, Sec. 4.1]; assume we extract the \(w^\beta\) symbols from \(S_v\) and have them packed in the lowest \(k\ell\) bits of a computer word *X*, where \(k=w^\beta\) is the number of symbols and \(\ell =\lg \sigma '\) is the number of bits used per symbol. We first create \(\sigma '\) copies of the sequence at distance \(2k\ell\) of each other: \(X \leftarrow X \cdot (0^{2k\ell -1}1)^{\sigma '}\). In each copy we will count the occurrences of a different symbol. To have the \((i+1)\)th copy count the occurrences of symbol *i*, for \(0 \le i < \sigma '\), we perform

where \(i_\ell\) is number *i* written in \(\ell\) bits. Thus in the \((i+1)\)th copy the symbols equal to *i* become zero and the others nonzero. We then set a 1 at the highest bit of the symbols equal to *i* in the \((i+1)\)th copy, with

where \(Y=(0^{k\ell } (10^{\ell -1})^k)^{\sigma '}\).^{Footnote 3} Now we add all the 1s in each copy with \(X \leftarrow X \cdot 0^{k\ell (2\sigma '-1)} (0^{\ell -1}1)^k\). This spreads several sums across the \(2k\ell\) bits of each copy, and in particular the *k*th sum adds up all the 1s of the copy. Each sum requires \(\lg k\) bits, which is precisely the \(\ell\) bits we have allocated per field. Finally, we isolate the desired counters using \(X \leftarrow X~\textsc {and}~(0^{k\ell }1^\ell 0^{(k-1)\ell })^{\sigma '}\). The \(\sigma '\) counters are not contiguous in the computer word, but we still can afford to store them spaced: we use \(2k\ell \sigma ' = 2 \beta w^{2\beta }\lg w\) bits, which since \(\beta \le 1/4\), is always less than *w*.

Since the range is of length at most \(w^{2\beta }\), the cumulative counters need \(\lg (w^{2\beta }) = 2\ell\) bits. We will store them in a computer word *A* separated by \(2k\ell\) bits so that we can directly add the resulting word *X* after processing a chunk of \(w^\beta\) symbols of the range in \(S_v\): \(A \leftarrow A+X\). If the last chunk is of length \(l < w^\beta\), we complete it with zeros and then subtract those spurious \(w^\beta -l\) occurrences from the first counter, \(A \leftarrow A - (w^\beta -l) \cdot 2^{(k-1)\ell }\).

The last challenge is to output the counters that are at least \(y = \lfloor \tau ' (j-i+1) \rfloor +1\) after processing the range. We use

so that the counters reaching *y* will overflow to the next bit. We isolate those overflow bits with \(A \leftarrow A~\textsc {and}~(0^{(k-1)\ell -1} 1 0^{(k+1)\ell })^{\sigma '}\), so that we have to report the symbol *i* if and only if \(A~\textsc {and}~0^{(k(2\sigma '-2i-1)-1)\ell -1} 1 0^{(k(2i+1)+1)\ell } \not = 0\). We then repeatedly isolate the lowest bit of *A* with

and then remove it with \(A \leftarrow A~\textsc {and}~(A-1)\), until \(A=0\). Once we have a position isolated in *D*, we find the position in constant time by using a monotone minimum perfect hash function over the set \(\{ 2^{(k(2i+1)+1)\ell },~0 \le i < \sigma '\}\), which uses \(\mathcal {O}\!\left( {\sigma ' \lg w} \right) =o(w)\) bits [2]. Only one such data structure is needed for all the sequences, and it takes less space than a single systemwide pointer.

## Rights and permissions

## About this article

### Cite this article

Belazzougui, D., Gagie, T., Munro, J.I. *et al.* Range Majorities and Minorities in Arrays.
*Algorithmica* **83**, 1707–1733 (2021). https://doi.org/10.1007/s00453-021-00799-7

Received:

Accepted:

Published:

Issue Date:

DOI: https://doi.org/10.1007/s00453-021-00799-7

### Keywords

- Compressed data structures
- Range majority and minority
- Range mode