Randomized Algorithms for Tracking Distributed Count, Frequencies, and Ranks

Huang, Zengfeng; Yi, Ke; Zhang, Qin

doi:10.1007/s00453-018-00531-y

Randomized Algorithms for Tracking Distributed Count, Frequencies, and Ranks

Published: 11 December 2018

Volume 81, pages 2222–2243, (2019)
Cite this article

Algorithmica Aims and scope Submit manuscript

553 Accesses
3 Citations
Explore all metrics

A Correction to this article was published on 28 July 2020

This article has been updated

Abstract

We show that randomization can lead to significant improvements for a few fundamental problems in distributed tracking. Our basis is the count-tracking problem, where there are k players, each holding a counter $n_i$ that gets incremented over time, and the goal is to track an $\varepsilon $-approximation of their sum $n=\sum _i n_i$ continuously at all times, using minimum communication. While the deterministic communication complexity of the problem is ${\varTheta }(k/\varepsilon \cdot \log N)$, where N is the final value of n when the tracking finishes, we show that with randomization, the communication cost can be reduced to ${\varTheta }(\sqrt{k}/\varepsilon \cdot \log N)$. Our algorithm is simple and uses only O(1) space at each player, while the lower bound holds even assuming each player has infinite computing power. Then, we extend our techniques to two related distributed tracking problems: frequency-tracking and rank-tracking, and obtain similar improvements over previous deterministic algorithms. Both problems are of central importance in large data monitoring and analysis, and have been extensively studied in the literature.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Stratified random sampling from streaming and stored data

Article 23 October 2020

The pattern frequency distribution theory: a mathematic establishment toward rational and reliable pattern mining

Article 20 August 2022

Distributed Adaptive Thresholding Graph Recursive Least Squares Algorithm

Article 06 May 2024

Change history

28 July 2020
After publication of the article [1] the authors have noticed that the funding information are not published in online and print version of the article. The omitted funding acknowledgement is given below.

Notes

We sometimes omit “(t)” when the context is clear.
A more careful analysis leads to a slightly better bound of $O(k/\varepsilon \cdot \log (\varepsilon N/k))$, but we will assume that N is sufficiently large, compared to k and $1/\varepsilon $, to simplify the bounds.
The lower bound in [29] was stated for the heavy hitters tracking problem, but essentially the same proof works for count-tracking.
The bound of the random sampling algorithm [10, 24] is actually slightly better, which is $O(k\log N / \log (k\varepsilon ^2))$.
In Feller’s book [11] the following is proved. Let $p \in (0,1)$ be some constant and $q = 1-p$. The population size is N and the sample size is n, so that $n < N$ and Np, Nq are both integers. The hypergeometric distribution is $P(k; n, N) = {Np \atopwithdelims ()k}{Nq \atopwithdelims ()n-k}/{N \atopwithdelims ()n}$ for $0 \le k \le n$. Theorem 8 [11] If $N \rightarrow \infty , n \rightarrow \infty $ so that $n/N \rightarrow t\in (0,1)$ and $x_k := (k - np)/\sqrt{npq} \rightarrow x$, then
$$\begin{aligned} p(k;n,N) \sim \frac{e^{-x^2/2(1-t)}}{\sqrt{2\pi npq(1-t)}} \end{aligned}$$

References

Agarwal, P.K., Cormode, G., Huang, Z., Phillips, J.M., Wei, Z., Yi, K.: Mergeable summaries. In: Proceedings of the ACM Symposium on Principles of Database Systems (2012)
Arackaparambil, C., Brody, J., Chakrabarti, A.: Functional monitoring without monotonicity. In: Proceedings of the International Colloquium on Automata, Languages, and Programming (2009)
Babcock, B., Olston, C.: Distributed top-k monitoring. In: Proceedings of the ACM SIGMOD International Conference on Management of Data (2003)
Bar-Yossef, Z.: The complexity of massive data set computations. PhD thesis, University of California at Berkeley (2002)
Chan, H.-L., Lam, T.W., Lee, L.-K., Ting, H.-F.: Continuous monitoring of distributed data streams over a time-based sliding window. Algorithmica 62(3–4), 1088–1111 (2011)
MathSciNet MATH Google Scholar
Cormode, G.: The continuous distributed monitoring model. ACM SIGMOD Rec. 42(1), 5–14 (2013)
Article Google Scholar
Cormode, G., Garofalakis, M., Muthukrishnan, S., Rastogi, R.: Holistic aggregates in a networked world: distributed tracking of approximate quantiles. In: Proceedings of the ACM SIGMOD International Conference on Management of Data (2005)
Cormode, G., Hadjieleftheriou, M.: Finding frequent items in data streams. In: Proceedings of the International Conference on Very Large Data Bases (2008)
Cormode, G., Muthukrishnan, S., Yi, K.: Algorithms for distributed functional monitoring. ACM Trans. Algorithms 7(2), Article 21 (2011). (Preliminary version in SODA’08)
Article MathSciNet Google Scholar
Cormode, G., Muthukrishnan, S., Yi, K., Zhang, Q.: Continuous sampling from distributed streams. J. ACM 59(2), 10 (2012). (Preliminary version in PODS’10)
Article MathSciNet Google Scholar
Feller, W.: An Introduction to Probability Theory and Its Applications. Wiley, New York (1968)
MATH Google Scholar
Gibbons, P.B., Tirthapura, S.: Estimating simple functions on the union of data streams. In: Proceedings of the ACM Symposium on Parallelism in Algorithms and Architectures (2001)
Greenwald, M., Khanna, S.: Space-efficient online computation of quantile summaries. In: Proceedings of the ACM SIGMOD International Conference on Management of Data (2001)
Huang, Z., Wang, L., Yi, K., Liu, Y.: Sampling based algorithms for quantile computation in sensor networks. In: Proceedings of the ACM SIGMOD International Conference on Management of Data (2011)
Huang, Z., Yi, K., Liu, Y., Chen, G.: Optimal sampling algorithms for frequency estimation in distributed data. In: IEEE INFOCOM (2011)
Keralapura, R., Cormode, G., Ramamirtham, J.: Communication-efficient distributed monitoring of thresholded counts. In: Proceedings of the ACM SIGMOD International Conference on Management of Data (2006)
Manjhi, A., Shkapenyuk, V., Dhamdhere, K., Olston, C.: Finding (recently) frequent items in distributed data streams. In: Proceedings of the IEEE International Conference on Data Engineering (2005)
Manku, G., Motwani, R.: Approximate frequency counts over data streams. In: Proceedings of the International Conference on Very Large Data Bases (2002)
Metwally, A., Agrawal, D., Abbadi, A.: An integrated efficient solution for computing frequent and top-k elements in data streams. ACM Trans. Database Syst. 31(3), 1095–1133 (2006)
Article Google Scholar
Misra, J., Gries, D.: Finding repeated elements. Sci. Comput. Program. 2, 143–152 (1982)
Article MathSciNet Google Scholar
Munro, J.I., Paterson, M.S.: Selection and sorting with limited storage. Theor. Comput. Sci. 12, 315–323 (1980)
Article MathSciNet Google Scholar
Patt-Shamir, B., Shafrir, A.: Approximate distributed top-k queries. Distrib. Comput. 21(1), 1–22 (2008)
Article Google Scholar
Suri, S., Toth, C., Zhou, Y.: Range counting over multidimensional data streams. Discrete Comput. Geom. 36, 633–655 (2006)
Article MathSciNet Google Scholar
Tirthapura, S., Woodruff, D.P.: Optimal random sampling from distributed streams revisited. In: Proceedings of the International Symposium on Distributed Computing (2011)
Vapnik, V.N., Chervonenkis, A.Y.: On the uniform convergence of relative frequencies of events to their probabilities. Theory Probab. Appl. 16, 264–280 (1971)
Article Google Scholar
Woodruff, D.P.: Efficient and Private Distance Approximation in the Communication and Streaming Models. PhD thesis, Massachusetts Institute of Technology (2007)
Woodruff, D.P., Zhang, Q.: Tight bounds for distributed functional monitoring. In: Proceedings of the ACM Symposium on Theory of Computing (2012)
Yao, A.C.: Probabilistic computations: towards a unified measure of complexity. In: Proceedings of the IEEE Symposium on Foundations of Computer Science (1977)
Yi, K., Zhang, Q.: Optimal tracking of distributed heavy hitters and quantiles. In: Proceedings of the ACM Symposium on Principles of Database Systems (2009)

Download references

Author information

Authors and Affiliations

School of Data Science, Fudan University, Shanghai, China
Zengfeng Huang
The Hong Kong University of Science and Technology, Clear Water Bay, Hong Kong
Ke Yi
Indiana University Bloomington, Bloomington, USA
Qin Zhang

Authors

Zengfeng Huang
View author publications
You can also search for this author in PubMed Google Scholar
Ke Yi
View author publications
You can also search for this author in PubMed Google Scholar
Qin Zhang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Zengfeng Huang.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix: Lower Bound for the Sampling Problem

Claim

To solve the sampling problem we need to probe at least ${\varOmega }(k)$ sites.

Proof

Suppose that the coordinator only samples $z = o(k)$ sites. Let X be the number of sites that are sampled with bit 1. Then X is chosen from the hypergeometric distribution with probability density function (pdf) $\mathsf {Pr}[X = x] = {s' \atopwithdelims ()x}{k' - s' \atopwithdelims ()z - x}/{k' \atopwithdelims ()z}$. The expected value of X is $\frac{z}{k'} \cdot s'$, which is $\frac{z}{k'}\left( \frac{k}{2} - y + \sqrt{k}\right) $ or $\frac{z}{k'}\left( \frac{k}{2} - y - \sqrt{k}\right) $, depending on the value of $s'$. Let $p = \left( \frac{k}{2} - y\right) /k' = \frac{1}{2} \pm o(1)$ and $\alpha = \sqrt{k}/k' = 1/\sqrt{k} \pm o(1/\sqrt{k})$. To avoid tedious calculation, we assume that X is picked randomly from one of the two normal distributions $\mathcal {N}_1(\mu _1, \sigma _1^2)$ and $\mathcal {N}_2(\mu _2, \sigma _2^2)$ with equal probability, where $\mu _1 = z(p-\alpha ), \mu _2 = z(p+\alpha ), \sigma _1, \sigma _2 = {\varTheta }(\sqrt{zp(1-p)}) = {\varTheta }(\sqrt{z})$. In Feller [11] it is shown that the normal distribution approximates the hypergeometric distribution very well when z is large and $p \pm \alpha $ are constants in (0, 1).^{Footnote 5} Now our task is to decide from which of the two distributions X is drawn based on the value of X with success probability at least 0.7.

Let $f_1(x; \mu _1, \sigma _1^2)$ and $f_2(x; \mu _2, \sigma _2^2)$ be the pdf of the two normal distributions $\mathcal {N}_1, \mathcal {N}_2$, respectively. It is easy to see that the best deterministic algorithm of differentiating the two distributions based on the value of a sample X will do the following.

If $X > x_0$, then X is chosen from $\mathcal {N}_2$, otherwise X is chosen from $\mathcal {N}_1$, where $x_0$ is the value such that $f_1(x_0; \mu _1, \sigma _1^2) = f_2(x_0; \mu _2, \sigma _2^2)$ (thus $\mu _1< x_0 < \mu _2$).

Indeed, if $X > x_0$ and the algorithm decides that “X is chosen from $\mathcal {N}_1$”, we can always flip this decision and improve the success probability of the algorithm.

The error comes from two sources: (1) $X > x_0$ but X is actually drawn from $\mathcal {N}_2$; (2) $X \le x_0$ but X is actually drawn from $\mathcal {N}_1$. The total error is

$$\begin{aligned} 1/2 \cdot ({\varPhi }(-\ell _1/\sigma _1) + {\varPhi }(-\ell _2/\sigma _2)), \end{aligned}$$

where $\ell _1 = x_0 - \mu _1$ and $\ell _2 = \mu _2 - x_0$. (Thus $\ell _1 + \ell _2 = \mu _2 - \mu _1 = 2 \alpha z$). ${\varPhi }(\cdot )$ is the cumulative distribution function (cdf) of the normal distribution. See Fig. 2.

Finally note that $\ell _1/\sigma _1 = O(\alpha z / \sqrt{z}) = O(\sqrt{z/k}) = o(1)$ and $\ell _2/\sigma _2 = O(\alpha z / \sqrt{z}) = o(1)$, so ${\varPhi }(-\ell _1/\sigma _1) + {\varPhi }(-\ell _2/\sigma _2) > 0.99$. Therefore, the failure probability is at least 0.49, contradicting our success probability guarantee. Thus we must have $z = {\varOmega }(k)$. $\square $

Rights and permissions

Reprints and permissions

About this article

Cite this article

Huang, Z., Yi, K. & Zhang, Q. Randomized Algorithms for Tracking Distributed Count, Frequencies, and Ranks. Algorithmica 81, 2222–2243 (2019). https://doi.org/10.1007/s00453-018-00531-y

Download citation

Received: 27 October 2016
Accepted: 23 November 2018
Published: 11 December 2018
Issue Date: 01 June 2019
DOI: https://doi.org/10.1007/s00453-018-00531-y

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Randomized Algorithms for Tracking Distributed Count, Frequencies, and Ranks

Abstract

Access this article

Similar content being viewed by others

Stratified random sampling from streaming and stored data

The pattern frequency distribution theory: a mathematic establishment toward rational and reliable pattern mining

Distributed Adaptive Thresholding Graph Recursive Least Squares Algorithm

Change history

28 July 2020

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendix: Lower Bound for the Sampling Problem

Claim

Proof

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Randomized Algorithms for Tracking Distributed Count, Frequencies, and Ranks

Abstract

Access this article

Similar content being viewed by others

Stratified random sampling from streaming and stored data

The pattern frequency distribution theory: a mathematic establishment toward rational and reliable pattern mining

Distributed Adaptive Thresholding Graph Recursive Least Squares Algorithm

Change history

28 July 2020

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendix: Lower Bound for the Sampling Problem

Appendix: Lower Bound for the Sampling Problem

Claim

Proof

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation