Skip to main content
Log in

Randomized Algorithms for Tracking Distributed Count, Frequencies, and Ranks

  • Published:
Algorithmica Aims and scope Submit manuscript

A Correction to this article was published on 28 July 2020

This article has been updated

Abstract

We show that randomization can lead to significant improvements for a few fundamental problems in distributed tracking. Our basis is the count-tracking problem, where there are k players, each holding a counter \(n_i\) that gets incremented over time, and the goal is to track an \(\varepsilon \)-approximation of their sum \(n=\sum _i n_i\) continuously at all times, using minimum communication. While the deterministic communication complexity of the problem is \({\varTheta }(k/\varepsilon \cdot \log N)\), where N is the final value of n when the tracking finishes, we show that with randomization, the communication cost can be reduced to \({\varTheta }(\sqrt{k}/\varepsilon \cdot \log N)\). Our algorithm is simple and uses only O(1) space at each player, while the lower bound holds even assuming each player has infinite computing power. Then, we extend our techniques to two related distributed tracking problems: frequency-tracking and rank-tracking, and obtain similar improvements over previous deterministic algorithms. Both problems are of central importance in large data monitoring and analysis, and have been extensively studied in the literature.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1

Similar content being viewed by others

Change history

  • 28 July 2020

    After publication of the article [1] the authors have noticed that the funding information are not published in online and print version of the article. The omitted funding acknowledgement is given below.

Notes

  1. We sometimes omit “(t)” when the context is clear.

  2. A more careful analysis leads to a slightly better bound of \(O(k/\varepsilon \cdot \log (\varepsilon N/k))\), but we will assume that N is sufficiently large, compared to k and \(1/\varepsilon \), to simplify the bounds.

  3. The lower bound in [29] was stated for the heavy hitters tracking problem, but essentially the same proof works for count-tracking.

  4. The bound of the random sampling algorithm [10, 24] is actually slightly better, which is \(O(k\log N / \log (k\varepsilon ^2))\).

  5. In Feller’s book [11] the following is proved. Let \(p \in (0,1)\) be some constant and \(q = 1-p\). The population size is N and the sample size is n, so that \(n < N\) and NpNq are both integers. The hypergeometric distribution is \(P(k; n, N) = {Np \atopwithdelims ()k}{Nq \atopwithdelims ()n-k}/{N \atopwithdelims ()n}\) for \(0 \le k \le n\). Theorem 8 [11] If \(N \rightarrow \infty , n \rightarrow \infty \) so that \(n/N \rightarrow t\in (0,1)\) and \(x_k := (k - np)/\sqrt{npq} \rightarrow x\), then

    $$\begin{aligned} p(k;n,N) \sim \frac{e^{-x^2/2(1-t)}}{\sqrt{2\pi npq(1-t)}} \end{aligned}$$

References

  1. Agarwal, P.K., Cormode, G., Huang, Z., Phillips, J.M., Wei, Z., Yi, K.: Mergeable summaries. In: Proceedings of the ACM Symposium on Principles of Database Systems (2012)

  2. Arackaparambil, C., Brody, J., Chakrabarti, A.: Functional monitoring without monotonicity. In: Proceedings of the International Colloquium on Automata, Languages, and Programming (2009)

  3. Babcock, B., Olston, C.: Distributed top-k monitoring. In: Proceedings of the ACM SIGMOD International Conference on Management of Data (2003)

  4. Bar-Yossef, Z.: The complexity of massive data set computations. PhD thesis, University of California at Berkeley (2002)

  5. Chan, H.-L., Lam, T.W., Lee, L.-K., Ting, H.-F.: Continuous monitoring of distributed data streams over a time-based sliding window. Algorithmica 62(3–4), 1088–1111 (2011)

    MathSciNet  MATH  Google Scholar 

  6. Cormode, G.: The continuous distributed monitoring model. ACM SIGMOD Rec. 42(1), 5–14 (2013)

    Article  Google Scholar 

  7. Cormode, G., Garofalakis, M., Muthukrishnan, S., Rastogi, R.: Holistic aggregates in a networked world: distributed tracking of approximate quantiles. In: Proceedings of the ACM SIGMOD International Conference on Management of Data (2005)

  8. Cormode, G., Hadjieleftheriou, M.: Finding frequent items in data streams. In: Proceedings of the International Conference on Very Large Data Bases (2008)

  9. Cormode, G., Muthukrishnan, S., Yi, K.: Algorithms for distributed functional monitoring. ACM Trans. Algorithms 7(2), Article 21 (2011). (Preliminary version in SODA’08)

    Article  MathSciNet  Google Scholar 

  10. Cormode, G., Muthukrishnan, S., Yi, K., Zhang, Q.: Continuous sampling from distributed streams. J. ACM 59(2), 10 (2012). (Preliminary version in PODS’10)

    Article  MathSciNet  Google Scholar 

  11. Feller, W.: An Introduction to Probability Theory and Its Applications. Wiley, New York (1968)

    MATH  Google Scholar 

  12. Gibbons, P.B., Tirthapura, S.: Estimating simple functions on the union of data streams. In: Proceedings of the ACM Symposium on Parallelism in Algorithms and Architectures (2001)

  13. Greenwald, M., Khanna, S.: Space-efficient online computation of quantile summaries. In: Proceedings of the ACM SIGMOD International Conference on Management of Data (2001)

  14. Huang, Z., Wang, L., Yi, K., Liu, Y.: Sampling based algorithms for quantile computation in sensor networks. In: Proceedings of the ACM SIGMOD International Conference on Management of Data (2011)

  15. Huang, Z., Yi, K., Liu, Y., Chen, G.: Optimal sampling algorithms for frequency estimation in distributed data. In: IEEE INFOCOM (2011)

  16. Keralapura, R., Cormode, G., Ramamirtham, J.: Communication-efficient distributed monitoring of thresholded counts. In: Proceedings of the ACM SIGMOD International Conference on Management of Data (2006)

  17. Manjhi, A., Shkapenyuk, V., Dhamdhere, K., Olston, C.: Finding (recently) frequent items in distributed data streams. In: Proceedings of the IEEE International Conference on Data Engineering (2005)

  18. Manku, G., Motwani, R.: Approximate frequency counts over data streams. In: Proceedings of the International Conference on Very Large Data Bases (2002)

  19. Metwally, A., Agrawal, D., Abbadi, A.: An integrated efficient solution for computing frequent and top-k elements in data streams. ACM Trans. Database Syst. 31(3), 1095–1133 (2006)

    Article  Google Scholar 

  20. Misra, J., Gries, D.: Finding repeated elements. Sci. Comput. Program. 2, 143–152 (1982)

    Article  MathSciNet  Google Scholar 

  21. Munro, J.I., Paterson, M.S.: Selection and sorting with limited storage. Theor. Comput. Sci. 12, 315–323 (1980)

    Article  MathSciNet  Google Scholar 

  22. Patt-Shamir, B., Shafrir, A.: Approximate distributed top-k queries. Distrib. Comput. 21(1), 1–22 (2008)

    Article  Google Scholar 

  23. Suri, S., Toth, C., Zhou, Y.: Range counting over multidimensional data streams. Discrete Comput. Geom. 36, 633–655 (2006)

    Article  MathSciNet  Google Scholar 

  24. Tirthapura, S., Woodruff, D.P.: Optimal random sampling from distributed streams revisited. In: Proceedings of the International Symposium on Distributed Computing (2011)

  25. Vapnik, V.N., Chervonenkis, A.Y.: On the uniform convergence of relative frequencies of events to their probabilities. Theory Probab. Appl. 16, 264–280 (1971)

    Article  Google Scholar 

  26. Woodruff, D.P.: Efficient and Private Distance Approximation in the Communication and Streaming Models. PhD thesis, Massachusetts Institute of Technology (2007)

  27. Woodruff, D.P., Zhang, Q.: Tight bounds for distributed functional monitoring. In: Proceedings of the ACM Symposium on Theory of Computing (2012)

  28. Yao, A.C.: Probabilistic computations: towards a unified measure of complexity. In: Proceedings of the IEEE Symposium on Foundations of Computer Science (1977)

  29. Yi, K., Zhang, Q.: Optimal tracking of distributed heavy hitters and quantiles. In: Proceedings of the ACM Symposium on Principles of Database Systems (2009)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Zengfeng Huang.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix: Lower Bound for the Sampling Problem

Appendix: Lower Bound for the Sampling Problem

Claim

To solve the sampling problem we need to probe at least \({\varOmega }(k)\) sites.

Proof

Suppose that the coordinator only samples \(z = o(k)\) sites. Let X be the number of sites that are sampled with bit 1. Then X is chosen from the hypergeometric distribution with probability density function (pdf) \(\mathsf {Pr}[X = x] = {s' \atopwithdelims ()x}{k' - s' \atopwithdelims ()z - x}/{k' \atopwithdelims ()z}\). The expected value of X is \(\frac{z}{k'} \cdot s'\), which is \(\frac{z}{k'}\left( \frac{k}{2} - y + \sqrt{k}\right) \) or \(\frac{z}{k'}\left( \frac{k}{2} - y - \sqrt{k}\right) \), depending on the value of \(s'\). Let \(p = \left( \frac{k}{2} - y\right) /k' = \frac{1}{2} \pm o(1)\) and \(\alpha = \sqrt{k}/k' = 1/\sqrt{k} \pm o(1/\sqrt{k})\). To avoid tedious calculation, we assume that X is picked randomly from one of the two normal distributions \(\mathcal {N}_1(\mu _1, \sigma _1^2)\) and \(\mathcal {N}_2(\mu _2, \sigma _2^2)\) with equal probability, where \(\mu _1 = z(p-\alpha ), \mu _2 = z(p+\alpha ), \sigma _1, \sigma _2 = {\varTheta }(\sqrt{zp(1-p)}) = {\varTheta }(\sqrt{z})\). In Feller [11] it is shown that the normal distribution approximates the hypergeometric distribution very well when z is large and \(p \pm \alpha \) are constants in (0, 1).Footnote 5 Now our task is to decide from which of the two distributions X is drawn based on the value of X with success probability at least 0.7.

Fig. 2
figure 2

Differentiating two distributions

Let \(f_1(x; \mu _1, \sigma _1^2)\) and \(f_2(x; \mu _2, \sigma _2^2)\) be the pdf of the two normal distributions \(\mathcal {N}_1, \mathcal {N}_2\), respectively. It is easy to see that the best deterministic algorithm of differentiating the two distributions based on the value of a sample X will do the following.

  • If \(X > x_0\), then X is chosen from \(\mathcal {N}_2\), otherwise X is chosen from \(\mathcal {N}_1\), where \(x_0\) is the value such that \(f_1(x_0; \mu _1, \sigma _1^2) = f_2(x_0; \mu _2, \sigma _2^2)\) (thus \(\mu _1< x_0 < \mu _2\)).

Indeed, if \(X > x_0\) and the algorithm decides that “X is chosen from \(\mathcal {N}_1\)”, we can always flip this decision and improve the success probability of the algorithm.

The error comes from two sources: (1) \(X > x_0\) but X is actually drawn from \(\mathcal {N}_2\); (2) \(X \le x_0\) but X is actually drawn from \(\mathcal {N}_1\). The total error is

$$\begin{aligned} 1/2 \cdot ({\varPhi }(-\ell _1/\sigma _1) + {\varPhi }(-\ell _2/\sigma _2)), \end{aligned}$$

where \(\ell _1 = x_0 - \mu _1\) and \(\ell _2 = \mu _2 - x_0\). (Thus \(\ell _1 + \ell _2 = \mu _2 - \mu _1 = 2 \alpha z\)). \({\varPhi }(\cdot )\) is the cumulative distribution function (cdf) of the normal distribution. See Fig. 2.

Finally note that \(\ell _1/\sigma _1 = O(\alpha z / \sqrt{z}) = O(\sqrt{z/k}) = o(1)\) and \(\ell _2/\sigma _2 = O(\alpha z / \sqrt{z}) = o(1)\), so \({\varPhi }(-\ell _1/\sigma _1) + {\varPhi }(-\ell _2/\sigma _2) > 0.99\). Therefore, the failure probability is at least 0.49, contradicting our success probability guarantee. Thus we must have \(z = {\varOmega }(k)\). \(\square \)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Huang, Z., Yi, K. & Zhang, Q. Randomized Algorithms for Tracking Distributed Count, Frequencies, and Ranks. Algorithmica 81, 2222–2243 (2019). https://doi.org/10.1007/s00453-018-00531-y

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00453-018-00531-y

Keywords

Navigation