Skip to main content

Optimal Tracking of Distributed Heavy Hitters and Quantiles

Abstract

We consider the problem of tracking heavy hitters and quantiles in the distributed streaming model. The heavy hitters and quantiles are two important statistics for characterizing a data distribution. Let A be a multiset of elements, drawn from the universe U={1,…,u}. For a given 0≤ϕ≤1, the ϕ-heavy hitters are those elements of A whose frequency in A is at least ϕ|A|; the ϕ-quantile of A is an element x of U such that at most ϕ|A| elements of A are smaller than A and at most (1−ϕ)|A| elements of A are greater than x. Suppose the elements of A are received at k remote sites over time, and each of the sites has a two-way communication channel to a designated coordinator, whose goal is to track the set of ϕ-heavy hitters and the ϕ-quantile of A approximately at all times with minimum communication. We give tracking algorithms with worst-case communication cost O(k/ϵ⋅logn) for both problems, where n is the total number of items in A, and ϵ is the approximation error. This substantially improves upon the previous known algorithms. We also give matching lower bounds on the communication costs for both problems, showing that our algorithms are optimal. We also consider a more general version of the problem where we simultaneously track the ϕ-quantiles for all 0≤ϕ≤1.

This is a preview of subscription content, access via your institution.

References

  1. Alon, N., Matias, Y., Szegedy, M.: The space complexity of approximating the frequency moments. J. Comput. Syst. Sci. 58, 137–147 (1999)

    MathSciNet  MATH  Article  Google Scholar 

  2. Arackaparambil, C., Brody, J., Chakrabarti, A.: Functional monitoring without monotonicity. In: Proceedings of the International Colloquium on Automata, Languages, and Programming (2009)

    Google Scholar 

  3. Babcock, B., Babu, S., Datar, M., Motwani, R., Widom, J.: Models and issues in data stream systems. In: Proceedings of the ACM Symposium on Principles of Database Systems (2002)

    Google Scholar 

  4. Babcock, B., Olston, C.: Distributed top-k monitoring. In: Proceedings of the ACM SIGMOD International Conference on Management of Data (2003)

    Google Scholar 

  5. Cormode, G., Garofalakis, M.: Approximate continuous querying over distributed streams. ACM Trans. Database Syst. 33(2), Article 9 (2008)

    Article  Google Scholar 

  6. Cormode, G., Garofalakis, M., Muthukrishnan, S., Rastogi, R.: Holistic aggregates in a networked world: Distributed tracking of approximate quantiles. In: Proceedings of the ACM SIGMOD International Conference on Management of Data (2005)

    Google Scholar 

  7. Cormode, G., Hadjieleftheriou, M.: Finding frequent items in data streams. In: Proceedings of the International Conference on Very Large Databases (2008)

    Google Scholar 

  8. Cormode, G., Muthukrishnan, S., Yi, K.: Algorithms for distributed functional monitoring. ACM Trans. Algorithms 7(2), Article 21 (2011)

    MathSciNet  Article  Google Scholar 

  9. Cormode, G., Muthukrishnan, S., Yi, K., Zhang, Q.: Optimal sampling from distributed streams. In: Proceedings of the ACM Symposium on Principles of Database Systems (2010)

    Google Scholar 

  10. Cormode, G., Muthukrishnan, S., Zhuang, W.: What’s different: Distributed, continuous monitoring of duplicate-resilient aggregates on data streams. In: Proceedings of the IEEE International Conference on Data Engineering (2006)

    Google Scholar 

  11. Fuller, R., Kantardzic, M.: FIDS: Monitoring frequent items over distributed data streams. In: Proceedings of the International Conference on Machine Learning and Data Mining (2007)

    Google Scholar 

  12. Gilbert, A.C., Kotidis, Y., Muthukrishnan, S., Strauss, M.J.: How to summarize the universe: Dynamic maintenance of quantiles. In: Proceedings of the International Conference on Very Large Databases (2002)

    Google Scholar 

  13. Greenwald, M., Khanna, S.: Space-efficient online computation of quantile summaries. In: Proceedings of the ACM SIGMOD International Conference on Management of Data (2001)

    Google Scholar 

  14. Karp, R.M., Shenker, S., Papadimitriou, C.H.: A simple algorithm for finding frequent elements in streams and bags. ACM Trans. Database Syst. 28(1), 51–55 (2003)

    Article  Google Scholar 

  15. Keralapura, R., Cormode, G., Ramamirtham, J.: Communication-efficient distributed monitoring of thresholded counts. In: Proceedings of the ACM SIGMOD International Conference on Management of Data (2006)

    Google Scholar 

  16. Manjhi, A., Shkapenyuk, V., Dhamdhere, K., Olston, C.: Finding (recently) frequent items in distributed data streams. In: Proceedings of the IEEE International Conference on Data Engineering (2005)

    Google Scholar 

  17. Manku, G., Motwani, R.: Approximate frequency counts over data streams. In: Proceedings of the International Conference on Very Large Databases (2002)

    Google Scholar 

  18. Metwally, A., Agrawal, D., Abbadi, A.E.: An integrated efficient solution for computing frequent and top-k elements in data streams. ACM Trans. Database Syst. 31(3), 1095–1133 (2006)

    Article  Google Scholar 

  19. Olston, C., Widom, J.: Efficient monitoring and querying of distributed, dynamic data via approximate replication. IEEE Data Eng. Bull. 28, 11–18 (2005)

    Google Scholar 

  20. Sharfman, I., Schuster, A., Keren, D.: Shape sensitive geometric monitoring. In: Proceedings of the ACM Symposium on Principles of Database Systems (2008)

    Google Scholar 

  21. Yao, A.C.: Some complexity questions related to distributive computing. In: Proceedings of the ACM Symposium on Theory of Computation (1979)

    Google Scholar 

  22. Yi, K., Zhang, Q.: Multi-dimensional online tracking. In: Proceedings of the ACM-SIAM Symposium on Discrete Algorithms (2008)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Qin Zhang.

Additional information

A preliminary version of this article was presented at the ACM Symposium on Principles of Database Systems (PODS), 2009.

Ke Yi supported in part by a DAG and an RPC grant from HKUST, and a Google Faculty Research Award.

Most of this work was done while Qin Zhang was a Ph.D. student at HKUST.

Rights and permissions

Reprints and Permissions

About this article

Cite this article

Yi, K., Zhang, Q. Optimal Tracking of Distributed Heavy Hitters and Quantiles. Algorithmica 65, 206–223 (2013). https://doi.org/10.1007/s00453-011-9584-4

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00453-011-9584-4

Keywords

  • Communication Cost
  • Tracking Algorithm
  • Deterministic Algorithm
  • Tracking Problem
  • Tracking Period