Advertisement

Algorithmica

, Volume 74, Issue 2, pp 787–811 | Cite as

Space-Efficient Estimation of Statistics Over Sub-Sampled Streams

  • Andrew McGregor
  • A. Pavan
  • Srikanta TirthapuraEmail author
  • David P. Woodruff
Article
  • 208 Downloads

Abstract

In many stream monitoring situations, the data arrival rate is so high that it is not even possible to observe each element of the stream. The most common solution is to sub-sample the data stream and use the sample to infer properties and estimate aggregates of the original stream. However, in many cases, the estimation of aggregates on the original stream cannot be accomplished through simply estimating them on the sampled stream, followed by a normalization. We present algorithms for estimating frequency moments, support size, entropy, and heavy hitters of the original stream, through a single pass over the sampled stream.

Keywords

Data streams Frequency moments Sub-sampling 

References

  1. 1.
    Alon, N., Matias, Y., Szegedy, M.: The space complexity of approximating the frequency moments. J. Comput. Syst. Sci. 58(1), 137–147 (1999)CrossRefMathSciNetzbMATHGoogle Scholar
  2. 2.
    Babcock, B., Datar, M., Motwani, R.: Sampling from a moving window over streaming data. In: Proceedings of ACM-SIAM Symposium on Discrete Algorithms (SODA), pp. 633–634 (2002)Google Scholar
  3. 3.
    Bar-Yossef, Z.: The complexity of massive dataset computations. Ph.D. thesis, University of California at Berkeley (2002)Google Scholar
  4. 4.
    Bar-Yossef, Z.: Sampling lower bounds via information theory. In: Proceedings of 35th Annual ACM Symposium on Theory of Computing (STOC), pp. 335–344 (2003)Google Scholar
  5. 5.
    Barakat, C., Iannaccone, G., Diot, C.: Ranking flows from sampled traffic. In: Proceedings of ACM Conference on Emerging Network Experiment and Technology (CoNEXT), pp. 188–199 (2005)Google Scholar
  6. 6.
    Bhattacharyya, S., Madeira, A., Muthukrishnan, S., Ye, T.: How to scalably and accurately skip past streams. In: Proceedings of 23rd International Conference on Data Engineering (ICDE) Workshops, pp. 654–663 (2007)Google Scholar
  7. 7.
    Charikar, M., Chaudhuri, S., Motwani, R., Narasayya, V.R.: Towards estimation error guarantees for distinct values. In: Proceedings of 19th ACM Symposium on Principles of Database Systems (PODS), pp. 268–279 (2000)Google Scholar
  8. 8.
    Charikar, M., Chen, K., Farach-Colton, M.: Finding frequent items in data streams. Theor. Comput. Sci. 312(1), 3–15 (2004)CrossRefMathSciNetzbMATHGoogle Scholar
  9. 9.
  10. 10.
    Cohen, E., Cormode, G., Duffield, N.G.: Structure-aware sampling: flexible and accurate summarization. Proc. VLDB Endow. 4(11), 819–830 (2011)Google Scholar
  11. 11.
    Cohen, E., Duffield, N.G., Kaplan, H., Lund, C., Thorup, M.: Efficient stream sampling for variance-optimal estimation of subset sums. SIAM J. Comput. 40(5), 1402–1431 (2011)CrossRefMathSciNetzbMATHGoogle Scholar
  12. 12.
    Cohen, E., Duffield, N.G., Kaplan, H., Lund, C., Thorup, M.: Algorithms and estimators for summarization of unaggregated data streams. J. Comput. Syst. Sci. 80(7), 1214–1244 (2014)CrossRefMathSciNetzbMATHGoogle Scholar
  13. 13.
    Cohen, E., Grossaug, N., Kaplan, H.: Processing top-k queries from samples. Comput. Netw. 52(14), 2605–2622 (2008)CrossRefzbMATHGoogle Scholar
  14. 14.
    Cormode, G., Garofalakis, M.: Sketching probabilistic data streams. In: Proceedings of 26th ACM International Conference on Management of Data (SIGMOD), pp. 281–292 (2007)Google Scholar
  15. 15.
    Cormode, G., Muthukrishnan, S.: An improved data stream summary: the count-min sketch and its applications. J. Algorithms 55(1), 58–75 (2005)CrossRefMathSciNetzbMATHGoogle Scholar
  16. 16.
    Cormode, G., Muthukrishnan, S., Yi, K., Zhang, Q.: Optimal sampling from distributed streams. In: Proceedings of ACM Symposium on Principles of Database Systems (PODS), pp. 77–86 (2010)Google Scholar
  17. 17.
    Duffield, N.G., Lund, C., Thorup, M.: Properties and prediction of flow statistics from sampled packet streams. In: Proceedings of Internet Measurement Workshop, pp. 159–171 (2002)Google Scholar
  18. 18.
    Duffield, N.G., Lund, C., Thorup, M.: Estimating flow distributions from sampled flow statistics. IEEE/ACM Trans. Netw. 13(5), 933–946 (2005)CrossRefMathSciNetGoogle Scholar
  19. 19.
    Duffield, N.G., Lund, C., Thorup, M.: Priority sampling for estimation of arbitrary subset sums. J. ACM 54(6) (2007)Google Scholar
  20. 20.
    Efraimidis, P., Spirakis, P.G.: Weighted random sampling with a reservoir. Inf. Process. Lett. 97(5), 181–185 (2006)CrossRefMathSciNetzbMATHGoogle Scholar
  21. 21.
    Estan, C., Keys, K., Moore, D., Varghese, G.: Building a better netflow. In: Proceedings of ACM Conference on Applications, Technologies, Architectures, and Protocols for Computer Communication (SIGCOMM), pp. 245–256 (2004)Google Scholar
  22. 22.
    Estan, C., Varghese, G.: New directions in traffic measurement and accounting. In: Proceedings of ACM Conference on Applications, Technologies, Architectures, and Protocols for Computer Communication (SIGCOMM), pp. 323–336 (2002)Google Scholar
  23. 23.
    Gibbons, P.B., Matias, Y.: New sampling-based summary statistics for improving approximate query answers. In: Proceedings of ACM SIGMOD International Conference on Management of Data, pp. 331–342 (1998)Google Scholar
  24. 24.
    Guha, S., Huang, Z.: Revisiting the direct sum theorem and space lower bounds in random order streams. In: Automata, Languages and Programming, 36th International Colloquium, ICALP (1), pp. 513–524 (2009)Google Scholar
  25. 25.
    Harvey, N.J.A., Nelson, J., Onak, K.: Sketching and streaming entropy via approximation theory. In: Proceedings of 49th IEEE Conference on Foundations of Computer Science (FOCS), pp. 489–498 (2008)Google Scholar
  26. 26.
    Hohn, N., Veitch, D.: Inverting sampled traffic. IEEE/ACM Trans. Netw. 14(1), 68–80 (2006)CrossRefGoogle Scholar
  27. 27.
    Indyk, P., Woodruff, D.P.: Optimal approximations of the frequency moments of data streams. In: Proceedings of 37th Annual ACM Symposium on Theory of Computing (STOC), pp. 202–208 (2005)Google Scholar
  28. 28.
    Jayram, T.S., McGregor, A., Muthukrishnan, S., Vee, E.: Estimating statistical aggregates on probabilistic data streams. ACM Trans. Database Syst. 33, 26:1–26:30 (2008)CrossRefGoogle Scholar
  29. 29.
    Kane, D.M., Nelson, J., Woodruff, D.P.: On the exact space complexity of sketching and streaming small norms. In: Proceedings of 21st ACM-SIAM Symposium on Discrete Algorithms (SODA), pp. 1161–1178 (2010)Google Scholar
  30. 30.
    Lahiri, B., Tirthapura, S.: Stream sampling. In: Liu, L., Özsu, M.T. (eds.) Encyclopedia of Database Systems, pp. 2838–2842. Springer, US (2009)Google Scholar
  31. 31.
    McGregor, A. (ed.): Open Problems in Data Streams and Related Topics (2007). http://www.cse.iitk.ac.in/users/sganguly/data-stream-probs
  32. 32.
    McGregor, A., Pavan, A., Tirthapura, S., Woodruff, D.: Space-efficient estimation of statistics over sub-sampled streams. In: Proceedings of 31st ACM Symposium on Principles of Database Systems (PODS), pp. 273–282 (2012)Google Scholar
  33. 33.
    Misra, J., Gries, D.: Finding repeated elements. Sci. Comput. Program. 2(2), 143–152 (1982)CrossRefMathSciNetzbMATHGoogle Scholar
  34. 34.
    Rusu, F., Dobra, A.: Sketching sampled data streams. In: Proceedings of 25th IEEE International Conference on Data Engineering (ICDE), pp. 381–392 (2009)Google Scholar
  35. 35.
    Szegedy, M.: The dlt priority sampling is essentially optimal. In: Proceedings of Annual ACM Symposium on Theory of Computing (STOC), pp. 150–158 (2006)Google Scholar
  36. 36.
    Tirthapura, S., Woodruff, D.P.: Optimal random sampling from distributed streams revisited. In: Proceedings of International Symposium on Distributed Computing (DISC), pp. 283–297 (2011)Google Scholar
  37. 37.
    Vitter, J.S.: Random sampling with a reservoir. ACM Trans. Math. Softw. 11(1), 37–57 (1985)CrossRefMathSciNetzbMATHGoogle Scholar

Copyright information

© Springer Science+Business Media New York 2015

Authors and Affiliations

  • Andrew McGregor
    • 1
  • A. Pavan
    • 2
  • Srikanta Tirthapura
    • 2
    Email author
  • David P. Woodruff
    • 3
  1. 1.University of MassachusettsAmherstUSA
  2. 2.Iowa State UniversityAmesUSA
  3. 3.IBM AlmadenSan JoseUSA

Personalised recommendations