Skip to main content
Log in

An accurate estimation algorithm for big data streams

  • Published:
Distributed and Parallel Databases Aims and scope Submit manuscript

Abstract

Sketch is a memory-efficient data structure, and is used to store and query the frequency of any item in a given multiset. As it can achieve fast query and update, it has been applied to various fields. Different sketches have different advantages and disadvantages. Sketches are originally proposed for estimation of flow size in network measurement. The key factor of sketches for network measurement is the insertion speed and accuracy. In this paper, we propose a new sketch, which can significantly improve the insertion speed while improving the accuracy. Our key methods include on-chip/off-chip separation and partial update algorithm. Extensive experimental results show that our sketch significantly outperforms the state-of-the-art both in terms of accuracy and speed.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18
Fig. 19
Fig. 20
Fig. 21

Similar content being viewed by others

References

  1. Aggarwal, C.C., Subbian, K.: Event detection in social streams. In: SDM, vol. 12. SIAM (2012)

  2. Aguilar-Saborit, J., Trancoso, P., Muntes-Mulero, V., Larriba-Pey, J.-L.: Dynamic count filters. ACM SIGMOD Rec. 35, 26–32 (2006)

    Article  Google Scholar 

  3. Ben Basat, R., Einziger, G., Friedman, R., Luizelli, M.C., Waisbard, E.: Constant time updates in hierarchical heavy hitters. In: Proceedings of ACM SIGCOMM, pp. 127–140 (2017)

  4. Bloom, B.H.: Space/time trade-offs in hash coding with allowable errors. Commun. ACM 13(7), 422–426 (1970)

    Article  MATH  Google Scholar 

  5. Callegari, C.: Statistical approaches for network anomaly detection. In: Proceedings of ICIMP. Citeseer (2009)

  6. Charikar, M., Chen, K., Farach-Colton, M.: Finding frequent items in data streams. In: Automata, Languages and Programming. Springer, New York (2002)

  7. Cohen, S., Matias, Y.: Spectral bloom filters. In: Proceedings of ACM SIGMOD, pp. 241–252 (2003)

  8. Cormode, G.: Sketch Techniques for Approximate Query Processing. Foundations and Trends in Databases. NOW Publishers, Breda (2011)

    Google Scholar 

  9. Cormode, G., Hadjieleftheriou, M.: Finding frequent items in data streams. Proc. VLDB Endow. 1(2), 1530–1541 (2008)

    Article  Google Scholar 

  10. Cormode, G., Muthukrishnan, S.: An improved data stream summary: the count-min sketch and its applications. J. Algorithm. 55(1), 58–75 (2005)

    Article  MathSciNet  MATH  Google Scholar 

  11. Durme, B.V., Lall, A.: Streaming pointwise mutual information. In: Advances in Neural Information Processing Systems, pp. 1892–1900 (2009)

  12. Estan, C., Varghese, G.: New directions in traffic measurement and accounting. ACM SIGMCOMM CCR 32(4) (2002)

  13. Fan, L., Cao, P., Almeida, J., Broder, A.Z.: Summary cache: a scalable wide-area web cache sharing protocol. IEEE/ACM ToN 8(3), 281–293 (2000)

    Article  Google Scholar 

  14. Gilbert, A.C., Strauss, M.J., Tropp, J.A., Vershynin, R.: One sketch for all: fast algorithms for compressed sensing. In: Proceedings of ACM Symposium on Theory of Computing (2007)

  15. Goyal, A., Daumé, H. III.: Approximate scalable bounded space sketch for large data nlp. In: Proceedings of EMNLP (2011)

  16. Goyal, A., Daumé, H. III.: Lossy conservative update (lcu) sketch: Succinct approximate count storage. In: Proceedings of AAAI (2011)

  17. Li, P., Church, K.W., Hastie, T.J.: One sketch for all: Theory and application of conditional random sampling. In: Proceedings of Advances in Neural Information Processing Systems, pp. 953–960 (2009)

  18. Li, T., Chen, S., Ling, Y.: Per-flow traffic measurement through randomized counter sharing. IEEE/ACM Trans. Netw. 20(5), 1622–1634 (2012)

    Article  Google Scholar 

  19. Liu, Z., Manousis, A., et al.: One sketch to rule them all: Rethinking network flow monitoring with univmon. In: Proceedings of ACM SIGCOMM (2016)

  20. Liu, Z., Manousis, A., Vorsanger, G., Sekar, V., Braverman, V.: One sketch to rule them all: Rethinking network flow monitoring with univmon. In: ACM Proceedings of SIGCOMM, pp. 101–114 (2016)

  21. Lu, Y., Montanari, A., Prabhakar, B., Dharmapurikar, S., Kabbani, A.: Counter braids: a novel counter architecture for per-flow measurement. Proc. ACM SIGMETRICS 36(1), 121–132 (2008)

    Article  Google Scholar 

  22. Pitel, G., Fouquier, G.: Count-min-log sketch: Approximately counting with approximate counters (2015). arXiv:1502.04885

  23. Polyzotis, N., Garofalakis, M., Ioannidis, Y.: Approximate xml query answers. In: Proceedings of ACM SIGMOD (2004)

  24. Powers, D.M.: Applications and explanations of Zipf’s law. In: Proceedings of EMNLP-CoNLL. Association for Computational Linguistics (1998)

  25. Rousskov, A., Wessels, D.: High-performance benchmarking with web polygraph. Software 34(2), 187–211 (2004)

    Google Scholar 

  26. Talbot, D., Osborne, M.: Smoothed bloom filter language models: Tera-scale lms on the cheap. In: EMNLP-CoNLL, pp. 468–476 (2007)

  27. Thomas, D., Bordawekar, R., et al.: On efficient query processing of stream counts on the cell processor. In: Proceedings of IEEE ICDE (2009)

  28. Van Durme, B., Lall, A.: Probabilistic counting with randomized storage. In: Proceedings of IJCAI, pp. 1574–1579 (2009)

  29. Yang, T., Yuan, B., Zhang, S., Zhang, T., Duan, R., Wang, Y., Liu, B.: Approaching optimal compression with fast update for large scale routing tables. In: Proceedings of IEEE IWQoS, p. 32. IEEE Press, New York (2012)

  30. Yang, T., Xie, G., Li, Y., Fu, Q., Liu, A.X., Li, Q., Mathy, L.: Guarantee ip lookup performance with fib explosion. In: Proceedings of ACM SIGCOMM, pp. 39–50. ACM (2014)

  31. Yang, T., Liu, A.X., Shahzad, M., Zhong, Y., Fu, Q., Li, Z., Xie, G., Li, X.: A shifting bloom filter framework for set queries. Proc. VLDB Endow. 9(5), 408–419 (2016)

    Article  Google Scholar 

  32. Yang, T., Liu, A.X., Shahzad, M., Yang, D., Fu, Q., Xie, G., Li, X.: A shifting framework for set queries. IEEE/ACM Trans. Netw. 25(5), 3116–3131 (2017)

    Article  Google Scholar 

  33. Zhang, Y., Singh, S., Sen, S., Duffield, N., Lund, C.: Online identification of hierarchical heavy hitters: algorithms, evaluation, and applications. In: Proceedings of ACM IMC (2004)

  34. Zhao, Q.G., Ogihara, M., Wang, H., Xu, J.J.: Finding global icebergs over distributed data sets. In: Proceedings of ACM SIGMOD-SIGACT-SIGART (2006)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Qin Xin.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Xin, Q., Wu, J. An accurate estimation algorithm for big data streams. Distrib Parallel Databases 36, 461–483 (2018). https://doi.org/10.1007/s10619-018-7225-5

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10619-018-7225-5

Keywords

Navigation