Skip to main content
Log in

FID-sketch: an accurate sketch to store frequencies in data streams

World Wide Web Aims and scope Submit manuscript

Abstract

Sketches are being extensively used in a large number of real world applications to estimate frequencies of data items. Due to the unprecedented increase in the amount of Internet data and a relatively slower increase in the size of on-chip memories, existing sketches are becoming increasingly unable to keep the accuracy of the frequency estimates at an acceptable level. In this paper, we design a new sketch, called FID-sketch, that has a significantly higher accuracy and a much smaller on-chip memory footprint compared to the existing sketches. The key intuition behind the design of the FID-sketch is that before inserting an item, unlike prior sketches, it first estimates the current value of the frequency of that item stored in the sketch, and then increments as few counters as possible instead of incrementing a pre-determined fixed number of counters. We carried out extensive experiments to evaluate and compare the performance of FID-sketch with existing sketches on multi-core CPU and GPU platforms. Our experimental results show that our FID-sketch significantly outperforms the state-of-the-art with 36.7 times lower relative error. We have released the source code of our proposed sketch and other related sketches that we implemented at Github [21].

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Figure 1
Figure 2
Figure 3
Figure 4
Figure 5
Figure 6
Figure 7
Figure 8
Figure 9
Figure 10
Figure 11
Figure 12
Figure 13
Figure 14
Figure 15
Figure 16

Similar content being viewed by others

References

  1. Aguilar-Saborit, J., Trancoso, P., Muntes-Mulero, V., Larriba-Pey, J. -L.: Dynamic count filters. ACM SIGMOD Record, pp/ 26–32 (2006)

  2. Barman, D., Satapathy, P., Ciardo, G.: Detecting attacks in routers using sketches. In: Proceedings of the High Performance Switching and Routing (2007)

  3. Bu, T., Cao, J., Chen, A., Lee, P.P.: A fast and compact method for unveiling significant patterns in high speed networks. In: Proceedings of the IEEE INFOCOM, pp. 1893–1901 (2007)

  4. Callegari, C., Cyprus, N.: Statistical approaches for network anomaly detection. In: Proceedings of the ICIMP (2009)

  5. Chakrabarti, K., Garofalakis, M., Rastogi, R., Shim, K.: Approximate query processing using wavelets. VLDB 10(2-3), 199–223 (2000)

    MATH  Google Scholar 

  6. Charikar, M., Chen, K., Farach-Colton, M.: Finding frequent items in data streams. In: Automata, Languages and Programming (2002)

  7. Chen, A., Jin, Y., Cao, J., Li, L.E.: Tracking long duration flows in network traffic. In: Proceedings of the IEEE INFOCOM (2010)

  8. Cisco visual networking index: Forecast and methodology, 2015–2020. CISCO White paper

  9. Cohen, S., Matias, Y.: Spectral bloom filters. In: Proceedings of the ACM SIGMOD, pp. 241–252 (2003)

  10. Cormode, G., Garofalakis, M.: Sketching streams through the net: Distributed approximate query tracking. In: Proceedings of the VLDB (2005)

  11. Cormode, G., Muthukrishnan, S.: An improved data stream summary: the count-min sketch and its applications. J. Algorithm. 55(1), 58–75 (2005)

    Article  MathSciNet  Google Scholar 

  12. Cormode, G., Johnson, T., et al.: Holistic udafs at streaming speeds. In: Proceedings of the SIGMOD (2004)

  13. Cormode, G., Hadjieleftheriou, M.: Finding frequent items in data streams. Proc. VLDB 1(2), 1530–1541 (2008)

    Article  Google Scholar 

  14. Estan, C., Varghese, G.: New directions in traffic measurement and accounting. Proc. ACM SIGMCOMM 32(4), 323–338 (2002)

    Article  Google Scholar 

  15. Fan, L., Cao, P., Almeida, J., Broder, A.Z.: Summary cache: A scalable wide-area web cache sharing protocol. In: Proceedings of the ACM SIGCOMM (1998)

  16. Kollios, G., Byers, J.W., Considine, J., Hadjieleftheriou, M., Li, F.: Robust aggregation in sensor networks. IEEE Data Eng. Bull. 28(1), 26–32 (2005)

    Google Scholar 

  17. Pike, R., Dorward, S., Griesemer, R., Quinlan, S.: Interpreting the data Parallel analysis with sawzall. Dyn. Grids Worldw. Comput. 13(4), 277–298 (2005)

    Google Scholar 

  18. Pitel, G., Fouquier, G.: Count-min-log sketch: Approximately counting with approximate counters. arXiv:1502.04885 (2015)

  19. Potti, N., Patel, J.M.: Daq: a new paradigm for approximate query processing. In: Proceedings of the VLDB (2015)

  20. Powers, D.M.: Applications and explanations of Zipf’s law. In: Proceedings of the EMNLP-CoNLL. Association for Computational Linguistics (1998)

  21. Source code of FID sketches with CUDA implementation. https://github.com/papers2016/FID-sketch.git

  22. Yang, T., Xie, G., Li, Y., et al.: Guarantee IP lookup performance with FIB explosion. In: Proceedings of the SIGCOMM (2014)

  23. Yang, T., Liu, A.X., Shahzad, M., Zhong, Y., Fu, Q., Li, Z., Xie, G., Li, X.: A shifting bloom Filter Framework for Set Queries. In: Proceedings of the VLDB (2016)

  24. Yang, T., Liu, A.X., Shahzad, M., Yang, D., Fu, Q., Xie, G., Li, X.: A Shifting Framework for Set Queries. In: Proceedings of the IEEE/ACM Transaction on Networking (ToN) (2017)

  25. Yang, T., Zhou, Y., Jin, H., Chen, S., Li, X.: Pyramid Sketch: a Sketch Framework for Frequency Estimation of Data Streams. In: Proceedings of the VLDB (2017)

  26. Zhang, Y., Singh, S., Sen, S., Duffield, N., Lund, C.: Online identification of hierarchical heavy hitters: algorithms, evaluation, and applications. In: Proceedings of the ACM IMC (2004)

  27. Zhao, Q.G., Ogihara, M., Wang, H., Xu, J.J.: Finding global icebergs over distributed data sets. In: Proceedigs of the ACM PODS. ACM (2006)

  28. Zhou, Y., Yang, T., Jiang, J., Cui, B., Yu, M., Li, X., Uhlig, S.: Cold Filter: A Meta-Framework for Faster and More Accurate Stream Processing. In: Proceedings of the SIGMOD (2018)

Download references

Acknowledgments

This work is partially supported by Primary Research & Development Plan of China (2016YFB1000304), National Basic Research Program of China (2014CB340405), NSFC (61672061), the Open Project Funding of CAS Key Lab of Network Data Science and Technology, Institute of Computing Technology, Chinese Academy of Sciences, and National Science Foundation (CNS 1616317, CNS 1616273).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Xue Liu.

Additional information

Tong Yang and Haowei Zhang are the co-primary authors. Haowei Zhang and Hao Wang finished this work under the guidance of their supervisor: Tong Yang.

This article belongs to the Topical Collection: Special Issue on Web and Big Data

Guest Editors: Junjie Yao, Bin Cui, Christian S. Jensen, and Zhe Zhao

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Yang, T., Zhang, H., Wang, H. et al. FID-sketch: an accurate sketch to store frequencies in data streams. World Wide Web 22, 2675–2696 (2019). https://doi.org/10.1007/s11280-018-0546-5

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11280-018-0546-5

Keywords

Navigation