Abstract
Sketches are being extensively used in a large number of real world applications to estimate frequencies of data items. Due to the unprecedented increase in the amount of Internet data and a relatively slower increase in the size of on-chip memories, existing sketches are becoming increasingly unable to keep the accuracy of the frequency estimates at an acceptable level. In this paper, we design a new sketch, called FID-sketch, that has a significantly higher accuracy and a much smaller on-chip memory footprint compared to the existing sketches. The key intuition behind the design of the FID-sketch is that before inserting an item, unlike prior sketches, it first estimates the current value of the frequency of that item stored in the sketch, and then increments as few counters as possible instead of incrementing a pre-determined fixed number of counters. We carried out extensive experiments to evaluate and compare the performance of FID-sketch with existing sketches on multi-core CPU and GPU platforms. Our experimental results show that our FID-sketch significantly outperforms the state-of-the-art with 36.7 times lower relative error. We have released the source code of our proposed sketch and other related sketches that we implemented at Github [21].
Similar content being viewed by others
References
Aguilar-Saborit, J., Trancoso, P., Muntes-Mulero, V., Larriba-Pey, J. -L.: Dynamic count filters. ACM SIGMOD Record, pp/ 26–32 (2006)
Barman, D., Satapathy, P., Ciardo, G.: Detecting attacks in routers using sketches. In: Proceedings of the High Performance Switching and Routing (2007)
Bu, T., Cao, J., Chen, A., Lee, P.P.: A fast and compact method for unveiling significant patterns in high speed networks. In: Proceedings of the IEEE INFOCOM, pp. 1893–1901 (2007)
Callegari, C., Cyprus, N.: Statistical approaches for network anomaly detection. In: Proceedings of the ICIMP (2009)
Chakrabarti, K., Garofalakis, M., Rastogi, R., Shim, K.: Approximate query processing using wavelets. VLDB 10(2-3), 199–223 (2000)
Charikar, M., Chen, K., Farach-Colton, M.: Finding frequent items in data streams. In: Automata, Languages and Programming (2002)
Chen, A., Jin, Y., Cao, J., Li, L.E.: Tracking long duration flows in network traffic. In: Proceedings of the IEEE INFOCOM (2010)
Cisco visual networking index: Forecast and methodology, 2015–2020. CISCO White paper
Cohen, S., Matias, Y.: Spectral bloom filters. In: Proceedings of the ACM SIGMOD, pp. 241–252 (2003)
Cormode, G., Garofalakis, M.: Sketching streams through the net: Distributed approximate query tracking. In: Proceedings of the VLDB (2005)
Cormode, G., Muthukrishnan, S.: An improved data stream summary: the count-min sketch and its applications. J. Algorithm. 55(1), 58–75 (2005)
Cormode, G., Johnson, T., et al.: Holistic udafs at streaming speeds. In: Proceedings of the SIGMOD (2004)
Cormode, G., Hadjieleftheriou, M.: Finding frequent items in data streams. Proc. VLDB 1(2), 1530–1541 (2008)
Estan, C., Varghese, G.: New directions in traffic measurement and accounting. Proc. ACM SIGMCOMM 32(4), 323–338 (2002)
Fan, L., Cao, P., Almeida, J., Broder, A.Z.: Summary cache: A scalable wide-area web cache sharing protocol. In: Proceedings of the ACM SIGCOMM (1998)
Kollios, G., Byers, J.W., Considine, J., Hadjieleftheriou, M., Li, F.: Robust aggregation in sensor networks. IEEE Data Eng. Bull. 28(1), 26–32 (2005)
Pike, R., Dorward, S., Griesemer, R., Quinlan, S.: Interpreting the data Parallel analysis with sawzall. Dyn. Grids Worldw. Comput. 13(4), 277–298 (2005)
Pitel, G., Fouquier, G.: Count-min-log sketch: Approximately counting with approximate counters. arXiv:1502.04885 (2015)
Potti, N., Patel, J.M.: Daq: a new paradigm for approximate query processing. In: Proceedings of the VLDB (2015)
Powers, D.M.: Applications and explanations of Zipf’s law. In: Proceedings of the EMNLP-CoNLL. Association for Computational Linguistics (1998)
Source code of FID sketches with CUDA implementation. https://github.com/papers2016/FID-sketch.git
Yang, T., Xie, G., Li, Y., et al.: Guarantee IP lookup performance with FIB explosion. In: Proceedings of the SIGCOMM (2014)
Yang, T., Liu, A.X., Shahzad, M., Zhong, Y., Fu, Q., Li, Z., Xie, G., Li, X.: A shifting bloom Filter Framework for Set Queries. In: Proceedings of the VLDB (2016)
Yang, T., Liu, A.X., Shahzad, M., Yang, D., Fu, Q., Xie, G., Li, X.: A Shifting Framework for Set Queries. In: Proceedings of the IEEE/ACM Transaction on Networking (ToN) (2017)
Yang, T., Zhou, Y., Jin, H., Chen, S., Li, X.: Pyramid Sketch: a Sketch Framework for Frequency Estimation of Data Streams. In: Proceedings of the VLDB (2017)
Zhang, Y., Singh, S., Sen, S., Duffield, N., Lund, C.: Online identification of hierarchical heavy hitters: algorithms, evaluation, and applications. In: Proceedings of the ACM IMC (2004)
Zhao, Q.G., Ogihara, M., Wang, H., Xu, J.J.: Finding global icebergs over distributed data sets. In: Proceedigs of the ACM PODS. ACM (2006)
Zhou, Y., Yang, T., Jiang, J., Cui, B., Yu, M., Li, X., Uhlig, S.: Cold Filter: A Meta-Framework for Faster and More Accurate Stream Processing. In: Proceedings of the SIGMOD (2018)
Acknowledgments
This work is partially supported by Primary Research & Development Plan of China (2016YFB1000304), National Basic Research Program of China (2014CB340405), NSFC (61672061), the Open Project Funding of CAS Key Lab of Network Data Science and Technology, Institute of Computing Technology, Chinese Academy of Sciences, and National Science Foundation (CNS 1616317, CNS 1616273).
Author information
Authors and Affiliations
Corresponding author
Additional information
Tong Yang and Haowei Zhang are the co-primary authors. Haowei Zhang and Hao Wang finished this work under the guidance of their supervisor: Tong Yang.
This article belongs to the Topical Collection: Special Issue on Web and Big Data
Guest Editors: Junjie Yao, Bin Cui, Christian S. Jensen, and Zhe Zhao
Rights and permissions
About this article
Cite this article
Yang, T., Zhang, H., Wang, H. et al. FID-sketch: an accurate sketch to store frequencies in data streams. World Wide Web 22, 2675–2696 (2019). https://doi.org/10.1007/s11280-018-0546-5
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11280-018-0546-5