Abstract
Sketch is a probabilistic data structure designed for the estimation of item frequencies in a multiset, which is extensively used in data stream processing. The key metrics of sketches for data streams are accuracy, speed, and memory usage. There are various sketches in the literature, but most of them cannot achieve high accuracy, high speed and using limited memory at the same time for skewed datasets. Recently, two new sketches, the Pyramid sketch [1] and the OM sketch [2], have been proposed to tackle the problem. In this paper, we look closely at five different but important aspects of these two solutions and discuss the details on conditions and limits of their methods. Three of them, memory utilization, isolation and neutralization are related to accuracy; the other two: memory access and hash calculation are related to speed. We found that the new techniques proposed: automatic enlargement and hierarchy for accuracy, word acceleration and hash bit technique for speed play the central role in the improvement, but they also have limitations and side-effects. Other properties of working sketches such as deletion and generality are also discussed. Our discussions are supported by extensive experimental results, and we believe they can help in future development for better sketches.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Yang, T., Zhou, Y., Jin, H., Chen, S., Li, X.: Pyramid sketch: a sketch framework for frequency estimation of data streams. Proc. VLDB Endow. 10(11), 1442–1453 (2017)
Zhou, Y., Liu, P., Jin, H., Yang, T., Dang, S., Li, X.: One memory access sketch: a more accurate and faster sketch for per-flow measurement. In: IEEE GLOBECOM (2017)
Manerikar, N., Palpanas, T.: Frequent items in streaming data: an experimental evaluation of the state-of-the-art. Data Knowl. Eng. 68(4), 415–430 (2009)
Cormode, G., Johnson, T., Korn, F., Muthukrishnan, S., Spatscheck, O., Srivastava, D.: Holistic UDAFs at streaming speeds. In: ACM SIGMOD, pp. 35–46. ACM (2004)
Cormode, G., Garofalakis, M., Haas, P.J., Jermaine, C.: Synopses for massive data: samples, histograms, wavelets, sketches. Found. Trends Databases 4(1–3), 1–294 (2012)
Roy, P., Khan, A., Alonso, G.: Augmented sketch: faster and more accurate stream processing. In: ACM SIGMOD, pp. 1449–1463. ACM (2016)
Cormode, G., Muthukrishnan, S.: An improved data stream summary: the count-min sketch and its applications. J. Algorithms 55(1), 58–75 (2005)
Cormode, G., Hadjieleftheriou, M.: Finding frequent items in data streams. Proc. VLDB Endow. 1(2), 1530–1541 (2008)
Chen, A., Jin, Y., Cao, J., Li, L.E.: Tracking long duration flows in network traffic. In: IEEE INFOCOM, pp. 1–5. IEEE (2010)
Liu, Z., Manousis, A., Vorsanger, G., Sekar, V., Braverman, V.: One sketch to rule them all: rethinking network flow monitoring with UnivMon. In: ACM SIGCOMM, pp. 101–114. ACM (2016)
Gilbert, A.C., Strauss, M.J., Tropp, J.A., Vershynin, R.: One sketch for all: fast algorithms for compressed sensing. In: ACM STOC, pp. 237–246. ACM (2007)
Durme, B.V., Lall, A.: Probabilistic counting with randomized storage. In: IJCAI, pp. 1574–1579. Morgan Kaufmann Publishers Inc. (2009)
Polyzotis, N., Garofalakis, M., Ioannidis, Y.: Approximate XML query answers. In: ACM SIGMOD, pp. 263–274. ACM (2004)
Estan, C., Varghese, G.: New directions in traffic measurement and accounting. ACM Trans. Comput. Syst. 21(3), 270–313 (2002)
Powers, D.M.W.: Applications and explanations of Zipf’s law. Adv. Neural. Inf. Process. Syst. 5(4), 595–599 (1998)
Adamic, L.A., Huberman, B.A., Barabási, A.L., Albert, R., Jeong, H., Bianconi, G.: Power-law distribution of the World Wide Web. Science 287(5461), 2115 (2000)
Yang, T., Liu, L., Yan, Y., Shahzad, M., Shen, Y., Li, X., Cui, B., Xie, G.: SF-sketch: a fast, accurate, and memory efficient data structure to store frequencies of data items. In: IEEE ICDE. IEEE (2017)
Graham, C.: Sketch techniques for approximate query processing. Found. Trends Databases (2011)
Qiao, Y., Li, T., Chen, S.: One memory access bloom filters and their generalization. Proc. IEEE INFOCOM 28(6), 1745–1753 (2011)
Acknowledgements
This work was supported by Shenzhen Basic Research Program (JCYJ20160525 154348175), the Shenzhen Municipal Development and Reform Commission (Disciplinary Development Program for Data Science and Intelligent Computing) and Shenzhen Key Lab Project (ZDSYS20170303140513705).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer International Publishing AG, part of Springer Nature
About this paper
Cite this paper
Sun, S., Li, D. (2018). Discussion on Fast and Accurate Sketches for Skewed Data Streams: A Case Study. In: Cai, Y., Ishikawa, Y., Xu, J. (eds) Web and Big Data. APWeb-WAIM 2018. Lecture Notes in Computer Science(), vol 10988. Springer, Cham. https://doi.org/10.1007/978-3-319-96893-3_7
Download citation
DOI: https://doi.org/10.1007/978-3-319-96893-3_7
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-96892-6
Online ISBN: 978-3-319-96893-3
eBook Packages: Computer ScienceComputer Science (R0)