Advertisement

Fast and accurate stream processing by filtering the cold

  • Tong Yang
  • Jie Jiang
  • Yang Zhou
  • Long He
  • Jinyang Li
  • Bin CuiEmail author
  • Steve Uhlig
  • Xiaoming Li
Regular Paper
  • 16 Downloads

Abstract

Approximate stream processing algorithms, such as Count-Min sketch, Space-Saving, support numerous applications across multiple areas such as databases, storage systems, and networking. However, the unbalanced distribution in real data streams are challenging to existing algorithms. To enhance these algorithms, we propose a meta-framework, called Cold Filter, that enables faster and more accurate stream processing. Different from existing filters that mainly focus on hot (frequent) items, our filter captures cold (infrequent) items in the first stage, and hot items in the second stage. Existing filters also require two-direction communication—with frequent exchanges between the two stages; our filter on the other hand is one-direction—each item enters one stage at most once. Our filter can accurately estimate both cold and hot items, providing a level of genericity that makes it applicable to many stream processing tasks. To illustrate the benefits of our filter, we deploy it on four typical stream processing tasks. Experimental results show speed improvements of up to 4.7 times, and accuracy improvements of up to 51 times.

Keywords

Data streams Sketch Frequency estimation Top-k hot items Heavy changes Persistent items 

Notes

Acknowledgements

This work is supported by the National Key Research and Development Program of China (2018YFB1004403, 2016YFB1000304), NSFC (61672061, 61832001, and 61572039).

References

  1. 1.
    Cormode, G., Johnson, T., Korn, F., Muthukrishnan, S., Spatscheck, O., Srivastava, D.: Holistic UDAFs at streaming speeds. In: Proceedings of ACM SIGMOD, pp 35–46 (2004)Google Scholar
  2. 2.
    Manerikar, N., Palpanas, T.: Frequent items in streaming data: an experimental evaluation of the state-of-the-art. Data Knowl. Eng. 68(4), 415–430 (2009)CrossRefGoogle Scholar
  3. 3.
    Zhao, P., Aggarwal, C.C., Wang, M.: gSketch: on query estimation in graph streams. Proc. VLDB 5, 193–204 (2011)CrossRefGoogle Scholar
  4. 4.
    Roy, P., Khan, A., Alonso, G.: Augmented sketch: faster and more accurate stream processing. In: Proceedings of ACM SIGMOD, pp. 1449–1463 (2016)Google Scholar
  5. 5.
    Chen, B., Shrivastava, A.: Densified winner take all (WTA) hashing for sparse datasets. In: Proceedings of the Thirty-Fourth Conference on Uncertainty in Artificial Intelligence, UAI 2018, Monterey, California, USA, August 6–10, 2018, pp. 906–916 (2018)Google Scholar
  6. 6.
    Dobra, A., Garofalakis, M., Gehrke, J., Rastogi, R.: Processing complex aggregate queries over data streams. In: Proceedings of ACM SIGMOD, pp. 61–72. ACM (2002)Google Scholar
  7. 7.
    Cormode, G., Hadjieleftheriou, M.: Finding frequent items in data streams. Proc. VLDB 1(2), 1530–1541 (2008)CrossRefGoogle Scholar
  8. 8.
    Cormode, G., Garofalakis, M., Haas, P.J., Jermaine, C.: Synopses for massive data: samples, histograms, wavelets, sketches. Found. Trends Databases 4(1–3), 1–294 (2012)zbMATHGoogle Scholar
  9. 9.
    Cormode, G., Muthukrishnan, S.: An improved data stream summary: the count-min sketch and its applications. J. Alg. 55(1), 58–75 (2005)MathSciNetCrossRefzbMATHGoogle Scholar
  10. 10.
    Metwally, A., Agrawal, D., El Abbadi, A.: Efficient computation of frequent and top-k elements in data streams. In: International Conference on Database Theory, pp. 398–412. Springer (2005)Google Scholar
  11. 11.
    Charikar, M., Chen, K., Farach-Colton, M.: Finding frequent items in data streams. In: Widmayer, P., Eidenbenz, S., Triguero, F., Morales, R., Conejo, R., Hennessy, M. (eds.) Automata, Languages and Programming. Springer, Berlin (2002)Google Scholar
  12. 12.
    Schweller, R., Gupta, A., Parsons, E., Chen, Y.: Reversible sketches for efficient and accurate change detection over network data streams. In: Proceedings of ACM IMC, pp. 207–212. ACM (2004)Google Scholar
  13. 13.
    Gilbert, A.C., Kotidis, Y., Muthukrishnan, S., Strauss, M.J.: How to summarize the universe: dynamic maintenance of quantiles. In: Proceedings of VLDB, pp. 454–465. VLDB Endowment (2002)Google Scholar
  14. 14.
    Luo, C., Shrivastava, A.: SSH (sketch, shingle, & hash) for indexing massive-scale time series. In: NIPS 2016 Time Series Workshop, pp. 38–58 (2017)Google Scholar
  15. 15.
    Shrivastava, A., Konig, A.C., Bilenko, M.: Time adaptive sketches (ada-sketches) for summarizing data streams. In: Proceedings of the 2016 International Conference on Management of Data, pp. 1417–1432. ACM (2016)Google Scholar
  16. 16.
    Bloom, B.H.: Space/time trade-offs in hash coding with allowable errors. Commun. ACM 13(7), 422–426 (1970)CrossRefzbMATHGoogle Scholar
  17. 17.
    Garofalakis, M., Gibbons, P.B.: Wavelet synopses with error guarantees. In: Proceedings of ACM SIGMOD, pp. 476–487. ACM (2002)Google Scholar
  18. 18.
    Guha, S., Koudas, N., Shim, K.: Data-streams and histograms. In: Proceedings of STOC, pp. 471–475. ACM (2001)Google Scholar
  19. 19.
    Kirsch, A., Mitzenmacher, M., Varghese, G.: Hash-based techniques for high-speed packet processing. In: Cormode, G., Thottan, M. (eds.) Algorithms for Next Generation Networks, pp. 181–218. Springer, London (2010)CrossRefGoogle Scholar
  20. 20.
    Pandey, P., Bender, M.A., Johnson, R., Patro, R.: A general-purpose counting filter: Making every bit count. In: Proceedings of ACM SIGMOD, pp. 775–787Google Scholar
  21. 21.
    Thomas, D., Bordawekar, R., et al.: On efficient query processing of stream counts on the cell processor. In: Proceedings of IEEE ICDE (2009)Google Scholar
  22. 22.
    Yang, T., Liu, A.X., Shahzad, M., Zhong, Y., Fu, Q., Li, Z., Xie, G., Li, X.: A shifting bloom filter framework for set queries. Proc. VLDB 9(5), 408–419 (2016)CrossRefGoogle Scholar
  23. 23.
    Yang, T., Zhou, Y., Jin, H., Chen, S., Li, X.: Pyramid sketch: a sketch framework for frequency estimation of data streams. Proc. VLDB 10(11), 1442–1453 (2017)CrossRefGoogle Scholar
  24. 24.
    Zhou, Y., Liu, P., Jin, H., Yang, T., Dang, S., Li, X.: One memory access sketh: a more accurate and faster sketch for per-flow measurement. In: IEEE Globecom (2017)Google Scholar
  25. 25.
    Gong, J., Yang, T., Zhou, Y., Yang, D., Chen, S., Cui, B., Li, X.: Abc: a practicable sketch framework for non-uniform multisets. IEEE Bigdata (2017)Google Scholar
  26. 26.
    Wang, L., Cai, Z., Wang, H., Jiang, J., Yang, T., Cui, B., Li, X.: Fine-grained probability counting: Refined loglog algorithm. IEEE Bigcomp (2018)Google Scholar
  27. 27.
    Powers, D.M.: Applications and explanations of Zipf’s law. In Proceedings on EMNLP-CoNLL. Association for Computational Linguistics (1998)Google Scholar
  28. 28.
    Adamic, L.A., Huberman, B.A.: Power-law distribution of the world wide web. Science 287(5461), 2115–2115 (2000)CrossRefGoogle Scholar
  29. 29.
    Goyal, A., Iii, Daume H., Cormode, G.: Sketch algorithms for estimating point queries in NLP. In: Proceedings of EMNLP (2012)Google Scholar
  30. 30.
    Mandal, A., Jiang, H., Shrivastava, A., Sarkar, V.: Topkapi: parallel and fast sketches for finding top-k frequent elements. In: Advances in Neural Information Processing Systems, pp. 10898–10908 (2018) Google Scholar
  31. 31.
    Henzinger, M.R.: Algorithmic challenges in web search engines. Internet Math. 1(1), 115–123 (2004)MathSciNetCrossRefzbMATHGoogle Scholar
  32. 32.
    Li, Y., Miao, R., Kim, C., Yu, M.: Flowradar: a better netflow for data centers. In: Proceedings of USENIX NSDI, pp. 311–324 (2016)Google Scholar
  33. 33.
    Goodrich, M.T., Mitzenmacher, M.: Invertible bloom lookup tables. In: Proceedings of the 49th Annual Allerton Conference on Communication, Control, and Computing, pp. 792–799. IEEE (2011)Google Scholar
  34. 34.
    Xiao, Q., Qiao, Y., Zhen, M., Chen, S.: Estimating the persistent spreads in high-speed networks. In: 2014 IEEE 22nd International Conference on Network Protocols (ICNP), pp. 131–142. IEEE (2014)Google Scholar
  35. 35.
    Dai, H., Shahzad, M., Liu, A.X., Zhong, Y.: Finding persistent items in data streams. Proc. VLDB Endow. 10(4), 289–300 (2016)CrossRefGoogle Scholar
  36. 36.
    Shokrollahi, A.: Raptor codes. IEEE Trans. Inf. Theory 52(6), 2551–2567 (2006)MathSciNetCrossRefzbMATHGoogle Scholar
  37. 37.
    Ganguly, S., Garofalakis, M., Rastogi, R.: Processing data-stream join aggregates using skimmed sketches. In: International Conference on Extending Database Technology, pp. 569–586. Springer (2004)Google Scholar
  38. 38.
    Source code related to cold filter meta-framework. https://github.com/zhouyangpkuer/ColdFilter. Accessed May 2018
  39. 39.
    Ting, D.: Data sketches for disaggregated subset sum and frequent item estimation. In: Proceedings of the 2018 International Conference on Management of Data, pp. 1129–1140. ACM (2018)Google Scholar
  40. 40.
    Wei, Z., Luo, G., Yi, K., Du, X., Wen, J.-R.: Persistent data sketching. In: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, pp. 795–810. ACM (2015)Google Scholar
  41. 41.
    Peng, Y., Guo, J., Li, F., Qian, W., Zhou, A.: Persistent bloom filter: membership testing for the entire history. In: Proceedings of the 2018 International Conference on Management of Data, pp. 1037–1052. ACM (2018)Google Scholar
  42. 42.
    Chen, J., Zhang, Q.: Bias-aware sketches. Proc. VLDB Endow. 10(9), 961–972 (2017)CrossRefGoogle Scholar
  43. 43.
    Wei, Z., Liu, X., Li, F., Shang, S., Du, X., Wen, J.-R.: Matrix sketching over sliding windows. In: Proceedings of the 2016 International Conference on Management of Data, pp. 1465–1480. ACM (2016)Google Scholar
  44. 44.
    Agrawal N., Vulimiri, A.: Low-latency analytics on colossal data streams with summarystore. In: Proceedings of the 26th Symposium on Operating Systems Principles, pp. 647–664. ACM (2017)Google Scholar
  45. 45.
    Cui, H., Keeton, K., Roy, I., Viswanathan, K., Ganger, G.R.: Using data transformations for low-latency time series analysis. In: Proceedings of the Sixth ACM Symposium on Cloud Computing, pp. 395–407. ACM (2015)Google Scholar
  46. 46.
    Rabkin, A., Arye, M., Sen, S., Pai, V.S., Freedman, M.J.: Aggregation and degradation in jetstream: streaming analytics in the wide area. NSDI 14, 275–288 (2014)Google Scholar
  47. 47.
    Jiang, J., Fu, F., Yang, T., Cui, B.: SketchML: Accelerating distributed machine learning with data sketches. In: Proceedings of the 2018 International Conference on Management of Data, pp. 1269–1284. ACM (2018)Google Scholar
  48. 48.
    Aghazadeh, A., Spring, R., LeJeune, D., Dasarathy, G., Shrivastava, A., Baraniuk, R.G.: MISSION: ultra large-scale feature selection using count-sketches. In: Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10–15, 2018, pp. 80–88 (2018)Google Scholar
  49. 49.
    Shrivastava, A.: Fast and accurate training of 100,000 classes on a single titan x. (Preprint) Google Scholar
  50. 50.
    Babcock, B., Babu, S., Datar, M., Motwani, R., Widom, J.: Models and issues in data stream systems. In: Proceedings of ACM PODS, pp. 1–16. ACM (2002)Google Scholar
  51. 51.
    Muthukrishnan, S. et al.: Data streams: algorithms and applications. Found. Trends® Theor. Comput. Sci. 1(2), 117–236 (2005)Google Scholar
  52. 52.
    Guo, C., Yuan, L., Xiang, D., et al.: Pingmesh: a large-scale system for data center network latency measurement and analysis. ACM SIGMCOMM CCR 45(4), 139–152 (2015)CrossRefGoogle Scholar
  53. 53.
    Zhu, Y., Kang, N., Cao, J. et al.: Packet-level telemetry in large datacenter networks. In: ACM SIGMCOMM CCR, vol. 45, pp. 479–491. ACM (2015)Google Scholar
  54. 54.
    Pagh, R., Rodler, F.: Lossy dictionaries. Algorithms—ESA 2001, pp. 300–311 (2001)Google Scholar
  55. 55.
    Intel SSE2 Documentation. https://software.intel.com/en-us/node/683883. Accessed May 2018
  56. 56.
    Zhou, Y., Yang, T., Jiang, J., Cui, B., Yu, M., Li, X., Uhlig, S.: Cold filter: a meta-framework for faster and more accurate stream processing. In: Proceedings of SIGMOD (2018)Google Scholar
  57. 57.
    Lu, Y., Montanari, A., Prabhakar, B., Dharmapurikar, S., Kabbani, A.: Counter braids: a novel counter architecture for per-flow measurement. ACM Sigmetrics Perform. Eval. Rev. 36(1), 121–132 (2008)CrossRefGoogle Scholar
  58. 58.
    Manku, G.S., Motwani, R.: Approximate frequency counts over data streams. In: Proceedings of VLDB, pp. 346–357. VLDB Endowment (2002)Google Scholar
  59. 59.
    Golab, L., DeHaan, D., Demaine, E.D., Lopez-Ortiz, A., Munro, J.I.: Identifying frequent items in sliding windows over on-line packet streams. In: Proceedings of ACM IMC, pp. 173–178. ACM (2003)Google Scholar
  60. 60.
    Karp, R.M., Shenker, S., Papadimitriou, C.H.: A simple algorithm for finding frequent elements in streams and bags. ACM Trans. Database Syst. (TODS) 28(1), 51–55 (2003)CrossRefGoogle Scholar
  61. 61.
    Roberts, S.: Control chart tests based on geometric moving averages. Technometrics 1(3), 239–250 (1959)CrossRefGoogle Scholar
  62. 62.
    Indyk, P.: Stable distributions, pseudorandom generators, embeddings and data stream computation. In: Proceedings of the 41st Annual Symposium on Foundations of Computer Science, pp. 189–197. IEEE (2000)Google Scholar
  63. 63.
    Krishnamurthy, B., Sen, S., Zhang, Y., Chen, Y.: Sketch-based change detection: methods, evaluation, and applications. In: Proceedings of ACM IMC, pp. 234–247. ACM (2003)Google Scholar
  64. 64.
    Schweller, R., Li, Z., Chen, Y., et al.: Reversible sketches: enabling monitoring and analysis over high-speed data streams. IEEE/ACM Trans. Netw. (ToN) 15(5), 1059–1072 (2007)CrossRefGoogle Scholar
  65. 65.
    Guha, S., McGregor, A.: Stream order and order statistics: quantile estimation in random-order streams. SIAM J. Comput. 38(5), 2044–2059 (2009)MathSciNetCrossRefzbMATHGoogle Scholar
  66. 66.
    Wei, Z., Luo, G., Yi, K., Du, X., Wen, J.-R.: Persistent data sketching. In: Proceedings of ACM SIGMOD, pp. 795–810. ACM (2015)Google Scholar
  67. 67.
    The caida anonymized 2016 internet traces. http://www.caida.org/data/overview/. Accessed May 2018
  68. 68.
    Real-life transactional dataset. http://fimi.ua.ac.be/data/. Accessed May 2018
  69. 69.
    Rousskov, A., Wessels, D.: High-performance benchmarking with web polygraph. Softw.: Pract. Exp. 34(2), 187–211 (2004)Google Scholar
  70. 70.
    Hash website. http://burtleburtle.net/bob/hash/evahash.html. Accessed May 2018
  71. 71.
    Ji, M., Yan, J., Gu, S., Han, J., He, X., Zhang, W.V., Chen, Z.: Learning search tasks in queries and web pages via graph regularization. In: Proceedings of ACM SIGIR, pp. 55–64. ACM (2011)Google Scholar
  72. 72.
    Goyal, A., Daume Iii, H., Cormode, G.: Sketch algorithms for estimating point queries in NLP. In: EMNLP-CoNLL, pp. 1093–1103 (2012)Google Scholar
  73. 73.
    Qiao, Y., Li, T., Chen, S.: One memory access bloom filters and their generalization. In: INFOCOM, 2011 Proceedings IEEE, pp. 1745–1753. IEEE (2011)Google Scholar
  74. 74.
    Roy, P., Teubner, J., Alonso, G.: Efficient frequent item counting in multi-core hardware. In: ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (2012)Google Scholar

Copyright information

© Springer-Verlag GmbH Germany, part of Springer Nature 2019

Authors and Affiliations

  1. 1.Department of Computer Science and Technology & Key Laboratory of High Confidence Software, Technologies (MOE)Peking UniversityBeijingChina
  2. 2.Queen Mary University of LondonLondonUK

Personalised recommendations