Lightweight Metric Computation for Distributed Massive Data Streams

  • Emmanuelle Anceaume
  • Yann BusnelEmail author
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10430)


The real time analysis of massive data streams is of utmost importance in data intensive applications that need to detect as fast as possible and as efficiently as possible (in terms of computation and memory space) any correlation between its inputs or any deviance from some expected nominal behavior. The IoT infrastructure can be used for monitoring any events or changes in structural conditions that can compromise safety and increase risk. It is thus a recurrent and crucial issue to determine whether huge data streams, received at monitored devices, are correlated or not as it may reveal the presence of attacks. We propose a metric, called codeviation, that allows to evaluate the correlation between distributed massive streams. This metric is inspired from classical metric in statistics and probability theory, and as such enables to understand how observed quantities change together, and in which proportion. We then propose to estimate the codeviation in the data stream model. In this model, functions are estimated on a huge sequence of data items, in an online fashion, and with a very small amount of memory with respect to both the size of the input stream and the values domain from which data items are drawn. We then generalize our approach by presenting a new metric, the Sketch-\(\star \) metric, which allows us to define a distance between updatable summaries of large data streams. An important feature of the Sketch-\(\star \) metric is that, given a measure on the entire initial data streams, the Sketch-\(\star \) metric preserves the axioms of the latter measure on the sketch. We finally present results obtained during extensive experiments conducted on both synthetic traces and real data sets allowing us to validate the robustness and accuracy of our metrics.


Data stream model Correlation metric Statistical metric Distributed approximation algorithm 

Supplementary material


  1. 1.
    Lakhina, A., Crovella, M., Diot, C.: Mining anomalies using traffic feature distributions. In: Proceedings of the ACM Conference on Applications, Technologies, Architectures, and Protocols for Computer Communications (SIGCOMM) (2005)Google Scholar
  2. 2.
    Qiu, T., Ge, Z., Pei, D., Wang, J., Xu, J.: What happened in my network: mining network events from router syslogs. In: Proceedings of the 10th ACM Conference on Internet Measurement (IMC) (2010)Google Scholar
  3. 3.
    Yeung, D.S.: Covariance-matrix modeling and detecting various flooding attacks. IEEE Trans. Syst. Man Cybernet. Part A 37(2), 157–169 (2007)CrossRefGoogle Scholar
  4. 4.
    Zhu, Y., Fu, X., Graham, B., Bettati, R., Zhao, W.: On flow correlation attacks and countermeasures in mix networks. In: Martin, D., Serjantov, A. (eds.) PET 2004. LNCS, vol. 3424, pp. 207–225. Springer, Heidelberg (2005). doi: 10.1007/11423409_13 CrossRefGoogle Scholar
  5. 5.
    Ganguly, S., Garafalakis, M., Rastogi, R., Sabnani, K.: Streaming algorithms for robust, real-time detection of ddos attacks. In: Proceedings of the 27th International Conference on Distributed Computing Systems (ICDCS) (2007)Google Scholar
  6. 6.
    Jin, S., Yeung, D.: A covariance analysis model for ddos attack detection. In: 4th IEEE International Conference on Communications (ICC), vol. 4, pp. 1882–1886 (2004)Google Scholar
  7. 7.
    Pinarer, O., Gripay, Y., Servigne, S., Ozgovde, A.: Energy enhancement of multi-application monitoring systems for smart buildings. In: Krogstie, J., Mouratidis, H., Su, J. (eds.) CAiSE 2016. LNBIP, vol. 249, pp. 131–142. Springer, Cham (2016). doi: 10.1007/978-3-319-39564-7_14 Google Scholar
  8. 8.
    Boubrima, A., Matigot, F., Bechkit, W., Rivano, H., Ruas, A.: Optimal deployment of wireless sensor networks for air pollution monitoring. In: 24th International Conference on Computer Communication and Networks (ICCCN), Las Vegas, USA, August 2015Google Scholar
  9. 9.
    Stankovic, J.A.: Research directions for the internet of things. IEEE Internet Things J. 1(1), 3–9 (2014)CrossRefGoogle Scholar
  10. 10.
    Anceaume, E., Busnel, Y., Gambs, S.: Uniform and ergodic sampling in unstructured peer-to-peer systems with malicious nodes. In: Lu, C., Masuzawa, T., Mosbah, M. (eds.) OPODIS 2010. LNCS, vol. 6490, pp. 64–78. Springer, Heidelberg (2010). doi: 10.1007/978-3-642-17653-1_5 CrossRefGoogle Scholar
  11. 11.
    Bar-Yossef, Z., Jayram, T.S., Kumar, R., Sivakumar, D., Trevisan, L.: Counting distinct elements in a data stream. In: Rolim, J.D.P., Vadhan, S. (eds.) RANDOM 2002. LNCS, vol. 2483, pp. 1–10. Springer, Heidelberg (2002). doi: 10.1007/3-540-45726-7_1 CrossRefGoogle Scholar
  12. 12.
    Flajolet, P., Martin, G.N.: Probabilistic counting algorithms for data base applications. J. Comput. Syst. Sci. 31(2), 182–209 (1985)MathSciNetCrossRefzbMATHGoogle Scholar
  13. 13.
    Kane, D.M., Nelson, J., Woodruff, D.P.: An optimal algorithm for the distinct element problem. In: Proceedings of the Symposium on Principles of Databases (PODS) (2010)Google Scholar
  14. 14.
    Alon, N., Matias, Y., Szegedy, M.: The space complexity of approximating the frequency moments. In: Proceedings of the Twenty-Eighth Annual ACM Symposium on Theory of Computing (STOC), pp. 20–29 (1996)Google Scholar
  15. 15.
    Cover, T., Thomas, J.: Elements of Information Theory. Wiley, New York (1991)CrossRefzbMATHGoogle Scholar
  16. 16.
    Chakrabarti, A., Cormode, G., McGregor, A.: A near-optimal algorithm for computing the entropy of a stream. In. ACM-SIAM Symposium on Discrete Algorithms, pp. 328–335 (2007)Google Scholar
  17. 17.
    Lall, A., Sekar, V., Ogihara, M., Xu, J., Zhang, H.: Data streaming algorithms for estimating entropy of network traffic. In: Proceedings of the Joint International Conference on Measurement and Modeling of Computer Systems (SIGMETRICS). ACM (2006)Google Scholar
  18. 18.
    Anceaume, E., Busnel, Y., Gambs, S.: On the power of the adversary to solve the node sampling problem. Trans. Large-Scale Data Knowl. Centered Syst. (TLDKS) 11, 102–126 (2013)Google Scholar
  19. 19.
    Anceaume, E., Busnel, Y.: An information divergence estimation over data streams. In: Proceedings of the 11th IEEE International Symposium on Network Computing and Applications (NCA) (2012)Google Scholar
  20. 20.
    Chakrabarti, A., Ba, K., Muthukrishnan, S.: Estimating entropy and entropy norm on data streams. In: Durand, B., Thomas, W. (eds.) STACS 2006. LNCS, vol. 3884, pp. 196–205. Springer, Heidelberg (2006). doi: 10.1007/11672142_15 CrossRefGoogle Scholar
  21. 21.
    Guha, S., McGregor, A., Venkatasubramanian, S.: Streaming and sublinear approximation of entropy and information distances. In: Proceedings of the Seventeenth Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pp. 733–742 (2006)Google Scholar
  22. 22.
    Rivetti, N., Busnel, Y., Querzoni, L.: Load-aware shedding in stream processing systems. In: Proceedings of the 10th ACM International Conference on Distributed Event-Based Systems (DEBS), Ivine, CA, USA, June 2016Google Scholar
  23. 23.
    Rivetti, N., Anceaume, E., Busnel, Y., Querzoni, L., Sericola, B.: Online scheduling for shuffle grouping in distributed stream processing systems. In: Proceedings of the 17th ACM/IFIP/USENIX 13th International Conference on Middleware (Middleware), Trento, Italie, December 2016Google Scholar
  24. 24.
    Charikar, M., Chen, K., Farach-Colton, M.: Finding frequent items in data streams. Theor. Comput. Sci. 312(1), 3–15 (2004)MathSciNetCrossRefzbMATHGoogle Scholar
  25. 25.
    Cormode, G., Garofalakis, M.: Sketching probabilistic data streams. In: Proceedings of the 2007 ACM SIGMOD International Conference on Management of Data, pp. 281–292 (2007)Google Scholar
  26. 26.
    Guha, S., Indyk, P., Mcgregor, A.: Sketching information divergences. Mach. Learn. 72(1–2), 5–19 (2008)CrossRefzbMATHGoogle Scholar
  27. 27.
    Cormode, G., Muthukrishnan, S., Yi, K.: Algorithms for distributed functional monitoring. In: Proceedings of the 19th Annual ACM-SIAM Symposium On Discrete Algorithms (SODA) (2008)Google Scholar
  28. 28.
    Arackaparambil, C., Brody, J., Chakrabarti, A.: Functional monitoring without monotonicity. In: Proceedings of the 36th ACM International Colloquium on Automata, Languages and Programming (ICALP) (2009)Google Scholar
  29. 29.
    Gibbons, P.B., Tirthapura, S.: Estimating simple functions on the union of data streams. In: Proceedings of the Thirteenth Annual ACM Symposium on Parallel Algorithms and Architectures (SPAA), pp. 281–291 (2001)Google Scholar
  30. 30.
    Haung, Z., Yi, K., Zhang, Q.: Randomized algorithms for tracking distributed count, frequencies and ranks. In: Proceedings of 31st ACM Symposium on Principles of Database Systems (PODS) (2012)Google Scholar
  31. 31.
    Liu, Z., Radunović, B., Vojnovic, M.: Continuous distributed counting for non-monotonic streams. In: Proceedings of 31st ACM Symposium on Principles of Database Systems (PODS) (2012)Google Scholar
  32. 32.
    Yuan, J., Mills, K.: Monitoring the macroscopic effect of DDoS flooding attacks. IEEE Trans. Dependable Secure Comput. 2(4), 324–335 (2005)CrossRefGoogle Scholar
  33. 33.
    Basseville, M., Cardoso, J.F.: On entropies, divergences, and mean values. In: Proceedings of the IEEE International Symposium on Information Theory (1995)Google Scholar
  34. 34.
    Ali, S.M., Silvey, S.D.: General class of coefficients of divergence of one distribution from another. J. Roy. Stat. Soc. Ser. B (Methodological) 28(1), 131–142 (1966)MathSciNetzbMATHGoogle Scholar
  35. 35.
    Csiszár, I.: Information measures: a critical survey. In: Transactions of the Seventh Prague Conference on Information Theory, Statistical Decision Functions, Random Processes, Dordrecht, D. Riedel, pp. 73–86 (1978)Google Scholar
  36. 36.
    Morimoto, T.: Markov processes and the \(h\)-theorem. J. Phys. Soc. Jpn. 18(3), 328–331 (1963)CrossRefzbMATHGoogle Scholar
  37. 37.
    Kullback, S., Leibler, R.A.: On information and sufficiency. Ann. Math. Stat. 22(1), 79–86 (1951)MathSciNetCrossRefzbMATHGoogle Scholar
  38. 38.
    Bhattacharyya, A.: On a measure of divergence between two statistical populations defined by their probability distributions. Bull. Calcutta Math. Soc. 35, 99–109 (1943)MathSciNetzbMATHGoogle Scholar
  39. 39.
    Muthukrishnan, S.: Data Streams: Algorithms and Applications. Now Publishers Inc., Hanover (2005)zbMATHGoogle Scholar
  40. 40.
    Anceaume, E., Busnel, Y., Rivetti, N.: Estimating the frequency of data items in massive distributed streams. In: Proceedings of the 4th IEEE Symposium on Network Cloud Computing and Applications (NCCA), pp. 59–66 (2015)Google Scholar
  41. 41.
    Cormode, G., Muthukrishnan, S.: An improved data stream summary: the count-min sketch and its applications. J. Algorithms 55(1), 58–75 (2005)MathSciNetCrossRefzbMATHGoogle Scholar
  42. 42.
    The Internet Traffic Archive. Lawrence Berkeley National Laboratory.
  43. 43.
    Bregman, L.M.: The relaxation method of finding the common point of convex sets and its application to the solution of problems in convex programming. USSR Comput. Math. Math. Phys. 7(3), 200–217 (1967)MathSciNetCrossRefzbMATHGoogle Scholar
  44. 44.
    Hellinger, E.: Neue begründung der theorie quadratischer formen von unendlichvielen veränderlichen. J. Reine Angew. Math. 136, 210–271 (1909)MathSciNetzbMATHGoogle Scholar
  45. 45.
    Csiszár, I.: Why least squares and maximum entropy? an axiomatic approach to inference for linear inverse problems. Ann. Stat. 19(4), 2032–2066 (1991)MathSciNetCrossRefzbMATHGoogle Scholar
  46. 46.
    Amari, S.I., Cichocki, A.: Information geometry of divergence functions. Bull. Pol. Acad. Sci. Techn. Sci. 58(1), 183–195 (2010)Google Scholar
  47. 47.
    Amari, S.I.: \(\alpha \)-divergence is unique, belonging to both \(f\)-divergence and bregman divergence classes. IEEE Trans. Inf. Theor. 55(11), 4925–4931 (2009)MathSciNetCrossRefGoogle Scholar
  48. 48.
    Renyi, A.: On measures of information and entropy. In: Proceedings of the 4th Berkeley Symposium on Mathematics, Statistics and Probability, pp. 547–561 (1960)Google Scholar

Copyright information

© Springer-Verlag GmbH Germany 2017

Authors and Affiliations

  1. 1.IRISA/CNRSRennesFrance
  2. 2.IMT Atlantique/IRISARennesFrance

Personalised recommendations