Advertisement

The VLDB Journal

, Volume 24, Issue 3, pp 395–414 | Cite as

Conditional heavy hitters: detecting interesting correlations in data streams

  • Katsiaryna MirylenkaEmail author
  • Graham Cormode
  • Themis Palpanas
  • Divesh Srivastava
Regular Paper

Abstract

The notion of heavy hitters—items that make up a large fraction of the population—has been successfully used in a variety of applications across sensor and RFID monitoring, network data analysis, event mining, and more. Yet this notion often fails to capture the semantics we desire when we observe data in the form of correlated pairs. Here, we are interested in items that are conditionally frequent: when a particular item is frequent within the context of its parent item. In this work, we introduce and formalize the notion of conditional heavy hitters to identify such items, with applications in network monitoring and Markov chain modeling. We explore the relationship between conditional heavy hitters and other related notions in the literature, and show analytically and experimentally the usefulness of our approach. We introduce several algorithm variations that allow us to efficiently find conditional heavy hitters for input data with very different characteristics, and provide analytical results for their performance. Finally, we perform experimental evaluations with several synthetic and real datasets to demonstrate the efficacy of our methods and to study the behavior of the proposed algorithms for different types of data.

Keywords

Streaming data Online algorithms  Heavy hitters 

References

  1. 1.
    Agrawal, R., Imielinski, T., Swami, A.N.: Mining association rules between sets of items in large databases. In: ACM SIGMOD International Conference on Management of Data, pp. 207–216 (1993)Google Scholar
  2. 2.
    Alon, N., Matias, Y., Szegedy, M.: The space complexity of approximating the frequency moments. In: ACM Symposium on Theory of Computing, pp. 20–29 (1996)Google Scholar
  3. 3.
    Arasu, A., Manku, G.S.: Approximate counts and quantiles over sliding windows. In: Proceedings of the Twenty-Third ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, pp. 286–296. ACM (2004)Google Scholar
  4. 4.
    Baum, L.E., Petrie, T.: Statistical inference for probabilistic functions of finite state Markov chains. Ann. Math. Stat. 37(6), 1554–1563 (1966)CrossRefzbMATHMathSciNetGoogle Scholar
  5. 5.
    Boyer, B., Moore, J.: A fast majority vote algorithm. Tech. Rep. ICSCA-CMP-32. Institute for Computer Science, University of Texas (1981)Google Scholar
  6. 6.
    Broder, A., Mitzenmacher, M.: Network applications of bloom filters: a survey. Internet Math. 1(4), 485–509 (2005)CrossRefMathSciNetGoogle Scholar
  7. 7.
    Budak, C., Georgiou, T., Agrawal, D., El Abbadi, A.: Geoscope: online detection of geo-correlated information trends in social networks. PVLDB 7(4), 229–240 (2013)Google Scholar
  8. 8.
    Chang, J.H., Lee, W.S.: Finding recent frequent itemsets adaptively over online data streams. In: KDD, pp. 487–492 (2003)Google Scholar
  9. 9.
    Charikar, M., Chen, K., Farach-Colton, M.: Finding frequent items in data streams. In: Proceedings of the International Colloquium on Automata, Languages and Programming (ICALP) (2002)Google Scholar
  10. 10.
    Cormode, G., Hadjieleftheriou, M.: Finding frequent items in data streams. In: International Conference on Very Large Data Bases (2008)Google Scholar
  11. 11.
    Cormode, G., Korn, F., Muthukrishnan, S., Srivastava, D.: Finding hierarchical heavy hitters in data streams. In: International Conference on Very Large Data Bases, pp. 464–475 (2003)Google Scholar
  12. 12.
    Cormode, G., Korn, F., Tirthapura, S.: Time decaying aggregates in out-of-order streams. In: Proceedings of the Twenty-Seventh ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, pp. 89–98. ACM (2008)Google Scholar
  13. 13.
    Cormode, G., Muthukrishnan, S.: An improved data stream summary: the count-min sketch and its applications. J. Algorithm. 55(1), 58–75 (2005)CrossRefzbMATHMathSciNetGoogle Scholar
  14. 14.
    Dallachiesa, M., Nushi, B., Mirylenka, K., Palpanas, T.: Uncertain time-series similarity: return to the basics. PVLDB 5(11), 1662–1673 (2012)Google Scholar
  15. 15.
    Dallachiesa, M., Palpanas, T.: Identifying streaming frequent items in ad hoc time windows. Data Knowl. Eng. 87, 66–90 (2013)CrossRefGoogle Scholar
  16. 16.
    Demaine, E., López-Ortiz, A., Munro, J.I.: Frequency estimation of internet packet streams with limited space. In: European Symposium on Algorithms (ESA) (2002)Google Scholar
  17. 17.
    Duong, T., Goud, B., Schauer, K.: Closed-form density-based framework for automatic detection of cellular morphology changes. Proc. Natl. Acad. Sci. 109(22), 8382–8387 (2012)Google Scholar
  18. 18.
    Durme, B.V., Lall, A.: Streaming pointwise mutual information. In: Advances in Neural Information Processing Systems, pp. 1892–1900 (2009)Google Scholar
  19. 19.
    Gehrke, J., Korn, F., Srivastava, D.: On computing correlated aggregates over continual data streams. In: ACM SIGMOD International Conference on Management of Data, pp. 13–24 (2001)Google Scholar
  20. 20.
    Giannella, C., Han, J., Pei, J., Yan, X., Yu, P.S.: Mining frequent patterns in data streams at multiple time granularities. In: Kargupta, H., Joshi, A., Sivakumar, K., Yesha, Y. (eds.) Next Generation Data Mining, pp. 191–212 (2003)Google Scholar
  21. 21.
    Han, J., Pei, J., Yin, Y.: Mining frequent patterns without candidate generation. In: SIGMOD Conference, pp. 1–12 (2000)Google Scholar
  22. 22.
    Lahiri, B., Tirthapura, S.: Finding correlated heavy-hitters over data streams. In: IEEE 28th International Conference on Performance Computing and Communications (IPCCC), pp. 307–314. IEEE (2009)Google Scholar
  23. 23.
    Lee, L-K., Ting, H.F.: A simpler and more efficient deterministic scheme for finding frequent items over sliding windows. In: Proceedings of the Twenty-Fifth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, pp. 290–297. ACM (2006)Google Scholar
  24. 24.
    Letchner, J., Ré, C., Balazinska, M., Philipose, M.: Approximation trade-offs in Markovian stream processing: an empirical study. In: IEEE 26th International Conference on Data Engineering (ICDE), pp. 936–939. IEEE (2010)Google Scholar
  25. 25.
    Manerikar, N., Palpanas, T.: Frequent items in streaming data: an experimental evaluation of the state-of-the-art. Data Knowl. Eng. 68(4), 415–430 (2009)CrossRefGoogle Scholar
  26. 26.
    Manku, G., Motwani, R.: Approximate frequency counts over data streams. In: International Conference on Very Large Data Bases, pp. 346–357 (2002)Google Scholar
  27. 27.
    Metwally, A., Agrawal, D., Abbadi, A.E.: Efficient computation of frequent and top-k elements in data streams. In: International Conference on Database Theory (2005)Google Scholar
  28. 28.
    Mirylenka, K., Cormode, G., Palpanas, T., Srivastava, D.: Finding interesting correlations with conditional heavy hitters. In: International Conference on Data Engineering (ICDE) (2013)Google Scholar
  29. 29.
    Misra, J., Gries, D.: Finding repeated elements. Sci. Comput. Program. 2, 143–152 (1982)CrossRefzbMATHMathSciNetGoogle Scholar
  30. 30.
    Pearl, J.: Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan Kaufmann Publishers Inc., Los Altos (1988)Google Scholar
  31. 31.
    Rabinovich, M., Spatschek, O.: Web Caching and Replication. Addison-Wesley Longman Publishing Co., Inc, Boston (2002)Google Scholar
  32. 32.
    Raftery, A.E.: A model of high-order Markov chains. J. R. Stat. Soc. Series B Methodol. 47(3), 528–539 (1985)zbMATHMathSciNetGoogle Scholar
  33. 33.
    Rubner, Y., Tomasi, C., Guibas, L.: The earth mover’s distance as a metric for image retrieval. Int. J. Comput. Vision 40(2), 99–121 (2000)CrossRefzbMATHGoogle Scholar
  34. 34.
    Tantono, F.I., Manerikar, N., Palpanas, T.: Efficiently discovering recent frequent items in data streams. In: Scientific and Statistical Database Management, pp. 222–239. Springer, Berlin, Heidelberg (2008)Google Scholar
  35. 35.
    Venkataraman, S., Song, D.X., Gibbons, P.B., Blum, A.: New streaming algorithms for fast detection of superspreaders. In: Network and Distributed System Security Symposium NDSS (2005)Google Scholar
  36. 36.
    Wang, P., Wang, H., Wang, W.: Finding semantics in time series. In: ACM SIGMOD International Conference on Management of Data, pp. 385–396 (2011)Google Scholar
  37. 37.
    Welch, B.L.: The generalization of ‘student’s’ problem when several different population variances are involved. Biometrika 34(1/2), 28–35 (1947)CrossRefzbMATHMathSciNetGoogle Scholar
  38. 38.
    Yu, P.S., Chi, Y.: Association rule mining on streams. In: Encyclopedia of Database Systems, pp. 136–139. Springer-Verlag (2009)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2015

Authors and Affiliations

  • Katsiaryna Mirylenka
    • 1
    Email author
  • Graham Cormode
    • 2
  • Themis Palpanas
    • 3
  • Divesh Srivastava
    • 4
  1. 1.The University of TrentoTrentoItaly
  2. 2.The University of WarwickCoventryUK
  3. 3.Paris Descartes UniversityParisFrance
  4. 4.AT&T LabsBedminsterUSA

Personalised recommendations