Conditional heavy hitters: detecting interesting correlations in data streams

Abstract

The notion of heavy hitters—items that make up a large fraction of the population—has been successfully used in a variety of applications across sensor and RFID monitoring, network data analysis, event mining, and more. Yet this notion often fails to capture the semantics we desire when we observe data in the form of correlated pairs. Here, we are interested in items that are conditionally frequent: when a particular item is frequent within the context of its parent item. In this work, we introduce and formalize the notion of conditional heavy hitters to identify such items, with applications in network monitoring and Markov chain modeling. We explore the relationship between conditional heavy hitters and other related notions in the literature, and show analytically and experimentally the usefulness of our approach. We introduce several algorithm variations that allow us to efficiently find conditional heavy hitters for input data with very different characteristics, and provide analytical results for their performance. Finally, we perform experimental evaluations with several synthetic and real datasets to demonstrate the efficacy of our methods and to study the behavior of the proposed algorithms for different types of data.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15

Notes

  1. 1.

    Note that when restricting output to have size exactly \(\tau \), precision and recall are identical, so we do not duplicate this measurement.

  2. 2.

    http://ita.ee.lbl.gov/html/contrib/WorldCup.html

References

  1. 1.

    Agrawal, R., Imielinski, T., Swami, A.N.: Mining association rules between sets of items in large databases. In: ACM SIGMOD International Conference on Management of Data, pp. 207–216 (1993)

  2. 2.

    Alon, N., Matias, Y., Szegedy, M.: The space complexity of approximating the frequency moments. In: ACM Symposium on Theory of Computing, pp. 20–29 (1996)

  3. 3.

    Arasu, A., Manku, G.S.: Approximate counts and quantiles over sliding windows. In: Proceedings of the Twenty-Third ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, pp. 286–296. ACM (2004)

  4. 4.

    Baum, L.E., Petrie, T.: Statistical inference for probabilistic functions of finite state Markov chains. Ann. Math. Stat. 37(6), 1554–1563 (1966)

    Article  MATH  MathSciNet  Google Scholar 

  5. 5.

    Boyer, B., Moore, J.: A fast majority vote algorithm. Tech. Rep. ICSCA-CMP-32. Institute for Computer Science, University of Texas (1981)

  6. 6.

    Broder, A., Mitzenmacher, M.: Network applications of bloom filters: a survey. Internet Math. 1(4), 485–509 (2005)

    Article  MathSciNet  Google Scholar 

  7. 7.

    Budak, C., Georgiou, T., Agrawal, D., El Abbadi, A.: Geoscope: online detection of geo-correlated information trends in social networks. PVLDB 7(4), 229–240 (2013)

    Google Scholar 

  8. 8.

    Chang, J.H., Lee, W.S.: Finding recent frequent itemsets adaptively over online data streams. In: KDD, pp. 487–492 (2003)

  9. 9.

    Charikar, M., Chen, K., Farach-Colton, M.: Finding frequent items in data streams. In: Proceedings of the International Colloquium on Automata, Languages and Programming (ICALP) (2002)

  10. 10.

    Cormode, G., Hadjieleftheriou, M.: Finding frequent items in data streams. In: International Conference on Very Large Data Bases (2008)

  11. 11.

    Cormode, G., Korn, F., Muthukrishnan, S., Srivastava, D.: Finding hierarchical heavy hitters in data streams. In: International Conference on Very Large Data Bases, pp. 464–475 (2003)

  12. 12.

    Cormode, G., Korn, F., Tirthapura, S.: Time decaying aggregates in out-of-order streams. In: Proceedings of the Twenty-Seventh ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, pp. 89–98. ACM (2008)

  13. 13.

    Cormode, G., Muthukrishnan, S.: An improved data stream summary: the count-min sketch and its applications. J. Algorithm. 55(1), 58–75 (2005)

    Article  MATH  MathSciNet  Google Scholar 

  14. 14.

    Dallachiesa, M., Nushi, B., Mirylenka, K., Palpanas, T.: Uncertain time-series similarity: return to the basics. PVLDB 5(11), 1662–1673 (2012)

    Google Scholar 

  15. 15.

    Dallachiesa, M., Palpanas, T.: Identifying streaming frequent items in ad hoc time windows. Data Knowl. Eng. 87, 66–90 (2013)

    Article  Google Scholar 

  16. 16.

    Demaine, E., López-Ortiz, A., Munro, J.I.: Frequency estimation of internet packet streams with limited space. In: European Symposium on Algorithms (ESA) (2002)

  17. 17.

    Duong, T., Goud, B., Schauer, K.: Closed-form density-based framework for automatic detection of cellular morphology changes. Proc. Natl. Acad. Sci. 109(22), 8382–8387 (2012)

  18. 18.

    Durme, B.V., Lall, A.: Streaming pointwise mutual information. In: Advances in Neural Information Processing Systems, pp. 1892–1900 (2009)

  19. 19.

    Gehrke, J., Korn, F., Srivastava, D.: On computing correlated aggregates over continual data streams. In: ACM SIGMOD International Conference on Management of Data, pp. 13–24 (2001)

  20. 20.

    Giannella, C., Han, J., Pei, J., Yan, X., Yu, P.S.: Mining frequent patterns in data streams at multiple time granularities. In: Kargupta, H., Joshi, A., Sivakumar, K., Yesha, Y. (eds.) Next Generation Data Mining, pp. 191–212 (2003)

  21. 21.

    Han, J., Pei, J., Yin, Y.: Mining frequent patterns without candidate generation. In: SIGMOD Conference, pp. 1–12 (2000)

  22. 22.

    Lahiri, B., Tirthapura, S.: Finding correlated heavy-hitters over data streams. In: IEEE 28th International Conference on Performance Computing and Communications (IPCCC), pp. 307–314. IEEE (2009)

  23. 23.

    Lee, L-K., Ting, H.F.: A simpler and more efficient deterministic scheme for finding frequent items over sliding windows. In: Proceedings of the Twenty-Fifth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, pp. 290–297. ACM (2006)

  24. 24.

    Letchner, J., Ré, C., Balazinska, M., Philipose, M.: Approximation trade-offs in Markovian stream processing: an empirical study. In: IEEE 26th International Conference on Data Engineering (ICDE), pp. 936–939. IEEE (2010)

  25. 25.

    Manerikar, N., Palpanas, T.: Frequent items in streaming data: an experimental evaluation of the state-of-the-art. Data Knowl. Eng. 68(4), 415–430 (2009)

    Article  Google Scholar 

  26. 26.

    Manku, G., Motwani, R.: Approximate frequency counts over data streams. In: International Conference on Very Large Data Bases, pp. 346–357 (2002)

  27. 27.

    Metwally, A., Agrawal, D., Abbadi, A.E.: Efficient computation of frequent and top-k elements in data streams. In: International Conference on Database Theory (2005)

  28. 28.

    Mirylenka, K., Cormode, G., Palpanas, T., Srivastava, D.: Finding interesting correlations with conditional heavy hitters. In: International Conference on Data Engineering (ICDE) (2013)

  29. 29.

    Misra, J., Gries, D.: Finding repeated elements. Sci. Comput. Program. 2, 143–152 (1982)

    Article  MATH  MathSciNet  Google Scholar 

  30. 30.

    Pearl, J.: Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan Kaufmann Publishers Inc., Los Altos (1988)

    Google Scholar 

  31. 31.

    Rabinovich, M., Spatschek, O.: Web Caching and Replication. Addison-Wesley Longman Publishing Co., Inc, Boston (2002)

    Google Scholar 

  32. 32.

    Raftery, A.E.: A model of high-order Markov chains. J. R. Stat. Soc. Series B Methodol. 47(3), 528–539 (1985)

    MATH  MathSciNet  Google Scholar 

  33. 33.

    Rubner, Y., Tomasi, C., Guibas, L.: The earth mover’s distance as a metric for image retrieval. Int. J. Comput. Vision 40(2), 99–121 (2000)

    Article  MATH  Google Scholar 

  34. 34.

    Tantono, F.I., Manerikar, N., Palpanas, T.: Efficiently discovering recent frequent items in data streams. In: Scientific and Statistical Database Management, pp. 222–239. Springer, Berlin, Heidelberg (2008)

  35. 35.

    Venkataraman, S., Song, D.X., Gibbons, P.B., Blum, A.: New streaming algorithms for fast detection of superspreaders. In: Network and Distributed System Security Symposium NDSS (2005)

  36. 36.

    Wang, P., Wang, H., Wang, W.: Finding semantics in time series. In: ACM SIGMOD International Conference on Management of Data, pp. 385–396 (2011)

  37. 37.

    Welch, B.L.: The generalization of ‘student’s’ problem when several different population variances are involved. Biometrika 34(1/2), 28–35 (1947)

    Article  MATH  MathSciNet  Google Scholar 

  38. 38.

    Yu, P.S., Chi, Y.: Association rule mining on streams. In: Encyclopedia of Database Systems, pp. 136–139. Springer-Verlag (2009)

Download references

Author information

Affiliations

Authors

Corresponding author

Correspondence to Katsiaryna Mirylenka.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Mirylenka, K., Cormode, G., Palpanas, T. et al. Conditional heavy hitters: detecting interesting correlations in data streams. The VLDB Journal 24, 395–414 (2015). https://doi.org/10.1007/s00778-015-0382-5

Download citation

Keywords

  • Streaming data
  • Online algorithms
  • Heavy hitters