Abstract
The notion of heavy hitters—items that make up a large fraction of the population—has been successfully used in a variety of applications across sensor and RFID monitoring, network data analysis, event mining, and more. Yet this notion often fails to capture the semantics we desire when we observe data in the form of correlated pairs. Here, we are interested in items that are conditionally frequent: when a particular item is frequent within the context of its parent item. In this work, we introduce and formalize the notion of conditional heavy hitters to identify such items, with applications in network monitoring and Markov chain modeling. We explore the relationship between conditional heavy hitters and other related notions in the literature, and show analytically and experimentally the usefulness of our approach. We introduce several algorithm variations that allow us to efficiently find conditional heavy hitters for input data with very different characteristics, and provide analytical results for their performance. Finally, we perform experimental evaluations with several synthetic and real datasets to demonstrate the efficacy of our methods and to study the behavior of the proposed algorithms for different types of data.
This is a preview of subscription content,
to check access.














Similar content being viewed by others
Notes
Note that when restricting output to have size exactly \(\tau \), precision and recall are identical, so we do not duplicate this measurement.
References
Agrawal, R., Imielinski, T., Swami, A.N.: Mining association rules between sets of items in large databases. In: ACM SIGMOD International Conference on Management of Data, pp. 207–216 (1993)
Alon, N., Matias, Y., Szegedy, M.: The space complexity of approximating the frequency moments. In: ACM Symposium on Theory of Computing, pp. 20–29 (1996)
Arasu, A., Manku, G.S.: Approximate counts and quantiles over sliding windows. In: Proceedings of the Twenty-Third ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, pp. 286–296. ACM (2004)
Baum, L.E., Petrie, T.: Statistical inference for probabilistic functions of finite state Markov chains. Ann. Math. Stat. 37(6), 1554–1563 (1966)
Boyer, B., Moore, J.: A fast majority vote algorithm. Tech. Rep. ICSCA-CMP-32. Institute for Computer Science, University of Texas (1981)
Broder, A., Mitzenmacher, M.: Network applications of bloom filters: a survey. Internet Math. 1(4), 485–509 (2005)
Budak, C., Georgiou, T., Agrawal, D., El Abbadi, A.: Geoscope: online detection of geo-correlated information trends in social networks. PVLDB 7(4), 229–240 (2013)
Chang, J.H., Lee, W.S.: Finding recent frequent itemsets adaptively over online data streams. In: KDD, pp. 487–492 (2003)
Charikar, M., Chen, K., Farach-Colton, M.: Finding frequent items in data streams. In: Proceedings of the International Colloquium on Automata, Languages and Programming (ICALP) (2002)
Cormode, G., Hadjieleftheriou, M.: Finding frequent items in data streams. In: International Conference on Very Large Data Bases (2008)
Cormode, G., Korn, F., Muthukrishnan, S., Srivastava, D.: Finding hierarchical heavy hitters in data streams. In: International Conference on Very Large Data Bases, pp. 464–475 (2003)
Cormode, G., Korn, F., Tirthapura, S.: Time decaying aggregates in out-of-order streams. In: Proceedings of the Twenty-Seventh ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, pp. 89–98. ACM (2008)
Cormode, G., Muthukrishnan, S.: An improved data stream summary: the count-min sketch and its applications. J. Algorithm. 55(1), 58–75 (2005)
Dallachiesa, M., Nushi, B., Mirylenka, K., Palpanas, T.: Uncertain time-series similarity: return to the basics. PVLDB 5(11), 1662–1673 (2012)
Dallachiesa, M., Palpanas, T.: Identifying streaming frequent items in ad hoc time windows. Data Knowl. Eng. 87, 66–90 (2013)
Demaine, E., López-Ortiz, A., Munro, J.I.: Frequency estimation of internet packet streams with limited space. In: European Symposium on Algorithms (ESA) (2002)
Duong, T., Goud, B., Schauer, K.: Closed-form density-based framework for automatic detection of cellular morphology changes. Proc. Natl. Acad. Sci. 109(22), 8382–8387 (2012)
Durme, B.V., Lall, A.: Streaming pointwise mutual information. In: Advances in Neural Information Processing Systems, pp. 1892–1900 (2009)
Gehrke, J., Korn, F., Srivastava, D.: On computing correlated aggregates over continual data streams. In: ACM SIGMOD International Conference on Management of Data, pp. 13–24 (2001)
Giannella, C., Han, J., Pei, J., Yan, X., Yu, P.S.: Mining frequent patterns in data streams at multiple time granularities. In: Kargupta, H., Joshi, A., Sivakumar, K., Yesha, Y. (eds.) Next Generation Data Mining, pp. 191–212 (2003)
Han, J., Pei, J., Yin, Y.: Mining frequent patterns without candidate generation. In: SIGMOD Conference, pp. 1–12 (2000)
Lahiri, B., Tirthapura, S.: Finding correlated heavy-hitters over data streams. In: IEEE 28th International Conference on Performance Computing and Communications (IPCCC), pp. 307–314. IEEE (2009)
Lee, L-K., Ting, H.F.: A simpler and more efficient deterministic scheme for finding frequent items over sliding windows. In: Proceedings of the Twenty-Fifth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, pp. 290–297. ACM (2006)
Letchner, J., Ré, C., Balazinska, M., Philipose, M.: Approximation trade-offs in Markovian stream processing: an empirical study. In: IEEE 26th International Conference on Data Engineering (ICDE), pp. 936–939. IEEE (2010)
Manerikar, N., Palpanas, T.: Frequent items in streaming data: an experimental evaluation of the state-of-the-art. Data Knowl. Eng. 68(4), 415–430 (2009)
Manku, G., Motwani, R.: Approximate frequency counts over data streams. In: International Conference on Very Large Data Bases, pp. 346–357 (2002)
Metwally, A., Agrawal, D., Abbadi, A.E.: Efficient computation of frequent and top-k elements in data streams. In: International Conference on Database Theory (2005)
Mirylenka, K., Cormode, G., Palpanas, T., Srivastava, D.: Finding interesting correlations with conditional heavy hitters. In: International Conference on Data Engineering (ICDE) (2013)
Misra, J., Gries, D.: Finding repeated elements. Sci. Comput. Program. 2, 143–152 (1982)
Pearl, J.: Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan Kaufmann Publishers Inc., Los Altos (1988)
Rabinovich, M., Spatschek, O.: Web Caching and Replication. Addison-Wesley Longman Publishing Co., Inc, Boston (2002)
Raftery, A.E.: A model of high-order Markov chains. J. R. Stat. Soc. Series B Methodol. 47(3), 528–539 (1985)
Rubner, Y., Tomasi, C., Guibas, L.: The earth mover’s distance as a metric for image retrieval. Int. J. Comput. Vision 40(2), 99–121 (2000)
Tantono, F.I., Manerikar, N., Palpanas, T.: Efficiently discovering recent frequent items in data streams. In: Scientific and Statistical Database Management, pp. 222–239. Springer, Berlin, Heidelberg (2008)
Venkataraman, S., Song, D.X., Gibbons, P.B., Blum, A.: New streaming algorithms for fast detection of superspreaders. In: Network and Distributed System Security Symposium NDSS (2005)
Wang, P., Wang, H., Wang, W.: Finding semantics in time series. In: ACM SIGMOD International Conference on Management of Data, pp. 385–396 (2011)
Welch, B.L.: The generalization of ‘student’s’ problem when several different population variances are involved. Biometrika 34(1/2), 28–35 (1947)
Yu, P.S., Chi, Y.: Association rule mining on streams. In: Encyclopedia of Database Systems, pp. 136–139. Springer-Verlag (2009)
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Mirylenka, K., Cormode, G., Palpanas, T. et al. Conditional heavy hitters: detecting interesting correlations in data streams. The VLDB Journal 24, 395–414 (2015). https://doi.org/10.1007/s00778-015-0382-5
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00778-015-0382-5