Fast and accurate mining of correlated heavy hitters

Epicoco, Italo; Cafaro, Massimo; Pulimeno, Marco

doi:10.1007/s10618-017-0526-x

Fast and accurate mining of correlated heavy hitters

Published: 05 July 2017

Volume 32, pages 162–186, (2018)
Cite this article

Data Mining and Knowledge Discovery Aims and scope Submit manuscript

844 Accesses
11 Citations
1 Altmetric
Explore all metrics

Abstract

The problem of mining correlated heavy hitters (CHH) from a two-dimensional data stream has been introduced recently, and a deterministic algorithm based on the use of the Misra–Gries algorithm has been proposed by Lahiri et al. to solve it. In this paper we present a new counter-based algorithm for tracking CHHs, formally prove its error bounds and correctness and show, through extensive experimental results, that our algorithm outperforms the Misra–Gries based algorithm with regard to accuracy and speed whilst requiring asymptotically much less space.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Notes

http://ita.ee.lbl.gov/html/contrib/WorldCup.html

References

Agrawal R, Imieliński T, Swami A (1993) Mining association rules between sets of items in large databases. ACM SIGMOD Rec 22(2):207–216
Article Google Scholar
Ananthakrishna R, Das A, Gehrke J, Korn F, Muthukrishnan S, Srivastava D (2003) Efficient approximation of correlated sums on data streams. IEEE Trans Knowl Data Eng 15(3):569–572
Article Google Scholar
Cafaro M, Pulimeno M (2016) Merging frequent summaries. In: Proceedings of the 17th Italian conference on theoretical computer science (ICTCS 2016), vol 1720, CEUR proceedings, pp 280–285
Cafaro M, Tempesta P (2011) Finding frequent items in parallel. Concurr Comput Pract Exp 23(15):1774–1788. doi:10.1002/cpe.1761
Article Google Scholar
Cafaro M, Pulimeno M, Epicoco I, Aloisio G (2016a) Mining frequent items in the time fading model. Inf Sci 370–371:221–238. doi:10.1016/j.ins.2016.07.077
Cafaro M, Pulimeno M, Tempesta P (2016b) A parallel space saving algorithm for frequent items and the hurwitz zeta distribution. Inf Sci 329:1–19. doi:10.1016/j.ins.2015.09.003, http://www.sciencedirect.com/science/article/pii/S002002551500657X
Cafaro M, Pulimeno M, Epicoco I, Aloisio G (2017) Parallel space saving on multi- and many-core processors. Concurr Comput Pract Exp. doi:10.1002/cpe.4160
Charikar M, Chen K, Farach-Colton M (2002) Finding frequent items in data streams. In: ICALP ’02: proceedings of the 29th international colloquium on automata. Languages and programming. Springer-Verlag, pp 693–703
Chen L, Mei Q (2014) Mining frequent items in data stream using time fading model. Inf Sci 257:54–69. doi:10.1016/j.ins.2013.09.007, http://www.sciencedirect.com/science/article/pii/S0020025513006403
Cheng J, Ke Y, Ng W (2008) A survey on algorithms for mining frequent itemsets over data streams. Knowl Inf Syst 16(1):1–27. doi:10.1007/s10115-007-0092-4
Article Google Scholar
Cormode G, Hadjieleftheriou M (2009) Finding the frequent items in streams of data. Commun ACM 52(10):97–105. doi:10.1145/1562764.1562789
Article Google Scholar
Cormode G, Muthukrishnan S (2005a) An improved data stream summary: the count-min sketch and its applications. J Algorithms 55(1):58–75. doi:10.1016/j.jalgor.2003.12.001
Article MathSciNet MATH Google Scholar
Cormode G, Muthukrishnan S (2005b) What’s hot and what’s not: tracking most frequent items dynamically. ACM Trans Database Syst 30(1):249–278. doi:10.1145/1061318.1061325
Article Google Scholar
Cormode G, Korn F, Tirthapura S (2008) Exponentially decayed aggregates on data streams. In: IEEE 24th international conference on data engineering, 2008, ICDE 2008, pp 1379–1381, doi:10.1109/ICDE.2008.4497562
Cormode G, Tirthapura S, Xu B (2009) Time-decayed correlated aggregates over data streams. Stat Anal Data Min 2(5–6):294–310
Article MathSciNet Google Scholar
Das S, Antony S, Agrawal D, El Abbadi A (2009) Thread cooperation in multicore architectures for frequency counting over multiple data streams. Proc VLDB Endow 2(1):217–228. doi:10.14778/1687627.1687653
Article Google Scholar
Datar M, Gionis A, Indyk P, Motwani R (2002) Maintaining stream statistics over sliding windows: (extended abstract). In: Proceedings of the thirteenth annual ACM-SIAM symposium on discrete algorithms, SODA ’02. Society for Industrial and Applied Mathematics, Philadelphia, PA, USA, pp 635–644
Demaine ED, López-Ortiz A, Munro JI (2002) Frequency estimation of internet packet streams with limited space. In: ESA, pp 348–360
Erra U, Frola B (2012) Frequent items mining acceleration exploiting fast parallel sorting on the gpu. Proc Comput Sci 9(0):86–95, doi:10.1016/j.procs.2012.04.010, http://www.sciencedirect.com/science/article/pii/S1877050912001317, proceedings of the international conference on computational science, ICCS 2012
Gehrke J, Korn F, Srivastava D (2001) On computing correlated aggregates over continual data streams. SIGMOD Rec 30(2):13–24. doi:10.1145/376284.375665
Article Google Scholar
Govindaraju NK, Raghuvanshi N, Manocha D (2005) Fast and approximate stream mining of quantiles and frequencies using graphics processors. In: Proceedings of the 2005 ACM SIGMOD international conference on management of data, SIGMOD ’05, ACM, pp 611–622, doi:10.1145/1066157.1066227
Han J, Pei J, Yin Y (2000) Mining frequent patterns without candidate generation. ACM SIGMOD Rec 29(2):1–12
Article Google Scholar
Jin C, Qian W, Sha C, Yu JX, Zhou A (2003) Dynamically maintaining frequent items over a data stream. In: Proceedings of CIKM, ACM Press, pp 287–294
Karp RM, Shenker S, Papadimitriou CH (2003) A simple algorithm for finding frequent elements in streams and bags. ACM Trans Database Syst 28(1):51–55. doi:10.1145/762471.762473
Article Google Scholar
Lahiri B, Mukherjee AP, Tirthapura S (2016) Identifying correlated heavy-hitters in a two-dimensional data stream. Data Min Knowl Discov 30(4):797–818. doi:10.1007/s10618-015-0438-6
Article MathSciNet Google Scholar
Manerikar N, Palpanas T (2009) Frequent items in streaming data: an experimental evaluation of the state-of-the-art. Data Knowl Eng 68(4):415–430. doi:10.1016/j.datak.2008.11.001
Article Google Scholar
Manjhi A, Shkapenyuk V, Dhamdhere K, Olston C (2005) Finding (recently) frequent items in distributed data streams. In: Proceedings of 21st international conference on data engineering, 2005, ICDE 2005, pp 767–778, doi:10.1109/ICDE.2005.68
Manku GS, Motwani R (2002) Approximate frequency counts over data streams. In: VLDB, pp 346–357
Metwally A, Agrawal D, Abbadi AE (2006) An integrated efficient solution for computing frequent and top-k elements in data streams. ACM Trans Database Syst 31(3):1095–1133. doi:10.1145/1166074.1166084
Article Google Scholar
Mirylenka K, Cormode G, Palpanas T, Srivastava D (2015) Conditional heavy hitters: detecting interesting correlations in data streams. VLDB J 24(3):395–414. doi:10.1007/s00778-015-0382-5
Article Google Scholar
Misra J, Gries D (1982) Finding repeated elements. Sci Comput Progr 2(2):143–152
Article MathSciNet MATH Google Scholar
Roy P, Teubner J, Alonso G (2012) Efficient frequent item counting in multi-core hardware. In: Proceedings of the 18th ACM SIGKDD international conference on knowledge discovery and data mining, KDD ’12, ACM, pp 1451–1459, doi:10.1145/2339530.2339757
Tangwongsan K, Tirthapura S, Wu KL (2014) Parallel streaming frequency-based aggregates. In: Proceedings of the 26th ACM symposium on parallelism in algorithms and architectures, SPAA ’14, ACM, pp 236–245, doi:10.1145/2612669.2612695
Zaki MJ (2000) Scalable algorithms for association mining. IEEE Trans Knowl Data Eng 12(3):372–390
Article Google Scholar
Zhang Y (2012) Parallelizing the weighted lossy counting algorithm in high-speed network monitoring. In: Second international conference on instrumentation, measurement, computer, communication and control (IMCCC), pp 757–761, doi:10.1109/IMCCC.2012.183
Zhang Y, Singh S, Sen S, Duffield N, Lund C (2004) Online identification of hierarchical heavy hitters: algorithms, evaluation, and applications. In: Proceedings of the 4th ACM SIGCOMM conference on Internet measurement, ACM, pp 101–114
Zhang Y, Sun Y, Zhang J, Xu J, Wu Y (2014) An efficient framework for parallel and continuous frequent item monitoring. Concurr Comput Pract Exp 26(18):2856–2879. doi:10.1002/cpe.3182
Article Google Scholar

Download references

Author information

Authors and Affiliations

University of Salento, Lecce, Italy
Italo Epicoco, Massimo Cafaro & Marco Pulimeno
Euro-Mediterranean Center on Climate Change, CMCC Foundation, Lecce, Italy
Italo Epicoco & Massimo Cafaro

Authors

Italo Epicoco
View author publications
You can also search for this author in PubMed Google Scholar
Massimo Cafaro
View author publications
You can also search for this author in PubMed Google Scholar
Marco Pulimeno
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Massimo Cafaro.

Additional information

Responsible editor: Toon Calders.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Epicoco, I., Cafaro, M. & Pulimeno, M. Fast and accurate mining of correlated heavy hitters. Data Min Knowl Disc 32, 162–186 (2018). https://doi.org/10.1007/s10618-017-0526-x

Download citation

Received: 15 November 2016
Accepted: 22 June 2017
Published: 05 July 2017
Issue Date: January 2018
DOI: https://doi.org/10.1007/s10618-017-0526-x

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fast and accurate mining of correlated heavy hitters

Abstract

Access this article

Similar content being viewed by others

Identifying correlated heavy-hitters in a two-dimensional data stream

Parallel Mining of Correlated Heavy Hitters

Conditional heavy hitters: detecting interesting correlations in data streams

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Fast and accurate mining of correlated heavy hitters

Abstract

Access this article

Similar content being viewed by others

Identifying correlated heavy-hitters in a two-dimensional data stream

Parallel Mining of Correlated Heavy Hitters

Conditional heavy hitters: detecting interesting correlations in data streams

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation