Identifying correlated heavy-hitters in a two-dimensional data stream

Lahiri, Bibudh; Mukherjee, Arko Provo; Tirthapura, Srikanta

doi:10.1007/s10618-015-0438-6

Identifying correlated heavy-hitters in a two-dimensional data stream

Published: 24 October 2015

Volume 30, pages 797–818, (2016)
Cite this article

Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Bibudh Lahiri¹,
Arko Provo Mukherjee² &
Srikanta Tirthapura²

494 Accesses
10 Citations
Explore all metrics

Abstract

We consider online mining of correlated heavy-hitters (CHH) from a data stream. Given a stream of two-dimensional data, a correlated aggregate query first extracts a substream by applying a predicate along a primary dimension, and then computes an aggregate along a secondary dimension. Prior work on identifying heavy-hitters in streams has almost exclusively focused on identifying heavy-hitters on a single dimensional stream, and these yield little insight into the properties of heavy-hitters along other dimensions. In typical applications however, an analyst is interested not only in identifying heavy-hitters, but also in understanding further properties such as: what other items appear frequently along with a heavy-hitter, or what is the frequency distribution of items that appear along with the heavy-hitters. We consider queries of the following form: “In a stream S of (x, y) tuples, on the substream H of all x values that are heavy-hitters, maintain those y values that occur frequently with the x values in H”. We call this problem as CHH. We formulate an approximate formulation of CHH identification, and present an algorithm for tracking CHHs on a data stream. The algorithm is easy to implement and uses workspace much smaller than the stream itself. We present provable guarantees on the maximum error, as well as detailed experimental results that demonstrate the space-accuracy trade-off.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fast and accurate mining of correlated heavy hitters

Article 05 July 2017

A General Method for Estimating Correlated Aggregates Over a Data Stream

Article 06 August 2014

Efficient Frequent Itemset Mining from Dense Data Streams

References

Ananthakrishna R, Das A, Gehrke J, Korn F, Muthukrishnan S, Srivastava D (2003) Efficient approximation of correlated sums on data streams. IEEE Trans Knowl Data Eng 15(3):569–572
Article Google Scholar
Babcock B, Datar M, Motwani R, O’Callaghan L (2003) Maintaining variance and k-medians over data stream windows. In: Proceedings of the twenty-second ACM SIGACT-SIGMOD-SIGART symposium on principles of database systems (PODS), pp 234–243
Busch C, Tirthapura S (2007) A deterministic algorithm for summarizing asynchronous streams over a sliding window. In: STACS
CAIDA: OC48 traces dataset. https://data.caida.org/datasets/oc48/oc48-original/20020814/5min/
Charikar M, Chen K, Farach-Colton M (2004) Finding frequent items in data streams. Theor Comput Sci 312(1):3–15
Article MathSciNet MATH Google Scholar
Cormode G, Muthukrishnan S (2003) What’s hot and what’s not: tracking most frequent items dynamically. In: Proceedings of the 22nd ACM SIGMOD international conference on management of data/principles of database systems (PODS), pp 296–306
Cormode G, Muthukrishnan S (2005) An improved data stream summary: the count-min sketch and its applications. J Algorithms 55(1):58–75
Article MathSciNet MATH Google Scholar
Cormode G, Hadjieleftheriou M (2009) Finding the frequent items in streams of data. Commun ACM 52(10):97–105
Article Google Scholar
Cormode G, Tirthapura S, Xu B (2009) Time-decayed correlated aggregates over data streams. In: Proceedings of the SIAM international conference on data mining (SDM), pp 269–280
Cullingford RE (2009) Correlation and collaboration in anomaly detection. In: Cybersecurity applications & technology conference for homeland security (CATCH), pp 251–254
Demaine ED, López-Ortiz A, Munro JI (2002a) Frequency estimation of internet packet streams with limited space. In: Proceedings of the 10th annual european symposium (ESA), pp 348–360
Demaine ED, López-Ortiz A, Munro JI (2002b) Frequency estimation of internet packet streams with limited space. Tech rep
Estan C, Savage S, Varghese G (2003) Automatically inferring patterns of resource consumption in network traffic. In: Proceedings of the ACM SIGCOMM 2003 conference on applications, technologies, architectures, and protocols for computer communication (SIGCOMM), pp 137–148
Estan C, Varghese G (2002) New directions in traffic measurement and accounting. In: Proceedings of the ACM SIGCOMM 2002 conference on applications, technologies, architectures, and protocols for computer communication (SIGCOMM), pp 323–336
Gehrke J, Korn F, Srivastava D (2001) On computing correlated aggregates over continual data streams. In: Proceedings of the 20th ACM SIGMOD international conference on management of data (SIGMOD), pp 13–24
Google: Google n-grams dataset. http://storage.googleapis.com/books/ngrams/books/datasetsv2.html
Greenwald M, Khanna S (2001) Space-efficient online computation of quantile summaries. In: Proceedings of the 20th ACM SIGMOD international conference on management of data (SIGMOD), pp 58–66
Karp RM, Shenker S, Papadimitriou CH (2003) A simple algorithm for finding frequent elements in streams and bags. ACM Trans Database Syst 28:51–55
Article Google Scholar
Manku GS, Motwani R (2002) Approximate frequency counts over data streams. In: Proceedings of 28th international conference on very large data bases (VLDB), pp 346–357
Misra J, Gries D (1982) Finding repeated elements. Sci Comput Program 2(2):143–152
Article MathSciNet MATH Google Scholar
Tirthapura S, Woodruff DP (2012) A general method for estimating correlated aggregates over a data stream. In: Proceedings of the ICDE, pp 162–173
Xu B, Tirthapura S, Busch C (2008) Sketching asynchronous data streams over sliding windows. Distrib Comput 20(5):359–374
Article MATH Google Scholar
Zhang L, Guan Y (2007) Variance estimation over sliding windows. In: Proceedings of the twenty-sixth ACM SIGACT-SIGMOD-SIGART symposium on principles of database systems (PODS), pp 225–232
Zhang Y, Singh S, Sen S, Duffield NG, Lund C (2004) Online identification of hierarchical heavy hitters: algorithms, evaluation, and applications. In: Internet measurement conference (IMC), pp 101–114

Download references

Acknowledgments

The authors were supported in part by the National Science Foundation through Grants NSF CNS-0834743 and CNS-0831903.

Author information

Authors and Affiliations

Impetus Technologies, Los Gatos, CA, 95032, USA
Bibudh Lahiri
Department of Electrical and Computer Engineering, Iowa State University, Ames, IA, 50011, USA
Arko Provo Mukherjee & Srikanta Tirthapura

Authors

Bibudh Lahiri
View author publications
You can also search for this author in PubMed Google Scholar
Arko Provo Mukherjee
View author publications
You can also search for this author in PubMed Google Scholar
Srikanta Tirthapura
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Srikanta Tirthapura.

Additional information

Responsible editor: Bart Goethals.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (docx 84 KB)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Lahiri, B., Mukherjee, A.P. & Tirthapura, S. Identifying correlated heavy-hitters in a two-dimensional data stream. Data Min Knowl Disc 30, 797–818 (2016). https://doi.org/10.1007/s10618-015-0438-6

Download citation

Received: 02 October 2013
Accepted: 15 September 2015
Published: 24 October 2015
Issue Date: July 2016
DOI: https://doi.org/10.1007/s10618-015-0438-6

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Identifying correlated heavy-hitters in a two-dimensional data stream

Abstract

Access this article

Similar content being viewed by others

Fast and accurate mining of correlated heavy hitters

A General Method for Estimating Correlated Aggregates Over a Data Stream

Efficient Frequent Itemset Mining from Dense Data Streams

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Electronic supplementary material

Supplementary material 1 (docx 84 KB)

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Identifying correlated heavy-hitters in a two-dimensional data stream

Abstract

Access this article

Similar content being viewed by others

Fast and accurate mining of correlated heavy hitters

A General Method for Estimating Correlated Aggregates Over a Data Stream

Efficient Frequent Itemset Mining from Dense Data Streams

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Electronic supplementary material

Supplementary material 1 (docx 84 KB)

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation