Skip to main content
Log in

Identifying correlated heavy-hitters in a two-dimensional data stream

  • Published:
Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Abstract

We consider online mining of correlated heavy-hitters (CHH) from a data stream. Given a stream of two-dimensional data, a correlated aggregate query first extracts a substream by applying a predicate along a primary dimension, and then computes an aggregate along a secondary dimension. Prior work on identifying heavy-hitters in streams has almost exclusively focused on identifying heavy-hitters on a single dimensional stream, and these yield little insight into the properties of heavy-hitters along other dimensions. In typical applications however, an analyst is interested not only in identifying heavy-hitters, but also in understanding further properties such as: what other items appear frequently along with a heavy-hitter, or what is the frequency distribution of items that appear along with the heavy-hitters. We consider queries of the following form: “In a stream S of (xy) tuples, on the substream H of all x values that are heavy-hitters, maintain those y values that occur frequently with the x values in H”. We call this problem as CHH. We formulate an approximate formulation of CHH identification, and present an algorithm for tracking CHHs on a data stream. The algorithm is easy to implement and uses workspace much smaller than the stream itself. We present provable guarantees on the maximum error, as well as detailed experimental results that demonstrate the space-accuracy trade-off.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

References

  • Ananthakrishna R, Das A, Gehrke J, Korn F, Muthukrishnan S, Srivastava D (2003) Efficient approximation of correlated sums on data streams. IEEE Trans Knowl Data Eng 15(3):569–572

    Article  Google Scholar 

  • Babcock B, Datar M, Motwani R, O’Callaghan L (2003) Maintaining variance and k-medians over data stream windows. In: Proceedings of the twenty-second ACM SIGACT-SIGMOD-SIGART symposium on principles of database systems (PODS), pp 234–243

  • Busch C, Tirthapura S (2007) A deterministic algorithm for summarizing asynchronous streams over a sliding window. In: STACS

  • CAIDA: OC48 traces dataset. https://data.caida.org/datasets/oc48/oc48-original/20020814/5min/

  • Charikar M, Chen K, Farach-Colton M (2004) Finding frequent items in data streams. Theor Comput Sci 312(1):3–15

    Article  MathSciNet  MATH  Google Scholar 

  • Cormode G, Muthukrishnan S (2003) What’s hot and what’s not: tracking most frequent items dynamically. In: Proceedings of the 22nd ACM SIGMOD international conference on management of data/principles of database systems (PODS), pp 296–306

  • Cormode G, Muthukrishnan S (2005) An improved data stream summary: the count-min sketch and its applications. J Algorithms 55(1):58–75

    Article  MathSciNet  MATH  Google Scholar 

  • Cormode G, Hadjieleftheriou M (2009) Finding the frequent items in streams of data. Commun ACM 52(10):97–105

    Article  Google Scholar 

  • Cormode G, Tirthapura S, Xu B (2009) Time-decayed correlated aggregates over data streams. In: Proceedings of the SIAM international conference on data mining (SDM), pp 269–280

  • Cullingford RE (2009) Correlation and collaboration in anomaly detection. In: Cybersecurity applications & technology conference for homeland security (CATCH), pp 251–254

  • Demaine ED, López-Ortiz A, Munro JI (2002a) Frequency estimation of internet packet streams with limited space. In: Proceedings of the 10th annual european symposium (ESA), pp 348–360

  • Demaine ED, López-Ortiz A, Munro JI (2002b) Frequency estimation of internet packet streams with limited space. Tech rep

  • Estan C, Savage S, Varghese G (2003) Automatically inferring patterns of resource consumption in network traffic. In: Proceedings of the ACM SIGCOMM 2003 conference on applications, technologies, architectures, and protocols for computer communication (SIGCOMM), pp 137–148

  • Estan C, Varghese G (2002) New directions in traffic measurement and accounting. In: Proceedings of the ACM SIGCOMM 2002 conference on applications, technologies, architectures, and protocols for computer communication (SIGCOMM), pp 323–336

  • Gehrke J, Korn F, Srivastava D (2001) On computing correlated aggregates over continual data streams. In: Proceedings of the 20th ACM SIGMOD international conference on management of data (SIGMOD), pp 13–24

  • Google: Google n-grams dataset. http://storage.googleapis.com/books/ngrams/books/datasetsv2.html

  • Greenwald M, Khanna S (2001) Space-efficient online computation of quantile summaries. In: Proceedings of the 20th ACM SIGMOD international conference on management of data (SIGMOD), pp 58–66

  • Karp RM, Shenker S, Papadimitriou CH (2003) A simple algorithm for finding frequent elements in streams and bags. ACM Trans Database Syst 28:51–55

    Article  Google Scholar 

  • Manku GS, Motwani R (2002) Approximate frequency counts over data streams. In: Proceedings of 28th international conference on very large data bases (VLDB), pp 346–357

  • Misra J, Gries D (1982) Finding repeated elements. Sci Comput Program 2(2):143–152

    Article  MathSciNet  MATH  Google Scholar 

  • Tirthapura S, Woodruff DP (2012) A general method for estimating correlated aggregates over a data stream. In: Proceedings of the ICDE, pp 162–173

  • Xu B, Tirthapura S, Busch C (2008) Sketching asynchronous data streams over sliding windows. Distrib Comput 20(5):359–374

    Article  MATH  Google Scholar 

  • Zhang L, Guan Y (2007) Variance estimation over sliding windows. In: Proceedings of the twenty-sixth ACM SIGACT-SIGMOD-SIGART symposium on principles of database systems (PODS), pp 225–232

  • Zhang Y, Singh S, Sen S, Duffield NG, Lund C (2004) Online identification of hierarchical heavy hitters: algorithms, evaluation, and applications. In: Internet measurement conference (IMC), pp 101–114

Download references

Acknowledgments

The authors were supported in part by the National Science Foundation through Grants NSF CNS-0834743 and CNS-0831903.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Srikanta Tirthapura.

Additional information

Responsible editor: Bart Goethals.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (docx 84 KB)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Lahiri, B., Mukherjee, A.P. & Tirthapura, S. Identifying correlated heavy-hitters in a two-dimensional data stream. Data Min Knowl Disc 30, 797–818 (2016). https://doi.org/10.1007/s10618-015-0438-6

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10618-015-0438-6

Keywords

Navigation