Abstract
We consider online mining of correlated heavy-hitters (CHH) from a data stream. Given a stream of two-dimensional data, a correlated aggregate query first extracts a substream by applying a predicate along a primary dimension, and then computes an aggregate along a secondary dimension. Prior work on identifying heavy-hitters in streams has almost exclusively focused on identifying heavy-hitters on a single dimensional stream, and these yield little insight into the properties of heavy-hitters along other dimensions. In typical applications however, an analyst is interested not only in identifying heavy-hitters, but also in understanding further properties such as: what other items appear frequently along with a heavy-hitter, or what is the frequency distribution of items that appear along with the heavy-hitters. We consider queries of the following form: “In a stream S of (x, y) tuples, on the substream H of all x values that are heavy-hitters, maintain those y values that occur frequently with the x values in H”. We call this problem as CHH. We formulate an approximate formulation of CHH identification, and present an algorithm for tracking CHHs on a data stream. The algorithm is easy to implement and uses workspace much smaller than the stream itself. We present provable guarantees on the maximum error, as well as detailed experimental results that demonstrate the space-accuracy trade-off.
Similar content being viewed by others
References
Ananthakrishna R, Das A, Gehrke J, Korn F, Muthukrishnan S, Srivastava D (2003) Efficient approximation of correlated sums on data streams. IEEE Trans Knowl Data Eng 15(3):569–572
Babcock B, Datar M, Motwani R, O’Callaghan L (2003) Maintaining variance and k-medians over data stream windows. In: Proceedings of the twenty-second ACM SIGACT-SIGMOD-SIGART symposium on principles of database systems (PODS), pp 234–243
Busch C, Tirthapura S (2007) A deterministic algorithm for summarizing asynchronous streams over a sliding window. In: STACS
CAIDA: OC48 traces dataset. https://data.caida.org/datasets/oc48/oc48-original/20020814/5min/
Charikar M, Chen K, Farach-Colton M (2004) Finding frequent items in data streams. Theor Comput Sci 312(1):3–15
Cormode G, Muthukrishnan S (2003) What’s hot and what’s not: tracking most frequent items dynamically. In: Proceedings of the 22nd ACM SIGMOD international conference on management of data/principles of database systems (PODS), pp 296–306
Cormode G, Muthukrishnan S (2005) An improved data stream summary: the count-min sketch and its applications. J Algorithms 55(1):58–75
Cormode G, Hadjieleftheriou M (2009) Finding the frequent items in streams of data. Commun ACM 52(10):97–105
Cormode G, Tirthapura S, Xu B (2009) Time-decayed correlated aggregates over data streams. In: Proceedings of the SIAM international conference on data mining (SDM), pp 269–280
Cullingford RE (2009) Correlation and collaboration in anomaly detection. In: Cybersecurity applications & technology conference for homeland security (CATCH), pp 251–254
Demaine ED, López-Ortiz A, Munro JI (2002a) Frequency estimation of internet packet streams with limited space. In: Proceedings of the 10th annual european symposium (ESA), pp 348–360
Demaine ED, López-Ortiz A, Munro JI (2002b) Frequency estimation of internet packet streams with limited space. Tech rep
Estan C, Savage S, Varghese G (2003) Automatically inferring patterns of resource consumption in network traffic. In: Proceedings of the ACM SIGCOMM 2003 conference on applications, technologies, architectures, and protocols for computer communication (SIGCOMM), pp 137–148
Estan C, Varghese G (2002) New directions in traffic measurement and accounting. In: Proceedings of the ACM SIGCOMM 2002 conference on applications, technologies, architectures, and protocols for computer communication (SIGCOMM), pp 323–336
Gehrke J, Korn F, Srivastava D (2001) On computing correlated aggregates over continual data streams. In: Proceedings of the 20th ACM SIGMOD international conference on management of data (SIGMOD), pp 13–24
Google: Google n-grams dataset. http://storage.googleapis.com/books/ngrams/books/datasetsv2.html
Greenwald M, Khanna S (2001) Space-efficient online computation of quantile summaries. In: Proceedings of the 20th ACM SIGMOD international conference on management of data (SIGMOD), pp 58–66
Karp RM, Shenker S, Papadimitriou CH (2003) A simple algorithm for finding frequent elements in streams and bags. ACM Trans Database Syst 28:51–55
Manku GS, Motwani R (2002) Approximate frequency counts over data streams. In: Proceedings of 28th international conference on very large data bases (VLDB), pp 346–357
Misra J, Gries D (1982) Finding repeated elements. Sci Comput Program 2(2):143–152
Tirthapura S, Woodruff DP (2012) A general method for estimating correlated aggregates over a data stream. In: Proceedings of the ICDE, pp 162–173
Xu B, Tirthapura S, Busch C (2008) Sketching asynchronous data streams over sliding windows. Distrib Comput 20(5):359–374
Zhang L, Guan Y (2007) Variance estimation over sliding windows. In: Proceedings of the twenty-sixth ACM SIGACT-SIGMOD-SIGART symposium on principles of database systems (PODS), pp 225–232
Zhang Y, Singh S, Sen S, Duffield NG, Lund C (2004) Online identification of hierarchical heavy hitters: algorithms, evaluation, and applications. In: Internet measurement conference (IMC), pp 101–114
Acknowledgments
The authors were supported in part by the National Science Foundation through Grants NSF CNS-0834743 and CNS-0831903.
Author information
Authors and Affiliations
Corresponding author
Additional information
Responsible editor: Bart Goethals.
Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
About this article
Cite this article
Lahiri, B., Mukherjee, A.P. & Tirthapura, S. Identifying correlated heavy-hitters in a two-dimensional data stream. Data Min Knowl Disc 30, 797–818 (2016). https://doi.org/10.1007/s10618-015-0438-6
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10618-015-0438-6