Advertisement

Data Mining and Knowledge Discovery

, Volume 30, Issue 4, pp 797–818 | Cite as

Identifying correlated heavy-hitters in a two-dimensional data stream

  • Bibudh Lahiri
  • Arko Provo Mukherjee
  • Srikanta TirthapuraEmail author
Article

Abstract

We consider online mining of correlated heavy-hitters (CHH) from a data stream. Given a stream of two-dimensional data, a correlated aggregate query first extracts a substream by applying a predicate along a primary dimension, and then computes an aggregate along a secondary dimension. Prior work on identifying heavy-hitters in streams has almost exclusively focused on identifying heavy-hitters on a single dimensional stream, and these yield little insight into the properties of heavy-hitters along other dimensions. In typical applications however, an analyst is interested not only in identifying heavy-hitters, but also in understanding further properties such as: what other items appear frequently along with a heavy-hitter, or what is the frequency distribution of items that appear along with the heavy-hitters. We consider queries of the following form: “In a stream S of (xy) tuples, on the substream H of all x values that are heavy-hitters, maintain those y values that occur frequently with the x values in H”. We call this problem as CHH. We formulate an approximate formulation of CHH identification, and present an algorithm for tracking CHHs on a data stream. The algorithm is easy to implement and uses workspace much smaller than the stream itself. We present provable guarantees on the maximum error, as well as detailed experimental results that demonstrate the space-accuracy trade-off.

Keywords

Data stream mining Correlation Heavy-hitters 

Notes

Acknowledgments

The authors were supported in part by the National Science Foundation through Grants NSF CNS-0834743 and CNS-0831903.

Supplementary material

10618_2015_438_MOESM1_ESM.docx (85 kb)
Supplementary material 1 (docx 84 KB)

References

  1. Ananthakrishna R, Das A, Gehrke J, Korn F, Muthukrishnan S, Srivastava D (2003) Efficient approximation of correlated sums on data streams. IEEE Trans Knowl Data Eng 15(3):569–572CrossRefGoogle Scholar
  2. Babcock B, Datar M, Motwani R, O’Callaghan L (2003) Maintaining variance and k-medians over data stream windows. In: Proceedings of the twenty-second ACM SIGACT-SIGMOD-SIGART symposium on principles of database systems (PODS), pp 234–243Google Scholar
  3. Busch C, Tirthapura S (2007) A deterministic algorithm for summarizing asynchronous streams over a sliding window. In: STACSGoogle Scholar
  4. Charikar M, Chen K, Farach-Colton M (2004) Finding frequent items in data streams. Theor Comput Sci 312(1):3–15MathSciNetCrossRefzbMATHGoogle Scholar
  5. Cormode G, Muthukrishnan S (2003) What’s hot and what’s not: tracking most frequent items dynamically. In: Proceedings of the 22nd ACM SIGMOD international conference on management of data/principles of database systems (PODS), pp 296–306Google Scholar
  6. Cormode G, Muthukrishnan S (2005) An improved data stream summary: the count-min sketch and its applications. J Algorithms 55(1):58–75MathSciNetCrossRefzbMATHGoogle Scholar
  7. Cormode G, Hadjieleftheriou M (2009) Finding the frequent items in streams of data. Commun ACM 52(10):97–105CrossRefGoogle Scholar
  8. Cormode G, Tirthapura S, Xu B (2009) Time-decayed correlated aggregates over data streams. In: Proceedings of the SIAM international conference on data mining (SDM), pp 269–280Google Scholar
  9. Cullingford RE (2009) Correlation and collaboration in anomaly detection. In: Cybersecurity applications & technology conference for homeland security (CATCH), pp 251–254Google Scholar
  10. Demaine ED, López-Ortiz A, Munro JI (2002a) Frequency estimation of internet packet streams with limited space. In: Proceedings of the 10th annual european symposium (ESA), pp 348–360Google Scholar
  11. Demaine ED, López-Ortiz A, Munro JI (2002b) Frequency estimation of internet packet streams with limited space. Tech repGoogle Scholar
  12. Estan C, Savage S, Varghese G (2003) Automatically inferring patterns of resource consumption in network traffic. In: Proceedings of the ACM SIGCOMM 2003 conference on applications, technologies, architectures, and protocols for computer communication (SIGCOMM), pp 137–148Google Scholar
  13. Estan C, Varghese G (2002) New directions in traffic measurement and accounting. In: Proceedings of the ACM SIGCOMM 2002 conference on applications, technologies, architectures, and protocols for computer communication (SIGCOMM), pp 323–336Google Scholar
  14. Gehrke J, Korn F, Srivastava D (2001) On computing correlated aggregates over continual data streams. In: Proceedings of the 20th ACM SIGMOD international conference on management of data (SIGMOD), pp 13–24Google Scholar
  15. Greenwald M, Khanna S (2001) Space-efficient online computation of quantile summaries. In: Proceedings of the 20th ACM SIGMOD international conference on management of data (SIGMOD), pp 58–66Google Scholar
  16. Karp RM, Shenker S, Papadimitriou CH (2003) A simple algorithm for finding frequent elements in streams and bags. ACM Trans Database Syst 28:51–55CrossRefGoogle Scholar
  17. Manku GS, Motwani R (2002) Approximate frequency counts over data streams. In: Proceedings of 28th international conference on very large data bases (VLDB), pp 346–357Google Scholar
  18. Misra J, Gries D (1982) Finding repeated elements. Sci Comput Program 2(2):143–152MathSciNetCrossRefzbMATHGoogle Scholar
  19. Tirthapura S, Woodruff DP (2012) A general method for estimating correlated aggregates over a data stream. In: Proceedings of the ICDE, pp 162–173Google Scholar
  20. Xu B, Tirthapura S, Busch C (2008) Sketching asynchronous data streams over sliding windows. Distrib Comput 20(5):359–374CrossRefzbMATHGoogle Scholar
  21. Zhang L, Guan Y (2007) Variance estimation over sliding windows. In: Proceedings of the twenty-sixth ACM SIGACT-SIGMOD-SIGART symposium on principles of database systems (PODS), pp 225–232Google Scholar
  22. Zhang Y, Singh S, Sen S, Duffield NG, Lund C (2004) Online identification of hierarchical heavy hitters: algorithms, evaluation, and applications. In: Internet measurement conference (IMC), pp 101–114Google Scholar

Copyright information

© The Author(s) 2015

Authors and Affiliations

  • Bibudh Lahiri
    • 1
  • Arko Provo Mukherjee
    • 2
  • Srikanta Tirthapura
    • 2
    Email author
  1. 1.Impetus TechnologiesLos GatosUSA
  2. 2.Department of Electrical and Computer EngineeringIowa State UniversityAmesUSA

Personalised recommendations