Finding Frequent Items in Data Streams

Charikar, Moses; Chen, Kevin; Farach-Colton, Martin

doi:10.1007/3-540-45465-9_59

Moses Charikar⁷,
Kevin Chen⁸ &
Martin Farach-Colton⁹

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 2380))

Included in the following conference series:

International Colloquium on Automata, Languages, and Programming

2878 Accesses
357 Citations
3 Altmetric

Abstract

We present a 1-pass algorithm for estimating the most frequent items in a data stream using very limited storage space. Our method relies on a novel data structure called a count sketch, which allows us to estimate the frequencies of all the items in the stream. Our algorithm achieves better space bounds than the previous best known algorithms for this problem for many natural distributions on the item frequencies. In addition, our algorithm leads directly to a 2-pass algorithm for the problem of estimating the items with the largest (absolute) change in frequency between two data streams. To our knowledge, this problem has not been previously studied in the literature.

This work was done while the author was at Google Inc.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Dimitris Achlioptas. Database-friendly random projections. In Proc. 20th ACM Symposium on Principles of Database Systems, pages 274–281, 2001.
Google Scholar
Noga Alon, Yossi Matias, and Mario Szegedy. The space complexity of approximating the frequency moments. Journal of Computer and System Sciences, 58(1):137–147, 1999.
Article MATH MathSciNet Google Scholar
Joan Feigenbaum, Sampath Kannan, Martin Strauss, and Mahesh Viswanathan. An approximate l ₁-difference algotihm for massive data streams. In Proc. 40th IEEE Symposium on Foundations of Computer Science, pages 501–511, 1999.
Google Scholar
Joan Feigenbaum, Sampath Kannan, Martin Strauss, and Mahesh Viswanathan. Testing and spot-checking of data streams. In Proc. 11th ACM-SIAM Symposium on Discrete Algorithms, pages 165–174, 2000.
Google Scholar
Min Fang, Narayanan Shivakumar, Hector Garcia-Molina, Rajeev Motwani, and Jeffrey Ullman. Computing iceberg queries efficiently. In Proc. 22nd International Conference on Very Large Data Bases, pages 307–317, 1996.
Google Scholar
Anna Gilbert, Sudipto Guha, Piotr Indyk, Yannis Kotidis, S. Muthukrishnan, and Martin Strauss. Fast, small-space algorithms for approximate histogram maintenance. In to appear in Proc. 34th ACM Symposium on Theory of Computing, 2002.
Google Scholar
Phillip Gibbons and Yossi Matias. New sampling-based summary statistics for improving approximate query answers. In Proc. ACM SIGMOD International Conference on Management of Data, pages 331–342, 1998.
Google Scholar
Phillip Gibbons and Yossi Matias. Synopsis data structures for massive data sets. In Proc. 10th Annual ACM-SIAM Symposium on Discrete Algorithms, pages 909–910, 1999.
Google Scholar
Sudipto Guha, Nina Mishra, Rajeev Motwani, and Liadan O’Callaghan. Clustering data streams. In Proc. 41st IEEE Symposium on Foundations of Computer Science, pages 359–366, 2000.
Google Scholar
Google. Google zeitgeist-search patterns, trends, and surprises according to google. http://www.google.com/press/zeitgeist.html.
Monika Henzinger, Prabhakar Raghavan, and Sridhar Rajagopalan. Computing on data streams. Technical Report SRC TR 1998-011, DEC, 1998.
Google Scholar
Piotr Indyk. Stable distributions, pseudorandom generators, embeddings and data stream computation. In Proc. 41st IEEE Symposium on Foundations of Computer Science, pages 148–155, 2000.
Google Scholar

Download references

Author information

Authors and Affiliations

Princeton University, USA
Moses Charikar
UC Berkeley, USA
Kevin Chen
Rutgers University and Google Inc., USA
Martin Farach-Colton

Authors

Moses Charikar
View author publications
You can also search for this author in PubMed Google Scholar
Kevin Chen
View author publications
You can also search for this author in PubMed Google Scholar
Martin Farach-Colton
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Institute of Theoretical Computer Science, ETH Zentrum, ETH Zürich, 8092, Zürich, Switzerland
Peter Widmayer & Stephan Eidenbenz &
Department of Languages and Sciences of the Computation E.T.S. de Ingeniería Informática, University of Málaga, Campus de Teatinos, 29071, Málaga, Spain
Francisco Triguero , Rafael Morales & Ricardo Conejo , &
School of Cognitive and Computing Sciences, University of Sussex, Falmer, Brighton, BN1 9QN, UK
Matthew Hennessy

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Charikar, M., Chen, K., Farach-Colton, M. (2002). Finding Frequent Items in Data Streams. In: Widmayer, P., Eidenbenz, S., Triguero, F., Morales, R., Conejo, R., Hennessy, M. (eds) Automata, Languages and Programming. ICALP 2002. Lecture Notes in Computer Science, vol 2380. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-45465-9_59

Download citation

DOI: https://doi.org/10.1007/3-540-45465-9_59
Published: 25 June 2002
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-43864-9
Online ISBN: 978-3-540-45465-6
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics