Skip to main content

Efficient Computation of Frequent and Top-k Elements in Data Streams

  • Conference paper

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 3363))

Abstract

We propose an integrated approach for solving both problems of finding the most popular k elements, and finding frequent elements in a data stream. Our technique is efficient and exact if the alphabet under consideration is small. In the more practical large alphabet case, our solution is space efficient and reports both top-k and frequent elements with tight guarantees on errors. For general data distributions, our top-k algorithm can return a set of k′ elements, where k′ ≈ k, which are guaranteed to be the top-k’ elements; and we use minimal space for calculating frequent elements. For realistic Zipfian data, our space requirement for the frequent elements problem decreases dramatically with the parameter of the distribution; and for top-k queries, we ensure that only the top-k elements, in the correct order, are reported. Our experiments show significant space reductions with no loss in accuracy.

This work was supported in part by NSF under grants EIA 00-80134, NSF 02-09112, and CNF 04-23336.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Bose, P., Kranakis, E., Morin, P., Tang, Y.: Bounds for Frequency Estimation of Packet Streams. In: Proceedings of the 10th International Colloquium on Structural Information and Communication Complexity, pp. 33–42 (2003)

    Google Scholar 

  2. Boyer, R., Moore, J.: A Fast Majority Vote Algorithm. Technical Report 1981-32, Institute for Computing Science, University of Texas, Austin (February 1981)

    Google Scholar 

  3. Charikar, M., Chen, K., Farach-Colton, M.: Finding Frequent Items in Data Streams. In: Proceedings of the 29th International Colloquium on Automata, Languages and Programming, pp. 693–703. Springer, Heidelberg (2002)

    Chapter  Google Scholar 

  4. Cormode, G., Korn, F., Muthukrishnan, S., Srivastava, D.: Finding Hierarchical Heavy Hitters in Data Streams. In: Proceedings of the 29th International Conference on Very Large Data Bases, pp. 464–475 (2003)

    Google Scholar 

  5. Cormode, G., Muthukrishnan, S.: What’s Hot and What’s Not: Tracking Most Frequent Items Dynamically. In: Proceedings of the 22nd Symposium on Principles of Databse Systems, pp. 296–306 (June 2003)

    Google Scholar 

  6. Demaine, E.D., López-Ortiz, A., Munro, J.I.: Frequency Estimation of Internet Packet Streams with Limited Space. In: Proceedings of the 10th Annual European Symposium on Algorithms, pp. 348–360 (2002)

    Google Scholar 

  7. Estan, C., Varghese, G.: New Directions in Traffic Measurement and Accounting: Focusing on the Elephants, Ignoring the Mice. ACM Trans. Comput. Syst. 21(3), 270–313 (2003)

    Article  Google Scholar 

  8. Fischer, M.J., Salzberg, S.L.: Finding a Majority Among N Votes: Solution to Problem 81-5. Journal of Algorithms 3, 376–379 (1982)

    Google Scholar 

  9. Hoare, C.A.R.: Algorithm 65: Find. Communications of the ACM 4(7), 321–322 (1961)

    Article  Google Scholar 

  10. Jin, C., Qian, W., Sha, C., Yu, J.X., Zhou, A.: Dynamically Maintaining Frequent Items over A Data Stream. In: Proceedings of the Twelfth International Conference on Information and Knowledge Management, pp. 287–294. ACM Press, New York (2003)

    Chapter  Google Scholar 

  11. Karp, R., Shenker, S., Papadimitriou, C.: A Simple Algorithm for Finding Frequent Elements in Streams and Bags. ACM Transactions on Database Systems 28(1), 51–55 (2003)

    Article  Google Scholar 

  12. Kirschenhofer, P., Prodinger, H., Martinez, C.: Analysis of Hoare’s FIND algorithm with median-of-three partition. Random Structures Algorithms 10(1-2), 143–156 (1997)

    Article  MATH  MathSciNet  Google Scholar 

  13. Manku, G., Motwani, R.: Approximate Frequency Counts over Data Streams. In: Proceedings of the 28th International Conference on Very Large Data Bases, pp. 346–357 (2002)

    Google Scholar 

  14. Metwally, A., Agrawal, D., El Abbadi, A.: Efficient Computation of Frequent and Top-k Elements in Data Streams. Technical Report 2005-23, University of California, Santa Barbara (September 2005)

    Google Scholar 

  15. Misra, J., Gries, D.: Finding Repeated Elements. Science of Computer Programming 2, 143–152 (1982)

    Article  MATH  MathSciNet  Google Scholar 

  16. Zipf, G.K.: Human Behavior and The Principle of Least Effort. Addison-Wesley, Reading (1949)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2004 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Metwally, A., Agrawal, D., El Abbadi, A. (2004). Efficient Computation of Frequent and Top-k Elements in Data Streams. In: Eiter, T., Libkin, L. (eds) Database Theory - ICDT 2005. ICDT 2005. Lecture Notes in Computer Science, vol 3363. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-30570-5_27

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-30570-5_27

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-24288-8

  • Online ISBN: 978-3-540-30570-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics