Efficient Computation of Frequent and Top-k Elements in Data Streams

  • Ahmed Metwally
  • Divyakant Agrawal
  • Amr El Abbadi
Part of the Lecture Notes in Computer Science book series (LNCS, volume 3363)

Abstract

We propose an integrated approach for solving both problems of finding the most popular k elements, and finding frequent elements in a data stream. Our technique is efficient and exact if the alphabet under consideration is small. In the more practical large alphabet case, our solution is space efficient and reports both top-k and frequent elements with tight guarantees on errors. For general data distributions, our top-k algorithm can return a set of k′ elements, where k′ ≈ k, which are guaranteed to be the top-k’ elements; and we use minimal space for calculating frequent elements. For realistic Zipfian data, our space requirement for the frequent elements problem decreases dramatically with the parameter of the distribution; and for top-k queries, we ensure that only the top-k elements, in the correct order, are reported. Our experiments show significant space reductions with no loss in accuracy.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Bose, P., Kranakis, E., Morin, P., Tang, Y.: Bounds for Frequency Estimation of Packet Streams. In: Proceedings of the 10th International Colloquium on Structural Information and Communication Complexity, pp. 33–42 (2003)Google Scholar
  2. 2.
    Boyer, R., Moore, J.: A Fast Majority Vote Algorithm. Technical Report 1981-32, Institute for Computing Science, University of Texas, Austin (February 1981)Google Scholar
  3. 3.
    Charikar, M., Chen, K., Farach-Colton, M.: Finding Frequent Items in Data Streams. In: Proceedings of the 29th International Colloquium on Automata, Languages and Programming, pp. 693–703. Springer, Heidelberg (2002)CrossRefGoogle Scholar
  4. 4.
    Cormode, G., Korn, F., Muthukrishnan, S., Srivastava, D.: Finding Hierarchical Heavy Hitters in Data Streams. In: Proceedings of the 29th International Conference on Very Large Data Bases, pp. 464–475 (2003)Google Scholar
  5. 5.
    Cormode, G., Muthukrishnan, S.: What’s Hot and What’s Not: Tracking Most Frequent Items Dynamically. In: Proceedings of the 22nd Symposium on Principles of Databse Systems, pp. 296–306 (June 2003)Google Scholar
  6. 6.
    Demaine, E.D., López-Ortiz, A., Munro, J.I.: Frequency Estimation of Internet Packet Streams with Limited Space. In: Proceedings of the 10th Annual European Symposium on Algorithms, pp. 348–360 (2002)Google Scholar
  7. 7.
    Estan, C., Varghese, G.: New Directions in Traffic Measurement and Accounting: Focusing on the Elephants, Ignoring the Mice. ACM Trans. Comput. Syst. 21(3), 270–313 (2003)CrossRefGoogle Scholar
  8. 8.
    Fischer, M.J., Salzberg, S.L.: Finding a Majority Among N Votes: Solution to Problem 81-5. Journal of Algorithms 3, 376–379 (1982)Google Scholar
  9. 9.
    Hoare, C.A.R.: Algorithm 65: Find. Communications of the ACM 4(7), 321–322 (1961)CrossRefGoogle Scholar
  10. 10.
    Jin, C., Qian, W., Sha, C., Yu, J.X., Zhou, A.: Dynamically Maintaining Frequent Items over A Data Stream. In: Proceedings of the Twelfth International Conference on Information and Knowledge Management, pp. 287–294. ACM Press, New York (2003)CrossRefGoogle Scholar
  11. 11.
    Karp, R., Shenker, S., Papadimitriou, C.: A Simple Algorithm for Finding Frequent Elements in Streams and Bags. ACM Transactions on Database Systems 28(1), 51–55 (2003)CrossRefGoogle Scholar
  12. 12.
    Kirschenhofer, P., Prodinger, H., Martinez, C.: Analysis of Hoare’s FIND algorithm with median-of-three partition. Random Structures Algorithms 10(1-2), 143–156 (1997)MATHCrossRefMathSciNetGoogle Scholar
  13. 13.
    Manku, G., Motwani, R.: Approximate Frequency Counts over Data Streams. In: Proceedings of the 28th International Conference on Very Large Data Bases, pp. 346–357 (2002)Google Scholar
  14. 14.
    Metwally, A., Agrawal, D., El Abbadi, A.: Efficient Computation of Frequent and Top-k Elements in Data Streams. Technical Report 2005-23, University of California, Santa Barbara (September 2005)Google Scholar
  15. 15.
    Misra, J., Gries, D.: Finding Repeated Elements. Science of Computer Programming 2, 143–152 (1982)MATHCrossRefMathSciNetGoogle Scholar
  16. 16.
    Zipf, G.K.: Human Behavior and The Principle of Least Effort. Addison-Wesley, Reading (1949)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2004

Authors and Affiliations

  • Ahmed Metwally
    • 1
  • Divyakant Agrawal
    • 1
  • Amr El Abbadi
    • 1
  1. 1.Department of Computer ScienceUniversity of CaliforniaSanta Barbara

Personalised recommendations