Efficient Computation of Frequent and Top-k Elements in Data Streams

Metwally, Ahmed; Agrawal, Divyakant; El Abbadi, Amr

doi:10.1007/978-3-540-30570-5_27

Efficient Computation of Frequent and Top-k Elements in Data Streams

Ahmed Metwally¹⁸,
Divyakant Agrawal¹⁸ &
Amr El Abbadi¹⁸

Conference paper

1907 Accesses
193 Citations
11 Altmetric

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 3363))

Abstract

We propose an integrated approach for solving both problems of finding the most popular k elements, and finding frequent elements in a data stream. Our technique is efficient and exact if the alphabet under consideration is small. In the more practical large alphabet case, our solution is space efficient and reports both top-k and frequent elements with tight guarantees on errors. For general data distributions, our top-k algorithm can return a set of k′ elements, where k′ ≈ k, which are guaranteed to be the top-k’ elements; and we use minimal space for calculating frequent elements. For realistic Zipfian data, our space requirement for the frequent elements problem decreases dramatically with the parameter of the distribution; and for top-k queries, we ensure that only the top-k elements, in the correct order, are reported. Our experiments show significant space reductions with no loss in accuracy.

This work was supported in part by NSF under grants EIA 00-80134, NSF 02-09112, and CNF 04-23336.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Bose, P., Kranakis, E., Morin, P., Tang, Y.: Bounds for Frequency Estimation of Packet Streams. In: Proceedings of the 10th International Colloquium on Structural Information and Communication Complexity, pp. 33–42 (2003)
Google Scholar
Boyer, R., Moore, J.: A Fast Majority Vote Algorithm. Technical Report 1981-32, Institute for Computing Science, University of Texas, Austin (February 1981)
Google Scholar
Charikar, M., Chen, K., Farach-Colton, M.: Finding Frequent Items in Data Streams. In: Proceedings of the 29th International Colloquium on Automata, Languages and Programming, pp. 693–703. Springer, Heidelberg (2002)
Chapter Google Scholar
Cormode, G., Korn, F., Muthukrishnan, S., Srivastava, D.: Finding Hierarchical Heavy Hitters in Data Streams. In: Proceedings of the 29th International Conference on Very Large Data Bases, pp. 464–475 (2003)
Google Scholar
Cormode, G., Muthukrishnan, S.: What’s Hot and What’s Not: Tracking Most Frequent Items Dynamically. In: Proceedings of the 22nd Symposium on Principles of Databse Systems, pp. 296–306 (June 2003)
Google Scholar
Demaine, E.D., López-Ortiz, A., Munro, J.I.: Frequency Estimation of Internet Packet Streams with Limited Space. In: Proceedings of the 10th Annual European Symposium on Algorithms, pp. 348–360 (2002)
Google Scholar
Estan, C., Varghese, G.: New Directions in Traffic Measurement and Accounting: Focusing on the Elephants, Ignoring the Mice. ACM Trans. Comput. Syst. 21(3), 270–313 (2003)
Article Google Scholar
Fischer, M.J., Salzberg, S.L.: Finding a Majority Among N Votes: Solution to Problem 81-5. Journal of Algorithms 3, 376–379 (1982)
Google Scholar
Hoare, C.A.R.: Algorithm 65: Find. Communications of the ACM 4(7), 321–322 (1961)
Article Google Scholar
Jin, C., Qian, W., Sha, C., Yu, J.X., Zhou, A.: Dynamically Maintaining Frequent Items over A Data Stream. In: Proceedings of the Twelfth International Conference on Information and Knowledge Management, pp. 287–294. ACM Press, New York (2003)
Chapter Google Scholar
Karp, R., Shenker, S., Papadimitriou, C.: A Simple Algorithm for Finding Frequent Elements in Streams and Bags. ACM Transactions on Database Systems 28(1), 51–55 (2003)
Article Google Scholar
Kirschenhofer, P., Prodinger, H., Martinez, C.: Analysis of Hoare’s FIND algorithm with median-of-three partition. Random Structures Algorithms 10(1-2), 143–156 (1997)
Article MATH MathSciNet Google Scholar
Manku, G., Motwani, R.: Approximate Frequency Counts over Data Streams. In: Proceedings of the 28th International Conference on Very Large Data Bases, pp. 346–357 (2002)
Google Scholar
Metwally, A., Agrawal, D., El Abbadi, A.: Efficient Computation of Frequent and Top-k Elements in Data Streams. Technical Report 2005-23, University of California, Santa Barbara (September 2005)
Google Scholar
Misra, J., Gries, D.: Finding Repeated Elements. Science of Computer Programming 2, 143–152 (1982)
Article MATH MathSciNet Google Scholar
Zipf, G.K.: Human Behavior and The Principle of Least Effort. Addison-Wesley, Reading (1949)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, University of California, Santa Barbara
Ahmed Metwally, Divyakant Agrawal & Amr El Abbadi

Authors

Ahmed Metwally
View author publications
You can also search for this author in PubMed Google Scholar
Divyakant Agrawal
View author publications
You can also search for this author in PubMed Google Scholar
Amr El Abbadi
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Institut für Informationssysteme, Technische Universität Wien, Favoritenstraße 9-11, A-1040, Vienna, Austria
Thomas Eiter
School of Informatics, University of Edinburgh,
Leonid Libkin

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Metwally, A., Agrawal, D., El Abbadi, A. (2004). Efficient Computation of Frequent and Top-k Elements in Data Streams. In: Eiter, T., Libkin, L. (eds) Database Theory - ICDT 2005. ICDT 2005. Lecture Notes in Computer Science, vol 3363. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-30570-5_27

Download citation

DOI: https://doi.org/10.1007/978-3-540-30570-5_27
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-24288-8
Online ISBN: 978-3-540-30570-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics