Abstract
We propose an integrated approach for solving both problems of finding the most popular k elements, and finding frequent elements in a data stream. Our technique is efficient and exact if the alphabet under consideration is small. In the more practical large alphabet case, our solution is space efficient and reports both top-k and frequent elements with tight guarantees on errors. For general data distributions, our top-k algorithm can return a set of k′ elements, where k′ ≈ k, which are guaranteed to be the top-k’ elements; and we use minimal space for calculating frequent elements. For realistic Zipfian data, our space requirement for the frequent elements problem decreases dramatically with the parameter of the distribution; and for top-k queries, we ensure that only the top-k elements, in the correct order, are reported. Our experiments show significant space reductions with no loss in accuracy.
This work was supported in part by NSF under grants EIA 00-80134, NSF 02-09112, and CNF 04-23336.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Bose, P., Kranakis, E., Morin, P., Tang, Y.: Bounds for Frequency Estimation of Packet Streams. In: Proceedings of the 10th International Colloquium on Structural Information and Communication Complexity, pp. 33–42 (2003)
Boyer, R., Moore, J.: A Fast Majority Vote Algorithm. Technical Report 1981-32, Institute for Computing Science, University of Texas, Austin (February 1981)
Charikar, M., Chen, K., Farach-Colton, M.: Finding Frequent Items in Data Streams. In: Proceedings of the 29th International Colloquium on Automata, Languages and Programming, pp. 693–703. Springer, Heidelberg (2002)
Cormode, G., Korn, F., Muthukrishnan, S., Srivastava, D.: Finding Hierarchical Heavy Hitters in Data Streams. In: Proceedings of the 29th International Conference on Very Large Data Bases, pp. 464–475 (2003)
Cormode, G., Muthukrishnan, S.: What’s Hot and What’s Not: Tracking Most Frequent Items Dynamically. In: Proceedings of the 22nd Symposium on Principles of Databse Systems, pp. 296–306 (June 2003)
Demaine, E.D., López-Ortiz, A., Munro, J.I.: Frequency Estimation of Internet Packet Streams with Limited Space. In: Proceedings of the 10th Annual European Symposium on Algorithms, pp. 348–360 (2002)
Estan, C., Varghese, G.: New Directions in Traffic Measurement and Accounting: Focusing on the Elephants, Ignoring the Mice. ACM Trans. Comput. Syst. 21(3), 270–313 (2003)
Fischer, M.J., Salzberg, S.L.: Finding a Majority Among N Votes: Solution to Problem 81-5. Journal of Algorithms 3, 376–379 (1982)
Hoare, C.A.R.: Algorithm 65: Find. Communications of the ACM 4(7), 321–322 (1961)
Jin, C., Qian, W., Sha, C., Yu, J.X., Zhou, A.: Dynamically Maintaining Frequent Items over A Data Stream. In: Proceedings of the Twelfth International Conference on Information and Knowledge Management, pp. 287–294. ACM Press, New York (2003)
Karp, R., Shenker, S., Papadimitriou, C.: A Simple Algorithm for Finding Frequent Elements in Streams and Bags. ACM Transactions on Database Systems 28(1), 51–55 (2003)
Kirschenhofer, P., Prodinger, H., Martinez, C.: Analysis of Hoare’s FIND algorithm with median-of-three partition. Random Structures Algorithms 10(1-2), 143–156 (1997)
Manku, G., Motwani, R.: Approximate Frequency Counts over Data Streams. In: Proceedings of the 28th International Conference on Very Large Data Bases, pp. 346–357 (2002)
Metwally, A., Agrawal, D., El Abbadi, A.: Efficient Computation of Frequent and Top-k Elements in Data Streams. Technical Report 2005-23, University of California, Santa Barbara (September 2005)
Misra, J., Gries, D.: Finding Repeated Elements. Science of Computer Programming 2, 143–152 (1982)
Zipf, G.K.: Human Behavior and The Principle of Least Effort. Addison-Wesley, Reading (1949)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2004 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Metwally, A., Agrawal, D., El Abbadi, A. (2004). Efficient Computation of Frequent and Top-k Elements in Data Streams. In: Eiter, T., Libkin, L. (eds) Database Theory - ICDT 2005. ICDT 2005. Lecture Notes in Computer Science, vol 3363. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-30570-5_27
Download citation
DOI: https://doi.org/10.1007/978-3-540-30570-5_27
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-24288-8
Online ISBN: 978-3-540-30570-5
eBook Packages: Computer ScienceComputer Science (R0)