LotterySampling: A Randomized Algorithm for the Heavy Hitters and Top-k Problems in Data Streams

Martínez, Conrado; Solera-Pardo, Gonzalo

doi:10.1007/978-3-031-22105-7_3

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13595))

Included in the following conference series:

International Computing and Combinatorics Conference

750 Accesses
1 Citations

Abstract

We propose a new randomized count-based algorithm to solve the Heavy Hitters and Top-k problems in data streams. This algorithm, called LotterySampling, uses the intuitive concept of “lottery tickets”, to decide which elements to sample. We prove that LotterySampling is inside the \((\delta ,\epsilon )\)-deficient framework for the Heavy Hitters problem and that it has a similar performance to the well known StickySampling algorithm, although they are very different in nature. More importantly, we define a similar \((\delta ,\epsilon )\)-deficient framework for the harder Top-k problem and we prove that LotterySampling is inside it. Hence, LotterySampling can be used, without any previous assumption of the data distribution, as a probabilistic approximation scheme to find the k most frequent elements and to approximate the frequencies of the reported elements by a factor of \(1 - \epsilon \). To the best of our knowledge, this is the first algorithm that gives theoretical guarantees for the Top-k problem for unknown and arbitrary streams, which is the most important contribution of this paper. Its memory usage is adaptive to the distribution of the stream, and it will increase or decrease depending on whether the stream becomes less or more skewed. More precisely, the sample size depends, at any given moment, on the unknown \(k^{\text {th}}\) highest frequency but it is independent of the length of the stream and of the number of distinct elements. The user just needs to provide two parameters that determine the quality of the answers, independently of the stream. We compare LotterySampling with other existing probabilistic and deterministic algorithms showing its strengths and weaknesses.

This work has been supported by funds from the MOTION Project (Project PID2020-112581GB-C21) of the Spanish Ministry of Science & Innovation MCIN/AEI/10.13039/501100011033, while the second author was a postgraduate student at Universitat Politècnica de Catalunya.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 69.99; Price excludes VAT (USA)

Softcover Book: USD 89.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Efficient Sampling of Non-strict Turnstile Data Streams

An Effective Single-Pass Approach for Estimating the Φ-quantile in Data Streams

Consistent Subset Sampling

References

Bandi, N., Metwally, A., Agrawal, D., El Abbadi, A.: Fast data stream algorithms using associative memories. In: Proceedings of the ACM International Conference on Management of Data (SIGMOD 2007), pp. 247–256 (2007)
Google Scholar
Charikar, M., Chen, K., Farach-Colton, M.: Finding frequent items in data streams. Theoret. Comput. Sci. 312(1), 3–15 (2004)
Article MathSciNet MATH Google Scholar
Cormode, G., Hadjieleftheriou, M.: Finding frequent items in data streams. Proc. VLDB Endow. 1(2), 1530–1541 (2008)
Article Google Scholar
Gibbons, P.B., Matias, Y.: New sampling-based summary statistics for improving approximate query answers. In: Proceedings of the ACM International Conference on Management of Data (SIGMOD 1998), pp. 331–342 (1998)
Google Scholar
Lam, H., Calders, T.: Mining top-k frequent items in a data stream with flexible sliding windows. In: Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD 2010), pp. 283–292 (2010)
Google Scholar
Manku, G.S., Motwani, R.: Approximate frequency counts over data streams. In: Proceedings of the 28th International Conference on Very Large Data Bases (VLDB 2002), pp. 346–357. Morgan Kaufmann (2002)
Google Scholar
Martínez, C., Solera-Pardo, G.: LotterySampling: a randomized algorithm for the heavy hitters and top-\(k\) problems in data streams (2022). https://www.researchgate.net/publication/362910274_LotterySampling_A_Randomized_Algorithm_for_the_Heavy_Hitters_and_Top-k_Problems_in_Data_Streams. Full version of the conference paper. Available on-line in ResearchGate
Metwally, A., Agrawal, D., El Abbadi, A.: Efficient computation of frequent and top-k elements in data streams. In: Eiter, T., Libkin, L. (eds.) ICDT 2005. LNCS, vol. 3363, pp. 398–412. Springer, Heidelberg (2004). https://doi.org/10.1007/978-3-540-30570-5_27
Chapter Google Scholar
Muthukrishnan, S.: Data streams: algorithms and applications. Found. Trends Theor. Comput. Sci. 1(2), 117–236 (2005)
Article MathSciNet MATH Google Scholar

Download references

Author information

Authors and Affiliations

Universitat Politècnica de Catalunya, Barcelona, Spain
Conrado Martínez & Gonzalo Solera-Pardo
Google Research Labs, Zürich, Switzerland
Gonzalo Solera-Pardo

Authors

Conrado Martínez
View author publications
You can also search for this author in PubMed Google Scholar
Gonzalo Solera-Pardo
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Conrado Martínez .

Editor information

Editors and Affiliations

Shenzhen Institute of Advanced Technology, Shenzhen, China
Yong Zhang
Harbin Institute of Technology, Harbin, China
Dongjing Miao
Technical University of Berlin, Berlin, Germany
Rolf Möhring

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Martínez, C., Solera-Pardo, G. (2022). LotterySampling: A Randomized Algorithm for the Heavy Hitters and Top-k Problems in Data Streams. In: Zhang, Y., Miao, D., Möhring, R. (eds) Computing and Combinatorics. COCOON 2022. Lecture Notes in Computer Science, vol 13595. Springer, Cham. https://doi.org/10.1007/978-3-031-22105-7_3

Download citation

DOI: https://doi.org/10.1007/978-3-031-22105-7_3
Published: 01 January 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-22104-0
Online ISBN: 978-3-031-22105-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

LotterySampling: A Randomized Algorithm for the Heavy Hitters and Top-k Problems in Data Streams

Abstract

Access this chapter

Similar content being viewed by others

Efficient Sampling of Non-strict Turnstile Data Streams

An Effective Single-Pass Approach for Estimating the Φ-quantile in Data Streams

Consistent Subset Sampling

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

LotterySampling: A Randomized Algorithm for the Heavy Hitters and Top-k Problems in Data Streams

Abstract

Access this chapter

Similar content being viewed by others

Efficient Sampling of Non-strict Turnstile Data Streams

An Effective Single-Pass Approach for Estimating the Φ-quantile in Data Streams

Consistent Subset Sampling

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation