Encyclopedia of Database Systems

2018 Edition
| Editors: Ling Liu, M. Tamer Özsu

Count-Min Sketch

  • Graham Cormode
Reference work entry
DOI: https://doi.org/10.1007/978-1-4614-8265-9_87

Synonyms

CM Sketch

Definition

The Count-Min (CM) Sketch is a compact summary data structure capable of representing a high-dimensional vector and answering queries on this vector, in particular point queries and dot product queries, with strong accuracy guarantees. Such queries are at the core of many computations, so the structure can be used in order to answer a variety of other queries, such as frequent items (heavy hitters), quantile finding, join size estimation, and more. Since the data structure can easily process updates in the form of additions or subtractions to dimensions of the vector (which may correspond to insertions or deletions, or other transactions), it is capable of working over streams of updates, at high rates.

The data structure maintains the linear projection of the vector with a number of other random vectors. These vectors are defined implicitly by simple hash functions. Increasing the range of the hash functions increases the accuracy of the summary, and...

This is a preview of subscription content, log in to check access.

Recommended Reading

  1. 1.
    Cormode G, Muthukrishnan S. An improved data stream summary: the count-min sketch and its applications. J Algorith. 2005;55(1):58–75.MathSciNetzbMATHCrossRefGoogle Scholar
  2. 2.
    Charikar M, Chen K, Farach-Colton M. Finding frequent items in data streams. In: Proceedings of the 29th International Colloquium on Automata, Languages, and Programming; 2002. p. 693–703.CrossRefGoogle Scholar
  3. 3.
    Alon N, Matias Y, Szegedy M. The space complexity of approximating the frequency moments. In: Proceedings of the 28th Annual ACM Symposium on Theory of Computing; 1996. p. 20–9. Journal version in J Comput Syst Sci. 1999;58(1):137–47.zbMATHCrossRefGoogle Scholar
  4. 4.
    Estan C, Varghese G. New directions in traffic measurement and accounting. In: Proceedings of the ACM International Conference of the on Data Communication; 2002. p. 323–38.CrossRefGoogle Scholar
  5. 5.
    Motwani R, Raghavan P. Randomized algorithms. Cambridge: Cambridge University Press; 1995.zbMATHCrossRefGoogle Scholar
  6. 6.
    Cormode G, Muthukrishnan S. Summarizing and mining skewed data streams. In: Proceedings of the 2005 SIAM International Conference on Data Mining; 2005.Google Scholar
  7. 7.
    Lee GM, Liu H, Yoon Y, Zhang Y. Improving sketch reconstruction accuracy using linear least squares method. In: Proceedings of the 5th ACM SIGCOMM Conference on Internet Measurement; 2005. p. 273–8.Google Scholar
  8. 8.
    Bhattacharrya S, Madeira A, Muthukrishnan S, Ye T. How to scalably skip past streams. In: Proceedings of the 1st International Workshop on Scalable Stream Processing Systems; 2007. p. 654–63.Google Scholar
  9. 9.
    Indyk P. Better algorithms for high-dimensional proximity problems via asymmetric embeddings. In: Proceedings of the ACM-SIAM Symposium on Discrete Algorithms; 2003.Google Scholar
  10. 10.
    Lakshminath B, Ganguly S. Estimating entropy over data streams. In: Proceedings of the 14th European Symposium on Algorithms; 2006. p. 148–59.Google Scholar
  11. 11.
    Sarlós T, Benzúr A, Csalogány K, Fogaras D, Rácz B. To randomize or not to randomize: space optimal summaries for hyperlink analysis. In: Proceedings of the 15th International World Wide Web Conference; 2006. p. 297–306.Google Scholar
  12. 12.
    Spiegel J, Polyzotis N. Graph-based synopses for relational selectivity estimation. In: Proceedings of the ACM SIGMOD International Conference on Management of Data; 2006. p. 205–16.Google Scholar
  13. 13.
    Rusu F, Dobra A. Statistical analysis of sketch estimators. In: Proceedings of the ACM SIGMOD International Conference on Management of Data; 2007. p. 187–98.Google Scholar
  14. 14.
    Cormode G, Muthukrishnan S. Space efficient mining of multigraph streams. In: Proceedings of the 24th ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems; 2005. p. 271–82.Google Scholar
  15. 15.
    Kollios G, Byers J, Considine J, Hadjieleftheriou M, Li F. Robust aggregation in sensor networks. Q Bull IEEE TC Data Eng. 2005;28(1):26–32.Google Scholar
  16. 16.
    Roughan M, Zhang Y. Secure distributed data mining and its application in large-scale network measurements. Computer Communication Review. 2006;36(1):7–14.CrossRefGoogle Scholar
  17. 17.
    Cormode G, Korn F, Muthukrishnan S, Johnson T, Spatscheck O, Srivastava D. Holistic UDAFs at streaming speeds. In: Proceedings of the ACM SIGMOD International Conference on Management of Data; 2004. p. 35–46.Google Scholar
  18. 18.
    Lai Y-K, Byrd GT. High-throughput sketch update on a low-power stream processor. In: Proceedings of the ACM/IEEE Symposium on Architecture for Networking and Communications Systems; 2006. p. 123–32.Google Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2018

Authors and Affiliations

  1. 1.Computer ScienceUniversity of WarwickWarwickUK

Section editors and affiliations

  • Divesh Srivastava
    • 1
  1. 1.AT&T Labs-ResearchBedminsterUSA