Skip to main content

Part of the book series: Unsupervised and Semi-Supervised Learning ((UNSESUL))

Abstract

In optimization or machine learning problems we are given a set of items, usually points in some metric space, and the goal is to minimize or maximize an objective function over some space of candidate solutions. For example, in clustering problems, the input is a set of points in some metric space, and a common goal is to compute a set of centers in some other space (points, lines) that will minimize the sum of distances to these points. In database queries, we may need to compute such a sum for a specific query set of k centers.

However, traditional algorithms cannot handle modern systems that require parallel real-time computations of infinite distributed streams from sensors such as GPS, audio or video that arrive to a cloud, or networks of weaker devices such as smartphones or robots.

Core-set is a “small data” summarization of the input “big data,” where every possible query has approximately the same answer on both data sets. Generic techniques enable efficient coreset maintainance of streaming, distributed, and dynamic data. Traditional algorithms can then be applied on these coresets to maintain the approximated optimal solutions.

The challenge is to design coresets with provable trade-off between their size and approximation error. This survey summarizes such constructions in a retrospective way that aims to unify and simplify the state of the art.

Figures by Ibrahim Jubran.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

eBook
USD 16.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 129.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 129.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Agarwal, P.K., Har-Peled, S.: Maintaining the approximate extent measures of moving points. In: Proceedings of the 12th Soda, pp. 148–157 (2001)

    Google Scholar 

  2. Agarwal, P., Har-Peled, S., Varadarajan, K.: Approximating extent measures of points. J. Assoc. Comput. Mach. 51(4), 606–635 (2004)

    Article  MathSciNet  Google Scholar 

  3. Agarwal, P., Har-Peled, S., Varadarajan, K.: Geometric approximation via coresets. Combinatorial Comput. Geom. 52, 1–30 (2005)

    MathSciNet  MATH  Google Scholar 

  4. Agarwal, P.K., Jones, M., Murali, T.M., Procopiuc, C.M.: A Monte Carlo algorithm for fast projective clustering. In: Proceeding ACM-SIGMOD International Conference on Management of Data, pp. 418–427 (2002)

    Google Scholar 

  5. Agarwal, P.K., Mustafa, N.H.: k-means projective clustering. In: Proceeding 23rd ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems (PODS), pp. 155–165 (2004)

    Google Scholar 

  6. Agarwal, P.K., Procopiuc, C.M.: Approximation algorithms for projective clustering. In: Proceeding 11th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pp. 538–547 (2000)

    Google Scholar 

  7. Agarwal, P.K., Procopiuc, C.M.: Approximation algorithms for projective clustering. J. Algorithms 46(2), 115–139 (2003)

    Article  MathSciNet  Google Scholar 

  8. Agarwal, P.K., Procopiuc, C.M., Varadarajan, K.R.: Approximation algorithms for k-line center. In: European Symposium on Algorithms, pp. 54–63 (2002)

    Chapter  Google Scholar 

  9. Aggarwal, A., Deshpande, A., Kannan, R.: Adaptive sampling for k-means clustering. In: Proceedings of the 25th approx, pp. 15–28 (2009)

    Chapter  Google Scholar 

  10. Anthony, M., Bartlett, P.L.: Neural Network Learning: Theoretical Foundations. Cambridge University Press, Cambridge (1999)

    Book  Google Scholar 

  11. Assadi, S., Khanna, S.: Randomized composable coresets for matching and vertex cover. CoRR. abs/1705.08242 (2017). Retrieved from http://arxiv.org/abs/1705.08242

  12. Bachem, O., Lucic, M., Krause, A.: Coresets for nonparametric estimation-the case of dp-means. In: International Conference on Machine Learning (ICML) (2015)

    Google Scholar 

  13. Bădoiu, M., Clarkson, K.L.: Smaller core-sets for balls. In: Proceedings of the 14th soda, pp. 801–802 (2003)

    Google Scholar 

  14. Bădoiu, M., Clarkson, K.L.: Optimal core-sets for balls. Comput. Geom. 40(1), 14–22 (2008)

    Article  MathSciNet  Google Scholar 

  15. Bambauer, J., Muralidhar, K., Sarathy, R.: Fool’s gold: an illustrated critique of differential privacy. Vanderbilt J. Entertain. Technol. Law 16, 701 (2013)

    Google Scholar 

  16. Barger, A., Feldman, D.: k-means for streaming and distributed big sparse data. In: Proceeding of the 2016 SIAM International Conference on Data Mining (sdm’16) (2016)

    Google Scholar 

  17. Batson, J., Spielman, D.A., Srivastava, N.: Twice-ramanujan sparsifiers. SIAM Rev. 56(2), 315–334 (2014)

    Article  MathSciNet  Google Scholar 

  18. Baykal, C., Liebenwein, L., Gilitschenski, I., et al.: Data-dependent coresets for compressing neural networks with applications to generalization bounds. arXiv preprint arXiv:1804.05345 (2018)

    Google Scholar 

  19. Bentley, J.L., Saxe, J.B.: Decomposable searching problems I. static-to-dynamic transformation. J. Algorithms 1(4), 301–358 (1980)

    Article  MathSciNet  Google Scholar 

  20. Blumer, A., Ehrenfeucht, A., Haussler, D., Warmuth, M.K.: Learnability and the vapnik-chervonenkis dimension. J. Assoc. Comput. Mach. 36(4), 929–965 (1989)

    Article  MathSciNet  Google Scholar 

  21. Boutsidis, C., Zouzias, A., Mahoney, M.W., Drineas, P.: Randomized dimensionality reduction for k-means clustering. IEEE Trans. Inf. Theory 61(2), 1045–1062 (2015)

    Article  MathSciNet  Google Scholar 

  22. Braverman, V., Feldman, D., Lang, H.: New frameworks for offline and streaming coreset constructions. arXiv preprint:1612.00889 (2016)

    Google Scholar 

  23. Charikar, M., Guha, S.: Improved combinatorial algorithms for facility location problems. SIAM J. Comput. 34(4), 803–824 (2005)

    Article  MathSciNet  Google Scholar 

  24. Charikar, M., Guha, S., Tardos, É., Shmoys, D.B.: A constant-factor approximation algorithm for the k-median problem. J. Comput. Syst. Sci. 65(1), 129–149 (2002)

    Article  MathSciNet  Google Scholar 

  25. Chazelle, B., Edelsbrunner, H., Grigni, M., Guibas, L., Sharir, M., Welzl, E.: Improved bounds on weak &egr;-nets for convex sets. In: Proceedings of the Twenty-Fifth annual ACM Symposium on Theory of Computing (STOC), pp. 495–504. ACM, New York (1993)

    Google Scholar 

  26. Chen, K.: On coresets for k-median and k-means clustering in metric and euclidean spaces and their applications. SIAM J. Comput. 39(3), 923–947 (2009)

    Article  MathSciNet  Google Scholar 

  27. Choi, S., Kim, T., Yu, W.: Performance evaluation of ransac family. J. Comput. Vis. 24(3), 271–300 (1997)

    Article  Google Scholar 

  28. Clarkson, K.L.: Coresets, sparse greedy approximation, and the frank-wolfe algorithm. Assoc. Comput. Mach. Trans. Algorithms (TALG) 6(4), 63 (2010)

    Article  MathSciNet  Google Scholar 

  29. Clarkson, K.L., Woodruff, D.P.: Numerical linear algebra in the streaming model. In: Proceedings of the 41st STOC, pp. 205–214 (2009)

    Google Scholar 

  30. Cohen, M.B., Elder, S., Musco, C., Musco, C., Persu, M.: Dimensionality reduction for k-means clustering and low rank approximation. In: Proceedings of the Forty-Seventh Annual ACM on Symposium on Theory of Computing, STOC 2015, pp. 163–172 (2015)

    MathSciNet  MATH  Google Scholar 

  31. Cohen, M.B., Lee, Y.T., Musco, C., Musco, C., Peng, R., Sidford, A.: Uniform sampling for matrix approximation. In: Proceedings of the 2015 Conference on Innovations in Theoretical Computer Science, pp. 181–190. ACM, New York (2015) . http://doi.acm.org/10.1145/2688073.2688113

  32. Dasgupta, A., Drineas, P., Harb, B., Kumar, R., Mahoney, M.W.: Sampling algorithms and coresets for p-regression. In: Proceedings 19th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pp. 932–941 (2008) . http://doi.acm.org/10.1145/1347082.1347184

  33. Dasgupta, S., Schulman, L.J.: A two-round variant of em for gaussian mixtures. In: Proceedings of the Sixteenth Conference on Uncertainty in Artificial Intelligence, pp. 152–159 (2000)

    Google Scholar 

  34. Deshpande, A., Rademacher, L., Vempala, S., Wang, G.: Matrix approximation and projective clustering via volume sampling. In: Proceedings 17th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pp. 1117–1126 (2006)

    Google Scholar 

  35. Drineas, P., Mahoney, M.W., Muthukrishnan, S.: Sampling algorithms for l 2 regression and applications. In: Proceeding of SODA 06 Proceedings of the Seventeenth Annual ACM-SIAM Symposium on Discrete Algorithm, pp. 1127–1136 (2006)

    Google Scholar 

  36. Edwards, M., Varadarajan, K.: No coreset, no cry: Ii. In: International Conference on Foundations of Software Technology and Theoretical Computer Science, pp. 107–115 (2005)

    Google Scholar 

  37. Effros, M., Schulman, L.J.: Deterministic clustering with data nets. In: Electronic Colloquium on Computational Complexity (ECCC), Report no. 050 (2004)

    Google Scholar 

  38. Epstein, D., Feldman, D.: Quadcopter tracks quadcopter via real-time shape fitting. IEEE Robot. Autom. Lett. 3(1), 544–550 (2018)

    Article  Google Scholar 

  39. Feigin, M., Feldman, D., Sochen, N.: From high definition image to low space optimization. In: Proceeding 3rd International Conference on Scale Space and Variational Methods in Computer Vision (SSVM 2011) (2011)

    Google Scholar 

  40. Feldman, D., Langberg, M.: A unified framework for approximating and clustering data. In: Proceeding 34th Annual ACM Symposium on Theory of Computing (STOC) (2011). See http://arxiv.org/abs/1106.1379 for fuller version

  41. Feldman, D., Schulman, L.J.: Data reduction for weighted and outlier-resistant clustering. In: Proceeding of the 23rd Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pp. 1343–1354 (2012)

    Google Scholar 

  42. Feldman, D., Tassa, T.: More constraints, smaller coresets: constrained matrix approximation of sparse big data. In: Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (kdd’15), pp. 249–258 (2015)

    Google Scholar 

  43. Feldman, D., Fiat, A., Sharir, M.: Coresets for weighted facilities and their applications. In: FOCS, pp. 315–324 (2006)

    Google Scholar 

  44. Feldman, D., Fiat, A., Segev, D., Sharir, M.: Bi-criteria linear-time approximations for generalized k-mean/median/center. In: Proceeding of 23rd ACM Symposium on Computational Geometry (SOCG), pp. 19–26 (2007)

    Google Scholar 

  45. Feldman, D., Monemizadeh, M., Sohler, C.: A ptas for k-means clustering based on weak coresets. In: Proceedings of the 23rd ACM Symposium on Computational Geometry (SoCG), pp. 11–18 (2007)

    Google Scholar 

  46. Feldman, D., Fiat, A., Kaplan, H., Nissim, K.: Private coresets. In: Proceedings of the Forty-First Annual ACM Symposium on Theory of Computing, pp. 361–370 (2009)

    Google Scholar 

  47. Feldman, D., Monemizadeh, M., Sohler, C., Woodruff, D.P.: Coresets and sketches for high dimensional subspace approximation problems. In: Proceedings of the Twenty-First Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 630–649 (2010)

    Google Scholar 

  48. Feldman, D., Krause, A., Faulkner, M.: Scalable training of mixture models via coresets. In: Proceeding 25th Conference on Neural Information Processing Systems (NIPS) (2011)

    Google Scholar 

  49. Feldman, D., Sugaya, A., Rus, D.: An effective coreset compression algorithm for large scale sensor networks. In: 2012 ACM/IEEE 11th International Conference on Information Processing in Sensor Networks (IPSN), pp. 257–268 (2012)

    Google Scholar 

  50. Feldman, D., Sung, C., Rus, D.: The single pixel gps: learning big data signals from tiny coresets. In: Proceedings of the 20th International Conference on Advances in Geographic Information Systems, pp. 23–32 (2012)

    Google Scholar 

  51. Feldman, D., Sugaya, A., Sung, C., Rus, D.: Idiary: from gps signals to a text-searchable diary. In: Proceedings of the 11th ACM Conference on Embedded Networked Sensor Systems, p. 6 (2013)

    Google Scholar 

  52. Feldman, D., Schmidt, M., Sohler, C.: Turning big data into tiny data: constant-size coresets for k-means, PCA and projective clustering. In: Proceedings of the Twenty-Fourth Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 1434–1453 (2013a)

    Google Scholar 

  53. Feldman, D., Schmidt, M., Sohler, C.: Turning big data into tiny data: constant-size coresets for k-means, pca and projective clustering. In: Proceedings of the Twenty-Fourth Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 1434–1453 (2013b)

    Google Scholar 

  54. Feldman, D., Volkov, M., Rus, D.: Dimensionality reduction of massive sparse datasets using coresets. In: Advances in Neural Information Processing Systems (NIPS) (2016)

    Google Scholar 

  55. Feldman, D., Ozer, S., Rus, D.: Coresets for vector summarization with applications to network graphs. In: Proceedings of the 34th International Conference on Machine Learning, ICML 2017. Sydney, NSW, Australia, 6–11 August 2017, pp. 1117–1125 (2017). http://proceedings.mlr.press/v70/feldman17a.html

  56. Feldman, D., Xiang, C., Zhu, R., Rus, D.: Coresets for differentially private k-means clustering and applications to privacy in mobile sensor networks. In: 2017 16th ACM/IEEE International Conference on Information Processing in Sensor Networks (IPSN), pp. 3–16 (2017)

    Google Scholar 

  57. Foster, I.: Designing and Building Parallel Programs. Addison Wesley Publishing Company, Reading (1995)

    MATH  Google Scholar 

  58. Funke, S., Laue, S.: Bounded-hop energy-efficient broadcast in low-dimensional metrics via coresets. In: Annual Symposium on Theoretical Aspects of Computer Science, pp. 272–283 (2007)

    Google Scholar 

  59. Har-Peled, S.: No coreset, no cry. In: Proceedings of the 24th IARCS Annual Conference on Foundations of Software Technology and Theoretical Computer Science (FSTTCS), pp. 324–335 (2004)

    Google Scholar 

  60. Har-Peled, S.: Coresets for discrete integration and clustering. In: 26th FSTTCS, pp. 33–44 (2006)

    Chapter  Google Scholar 

  61. Har-Peled, S., Kushal, A.: Smaller coresets for k-median and k-means clustering. In: Proceedings of the 25th SODA, pp. 126–134 (2005)

    Google Scholar 

  62. Har-Peled, S., Mazumdar, S.: Coresets for k-means and k-median clustering and their applications. In: Proceedings of the 36th ACM Symposium on the Theory of Computing (STOC), pp. 291–300 (2004a)

    Google Scholar 

  63. Har-Peled, S., Mazumdar, S.: On coresets for k-means and k-median clustering. In: Proceeding of the 36th Annual ACM Symposium on Theory of Computing (STOC), pp. 291–300 (2004b)

    Google Scholar 

  64. Har-Peled, S., Varadarajan, K.R.: Projective clustering in high dimensions using coresets. In: Proceeding 18th ACM Symposium on Computational Geometry (SOCG), pp. 312–318 (2002)

    Google Scholar 

  65. Haussler, D.: Decision theoretic generalizations of the pac learning model. In: Proceedings of the 1st International Workshop on Algorithmic Learning Theory (ALT), pp. 21–41 (1990)

    Google Scholar 

  66. Haussler, D., Welzl, E.: Epsilon-nets and simplex range queries. In: Annual ACM Symposium on Computational Geometry (SOCG) (1986)

    Google Scholar 

  67. Hellerstein, J.: Parallel programming in the age of big data. In: GIGAOM blog. Nov. 9, 2008 (2008)

    Google Scholar 

  68. Hoeffding, W.: Probability inequalities for sums of bounded random variables. J. Am. Stat. Assoc. 58(301), 13–30 (1963)

    Article  MathSciNet  Google Scholar 

  69. IBM: What is Big Data? Bringing Big Data to the Enterprise (2012). www.ibm.com/software/data/bigdata/. Accessed 3rd Oct 2012

  70. Inaba, M., Katoh, N., Imai, H.: Applications of weighted voronoi diagrams and randomization to variance-based k-clustering. In: Symposium on Computational Geometry, pp. 332–339 (1994)

    Google Scholar 

  71. Indyk, P., Mahabadi, S., Mahdian, M., Mirrokni, V.S.: Composable core-sets for diversity and coverage maximization. In: Proceedings of the 33rd ACM Sigmod-Sigact-Sigart Symposium on Principles of Database Systems, pp. 100–108 (2014)

    Google Scholar 

  72. Joshi, S., Kommaraji, R.V., Phillips, J.M., Venkatasubramanian, S.: Comparing distributions and shapes using the kernel distance. In: Proceedings of the Twenty-Seventh Annual Symposium on Computational Geometry, pp. 47–56 (2011)

    Google Scholar 

  73. Langberg, M., Schulman, L.J.: Universal epsilon-approximators for integrals. In: Proceedings of the 21st ACM-SIAM Symposium on Discrete Algorithms (SODA), pp. 598–607 (2010)

    Google Scholar 

  74. Li, Y., Long, P.M., Srinivasan, A.: Improved bounds on the sample complexity of learning. J. Comput. Syst. Sci. 62, 516–527 (2001)

    Article  MathSciNet  Google Scholar 

  75. Löffler, M., Phillips, J.M.: Shape fitting on point sets with probability distributions. In: European Symposium on Algorithms, pp. 313–324 (2009)

    Google Scholar 

  76. Mahajan, M., Nimbhorkar, P., Varadarajan, K.: The planar k-means problem is np-hard. In: International Workshop on Algorithms and Computation, pp. 274–285 (2009)

    Chapter  Google Scholar 

  77. Mahoney, M.W.: Randomized algorithms for matrices and data. Found. Trends Mach. Learn. 3(2), 123–224 (2011)

    MATH  Google Scholar 

  78. Matousek, J.: Approximations and optimal geometric divide-an-conquer. J. Comput. Syst. Sci. 50(2), 203–208 (1995)

    Article  MathSciNet  Google Scholar 

  79. Matouaek, J.: New constructions of weak epsilon-nets. In: Proceedings of the Nineteenth Annual Symposium on Computational Geometry, pp. 129–135 (2003)

    Google Scholar 

  80. McLachlan, G., Krishnan, T.: The EM Algorithm and Extensions, vol. 382. Wiley, New York (2007)

    MATH  Google Scholar 

  81. Munteanu, A., Schwiegelshohn, C.: Coresets-methods and history: a theoreticians design pattern for approximation and streaming algorithms. KI-Künstl. Intell. 32(1), 37–53 (2018)

    Article  Google Scholar 

  82. Muthukrishnan, S., et al.: Data streams: algorithms and applications. Found. Trends Theor. Comput. Sci. 1(2), 117–236 (2005)

    Article  MathSciNet  Google Scholar 

  83. Ostrovsky, R., Rabani, Y., Schulman, L.J., Swamy, C.: The effectiveness of lloyd-type methods for the k-means problem. In: 47th Annual IEEE Symposium on Foundations of Computer Science, 2006. FOCS’06, pp. 165–176 (2006)

    Google Scholar 

  84. Paul, R., Feldman, D., Rus, D., Newman, P.: Visual precis generation using coresets. In: 2014 IEEE International Conference on Robotics and Automation (ICRA), pp. 1304–1311 (2014)

    Google Scholar 

  85. Phillips, J.M.: Coresets and Sketches, Near-Final Version of Chapter 49 in Handbook on Discrete and Computational Geometry, 3rd edn. CoRR (2016). abs/1601.00617. Retrieved from http://arxiv.org/abs/1601.00617

  86. Rosman, G., Volkov, M., Feldman, D., Fisher III, J.W., Rus, D.: Coresets for k-segmentation of streaming data. In: Advances in Neural Information Processing Systems (NIPS), pp. 559–567 (2014)

    Google Scholar 

  87. Segaran, T., Hammerbacher, J.: Beautiful Data: The Stories Behind Elegant Data Solutions. O’Reilly Media, Inc., Beijing (2009)

    Google Scholar 

  88. Sener, O., Savarese, S.: Active learning for convolutional neural networks: a core-set approach. Statistics 1050, 27 (2017)

    Google Scholar 

  89. Shyamalkumar, N., Varadarajan, K.: Efficient subspace approximation algorithms. In: Proceedings of the 18th ACM-SIAM Symposium on Discrete Algorithms (SODA), pp. 532–540 (2007)

    Google Scholar 

  90. Tolochinsky, E., Feldman, D.: Coresets for Monotonic Functions With Applications to Deep Learning (2018). arXiv preprint:1802.07382

    Google Scholar 

  91. Tremblay, N., Barthelmé, S., Amblard, P.-O.: Determinantal Point Processes for Coresets (2018). arXiv preprint:1803.08700

    Google Scholar 

  92. Varadarajan, K., Xiao, X.: A near-linear algorithm for projective clustering integer points. In: Proceedings of the ACM-SIAM Symposium on Discrete Algorithms (SODA) (2012a)

    Google Scholar 

  93. Varadarajan, K., Xiao, X.: On the sensitivity of shape fitting problems. In: Proceedings of the 32nd Annual Conference on IARCS Annual Conference on Foundations of Software Technology and Theoretical Computer Science (FSTTCS), pp. 486–497 (2012b)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Switzerland AG

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Feldman, D. (2020). Core-Sets: Updated Survey. In: Ros, F., Guillaume, S. (eds) Sampling Techniques for Supervised or Unsupervised Tasks. Unsupervised and Semi-Supervised Learning. Springer, Cham. https://doi.org/10.1007/978-3-030-29349-9_2

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-29349-9_2

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-29348-2

  • Online ISBN: 978-3-030-29349-9

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics