Abstract
In optimization or machine learning problems we are given a set of items, usually points in some metric space, and the goal is to minimize or maximize an objective function over some space of candidate solutions. For example, in clustering problems, the input is a set of points in some metric space, and a common goal is to compute a set of centers in some other space (points, lines) that will minimize the sum of distances to these points. In database queries, we may need to compute such a sum for a specific query set of k centers.
However, traditional algorithms cannot handle modern systems that require parallel real-time computations of infinite distributed streams from sensors such as GPS, audio or video that arrive to a cloud, or networks of weaker devices such as smartphones or robots.
Core-set is a “small data” summarization of the input “big data,” where every possible query has approximately the same answer on both data sets. Generic techniques enable efficient coreset maintainance of streaming, distributed, and dynamic data. Traditional algorithms can then be applied on these coresets to maintain the approximated optimal solutions.
The challenge is to design coresets with provable trade-off between their size and approximation error. This survey summarizes such constructions in a retrospective way that aims to unify and simplify the state of the art.
Figures by Ibrahim Jubran.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Agarwal, P.K., Har-Peled, S.: Maintaining the approximate extent measures of moving points. In: Proceedings of the 12th Soda, pp. 148–157 (2001)
Agarwal, P., Har-Peled, S., Varadarajan, K.: Approximating extent measures of points. J. Assoc. Comput. Mach. 51(4), 606–635 (2004)
Agarwal, P., Har-Peled, S., Varadarajan, K.: Geometric approximation via coresets. Combinatorial Comput. Geom. 52, 1–30 (2005)
Agarwal, P.K., Jones, M., Murali, T.M., Procopiuc, C.M.: A Monte Carlo algorithm for fast projective clustering. In: Proceeding ACM-SIGMOD International Conference on Management of Data, pp. 418–427 (2002)
Agarwal, P.K., Mustafa, N.H.: k-means projective clustering. In: Proceeding 23rd ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems (PODS), pp. 155–165 (2004)
Agarwal, P.K., Procopiuc, C.M.: Approximation algorithms for projective clustering. In: Proceeding 11th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pp. 538–547 (2000)
Agarwal, P.K., Procopiuc, C.M.: Approximation algorithms for projective clustering. J. Algorithms 46(2), 115–139 (2003)
Agarwal, P.K., Procopiuc, C.M., Varadarajan, K.R.: Approximation algorithms for k-line center. In: European Symposium on Algorithms, pp. 54–63 (2002)
Aggarwal, A., Deshpande, A., Kannan, R.: Adaptive sampling for k-means clustering. In: Proceedings of the 25th approx, pp. 15–28 (2009)
Anthony, M., Bartlett, P.L.: Neural Network Learning: Theoretical Foundations. Cambridge University Press, Cambridge (1999)
Assadi, S., Khanna, S.: Randomized composable coresets for matching and vertex cover. CoRR. abs/1705.08242 (2017). Retrieved from http://arxiv.org/abs/1705.08242
Bachem, O., Lucic, M., Krause, A.: Coresets for nonparametric estimation-the case of dp-means. In: International Conference on Machine Learning (ICML) (2015)
Bădoiu, M., Clarkson, K.L.: Smaller core-sets for balls. In: Proceedings of the 14th soda, pp. 801–802 (2003)
Bădoiu, M., Clarkson, K.L.: Optimal core-sets for balls. Comput. Geom. 40(1), 14–22 (2008)
Bambauer, J., Muralidhar, K., Sarathy, R.: Fool’s gold: an illustrated critique of differential privacy. Vanderbilt J. Entertain. Technol. Law 16, 701 (2013)
Barger, A., Feldman, D.: k-means for streaming and distributed big sparse data. In: Proceeding of the 2016 SIAM International Conference on Data Mining (sdm’16) (2016)
Batson, J., Spielman, D.A., Srivastava, N.: Twice-ramanujan sparsifiers. SIAM Rev. 56(2), 315–334 (2014)
Baykal, C., Liebenwein, L., Gilitschenski, I., et al.: Data-dependent coresets for compressing neural networks with applications to generalization bounds. arXiv preprint arXiv:1804.05345 (2018)
Bentley, J.L., Saxe, J.B.: Decomposable searching problems I. static-to-dynamic transformation. J. Algorithms 1(4), 301–358 (1980)
Blumer, A., Ehrenfeucht, A., Haussler, D., Warmuth, M.K.: Learnability and the vapnik-chervonenkis dimension. J. Assoc. Comput. Mach. 36(4), 929–965 (1989)
Boutsidis, C., Zouzias, A., Mahoney, M.W., Drineas, P.: Randomized dimensionality reduction for k-means clustering. IEEE Trans. Inf. Theory 61(2), 1045–1062 (2015)
Braverman, V., Feldman, D., Lang, H.: New frameworks for offline and streaming coreset constructions. arXiv preprint:1612.00889 (2016)
Charikar, M., Guha, S.: Improved combinatorial algorithms for facility location problems. SIAM J. Comput. 34(4), 803–824 (2005)
Charikar, M., Guha, S., Tardos, É., Shmoys, D.B.: A constant-factor approximation algorithm for the k-median problem. J. Comput. Syst. Sci. 65(1), 129–149 (2002)
Chazelle, B., Edelsbrunner, H., Grigni, M., Guibas, L., Sharir, M., Welzl, E.: Improved bounds on weak &egr;-nets for convex sets. In: Proceedings of the Twenty-Fifth annual ACM Symposium on Theory of Computing (STOC), pp. 495–504. ACM, New York (1993)
Chen, K.: On coresets for k-median and k-means clustering in metric and euclidean spaces and their applications. SIAM J. Comput. 39(3), 923–947 (2009)
Choi, S., Kim, T., Yu, W.: Performance evaluation of ransac family. J. Comput. Vis. 24(3), 271–300 (1997)
Clarkson, K.L.: Coresets, sparse greedy approximation, and the frank-wolfe algorithm. Assoc. Comput. Mach. Trans. Algorithms (TALG) 6(4), 63 (2010)
Clarkson, K.L., Woodruff, D.P.: Numerical linear algebra in the streaming model. In: Proceedings of the 41st STOC, pp. 205–214 (2009)
Cohen, M.B., Elder, S., Musco, C., Musco, C., Persu, M.: Dimensionality reduction for k-means clustering and low rank approximation. In: Proceedings of the Forty-Seventh Annual ACM on Symposium on Theory of Computing, STOC 2015, pp. 163–172 (2015)
Cohen, M.B., Lee, Y.T., Musco, C., Musco, C., Peng, R., Sidford, A.: Uniform sampling for matrix approximation. In: Proceedings of the 2015 Conference on Innovations in Theoretical Computer Science, pp. 181–190. ACM, New York (2015) . http://doi.acm.org/10.1145/2688073.2688113
Dasgupta, A., Drineas, P., Harb, B., Kumar, R., Mahoney, M.W.: Sampling algorithms and coresets for ℓ p-regression. In: Proceedings 19th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pp. 932–941 (2008) . http://doi.acm.org/10.1145/1347082.1347184
Dasgupta, S., Schulman, L.J.: A two-round variant of em for gaussian mixtures. In: Proceedings of the Sixteenth Conference on Uncertainty in Artificial Intelligence, pp. 152–159 (2000)
Deshpande, A., Rademacher, L., Vempala, S., Wang, G.: Matrix approximation and projective clustering via volume sampling. In: Proceedings 17th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pp. 1117–1126 (2006)
Drineas, P., Mahoney, M.W., Muthukrishnan, S.: Sampling algorithms for l 2 regression and applications. In: Proceeding of SODA 06 Proceedings of the Seventeenth Annual ACM-SIAM Symposium on Discrete Algorithm, pp. 1127–1136 (2006)
Edwards, M., Varadarajan, K.: No coreset, no cry: Ii. In: International Conference on Foundations of Software Technology and Theoretical Computer Science, pp. 107–115 (2005)
Effros, M., Schulman, L.J.: Deterministic clustering with data nets. In: Electronic Colloquium on Computational Complexity (ECCC), Report no. 050 (2004)
Epstein, D., Feldman, D.: Quadcopter tracks quadcopter via real-time shape fitting. IEEE Robot. Autom. Lett. 3(1), 544–550 (2018)
Feigin, M., Feldman, D., Sochen, N.: From high definition image to low space optimization. In: Proceeding 3rd International Conference on Scale Space and Variational Methods in Computer Vision (SSVM 2011) (2011)
Feldman, D., Langberg, M.: A unified framework for approximating and clustering data. In: Proceeding 34th Annual ACM Symposium on Theory of Computing (STOC) (2011). See http://arxiv.org/abs/1106.1379 for fuller version
Feldman, D., Schulman, L.J.: Data reduction for weighted and outlier-resistant clustering. In: Proceeding of the 23rd Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pp. 1343–1354 (2012)
Feldman, D., Tassa, T.: More constraints, smaller coresets: constrained matrix approximation of sparse big data. In: Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (kdd’15), pp. 249–258 (2015)
Feldman, D., Fiat, A., Sharir, M.: Coresets for weighted facilities and their applications. In: FOCS, pp. 315–324 (2006)
Feldman, D., Fiat, A., Segev, D., Sharir, M.: Bi-criteria linear-time approximations for generalized k-mean/median/center. In: Proceeding of 23rd ACM Symposium on Computational Geometry (SOCG), pp. 19–26 (2007)
Feldman, D., Monemizadeh, M., Sohler, C.: A ptas for k-means clustering based on weak coresets. In: Proceedings of the 23rd ACM Symposium on Computational Geometry (SoCG), pp. 11–18 (2007)
Feldman, D., Fiat, A., Kaplan, H., Nissim, K.: Private coresets. In: Proceedings of the Forty-First Annual ACM Symposium on Theory of Computing, pp. 361–370 (2009)
Feldman, D., Monemizadeh, M., Sohler, C., Woodruff, D.P.: Coresets and sketches for high dimensional subspace approximation problems. In: Proceedings of the Twenty-First Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 630–649 (2010)
Feldman, D., Krause, A., Faulkner, M.: Scalable training of mixture models via coresets. In: Proceeding 25th Conference on Neural Information Processing Systems (NIPS) (2011)
Feldman, D., Sugaya, A., Rus, D.: An effective coreset compression algorithm for large scale sensor networks. In: 2012 ACM/IEEE 11th International Conference on Information Processing in Sensor Networks (IPSN), pp. 257–268 (2012)
Feldman, D., Sung, C., Rus, D.: The single pixel gps: learning big data signals from tiny coresets. In: Proceedings of the 20th International Conference on Advances in Geographic Information Systems, pp. 23–32 (2012)
Feldman, D., Sugaya, A., Sung, C., Rus, D.: Idiary: from gps signals to a text-searchable diary. In: Proceedings of the 11th ACM Conference on Embedded Networked Sensor Systems, p. 6 (2013)
Feldman, D., Schmidt, M., Sohler, C.: Turning big data into tiny data: constant-size coresets for k-means, PCA and projective clustering. In: Proceedings of the Twenty-Fourth Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 1434–1453 (2013a)
Feldman, D., Schmidt, M., Sohler, C.: Turning big data into tiny data: constant-size coresets for k-means, pca and projective clustering. In: Proceedings of the Twenty-Fourth Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 1434–1453 (2013b)
Feldman, D., Volkov, M., Rus, D.: Dimensionality reduction of massive sparse datasets using coresets. In: Advances in Neural Information Processing Systems (NIPS) (2016)
Feldman, D., Ozer, S., Rus, D.: Coresets for vector summarization with applications to network graphs. In: Proceedings of the 34th International Conference on Machine Learning, ICML 2017. Sydney, NSW, Australia, 6–11 August 2017, pp. 1117–1125 (2017). http://proceedings.mlr.press/v70/feldman17a.html
Feldman, D., Xiang, C., Zhu, R., Rus, D.: Coresets for differentially private k-means clustering and applications to privacy in mobile sensor networks. In: 2017 16th ACM/IEEE International Conference on Information Processing in Sensor Networks (IPSN), pp. 3–16 (2017)
Foster, I.: Designing and Building Parallel Programs. Addison Wesley Publishing Company, Reading (1995)
Funke, S., Laue, S.: Bounded-hop energy-efficient broadcast in low-dimensional metrics via coresets. In: Annual Symposium on Theoretical Aspects of Computer Science, pp. 272–283 (2007)
Har-Peled, S.: No coreset, no cry. In: Proceedings of the 24th IARCS Annual Conference on Foundations of Software Technology and Theoretical Computer Science (FSTTCS), pp. 324–335 (2004)
Har-Peled, S.: Coresets for discrete integration and clustering. In: 26th FSTTCS, pp. 33–44 (2006)
Har-Peled, S., Kushal, A.: Smaller coresets for k-median and k-means clustering. In: Proceedings of the 25th SODA, pp. 126–134 (2005)
Har-Peled, S., Mazumdar, S.: Coresets for k-means and k-median clustering and their applications. In: Proceedings of the 36th ACM Symposium on the Theory of Computing (STOC), pp. 291–300 (2004a)
Har-Peled, S., Mazumdar, S.: On coresets for k-means and k-median clustering. In: Proceeding of the 36th Annual ACM Symposium on Theory of Computing (STOC), pp. 291–300 (2004b)
Har-Peled, S., Varadarajan, K.R.: Projective clustering in high dimensions using coresets. In: Proceeding 18th ACM Symposium on Computational Geometry (SOCG), pp. 312–318 (2002)
Haussler, D.: Decision theoretic generalizations of the pac learning model. In: Proceedings of the 1st International Workshop on Algorithmic Learning Theory (ALT), pp. 21–41 (1990)
Haussler, D., Welzl, E.: Epsilon-nets and simplex range queries. In: Annual ACM Symposium on Computational Geometry (SOCG) (1986)
Hellerstein, J.: Parallel programming in the age of big data. In: GIGAOM blog. Nov. 9, 2008 (2008)
Hoeffding, W.: Probability inequalities for sums of bounded random variables. J. Am. Stat. Assoc. 58(301), 13–30 (1963)
IBM: What is Big Data? Bringing Big Data to the Enterprise (2012). www.ibm.com/software/data/bigdata/. Accessed 3rd Oct 2012
Inaba, M., Katoh, N., Imai, H.: Applications of weighted voronoi diagrams and randomization to variance-based k-clustering. In: Symposium on Computational Geometry, pp. 332–339 (1994)
Indyk, P., Mahabadi, S., Mahdian, M., Mirrokni, V.S.: Composable core-sets for diversity and coverage maximization. In: Proceedings of the 33rd ACM Sigmod-Sigact-Sigart Symposium on Principles of Database Systems, pp. 100–108 (2014)
Joshi, S., Kommaraji, R.V., Phillips, J.M., Venkatasubramanian, S.: Comparing distributions and shapes using the kernel distance. In: Proceedings of the Twenty-Seventh Annual Symposium on Computational Geometry, pp. 47–56 (2011)
Langberg, M., Schulman, L.J.: Universal epsilon-approximators for integrals. In: Proceedings of the 21st ACM-SIAM Symposium on Discrete Algorithms (SODA), pp. 598–607 (2010)
Li, Y., Long, P.M., Srinivasan, A.: Improved bounds on the sample complexity of learning. J. Comput. Syst. Sci. 62, 516–527 (2001)
Löffler, M., Phillips, J.M.: Shape fitting on point sets with probability distributions. In: European Symposium on Algorithms, pp. 313–324 (2009)
Mahajan, M., Nimbhorkar, P., Varadarajan, K.: The planar k-means problem is np-hard. In: International Workshop on Algorithms and Computation, pp. 274–285 (2009)
Mahoney, M.W.: Randomized algorithms for matrices and data. Found. Trends Mach. Learn. 3(2), 123–224 (2011)
Matousek, J.: Approximations and optimal geometric divide-an-conquer. J. Comput. Syst. Sci. 50(2), 203–208 (1995)
Matouaek, J.: New constructions of weak epsilon-nets. In: Proceedings of the Nineteenth Annual Symposium on Computational Geometry, pp. 129–135 (2003)
McLachlan, G., Krishnan, T.: The EM Algorithm and Extensions, vol. 382. Wiley, New York (2007)
Munteanu, A., Schwiegelshohn, C.: Coresets-methods and history: a theoreticians design pattern for approximation and streaming algorithms. KI-Künstl. Intell. 32(1), 37–53 (2018)
Muthukrishnan, S., et al.: Data streams: algorithms and applications. Found. Trends Theor. Comput. Sci. 1(2), 117–236 (2005)
Ostrovsky, R., Rabani, Y., Schulman, L.J., Swamy, C.: The effectiveness of lloyd-type methods for the k-means problem. In: 47th Annual IEEE Symposium on Foundations of Computer Science, 2006. FOCS’06, pp. 165–176 (2006)
Paul, R., Feldman, D., Rus, D., Newman, P.: Visual precis generation using coresets. In: 2014 IEEE International Conference on Robotics and Automation (ICRA), pp. 1304–1311 (2014)
Phillips, J.M.: Coresets and Sketches, Near-Final Version of Chapter 49 in Handbook on Discrete and Computational Geometry, 3rd edn. CoRR (2016). abs/1601.00617. Retrieved from http://arxiv.org/abs/1601.00617
Rosman, G., Volkov, M., Feldman, D., Fisher III, J.W., Rus, D.: Coresets for k-segmentation of streaming data. In: Advances in Neural Information Processing Systems (NIPS), pp. 559–567 (2014)
Segaran, T., Hammerbacher, J.: Beautiful Data: The Stories Behind Elegant Data Solutions. O’Reilly Media, Inc., Beijing (2009)
Sener, O., Savarese, S.: Active learning for convolutional neural networks: a core-set approach. Statistics 1050, 27 (2017)
Shyamalkumar, N., Varadarajan, K.: Efficient subspace approximation algorithms. In: Proceedings of the 18th ACM-SIAM Symposium on Discrete Algorithms (SODA), pp. 532–540 (2007)
Tolochinsky, E., Feldman, D.: Coresets for Monotonic Functions With Applications to Deep Learning (2018). arXiv preprint:1802.07382
Tremblay, N., Barthelmé, S., Amblard, P.-O.: Determinantal Point Processes for Coresets (2018). arXiv preprint:1803.08700
Varadarajan, K., Xiao, X.: A near-linear algorithm for projective clustering integer points. In: Proceedings of the ACM-SIAM Symposium on Discrete Algorithms (SODA) (2012a)
Varadarajan, K., Xiao, X.: On the sensitivity of shape fitting problems. In: Proceedings of the 32nd Annual Conference on IARCS Annual Conference on Foundations of Software Technology and Theoretical Computer Science (FSTTCS), pp. 486–497 (2012b)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this chapter
Cite this chapter
Feldman, D. (2020). Core-Sets: Updated Survey. In: Ros, F., Guillaume, S. (eds) Sampling Techniques for Supervised or Unsupervised Tasks. Unsupervised and Semi-Supervised Learning. Springer, Cham. https://doi.org/10.1007/978-3-030-29349-9_2
Download citation
DOI: https://doi.org/10.1007/978-3-030-29349-9_2
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-29348-2
Online ISBN: 978-3-030-29349-9
eBook Packages: EngineeringEngineering (R0)