Abstract
Coresets can be described as a compact subset such that models trained on coresets will also provide a good fit with models trained on full data set. Using coresets, we can scale down a big data to a tiny one to reduce the computational cost of a machine learning problem. In recent years, data scientists have investigated various methods to create coresets with many techniques and approaches, especially for solving the problem of clustering large datasets. In this paper, we make comparisons among four state-of-the-art algorithms: ProTraS by Ros and Guillaume with improvements, Lightweight Coreset by Bachem et al. Adaptive Sampling Coreset by Feldman et al. and a native Farthest-First-Traversal-based coreset construction. We briefly introduce these four algorithms and compare their performances to find out the benefits and drawbacks of each one.
Similar content being viewed by others
References
Agarwal PK, Procopiuc CM, Varadarajan KR. Approximating extent measures of points. J ACM. 2004;51(4):606–35.
Agarwal PK, Procopiuc CM, Varadarajan KR. Geometric approximation via coresets. Comb Comput Geom. 2005;52:1–30.
Arora S. Polynomial time approximation schemes for Euclidean traveling salesman and other geometric problems. J Assoc Comput Mach. 1998;45(5):753–82.
Arthur D, Vassilvitskii S. k-Means++: the advantages of careful seeding. In: Proceedings of the eighteenth annual ACM-SIAM symposium on discrete algorithms, 2007, pp. 1027–1035.
Bachem O, Lucic M, Krause A. Coresets for nonparametric estimation—the case of DP-means. In: International Conference on Machine Learning (ICML), 2015.
Bachem O, Lucic M, Lattanzi S. One-shot coresets: the case of k-clustering. In: International conference on articial intelligence and statistics (AISTATS), 2018.
Bachem O, Lucic M, Krause A. Scalable and distributed clustering via lightweight coresets. In: International conference on knowledge discovery and data mining (KDD), 2018.
Bachem O, Lucic M, Krause A. Practical coreset constructions for machine learning. arXiv preprint, 2017.
Charikar M, O’Callaghan L, Panigrahy. Better streaming algorithms for clustering problems. In: Proceedings of the 35th annual ACM symposium on theory of computing, 2003, pp. 30–39.
Dang TK, Tran KTK. The meeting of acquaintances: a cost-efficient authentication scheme for light-weight objects with transient trust level and plurality approach. Secur Commun Netw. 2019;2019:1–18.
Frahling G, Sohler C. Coresets in dynamic geometric data streams. In: Proceedings of the thirty-seventh annual ACM symposium on theory of computing, pp. 209–217, STOC 2005. https://doi.org/10.1145/1060590.1060622
Feldman D, Monemizadeh M, Sohler C. A PTAS for k- means clustering based on weak coresets. In: Symposium on Computational Geometry (SoCG), ACM, 2007, pp 11–18.
Feldman D, Monemizadeh M, Sohler C, Woodruff DP, Coresets and sketches for high dimensional subspace approximation problems. In: Symposium on discrete algorithms (SODA), Society for Industrial and Applied Mathematics, pp. 630–649, 2010.
Feldman D, Faulkner M, Krause A, Scalable training of mixture models via coresets. In: Advances in neural information processing systems (NIPS), 2011, pp. 2142–2150
Feldman D, Schmidt M, Sohler C, Turning big data into tiny data: constant-size coresets for k-means, PCA and projective clustering. In: Symposium on discrete algorithms (SODA), Society for industrial and applied mathematics, pp. 1434–1453, 2013.
Gonzalez TF. Clustering to minimize the maximum inter-cluster distance. Theor Comput Sci. 1985;38:293–306.
Har-Peled S. Geometric approximation algorithms, vol. 173. Washington: American mathematical society Providence; 2011.
Har-Peled S, Kushal A. Smaller coresets for k-median and k- means clustering. In: Symposium on computational geometry (SoCG), ACM, pp. 126–134, 2005.
Har-Peled S, Mazumdar S, On coresets for k-means and k-median clustering. In: Symposium on theory of computing (STOC), ACM, pp. 291–300, 2004.
Hoang NL, Dang TK, Trang LH, A Comparative Study of the Use of Coresets for Clustering Large Datasets. In: LNCS 11814 Future Data and Security Engineering, pp. 45-55, 2019. https://doi.org/10.1007/978-3-030-35653-8
Lloyd SP. Least squares quantization in PCM. IEEE Trans Inf Theory. 1982;28:129–37.
Lucic M, Bachem O, Krause A, Strong coresets for hard and soft Bregman clustering with applications to exponential family mixtures. In: International conference on artificial intelligence and statistics (AISTATS), pp. 1–9, 2016.
Lucic M, Faulkner M, Krause A. Training mixture models at scale via coresets. J Mach Learn. 2017;18:1–25.
Inaba M, Katoh N, Imai H, Applications of weighted Voronoi diagrams and randomization to variance-based k-clustering. In: Proceeding of 10th annual symposium on computational geometry, pp. 332–339, 1994.
Matousek J. On approximate geometric k-clustering. Discrete Comput Geometry. 2000;24:61–84.
Phan TN, Dang TK. A Lightweight Indexing Approach for Efficient Batch Similarity Processing with MapReduce. SN Comput Sci. 2020;1(1). https://doi.org/10.1007/s42979-019-0007-y.
Ros F, Guillaume S. DENDIS: a new density-based sampling for clustering algorithm. Expert Syst Appl. 2016;56:349–59.
Ros F, Guillaume S. DIDES: a fast and effective sampling for clustering algorithm. Knowl Inf Syst. 2017;50:543–68.
Ros F, Guillaume S. ProTraS: a probabilistic traversing sampling algorithm. Expert Syst Appl. 2018;105:65–76.
Rosenkrantz DJ, Stearns RE, Lewis PM II. An analysis of several Heuristics for the traveling salesman problem. SIAM J Comput. 1977;6:563–81.
Trang LH, Ngoan PV, Duc NV. A sample-based algorithm for visual assessment of cluster tendency (VAT) with large datasets. Future Data Secur Eng LNCS. 2018;11251:145–57.
Trang LH, Hoang NL, Dang TK. A farthest first traversal based sampling algorithm for k-clustering. In: 2020 14th international conference on ubiquitous information management and communication (IMCOM), Taichung, Taiwan, 2020, pp. 1-6. https://doi.org/10.1109/IMCOM48794.2020.9001738
Vega WFdl, Karpinski M, Kenyon C, Rabani Y. Approximation schemes for clustering problems. In: Proceedings of the 35th annual ACM symposium on theory of computing, pp. 50–58, 2003.
https://cs.joensuu.fi/sipu/datasets. Accessed Jan 2020.
https://github.com/deric/clustering-benchmark. Accessed Jan 2020.
Acknowledgements
This research is funded by a project with the Department of Science and Technology, Ho Chi Minh City, Vietnam (contract with HCMUT No. 42/2019/HD-QPTKHCN, dated 11/7/2019). We would like to thank the FDSE 2019 organizing committee and paper reviewers for suggesting corrections and improvements. The audience at our presentation session in FDSE 2019 conference also offered constructive feedbacks for the paper.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of Interest
The authors report no conflicts of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
This article is part of the topical collection “Future Data and Security Engineering 2019” guest edited by Tran Khanh Dang.
Rights and permissions
About this article
Cite this article
Le Hoang, N., Trang, L.H. & Dang, T.K. A Comparative Study of the Some Methods Used in Constructing Coresets for Clustering Large Datasets. SN COMPUT. SCI. 1, 215 (2020). https://doi.org/10.1007/s42979-020-00227-7
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s42979-020-00227-7