A Comparative Study of the Some Methods Used in Constructing Coresets for Clustering Large Datasets

Le Hoang, Nguyen; Trang, Le Hong; Dang, Tran Khanh

doi:10.1007/s42979-020-00227-7

A Comparative Study of the Some Methods Used in Constructing Coresets for Clustering Large Datasets

Original Research
Published: 27 June 2020

Volume 1, article number 215, (2020)
Cite this article

SN Computer Science Aims and scope Submit manuscript

668 Accesses
9 Citations
Explore all metrics

Abstract

Coresets can be described as a compact subset such that models trained on coresets will also provide a good fit with models trained on full data set. Using coresets, we can scale down a big data to a tiny one to reduce the computational cost of a machine learning problem. In recent years, data scientists have investigated various methods to create coresets with many techniques and approaches, especially for solving the problem of clustering large datasets. In this paper, we make comparisons among four state-of-the-art algorithms: ProTraS by Ros and Guillaume with improvements, Lightweight Coreset by Bachem et al. Adaptive Sampling Coreset by Feldman et al. and a native Farthest-First-Traversal-based coreset construction. We briefly introduce these four algorithms and compare their performances to find out the benefits and drawbacks of each one.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Comparative Study of the Use of Coresets for Clustering Large Datasets

Overview of Scalable Partitional Methods for Big Data Clustering

Online K-Means Clustering with Lightweight Coresets

References

Agarwal PK, Procopiuc CM, Varadarajan KR. Approximating extent measures of points. J ACM. 2004;51(4):606–35.
Article MathSciNet MATH Google Scholar
Agarwal PK, Procopiuc CM, Varadarajan KR. Geometric approximation via coresets. Comb Comput Geom. 2005;52:1–30.
MathSciNet MATH Google Scholar
Arora S. Polynomial time approximation schemes for Euclidean traveling salesman and other geometric problems. J Assoc Comput Mach. 1998;45(5):753–82.
Article MathSciNet MATH Google Scholar
Arthur D, Vassilvitskii S. k-Means++: the advantages of careful seeding. In: Proceedings of the eighteenth annual ACM-SIAM symposium on discrete algorithms, 2007, pp. 1027–1035.
Bachem O, Lucic M, Krause A. Coresets for nonparametric estimation—the case of DP-means. In: International Conference on Machine Learning (ICML), 2015.
Bachem O, Lucic M, Lattanzi S. One-shot coresets: the case of k-clustering. In: International conference on articial intelligence and statistics (AISTATS), 2018.
Bachem O, Lucic M, Krause A. Scalable and distributed clustering via lightweight coresets. In: International conference on knowledge discovery and data mining (KDD), 2018.
Bachem O, Lucic M, Krause A. Practical coreset constructions for machine learning. arXiv preprint, 2017.
Charikar M, O’Callaghan L, Panigrahy. Better streaming algorithms for clustering problems. In: Proceedings of the 35th annual ACM symposium on theory of computing, 2003, pp. 30–39.
Dang TK, Tran KTK. The meeting of acquaintances: a cost-efficient authentication scheme for light-weight objects with transient trust level and plurality approach. Secur Commun Netw. 2019;2019:1–18.
Article Google Scholar
Frahling G, Sohler C. Coresets in dynamic geometric data streams. In: Proceedings of the thirty-seventh annual ACM symposium on theory of computing, pp. 209–217, STOC 2005. https://doi.org/10.1145/1060590.1060622
Feldman D, Monemizadeh M, Sohler C. A PTAS for k- means clustering based on weak coresets. In: Symposium on Computational Geometry (SoCG), ACM, 2007, pp 11–18.
Feldman D, Monemizadeh M, Sohler C, Woodruff DP, Coresets and sketches for high dimensional subspace approximation problems. In: Symposium on discrete algorithms (SODA), Society for Industrial and Applied Mathematics, pp. 630–649, 2010.
Feldman D, Faulkner M, Krause A, Scalable training of mixture models via coresets. In: Advances in neural information processing systems (NIPS), 2011, pp. 2142–2150
Feldman D, Schmidt M, Sohler C, Turning big data into tiny data: constant-size coresets for k-means, PCA and projective clustering. In: Symposium on discrete algorithms (SODA), Society for industrial and applied mathematics, pp. 1434–1453, 2013.
Gonzalez TF. Clustering to minimize the maximum inter-cluster distance. Theor Comput Sci. 1985;38:293–306.
Article MATH Google Scholar
Har-Peled S. Geometric approximation algorithms, vol. 173. Washington: American mathematical society Providence; 2011.
MATH Google Scholar
Har-Peled S, Kushal A. Smaller coresets for k-median and k- means clustering. In: Symposium on computational geometry (SoCG), ACM, pp. 126–134, 2005.
Har-Peled S, Mazumdar S, On coresets for k-means and k-median clustering. In: Symposium on theory of computing (STOC), ACM, pp. 291–300, 2004.
Hoang NL, Dang TK, Trang LH, A Comparative Study of the Use of Coresets for Clustering Large Datasets. In: LNCS 11814 Future Data and Security Engineering, pp. 45-55, 2019. https://doi.org/10.1007/978-3-030-35653-8
Lloyd SP. Least squares quantization in PCM. IEEE Trans Inf Theory. 1982;28:129–37.
Article MathSciNet MATH Google Scholar
Lucic M, Bachem O, Krause A, Strong coresets for hard and soft Bregman clustering with applications to exponential family mixtures. In: International conference on artificial intelligence and statistics (AISTATS), pp. 1–9, 2016.
Lucic M, Faulkner M, Krause A. Training mixture models at scale via coresets. J Mach Learn. 2017;18:1–25.
MathSciNet MATH Google Scholar
Inaba M, Katoh N, Imai H, Applications of weighted Voronoi diagrams and randomization to variance-based k-clustering. In: Proceeding of 10th annual symposium on computational geometry, pp. 332–339, 1994.
Matousek J. On approximate geometric k-clustering. Discrete Comput Geometry. 2000;24:61–84.
Article MathSciNet MATH Google Scholar
Phan TN, Dang TK. A Lightweight Indexing Approach for Efficient Batch Similarity Processing with MapReduce. SN Comput Sci. 2020;1(1). https://doi.org/10.1007/s42979-019-0007-y.
Ros F, Guillaume S. DENDIS: a new density-based sampling for clustering algorithm. Expert Syst Appl. 2016;56:349–59.
Article Google Scholar
Ros F, Guillaume S. DIDES: a fast and effective sampling for clustering algorithm. Knowl Inf Syst. 2017;50:543–68.
Article Google Scholar
Ros F, Guillaume S. ProTraS: a probabilistic traversing sampling algorithm. Expert Syst Appl. 2018;105:65–76.
Article Google Scholar
Rosenkrantz DJ, Stearns RE, Lewis PM II. An analysis of several Heuristics for the traveling salesman problem. SIAM J Comput. 1977;6:563–81.
Article MathSciNet MATH Google Scholar
Trang LH, Ngoan PV, Duc NV. A sample-based algorithm for visual assessment of cluster tendency (VAT) with large datasets. Future Data Secur Eng LNCS. 2018;11251:145–57.
Article Google Scholar
Trang LH, Hoang NL, Dang TK. A farthest first traversal based sampling algorithm for k-clustering. In: 2020 14th international conference on ubiquitous information management and communication (IMCOM), Taichung, Taiwan, 2020, pp. 1-6. https://doi.org/10.1109/IMCOM48794.2020.9001738
Vega WFdl, Karpinski M, Kenyon C, Rabani Y. Approximation schemes for clustering problems. In: Proceedings of the 35th annual ACM symposium on theory of computing, pp. 50–58, 2003.
https://cs.joensuu.fi/sipu/datasets. Accessed Jan 2020.
https://github.com/deric/clustering-benchmark. Accessed Jan 2020.

Download references

Acknowledgements

This research is funded by a project with the Department of Science and Technology, Ho Chi Minh City, Vietnam (contract with HCMUT No. 42/2019/HD-QPTKHCN, dated 11/7/2019). We would like to thank the FDSE 2019 organizing committee and paper reviewers for suggesting corrections and improvements. The audience at our presentation session in FDSE 2019 conference also offered constructive feedbacks for the paper.

Author information

Authors and Affiliations

Ho Chi Minh City University of Technology (HCMUT), 268 Ly Thuong Kiet, District 10, Ho Chi Minh City, Vietnam
Nguyen Le Hoang, Le Hong Trang & Tran Khanh Dang
Vietnam National University Ho Chi Minh City (VNU-HCM), Linh Trung Ward, Thu Duc District, Ho Chi Minh City, Vietnam
Nguyen Le Hoang, Le Hong Trang & Tran Khanh Dang

Authors

Nguyen Le Hoang
View author publications
You can also search for this author in PubMed Google Scholar
Le Hong Trang
View author publications
You can also search for this author in PubMed Google Scholar
Tran Khanh Dang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Tran Khanh Dang.

Ethics declarations

Conflict of Interest

The authors report no conflicts of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This article is part of the topical collection “Future Data and Security Engineering 2019” guest edited by Tran Khanh Dang.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Le Hoang, N., Trang, L.H. & Dang, T.K. A Comparative Study of the Some Methods Used in Constructing Coresets for Clustering Large Datasets. SN COMPUT. SCI. 1, 215 (2020). https://doi.org/10.1007/s42979-020-00227-7

Download citation

Received: 06 May 2020
Accepted: 12 June 2020
Published: 27 June 2020
DOI: https://doi.org/10.1007/s42979-020-00227-7

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Comparative Study of the Some Methods Used in Constructing Coresets for Clustering Large Datasets

Abstract

Access this article

Similar content being viewed by others

A Comparative Study of the Use of Coresets for Clustering Large Datasets

Overview of Scalable Partitional Methods for Big Data Clustering

Online K-Means Clustering with Lightweight Coresets

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of Interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A Comparative Study of the Some Methods Used in Constructing Coresets for Clustering Large Datasets

Abstract

Access this article

Similar content being viewed by others

A Comparative Study of the Use of Coresets for Clustering Large Datasets

Overview of Scalable Partitional Methods for Big Data Clustering

Online K-Means Clustering with Lightweight Coresets

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of Interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation