Skip to main content
Log in

A Comparative Study of the Some Methods Used in Constructing Coresets for Clustering Large Datasets

  • Original Research
  • Published:
SN Computer Science Aims and scope Submit manuscript

Abstract

Coresets can be described as a compact subset such that models trained on coresets will also provide a good fit with models trained on full data set. Using coresets, we can scale down a big data to a tiny one to reduce the computational cost of a machine learning problem. In recent years, data scientists have investigated various methods to create coresets with many techniques and approaches, especially for solving the problem of clustering large datasets. In this paper, we make comparisons among four state-of-the-art algorithms: ProTraS by Ros and Guillaume with improvements, Lightweight Coreset by Bachem et al. Adaptive Sampling Coreset by Feldman et al. and a native Farthest-First-Traversal-based coreset construction. We briefly introduce these four algorithms and compare their performances to find out the benefits and drawbacks of each one.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

References

  1. Agarwal PK, Procopiuc CM, Varadarajan KR. Approximating extent measures of points. J ACM. 2004;51(4):606–35.

    Article  MathSciNet  MATH  Google Scholar 

  2. Agarwal PK, Procopiuc CM, Varadarajan KR. Geometric approximation via coresets. Comb Comput Geom. 2005;52:1–30.

    MathSciNet  MATH  Google Scholar 

  3. Arora S. Polynomial time approximation schemes for Euclidean traveling salesman and other geometric problems. J Assoc Comput Mach. 1998;45(5):753–82.

    Article  MathSciNet  MATH  Google Scholar 

  4. Arthur D, Vassilvitskii S. k-Means++: the advantages of careful seeding. In: Proceedings of the eighteenth annual ACM-SIAM symposium on discrete algorithms, 2007, pp. 1027–1035.

  5. Bachem O, Lucic M, Krause A. Coresets for nonparametric estimation—the case of DP-means. In: International Conference on Machine Learning (ICML), 2015.

  6. Bachem O, Lucic M, Lattanzi S. One-shot coresets: the case of k-clustering. In: International conference on articial intelligence and statistics (AISTATS), 2018.

  7. Bachem O, Lucic M, Krause A. Scalable and distributed clustering via lightweight coresets. In: International conference on knowledge discovery and data mining (KDD), 2018.

  8. Bachem O, Lucic M, Krause A. Practical coreset constructions for machine learning. arXiv preprint, 2017.

  9. Charikar M, O’Callaghan L, Panigrahy. Better streaming algorithms for clustering problems. In: Proceedings of the 35th annual ACM symposium on theory of computing, 2003, pp. 30–39.

  10. Dang TK, Tran KTK. The meeting of acquaintances: a cost-efficient authentication scheme for light-weight objects with transient trust level and plurality approach. Secur Commun Netw. 2019;2019:1–18.

    Article  Google Scholar 

  11. Frahling G, Sohler C. Coresets in dynamic geometric data streams. In: Proceedings of the thirty-seventh annual ACM symposium on theory of computing, pp. 209–217, STOC 2005. https://doi.org/10.1145/1060590.1060622

  12. Feldman D, Monemizadeh M, Sohler C. A PTAS for k- means clustering based on weak coresets. In: Symposium on Computational Geometry (SoCG), ACM, 2007, pp 11–18.

  13. Feldman D, Monemizadeh M, Sohler C, Woodruff DP, Coresets and sketches for high dimensional subspace approximation problems. In: Symposium on discrete algorithms (SODA), Society for Industrial and Applied Mathematics, pp. 630–649, 2010.

  14. Feldman D, Faulkner M, Krause A, Scalable training of mixture models via coresets. In: Advances in neural information processing systems (NIPS), 2011, pp. 2142–2150

  15. Feldman D, Schmidt M, Sohler C, Turning big data into tiny data: constant-size coresets for k-means, PCA and projective clustering. In: Symposium on discrete algorithms (SODA), Society for industrial and applied mathematics, pp. 1434–1453, 2013.

  16. Gonzalez TF. Clustering to minimize the maximum inter-cluster distance. Theor Comput Sci. 1985;38:293–306.

    Article  MATH  Google Scholar 

  17. Har-Peled S. Geometric approximation algorithms, vol. 173. Washington: American mathematical society Providence; 2011.

    MATH  Google Scholar 

  18. Har-Peled S, Kushal A. Smaller coresets for k-median and k- means clustering. In: Symposium on computational geometry (SoCG), ACM, pp. 126–134, 2005.

  19. Har-Peled S, Mazumdar S, On coresets for k-means and k-median clustering. In: Symposium on theory of computing (STOC), ACM, pp. 291–300, 2004.

  20. Hoang NL, Dang TK, Trang LH, A Comparative Study of the Use of Coresets for Clustering Large Datasets. In: LNCS 11814 Future Data and Security Engineering, pp. 45-55, 2019. https://doi.org/10.1007/978-3-030-35653-8

  21. Lloyd SP. Least squares quantization in PCM. IEEE Trans Inf Theory. 1982;28:129–37.

    Article  MathSciNet  MATH  Google Scholar 

  22. Lucic M, Bachem O, Krause A, Strong coresets for hard and soft Bregman clustering with applications to exponential family mixtures. In: International conference on artificial intelligence and statistics (AISTATS), pp. 1–9, 2016.

  23. Lucic M, Faulkner M, Krause A. Training mixture models at scale via coresets. J Mach Learn. 2017;18:1–25.

    MathSciNet  MATH  Google Scholar 

  24. Inaba M, Katoh N, Imai H, Applications of weighted Voronoi diagrams and randomization to variance-based k-clustering. In: Proceeding of 10th annual symposium on computational geometry, pp. 332–339, 1994.

  25. Matousek J. On approximate geometric k-clustering. Discrete Comput Geometry. 2000;24:61–84.

    Article  MathSciNet  MATH  Google Scholar 

  26. Phan TN, Dang TK. A Lightweight Indexing Approach for Efficient Batch Similarity Processing with MapReduce. SN Comput Sci. 2020;1(1). https://doi.org/10.1007/s42979-019-0007-y.

  27. Ros F, Guillaume S. DENDIS: a new density-based sampling for clustering algorithm. Expert Syst Appl. 2016;56:349–59.

    Article  Google Scholar 

  28. Ros F, Guillaume S. DIDES: a fast and effective sampling for clustering algorithm. Knowl Inf Syst. 2017;50:543–68.

    Article  Google Scholar 

  29. Ros F, Guillaume S. ProTraS: a probabilistic traversing sampling algorithm. Expert Syst Appl. 2018;105:65–76.

    Article  Google Scholar 

  30. Rosenkrantz DJ, Stearns RE, Lewis PM II. An analysis of several Heuristics for the traveling salesman problem. SIAM J Comput. 1977;6:563–81.

    Article  MathSciNet  MATH  Google Scholar 

  31. Trang LH, Ngoan PV, Duc NV. A sample-based algorithm for visual assessment of cluster tendency (VAT) with large datasets. Future Data Secur Eng LNCS. 2018;11251:145–57.

    Article  Google Scholar 

  32. Trang LH, Hoang NL, Dang TK. A farthest first traversal based sampling algorithm for k-clustering. In: 2020 14th international conference on ubiquitous information management and communication (IMCOM), Taichung, Taiwan, 2020, pp. 1-6. https://doi.org/10.1109/IMCOM48794.2020.9001738

  33. Vega WFdl, Karpinski M, Kenyon C, Rabani Y. Approximation schemes for clustering problems. In: Proceedings of the 35th annual ACM symposium on theory of computing, pp. 50–58, 2003.

  34. https://cs.joensuu.fi/sipu/datasets. Accessed Jan 2020.

  35. https://github.com/deric/clustering-benchmark. Accessed Jan 2020.

Download references

Acknowledgements

This research is funded by a project with the Department of Science and Technology, Ho Chi Minh City, Vietnam (contract with HCMUT No. 42/2019/HD-QPTKHCN, dated 11/7/2019). We would like to thank the FDSE 2019 organizing committee and paper reviewers for suggesting corrections and improvements. The audience at our presentation session in FDSE 2019 conference also offered constructive feedbacks for the paper.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Tran Khanh Dang.

Ethics declarations

Conflict of Interest

The authors report no conflicts of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This article is part of the topical collection “Future Data and Security Engineering 2019” guest edited by Tran Khanh Dang.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Le Hoang, N., Trang, L.H. & Dang, T.K. A Comparative Study of the Some Methods Used in Constructing Coresets for Clustering Large Datasets. SN COMPUT. SCI. 1, 215 (2020). https://doi.org/10.1007/s42979-020-00227-7

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s42979-020-00227-7

Keywords

Navigation