Skip to main content
Log in

An empirical comparison and characterisation of nine popular clustering methods

  • Regular Article
  • Published:
Advances in Data Analysis and Classification Aims and scope Submit manuscript

This article has been updated

Abstract

Nine popular clustering methods are applied to 42 real data sets. The aim is to give a detailed characterisation of the methods by means of several cluster validation indexes that measure various individual aspects of the resulting clusters such as small within-cluster distances, separation of clusters, closeness to a Gaussian distribution etc. as introduced in Hennig (in: Data analysis and applications 1: clustering and regression, modeling—estimating, forecasting and data mining, ISTE Ltd., London, 2019). 30 of the data sets come with a “true” clustering. On these data sets the similarity of the clusterings from the nine methods to the “true” clusterings is explored. Furthermore, a mixed effects regression relates the observable individual aspects of the clusters to the similarity with the “true” clusterings, which in real clustering problems is unobservable. The study gives new insight not only into the ability of the methods to discover “true” clusterings, but also into properties of clusterings that can be expected from the methods, which is crucial for the choice of a method in a real situation without a given “true” clustering.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11

Similar content being viewed by others

Change history

  • 02 June 2023

    The missed ESM material has been updated in the original article.

References

  • Ackerman M, Ben-David S (2008) Measures of clustering quality: a working set of axioms for clustering. Adv Neural Inf Process Syst NIPS 22:121–128

    Google Scholar 

  • Ackerman M, Ben-David S, Branzei S, Loker D (2012) Weighted clustering. In: Proceedings of the 26th AAAI conference on artificial intelligence, pp 858–863

  • Ackerman M, Ben-David S, Loker D (2010) Towards property-based classification of clustering paradigms. In: Advances in neural information processing systems (NIPS), pp 10–18

  • Adolfsson A, Ackerman M, Brownstein NC (2019) To cluster, or not to cluster: an analysis of clusterability methods. Pattern Recognit 88:13–26

    Article  Google Scholar 

  • Akhanli SE, Hennig C (2020) Comparing clusterings and numbers of clusters by aggregation of calibrated clustering validity indexes. Stat Comput 30(5):1523–1544

    Article  MathSciNet  MATH  Google Scholar 

  • Amigo E, Gonzalo J, Artiles J, Verdejo F (2009) A comparison of extrinsic clustering evaluation metrics based on formal constraints. Inf Retr 12:461–486

    Article  Google Scholar 

  • Anderlucci L, Hennig C (2014) Clustering of categorical data: a comparison of a model-based and a distance-based approach. Commun Stat Theory Methods 43:704–721

    Article  MathSciNet  MATH  Google Scholar 

  • Andrews JL, McNicholas PD (2012) Model-based clustering, classification, and discriminant analysis via mixtures of multivariate t-distributions. Stat Comput 22(5):1021–1029

    Article  MathSciNet  MATH  Google Scholar 

  • Andrews JL, Wickins JR, Boers NM, McNicholas PD (2018) teigen: an R package for model-based clustering and classification via the multivariate \(t\) distribution. J Stat Softw 83(7):1–32

    Article  Google Scholar 

  • Arbelaitz O, Gurrutxaga I, Muguerza J, Pérez JM, Perona I (2013) An extensive comparative study of cluster validity indices. Pattern Recognit 46(1):243–256

    Article  Google Scholar 

  • Bagga A, Baldwin B (1998) Entity-based cross-document coreferencing using the vector space model. In: Proceedings of the 36th annual meeting of the association for computational linguistics and the 17th international conference on computational linguistics (COLING-ACL 98). ACL, Stroudsburg PE, pp 79–85

  • Boulesteix AL, Hatz M (2017) Benchmarking for clustering methods based on real data: a statistical view. In: Data science: innovative developments in data analysis and clustering. Springer, Berlin, pp 73–82

  • Boulesteix AL (2015) Ten simple rules for reducing overoptimistic reporting in methodological computational research. PLoS Comput Biol 11:e1004191

    Article  Google Scholar 

  • Boulesteix AL, Lauer S, Eugster MJA (2013) A plea for neutral comparison studies in computational sciences. PLoS ONE 8:e61562

    Article  Google Scholar 

  • Brusco MJ, Steinley D (2007) A comparison of heuristic procedures for minimum within-cluster sums of squares partitioning. Psychometrika 72:583–600

    Article  MathSciNet  MATH  Google Scholar 

  • Coretto P, Hennig C (2016) Robust improper maximum likelihood: tuning, computation, and a comparison with other methods for robust Gaussian clustering. J Am Stat Assoc 111:1648–1659

    Article  MathSciNet  Google Scholar 

  • Correa-Morris J (2013) An indication of unification for different clustering approaches. Pattern Recognit 46:2548–2561

    Article  MATH  Google Scholar 

  • de Souto MC, Costa IG, de Araujo DS, Ludermir TB, Schliep A (2008) Clustering cancer gene expression data: a comparative study. BMC Bioinform 9:497

    Article  Google Scholar 

  • Dimitriadou E, Barth M, Windischberger C, Hornik K, Moser E (2004) A quantitative comparison of functional MRI cluster analysis. Artif Intell Med 31:57–71

    Article  Google Scholar 

  • Dua D, Graff C (2017) UCI machine learning repository. http://archive.ics.uci.edu/ml

  • Ester M, Kriegel HP, Sander J, Xu X (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. In: Simoudis E, Han J, Fayyad UM (eds) KDD 96: proceedings of the second international conference on knowledge discovery and data mining. AAAI Press, Menlo Park, pp 226–231

  • Everitt BS, Landau S, Leese M, Stahl D (2011) Cluster analysis, 5th edn. Wiley, New York

    Book  MATH  Google Scholar 

  • Fisher L, Van Ness J (1971) Admissible clustering procedures. Biometrika 58:91–104

    Article  MathSciNet  MATH  Google Scholar 

  • Fraley C, Raftery AE (2002) Model-based clustering, discriminant analysis and density estimation. J Am Stat Assoc 97:611–631

    Article  MathSciNet  MATH  Google Scholar 

  • Halkidi M, Vazirgiannis M, Hennig C (2015) Method-independent indices for cluster validation and estimating the number of clusters. In: Hennig C, Meila M, Murtagh F, Rocci R (eds) Handbook of cluster analysis. CRC Press, Boca Raton, pp 595–618

    MATH  Google Scholar 

  • Hartigan JA, Wong MA (1979) Algorithm as 136: a k-means clustering algorithm. Appl Stat 28:100–108

    Article  MATH  Google Scholar 

  • Hennig C (2020) FPC: flexible procedures for clustering. R package version 2.2-8

  • Hennig C (2015) Clustering strategy and method selection. In: Hennig C, Meila M, Murtagh F, Rocci R (eds) Handbook of cluster analysis. CRC Press, Boca Raton, pp 703–730

    Chapter  Google Scholar 

  • Hennig C (2015) What are the true clusters? Pattern Recognit Lett 64:53–62

    Article  MATH  Google Scholar 

  • Hennig C (2018) Some thoughts on simulation studies to compare clustering methods. Arch Data Sci Ser A 5(1):1–21

    Google Scholar 

  • Hennig C (2019) Cluster validation by measurement of clustering characteristics relevant to the user. In: Skiadas CH, Bozeman JR (eds) Data analysis and applications 1: clustering and regression, modeling—estimating, forecasting and data mining. ISTE Ltd., London, pp 1–24

    Google Scholar 

  • Hennig C, Meila M (2015) Cluster analysis: an overview. In: Hennig C, Meila M, Murtagh F, Rocci R (eds) Handbook of cluster analysis. CRC Press, Boca Raton, pp 1–19

    Chapter  MATH  Google Scholar 

  • Hubert L, Arabie P (1985) Comparing partitions. J Classif 2(2):193–218

    Article  MATH  Google Scholar 

  • Hubert LJ, Schultz J (1976) Quadratic assignment as a general data analysis strategy. Br J Math Stat Psychol 29:190–241

    Article  MathSciNet  MATH  Google Scholar 

  • Jain AK, Topchy A, Law MHC, Buhmann JM (2004) Landscape of clustering algorithms. In: Proceedings of the 17th international conference on pattern recognition (ICPR04). IEEE Computer Society Washington, vol 1, pp 260–263

  • Jardine N, Sibson R (1971) Mathematical taxonomy. Wiley, London

    MATH  Google Scholar 

  • Javed A, Lee BS, Rizzo DM (2020) A benchmark study on time series clustering. Mach Learn Appl 1:100001

    Google Scholar 

  • Karatzoglou A, Smola A, Hornik K, Zeileis A (2004) kernlab—an S4 package for kernel methods in R. J Stat Softw 11(9):1–20

    Article  Google Scholar 

  • Kaufman L, Rousseeuw PJ (1990) Finding groups in data: an introduction to cluster analysis, vol 344. Wiley, New York

    Book  MATH  Google Scholar 

  • Kleinberg J (2002) An impossibility theorem for clustering. Adv Neural Inf Process Syst NIPS 15:463–470

    Google Scholar 

  • Kou G, Peng Y, Wang G (2014) Evaluation of clustering algorithms for financial risk analysis using MCDM methods. Inf Sci 275:1–12

    Article  Google Scholar 

  • Lee SX, McLachlan GJ (2013) On mixtures of skew normal and skew t-distributions. Adv Data Anal Classif 7:241–266

    Article  MathSciNet  MATH  Google Scholar 

  • Liu X, Song W, Wong BY, Zhang T, Yu S, Lin GN, Di X (2019) A comparison framework and guideline of clustering methods for mass cytometry data. Genome Biol 20:297

    Article  Google Scholar 

  • Maechler M, Rousseeuw P, Struyf A, Hubert M, Hornik K (2019) cluster: cluster analysis basics and extensions. R package version 2.1.0

  • Maulik U, Bandyopadhyay S (2002) Performance evaluation of some clustering algorithms and validity indices. IEEE Trans Pattern Anal Mach Intell 24(12):1650–1654

    Article  Google Scholar 

  • McLachlan GJ, Peel D (2000) Finite mixture models. Wiley, New York

    Book  MATH  Google Scholar 

  • Meila M (2007) Comparing clusterings—an information based distance. J Multivar Anal 98(5):873–895

    Article  MathSciNet  MATH  Google Scholar 

  • Meila M (2015) Criteria for comparing clusterings. In: Hennig C, Meila M, Murtagh F, Rocci R (eds) Handbook of cluster analysis. CRC Press, Boca Raton, pp 619–635

    Google Scholar 

  • Meila M, Heckerman D (2001) An experimental comparison of model-based clustering methods. Mach Learn 42:9–29

    Article  MATH  Google Scholar 

  • Milligan GW (1980) An examination of the effect of six types of error perturbation on fifteen clustering algorithms. Psychometrika 45:325–342

    Article  Google Scholar 

  • Milligan GW (1981) A Monte Carlo study of thirty internal criterion measures for cluster analysis. Psychometrika 46:187–199

    Article  MATH  Google Scholar 

  • Milligan GW (1996) Clustering validation: results and implications for applied analyses. In: Arabie P, Hubert LJ, Soete GD (eds) Clustering and classification. World Scientific, Singapore, pp 341–375

    Chapter  MATH  Google Scholar 

  • Ng AY, Jordan MI, Weiss Y (2001) On spectral clustering: analysis and an algorithm. In: Dietterich T, Becker S, Ghahramani Z (eds) Advances in neural information processing systems 14 (NIPS 2001). NIPS, pp 1–8

  • Pinheiro JC, Bates DM (2000) Mixed-effects models in S and S-PLUS. Springer, New York

    Book  MATH  Google Scholar 

  • Rodriguez MZ, Comin CH, Casanova D, Bruno OM, Amancio DR, Costa L, Rodrigues FA (2019) Clustering algorithms: a comparative approach. PLoS ONE 14:e0210236

    Article  Google Scholar 

  • Saracli S, Dogan N, Dogan I (2013) Comparison of hierarchical cluster analysis methods by cophenetic correlation. J Inequal Appl 203:89

    MATH  Google Scholar 

  • Scrucca L, Fop M, Murphy TB, Raftery AE (2016) mclust 5: clustering, classification and density estimation using Gaussian finite mixture models. R J 8(1):289–317

    Article  Google Scholar 

  • Shannon CE (1948) A mathematical theory of communication. Bell Syst Tech J 27(3):379–423

    Article  MathSciNet  MATH  Google Scholar 

  • Steinley D, Brusco MJ (2011) Evaluating the performance of model-based clustering: recommendations and cautions. Psychol Methods 16:63–79

    Article  Google Scholar 

  • Van Mechelen I, Boulesteix AL, Dangl R, Dean N, Guyon I, Hennig C, Leisch F, Steinley D (2018) Benchmarking in cluster analysis: a white paper. arXiv:1809.10496 [stat]

  • von Luxburg U, Williamson R, Guyon I (2012) Clustering: science or art? JMLR Workshop Conf Proc 27:65–79

    Google Scholar 

  • Wang K, Ng A, McLachlan G (2018) EMMIXskew: the EM algorithm and skew mixture distribution. R package version 1.0.3

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Christian Hennig.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary file 1 (pdf 162 KB)

Supplementary file 2 (zip 4055 KB)

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Hennig, C. An empirical comparison and characterisation of nine popular clustering methods. Adv Data Anal Classif 16, 201–229 (2022). https://doi.org/10.1007/s11634-021-00478-z

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11634-021-00478-z

Keywords

Mathematics Subject Classification

Navigation