An empirical comparison and characterisation of nine popular clustering methods

Hennig, Christian

doi:10.1007/s11634-021-00478-z

An empirical comparison and characterisation of nine popular clustering methods

Regular Article
Published: 09 January 2022

Volume 16, pages 201–229, (2022)
Cite this article

Advances in Data Analysis and Classification Aims and scope Submit manuscript

Christian Hennig ORCID: orcid.org/0000-0003-1550-5637¹

1022 Accesses
9 Citations
5 Altmetric
Explore all metrics

This article has been updated

Abstract

Nine popular clustering methods are applied to 42 real data sets. The aim is to give a detailed characterisation of the methods by means of several cluster validation indexes that measure various individual aspects of the resulting clusters such as small within-cluster distances, separation of clusters, closeness to a Gaussian distribution etc. as introduced in Hennig (in: Data analysis and applications 1: clustering and regression, modeling—estimating, forecasting and data mining, ISTE Ltd., London, 2019). 30 of the data sets come with a “true” clustering. On these data sets the similarity of the clusterings from the nine methods to the “true” clusterings is explored. Furthermore, a mixed effects regression relates the observable individual aspects of the clusters to the similarity with the “true” clusterings, which in real clustering problems is unobservable. The study gives new insight not only into the ability of the methods to discover “true” clusterings, but also into properties of clusterings that can be expected from the methods, which is crucial for the choice of a method in a real situation without a given “true” clustering.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Comparing clusterings and numbers of clusters by aggregation of calibrated clustering validity indexes

Article 25 June 2020

Hierarchical Means Clustering

Article Open access 23 September 2022

Data clustering: application and trends

Article 27 November 2022

Change history

02 June 2023
The missed ESM material has been updated in the original article.

References

Ackerman M, Ben-David S (2008) Measures of clustering quality: a working set of axioms for clustering. Adv Neural Inf Process Syst NIPS 22:121–128
Google Scholar
Ackerman M, Ben-David S, Branzei S, Loker D (2012) Weighted clustering. In: Proceedings of the 26th AAAI conference on artificial intelligence, pp 858–863
Ackerman M, Ben-David S, Loker D (2010) Towards property-based classification of clustering paradigms. In: Advances in neural information processing systems (NIPS), pp 10–18
Adolfsson A, Ackerman M, Brownstein NC (2019) To cluster, or not to cluster: an analysis of clusterability methods. Pattern Recognit 88:13–26
Article Google Scholar
Akhanli SE, Hennig C (2020) Comparing clusterings and numbers of clusters by aggregation of calibrated clustering validity indexes. Stat Comput 30(5):1523–1544
Article MathSciNet MATH Google Scholar
Amigo E, Gonzalo J, Artiles J, Verdejo F (2009) A comparison of extrinsic clustering evaluation metrics based on formal constraints. Inf Retr 12:461–486
Article Google Scholar
Anderlucci L, Hennig C (2014) Clustering of categorical data: a comparison of a model-based and a distance-based approach. Commun Stat Theory Methods 43:704–721
Article MathSciNet MATH Google Scholar
Andrews JL, McNicholas PD (2012) Model-based clustering, classification, and discriminant analysis via mixtures of multivariate t-distributions. Stat Comput 22(5):1021–1029
Article MathSciNet MATH Google Scholar
Andrews JL, Wickins JR, Boers NM, McNicholas PD (2018) teigen: an R package for model-based clustering and classification via the multivariate \(t\) distribution. J Stat Softw 83(7):1–32
Article Google Scholar
Arbelaitz O, Gurrutxaga I, Muguerza J, Pérez JM, Perona I (2013) An extensive comparative study of cluster validity indices. Pattern Recognit 46(1):243–256
Article Google Scholar
Bagga A, Baldwin B (1998) Entity-based cross-document coreferencing using the vector space model. In: Proceedings of the 36th annual meeting of the association for computational linguistics and the 17th international conference on computational linguistics (COLING-ACL 98). ACL, Stroudsburg PE, pp 79–85
Boulesteix AL, Hatz M (2017) Benchmarking for clustering methods based on real data: a statistical view. In: Data science: innovative developments in data analysis and clustering. Springer, Berlin, pp 73–82
Boulesteix AL (2015) Ten simple rules for reducing overoptimistic reporting in methodological computational research. PLoS Comput Biol 11:e1004191
Article Google Scholar
Boulesteix AL, Lauer S, Eugster MJA (2013) A plea for neutral comparison studies in computational sciences. PLoS ONE 8:e61562
Article Google Scholar
Brusco MJ, Steinley D (2007) A comparison of heuristic procedures for minimum within-cluster sums of squares partitioning. Psychometrika 72:583–600
Article MathSciNet MATH Google Scholar
Coretto P, Hennig C (2016) Robust improper maximum likelihood: tuning, computation, and a comparison with other methods for robust Gaussian clustering. J Am Stat Assoc 111:1648–1659
Article MathSciNet Google Scholar
Correa-Morris J (2013) An indication of unification for different clustering approaches. Pattern Recognit 46:2548–2561
Article MATH Google Scholar
de Souto MC, Costa IG, de Araujo DS, Ludermir TB, Schliep A (2008) Clustering cancer gene expression data: a comparative study. BMC Bioinform 9:497
Article Google Scholar
Dimitriadou E, Barth M, Windischberger C, Hornik K, Moser E (2004) A quantitative comparison of functional MRI cluster analysis. Artif Intell Med 31:57–71
Article Google Scholar
Dua D, Graff C (2017) UCI machine learning repository. http://archive.ics.uci.edu/ml
Ester M, Kriegel HP, Sander J, Xu X (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. In: Simoudis E, Han J, Fayyad UM (eds) KDD 96: proceedings of the second international conference on knowledge discovery and data mining. AAAI Press, Menlo Park, pp 226–231
Everitt BS, Landau S, Leese M, Stahl D (2011) Cluster analysis, 5th edn. Wiley, New York
Book MATH Google Scholar
Fisher L, Van Ness J (1971) Admissible clustering procedures. Biometrika 58:91–104
Article MathSciNet MATH Google Scholar
Fraley C, Raftery AE (2002) Model-based clustering, discriminant analysis and density estimation. J Am Stat Assoc 97:611–631
Article MathSciNet MATH Google Scholar
Halkidi M, Vazirgiannis M, Hennig C (2015) Method-independent indices for cluster validation and estimating the number of clusters. In: Hennig C, Meila M, Murtagh F, Rocci R (eds) Handbook of cluster analysis. CRC Press, Boca Raton, pp 595–618
MATH Google Scholar
Hartigan JA, Wong MA (1979) Algorithm as 136: a k-means clustering algorithm. Appl Stat 28:100–108
Article MATH Google Scholar
Hennig C (2020) FPC: flexible procedures for clustering. R package version 2.2-8
Hennig C (2015) Clustering strategy and method selection. In: Hennig C, Meila M, Murtagh F, Rocci R (eds) Handbook of cluster analysis. CRC Press, Boca Raton, pp 703–730
Chapter Google Scholar
Hennig C (2015) What are the true clusters? Pattern Recognit Lett 64:53–62
Article MATH Google Scholar
Hennig C (2018) Some thoughts on simulation studies to compare clustering methods. Arch Data Sci Ser A 5(1):1–21
Google Scholar
Hennig C (2019) Cluster validation by measurement of clustering characteristics relevant to the user. In: Skiadas CH, Bozeman JR (eds) Data analysis and applications 1: clustering and regression, modeling—estimating, forecasting and data mining. ISTE Ltd., London, pp 1–24
Google Scholar
Hennig C, Meila M (2015) Cluster analysis: an overview. In: Hennig C, Meila M, Murtagh F, Rocci R (eds) Handbook of cluster analysis. CRC Press, Boca Raton, pp 1–19
Chapter MATH Google Scholar
Hubert L, Arabie P (1985) Comparing partitions. J Classif 2(2):193–218
Article MATH Google Scholar
Hubert LJ, Schultz J (1976) Quadratic assignment as a general data analysis strategy. Br J Math Stat Psychol 29:190–241
Article MathSciNet MATH Google Scholar
Jain AK, Topchy A, Law MHC, Buhmann JM (2004) Landscape of clustering algorithms. In: Proceedings of the 17th international conference on pattern recognition (ICPR04). IEEE Computer Society Washington, vol 1, pp 260–263
Jardine N, Sibson R (1971) Mathematical taxonomy. Wiley, London
MATH Google Scholar
Javed A, Lee BS, Rizzo DM (2020) A benchmark study on time series clustering. Mach Learn Appl 1:100001
Google Scholar
Karatzoglou A, Smola A, Hornik K, Zeileis A (2004) kernlab—an S4 package for kernel methods in R. J Stat Softw 11(9):1–20
Article Google Scholar
Kaufman L, Rousseeuw PJ (1990) Finding groups in data: an introduction to cluster analysis, vol 344. Wiley, New York
Book MATH Google Scholar
Kleinberg J (2002) An impossibility theorem for clustering. Adv Neural Inf Process Syst NIPS 15:463–470
Google Scholar
Kou G, Peng Y, Wang G (2014) Evaluation of clustering algorithms for financial risk analysis using MCDM methods. Inf Sci 275:1–12
Article Google Scholar
Lee SX, McLachlan GJ (2013) On mixtures of skew normal and skew t-distributions. Adv Data Anal Classif 7:241–266
Article MathSciNet MATH Google Scholar
Liu X, Song W, Wong BY, Zhang T, Yu S, Lin GN, Di X (2019) A comparison framework and guideline of clustering methods for mass cytometry data. Genome Biol 20:297
Article Google Scholar
Maechler M, Rousseeuw P, Struyf A, Hubert M, Hornik K (2019) cluster: cluster analysis basics and extensions. R package version 2.1.0
Maulik U, Bandyopadhyay S (2002) Performance evaluation of some clustering algorithms and validity indices. IEEE Trans Pattern Anal Mach Intell 24(12):1650–1654
Article Google Scholar
McLachlan GJ, Peel D (2000) Finite mixture models. Wiley, New York
Book MATH Google Scholar
Meila M (2007) Comparing clusterings—an information based distance. J Multivar Anal 98(5):873–895
Article MathSciNet MATH Google Scholar
Meila M (2015) Criteria for comparing clusterings. In: Hennig C, Meila M, Murtagh F, Rocci R (eds) Handbook of cluster analysis. CRC Press, Boca Raton, pp 619–635
Google Scholar
Meila M, Heckerman D (2001) An experimental comparison of model-based clustering methods. Mach Learn 42:9–29
Article MATH Google Scholar
Milligan GW (1980) An examination of the effect of six types of error perturbation on fifteen clustering algorithms. Psychometrika 45:325–342
Article Google Scholar
Milligan GW (1981) A Monte Carlo study of thirty internal criterion measures for cluster analysis. Psychometrika 46:187–199
Article MATH Google Scholar
Milligan GW (1996) Clustering validation: results and implications for applied analyses. In: Arabie P, Hubert LJ, Soete GD (eds) Clustering and classification. World Scientific, Singapore, pp 341–375
Chapter MATH Google Scholar
Ng AY, Jordan MI, Weiss Y (2001) On spectral clustering: analysis and an algorithm. In: Dietterich T, Becker S, Ghahramani Z (eds) Advances in neural information processing systems 14 (NIPS 2001). NIPS, pp 1–8
Pinheiro JC, Bates DM (2000) Mixed-effects models in S and S-PLUS. Springer, New York
Book MATH Google Scholar
Rodriguez MZ, Comin CH, Casanova D, Bruno OM, Amancio DR, Costa L, Rodrigues FA (2019) Clustering algorithms: a comparative approach. PLoS ONE 14:e0210236
Article Google Scholar
Saracli S, Dogan N, Dogan I (2013) Comparison of hierarchical cluster analysis methods by cophenetic correlation. J Inequal Appl 203:89
MATH Google Scholar
Scrucca L, Fop M, Murphy TB, Raftery AE (2016) mclust 5: clustering, classification and density estimation using Gaussian finite mixture models. R J 8(1):289–317
Article Google Scholar
Shannon CE (1948) A mathematical theory of communication. Bell Syst Tech J 27(3):379–423
Article MathSciNet MATH Google Scholar
Steinley D, Brusco MJ (2011) Evaluating the performance of model-based clustering: recommendations and cautions. Psychol Methods 16:63–79
Article Google Scholar
Van Mechelen I, Boulesteix AL, Dangl R, Dean N, Guyon I, Hennig C, Leisch F, Steinley D (2018) Benchmarking in cluster analysis: a white paper. arXiv:1809.10496 [stat]
von Luxburg U, Williamson R, Guyon I (2012) Clustering: science or art? JMLR Workshop Conf Proc 27:65–79
Google Scholar
Wang K, Ng A, McLachlan G (2018) EMMIXskew: the EM algorithm and skew mixture distribution. R package version 1.0.3

Download references

Author information

Authors and Affiliations

Dipartimento di Scienze Statistiche “Paolo Fortunati”, Universitá di Bologna, Bologna, Italy
Christian Hennig

Authors

Christian Hennig
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Christian Hennig.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary file 1 (pdf 162 KB)

Supplementary file 2 (zip 4055 KB)

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Hennig, C. An empirical comparison and characterisation of nine popular clustering methods. Adv Data Anal Classif 16, 201–229 (2022). https://doi.org/10.1007/s11634-021-00478-z

Download citation

Received: 05 February 2021
Revised: 18 June 2021
Accepted: 16 October 2021
Published: 09 January 2022
Issue Date: March 2022
DOI: https://doi.org/10.1007/s11634-021-00478-z

Keywords

Mathematics Subject Classification

62H30

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

An empirical comparison and characterisation of nine popular clustering methods

Abstract

Access this article

Similar content being viewed by others

Comparing clusterings and numbers of clusters by aggregation of calibrated clustering validity indexes

Hierarchical Means Clustering

Data clustering: application and trends

Change history

02 June 2023

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Supplementary Information

Supplementary file 1 (pdf 162 KB)

Supplementary file 2 (zip 4055 KB)

Rights and permissions

About this article

Cite this article

Keywords

Mathematics Subject Classification

Navigation

An empirical comparison and characterisation of nine popular clustering methods

Abstract

Access this article

Similar content being viewed by others

Comparing clusterings and numbers of clusters by aggregation of calibrated clustering validity indexes

Hierarchical Means Clustering

Data clustering: application and trends

Change history

02 June 2023

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Supplementary Information

Supplementary file 1 (pdf 162 KB)

Supplementary file 2 (zip 4055 KB)

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Mathematics Subject Classification

Search

Navigation