Skip to main content

The area under the ROC curve as a measure of clustering quality


The area under the receiver operating characteristics (ROC) Curve, referred to as AUC, is a well-known performance measure in the supervised learning domain. Due to its compelling features, it has been employed in a number of studies to evaluate and compare the performance of different classifiers. In this work, we explore AUC as a performance measure in the unsupervised learning domain, more specifically, in the context of cluster analysis. In particular, we elaborate on the use of AUC as an internal/relative measure of clustering quality, which we refer to as Area Under the Curve for Clustering (AUCC). We show that the AUCC of a given candidate clustering solution has an expected value under a null model of random clustering solutions, regardless of the size of the dataset and, more importantly, regardless of the number or the (im)balance of clusters under evaluation. In addition, we elaborate on the fact that, in the context of internal/relative clustering validation as we consider, AUCC is actually a linear transformation of the Gamma criterion from Baker and Hubert (1975), for which we also formally derive a theoretical expected value for chance clusterings. We also discuss the computational complexity of these criteria and show that, while an ordinary implementation of Gamma can be computationally prohibitive and impractical for most real applications of cluster analysis, its equivalence with AUCC actually unveils a much more efficient algorithmic procedure. Our theoretical findings are supported by experimental results. These results show that, in addition to an effective and robust quantitative evaluation provided by AUCC, visual inspection of the ROC curves themselves can be useful to further assess a candidate clustering solution from a broader, qualitative perspective as well.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5


  1. In fact, non-random classifiers can also exhibit such a performance (Flach 2010).

  2. This result was originally and preliminarily described in (Jaskowiak 2015). An equivalent result, involving the relation between AUC and the 1954 Goodman-Kruskal’s rank correlation, was recently rediscovered by Higham and Higham (2019) in an unrelated context, involving measures of resolution in meta-cognitive studies.

  3. Assuming that (a) all dissimilarities \(||\cdot ||\) are given in advance (otherwise an additional dissimilarity cost would be required – \(O(n^2d)\) in case of Euclidean distance, where d is the dimension of the data space), and (b) cluster sizes are balanced (all proportional to n/k, possibly differing by a constant factor) (Vendramin et al. 2010).

  4. Apart from the cost to obtain the dissimilarity matrix, \({\mathbf {D}}\), which is also required by Gamma.

  5. Note that \(C_m\) is not necessarily different from \(C_l\), they may or may not be the same cluster in partition \({\mathcal {C}}_k\).

  6. This dataset consists of 9 clusters, with 50 objects each, obtained from normal distributions with variance equal to 4.5, centered at (0, 0), (0, 20), (0, 40), (20, 0), (20, 20), (20, 40), (40, 0), (40, 20), and (40, 40).


  • Amigó E, Gonzalo J, Artiles J, Verdejo F (2009) A comparison of extrinsic clustering evaluation metrics based on formal constraints. Inf Retr 12(5):613

    Article  Google Scholar 

  • Arbelaitz O, Gurrutxaga I, Muguerza J, Pérez JM, Perona I (2013) An extensive comparative study of cluster validity indices. Pattern Recognit 46(1):243–256

    Article  Google Scholar 

  • Baker FB, Hubert LJ (1975) Measuring the power of hierarchical cluster analysis. J Am Stat Assoc 70(349):31–38

    Article  MATH  Google Scholar 

  • Bezdek JC, Pal NR (1998) Some new indexes of cluster validity. IEEE Trans Syst, Man Cybern, Part B 28(3):301–315

    Article  Google Scholar 

  • Bradley AP (1997) The use of the area under the roc curve in the evaluation of machine learning algorithms. Pattern Recognit 30(7):1145–1159

    Article  Google Scholar 

  • Brock G, Pihur V, Datta S, Datta S (2008) clValid: an R package for cluster validation. J Stat Softw 25(4):1–22

    Article  Google Scholar 

  • Calinski R, Harabasz J (1974) A dentrite method for cluster analysis. Commun Stat 3:1–27

    MATH  Google Scholar 

  • Ceriani L, Verme P (2012) The origins of the gini index: extracts from variabilità e mutabilità (1912) by corrado gini. J Econ Inequal 10(3):421–443

    Article  Google Scholar 

  • Charrad M, Ghazzali N, Boiteau V, Niknafs A (2014) NbClust: an R package for determining the relevant number of clusters in a data set. J Stat Softw 61(6):1–36

    Article  Google Scholar 

  • Davies D, Bouldin D (1979) A cluster separation measure. IEEE Trans Pattern Anal Mach Intell 1:224–227

    Article  Google Scholar 

  • Desgraupes B (2016) clusterCrit: clustering indices. R package version 1(2):7

    Google Scholar 

  • Dunn J (1974) Well separated clusters and optimal fuzzy partitions. J Cybern 4:95–104

    Article  MathSciNet  MATH  Google Scholar 

  • Everitt B (1974) Cluster analysis. Heinemann educational for the social science research council London

  • Färber I, Günnemann S, Kriegel H-P, Kröger P, Müller E, Schubert E, Seidl T, Zimek A (2010). On using class-labels in evaluation of clusterings. In: MultiClust: 1st international workshop on discovering, summarizing and using multiple clusterings, Washington, DC

  • Fawcett T (2004) ROC graphs: notes and practical considerations for researchers. Technical report

  • Fawcett T (2006) An introduction to ROC analysis. Pattern Recognit Lett 27(8):861–874

    Article  MathSciNet  Google Scholar 

  • Flach P, Hernández-Orallo J, Ferri C (2011) A coherent interpretation of AUC as a measure of aggregated classification performance. In: International Conference on Machine Learning — ICML

  • Flach PA (2010) Encyclopedia of machine learning, Chapter ROC Analysis, pp. 869–875. Boston, MA: Springer US

  • Giancarlo R, Lo Bosco G, Pinello L, Utro F (2013) A methodology to assess the intrinsic discriminative ability of a distance function and its interplay with clustering algorithms for microarray data analysis. BMC Bioinformatics 14(Suppl 1):S6

  • Gini C (1912) Variabilità e mutabilità. Tipogr. di P, Cuppini

    Google Scholar 

  • Goodman L, Kruskal W (1954) Measures of association for cross-classifications. J Am Stat Assoc 49:732–764

    MATH  Google Scholar 

  • Halkidi M, Batistakis Y, Vazirgiannis M (2001) On clustering validation techniques. J Intell Inf Syst 17(2–3):107–145

    Article  MATH  Google Scholar 

  • Halkidi M, Vazirgiannis M (2008) A density-based cluster validity approach using multi-representatives. Pattern Recognit Lett 29:773–786

    Article  Google Scholar 

  • Hand DJ, Till RJ (2001) A simple generalisation of the area under the ROC curve for multiple class classification problems. Mach Learn 45(2):171–186

    Article  MATH  Google Scholar 

  • Hanley JA, McNeil BJ (1982) The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology 143(1):29–36

    Article  Google Scholar 

  • Hennig C (2015) Pattern recognition letters. What are the true clusters?, 64, 53–62

  • Hennig C, Meila M, Murtagh F, Rocci R (2015) Handbook of cluster analysis. CRC Press

  • Hernández-Orallo J, Flach P, Ferri C (2013) ROC curves in cost space. Mach Learn 93(1):71–91

    Article  MathSciNet  MATH  Google Scholar 

  • Higham PA, Higham DP (2019) New improved gamma: enhancing the accuracy of Goodman-Kruskal’s gamma using ROC curves. Behav Res Methods 51(1):108–125

    Article  Google Scholar 

  • Hill RS (1980) A stopping rule for partitioning dendrograms. Botanical Gazette 141:321–324

    Article  Google Scholar 

  • Hruschka ER, Campello RJGB, Castro LN (2004) Improving the efficiency of a clustering genetic algorithm. In: Ibero-American conference on artificial intelligence – IBERAMIA 3315: 861–870

  • Hruschka ER, Campello RJGB, de Castro LN (2006) Evolving clusters in gene-expression data. Inf Sci 176(13):1898–1927

    Article  MathSciNet  Google Scholar 

  • Huang J, Ling CX (2005) Using auc and accuracy in evaluating learning algorithms. IEEE Trans Knowl Data Eng 17(3):299–310

    Article  Google Scholar 

  • Hubert L, Arabie P (1985) Comparing partitions. J Classif 2(1):193–218

    Article  MATH  Google Scholar 

  • Hubert LJ, Levin JR (1976) A general statistical framework for assessing categorical clustering in free recall. Psychol Bull 10:1072–1080

    Article  Google Scholar 

  • Jain AK, Dubes RC (1988) Algorithms for clustering data. Prentice Hall

  • Jaskowiak PA (2015) On the evaluation of clustering results: measures, ensembles, and gene expression data analysis. Ph. D. thesis, University of São Paulo, Brazil (

  • Jaskowiak PA, Campello RJGB, Costa IG (2012). Evaluating correlation coefficients for clustering gene expression profiles of cancer. In: 7th Brazilian symposium on bioinformatics (BSB2012), Volume 7409 of LNCS, pp. 120–131. Springer / Berlin Heidelberg

  • Jaskowiak PA, Campello RJGB, Costa IG (2014) On the selection of appropriate distances for gene expression data clustering. BMC bioinformatics 15 Suppl 2(Suppl 2):S2

  • Jaskowiak PA, Campello RJGB, Costa Filho IG (2013) Proximity measures for clustering gene expression microarray data: a validation methodology and a comparative analysis. IEEE/ACM Trans Comput Biol Bioinf 10(4):845–857

    Article  Google Scholar 

  • Jaskowiak PA, Moulavi D, Furtado ACS, Campello RJGB, Zimek A, Sander J (2016) On strategies for building effective ensembles of relative clustering validity criteria. Knowl Inf Syst 47(2):329–354

    Article  Google Scholar 

  • Kim B, Lee H, Kang P (2018) Integrating cluster validity indices based on data envelopment analysis. Appl Soft Comput 64:94–108

    Article  Google Scholar 

  • MacQueen J (1967) Some methods for classification and analysis of multivariate observations. In: 5th Berkeley symposium on mathematics. Statistics, and probabilistics 1:281–297

  • Majnik M, Bosnić Z (2013) Roc analysis of classifiers in machine learning: a survey. Intell Data Anal 17(3):531–558

    Article  Google Scholar 

  • Mason SJ, Graham NE (2002) Areas beneath the relative operating characteristics (ROC) and relative operating levels (ROL) curves: statistical significance and interpretation. Q J Royal Meteorol Soc 128(584):2145–2166

    Article  Google Scholar 

  • Maulik U, Bandyopadhyay S (2002) Performance evaluation of some clustering algorithms and validity indices. IEEE Trans Pattern Anal and Mach Intell 24(12):1650–1654

    Article  Google Scholar 

  • Milligan GW (1981) A monte carlo study of thirty internal criterion measures for cluster analysis. Psychometrika 46(2):187–199

    Article  MATH  Google Scholar 

  • Milligan GW, Cooper MC (1985) An examination of procedures for determining the number of clusters in a data set. Psychometrika 50(2):159–179

    Article  Google Scholar 

  • Moulavi D, Jaskowiak PA, Campello RJGB, Zimek A, Sander J (2014) Density-based clustering validation. In: Proceedings of the 14th SIAM international conference on data mining (SDM), Philadelphia, PA, pp. 839–847

  • Nguyen T, Viehman J, Yeboah D, Olbricht GR, Obafemi-Ajayi T (2020) Statistical comparative analysis and evaluation of validation indices for clustering optimization. In: 2020 IEEE symposium series on computational intelligence (SSCI), pp. 3081–3090

  • Pakhira MK, Bandyopadhyay S, Maulik U (2004) Validity index for crisp and fuzzy clusters. Pattern Recognit 37:487–501

    Article  MATH  Google Scholar 

  • Pearson K (1895) Contributions to the mathematical theory of evolution. iii. regression, heredity, and panmixia. Proc Royal Soc London 59:69–71

    Google Scholar 

  • Provost F, Fawcett T (1997). Analysis and visualization of classifier performance: Comparison under imprecise class and cost distributions. In: Proceedings of the third international conference on knowledge discovery and data mining, pp. 43–48. AAAI Press

  • Provost FJ, Fawcett T, Kohavi R (1998). The case against accuracy estimation for comparing induction algorithms. In: Proceedings of the fifteenth international conference on machine learning, ICML ’98, San Francisco, CA, USA, pp. 445–453. Morgan Kaufmann Publishers Inc

  • Ratkowsky DA, Lance GN (1978) A criterion for determining the number of groups in a classification. Aust Comput J 10:115–117

    Google Scholar 

  • Romano S, Vinh NX, Bailey J, Verspoor K (2016) Adjusting for chance clustering comparison measures. J Mach Learn Res 17(1):4635–4666

    MathSciNet  MATH  Google Scholar 

  • Rousseeuw PJ (1987) Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J Comput Appl Math 20:53–65

    Article  MATH  Google Scholar 

  • Spackman KA (1989) Signal detection theory: Valuable tools for evaluating inductive learning. In: Proceedings of the sixth international workshop on machine learning, San Francisco, CA, USA, pp. 160–163. Morgan Kaufmann Publishers Inc

  • Vendramin L, Campello RJGB, Hruschka ER (2009) On the comparison of relative clustering validation criteria. In: Proceedings of the 9th SIAM international conference on data mining (SDM), Sparks, NV, pp. 733–744

  • Vendramin L, Campello RJGB, Hruschka ER (2010) Relative clustering validity criteria: a comparative overview. Stat Anal Data Min 3(4):209–235

    Article  MathSciNet  MATH  Google Scholar 

  • Vendramin L, Jaskowiak PA, Campello RJGB (2013) On the combination of relative clustering validity criteria. In: Proceedings of the 25th International conference on scientific and statistical database management (SSDBM), Baltimore, MD, pp. 4:1–12

  • Xu R, Wunsch D, Wunsch D II (2009) Clustering. IEEE Press

  • Yeung KY, Fraley C, Murua A, Raftery AE, Ruzzo WL (2001) Model-based clustering and data transformations for gene expression data. Bioinformatics 17(10):977–987

    Article  Google Scholar 

  • Zhou S, Liu F, Song W (2021) Estimating the optimal number of clusters via internal validity index. Neural Process Lett 53(2):1013–1034

    Article  Google Scholar 

Download references


This project was partially funded by Brazilian research agencies FAPESP (Process #2011/04247-5) and CNPq (#302161/2017-1). Ivan G. Costa was supported by the Interdisciplinary Center for Clinical Research (IZKF) Faculty of Medicine at the RWTH Aachen.

Author information

Authors and Affiliations


Corresponding author

Correspondence to Pablo A. Jaskowiak.

Additional information

Responsible editor: Albrecht Zimmermann and Peggy Cellier.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Jaskowiak, P.A., Costa, I.G. & Campello, R.J.G.B. The area under the ROC curve as a measure of clustering quality. Data Min Knowl Disc 36, 1219–1245 (2022).

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI:


  • Clustering validation
  • Area under the curve
  • Receiver operating characteristics
  • Area under the curve for clustering
  • Qualitative/visual clustering evaluation