Hierarchical clustering represents one of the most widespread analytical approaches to tackle classification problems mainly due to the visual powerfulness of the associated graphical representation, the dendrogram. That said, the requirement of appropriately choosing the number of clusters still represents the main difficulty for the final user. We introduce DESPOTA (DEndrogram Slicing through a PermutatiOn Test Approach), a novel approach exploiting permutation tests in order to automatically detect a partition among those embedded in a dendrogram. Unlike the traditional approach, DESPOTA includes in the search space also partitions not corresponding to horizontal cuts of the dendrogram. Applications on both real and syntethic datasets will show the effectiveness of our proposal.
This is a preview of subscription content, access via your institution.
Buy single article
Instant access to the full article PDF.
Tax calculation will be finalised during checkout.
Subscribe to journal
Immediate online access to all issues from 2019. Subscription will auto renew annually.
Tax calculation will be finalised during checkout.
BANFIELD, J.D., and RAFTERY, A.E. (1993), “Model Based Gaussian and Non Gaussian Clustering”, Biometrics, 49, 803–821.
CALINSKI, R.B., and HARABASZ, J. (1974), “A Dendrite Method for Cluster Analysis”, Communications in Statistics, 3, 1–27.
CHARRAD M., GHAZZALI N., BOITEAU V., HUBERT M., and NIKNAFS A. (2013), An Examination of Indices for Determining the Number of Clusters: NbClust Package, R Package Version 1.3.
DUDA, R.O., and HART, P.E. (1973), Pattern Classification and Scene Analysis, New York: Wiley.
EVERITT, B., LANDAU, M., and LEESE, M. (2001), Cluster Analysis (4th ed.), London: Arnold.
GOOD, P.I. (1994), Permutations Tests for Testing Hypotheses, New York: Springer-Verlag.
GURRUTXAGA, I., ALBISUA, I., ARBELAITZ, O., MART`IN, J.I., MUGUERZA, J., P`EREZ, J.M., and PERONA, I. (2010), “SEP/COP: An Efficient Method to Find the Best Partition in Hierarchical Clustering Based on a New Cluster Validity Index”, Pattern Recognition, 43(10), 3364–3373.
HOCHBERG, Y. (1988), “A Sharper Bonferroni Procedure for Multiple Tests of Significance”, Biometrika, 75, 800–802.
HOLM, S. (1979), “A Simple Sequentially Rejective Multiple Testing Procedure”, Scandinavian Journal of Statistics, 6, 65–70.
HORTON P., and NAKAI K. (1996), “A Probabilistic Classification System for Predicting the Cellular Localization Sites of Proteins”, Proceedings of the International Conference on Intelligent Systems for Molecular Biology, 4, 109–115.
HUBERT, L.J., and LEVIN, J.R. (1976), “A General Statistical Framework for Assessing Categorical Clustering in Free Recall”, Psychological Bulletin, 83, 1072–1080.
HUBERT, L., and ARABIE, P. (1985), “Comparing Partitions”, Journal of Classification, 2, 193–218.
JOHNSON, R.A., and WICHERN, D.W. (1982), Applied Multivariate Statistical Analysis, Upper Saddle River, NJ: Prentice Hall.
KIM, M., and RAMAKRISHNA, R.S. (2005), “New Indices for Cluster Validity Assessment”, Pattern Recognition Letters, 26(15), 2353–2363.
KUIPER, K.K., and FISHER, L. (1975), “A Monte Carlo Comparison of Six Clustering Procedures”, Biometrics, 31, 777–783.
LAGO-FERNA’ NDEZ, L.F., and CORBACHO, F. (2010), “Normality-Based Validation for Crisp Clustering”, Pattern Recognition, 43, 782–795.
LIU, Y., HAYES, D.N., NOBEL, A., and MARRON, J.S. (2008), “Statistical Significance of Clustering for High-Dimension, Low-Sample Size Data”, Journal of the American Statistical Association, 103(483), 1281–1293.
MAECHLER M., ROUSSEEUWP., STRUYF A., HUBERT M., and HORNIK K. (2011), Cluster: Cluster Analysis Basics and Extensions, R Package Version 1.14.1.
MILLIGAN, G.W. (1981), “A Monte Carlo Study of Thirty Internal Criterion Measures for Cluster Analysis”, Psychometrika, 46(2), 187–199.
MILLIGAN, G.W., and COOPER, M.C. (1985), “An Examination of Procedures for Determining the Number of Clusters in a Dataset”, Psychometrika, 52(2), 159–179.
PARK, P.J., MANJOURIDES, J., BONETTI, M., and PAGANO, M. (2009), “A Permutation Test for Determining Significance of Clusters with Applications to Spatial and Gene Expression Data”, Computational Statistics and Data Analysis, 53(12), 4290–4300.
PESARIN, F., and SALMASO, L. (2010), Permutation Tests for Complex Data. Theory, Applications and Software, Chichester: John Wiley and Sons.
QIU, W.L., and JOE, H. (2006), “Separation Index and Partial Membership for Clustering”, Computational Statistics and Data Analysis, 50, 585–603.
QIU, W.L., and JOE, H. (2006), “Generation of Random Clusters with Specified Degree of Separation”, Journal of Classification, 23(2), 315–334.
QIU, W.L., and JOE, H. (2009). ClusterGeneration: Random Cluster Generation (with Specified Degree of Separation), R package version 1.2.7.
R DEVELOPMENT CORE TEAM (2010), R: A Language and Environment for Statistical Computing, R Foundation for Statistical Computing, Vienna, Austria, ISBN 3-900051-07-0, http://www.R-project.org.
ROMANO, J.P., SHAIKH, A.M., and WOLF, M. (2008), “Formalized Data Snooping Based on Generalized Error Rates”, Econometric Theory, 24, 404–447.
RYOTA, S., and SHIMODAIRA, H. (2011), pvclust: Hierarchical Clustering with P-Values via Multiscale Bootstrap Resampling, R package version 1.2-2, http://CRAN.Rproject.org/package=pvclust.
SHIMODAIRA, H. (2004), “Approximately Unbiased Tests of Regions Using Multistep-Multiscale Bootstrap Resampling”, Annals of Statistics, 32, 2616–2641.
STEINLEY, D. (2004), “Properties of the Hubert-Arabie Adjusted Rand Index”, Psychological Methods, 9(3), 386–396.
TIBSHIRANI, R., WALTHER, G., and HASTIE, T. (2001), “Estimating the Number of Clusters in a Data Set via the Gap Statistic, Journal of Royal Statistical Society B, 83(2), 411–423
WARRENS, M.J. (2008), “On the Equivalence of Cohens Kappa and the Hubert-Arabie Adjusted Rand Index”, Journal of Classification, 25, 177–183.
WICKHAM, H. (2009), ggplot2: Elegant Graphics for Data Analysis, New York: Springer.
WISHART D. (1969),“An Algorithm for Hierarchical Classification”, Biometrics, 25, 165–170.
WU, K.-L., YANG, M.-S., and HSIEH, J.-N. (2009), “Robust Cluster Validity Indexes”, Pattern Recognition, 42(11), 2541–2550.
The authors wish to thank Professor Ibai Gurrutxaga and his colleagues for kindly providing the data and the R code used in their paper: this allowed us to make a worthwhile comparison of the two methods. The authors are also grateful to Professor Jaromir Antoch for helpful comments on a previous draft of the paper and the three anonymous referees for their valuable suggestions which helped us to improve the final version of this paper.
All computation and graphics were done in the R language (R Development Core Team 2010) using the basic packages and the additional cluster (Maechler et al. 2011), ggplot2 (Wickham 2009) and NbClust (Charrad et al. 2013) packages.
About this article
Cite this article
Bruzzese, D., Vistocco, D. DESPOTA: DEndrogram Slicing through a PemutatiOn Test Approach. J Classif 32, 285–304 (2015). https://doi.org/10.1007/s00357-015-9179-x
- Hierarchical clustering
- Cluster detection
- Permutation tests