DESPOTA: DEndrogram Slicing through a PemutatiOn Test Approach

Abstract

Hierarchical clustering represents one of the most widespread analytical approaches to tackle classification problems mainly due to the visual powerfulness of the associated graphical representation, the dendrogram. That said, the requirement of appropriately choosing the number of clusters still represents the main difficulty for the final user. We introduce DESPOTA (DEndrogram Slicing through a PermutatiOn Test Approach), a novel approach exploiting permutation tests in order to automatically detect a partition among those embedded in a dendrogram. Unlike the traditional approach, DESPOTA includes in the search space also partitions not corresponding to horizontal cuts of the dendrogram. Applications on both real and syntethic datasets will show the effectiveness of our proposal.

This is a preview of subscription content, access via your institution.

References

  1. BANFIELD, J.D., and RAFTERY, A.E. (1993), “Model Based Gaussian and Non Gaussian Clustering”, Biometrics, 49, 803–821.

  2. CALINSKI, R.B., and HARABASZ, J. (1974), “A Dendrite Method for Cluster Analysis”, Communications in Statistics, 3, 1–27.

  3. CHARRAD M., GHAZZALI N., BOITEAU V., HUBERT M., and NIKNAFS A. (2013), An Examination of Indices for Determining the Number of Clusters: NbClust Package, R Package Version 1.3.

  4. DUDA, R.O., and HART, P.E. (1973), Pattern Classification and Scene Analysis, New York: Wiley.

  5. EVERITT, B., LANDAU, M., and LEESE, M. (2001), Cluster Analysis (4th ed.), London: Arnold.

  6. GOOD, P.I. (1994), Permutations Tests for Testing Hypotheses, New York: Springer-Verlag.

  7. GURRUTXAGA, I., ALBISUA, I., ARBELAITZ, O., MART`IN, J.I., MUGUERZA, J., P`EREZ, J.M., and PERONA, I. (2010), “SEP/COP: An Efficient Method to Find the Best Partition in Hierarchical Clustering Based on a New Cluster Validity Index”, Pattern Recognition, 43(10), 3364–3373.

  8. HOCHBERG, Y. (1988), “A Sharper Bonferroni Procedure for Multiple Tests of Significance”, Biometrika, 75, 800–802.

  9. HOLM, S. (1979), “A Simple Sequentially Rejective Multiple Testing Procedure”, Scandinavian Journal of Statistics, 6, 65–70.

  10. HORTON P., and NAKAI K. (1996), “A Probabilistic Classification System for Predicting the Cellular Localization Sites of Proteins”, Proceedings of the International Conference on Intelligent Systems for Molecular Biology, 4, 109–115.

  11. HUBERT, L.J., and LEVIN, J.R. (1976), “A General Statistical Framework for Assessing Categorical Clustering in Free Recall”, Psychological Bulletin, 83, 1072–1080.

  12. HUBERT, L., and ARABIE, P. (1985), “Comparing Partitions”, Journal of Classification, 2, 193–218.

  13. JOHNSON, R.A., and WICHERN, D.W. (1982), Applied Multivariate Statistical Analysis, Upper Saddle River, NJ: Prentice Hall.

  14. KIM, M., and RAMAKRISHNA, R.S. (2005), “New Indices for Cluster Validity Assessment”, Pattern Recognition Letters, 26(15), 2353–2363.

  15. KUIPER, K.K., and FISHER, L. (1975), “A Monte Carlo Comparison of Six Clustering Procedures”, Biometrics, 31, 777–783.

  16. LAGO-FERNA’ NDEZ, L.F., and CORBACHO, F. (2010), “Normality-Based Validation for Crisp Clustering”, Pattern Recognition, 43, 782–795.

  17. LIU, Y., HAYES, D.N., NOBEL, A., and MARRON, J.S. (2008), “Statistical Significance of Clustering for High-Dimension, Low-Sample Size Data”, Journal of the American Statistical Association, 103(483), 1281–1293.

  18. MAECHLER M., ROUSSEEUWP., STRUYF A., HUBERT M., and HORNIK K. (2011), Cluster: Cluster Analysis Basics and Extensions, R Package Version 1.14.1.

  19. MILLIGAN, G.W. (1981), “A Monte Carlo Study of Thirty Internal Criterion Measures for Cluster Analysis”, Psychometrika, 46(2), 187–199.

  20. MILLIGAN, G.W., and COOPER, M.C. (1985), “An Examination of Procedures for Determining the Number of Clusters in a Dataset”, Psychometrika, 52(2), 159–179.

  21. PARK, P.J., MANJOURIDES, J., BONETTI, M., and PAGANO, M. (2009), “A Permutation Test for Determining Significance of Clusters with Applications to Spatial and Gene Expression Data”, Computational Statistics and Data Analysis, 53(12), 4290–4300.

  22. PESARIN, F., and SALMASO, L. (2010), Permutation Tests for Complex Data. Theory, Applications and Software, Chichester: John Wiley and Sons.

  23. QIU, W.L., and JOE, H. (2006), “Separation Index and Partial Membership for Clustering”, Computational Statistics and Data Analysis, 50, 585–603.

  24. QIU, W.L., and JOE, H. (2006), “Generation of Random Clusters with Specified Degree of Separation”, Journal of Classification, 23(2), 315–334.

  25. QIU, W.L., and JOE, H. (2009). ClusterGeneration: Random Cluster Generation (with Specified Degree of Separation), R package version 1.2.7.

  26. R DEVELOPMENT CORE TEAM (2010), R: A Language and Environment for Statistical Computing, R Foundation for Statistical Computing, Vienna, Austria, ISBN 3-900051-07-0, http://www.R-project.org.

  27. ROMANO, J.P., SHAIKH, A.M., and WOLF, M. (2008), “Formalized Data Snooping Based on Generalized Error Rates”, Econometric Theory, 24, 404–447.

  28. RYOTA, S., and SHIMODAIRA, H. (2011), pvclust: Hierarchical Clustering with P-Values via Multiscale Bootstrap Resampling, R package version 1.2-2, http://CRAN.Rproject.org/package=pvclust.

  29. SHIMODAIRA, H. (2004), “Approximately Unbiased Tests of Regions Using Multistep-Multiscale Bootstrap Resampling”, Annals of Statistics, 32, 2616–2641.

  30. STEINLEY, D. (2004), “Properties of the Hubert-Arabie Adjusted Rand Index”, Psychological Methods, 9(3), 386–396.

  31. TIBSHIRANI, R., WALTHER, G., and HASTIE, T. (2001), “Estimating the Number of Clusters in a Data Set via the Gap Statistic, Journal of Royal Statistical Society B, 83(2), 411–423

  32. WARRENS, M.J. (2008), “On the Equivalence of Cohens Kappa and the Hubert-Arabie Adjusted Rand Index”, Journal of Classification, 25, 177–183.

  33. WICKHAM, H. (2009), ggplot2: Elegant Graphics for Data Analysis, New York: Springer.

  34. WISHART D. (1969),“An Algorithm for Hierarchical Classification”, Biometrics, 25, 165–170.

  35. WU, K.-L., YANG, M.-S., and HSIEH, J.-N. (2009), “Robust Cluster Validity Indexes”, Pattern Recognition, 42(11), 2541–2550.

Download references

Author information

Affiliations

Authors

Corresponding author

Correspondence to Dario Bruzzese.

Additional information

The authors wish to thank Professor Ibai Gurrutxaga and his colleagues for kindly providing the data and the R code used in their paper: this allowed us to make a worthwhile comparison of the two methods. The authors are also grateful to Professor Jaromir Antoch for helpful comments on a previous draft of the paper and the three anonymous referees for their valuable suggestions which helped us to improve the final version of this paper.

All computation and graphics were done in the R language (R Development Core Team 2010) using the basic packages and the additional cluster (Maechler et al. 2011), ggplot2 (Wickham 2009) and NbClust (Charrad et al. 2013) packages.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Bruzzese, D., Vistocco, D. DESPOTA: DEndrogram Slicing through a PemutatiOn Test Approach. J Classif 32, 285–304 (2015). https://doi.org/10.1007/s00357-015-9179-x

Download citation

Keywords

  • Hierarchical clustering
  • Cluster detection
  • Permutation tests