Advertisement

Journal of Classification

, Volume 32, Issue 2, pp 285–304 | Cite as

DESPOTA: DEndrogram Slicing through a PemutatiOn Test Approach

  • Dario Bruzzese
  • Domenico Vistocco
Article

Abstract

Hierarchical clustering represents one of the most widespread analytical approaches to tackle classification problems mainly due to the visual powerfulness of the associated graphical representation, the dendrogram. That said, the requirement of appropriately choosing the number of clusters still represents the main difficulty for the final user. We introduce DESPOTA (DEndrogram Slicing through a PermutatiOn Test Approach), a novel approach exploiting permutation tests in order to automatically detect a partition among those embedded in a dendrogram. Unlike the traditional approach, DESPOTA includes in the search space also partitions not corresponding to horizontal cuts of the dendrogram. Applications on both real and syntethic datasets will show the effectiveness of our proposal.

Keywords

Hierarchical clustering Cluster detection Permutation tests 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. BANFIELD, J.D., and RAFTERY, A.E. (1993), “Model Based Gaussian and Non Gaussian Clustering”, Biometrics, 49, 803–821.Google Scholar
  2. CALINSKI, R.B., and HARABASZ, J. (1974), “A Dendrite Method for Cluster Analysis”, Communications in Statistics, 3, 1–27.Google Scholar
  3. CHARRAD M., GHAZZALI N., BOITEAU V., HUBERT M., and NIKNAFS A. (2013), An Examination of Indices for Determining the Number of Clusters: NbClust Package, R Package Version 1.3.Google Scholar
  4. DUDA, R.O., and HART, P.E. (1973), Pattern Classification and Scene Analysis, New York: Wiley.Google Scholar
  5. EVERITT, B., LANDAU, M., and LEESE, M. (2001), Cluster Analysis (4th ed.), London: Arnold.Google Scholar
  6. GOOD, P.I. (1994), Permutations Tests for Testing Hypotheses, New York: Springer-Verlag.Google Scholar
  7. GURRUTXAGA, I., ALBISUA, I., ARBELAITZ, O., MART`IN, J.I., MUGUERZA, J., P`EREZ, J.M., and PERONA, I. (2010), “SEP/COP: An Efficient Method to Find the Best Partition in Hierarchical Clustering Based on a New Cluster Validity Index”, Pattern Recognition, 43(10), 3364–3373.Google Scholar
  8. HOCHBERG, Y. (1988), “A Sharper Bonferroni Procedure for Multiple Tests of Significance”, Biometrika, 75, 800–802.Google Scholar
  9. HOLM, S. (1979), “A Simple Sequentially Rejective Multiple Testing Procedure”, Scandinavian Journal of Statistics, 6, 65–70.Google Scholar
  10. HORTON P., and NAKAI K. (1996), “A Probabilistic Classification System for Predicting the Cellular Localization Sites of Proteins”, Proceedings of the International Conference on Intelligent Systems for Molecular Biology, 4, 109–115.Google Scholar
  11. HUBERT, L.J., and LEVIN, J.R. (1976), “A General Statistical Framework for Assessing Categorical Clustering in Free Recall”, Psychological Bulletin, 83, 1072–1080.Google Scholar
  12. HUBERT, L., and ARABIE, P. (1985), “Comparing Partitions”, Journal of Classification, 2, 193–218.Google Scholar
  13. JOHNSON, R.A., and WICHERN, D.W. (1982), Applied Multivariate Statistical Analysis, Upper Saddle River, NJ: Prentice Hall.Google Scholar
  14. KIM, M., and RAMAKRISHNA, R.S. (2005), “New Indices for Cluster Validity Assessment”, Pattern Recognition Letters, 26(15), 2353–2363.Google Scholar
  15. KUIPER, K.K., and FISHER, L. (1975), “A Monte Carlo Comparison of Six Clustering Procedures”, Biometrics, 31, 777–783.Google Scholar
  16. LAGO-FERNA’ NDEZ, L.F., and CORBACHO, F. (2010), “Normality-Based Validation for Crisp Clustering”, Pattern Recognition, 43, 782–795.Google Scholar
  17. LIU, Y., HAYES, D.N., NOBEL, A., and MARRON, J.S. (2008), “Statistical Significance of Clustering for High-Dimension, Low-Sample Size Data”, Journal of the American Statistical Association, 103(483), 1281–1293.Google Scholar
  18. MAECHLER M., ROUSSEEUWP., STRUYF A., HUBERT M., and HORNIK K. (2011), Cluster: Cluster Analysis Basics and Extensions, R Package Version 1.14.1.Google Scholar
  19. MILLIGAN, G.W. (1981), “A Monte Carlo Study of Thirty Internal Criterion Measures for Cluster Analysis”, Psychometrika, 46(2), 187–199.Google Scholar
  20. MILLIGAN, G.W., and COOPER, M.C. (1985), “An Examination of Procedures for Determining the Number of Clusters in a Dataset”, Psychometrika, 52(2), 159–179.Google Scholar
  21. PARK, P.J., MANJOURIDES, J., BONETTI, M., and PAGANO, M. (2009), “A Permutation Test for Determining Significance of Clusters with Applications to Spatial and Gene Expression Data”, Computational Statistics and Data Analysis, 53(12), 4290–4300.Google Scholar
  22. PESARIN, F., and SALMASO, L. (2010), Permutation Tests for Complex Data. Theory, Applications and Software, Chichester: John Wiley and Sons.Google Scholar
  23. QIU, W.L., and JOE, H. (2006), “Separation Index and Partial Membership for Clustering”, Computational Statistics and Data Analysis, 50, 585–603.Google Scholar
  24. QIU, W.L., and JOE, H. (2006), “Generation of Random Clusters with Specified Degree of Separation”, Journal of Classification, 23(2), 315–334.Google Scholar
  25. QIU, W.L., and JOE, H. (2009). ClusterGeneration: Random Cluster Generation (with Specified Degree of Separation), R package version 1.2.7.Google Scholar
  26. R DEVELOPMENT CORE TEAM (2010), R: A Language and Environment for Statistical Computing, R Foundation for Statistical Computing, Vienna, Austria, ISBN 3-900051-07-0, http://www.R-project.org.
  27. ROMANO, J.P., SHAIKH, A.M., and WOLF, M. (2008), “Formalized Data Snooping Based on Generalized Error Rates”, Econometric Theory, 24, 404–447.Google Scholar
  28. RYOTA, S., and SHIMODAIRA, H. (2011), pvclust: Hierarchical Clustering with P-Values via Multiscale Bootstrap Resampling, R package version 1.2-2, http://CRAN.Rproject.org/package=pvclust.
  29. SHIMODAIRA, H. (2004), “Approximately Unbiased Tests of Regions Using Multistep-Multiscale Bootstrap Resampling”, Annals of Statistics, 32, 2616–2641.Google Scholar
  30. STEINLEY, D. (2004), “Properties of the Hubert-Arabie Adjusted Rand Index”, Psychological Methods, 9(3), 386–396.Google Scholar
  31. TIBSHIRANI, R., WALTHER, G., and HASTIE, T. (2001), “Estimating the Number of Clusters in a Data Set via the Gap Statistic, Journal of Royal Statistical Society B, 83(2), 411–423Google Scholar
  32. WARRENS, M.J. (2008), “On the Equivalence of Cohens Kappa and the Hubert-Arabie Adjusted Rand Index”, Journal of Classification, 25, 177–183.Google Scholar
  33. WICKHAM, H. (2009), ggplot2: Elegant Graphics for Data Analysis, New York: Springer.Google Scholar
  34. WISHART D. (1969),“An Algorithm for Hierarchical Classification”, Biometrics, 25, 165–170.Google Scholar
  35. WU, K.-L., YANG, M.-S., and HSIEH, J.-N. (2009), “Robust Cluster Validity Indexes”, Pattern Recognition, 42(11), 2541–2550.Google Scholar

Copyright information

© Classification Society of North America 2015

Authors and Affiliations

  1. 1.Department of Public HealthUniversity of Naples “Federico II”NaplesItaly
  2. 2.Department of Economics and LawUniversity of CassinoCassinoItaly

Personalised recommendations