Advertisement

Bootstrapping estimates of stability for clusters, observations and model selection

  • Han Yu
  • Brian Chapman
  • Arianna Di Florio
  • Ellen Eischen
  • David Gotz
  • Mathews Jacob
  • Rachael Hageman Blair
Original Paper
  • 70 Downloads

Abstract

Clustering is a challenging problem in unsupervised learning. In lieu of a gold standard, stability has become a valuable surrogate to performance and robustness. In this work, we propose a non-parametric bootstrapping approach to estimating the stability of a clustering method, which also captures stability of the individual clusters and observations. This flexible framework enables different types of comparisons between clusterings and can be used in connection with two possible bootstrap approaches for stability. The first approach, scheme 1, can be used to assess confidence (stability) around clustering from the original dataset based on bootstrap replications. A second approach, scheme 2, searches over the bootstrap clusterings for an optimally stable partitioning of the data. The two schemes accommodate different model assumptions that can be motivated by an investigator’s trust (or lack thereof) in the original data and additional computational considerations. We propose a hierarchical visualization extrapolated from the stability profiles that give insights into the separation of groups, and projected visualizations for the inspection of the stability of individual operations. Our approaches show good performance in simulation and on real data. These approaches can be implemented using the R package bootcluster that is available on the Comprehensive R Archive Network (CRAN).

Keywords

Ensemble k-means Jaccard coefficient Clustering Visualization 

Supplementary material

180_2018_830_MOESM1_ESM.pdf (261 kb)
Supplementary material 1 (pdf 260 KB)

References

  1. Ben-Hur A, Elisseeff A, Guyon I (2001) A stability based method for discovering structure in clustered data. In: Pacific symposium on biocomputing, vol 7, pp 6–17Google Scholar
  2. Breiman L (1996) Out-of-bag estimation. Technical report, Statistics Department, University of California Berkeley, Berkeley CAGoogle Scholar
  3. Dudoit S, Fridlyand J (2003) Bagging to improve the accuracy of a clustering procedure. Bioinformatics 19(9):1090–1099CrossRefGoogle Scholar
  4. Efron B, Tibshirani RJ (1994) An Introduction to the bootstrap: Chapman and Hall/CRC monographs on statistics and applied probability. CRC Press, Boca RatonGoogle Scholar
  5. Efron B, Halloran E, Holmes S (1996) Bootstrap confidence levels for phylogenetic trees. Proc Natl Acad Sci 93(23):13429–13429CrossRefzbMATHGoogle Scholar
  6. Falasconi M, Gutierrez A, Pardo M, Sberveglieri G, Marco S (2010) A stability based validity method for fuzzy clustering. Pattern Recognit 43(4):1292–1305CrossRefzbMATHGoogle Scholar
  7. Fang Y, Wang J (2012) Selection of the number of clusters via the bootstrap. Comput Stat Data Anal 56:468–477MathSciNetCrossRefzbMATHGoogle Scholar
  8. Felsenstein J (1985) Confidence limits on phylogenies: an approach using the bootstrap. Evolution 39(4):783–791CrossRefGoogle Scholar
  9. Hastie T, Tibshirani R, Friedman J (2001) The elements of statistical learning. Springer series in statistics. Springer New York Inc., New YorkCrossRefzbMATHGoogle Scholar
  10. Hennig C (2007) Cluster-wise assessment of cluster stability. Comput Stat Data Anal 52(1):258–271MathSciNetCrossRefzbMATHGoogle Scholar
  11. Jain AK, Murty MN, Flynn PJ (1999) Data clustering: a review. ACM Comput Surv 31(3):264–323CrossRefGoogle Scholar
  12. Kerr MK, Churchill GA (2001) Bootstrapping cluster analysis: assessing the reliability of conclusions from microarray experiments. Proc Natl Acad Sci 98(16):8961–8965CrossRefzbMATHGoogle Scholar
  13. Ross DT, Scherf U, Eisen MB, Perou CM, Rees C, Spellman P, Iyer V, Jeffrey SS, Van de Rijn M, Waltham M, Permamenschikov L, Lashkari D, Shalon D, Myers T, Botstein D, Brown P (2000) Systematic variation in gene expression patterns in human cancer cell lines. Nat Genet 24(3):227–235CrossRefGoogle Scholar
  14. Tibshirani R, Walther G (2005) Cluster validation by prediction strength. J Comput Graph Stat 14(3):511–528MathSciNetCrossRefGoogle Scholar
  15. Von Luxburg U (2009) Clustering stability: an overview. Found Trends Mach Learn 2(3):235–274CrossRefzbMATHGoogle Scholar
  16. Wang J (2010) Consistent selection of the number of clusters via crossvalidation. Biometrika 97(4):893–904MathSciNetCrossRefzbMATHGoogle Scholar

Copyright information

© Springer-Verlag GmbH Germany, part of Springer Nature 2018

Authors and Affiliations

  • Han Yu
    • 1
  • Brian Chapman
    • 2
  • Arianna Di Florio
    • 3
    • 4
  • Ellen Eischen
    • 5
  • David Gotz
    • 6
  • Mathews Jacob
    • 7
  • Rachael Hageman Blair
    • 8
  1. 1.Department of BiostatisticsState University of New York at BuffaloBuffaloUSA
  2. 2.Department of Radiology and Imaging ScienceUniversity of UtahSalt Lake CityUSA
  3. 3.Institute of Psychological Medicine and Clinical NeurosciencesCardiff University School of MedicineCardiffUK
  4. 4.Department of PsychiatryUniversity of North Carolina at Chapel HillChapel HillUSA
  5. 5.Department of MathematicsUniversity of OregonEugeneUSA
  6. 6.School of Information and Library ScienceUniversity of North Carolina at Chapel HillChapel HillUSA
  7. 7.Department of Electrical and Computer EngineeringUniversity of IowaIowa CityUSA
  8. 8.Department of BiostatisticsState University of New York at BuffaloBuffaloUSA

Personalised recommendations