Multi-objective selection for collecting cluster alternatives

Abstract

Grouping objects into different categories is a basic means of cognition. In the fields of machine learning and statistics, this subject is addressed by cluster analysis. Yet, it is still controversially discussed how to assess the reliability and quality of clusterings. In particular, it is hard to determine the optimal number of clusters inherent in the underlying data. Running different cluster algorithms and cluster validation methods usually yields different optimal clusterings. In fact, several clusterings with different numbers of clusters are plausible in many situations, as different methods are specialized on diverse structural properties. To account for the possibility of multiple plausible clusterings, we employ a multi-objective approach for collecting cluster alternatives (MOCCA) from a combination of cluster algorithms and validation measures. In an application to artificial data as well as microarray data sets, we demonstrate that exploring a Pareto set of optimal partitions rather than a single solution can identify alternative solutions that are overlooked by conventional clustering strategies. Competitive solutions are hereby ranked following an impartial criterion, while the ultimate judgement is left to the investigator.

This is a preview of subscription content, log in to check access.

References

  1. Albatineh AN, Niewiadomska-Bugaj M, Mihalko D (2006) On similarity indices and correction for chance agreement. J Classif 23(2): 301–313

    MathSciNet  Article  Google Scholar 

  2. Ben-David S, von Luxburg U, Pál D (2006) A sober look at clustering stability. In: Carbonell JG, Siekmann J (eds) Conference on learning theory. Lecture notes in artificial intelligence, vol 4005. Springer, Berlin, pp 5–19

  3. Ben-David S, Pál D, Simon HU (2007) Stability of k-means clustering. In: Bshouty NH, Gentile C (eds) Conference on learning theory. Lecture notes in artificial intelligence, vol 4539. Springer, Berlin, pp 20–34

  4. Ben-Hur A, Elisseeff A, Guyon I (2002) A stability based method for discovering structure in clustered data. Pacific Symposium on Biocomputing, pp 6–17

  5. Bertoni A, Valentini G (2005) Random projections for assessing gene expression cluster stability. In: Proceedings of the IEEE-international joint conference on neural networks (IJCNN), vol 1. IEEE Computer Society, pp 149–154

  6. Brock G, Pihur V, Datta S, Datta S (2008) clvalid: an r package for cluster validation. J Stat Softw 25(4): 1–22

    Google Scholar 

  7. Conover WJ (1999) Practical nonparametric statistics, 3rd edn. Wiley, New York

    Google Scholar 

  8. Cottrell M, Hammer B, Hasenfuss A, Villmann T (2006) Batch and median neural gas. Neural Netw 19(6–7): 762–771

    MATH  Article  Google Scholar 

  9. de Souto MCP, Costa IG, de Araujo DSA, Ludermir TB, Schliep A (2008) Clustering cancer gene expression data: a comparative study. BMC Bioinf 9(1): 497

    Article  Google Scholar 

  10. Deb K (2004) Multi-objective optimization using evolutionary algorithms. Wiley, New York

    Google Scholar 

  11. Dimitriadou E (2009) cclust: convex clustering methods and clustering indexes http://CRAN.R-project.org/package=cclust. R package version 0.6-16

  12. Dimitriadou E, Barth M, Windischberger C, Hornik K, Moser E (2004) A quantitative comparison of functional MRI cluster analysis. Artif Intell Med 32(1): 57–71

    Article  Google Scholar 

  13. Dimitriadou E, Weingessel A, Hornik K (1999) Voting in clustering and finding the number of clusters. In: Bothe H, Oja E, Massad E, Haefke C (eds) Proceedings of the “International symposium on advances in intelligent data analysis (AIDA 99)” (“International ICSC congress on computational intelligence: methods and applications (CIMA 99)”, ICSC Academic Press, pp 291–296

  14. Dolnicar S, Leisch F (2009) Evaluation of structure and reproducibility of cluster solutions using the bootstrap. Technical report 63, Department of Statistics, LMU Munich

  15. Dos Santos EM, Sabourin R, Maupin P (2009) Overfitting cautious selection of classifier ensembles with genetic algorithms. Inf Fusion 10(2): 150–162

    Article  Google Scholar 

  16. Faceli K, de Souto MCP (2006) Multi-objective clustering ensemble. In: Proceedings of the 6th international conference on hybrid intelligent systems, IEEE Computer Society, Los Alamitos, p 51

  17. Fowlkes E, Mallows C (1983) A method for comparing two hierarchical clusterings. J Am Stat Assoc 78(383): 553–569

    MATH  Article  Google Scholar 

  18. Fridlyand J, Dudoit S (2001) Applications of resampling methods to estimate the number of clusters and to improve the accuracy of a clustering method. Technical report 600, University of California, Berkeley

  19. Golub T, Slonim D, Tamayo P, Huard C, Gassenbeek M, Coller H, Loh M, Downing J, Caliguri M, Bloomfield C, Lander E (1999) Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 286(5439): 531–537

    Article  Google Scholar 

  20. Handl J, Knowles J, Kell D (2005) Computational cluster validation in post-genomic data analysis. Bioinformatics 21(15): 3201–3212

    Article  Google Scholar 

  21. Hornik K, Leisch F (2005) Ensemble methods for cluster analysis. In: Taudes A (eds) Adaptive information systems and modelling in economics and management science. Springer, Berlin, pp 261–268

    Google Scholar 

  22. Hubert L, Arabie P (1985) Comparing partitions. J Math. Classif 2: 193–218

    Article  Google Scholar 

  23. Jaccard P (1908) Nouvelles recherches sur la distribution florale. Bulletin de la Société Vaudoise des sciences naturelles 44: 223–270

    Google Scholar 

  24. Jain A, Dubes R (1988) Algorithms for clustering data. Prentice Hall, New Jersey

    Google Scholar 

  25. Jain AK (2010) Data clustering: 50 years beyond k-means. Pattern Recognit Lett 31(8): 651–666

    Article  Google Scholar 

  26. Jain AK, Moreau JV (1987) Bootstrap technique in cluster analysis. Pattern Recognit 20(5): 547–568

    Article  Google Scholar 

  27. Kerr MK, Churchill GA (2001) Bootstrapping cluster analysis: assessing the reliability of conclusions from microarray experiments. Proceedings of the national academy of sciences 98(16):8961–8965

  28. Kestler HA, Müller A, Buchholz M, Palm G, Gress TM (2003) Robustness evaluation of clusterings. In: Spang R, Béziat P, Vingron M (eds) Currents in computational molecular biology, (Abstract) pp 253–254

  29. Kestler HA, Müller A, Schwenker F, Gress T, Mattfeldt T, Palm G (2001) Cluster analysis of comparative genomic hybridization data. Lecture notes NATO ASI: aritificial intelligence and heuristic methods for bioinformatics, (Abstract) pp S–40

  30. Khan J, Wei J, Ringner M, Saal L, Ladanyi M, Westermann F, Berthold F, Schwab M, Antonesco C, Peterson C, Meltzer P (2001) Classification and diagnostic prediction of cancer using gene expression profiling and artificial neural networks. Nat Med 6(7): 673–679

    Article  Google Scholar 

  31. Kraus JM, Kestler HA (2010) A highly effcient multi-core algorithm for clustering extremely large datasets. BMC Bioinf 11(1): 169

    Article  Google Scholar 

  32. Lange T, Roth V, Braun ML, Buhmann JM (2004) Stability-based validation of clustering solutions. Neural Comput 16(6): 1299–1323

    MATH  Article  Google Scholar 

  33. Law MHC, Topchy AP, Jain AK (2004) Multiobjective data clustering. In: Proceedings of the 2004 IEEE computer society conference on computer vision and pattern recognition, vol 2. IEEE Computer Society, Los Alamitos, pp 424–430

  34. Leisch F, Hornik K (1999) Stabilization of k-means with bagged clustering. In: Proceedings of the 1999 joint statistical meetings

  35. Levine E, Domany E (2001) Resampling method for unsupervised estimation of cluster validity. Neural Comput 13(11): 2573–2593

    MATH  Article  Google Scholar 

  36. Maechler M, Rousseeuw P, Struyf A, Hubert M (2005) Cluster analysis basics and extensions. Unpublished

  37. Nieweglowski L (2009) clv: cluster validation techniques. http://CRAN.R-project.org/package=clv. R package version 0.3-2

  38. Pomeroy S, Tamayo P, Gaasenbeek M, Sturla L, Angelo M, McLaughlin M, Kim J, Goumnerova L, Black P, Lau C, Allen J, Zagzag D, Olson J, Curran T, Wetmore C, Biegel J, Poggio T, Mukherjee S, Rifkin R, Califano A, Stolovitzky G, Louis D, Mesirov J, Lander E, Golub T (2002) Prediction of central nervous system embryonal tumour outcome based on gene expression. Nature 415(6870): 436–442

    Article  Google Scholar 

  39. Radtke PW, Wong T, Sabourin R (2009) Solution over-fit control in evolutionary multiobjective optimization of pattern classification systems. Int J Pattern Recognit Artif Intell 23(6): 1107–1127

    Article  Google Scholar 

  40. Rakhlin A, Caponnetto A (2007) Stability of k-means clustering. In: Schölkopf B, Platt JC, Hoffman T (eds) Advances in neural information processing systems 19. MIT Press, Cambridge, pp 1121–1128

    Google Scholar 

  41. Smolkin M, Ghosh D (2003) Cluster stability scores for microarray data in cancer studies. BMC Bioinf 4(36)

  42. Strehl A, Ghosh J (2002) Cluster ensembles—a knowledge reuse framework for combining multiple partitions. J Mach Learn Res 3: 583–617

    MathSciNet  Article  Google Scholar 

  43. Ultsch A (2005) Clustering with som: U*c. In: Proceedings of the workshop on self-organizing maps. Paris, pp 75–82

  44. Venables WN, Ripley BD (2002) Modern applied statistics with S, 4th edn. Springer, New York

    Google Scholar 

Download references

Author information

Affiliations

Authors

Corresponding author

Correspondence to Hans A. Kestler.

Electronic Supplementary Material

The Below is the Electronic Supplementary Material.

ESM 1 (PDF 103 kb)

Rights and permissions

Reprints and Permissions

About this article

Cite this article

Kraus, J.M., Müssel, C., Palm, G. et al. Multi-objective selection for collecting cluster alternatives. Comput Stat 26, 341–353 (2011). https://doi.org/10.1007/s00180-011-0244-6

Download citation

Keywords

  • Cluster analysis
  • Multi-objective optimization
  • Cluster number estimation
  • Cluster validation