Advertisement

Computational Statistics

, Volume 26, Issue 2, pp 341–353 | Cite as

Multi-objective selection for collecting cluster alternatives

  • Johann M. Kraus
  • Christoph Müssel
  • Günther Palm
  • Hans A. Kestler
Original Paper

Abstract

Grouping objects into different categories is a basic means of cognition. In the fields of machine learning and statistics, this subject is addressed by cluster analysis. Yet, it is still controversially discussed how to assess the reliability and quality of clusterings. In particular, it is hard to determine the optimal number of clusters inherent in the underlying data. Running different cluster algorithms and cluster validation methods usually yields different optimal clusterings. In fact, several clusterings with different numbers of clusters are plausible in many situations, as different methods are specialized on diverse structural properties. To account for the possibility of multiple plausible clusterings, we employ a multi-objective approach for collecting cluster alternatives (MOCCA) from a combination of cluster algorithms and validation measures. In an application to artificial data as well as microarray data sets, we demonstrate that exploring a Pareto set of optimal partitions rather than a single solution can identify alternative solutions that are overlooked by conventional clustering strategies. Competitive solutions are hereby ranked following an impartial criterion, while the ultimate judgement is left to the investigator.

Keywords

Cluster analysis Multi-objective optimization Cluster number estimation Cluster validation 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Supplementary material

180_2011_244_MOESM1_ESM.pdf (103 kb)
ESM 1 (PDF 103 kb)

References

  1. Albatineh AN, Niewiadomska-Bugaj M, Mihalko D (2006) On similarity indices and correction for chance agreement. J Classif 23(2): 301–313MathSciNetCrossRefGoogle Scholar
  2. Ben-David S, von Luxburg U, Pál D (2006) A sober look at clustering stability. In: Carbonell JG, Siekmann J (eds) Conference on learning theory. Lecture notes in artificial intelligence, vol 4005. Springer, Berlin, pp 5–19Google Scholar
  3. Ben-David S, Pál D, Simon HU (2007) Stability of k-means clustering. In: Bshouty NH, Gentile C (eds) Conference on learning theory. Lecture notes in artificial intelligence, vol 4539. Springer, Berlin, pp 20–34Google Scholar
  4. Ben-Hur A, Elisseeff A, Guyon I (2002) A stability based method for discovering structure in clustered data. Pacific Symposium on Biocomputing, pp 6–17Google Scholar
  5. Bertoni A, Valentini G (2005) Random projections for assessing gene expression cluster stability. In: Proceedings of the IEEE-international joint conference on neural networks (IJCNN), vol 1. IEEE Computer Society, pp 149–154Google Scholar
  6. Brock G, Pihur V, Datta S, Datta S (2008) clvalid: an r package for cluster validation. J Stat Softw 25(4): 1–22Google Scholar
  7. Conover WJ (1999) Practical nonparametric statistics, 3rd edn. Wiley, New YorkGoogle Scholar
  8. Cottrell M, Hammer B, Hasenfuss A, Villmann T (2006) Batch and median neural gas. Neural Netw 19(6–7): 762–771MATHCrossRefGoogle Scholar
  9. de Souto MCP, Costa IG, de Araujo DSA, Ludermir TB, Schliep A (2008) Clustering cancer gene expression data: a comparative study. BMC Bioinf 9(1): 497CrossRefGoogle Scholar
  10. Deb K (2004) Multi-objective optimization using evolutionary algorithms. Wiley, New YorkGoogle Scholar
  11. Dimitriadou E (2009) cclust: convex clustering methods and clustering indexes http://CRAN.R-project.org/package=cclust. R package version 0.6-16
  12. Dimitriadou E, Barth M, Windischberger C, Hornik K, Moser E (2004) A quantitative comparison of functional MRI cluster analysis. Artif Intell Med 32(1): 57–71CrossRefGoogle Scholar
  13. Dimitriadou E, Weingessel A, Hornik K (1999) Voting in clustering and finding the number of clusters. In: Bothe H, Oja E, Massad E, Haefke C (eds) Proceedings of the “International symposium on advances in intelligent data analysis (AIDA 99)” (“International ICSC congress on computational intelligence: methods and applications (CIMA 99)”, ICSC Academic Press, pp 291–296Google Scholar
  14. Dolnicar S, Leisch F (2009) Evaluation of structure and reproducibility of cluster solutions using the bootstrap. Technical report 63, Department of Statistics, LMU MunichGoogle Scholar
  15. Dos Santos EM, Sabourin R, Maupin P (2009) Overfitting cautious selection of classifier ensembles with genetic algorithms. Inf Fusion 10(2): 150–162CrossRefGoogle Scholar
  16. Faceli K, de Souto MCP (2006) Multi-objective clustering ensemble. In: Proceedings of the 6th international conference on hybrid intelligent systems, IEEE Computer Society, Los Alamitos, p 51Google Scholar
  17. Fowlkes E, Mallows C (1983) A method for comparing two hierarchical clusterings. J Am Stat Assoc 78(383): 553–569MATHCrossRefGoogle Scholar
  18. Fridlyand J, Dudoit S (2001) Applications of resampling methods to estimate the number of clusters and to improve the accuracy of a clustering method. Technical report 600, University of California, BerkeleyGoogle Scholar
  19. Golub T, Slonim D, Tamayo P, Huard C, Gassenbeek M, Coller H, Loh M, Downing J, Caliguri M, Bloomfield C, Lander E (1999) Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 286(5439): 531–537CrossRefGoogle Scholar
  20. Handl J, Knowles J, Kell D (2005) Computational cluster validation in post-genomic data analysis. Bioinformatics 21(15): 3201–3212CrossRefGoogle Scholar
  21. Hornik K, Leisch F (2005) Ensemble methods for cluster analysis. In: Taudes A (eds) Adaptive information systems and modelling in economics and management science. Springer, Berlin, pp 261–268CrossRefGoogle Scholar
  22. Hubert L, Arabie P (1985) Comparing partitions. J Math. Classif 2: 193–218CrossRefGoogle Scholar
  23. Jaccard P (1908) Nouvelles recherches sur la distribution florale. Bulletin de la Société Vaudoise des sciences naturelles 44: 223–270Google Scholar
  24. Jain A, Dubes R (1988) Algorithms for clustering data. Prentice Hall, New JerseyMATHGoogle Scholar
  25. Jain AK (2010) Data clustering: 50 years beyond k-means. Pattern Recognit Lett 31(8): 651–666CrossRefGoogle Scholar
  26. Jain AK, Moreau JV (1987) Bootstrap technique in cluster analysis. Pattern Recognit 20(5): 547–568CrossRefGoogle Scholar
  27. Kerr MK, Churchill GA (2001) Bootstrapping cluster analysis: assessing the reliability of conclusions from microarray experiments. Proceedings of the national academy of sciences 98(16):8961–8965Google Scholar
  28. Kestler HA, Müller A, Buchholz M, Palm G, Gress TM (2003) Robustness evaluation of clusterings. In: Spang R, Béziat P, Vingron M (eds) Currents in computational molecular biology, (Abstract) pp 253–254Google Scholar
  29. Kestler HA, Müller A, Schwenker F, Gress T, Mattfeldt T, Palm G (2001) Cluster analysis of comparative genomic hybridization data. Lecture notes NATO ASI: aritificial intelligence and heuristic methods for bioinformatics, (Abstract) pp S–40Google Scholar
  30. Khan J, Wei J, Ringner M, Saal L, Ladanyi M, Westermann F, Berthold F, Schwab M, Antonesco C, Peterson C, Meltzer P (2001) Classification and diagnostic prediction of cancer using gene expression profiling and artificial neural networks. Nat Med 6(7): 673–679CrossRefGoogle Scholar
  31. Kraus JM, Kestler HA (2010) A highly effcient multi-core algorithm for clustering extremely large datasets. BMC Bioinf 11(1): 169CrossRefGoogle Scholar
  32. Lange T, Roth V, Braun ML, Buhmann JM (2004) Stability-based validation of clustering solutions. Neural Comput 16(6): 1299–1323MATHCrossRefGoogle Scholar
  33. Law MHC, Topchy AP, Jain AK (2004) Multiobjective data clustering. In: Proceedings of the 2004 IEEE computer society conference on computer vision and pattern recognition, vol 2. IEEE Computer Society, Los Alamitos, pp 424–430Google Scholar
  34. Leisch F, Hornik K (1999) Stabilization of k-means with bagged clustering. In: Proceedings of the 1999 joint statistical meetingsGoogle Scholar
  35. Levine E, Domany E (2001) Resampling method for unsupervised estimation of cluster validity. Neural Comput 13(11): 2573–2593MATHCrossRefGoogle Scholar
  36. Maechler M, Rousseeuw P, Struyf A, Hubert M (2005) Cluster analysis basics and extensions. UnpublishedGoogle Scholar
  37. Nieweglowski L (2009) clv: cluster validation techniques. http://CRAN.R-project.org/package=clv. R package version 0.3-2
  38. Pomeroy S, Tamayo P, Gaasenbeek M, Sturla L, Angelo M, McLaughlin M, Kim J, Goumnerova L, Black P, Lau C, Allen J, Zagzag D, Olson J, Curran T, Wetmore C, Biegel J, Poggio T, Mukherjee S, Rifkin R, Califano A, Stolovitzky G, Louis D, Mesirov J, Lander E, Golub T (2002) Prediction of central nervous system embryonal tumour outcome based on gene expression. Nature 415(6870): 436–442CrossRefGoogle Scholar
  39. Radtke PW, Wong T, Sabourin R (2009) Solution over-fit control in evolutionary multiobjective optimization of pattern classification systems. Int J Pattern Recognit Artif Intell 23(6): 1107–1127CrossRefGoogle Scholar
  40. Rakhlin A, Caponnetto A (2007) Stability of k-means clustering. In: Schölkopf B, Platt JC, Hoffman T (eds) Advances in neural information processing systems 19. MIT Press, Cambridge, pp 1121–1128Google Scholar
  41. Smolkin M, Ghosh D (2003) Cluster stability scores for microarray data in cancer studies. BMC Bioinf 4(36)Google Scholar
  42. Strehl A, Ghosh J (2002) Cluster ensembles—a knowledge reuse framework for combining multiple partitions. J Mach Learn Res 3: 583–617MathSciNetCrossRefGoogle Scholar
  43. Ultsch A (2005) Clustering with som: U*c. In: Proceedings of the workshop on self-organizing maps. Paris, pp 75–82Google Scholar
  44. Venables WN, Ripley BD (2002) Modern applied statistics with S, 4th edn. Springer, New YorkMATHGoogle Scholar

Copyright information

© Springer-Verlag 2011

Authors and Affiliations

  • Johann M. Kraus
    • 1
  • Christoph Müssel
    • 1
  • Günther Palm
    • 1
  • Hans A. Kestler
    • 1
    • 2
  1. 1.Institute of Neural Information ProcessingUniversity of UlmUlmGermany
  2. 2.Department of Internal Medicine IUniversity Hospital UlmUlmGermany

Personalised recommendations