Abstract
Clustering is one of the most well known activities in scientific investigation and the object of research in many disciplines, ranging from Statistics to Computer Science. In this beautiful area, one of the most difficult challenges is a particular instance of the model selection problem, i.e., the identification of the correct number of clusters in a dataset. In what follows, for ease of reference, we refer to that instance still as model selection. It is an important part of any statistical analysis. The techniques used for solving it are mainly either Bayesian or data-driven, and are both based on internal knowledge. That is, they use information obtained by processing the input data. Although both techniques have been evaluated in the realm of microarray data analysis, their merits (relative to each other) has not been assessed. Here we will fill this gap in the literature by comparing three Bayesians versus several state of the art data-driven model selection methods. Our results show that, although in some cases Bayesian methods guarantee good results, they are not able to compete in terms of ability to predict the correct number of clusters in a dataset with the data-driven methods.
Similar content being viewed by others
References
Akaike H (1978) A new look at the statistical model identification. IEEE Trans Autom Control 9(6):716–723
Alizadeh A, Eisen M, Davis R, Ma C, Lossos I, Rosenwald A, Boldrick J, Sabet H, Tran T, Yu X, Powell J, Yang L, Marti G, Moore T, Hudson JJ, Lu L, Lewis D, Tibshirani R, Sherlock G, Chan W, Greiner T, Weisenburger D, Armitage J, Warnke R, Levy R, Wilson W, Grever M, Byrd J, Botstein D, Brown P, Staudt L (2000) Distinct types of diffuse large b-cell lymphoma identified by gene expression profiling. Nature 403:503–511
Alon U, Barkai N, Notterman D, Gish K, Ybarra S, Mack D, Levine A (1999) Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proc Natl Acad Sci USA 96:6745–6750
Andreopoulos B, An A, Wang X, Schroeder M (2009) A roadmap of clustering algorithms: finding a match for a biomedical application. Brief Bioinform 10(3):297–314
Ben-Hur A, Elisseeff A, Guyon I (2002) A stability based method for discovering structure in clustering data. In: Seventh pacific symposium on biocomputing, ISCB, pp 6–17
Bouguila N, Ziou D (2007) High-dimensional unsupervised selection and estimation of a finite generalized Dirichlet mixture model based on minimum message length. IEEE Trans Pattern Anal Mach Intell 29(10):1716–1731
Breckenridge J (1989) Replicating cluster analysis: method, consistency, and validity. Multivar Behav Res 24(2):147–161
D’haeseleer P (2006) How does gene expression cluster work? Nat Biotechnol 23:1499–1501
Di Gesú V, Giancarlo R, Lo Bosco G, Raimondi A, Scaturro D (2005) A genetic algorithm for clustering gene expression data. BMC Bioinform 6:289
Dudoit S, Fridlyand J (2002) A prediction-based resampling method for estimating the number of clusters in a dataset. Genome Biol 3:1–21
Everitt B (1993) Cluster analysis. Edward Arnold, London
Figuereido MAT, Jain AK (2002) Unsupervised learning of fInite mixture models. IEEE Trans Pattern Anal Mach Intell 24(3):381–396
Fowlkes E, Mallows C (1983) A method for comparing two hierarchical clusterings. J Am Stat Assoc 78:553–584
Giancarlo R, Utro F (2011) Speeding up the consensus clustering methodology for microarray data analysis. Algorithms Mol Biol 6:1
Giancarlo R, Utro F (2012a) Algorithmic paradigms for stability-based cluster validity and model selection statistical methods, with applications to microarray data analysis. Theor Comput Sci 428:58–79
Giancarlo R, Utro F (2012b) Stability-based model selection for high throughput genomic data: an algorithmic paradigm. In: Artificial immune systems. Lecture notes in computer science, vol 7597, pp 260–270
Giancarlo R, Scaturro D, Utro F (2008a) Computational cluster validation for microarray data analysis: experimental assessment of clest, consensus clustering, figure of merit, gap statistics and model explorer. BMC Bioinform 9:462
Giancarlo R, Scaturro D, Utro F (2008b) A tutorial on computational cluster analysis with applications to pattern discovery in microarray data. Math Comput Sci 1:655–672
Giancarlo R, Scaturro D, Utro F (2009) Statistical indices for computational and data driven class discovery in microarray data. In: Chen JY, Lonardi S (eds) Biological data mining. CRC Press, San Francisco, pp 295–335
Giancarlo R, Lo Bosco G, Pinello L (2010) Distance functions, clustering algorithms and microarray data analysis. In: Learning and intelligent optimization. Lecture notes in computer science, pp 125–138
Giancarlo R, Lo Bosco G, Pinello P, Utro F (2011) The three steps of clustering in the post-genomic Era. In: Computational intelligence methods for bioinformatics and biostatistics. Lecture notes in computer science, pp 13–30
Giancarlo R, Lo Bosco G, Pinello L, Utro F (2013) A methodology to assess the intrinsic discriminative ability of a distance function and its interplay with clustering algorithms for microarray data analysis. BMC Bioinform 14:S6
Golub T, Slonim D, Tamayo P, Huard C, Gaasenbeeck M, Mesirov J, Coller H, Loh M, Downing J, Caligiuri M, Bloomfield C, Lander E (1999) Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 286(531):531–537
Handl J, Knowles J, Kell D (2005) Computational cluster validation in post-genomic data analysis. Bioinformatics 21(15):3201–3212
Hartigan J (1975) Clustering algorithms. Wiley, New York
Hastie T, Tibshirani R, Friedman J (2003) The elements of statistical learning. Springer, Heidelberg
Hubert L, Arabie P (1985) Comparing partitions. J Classif 2:193–218
Jain A, Dubes R (1988) Algorithms for clustering data. Prentice-Hall, Englewood Cliffs
Jain A, Murty M, Flynn P (1999) Data clustering: a review. ACM Comput Surv 31(3):264–323
Kaufman L, Rousseeuw P (1990) Finding groups in data: an introduction to cluster analysis. Wiley, New York
Klie S, Nikoloski Z, Selbig J (2010) Biological cluster evaluation for gene function prediction. J Comput Biol 17:1–18
Krzanowski W, Lai Y (1985) A criterion for determining the number of groups in a dataset using sum of squares clustering. Biometrics 44:23–34
Liu H, Motoda H (1998) Feature selection for knowledge discovery and data mining. Kluwer Academic Publishers, Norwell
Monti S, Tamayo P, Mesirov J, Golub T (2003) Consensus clustering: a resampling-based method for class discovery and visualization of gene expression microarray data. Mach Learn 52:91–118
NCI 60 Cancer Microarray Project (2008) http://genome-www.stanford.edu/NCI60
Pelleg D, Moore A (2000) X-means: extending k-means with efficient estimation of the number of clusters. In: Proceedings of the seventeenth international conference on machine learning, Morgan Kaufmann, San Francisco, pp 727–734
Perou C, Jeffrey S, van de Rijn M, Rees C, Eisen M, Ross D, Pergamenschikov A, Williams C, Zhu S, Lee J, Lashkari D, Shalon D, Brown P, Botstein D (1999) Distinctive gene expression patterns in human mammary epithelial cells and breast cancers. Proc Natl Acad Sci USA 96:9212–9217
Pollack J, Perou C, Alizadeh A, Eisen M, amd CF, Williams AP, Jeffrey S, Botstein D, Brown P (1999) Genome-wide analysis of DNA copy-number changes using cDNA microarrays. Nat Genet 23:41–46
Priness I, Maimon O, Ben-Gal I (2007) Evaluation of gene-expression clustering via mutual information distance measure. BMC Bioinform 8:111
Quackenbush J (2002) Microarray data normalization and transformation. Nat Genet 32:496–501
Rijsbergen CV (1979) Information retrieval, 2nd edn. Butterworths, London
Rissanen J (1978) Modeling by shortest data description. Automatica 14(5):465–471
Ross D, Scherf U, Eisen M, Perou C, Spellman P, Iyer V, Jeffrey S, van de Rijn M, Walthama M, Pergamenschikov A, Lee J, Lashkari D, Shalon D, Myers T, Weistein J, Botstein D, Brown P (2000) Systematic variation in gene expression patterns in human cancer cell lines. Nat Genet 24:227–235
Schwarz G (1978) Estimating the dimension of a model. Ann Stat 6(2):461–464. doi:10.2307/2958889
Shamir R, Sharan R (2003) Algorithmic approaches to clustering gene expression data. In: Jiang T, Smith T, Xu Y, Zhang MQ (eds) Current topics in computational biology. MIT Press, Cambridge, pp 120–161
Spellman P, Sherlock G, Zhang M, Iyer VR, Anders K, Eisen M, Brown P, Botstein D, Futcher B (1998) Comprehensive identification of cell cycle regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. Mol Biol Cell 9:3273–3297
Su A, Cooke M, Ching K, Hakak Y, Walker J, Wiltshire T, Orth A, Vega R, Sapinoso L, Moqrich A, Patapoutian A, Hampton G, Schultz P, Hogenesch J (2002) Large-scale analysis of the human and mouse transcriptomes. Proc Natl Acad Sci USA 99:4465–4470
Tibshirani R, Walther G, Hastie T (2001) Estimating the number of clusters in a dataset via the gap statistics. J R Stat Soc B 2:411–423
Wallace CS, Boulton DM (1968) An information measure for classification. Comput J 11(2):185–194
Wallace CS, Dowe DL (2000) MML clustering of multi-state, poisson, von mises circular and Gaussian distributions. Stat Comput 10(1):73–83
Wen X, Fuhrman S, Michaels G, Carr D, Smith S, Barker J, Somogyi R (1998) Large scale temporal gene expression mapping of central nervous system development. Proc Natl Acad Sci USA 95:334–339
Yeoh EJ, Ross M, Shurtleff S, Williams W, Patel D, Mahfouz R, Behm F, Raimondi S, Relling M, Patel A, Cheng C, Campana D, Wilkins D, Zhou X, Li J, Liu H, Pui CH, Evans W, Naeve C, Wong L, Downing J (2002) Classification, subtype discovery, and prediction of outcome in pediatric acute lymphoblastic leukemia by gene expression profiling. Cancer Cell 1:133–143
Yeung K, Haynor D, Ruzzo W (2001) Validating clustering for gene expression data. Bioinformatics 17:309–318
Funding
Giosué Lo Bosco and Raffaele Giancarlo were supported by Progetto di Ateneo dell’Universitá degli Studi di Palermo 2012-ATE-0298 Metodi Formali e Algoritmici per la Bioinformatica su Scala Genomica.
Author information
Authors and Affiliations
Corresponding author
Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
About this article
Cite this article
Giancarlo, R., Lo Bosco, G. & Utro, F. Bayesian versus data driven model selection for microarray data. Nat Comput 14, 393–402 (2015). https://doi.org/10.1007/s11047-014-9446-5
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11047-014-9446-5