Skip to main content
Log in

Bayesian versus data driven model selection for microarray data

  • Published:
Natural Computing Aims and scope Submit manuscript

Abstract

Clustering is one of the most well known activities in scientific investigation and the object of research in many disciplines, ranging from Statistics to Computer Science. In this beautiful area, one of the most difficult challenges is a particular instance of the model selection problem, i.e., the identification of the correct number of clusters in a dataset. In what follows, for ease of reference, we refer to that instance still as model selection. It is an important part of any statistical analysis. The techniques used for solving it are mainly either Bayesian or data-driven, and are both based on internal knowledge. That is, they use information obtained by processing the input data. Although both techniques have been evaluated in the realm of microarray data analysis, their merits (relative to each other) has not been assessed. Here we will fill this gap in the literature by comparing three Bayesians versus several state of the art data-driven model selection methods. Our results show that, although in some cases Bayesian methods guarantee good results, they are not able to compete in terms of ability to predict the correct number of clusters in a dataset with the data-driven methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  • Akaike H (1978) A new look at the statistical model identification. IEEE Trans Autom Control 9(6):716–723

    MathSciNet  Google Scholar 

  • Alizadeh A, Eisen M, Davis R, Ma C, Lossos I, Rosenwald A, Boldrick J, Sabet H, Tran T, Yu X, Powell J, Yang L, Marti G, Moore T, Hudson JJ, Lu L, Lewis D, Tibshirani R, Sherlock G, Chan W, Greiner T, Weisenburger D, Armitage J, Warnke R, Levy R, Wilson W, Grever M, Byrd J, Botstein D, Brown P, Staudt L (2000) Distinct types of diffuse large b-cell lymphoma identified by gene expression profiling. Nature 403:503–511

    Article  Google Scholar 

  • Alon U, Barkai N, Notterman D, Gish K, Ybarra S, Mack D, Levine A (1999) Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proc Natl Acad Sci USA 96:6745–6750

    Article  Google Scholar 

  • Andreopoulos B, An A, Wang X, Schroeder M (2009) A roadmap of clustering algorithms: finding a match for a biomedical application. Brief Bioinform 10(3):297–314

    Article  Google Scholar 

  • Ben-Hur A, Elisseeff A, Guyon I (2002) A stability based method for discovering structure in clustering data. In: Seventh pacific symposium on biocomputing, ISCB, pp 6–17

  • Bouguila N, Ziou D (2007) High-dimensional unsupervised selection and estimation of a finite generalized Dirichlet mixture model based on minimum message length. IEEE Trans Pattern Anal Mach Intell 29(10):1716–1731

    Article  Google Scholar 

  • Breckenridge J (1989) Replicating cluster analysis: method, consistency, and validity. Multivar Behav Res 24(2):147–161

    Article  Google Scholar 

  • D’haeseleer P (2006) How does gene expression cluster work? Nat Biotechnol 23:1499–1501

    Article  Google Scholar 

  • Di Gesú V, Giancarlo R, Lo Bosco G, Raimondi A, Scaturro D (2005) A genetic algorithm for clustering gene expression data. BMC Bioinform 6:289

    Article  Google Scholar 

  • Dudoit S, Fridlyand J (2002) A prediction-based resampling method for estimating the number of clusters in a dataset. Genome Biol 3:1–21

    Article  Google Scholar 

  • Everitt B (1993) Cluster analysis. Edward Arnold, London

    Google Scholar 

  • Figuereido MAT, Jain AK (2002) Unsupervised learning of fInite mixture models. IEEE Trans Pattern Anal Mach Intell 24(3):381–396

    Article  Google Scholar 

  • Fowlkes E, Mallows C (1983) A method for comparing two hierarchical clusterings. J Am Stat Assoc 78:553–584

    Article  MATH  Google Scholar 

  • Giancarlo R, Utro F (2011) Speeding up the consensus clustering methodology for microarray data analysis. Algorithms Mol Biol 6:1

    Article  Google Scholar 

  • Giancarlo R, Utro F (2012a) Algorithmic paradigms for stability-based cluster validity and model selection statistical methods, with applications to microarray data analysis. Theor Comput Sci 428:58–79

    Article  MathSciNet  MATH  Google Scholar 

  • Giancarlo R, Utro F (2012b) Stability-based model selection for high throughput genomic data: an algorithmic paradigm. In: Artificial immune systems. Lecture notes in computer science, vol 7597, pp 260–270

  • Giancarlo R, Scaturro D, Utro F (2008a) Computational cluster validation for microarray data analysis: experimental assessment of clest, consensus clustering, figure of merit, gap statistics and model explorer. BMC Bioinform 9:462

    Article  Google Scholar 

  • Giancarlo R, Scaturro D, Utro F (2008b) A tutorial on computational cluster analysis with applications to pattern discovery in microarray data. Math Comput Sci 1:655–672

    Article  MathSciNet  MATH  Google Scholar 

  • Giancarlo R, Scaturro D, Utro F (2009) Statistical indices for computational and data driven class discovery in microarray data. In: Chen JY, Lonardi S (eds) Biological data mining. CRC Press, San Francisco, pp 295–335

    Chapter  Google Scholar 

  • Giancarlo R, Lo Bosco G, Pinello L (2010) Distance functions, clustering algorithms and microarray data analysis. In: Learning and intelligent optimization. Lecture notes in computer science, pp 125–138

  • Giancarlo R, Lo Bosco G, Pinello P, Utro F (2011) The three steps of clustering in the post-genomic Era. In: Computational intelligence methods for bioinformatics and biostatistics. Lecture notes in computer science, pp 13–30

  • Giancarlo R, Lo Bosco G, Pinello L, Utro F (2013) A methodology to assess the intrinsic discriminative ability of a distance function and its interplay with clustering algorithms for microarray data analysis. BMC Bioinform 14:S6

    Article  Google Scholar 

  • Golub T, Slonim D, Tamayo P, Huard C, Gaasenbeeck M, Mesirov J, Coller H, Loh M, Downing J, Caligiuri M, Bloomfield C, Lander E (1999) Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 286(531):531–537

    Article  Google Scholar 

  • Handl J, Knowles J, Kell D (2005) Computational cluster validation in post-genomic data analysis. Bioinformatics 21(15):3201–3212

    Article  Google Scholar 

  • Hartigan J (1975) Clustering algorithms. Wiley, New York

    MATH  Google Scholar 

  • Hastie T, Tibshirani R, Friedman J (2003) The elements of statistical learning. Springer, Heidelberg

    Google Scholar 

  • Hubert L, Arabie P (1985) Comparing partitions. J Classif 2:193–218

    Article  Google Scholar 

  • Jain A, Dubes R (1988) Algorithms for clustering data. Prentice-Hall, Englewood Cliffs

    MATH  Google Scholar 

  • Jain A, Murty M, Flynn P (1999) Data clustering: a review. ACM Comput Surv 31(3):264–323

    Article  Google Scholar 

  • Kaufman L, Rousseeuw P (1990) Finding groups in data: an introduction to cluster analysis. Wiley, New York

    Book  Google Scholar 

  • Klie S, Nikoloski Z, Selbig J (2010) Biological cluster evaluation for gene function prediction. J Comput Biol 17:1–18

    Article  MathSciNet  Google Scholar 

  • Krzanowski W, Lai Y (1985) A criterion for determining the number of groups in a dataset using sum of squares clustering. Biometrics 44:23–34

    Article  MathSciNet  Google Scholar 

  • Liu H, Motoda H (1998) Feature selection for knowledge discovery and data mining. Kluwer Academic Publishers, Norwell

    Book  MATH  Google Scholar 

  • Monti S, Tamayo P, Mesirov J, Golub T (2003) Consensus clustering: a resampling-based method for class discovery and visualization of gene expression microarray data. Mach Learn 52:91–118

    Article  MATH  Google Scholar 

  • NCI 60 Cancer Microarray Project (2008) http://genome-www.stanford.edu/NCI60

  • Pelleg D, Moore A (2000) X-means: extending k-means with efficient estimation of the number of clusters. In: Proceedings of the seventeenth international conference on machine learning, Morgan Kaufmann, San Francisco, pp 727–734

  • Perou C, Jeffrey S, van de Rijn M, Rees C, Eisen M, Ross D, Pergamenschikov A, Williams C, Zhu S, Lee J, Lashkari D, Shalon D, Brown P, Botstein D (1999) Distinctive gene expression patterns in human mammary epithelial cells and breast cancers. Proc Natl Acad Sci USA 96:9212–9217

    Article  Google Scholar 

  • Pollack J, Perou C, Alizadeh A, Eisen M, amd CF, Williams AP, Jeffrey S, Botstein D, Brown P (1999) Genome-wide analysis of DNA copy-number changes using cDNA microarrays. Nat Genet 23:41–46

    Article  Google Scholar 

  • Priness I, Maimon O, Ben-Gal I (2007) Evaluation of gene-expression clustering via mutual information distance measure. BMC Bioinform 8:111

    Article  Google Scholar 

  • Quackenbush J (2002) Microarray data normalization and transformation. Nat Genet 32:496–501

    Article  Google Scholar 

  • Rijsbergen CV (1979) Information retrieval, 2nd edn. Butterworths, London

    Google Scholar 

  • Rissanen J (1978) Modeling by shortest data description. Automatica 14(5):465–471

    Article  MATH  Google Scholar 

  • Ross D, Scherf U, Eisen M, Perou C, Spellman P, Iyer V, Jeffrey S, van de Rijn M, Walthama M, Pergamenschikov A, Lee J, Lashkari D, Shalon D, Myers T, Weistein J, Botstein D, Brown P (2000) Systematic variation in gene expression patterns in human cancer cell lines. Nat Genet 24:227–235

    Article  Google Scholar 

  • Schwarz G (1978) Estimating the dimension of a model. Ann Stat 6(2):461–464. doi:10.2307/2958889

    Article  MATH  Google Scholar 

  • Shamir R, Sharan R (2003) Algorithmic approaches to clustering gene expression data. In: Jiang T, Smith T, Xu Y, Zhang MQ (eds) Current topics in computational biology. MIT Press, Cambridge, pp 120–161

    Google Scholar 

  • Spellman P, Sherlock G, Zhang M, Iyer VR, Anders K, Eisen M, Brown P, Botstein D, Futcher B (1998) Comprehensive identification of cell cycle regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. Mol Biol Cell 9:3273–3297

    Article  Google Scholar 

  • Su A, Cooke M, Ching K, Hakak Y, Walker J, Wiltshire T, Orth A, Vega R, Sapinoso L, Moqrich A, Patapoutian A, Hampton G, Schultz P, Hogenesch J (2002) Large-scale analysis of the human and mouse transcriptomes. Proc Natl Acad Sci USA 99:4465–4470

    Article  Google Scholar 

  • Tibshirani R, Walther G, Hastie T (2001) Estimating the number of clusters in a dataset via the gap statistics. J R Stat Soc B 2:411–423

    Article  MathSciNet  Google Scholar 

  • Wallace CS, Boulton DM (1968) An information measure for classification. Comput J 11(2):185–194

    Article  MATH  Google Scholar 

  • Wallace CS, Dowe DL (2000) MML clustering of multi-state, poisson, von mises circular and Gaussian distributions. Stat Comput 10(1):73–83

    Article  Google Scholar 

  • Wen X, Fuhrman S, Michaels G, Carr D, Smith S, Barker J, Somogyi R (1998) Large scale temporal gene expression mapping of central nervous system development. Proc Natl Acad Sci USA 95:334–339

    Article  Google Scholar 

  • Yeoh EJ, Ross M, Shurtleff S, Williams W, Patel D, Mahfouz R, Behm F, Raimondi S, Relling M, Patel A, Cheng C, Campana D, Wilkins D, Zhou X, Li J, Liu H, Pui CH, Evans W, Naeve C, Wong L, Downing J (2002) Classification, subtype discovery, and prediction of outcome in pediatric acute lymphoblastic leukemia by gene expression profiling. Cancer Cell 1:133–143

    Article  Google Scholar 

  • Yeung K, Haynor D, Ruzzo W (2001) Validating clustering for gene expression data. Bioinformatics 17:309–318

    Article  Google Scholar 

Download references

Funding

Giosué Lo Bosco and Raffaele Giancarlo were supported by Progetto di Ateneo dell’Universitá degli Studi di Palermo 2012-ATE-0298 Metodi Formali e Algoritmici per la Bioinformatica su Scala Genomica.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Giosué Lo Bosco.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 62 KB)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Giancarlo, R., Lo Bosco, G. & Utro, F. Bayesian versus data driven model selection for microarray data. Nat Comput 14, 393–402 (2015). https://doi.org/10.1007/s11047-014-9446-5

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11047-014-9446-5

Keywords

Navigation