Bayesian versus data driven model selection for microarray data

Giancarlo, Raffaele; Lo Bosco, Giosué; Utro, Filippo

doi:10.1007/s11047-014-9446-5

Bayesian versus data driven model selection for microarray data

Published: 16 July 2014

Volume 14, pages 393–402, (2015)
Cite this article

Natural Computing Aims and scope Submit manuscript

Raffaele Giancarlo¹,
Giosué Lo Bosco^1,2 &
Filippo Utro³

289 Accesses
2 Citations
Explore all metrics

Abstract

Clustering is one of the most well known activities in scientific investigation and the object of research in many disciplines, ranging from Statistics to Computer Science. In this beautiful area, one of the most difficult challenges is a particular instance of the model selection problem, i.e., the identification of the correct number of clusters in a dataset. In what follows, for ease of reference, we refer to that instance still as model selection. It is an important part of any statistical analysis. The techniques used for solving it are mainly either Bayesian or data-driven, and are both based on internal knowledge. That is, they use information obtained by processing the input data. Although both techniques have been evaluated in the realm of microarray data analysis, their merits (relative to each other) has not been assessed. Here we will fill this gap in the literature by comparing three Bayesians versus several state of the art data-driven model selection methods. Our results show that, although in some cases Bayesian methods guarantee good results, they are not able to compete in terms of ability to predict the correct number of clusters in a dataset with the data-driven methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Comprehensive Survey of Clustering Algorithms

Article 01 June 2015

Density-Based Clustering Based on Hierarchical Density Estimates

The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation

Article Open access 02 January 2020

References

Akaike H (1978) A new look at the statistical model identification. IEEE Trans Autom Control 9(6):716–723
MathSciNet Google Scholar
Alizadeh A, Eisen M, Davis R, Ma C, Lossos I, Rosenwald A, Boldrick J, Sabet H, Tran T, Yu X, Powell J, Yang L, Marti G, Moore T, Hudson JJ, Lu L, Lewis D, Tibshirani R, Sherlock G, Chan W, Greiner T, Weisenburger D, Armitage J, Warnke R, Levy R, Wilson W, Grever M, Byrd J, Botstein D, Brown P, Staudt L (2000) Distinct types of diffuse large b-cell lymphoma identified by gene expression profiling. Nature 403:503–511
Article Google Scholar
Alon U, Barkai N, Notterman D, Gish K, Ybarra S, Mack D, Levine A (1999) Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proc Natl Acad Sci USA 96:6745–6750
Article Google Scholar
Andreopoulos B, An A, Wang X, Schroeder M (2009) A roadmap of clustering algorithms: finding a match for a biomedical application. Brief Bioinform 10(3):297–314
Article Google Scholar
Ben-Hur A, Elisseeff A, Guyon I (2002) A stability based method for discovering structure in clustering data. In: Seventh pacific symposium on biocomputing, ISCB, pp 6–17
Bouguila N, Ziou D (2007) High-dimensional unsupervised selection and estimation of a finite generalized Dirichlet mixture model based on minimum message length. IEEE Trans Pattern Anal Mach Intell 29(10):1716–1731
Article Google Scholar
Breckenridge J (1989) Replicating cluster analysis: method, consistency, and validity. Multivar Behav Res 24(2):147–161
Article Google Scholar
D’haeseleer P (2006) How does gene expression cluster work? Nat Biotechnol 23:1499–1501
Article Google Scholar
Di Gesú V, Giancarlo R, Lo Bosco G, Raimondi A, Scaturro D (2005) A genetic algorithm for clustering gene expression data. BMC Bioinform 6:289
Article Google Scholar
Dudoit S, Fridlyand J (2002) A prediction-based resampling method for estimating the number of clusters in a dataset. Genome Biol 3:1–21
Article Google Scholar
Everitt B (1993) Cluster analysis. Edward Arnold, London
Google Scholar
Figuereido MAT, Jain AK (2002) Unsupervised learning of fInite mixture models. IEEE Trans Pattern Anal Mach Intell 24(3):381–396
Article Google Scholar
Fowlkes E, Mallows C (1983) A method for comparing two hierarchical clusterings. J Am Stat Assoc 78:553–584
Article MATH Google Scholar
Giancarlo R, Utro F (2011) Speeding up the consensus clustering methodology for microarray data analysis. Algorithms Mol Biol 6:1
Article Google Scholar
Giancarlo R, Utro F (2012a) Algorithmic paradigms for stability-based cluster validity and model selection statistical methods, with applications to microarray data analysis. Theor Comput Sci 428:58–79
Article MathSciNet MATH Google Scholar
Giancarlo R, Utro F (2012b) Stability-based model selection for high throughput genomic data: an algorithmic paradigm. In: Artificial immune systems. Lecture notes in computer science, vol 7597, pp 260–270
Giancarlo R, Scaturro D, Utro F (2008a) Computational cluster validation for microarray data analysis: experimental assessment of clest, consensus clustering, figure of merit, gap statistics and model explorer. BMC Bioinform 9:462
Article Google Scholar
Giancarlo R, Scaturro D, Utro F (2008b) A tutorial on computational cluster analysis with applications to pattern discovery in microarray data. Math Comput Sci 1:655–672
Article MathSciNet MATH Google Scholar
Giancarlo R, Scaturro D, Utro F (2009) Statistical indices for computational and data driven class discovery in microarray data. In: Chen JY, Lonardi S (eds) Biological data mining. CRC Press, San Francisco, pp 295–335
Chapter Google Scholar
Giancarlo R, Lo Bosco G, Pinello L (2010) Distance functions, clustering algorithms and microarray data analysis. In: Learning and intelligent optimization. Lecture notes in computer science, pp 125–138
Giancarlo R, Lo Bosco G, Pinello P, Utro F (2011) The three steps of clustering in the post-genomic Era. In: Computational intelligence methods for bioinformatics and biostatistics. Lecture notes in computer science, pp 13–30
Giancarlo R, Lo Bosco G, Pinello L, Utro F (2013) A methodology to assess the intrinsic discriminative ability of a distance function and its interplay with clustering algorithms for microarray data analysis. BMC Bioinform 14:S6
Article Google Scholar
Golub T, Slonim D, Tamayo P, Huard C, Gaasenbeeck M, Mesirov J, Coller H, Loh M, Downing J, Caligiuri M, Bloomfield C, Lander E (1999) Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 286(531):531–537
Article Google Scholar
Handl J, Knowles J, Kell D (2005) Computational cluster validation in post-genomic data analysis. Bioinformatics 21(15):3201–3212
Article Google Scholar
Hartigan J (1975) Clustering algorithms. Wiley, New York
MATH Google Scholar
Hastie T, Tibshirani R, Friedman J (2003) The elements of statistical learning. Springer, Heidelberg
Google Scholar
Hubert L, Arabie P (1985) Comparing partitions. J Classif 2:193–218
Article Google Scholar
Jain A, Dubes R (1988) Algorithms for clustering data. Prentice-Hall, Englewood Cliffs
MATH Google Scholar
Jain A, Murty M, Flynn P (1999) Data clustering: a review. ACM Comput Surv 31(3):264–323
Article Google Scholar
Kaufman L, Rousseeuw P (1990) Finding groups in data: an introduction to cluster analysis. Wiley, New York
Book Google Scholar
Klie S, Nikoloski Z, Selbig J (2010) Biological cluster evaluation for gene function prediction. J Comput Biol 17:1–18
Article MathSciNet Google Scholar
Krzanowski W, Lai Y (1985) A criterion for determining the number of groups in a dataset using sum of squares clustering. Biometrics 44:23–34
Article MathSciNet Google Scholar
Liu H, Motoda H (1998) Feature selection for knowledge discovery and data mining. Kluwer Academic Publishers, Norwell
Book MATH Google Scholar
Monti S, Tamayo P, Mesirov J, Golub T (2003) Consensus clustering: a resampling-based method for class discovery and visualization of gene expression microarray data. Mach Learn 52:91–118
Article MATH Google Scholar
NCI 60 Cancer Microarray Project (2008) http://genome-www.stanford.edu/NCI60
Pelleg D, Moore A (2000) X-means: extending k-means with efficient estimation of the number of clusters. In: Proceedings of the seventeenth international conference on machine learning, Morgan Kaufmann, San Francisco, pp 727–734
Perou C, Jeffrey S, van de Rijn M, Rees C, Eisen M, Ross D, Pergamenschikov A, Williams C, Zhu S, Lee J, Lashkari D, Shalon D, Brown P, Botstein D (1999) Distinctive gene expression patterns in human mammary epithelial cells and breast cancers. Proc Natl Acad Sci USA 96:9212–9217
Article Google Scholar
Pollack J, Perou C, Alizadeh A, Eisen M, amd CF, Williams AP, Jeffrey S, Botstein D, Brown P (1999) Genome-wide analysis of DNA copy-number changes using cDNA microarrays. Nat Genet 23:41–46
Article Google Scholar
Priness I, Maimon O, Ben-Gal I (2007) Evaluation of gene-expression clustering via mutual information distance measure. BMC Bioinform 8:111
Article Google Scholar
Quackenbush J (2002) Microarray data normalization and transformation. Nat Genet 32:496–501
Article Google Scholar
Rijsbergen CV (1979) Information retrieval, 2nd edn. Butterworths, London
Google Scholar
Rissanen J (1978) Modeling by shortest data description. Automatica 14(5):465–471
Article MATH Google Scholar
Ross D, Scherf U, Eisen M, Perou C, Spellman P, Iyer V, Jeffrey S, van de Rijn M, Walthama M, Pergamenschikov A, Lee J, Lashkari D, Shalon D, Myers T, Weistein J, Botstein D, Brown P (2000) Systematic variation in gene expression patterns in human cancer cell lines. Nat Genet 24:227–235
Article Google Scholar
Schwarz G (1978) Estimating the dimension of a model. Ann Stat 6(2):461–464. doi:10.2307/2958889
Article MATH Google Scholar
Shamir R, Sharan R (2003) Algorithmic approaches to clustering gene expression data. In: Jiang T, Smith T, Xu Y, Zhang MQ (eds) Current topics in computational biology. MIT Press, Cambridge, pp 120–161
Google Scholar
Spellman P, Sherlock G, Zhang M, Iyer VR, Anders K, Eisen M, Brown P, Botstein D, Futcher B (1998) Comprehensive identification of cell cycle regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. Mol Biol Cell 9:3273–3297
Article Google Scholar
Su A, Cooke M, Ching K, Hakak Y, Walker J, Wiltshire T, Orth A, Vega R, Sapinoso L, Moqrich A, Patapoutian A, Hampton G, Schultz P, Hogenesch J (2002) Large-scale analysis of the human and mouse transcriptomes. Proc Natl Acad Sci USA 99:4465–4470
Article Google Scholar
Tibshirani R, Walther G, Hastie T (2001) Estimating the number of clusters in a dataset via the gap statistics. J R Stat Soc B 2:411–423
Article MathSciNet Google Scholar
Wallace CS, Boulton DM (1968) An information measure for classification. Comput J 11(2):185–194
Article MATH Google Scholar
Wallace CS, Dowe DL (2000) MML clustering of multi-state, poisson, von mises circular and Gaussian distributions. Stat Comput 10(1):73–83
Article Google Scholar
Wen X, Fuhrman S, Michaels G, Carr D, Smith S, Barker J, Somogyi R (1998) Large scale temporal gene expression mapping of central nervous system development. Proc Natl Acad Sci USA 95:334–339
Article Google Scholar
Yeoh EJ, Ross M, Shurtleff S, Williams W, Patel D, Mahfouz R, Behm F, Raimondi S, Relling M, Patel A, Cheng C, Campana D, Wilkins D, Zhou X, Li J, Liu H, Pui CH, Evans W, Naeve C, Wong L, Downing J (2002) Classification, subtype discovery, and prediction of outcome in pediatric acute lymphoblastic leukemia by gene expression profiling. Cancer Cell 1:133–143
Article Google Scholar
Yeung K, Haynor D, Ruzzo W (2001) Validating clustering for gene expression data. Bioinformatics 17:309–318
Article Google Scholar

Download references

Funding

Giosué Lo Bosco and Raffaele Giancarlo were supported by Progetto di Ateneo dell’Universitá degli Studi di Palermo 2012-ATE-0298 Metodi Formali e Algoritmici per la Bioinformatica su Scala Genomica.

Author information

Authors and Affiliations

Dipartimento di Matematica e Informatica, University of Palermo, Via Archirafi 34, 90123, Palermo, Italy
Raffaele Giancarlo & Giosué Lo Bosco
Istituto Euro Mediterraneo di Scienza e Tecnologia, Via Emerico Amari 123, 90139, Palermo, Italy
Giosué Lo Bosco
Computational Biology Center, IBM T.J. Watson Research Center, Yorktown Heights, NY, 10598, USA
Filippo Utro

Authors

Raffaele Giancarlo
View author publications
You can also search for this author in PubMed Google Scholar
Giosué Lo Bosco
View author publications
You can also search for this author in PubMed Google Scholar
Filippo Utro
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Giosué Lo Bosco.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 62 KB)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Giancarlo, R., Lo Bosco, G. & Utro, F. Bayesian versus data driven model selection for microarray data. Nat Comput 14, 393–402 (2015). https://doi.org/10.1007/s11047-014-9446-5

Download citation

Published: 16 July 2014
Issue Date: September 2015
DOI: https://doi.org/10.1007/s11047-014-9446-5

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Bayesian versus data driven model selection for microarray data

Abstract

Access this article

Similar content being viewed by others

A Comprehensive Survey of Clustering Algorithms

Density-Based Clustering Based on Hierarchical Density Estimates

The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation

References

Funding

Author information

Authors and Affiliations

Corresponding author

Electronic supplementary material

Supplementary material 1 (pdf 62 KB)

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Bayesian versus data driven model selection for microarray data

Abstract

Access this article

Similar content being viewed by others

A Comprehensive Survey of Clustering Algorithms

Density-Based Clustering Based on Hierarchical Density Estimates

The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation

References

Funding

Author information

Authors and Affiliations

Corresponding author

Electronic supplementary material

Supplementary material 1 (pdf 62 KB)

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation