Stability-Based Model Selection for High Throughput Genomic Data: An Algorithmic Paradigm

Giancarlo, Raffaele; Utro, Filippo

doi:10.1007/978-3-642-33757-4_20

Raffaele Giancarlo²¹ &
Filippo Utro²²

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 7597))

Included in the following conference series:

International Conference on Artificial Immune Systems

834 Accesses
1 Citations

Abstract

Clustering is one of the most well known activities in scientific investigation and the object of research in many disciplines, ranging from Statistics to Computer Science. In this beautiful area, one of the most difficult challenges is the model selection problem, i.e., the identification of the correct number of clusters in a dataset. In the last decade, a few novel techniques for model selection, representing a sharp departure from previous ones in statistics, have been proposed and gained prominence for microarray data analysis. Among those, the stability-based methods are the most robust and best performing in terms of prediction, but the slowest in terms of time. Unfortunately, this fascinating and classic area of statistics as model selection, with important practical applications, has received very little attention in terms of algorithmic design and engineering. In this paper, in order to partially fill this gap, we highlight: (A) the first general algorithmic paradigm for stability-based methods for model selection; (B) a novel algorithmic paradigm for the class of stability-based methods for cluster validity, i.e., methods assessing how statistically significant is a given clustering solution; (C) a general algorithmic paradigm that describes heuristic and very effective speed-ups known in the Literature for stability-based model selection methods.

An extended version of this manuscript appears in [20] and it is presented here as invited contribution to Bio- & Immune- Inspired Algorithms and Models for Multi-Level Complex Systems Workshop within ICARIS 2012.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Alizadeh, A., Eisen, M., Davis, R., Ma, C., Lossos, I., Rosenwald, A., Boldrick, J., Sabet, H., Tran, T., Yu, X., Powell, J., Yang, L., Marti, G., Moore, T., Hudson, J.J., Lu, L., Lewis, D., Tibshirani, R., Sherlock, G., Chan, W., Greiner, T., Weisenburger, D., Armitage, J., Warnke, R., Levy, R., Wilson, W., Grever, M., Byrd, J., Botstein, D., Brown, P., Staudt, L.: Distinct types of diffuse large b-cell lymphoma identified by gene expression profiling. Nature 403, 503–511 (2000)
Article Google Scholar
Alon, U., Barkai, N., Notterman, D., Gish, K., Ybarra, S., Mack, D., Levine, A.: Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proceedings of the National Academy of Sciences of the United States of America 96, 6745–6750 (1999)
Article Google Scholar
Andreopoulos, B., An, A., Wang, X., Schroeder, M.: A roadmap of clustering algorithms: finding a match for a biomedical application. Briefings in Bioinformatics 10(3), 297–314 (2009)
Article Google Scholar
Ben-David, S., von Luxburg, U., Pál, D.: A Sober Look at Clustering Stability. In: Lugosi, G., Simon, H.U. (eds.) COLT 2006. LNCS (LNAI), vol. 4005, pp. 5–19. Springer, Heidelberg (2006)
Chapter Google Scholar
Ben-Hur, A., Elisseeff, A., Guyon, I.: A stability based method for discovering structure in clustering data. In: Seventh Pacific Symposium on Biocomputing, ISCB, pp. 6–17 (2002)
Google Scholar
Benesty, J., Morgan, D., Sondhi, M.: A better understanding and an improved solution to the problems of stereophonic acoustic echo cancellation. In: ICASSP 1997: Proceedings of the 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 1997), vol. 1, p. 303. IEEE Computer Society (1997)
Google Scholar
Bertoni, A., Valentini, G.: Model order selection for bio-molecular data clustering. BMC Bioinformatics 8 (2007)
Google Scholar
Bittner, M., Meltzer, P., Chen, Y., Jiang, Y., Seftor, E., Hendrix, M., Radmacher, M., Simon, R., Yakhini, Z., Ben-Dor, A., Sampas, N., Dougherty, E., Wang, E., Marincola, F., Gooden, C., Lueders, J., Glatfelter, A., Pollock, P., Carpten, J., Gillanders, E., Leja, D., Dietrich, K., Beaudry, C., Berens, M., Alberts, D., Sondak, V.: Molecular classification of cutaneous malignant melanoma by gene expression profiling. Nature 406, 536–540 (2000)
Article Google Scholar
Bock, H.: On some significance tests in cluster analysis. Journal of Classification 2, 77–108 (1985)
Article MathSciNet MATH Google Scholar
Breckenridge, J.: Replicating cluster analysis: Method, consistency, and validity. Multivariate Behavioral Research 24(2), 147–161 (1989)
Article Google Scholar
Breiman, L.: Bagging predictors. Machine Learning 24, 123–140 (1996)
MathSciNet MATH Google Scholar
Chen, J., Lonardi, S.: Biological Data Mining. Chapman & Hall (2009)
Google Scholar
D’haeseleer, P.: How does gene expression cluster work? Nature Biotechnology 23, 1499–1501 (2006)
Article Google Scholar
Dudoit, S., Fridlyand, J.: A prediction-based resampling method for estimating the number of clusters in a dataset. Genome Biology 3 (2002)
Google Scholar
Dudoit, S., Fridlyand, J.: Bagging to improve the accuracy of a clustering procedure. Bioinformatics 19(9), 1090–1099 (2003)
Article Google Scholar
Efron, B., Tibshirani, R.: An Introduction to the Bootstrap. Chapman & Hall, London (1993)
MATH Google Scholar
Giancarlo, R., Scaturro, D., Utro, F.: Computational cluster validation for microarray data analysis: experimental assessment of Clest, Consensus Clustering, Figure of Merit, Gap Statistics and Model Explorer. BMC Bioinformatics 9, 462 (2008)
Article Google Scholar
Giancarlo, R., Scaturro, D., Utro, F.: Statistical indices for computational and data driven class discovery in microarray data. In: Chen, J.Y., Lonardi, S. (eds.) Biological Data Mining, pp. 295–335. CRC Press, San Francisco (2009)
Chapter Google Scholar
Giancarlo, R., Utro, F.: Speeding up the Consensus Clustering methodology for microarray data analysis. Algorithms for Molecular Biology 6, 1 (2011)
Article Google Scholar
Giancarlo, R., Utro, F.: Algorithmic paradigms for stability-based cluster validity and model selection statistical methods, with applications to microarray data analysis. Theoretical Computer Science 428, 58–79 (2012)
Article MathSciNet MATH Google Scholar
Golub, T., Slonim, D., Tamayo, P., Huard, C., Gaasenbeeck, M., Mesirov, J., Coller, H., Loh, M., Downing, J., Caligiuri, M., Bloomfield, C., Lander, E.: Molecular classification of cancer: Class discovery and class prediction by gene expression monitoring. Science 286(5439 ), 531–537 (1999)
Article Google Scholar
Gordon, A.: Null models in cluster validation. In: From Data to Knowledge: Theoretical and Practical Aspects of Classification, pp. 32–44. Springer (1996)
Google Scholar
Handl, J., Knowles, J., Kell, D.: Computational cluster validation in Post-genomic data analysis. Bioinformatics 21(15), 3201–3212 (2005)
Article Google Scholar
Hastie, T., Tibshirani, R., Friedman, J.: The Elements of Statistical Learning. Springer (2003)
Google Scholar
Jain, A., Dubes, R.: Algorithms for Clustering Data. Prentice-Hall, Englewood Cliffs (1988)
MATH Google Scholar
Johnson, W., Lindenstrauss, J.: Extensions of Lipschitz mappings into a Hilbert space. Contemp. Math. 26, 189–206 (1984)
Article MathSciNet MATH Google Scholar
Kerr, M., Churchill, G.: Bootstrapping cluster analysis: Assessing the reliability of conclusions from microarray experiments. PNAS 98, 8961–8965 (2000)
Article Google Scholar
Kraus, J., Kestler, H.: A highly efficient multi-core algorithm for clustering extremely large datasets. BMC Bioinformatics 11 (2010)
Google Scholar
Levine, E., Domany, E.: Resampling method for unsupervised estimation of cluster validity. Neural Computation 13, 2573–2593 (2001)
Article MATH Google Scholar
McShane, L., Radmacher, M., Freidlin, B., Yu, R., Li, M.C., Simon, R.: Methods for assessing reproducibility of clustering patterns observed in analyses of microarray data. Bioinformatics 18, 1462–1469 (2002)
Article Google Scholar
Monti, S., Tamayo, P., Mesirov, J., Golub, T.: Consensus clustering: A resampling-based method for class discovery and visualization of gene expression microarray data. Machine Learning 52, 91–118 (2003)
Article MATH Google Scholar
Perou, C., Jeffrey, S., van de Rijn, M., Rees, C., Eisen, M., Ross, D., Pergamenschikov, A., Williams, C., Zhu, S., Lee, J., Lashkari, D., Shalon, D., Brown, P., Botstein, D.: Distinctive gene expression patterns in human mammary epithelial cells and breast cancers. Proceedings of the National Academy of Sciences of the United States of America 96, 9212–9217 (1999)
Article Google Scholar
Pollack, J., Perou, C., Alizadeh, A., Eisen, M., Amd, C.F., Williams, A.P., Jeffrey, S., Botstein, D., Brown, P.: Genome-wide analysis of DNA copy-number changes using cDNA microarrays. Nature Genetics 23, 41–46 (1999)
Article Google Scholar
Raviv, Y., Intrator, N.: Bootstrapping with noise: An effective regularization technique. Connection Science 8, 355–372 (1996)
Article Google Scholar
Ross, D., Scherf, U., Eisen, M., Perou, C., Spellman, P., Iyer, V., Jeffrey, S., van de Rijn, M., Walthama, M., Pergamenschikov, A., Lee, J., Lashkari, D., Shalon, D., Myers, T., Weistein, J., Botstein, D., Brown, P.: Systematic variation in gene expression patterns in human cancer cell lines. Nature Genetics 24, 227–235 (2000)
Article Google Scholar
Roth, V., Lange, T., Braun, M., Buhmann, J.: A resampling approach to cluster validation. In: Proceedings 15th Symposium in Computational Statistics, pp. 123–128 (2002)
Google Scholar
Sarle, W.: Cubic clustering criterion. Tech. rep., SAS (1983)
Google Scholar
Tibshirani, R., Walther, G., Hastie, T.: Estimating the number of clusters in a dataset via the gap statistics. Journal Royal Statistical Society B 2, 411–423 (2001)
Article MathSciNet Google Scholar
Utro, F.: Algorithms for internal validation clustering measures in the Post-genomic era, Doctoral Dissertation, University of Palermo (2011), http://arxiv.org/abs/1102.2915v1
Valentini, G.: Mosclust: a software library for discovering significant structures in bio-molecular data. Bioinformatics 23, 387–389 (2007)
Article MathSciNet Google Scholar
Wolfinger, R., Gibson, G., Wolfinger, E., Bennet, L., Hamadeh, H., Bushel, C., Paules, R.: Assessing gene significance from cDNA microarray expression data via mixed models. Journal of Computational Biology, 625–637 (2001)
Google Scholar

Download references

Author information

Authors and Affiliations

Dipartimento di Matematica ed Informatica, University of Palermo, Via Archirafi 34, 90123, Palermo, Italy
Raffaele Giancarlo
Computational Biology Center, IBM T.J. Watson Research Center, Yorktown Heights, NY, 10598, USA
Filippo Utro

Authors

Raffaele Giancarlo
View author publications
You can also search for this author in PubMed Google Scholar
Filippo Utro
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Departmento de Computación, Centro de Investigacion y de Estudios, Avanzados del Instituto Politecnico Nacional (CINVESTAV-IPN), Av. Insituto Politécnico Nacional No. 2508, Col. San Pedro Zacatenco, 07300, Mexico, D.F., Mexico
Carlos A. Coello Coello
School of Computer Science, University of Nottingham, Jubilee Campus, Wollaton Road, NG8 1BB, Nottingham, UK
Julie Greensmith
School of Computer Science, University of Nottingham, Jubilee Campus, Wollaton Road, NG81BB, Nottingham, UK
Natalio Krasnogor
Computer Laboratory, University of Cambridge, William Gates Building, 15 JJ Thomson Avenue, CB3 0FD, Cambridge, UK
Pietro Liò
Department of Mathematics and Computer Science, University of Catania, Viale A. Doria 6, 95125, Catania, Italy
Giuseppe Nicosia & Mario Pavone &

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Giancarlo, R., Utro, F. (2012). Stability-Based Model Selection for High Throughput Genomic Data: An Algorithmic Paradigm. In: Coello Coello, C.A., Greensmith, J., Krasnogor, N., Liò, P., Nicosia, G., Pavone, M. (eds) Artificial Immune Systems. ICARIS 2012. Lecture Notes in Computer Science, vol 7597. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-33757-4_20

Download citation

DOI: https://doi.org/10.1007/978-3-642-33757-4_20
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-33756-7
Online ISBN: 978-3-642-33757-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics