Stability-Based Model Selection for High Throughput Genomic Data: An Algorithmic Paradigm

  • Raffaele Giancarlo
  • Filippo Utro
Part of the Lecture Notes in Computer Science book series (LNCS, volume 7597)


Clustering is one of the most well known activities in scientific investigation and the object of research in many disciplines, ranging from Statistics to Computer Science. In this beautiful area, one of the most difficult challenges is the model selection problem, i.e., the identification of the correct number of clusters in a dataset. In the last decade, a few novel techniques for model selection, representing a sharp departure from previous ones in statistics, have been proposed and gained prominence for microarray data analysis. Among those, the stability-based methods are the most robust and best performing in terms of prediction, but the slowest in terms of time. Unfortunately, this fascinating and classic area of statistics as model selection, with important practical applications, has received very little attention in terms of algorithmic design and engineering. In this paper, in order to partially fill this gap, we highlight: (A) the first general algorithmic paradigm for stability-based methods for model selection; (B) a novel algorithmic paradigm for the class of stability-based methods for cluster validity, i.e., methods assessing how statistically significant is a given clustering solution; (C) a general algorithmic paradigm that describes heuristic and very effective speed-ups known in the Literature for stability-based model selection methods.


Cluster Algorithm Cluster Solution Microarray Data Analysis Consensus Cluster Model Selection Problem 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Alizadeh, A., Eisen, M., Davis, R., Ma, C., Lossos, I., Rosenwald, A., Boldrick, J., Sabet, H., Tran, T., Yu, X., Powell, J., Yang, L., Marti, G., Moore, T., Hudson, J.J., Lu, L., Lewis, D., Tibshirani, R., Sherlock, G., Chan, W., Greiner, T., Weisenburger, D., Armitage, J., Warnke, R., Levy, R., Wilson, W., Grever, M., Byrd, J., Botstein, D., Brown, P., Staudt, L.: Distinct types of diffuse large b-cell lymphoma identified by gene expression profiling. Nature 403, 503–511 (2000)CrossRefGoogle Scholar
  2. 2.
    Alon, U., Barkai, N., Notterman, D., Gish, K., Ybarra, S., Mack, D., Levine, A.: Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proceedings of the National Academy of Sciences of the United States of America 96, 6745–6750 (1999)CrossRefGoogle Scholar
  3. 3.
    Andreopoulos, B., An, A., Wang, X., Schroeder, M.: A roadmap of clustering algorithms: finding a match for a biomedical application. Briefings in Bioinformatics 10(3), 297–314 (2009)CrossRefGoogle Scholar
  4. 4.
    Ben-David, S., von Luxburg, U., Pál, D.: A Sober Look at Clustering Stability. In: Lugosi, G., Simon, H.U. (eds.) COLT 2006. LNCS (LNAI), vol. 4005, pp. 5–19. Springer, Heidelberg (2006)CrossRefGoogle Scholar
  5. 5.
    Ben-Hur, A., Elisseeff, A., Guyon, I.: A stability based method for discovering structure in clustering data. In: Seventh Pacific Symposium on Biocomputing, ISCB, pp. 6–17 (2002)Google Scholar
  6. 6.
    Benesty, J., Morgan, D., Sondhi, M.: A better understanding and an improved solution to the problems of stereophonic acoustic echo cancellation. In: ICASSP 1997: Proceedings of the 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 1997), vol. 1, p. 303. IEEE Computer Society (1997)Google Scholar
  7. 7.
    Bertoni, A., Valentini, G.: Model order selection for bio-molecular data clustering. BMC Bioinformatics 8 (2007)Google Scholar
  8. 8.
    Bittner, M., Meltzer, P., Chen, Y., Jiang, Y., Seftor, E., Hendrix, M., Radmacher, M., Simon, R., Yakhini, Z., Ben-Dor, A., Sampas, N., Dougherty, E., Wang, E., Marincola, F., Gooden, C., Lueders, J., Glatfelter, A., Pollock, P., Carpten, J., Gillanders, E., Leja, D., Dietrich, K., Beaudry, C., Berens, M., Alberts, D., Sondak, V.: Molecular classification of cutaneous malignant melanoma by gene expression profiling. Nature 406, 536–540 (2000)CrossRefGoogle Scholar
  9. 9.
    Bock, H.: On some significance tests in cluster analysis. Journal of Classification 2, 77–108 (1985)MathSciNetCrossRefzbMATHGoogle Scholar
  10. 10.
    Breckenridge, J.: Replicating cluster analysis: Method, consistency, and validity. Multivariate Behavioral Research 24(2), 147–161 (1989)CrossRefGoogle Scholar
  11. 11.
    Breiman, L.: Bagging predictors. Machine Learning 24, 123–140 (1996)MathSciNetzbMATHGoogle Scholar
  12. 12.
    Chen, J., Lonardi, S.: Biological Data Mining. Chapman & Hall (2009)Google Scholar
  13. 13.
    D’haeseleer, P.: How does gene expression cluster work? Nature Biotechnology 23, 1499–1501 (2006)CrossRefGoogle Scholar
  14. 14.
    Dudoit, S., Fridlyand, J.: A prediction-based resampling method for estimating the number of clusters in a dataset. Genome Biology 3 (2002)Google Scholar
  15. 15.
    Dudoit, S., Fridlyand, J.: Bagging to improve the accuracy of a clustering procedure. Bioinformatics 19(9), 1090–1099 (2003)CrossRefGoogle Scholar
  16. 16.
    Efron, B., Tibshirani, R.: An Introduction to the Bootstrap. Chapman & Hall, London (1993)zbMATHGoogle Scholar
  17. 17.
    Giancarlo, R., Scaturro, D., Utro, F.: Computational cluster validation for microarray data analysis: experimental assessment of Clest, Consensus Clustering, Figure of Merit, Gap Statistics and Model Explorer. BMC Bioinformatics 9, 462 (2008)CrossRefGoogle Scholar
  18. 18.
    Giancarlo, R., Scaturro, D., Utro, F.: Statistical indices for computational and data driven class discovery in microarray data. In: Chen, J.Y., Lonardi, S. (eds.) Biological Data Mining, pp. 295–335. CRC Press, San Francisco (2009)CrossRefGoogle Scholar
  19. 19.
    Giancarlo, R., Utro, F.: Speeding up the Consensus Clustering methodology for microarray data analysis. Algorithms for Molecular Biology 6, 1 (2011)CrossRefGoogle Scholar
  20. 20.
    Giancarlo, R., Utro, F.: Algorithmic paradigms for stability-based cluster validity and model selection statistical methods, with applications to microarray data analysis. Theoretical Computer Science 428, 58–79 (2012)MathSciNetCrossRefzbMATHGoogle Scholar
  21. 21.
    Golub, T., Slonim, D., Tamayo, P., Huard, C., Gaasenbeeck, M., Mesirov, J., Coller, H., Loh, M., Downing, J., Caligiuri, M., Bloomfield, C., Lander, E.: Molecular classification of cancer: Class discovery and class prediction by gene expression monitoring. Science 286(5439 ), 531–537 (1999)CrossRefGoogle Scholar
  22. 22.
    Gordon, A.: Null models in cluster validation. In: From Data to Knowledge: Theoretical and Practical Aspects of Classification, pp. 32–44. Springer (1996)Google Scholar
  23. 23.
    Handl, J., Knowles, J., Kell, D.: Computational cluster validation in Post-genomic data analysis. Bioinformatics 21(15), 3201–3212 (2005)CrossRefGoogle Scholar
  24. 24.
    Hastie, T., Tibshirani, R., Friedman, J.: The Elements of Statistical Learning. Springer (2003)Google Scholar
  25. 25.
    Jain, A., Dubes, R.: Algorithms for Clustering Data. Prentice-Hall, Englewood Cliffs (1988)zbMATHGoogle Scholar
  26. 26.
    Johnson, W., Lindenstrauss, J.: Extensions of Lipschitz mappings into a Hilbert space. Contemp. Math. 26, 189–206 (1984)MathSciNetCrossRefzbMATHGoogle Scholar
  27. 27.
    Kerr, M., Churchill, G.: Bootstrapping cluster analysis: Assessing the reliability of conclusions from microarray experiments. PNAS 98, 8961–8965 (2000)CrossRefGoogle Scholar
  28. 28.
    Kraus, J., Kestler, H.: A highly efficient multi-core algorithm for clustering extremely large datasets. BMC Bioinformatics 11 (2010)Google Scholar
  29. 29.
    Levine, E., Domany, E.: Resampling method for unsupervised estimation of cluster validity. Neural Computation 13, 2573–2593 (2001)CrossRefzbMATHGoogle Scholar
  30. 30.
    McShane, L., Radmacher, M., Freidlin, B., Yu, R., Li, M.C., Simon, R.: Methods for assessing reproducibility of clustering patterns observed in analyses of microarray data. Bioinformatics 18, 1462–1469 (2002)CrossRefGoogle Scholar
  31. 31.
    Monti, S., Tamayo, P., Mesirov, J., Golub, T.: Consensus clustering: A resampling-based method for class discovery and visualization of gene expression microarray data. Machine Learning 52, 91–118 (2003)CrossRefzbMATHGoogle Scholar
  32. 32.
    Perou, C., Jeffrey, S., van de Rijn, M., Rees, C., Eisen, M., Ross, D., Pergamenschikov, A., Williams, C., Zhu, S., Lee, J., Lashkari, D., Shalon, D., Brown, P., Botstein, D.: Distinctive gene expression patterns in human mammary epithelial cells and breast cancers. Proceedings of the National Academy of Sciences of the United States of America 96, 9212–9217 (1999)CrossRefGoogle Scholar
  33. 33.
    Pollack, J., Perou, C., Alizadeh, A., Eisen, M., Amd, C.F., Williams, A.P., Jeffrey, S., Botstein, D., Brown, P.: Genome-wide analysis of DNA copy-number changes using cDNA microarrays. Nature Genetics 23, 41–46 (1999)CrossRefGoogle Scholar
  34. 34.
    Raviv, Y., Intrator, N.: Bootstrapping with noise: An effective regularization technique. Connection Science 8, 355–372 (1996)CrossRefGoogle Scholar
  35. 35.
    Ross, D., Scherf, U., Eisen, M., Perou, C., Spellman, P., Iyer, V., Jeffrey, S., van de Rijn, M., Walthama, M., Pergamenschikov, A., Lee, J., Lashkari, D., Shalon, D., Myers, T., Weistein, J., Botstein, D., Brown, P.: Systematic variation in gene expression patterns in human cancer cell lines. Nature Genetics 24, 227–235 (2000)CrossRefGoogle Scholar
  36. 36.
    Roth, V., Lange, T., Braun, M., Buhmann, J.: A resampling approach to cluster validation. In: Proceedings 15th Symposium in Computational Statistics, pp. 123–128 (2002)Google Scholar
  37. 37.
    Sarle, W.: Cubic clustering criterion. Tech. rep., SAS (1983)Google Scholar
  38. 38.
    Tibshirani, R., Walther, G., Hastie, T.: Estimating the number of clusters in a dataset via the gap statistics. Journal Royal Statistical Society B 2, 411–423 (2001)MathSciNetCrossRefGoogle Scholar
  39. 39.
    Utro, F.: Algorithms for internal validation clustering measures in the Post-genomic era, Doctoral Dissertation, University of Palermo (2011),
  40. 40.
    Valentini, G.: Mosclust: a software library for discovering significant structures in bio-molecular data. Bioinformatics 23, 387–389 (2007)MathSciNetCrossRefGoogle Scholar
  41. 41.
    Wolfinger, R., Gibson, G., Wolfinger, E., Bennet, L., Hamadeh, H., Bushel, C., Paules, R.: Assessing gene significance from cDNA microarray expression data via mixed models. Journal of Computational Biology, 625–637 (2001)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2012

Authors and Affiliations

  • Raffaele Giancarlo
    • 1
  • Filippo Utro
    • 2
  1. 1.Dipartimento di Matematica ed InformaticaUniversity of PalermoPalermoItaly
  2. 2.Computational Biology CenterIBM T.J. Watson Research CenterYorktown HeightsUSA

Personalised recommendations