Skip to main content

Unsupervised Stability-Based Ensembles to Discover Reliable Structures in Complex Bio-molecular Data

  • Conference paper
Computational Intelligence Methods for Bioinformatics and Biostatistics (CIBB 2008)

Part of the book series: Lecture Notes in Computer Science ((LNBI,volume 5488))

  • 829 Accesses

Abstract

The assessment of the reliability of clusters discovered in bio-molecular data is a central issue in several bioinformatics problems. Several methods based on the concept of stability have been proposed to estimate the reliability of each individual cluster as well as the ”optimal” number of clusters. In this conceptual framework a clustering ensemble is obtained through bootstrapping techniques, noise injection into the data or random projections into lower dimensional subspaces. A measure of the reliability of a given clustering is obtained through specific stability/reliability scores based on the similarity of the clusterings composing the ensemble. Classical stability-based methods do not provide an assessment of the statistical significance of the clustering solutions and are not able to directly detect multiple structures (e.g. hierarchical structures) simultaneously present in the data. Statistical approaches based on the chi-square distribution and on the Bernstein inequality, show that stability-based methods can be successfully applied to the statistical assessment of the reliability of clusters, and to discover multiple structures underlying complex bio-molecular data. In this paper we provide an overview of stability based methods, focusing on stability indices and statistical tests that we recently proposed in the context of the analysis of gene expression data.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Dopazo, J.: Functional interpretation of microarray experiments. OMICS 3 (2006)

    Google Scholar 

  2. Gasch, P., Eisen, M.: Exploring the conditional regulation of yeast gene expression through fuzzy k-means clustering. Genome Biology 3 (2002)

    Google Scholar 

  3. Dyrskjøt, L., Thykjaer, T., Kruhøffer, M., Jensen, J., Marcussen, N., Hamilton-Dutoit, S., Wolf, H., Ørntoft, T.: Identifying distinct classes of bladder carcinoma using microarrays. Nature Genetics 33, 90–96 (2003)

    Article  PubMed  Google Scholar 

  4. Kaplan, N., Friedlich, M., Fromer, M., Linial, M.: A functional hierarchical organization of the protein sequence space. BMC Bioinformatics 5 (2004)

    Google Scholar 

  5. Jain, A., Murty, M., Flynn, P.: Data Clustering: a Review. ACM Computing Surveys 31, 264–323 (1999)

    Article  Google Scholar 

  6. Kasturi, J., Acharya, R.: Clustering of diverse genomic data using information fusions. Bioinformatics 21, 423–429 (2005)

    Article  CAS  PubMed  Google Scholar 

  7. Avogadri, R., Valentini, G.: Fuzzy ensemble clustering based on random projections for dna microarray data analysis. Artificial Intelligence in Medicine (2008), doi:10.1016/j.artmed.2008.07.014

    Google Scholar 

  8. Swift, S., Tucker, A., Liu, X.: An analysis of scalable methods for clustering high-dimensional gene expression data. Annals of Mathematics, Computing and Teleinformatics 1, 80–89 (2004)

    Google Scholar 

  9. Napolitano, F., Raiconi, G., Tagliaferri, R., Ciaramella, A., Staiano, A., Miele, G.: Clustering and visualization approaches for human cell cycle gene expression data analysis. Int. J. Approx. Reasoning 47, 70–84 (2008)

    Article  Google Scholar 

  10. Azuaje, F., Dopazo, J.: Data Analysis and Visualization in Genomics and Proteomics. Wiley, Chichester (2005)

    Book  Google Scholar 

  11. Giardine, B., Riemer, C., Hardison, R., Burhans, R., Elnitski, L., Shah, P., Zhang, Y., Blankenberg, D., Albert, I., Taylor, J., Miller, W., Kent, W., Nekrutenko, A.: Galaxy: a platform for interactive large-scale genome analysis. Genome Res. 15, 1451–1455 (2005)

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  12. Ciaramella, A., Cocozza, S., Iorio, F., Miele, G., Napolitano, F., Pinelli, M., Raiconi, G., Tagliaferri, R.: Interactive data analysis and clustering of genomic data. Neural Networks 21, 368–378 (2008)

    Article  CAS  PubMed  Google Scholar 

  13. Handl, J., Knowles, J., Kell, D.: Computational cluster validation in post-genomic data analysis. Bioinformatics 21, 3201–3215 (2005)

    Article  CAS  PubMed  Google Scholar 

  14. Dudoit, S., Fridlyand, J.: Bagging to improve the accuracy of a clustering procedure. Bioinformatics 19, 1090–1099 (2003)

    Article  CAS  PubMed  Google Scholar 

  15. Ben-Dor, A., Shamir, R., Yakhini, Z.: Clustering gene expression patterns. Journal of Computational Biology 6, 281–297 (1999)

    Article  CAS  PubMed  Google Scholar 

  16. Ben-Hur, A., Ellisseeff, A., Guyon, I.: A stability based method for discovering structure in clustered data. In: Altman, R., Dunker, A., Hunter, L., Klein, T., Lauderdale, K. (eds.) Pacific Symposium on Biocomputing, Lihue, Hawaii, USA, vol. 7, pp. 6–17. World Scientific, Singapore (2002)

    Google Scholar 

  17. Dudoit, S., Fridlyand, J.: A prediction-based resampling method for estimating the number of clusters in a dataset. Genome Biology 3, 1–21 (2002)

    Article  Google Scholar 

  18. Yeung, K., Haynor, D., Ruzzo, W.: Validating clustering for gene expression data. Bioinformatics 17, 309–318 (2001)

    Article  CAS  PubMed  Google Scholar 

  19. Kerr, M., Curchill, G.: Bootstrapping cluster analysis: assessing the reliability of conclusions from microarray experiments. PNAS 98, 8961–8965 (2001)

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  20. McShane, L., Radmacher, D., Freidlin, B., Yu, R., Li, M., Simon, R.: Method for assessing reproducibility of clustering patterns observed in analyses of microarray data. Bioinformatics 18, 1462–1469 (2002)

    Article  CAS  PubMed  Google Scholar 

  21. Smolkin, M., Gosh, D.: Cluster stability scores for microarray data in cancer studies. BMC Bioinformatics 36 (2003)

    Google Scholar 

  22. Bittner, M., Meltzer, P., Chen, Y., Jiang, Y., Seftor, E., Hendrix, M., Radmacher, M., Simon, R., Yakhini, Z., Ben-Dor, A., Sampas, N., Dougherty, E., Wang, E., Marincola, F., Gooden, C., Lueders, J., Glatfelter, A., Pollock, P., Carpten, J., Gillanders, E., Leja, D., Dietrich, K., Beaudry, C., Berens, M., Alberts, D., Sondak, V.: Molecular classification of malignant melanoma by gene expression profiling. Nature 406, 536–540 (2000)

    Article  CAS  PubMed  Google Scholar 

  23. Monti, S., Tamayo, P., Mesirov, J., Golub, T.: Consensus Clustering: A Resampling-based Method for Class Discovery and Visualization of Gene Expression Microarray Data. Machine Learning 52, 91–118 (2003)

    Article  Google Scholar 

  24. Lange, T., Roth, V., Braun, M., Buhmann, J.: Stability-based validation of clustering solutions. Neural Computation 16, 1299–1323 (2004)

    Article  PubMed  Google Scholar 

  25. Valentini, G.: Clusterv: a tool for assessing the reliability of clusters discovered in DNA microarray data. Bioinformatics 22, 369–370 (2006)

    Article  CAS  PubMed  Google Scholar 

  26. Bertoni, A., Valentini, G.: Model order selection for bio-molecular data clustering. BMC Bioinformatics 8 (2007)

    Google Scholar 

  27. Bertoni, A., Valentini, G.: Randomized maps for assessing the reliability of patients clusters in DNA microarray data analyses. Artificial Intelligence in Medicine 37, 85–109 (2006)

    Article  PubMed  Google Scholar 

  28. Bertoni, A., Valentini, G.: Discovering Significant Structures in Clustered Bio-molecular Data Through the Bernstein Inequality. In: Apolloni, B., Howlett, R.J., Jain, L. (eds.) KES 2007, Part III. LNCS, vol. 4694, pp. 886–891. Springer, Heidelberg (2007)

    Chapter  Google Scholar 

  29. Bertoni, A., Valentini, G.: Discovering multi-level structures in bio-molecular data through the Bernstein inequality. BMC Bioinformatics 9 (2008)

    Google Scholar 

  30. Valentini, G.: Mosclust: a software library for discovering significant structures in bio-molecular data. Bioinformatics 23, 387–389 (2007)

    Article  CAS  PubMed  Google Scholar 

  31. Bertoni, A., Valentini, G.: Randomized embedding cluster ensembles for gene expression data analysis. In: SETIT 2007 - IEEE International Conf. on Sciences of Electronic, Technologies of Information and Telecommunications, Hammamet, Tunisia (2007)

    Google Scholar 

  32. Rand, W.: Objective criteria for the evaluation of clustering methods. J. Am. Stat. Assoc. 66, 846–850 (1971)

    Article  Google Scholar 

  33. Jain, A., Dubes, R.: Algorithms for clustering data. Prentice Hall, Englewood Cliffs (1988)

    Google Scholar 

  34. Achlioptas, D.: Database-friendly random projections. In: Buneman, P. (ed.) Proc. ACM Symp. on the Principles of Database Systems. Contemporary Mathematics, pp. 274–281. ACM Press, New York (2001)

    Google Scholar 

  35. Kaufman, L., Rousseeuw, P.: Finding Groups in Data: An Introduction to Cluster Analysis. Wiley, New York (1990)

    Book  Google Scholar 

  36. Bertoni, A., Valentini, G.: Assessment of clusters reliability for high dimensional genomic data. In: BITS 2007, Bioinformatics Italian Society Meeting, Napoli Italy (2007)

    Google Scholar 

  37. Hoeffding, W.: Probability inequalities for sums of independent random variables. J. Amer. Statist. Assoc. 58, 13–30 (1963)

    Article  Google Scholar 

  38. Indyk, P.: Algorithmic Applications of Low-Distortion Geometric Embeddings. In: Proceedings of the 42nd IEEE symposium on Foundations of Computer Science, Washington DC, USA, pp. 10–33. IEEE Computer Society, Los Alamitos (2001)

    Google Scholar 

  39. Johnson, W., Lindenstrauss, J.: Extensions of Lipshitz mapping into Hilbert space. In: Conference in modern analysis and probability. Contemporary Mathematics, Amer. Math. Soc., vol. 26, pp. 189–206 (1984)

    Google Scholar 

  40. Valentini, G., Ruffino, F.: Characterization of lung tumor subtypes through gene expression cluster validity assessment. RAIRO - Theoretical Informatics and Applications 40, 163–176 (2006)

    Article  Google Scholar 

  41. Bertoni, A., Valentini, G.: In: Random projections for assessing gene expression cluster stability. In: IJCNN 2005, The IEEE-INNS International Joint Conference on Neural Networks, Montreal (2005)

    Google Scholar 

  42. Ben-David, S., von Luxburg, U., Pal, D.: A sober look at clustering stability. In: Lugosi, G., Simon, H.U. (eds.) COLT 2006. LNCS, vol. 4005, pp. 5–19. Springer, Heidelberg (2006)

    Chapter  Google Scholar 

  43. Ben-David, S., von Luxburg, U.: Relating clustering stability to properties of cluster boundaries. In: 21st Annual Conference on Learning Theory (COLT 2008). LNCS, pp. 379–390. Springer, Heidelberg (2008)

    Google Scholar 

  44. Harris, M., et al.: The Gene Ontology (GO) database and informatics resource. Nucleic Acid Res. 32, D258–D261 (2004)

    Article  Google Scholar 

  45. Brehelin, L., Gascuel, O., Martin, O.: Using repeated measurements to validate hierarchical gene clusters. Bioinformatics 24, 682–688 (2008)

    Article  CAS  PubMed  Google Scholar 

  46. Avogadri, R., Brioschi, M., Ruffino, F., Ferrazzi, F., Beghini, A., Valentini, G.: An algorithm to assess the reliability of hierarchical clusters in gene expression data. In: Lovrek, I., Howlett, R.J., Jain, L.C. (eds.) KES 2008, Part III. LNCS, vol. 5179, pp. 764–770. Springer, Heidelberg (2008)

    Chapter  Google Scholar 

  47. Filippone, M., Masulli, F., Rovetta, S.: Stability and Performances in Biclustering Algorithms. In: Masulli, F., Tagliaferri, R., Verhivker, G.M. (eds.) CIBB 2008. LNCS (LNBI), vol. 5488, pp. 91–101. Springer, Heidelberg (2009)

    Google Scholar 

  48. Troyanskaya, O., et al.: A Bayesian framework for combining heterogeneous data sources for gene function prediction (in saccharomices cerevisiae). Proc. Natl. Acad. Sci. USA 100, 8348–8353 (2003)

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  49. Guan, Y., Myers, C., Hess, D., Barutcuoglu, Z., Caudy, A., Troyanskaya, O.: Predicting gene function in a hierarchical context with an ensemble of classifiers. Genome Biology 9 (2008)

    Google Scholar 

  50. Alizadeh, A., Eisen, M., Davis, R., Ma, C., Lossos, I., Rosenwald, A., Boldrick, J., Sabet, H., Tran, T., Yu, X., Powell, J., Yang, L., Marti, G., Moore, T., Hudson, J., Lu, L., Lewis, D., Tibshirani, R., Sherlock, G., Chan, W., Greiner, T., Weisenburger, D., Armitage, J., Warnke, R., Levy, R., Wilson, W., Grever, M., Byrd, J., Botstein, D., Brown, P., Staudt, L.: Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling. Nature 403, 503–511 (2000)

    Article  CAS  PubMed  Google Scholar 

  51. Lapointe, J., Li, C., Higgins, J., van de Rijn, M., Bair, E., Montgomery, K., Ferrari, M., Egevad, L., Rayford, W., Bergerheim, U., Ekman, P., DeMarzo, A., Tibshirani, R., Botstein, D., Brown, P., Brooks, J., Pollack, J.: Gene expression profiling identifies clinically relevant subtypes of prostate cancer. PNAS 101, 811–816 (2004)

    Article  CAS  PubMed  PubMed Central  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2009 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Bertoni, A., Valentini, G. (2009). Unsupervised Stability-Based Ensembles to Discover Reliable Structures in Complex Bio-molecular Data. In: Masulli, F., Tagliaferri, R., Verkhivker, G.M. (eds) Computational Intelligence Methods for Bioinformatics and Biostatistics. CIBB 2008. Lecture Notes in Computer Science(), vol 5488. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-02504-4_3

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-02504-4_3

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-02503-7

  • Online ISBN: 978-3-642-02504-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics