Unsupervised Stability-Based Ensembles to Discover Reliable Structures in Complex Bio-molecular Data

Bertoni, Alberto; Valentini, Giorgio

doi:10.1007/978-3-642-02504-4_3

Alberto Bertoni²² &
Giorgio Valentini²²

Part of the book series: Lecture Notes in Computer Science ((LNBI,volume 5488))

Included in the following conference series:

International Meeting on Computational Intelligence Methods for Bioinformatics and Biostatistics

829 Accesses

Abstract

The assessment of the reliability of clusters discovered in bio-molecular data is a central issue in several bioinformatics problems. Several methods based on the concept of stability have been proposed to estimate the reliability of each individual cluster as well as the ”optimal” number of clusters. In this conceptual framework a clustering ensemble is obtained through bootstrapping techniques, noise injection into the data or random projections into lower dimensional subspaces. A measure of the reliability of a given clustering is obtained through specific stability/reliability scores based on the similarity of the clusterings composing the ensemble. Classical stability-based methods do not provide an assessment of the statistical significance of the clustering solutions and are not able to directly detect multiple structures (e.g. hierarchical structures) simultaneously present in the data. Statistical approaches based on the chi-square distribution and on the Bernstein inequality, show that stability-based methods can be successfully applied to the statistical assessment of the reliability of clusters, and to discover multiple structures underlying complex bio-molecular data. In this paper we provide an overview of stability based methods, focusing on stability indices and statistical tests that we recently proposed in the context of the analysis of gene expression data.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Dopazo, J.: Functional interpretation of microarray experiments. OMICS 3 (2006)
Google Scholar
Gasch, P., Eisen, M.: Exploring the conditional regulation of yeast gene expression through fuzzy k-means clustering. Genome Biology 3 (2002)
Google Scholar
Dyrskjøt, L., Thykjaer, T., Kruhøffer, M., Jensen, J., Marcussen, N., Hamilton-Dutoit, S., Wolf, H., Ørntoft, T.: Identifying distinct classes of bladder carcinoma using microarrays. Nature Genetics 33, 90–96 (2003)
Article PubMed Google Scholar
Kaplan, N., Friedlich, M., Fromer, M., Linial, M.: A functional hierarchical organization of the protein sequence space. BMC Bioinformatics 5 (2004)
Google Scholar
Jain, A., Murty, M., Flynn, P.: Data Clustering: a Review. ACM Computing Surveys 31, 264–323 (1999)
Article Google Scholar
Kasturi, J., Acharya, R.: Clustering of diverse genomic data using information fusions. Bioinformatics 21, 423–429 (2005)
Article CAS PubMed Google Scholar
Avogadri, R., Valentini, G.: Fuzzy ensemble clustering based on random projections for dna microarray data analysis. Artificial Intelligence in Medicine (2008), doi:10.1016/j.artmed.2008.07.014
Google Scholar
Swift, S., Tucker, A., Liu, X.: An analysis of scalable methods for clustering high-dimensional gene expression data. Annals of Mathematics, Computing and Teleinformatics 1, 80–89 (2004)
Google Scholar
Napolitano, F., Raiconi, G., Tagliaferri, R., Ciaramella, A., Staiano, A., Miele, G.: Clustering and visualization approaches for human cell cycle gene expression data analysis. Int. J. Approx. Reasoning 47, 70–84 (2008)
Article Google Scholar
Azuaje, F., Dopazo, J.: Data Analysis and Visualization in Genomics and Proteomics. Wiley, Chichester (2005)
Book Google Scholar
Giardine, B., Riemer, C., Hardison, R., Burhans, R., Elnitski, L., Shah, P., Zhang, Y., Blankenberg, D., Albert, I., Taylor, J., Miller, W., Kent, W., Nekrutenko, A.: Galaxy: a platform for interactive large-scale genome analysis. Genome Res. 15, 1451–1455 (2005)
Article CAS PubMed PubMed Central Google Scholar
Ciaramella, A., Cocozza, S., Iorio, F., Miele, G., Napolitano, F., Pinelli, M., Raiconi, G., Tagliaferri, R.: Interactive data analysis and clustering of genomic data. Neural Networks 21, 368–378 (2008)
Article CAS PubMed Google Scholar
Handl, J., Knowles, J., Kell, D.: Computational cluster validation in post-genomic data analysis. Bioinformatics 21, 3201–3215 (2005)
Article CAS PubMed Google Scholar
Dudoit, S., Fridlyand, J.: Bagging to improve the accuracy of a clustering procedure. Bioinformatics 19, 1090–1099 (2003)
Article CAS PubMed Google Scholar
Ben-Dor, A., Shamir, R., Yakhini, Z.: Clustering gene expression patterns. Journal of Computational Biology 6, 281–297 (1999)
Article CAS PubMed Google Scholar
Ben-Hur, A., Ellisseeff, A., Guyon, I.: A stability based method for discovering structure in clustered data. In: Altman, R., Dunker, A., Hunter, L., Klein, T., Lauderdale, K. (eds.) Pacific Symposium on Biocomputing, Lihue, Hawaii, USA, vol. 7, pp. 6–17. World Scientific, Singapore (2002)
Google Scholar
Dudoit, S., Fridlyand, J.: A prediction-based resampling method for estimating the number of clusters in a dataset. Genome Biology 3, 1–21 (2002)
Article Google Scholar
Yeung, K., Haynor, D., Ruzzo, W.: Validating clustering for gene expression data. Bioinformatics 17, 309–318 (2001)
Article CAS PubMed Google Scholar
Kerr, M., Curchill, G.: Bootstrapping cluster analysis: assessing the reliability of conclusions from microarray experiments. PNAS 98, 8961–8965 (2001)
Article CAS PubMed PubMed Central Google Scholar
McShane, L., Radmacher, D., Freidlin, B., Yu, R., Li, M., Simon, R.: Method for assessing reproducibility of clustering patterns observed in analyses of microarray data. Bioinformatics 18, 1462–1469 (2002)
Article CAS PubMed Google Scholar
Smolkin, M., Gosh, D.: Cluster stability scores for microarray data in cancer studies. BMC Bioinformatics 36 (2003)
Google Scholar
Bittner, M., Meltzer, P., Chen, Y., Jiang, Y., Seftor, E., Hendrix, M., Radmacher, M., Simon, R., Yakhini, Z., Ben-Dor, A., Sampas, N., Dougherty, E., Wang, E., Marincola, F., Gooden, C., Lueders, J., Glatfelter, A., Pollock, P., Carpten, J., Gillanders, E., Leja, D., Dietrich, K., Beaudry, C., Berens, M., Alberts, D., Sondak, V.: Molecular classification of malignant melanoma by gene expression profiling. Nature 406, 536–540 (2000)
Article CAS PubMed Google Scholar
Monti, S., Tamayo, P., Mesirov, J., Golub, T.: Consensus Clustering: A Resampling-based Method for Class Discovery and Visualization of Gene Expression Microarray Data. Machine Learning 52, 91–118 (2003)
Article Google Scholar
Lange, T., Roth, V., Braun, M., Buhmann, J.: Stability-based validation of clustering solutions. Neural Computation 16, 1299–1323 (2004)
Article PubMed Google Scholar
Valentini, G.: Clusterv: a tool for assessing the reliability of clusters discovered in DNA microarray data. Bioinformatics 22, 369–370 (2006)
Article CAS PubMed Google Scholar
Bertoni, A., Valentini, G.: Model order selection for bio-molecular data clustering. BMC Bioinformatics 8 (2007)
Google Scholar
Bertoni, A., Valentini, G.: Randomized maps for assessing the reliability of patients clusters in DNA microarray data analyses. Artificial Intelligence in Medicine 37, 85–109 (2006)
Article PubMed Google Scholar
Bertoni, A., Valentini, G.: Discovering Significant Structures in Clustered Bio-molecular Data Through the Bernstein Inequality. In: Apolloni, B., Howlett, R.J., Jain, L. (eds.) KES 2007, Part III. LNCS, vol. 4694, pp. 886–891. Springer, Heidelberg (2007)
Chapter Google Scholar
Bertoni, A., Valentini, G.: Discovering multi-level structures in bio-molecular data through the Bernstein inequality. BMC Bioinformatics 9 (2008)
Google Scholar
Valentini, G.: Mosclust: a software library for discovering significant structures in bio-molecular data. Bioinformatics 23, 387–389 (2007)
Article CAS PubMed Google Scholar
Bertoni, A., Valentini, G.: Randomized embedding cluster ensembles for gene expression data analysis. In: SETIT 2007 - IEEE International Conf. on Sciences of Electronic, Technologies of Information and Telecommunications, Hammamet, Tunisia (2007)
Google Scholar
Rand, W.: Objective criteria for the evaluation of clustering methods. J. Am. Stat. Assoc. 66, 846–850 (1971)
Article Google Scholar
Jain, A., Dubes, R.: Algorithms for clustering data. Prentice Hall, Englewood Cliffs (1988)
Google Scholar
Achlioptas, D.: Database-friendly random projections. In: Buneman, P. (ed.) Proc. ACM Symp. on the Principles of Database Systems. Contemporary Mathematics, pp. 274–281. ACM Press, New York (2001)
Google Scholar
Kaufman, L., Rousseeuw, P.: Finding Groups in Data: An Introduction to Cluster Analysis. Wiley, New York (1990)
Book Google Scholar
Bertoni, A., Valentini, G.: Assessment of clusters reliability for high dimensional genomic data. In: BITS 2007, Bioinformatics Italian Society Meeting, Napoli Italy (2007)
Google Scholar
Hoeffding, W.: Probability inequalities for sums of independent random variables. J. Amer. Statist. Assoc. 58, 13–30 (1963)
Article Google Scholar
Indyk, P.: Algorithmic Applications of Low-Distortion Geometric Embeddings. In: Proceedings of the 42nd IEEE symposium on Foundations of Computer Science, Washington DC, USA, pp. 10–33. IEEE Computer Society, Los Alamitos (2001)
Google Scholar
Johnson, W., Lindenstrauss, J.: Extensions of Lipshitz mapping into Hilbert space. In: Conference in modern analysis and probability. Contemporary Mathematics, Amer. Math. Soc., vol. 26, pp. 189–206 (1984)
Google Scholar
Valentini, G., Ruffino, F.: Characterization of lung tumor subtypes through gene expression cluster validity assessment. RAIRO - Theoretical Informatics and Applications 40, 163–176 (2006)
Article Google Scholar
Bertoni, A., Valentini, G.: In: Random projections for assessing gene expression cluster stability. In: IJCNN 2005, The IEEE-INNS International Joint Conference on Neural Networks, Montreal (2005)
Google Scholar
Ben-David, S., von Luxburg, U., Pal, D.: A sober look at clustering stability. In: Lugosi, G., Simon, H.U. (eds.) COLT 2006. LNCS, vol. 4005, pp. 5–19. Springer, Heidelberg (2006)
Chapter Google Scholar
Ben-David, S., von Luxburg, U.: Relating clustering stability to properties of cluster boundaries. In: 21st Annual Conference on Learning Theory (COLT 2008). LNCS, pp. 379–390. Springer, Heidelberg (2008)
Google Scholar
Harris, M., et al.: The Gene Ontology (GO) database and informatics resource. Nucleic Acid Res. 32, D258–D261 (2004)
Article Google Scholar
Brehelin, L., Gascuel, O., Martin, O.: Using repeated measurements to validate hierarchical gene clusters. Bioinformatics 24, 682–688 (2008)
Article CAS PubMed Google Scholar
Avogadri, R., Brioschi, M., Ruffino, F., Ferrazzi, F., Beghini, A., Valentini, G.: An algorithm to assess the reliability of hierarchical clusters in gene expression data. In: Lovrek, I., Howlett, R.J., Jain, L.C. (eds.) KES 2008, Part III. LNCS, vol. 5179, pp. 764–770. Springer, Heidelberg (2008)
Chapter Google Scholar
Filippone, M., Masulli, F., Rovetta, S.: Stability and Performances in Biclustering Algorithms. In: Masulli, F., Tagliaferri, R., Verhivker, G.M. (eds.) CIBB 2008. LNCS (LNBI), vol. 5488, pp. 91–101. Springer, Heidelberg (2009)
Google Scholar
Troyanskaya, O., et al.: A Bayesian framework for combining heterogeneous data sources for gene function prediction (in saccharomices cerevisiae). Proc. Natl. Acad. Sci. USA 100, 8348–8353 (2003)
Article CAS PubMed PubMed Central Google Scholar
Guan, Y., Myers, C., Hess, D., Barutcuoglu, Z., Caudy, A., Troyanskaya, O.: Predicting gene function in a hierarchical context with an ensemble of classifiers. Genome Biology 9 (2008)
Google Scholar
Alizadeh, A., Eisen, M., Davis, R., Ma, C., Lossos, I., Rosenwald, A., Boldrick, J., Sabet, H., Tran, T., Yu, X., Powell, J., Yang, L., Marti, G., Moore, T., Hudson, J., Lu, L., Lewis, D., Tibshirani, R., Sherlock, G., Chan, W., Greiner, T., Weisenburger, D., Armitage, J., Warnke, R., Levy, R., Wilson, W., Grever, M., Byrd, J., Botstein, D., Brown, P., Staudt, L.: Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling. Nature 403, 503–511 (2000)
Article CAS PubMed Google Scholar
Lapointe, J., Li, C., Higgins, J., van de Rijn, M., Bair, E., Montgomery, K., Ferrari, M., Egevad, L., Rayford, W., Bergerheim, U., Ekman, P., DeMarzo, A., Tibshirani, R., Botstein, D., Brown, P., Brooks, J., Pollack, J.: Gene expression profiling identifies clinically relevant subtypes of prostate cancer. PNAS 101, 811–816 (2004)
Article CAS PubMed PubMed Central Google Scholar

Download references

Author information

Authors and Affiliations

DSI, Dipartimento di Scienze dell’ Informazione, Università degli Studi di Milano, Via Comelico 39, 20135, Milano, Italia
Alberto Bertoni & Giorgio Valentini

Authors

Alberto Bertoni
View author publications
You can also search for this author in PubMed Google Scholar
Giorgio Valentini
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

DISI - Dipartimento di Informatica e Scienze dell’Informazione, Università di Genova, Via Dodecaneso 35, 16146, Genova, Italy
Francesco Masulli
DMI, Dipartimento di Matematica ed Informatica, Università di Salerno, Via Ponte don Melillo, 84084, Fisciano (Sa), Italy
Roberto Tagliaferri
Department of Pharmaceutical Chemistry, School of Pharmacy, The University of Kansas, 2095 Constant Ave, Lawrence, 66047, Kansas, USA
Gennady M. Verkhivker

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Bertoni, A., Valentini, G. (2009). Unsupervised Stability-Based Ensembles to Discover Reliable Structures in Complex Bio-molecular Data. In: Masulli, F., Tagliaferri, R., Verkhivker, G.M. (eds) Computational Intelligence Methods for Bioinformatics and Biostatistics. CIBB 2008. Lecture Notes in Computer Science(), vol 5488. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-02504-4_3

Download citation

DOI: https://doi.org/10.1007/978-3-642-02504-4_3
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-02503-7
Online ISBN: 978-3-642-02504-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics