Heterogeneous Data Integration with the Consensus Clustering Formalism

Filkov, Vladimir; Skiena, Steven

doi:10.1007/978-3-540-24745-6_8

Vladimir Filkov⁸ &
Steven Skiena⁹

Part of the book series: Lecture Notes in Computer Science ((LNBI,volume 2994))

Included in the following conference series:

International Workshop on Data Integration in the Life Sciences

443 Accesses
15 Citations

Abstract

Meaningfully integrating massive multi-experimental genomic data sets is becoming critical for the understanding of gene function. We have recently proposed methodologies for integrating large numbers of microarray data sets based on consensus clustering. Our methods combine gene clusters into a unified representation, or a consensus, that is insensitive to mis-classifications in the individual experiments. Here we extend their utility to heterogeneous data sets and focus on their refinement and improvement. First of all we compare our best heuristic to the popular majority rule consensus clustering heuristic, and show that the former yields tighter consensuses. We propose a refinement to our consensus algorithm by clustering of the source-specific clusterings as a step before finding the consensus between them, thereby improving our original results and increasing their biological relevance. We demonstrate our methodology on three data sets of yeast with biologically interesting results. Finally, we show that our methodology can deal successfully with missing experimental values.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Filkov, V., Skiena, S.: Integrating microarray data by consensus clustering. In: Proceedings of Fifteenth International Conference on Tools with Artificial Intelligence, pp. 418–426. IEEE Computer Society, Los Alamitos (2003)
Chapter Google Scholar
Eisen, M., Spellman, P., Brown, P., Botstein, D.: Cluster analysis and display of genome-wide expression patterns. Proceedings of the National Academy of Science 85, 14863–14868 (1998)
Article Google Scholar
Mewes, H., Hani, J., Pfeiffer, F., Frishman, D.: Mips: a database for genomes and protein sequences. Nucleic Acids Research 26, 33–37 (1998)
Article Google Scholar
Tatusov, R., Natale, D., Garkavtsev, I., Tatusova, T., Shankavaram, U., Rao, B., Kiryutin, B., Galperin, M., Fedorova, N., Koonin, E.: The cog database: new developments in phylogenetic classification of proteins from complete genomes. Nucleic Acids Res. 29, 22–28 (2001)
Article Google Scholar
Troyanskaya, O., Cantor, M., Sherlock, G., Brown, P., Hastie, T., Tibshirani, R., Botstein, D., Altman, R.: Missing value estimation methods for dna microarrays. Bioinformatics 17, 520–525 (2001)
Article Google Scholar
Hammer, J., Schneider, M.: Genomics algebra: A new, integrating data model, language, and tool for processing and querying genomic information. In: Prooceedings of the First Biennial Conference on Innovative Data Systems Research, pp. 176–187. Morgan Kaufman Publishers, San Francisco (2003)
Google Scholar
Marcotte, M., Pellegrine, M., Thompson, M.J., Yeates, T., Eisenberg, D.: A combined algorithm for genome wide prediction of protein function. Nature 402, 83–86 (1999)
Article Google Scholar
Pavlidis, P., Weston, J., Cai, J., Noble, W.: Learning gene functional classifications from multiple data types. Journal of Computational Biology 9, 401–411 (2002)
Article Google Scholar
Troyanskaya, O., Dolinski, K., Owen, A., Altman, R., Botstein, D.: A bayesian framework for combining heterogeneous data sources for gene function prediction (in s. cerevisiae). Proc. Natl. Acad. Sci. 100, 8348–8353 (2003)
Article Google Scholar
Strehl, A., Ghosh, J.: Cluster ensembles - a knowledge reuse framework for combining partitionings. In: Proceedings of AAAI, pp. 93–98. AAAI/MIT Press (2002)
Google Scholar
Monti, S., Tamayo, P., Mesirov, J., Golub, T.: Consensus clustering. Machine Learning 52, 91–118 (2003); Functional Genomics Special Issue
Google Scholar
Gordon, A., Vichi, M.: Partitions of partitions. Journal of Classification 15, 265–285 (1998)
Article MATH Google Scholar
Cristofor, D., Simovici, D.: Finding median partitions using information theoretical-based genetic algorithms. Journal of Universal Computer Science 8, 153–172 (2002)
MathSciNet Google Scholar
Meilâ, M.: Comparing clusterings by the variation of information. In: Schölkopf, B., Warmuth, M.K. (eds.) COLT/Kernel 2003. LNCS (LNAI), vol. 2777, pp. 173–187. Springer, Heidelberg (2003)
Chapter Google Scholar
Mirkin, B.: The problems of approximation in spaces of relations and qualitative data analysis. Information and Remote Control 35, 1424–1431 (1974)
Google Scholar
Filkov, V.: Computational Inference of Gene Regulation. PhD thesis, State University of New York at Stony Brook (2002)
Google Scholar
Hubert, L., Arabie, P.: Comparing partitions. Journal of Classification 2, 193–218 (1985)
Article Google Scholar
Barthélemy, J.P., Leclerc, B.: The median procedure for partitions. In: Cox, I., Hansen, P., Julesz, B. (eds.) Partitioning Data Sets. DIMACS Series in Descrete Mathematics, vol. 19, pp. 3–34. American Mathematical Society, Providence (1995)
Google Scholar
Wakabayashi, Y.: The complexity of computing medians of relations. Resenhas IME-USP 3, 323–349 (1998)
MATH MathSciNet Google Scholar
Krivanek, M., Moravek, J.: Hard problems in hierarchical-tree clustering. Acta Informatica 23, 311–323 (1986)
Article MATH MathSciNet Google Scholar
Downton, M., Brennan, T.: Comparing classifications: An evaluation of several coefficients of partition agreement, Paper presented at the meeting of the Classification Society, Boulder, CO (1980)
Google Scholar
Cho, R., Campbell, M., Winzeler, E., Steinmetz, L., Conway, A., Wodicka, L., Wolfsberg, T., Gabrielian, A., Landsman, D., Lockhart, D., Davis, R.: A genomewide transcriptional analysis of the mitotic cell cycle. Molecular Cell 2, 65–73 (1998)
Article Google Scholar
Spellman, P., Sherlock, G., Zhang, M., Iyer, V., Anders, K., Eisen, M., Brown, P., Botstein, D., Futcher, B.: Comprehensive identification of cell cycle-regulated genes of the yeast saccharomyces cerevisiae by microarray hybridization. Molecular Biology of the Cell 9, 3273–3297 (1998)
Google Scholar
Gasch, A., Spellman, P., Kao, C., Carmen-Harel, O., Eisen, M., Storz, G., Botstein, D., Brown, P.: Genomic expression programs in the response of yeast cells to environment changes. Molecular Biology of the Cell 11, 4241–4257 (2000)
Google Scholar
Bar-Joseph, Z., Gerber, G., Gifford, D., Jaakkola, T., Simon, I.: Continuous representations of time series gene expression data. Journal of Computational Biology 10, 241–256 (2003)
Article Google Scholar
Zhou, X., Wang, X., Dougherty, E.: Missing-value estimation using linear and nonlinear regression with bayesian gene selection. Bioinformatics 19, 2302–2307 (2003)
Article Google Scholar

Download references

Author information

Authors and Affiliations

CS Dept., UC Davis, One Shields Avenue, Davis, CA, 95616, USA
Vladimir Filkov
CS Dept., SUNY at Stony Brook, Stony Brook, NY, 11794, USA
Steven Skiena

Authors

Vladimir Filkov
View author publications
You can also search for this author in PubMed Google Scholar
Steven Skiena
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Dept. of Computer Science, University of Leipzig,
Erhard Rahm

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Filkov, V., Skiena, S. (2004). Heterogeneous Data Integration with the Consensus Clustering Formalism. In: Rahm, E. (eds) Data Integration in the Life Sciences. DILS 2004. Lecture Notes in Computer Science(), vol 2994. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-24745-6_8

Download citation

DOI: https://doi.org/10.1007/978-3-540-24745-6_8
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-21300-0
Online ISBN: 978-3-540-24745-6
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics