Abstract
Meaningfully integrating massive multi-experimental genomic data sets is becoming critical for the understanding of gene function. We have recently proposed methodologies for integrating large numbers of microarray data sets based on consensus clustering. Our methods combine gene clusters into a unified representation, or a consensus, that is insensitive to mis-classifications in the individual experiments. Here we extend their utility to heterogeneous data sets and focus on their refinement and improvement. First of all we compare our best heuristic to the popular majority rule consensus clustering heuristic, and show that the former yields tighter consensuses. We propose a refinement to our consensus algorithm by clustering of the source-specific clusterings as a step before finding the consensus between them, thereby improving our original results and increasing their biological relevance. We demonstrate our methodology on three data sets of yeast with biologically interesting results. Finally, we show that our methodology can deal successfully with missing experimental values.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Filkov, V., Skiena, S.: Integrating microarray data by consensus clustering. In: Proceedings of Fifteenth International Conference on Tools with Artificial Intelligence, pp. 418–426. IEEE Computer Society, Los Alamitos (2003)
Eisen, M., Spellman, P., Brown, P., Botstein, D.: Cluster analysis and display of genome-wide expression patterns. Proceedings of the National Academy of Science 85, 14863–14868 (1998)
Mewes, H., Hani, J., Pfeiffer, F., Frishman, D.: Mips: a database for genomes and protein sequences. Nucleic Acids Research 26, 33–37 (1998)
Tatusov, R., Natale, D., Garkavtsev, I., Tatusova, T., Shankavaram, U., Rao, B., Kiryutin, B., Galperin, M., Fedorova, N., Koonin, E.: The cog database: new developments in phylogenetic classification of proteins from complete genomes. Nucleic Acids Res. 29, 22–28 (2001)
Troyanskaya, O., Cantor, M., Sherlock, G., Brown, P., Hastie, T., Tibshirani, R., Botstein, D., Altman, R.: Missing value estimation methods for dna microarrays. Bioinformatics 17, 520–525 (2001)
Hammer, J., Schneider, M.: Genomics algebra: A new, integrating data model, language, and tool for processing and querying genomic information. In: Prooceedings of the First Biennial Conference on Innovative Data Systems Research, pp. 176–187. Morgan Kaufman Publishers, San Francisco (2003)
Marcotte, M., Pellegrine, M., Thompson, M.J., Yeates, T., Eisenberg, D.: A combined algorithm for genome wide prediction of protein function. Nature 402, 83–86 (1999)
Pavlidis, P., Weston, J., Cai, J., Noble, W.: Learning gene functional classifications from multiple data types. Journal of Computational Biology 9, 401–411 (2002)
Troyanskaya, O., Dolinski, K., Owen, A., Altman, R., Botstein, D.: A bayesian framework for combining heterogeneous data sources for gene function prediction (in s. cerevisiae). Proc. Natl. Acad. Sci. 100, 8348–8353 (2003)
Strehl, A., Ghosh, J.: Cluster ensembles - a knowledge reuse framework for combining partitionings. In: Proceedings of AAAI, pp. 93–98. AAAI/MIT Press (2002)
Monti, S., Tamayo, P., Mesirov, J., Golub, T.: Consensus clustering. Machine Learning 52, 91–118 (2003); Functional Genomics Special Issue
Gordon, A., Vichi, M.: Partitions of partitions. Journal of Classification 15, 265–285 (1998)
Cristofor, D., Simovici, D.: Finding median partitions using information theoretical-based genetic algorithms. Journal of Universal Computer Science 8, 153–172 (2002)
Meilâ, M.: Comparing clusterings by the variation of information. In: Schölkopf, B., Warmuth, M.K. (eds.) COLT/Kernel 2003. LNCS (LNAI), vol. 2777, pp. 173–187. Springer, Heidelberg (2003)
Mirkin, B.: The problems of approximation in spaces of relations and qualitative data analysis. Information and Remote Control 35, 1424–1431 (1974)
Filkov, V.: Computational Inference of Gene Regulation. PhD thesis, State University of New York at Stony Brook (2002)
Hubert, L., Arabie, P.: Comparing partitions. Journal of Classification 2, 193–218 (1985)
Barthélemy, J.P., Leclerc, B.: The median procedure for partitions. In: Cox, I., Hansen, P., Julesz, B. (eds.) Partitioning Data Sets. DIMACS Series in Descrete Mathematics, vol. 19, pp. 3–34. American Mathematical Society, Providence (1995)
Wakabayashi, Y.: The complexity of computing medians of relations. Resenhas IME-USP 3, 323–349 (1998)
Krivanek, M., Moravek, J.: Hard problems in hierarchical-tree clustering. Acta Informatica 23, 311–323 (1986)
Downton, M., Brennan, T.: Comparing classifications: An evaluation of several coefficients of partition agreement, Paper presented at the meeting of the Classification Society, Boulder, CO (1980)
Cho, R., Campbell, M., Winzeler, E., Steinmetz, L., Conway, A., Wodicka, L., Wolfsberg, T., Gabrielian, A., Landsman, D., Lockhart, D., Davis, R.: A genomewide transcriptional analysis of the mitotic cell cycle. Molecular Cell 2, 65–73 (1998)
Spellman, P., Sherlock, G., Zhang, M., Iyer, V., Anders, K., Eisen, M., Brown, P., Botstein, D., Futcher, B.: Comprehensive identification of cell cycle-regulated genes of the yeast saccharomyces cerevisiae by microarray hybridization. Molecular Biology of the Cell 9, 3273–3297 (1998)
Gasch, A., Spellman, P., Kao, C., Carmen-Harel, O., Eisen, M., Storz, G., Botstein, D., Brown, P.: Genomic expression programs in the response of yeast cells to environment changes. Molecular Biology of the Cell 11, 4241–4257 (2000)
Bar-Joseph, Z., Gerber, G., Gifford, D., Jaakkola, T., Simon, I.: Continuous representations of time series gene expression data. Journal of Computational Biology 10, 241–256 (2003)
Zhou, X., Wang, X., Dougherty, E.: Missing-value estimation using linear and nonlinear regression with bayesian gene selection. Bioinformatics 19, 2302–2307 (2003)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2004 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Filkov, V., Skiena, S. (2004). Heterogeneous Data Integration with the Consensus Clustering Formalism. In: Rahm, E. (eds) Data Integration in the Life Sciences. DILS 2004. Lecture Notes in Computer Science(), vol 2994. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-24745-6_8
Download citation
DOI: https://doi.org/10.1007/978-3-540-24745-6_8
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-21300-0
Online ISBN: 978-3-540-24745-6
eBook Packages: Springer Book Archive