Skip to main content

Heterogeneous Data Integration with the Consensus Clustering Formalism

  • Conference paper
Data Integration in the Life Sciences (DILS 2004)

Part of the book series: Lecture Notes in Computer Science ((LNBI,volume 2994))

Included in the following conference series:

Abstract

Meaningfully integrating massive multi-experimental genomic data sets is becoming critical for the understanding of gene function. We have recently proposed methodologies for integrating large numbers of microarray data sets based on consensus clustering. Our methods combine gene clusters into a unified representation, or a consensus, that is insensitive to mis-classifications in the individual experiments. Here we extend their utility to heterogeneous data sets and focus on their refinement and improvement. First of all we compare our best heuristic to the popular majority rule consensus clustering heuristic, and show that the former yields tighter consensuses. We propose a refinement to our consensus algorithm by clustering of the source-specific clusterings as a step before finding the consensus between them, thereby improving our original results and increasing their biological relevance. We demonstrate our methodology on three data sets of yeast with biologically interesting results. Finally, we show that our methodology can deal successfully with missing experimental values.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Filkov, V., Skiena, S.: Integrating microarray data by consensus clustering. In: Proceedings of Fifteenth International Conference on Tools with Artificial Intelligence, pp. 418–426. IEEE Computer Society, Los Alamitos (2003)

    Chapter  Google Scholar 

  2. Eisen, M., Spellman, P., Brown, P., Botstein, D.: Cluster analysis and display of genome-wide expression patterns. Proceedings of the National Academy of Science 85, 14863–14868 (1998)

    Article  Google Scholar 

  3. Mewes, H., Hani, J., Pfeiffer, F., Frishman, D.: Mips: a database for genomes and protein sequences. Nucleic Acids Research 26, 33–37 (1998)

    Article  Google Scholar 

  4. Tatusov, R., Natale, D., Garkavtsev, I., Tatusova, T., Shankavaram, U., Rao, B., Kiryutin, B., Galperin, M., Fedorova, N., Koonin, E.: The cog database: new developments in phylogenetic classification of proteins from complete genomes. Nucleic Acids Res. 29, 22–28 (2001)

    Article  Google Scholar 

  5. Troyanskaya, O., Cantor, M., Sherlock, G., Brown, P., Hastie, T., Tibshirani, R., Botstein, D., Altman, R.: Missing value estimation methods for dna microarrays. Bioinformatics 17, 520–525 (2001)

    Article  Google Scholar 

  6. Hammer, J., Schneider, M.: Genomics algebra: A new, integrating data model, language, and tool for processing and querying genomic information. In: Prooceedings of the First Biennial Conference on Innovative Data Systems Research, pp. 176–187. Morgan Kaufman Publishers, San Francisco (2003)

    Google Scholar 

  7. Marcotte, M., Pellegrine, M., Thompson, M.J., Yeates, T., Eisenberg, D.: A combined algorithm for genome wide prediction of protein function. Nature 402, 83–86 (1999)

    Article  Google Scholar 

  8. Pavlidis, P., Weston, J., Cai, J., Noble, W.: Learning gene functional classifications from multiple data types. Journal of Computational Biology 9, 401–411 (2002)

    Article  Google Scholar 

  9. Troyanskaya, O., Dolinski, K., Owen, A., Altman, R., Botstein, D.: A bayesian framework for combining heterogeneous data sources for gene function prediction (in s. cerevisiae). Proc. Natl. Acad. Sci. 100, 8348–8353 (2003)

    Article  Google Scholar 

  10. Strehl, A., Ghosh, J.: Cluster ensembles - a knowledge reuse framework for combining partitionings. In: Proceedings of AAAI, pp. 93–98. AAAI/MIT Press (2002)

    Google Scholar 

  11. Monti, S., Tamayo, P., Mesirov, J., Golub, T.: Consensus clustering. Machine Learning 52, 91–118 (2003); Functional Genomics Special Issue

    Google Scholar 

  12. Gordon, A., Vichi, M.: Partitions of partitions. Journal of Classification 15, 265–285 (1998)

    Article  MATH  Google Scholar 

  13. Cristofor, D., Simovici, D.: Finding median partitions using information theoretical-based genetic algorithms. Journal of Universal Computer Science 8, 153–172 (2002)

    MathSciNet  Google Scholar 

  14. Meilâ, M.: Comparing clusterings by the variation of information. In: Schölkopf, B., Warmuth, M.K. (eds.) COLT/Kernel 2003. LNCS (LNAI), vol. 2777, pp. 173–187. Springer, Heidelberg (2003)

    Chapter  Google Scholar 

  15. Mirkin, B.: The problems of approximation in spaces of relations and qualitative data analysis. Information and Remote Control 35, 1424–1431 (1974)

    Google Scholar 

  16. Filkov, V.: Computational Inference of Gene Regulation. PhD thesis, State University of New York at Stony Brook (2002)

    Google Scholar 

  17. Hubert, L., Arabie, P.: Comparing partitions. Journal of Classification 2, 193–218 (1985)

    Article  Google Scholar 

  18. Barthélemy, J.P., Leclerc, B.: The median procedure for partitions. In: Cox, I., Hansen, P., Julesz, B. (eds.) Partitioning Data Sets. DIMACS Series in Descrete Mathematics, vol. 19, pp. 3–34. American Mathematical Society, Providence (1995)

    Google Scholar 

  19. Wakabayashi, Y.: The complexity of computing medians of relations. Resenhas IME-USP 3, 323–349 (1998)

    MATH  MathSciNet  Google Scholar 

  20. Krivanek, M., Moravek, J.: Hard problems in hierarchical-tree clustering. Acta Informatica 23, 311–323 (1986)

    Article  MATH  MathSciNet  Google Scholar 

  21. Downton, M., Brennan, T.: Comparing classifications: An evaluation of several coefficients of partition agreement, Paper presented at the meeting of the Classification Society, Boulder, CO (1980)

    Google Scholar 

  22. Cho, R., Campbell, M., Winzeler, E., Steinmetz, L., Conway, A., Wodicka, L., Wolfsberg, T., Gabrielian, A., Landsman, D., Lockhart, D., Davis, R.: A genomewide transcriptional analysis of the mitotic cell cycle. Molecular Cell 2, 65–73 (1998)

    Article  Google Scholar 

  23. Spellman, P., Sherlock, G., Zhang, M., Iyer, V., Anders, K., Eisen, M., Brown, P., Botstein, D., Futcher, B.: Comprehensive identification of cell cycle-regulated genes of the yeast saccharomyces cerevisiae by microarray hybridization. Molecular Biology of the Cell 9, 3273–3297 (1998)

    Google Scholar 

  24. Gasch, A., Spellman, P., Kao, C., Carmen-Harel, O., Eisen, M., Storz, G., Botstein, D., Brown, P.: Genomic expression programs in the response of yeast cells to environment changes. Molecular Biology of the Cell 11, 4241–4257 (2000)

    Google Scholar 

  25. Bar-Joseph, Z., Gerber, G., Gifford, D., Jaakkola, T., Simon, I.: Continuous representations of time series gene expression data. Journal of Computational Biology 10, 241–256 (2003)

    Article  Google Scholar 

  26. Zhou, X., Wang, X., Dougherty, E.: Missing-value estimation using linear and nonlinear regression with bayesian gene selection. Bioinformatics 19, 2302–2307 (2003)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2004 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Filkov, V., Skiena, S. (2004). Heterogeneous Data Integration with the Consensus Clustering Formalism. In: Rahm, E. (eds) Data Integration in the Life Sciences. DILS 2004. Lecture Notes in Computer Science(), vol 2994. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-24745-6_8

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-24745-6_8

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-21300-0

  • Online ISBN: 978-3-540-24745-6

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics