Skip to main content

Finding Biologically Accurate Clusterings in Hierarchical Tree Decompositions Using the Variation of Information

  • Conference paper
Research in Computational Molecular Biology (RECOMB 2009)

Part of the book series: Lecture Notes in Computer Science ((LNBI,volume 5541))

Abstract

Hierarchical clustering is a popular method for grouping together similar elements based on a distance measure between them. In many cases, annotations for some elements are known beforehand, which can aid the clustering process. We present a novel approach for decomposing a hierarchical clustering into the clusters that optimally match a set of known annotations, as measured by the variation of information metric. Our approach is general and does not require the user to enter the number of clusters desired. We apply it to two biological domains: finding protein complexes within protein interaction networks and identifying species within metagenomic DNA samples. For these two applications, we test the quality of our clusters by using them to predict complex and species membership, respectively. We find that our approach generally outperforms the commonly used heuristic methods.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Arnau, V., Mars, S., Marín, I.: Iterative cluster analysis of protein interaction data. Bioinformatics 21(3), 364–378 (2005)

    Article  CAS  PubMed  Google Scholar 

  2. Bader, G.D., Hogue, C.W.V.: An automated method for finding molecular complexes in large protein interaction networks. BMC Bioinformatics 4, 2 (2003)

    Article  PubMed  PubMed Central  Google Scholar 

  3. Bernard, A., Vaughn, D.S., Hartemink, A.J.: Reconstructing the topology of protein complexes. In: Speed, T., Huang, H. (eds.) RECOMB 2007. LNCS (LNBI), vol. 4453, pp. 32–46. Springer, Heidelberg (2007)

    Chapter  Google Scholar 

  4. Böhm, C., Plant, C.: HISSCLU: a hierarchical density-based method for semi-supervised clustering. In: Proceedings of the 2008 International Conference on Extending Database Technology, pp. 440–451. ACM Press, New York (2008)

    Google Scholar 

  5. Brohee, S., van Helden, J.: Evaluation of clustering algorithms for protein-protein interaction networks. BMC Bioinformatics 7, 488+ (2006)

    Article  PubMed  PubMed Central  Google Scholar 

  6. Brun, C., Chevenet, F., Martin, D., Wojcik, J., Guenoche, A., Jacq, B.: Functional classification of proteins for the prediction of cellular function from a protein-protein interaction network. Genome Biol. 5(1), R6 (2003)

    Article  Google Scholar 

  7. Buehler, E.C., Sachs, J.R., Shao, K., Bagchi, A., Ungar, L.H.: The CRASSS plug-in for integrating annotation data with hierarchical clustering results. Bioinformatics 20(17), 3266–3269 (2004)

    Article  CAS  PubMed  Google Scholar 

  8. Cole, J.R., Chai, B., Farris, R.J., Wang, Q., Kulam, S.A., McGarrell, D.M., Garrity, G.M., Tiedje, J.M.: The ribosomal database project (RDP-II): sequences and tools for high-throughput rRNA analysis. Nucleic Acids Res. 33, 294–296 (2005)

    Article  Google Scholar 

  9. Corby-Harris, V., et al.: Geographical distribution and diversity of bacteria associated with natural populations of Drosophila melanogaster. Appl. Environ. Microbiol. 73, 3470–3479 (2007)

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  10. DeSantis, T.Z., Hugenholtz, P., Keller, K., Brodie, E.L., Larsen, N., Piceno, Y.M., Phan, R., Andersen, G.L.: NAST: a multiple sequence alignment server for comparative analysis of 16s rRNA genes. Nucleic Acids Res. 34(Web Server issue), W394–W399 (2006)

    Article  Google Scholar 

  11. Dhillon, I.S., Guan, Y., Kulis, B.: Weighted graph cuts without eigenvectors a multilevel approach. IEEE Trans. Pattern Anal. Mach. Intell. 29(11), 1944–1957 (2007)

    Article  PubMed  Google Scholar 

  12. Dotan-Cohen, D., Melkman, A.A., Kasif, S.: Hierarchical tree snipping: Clustering guided by prior knowledge. Bioinformatics 23(24), 3335–3342 (2007)

    Article  CAS  PubMed  Google Scholar 

  13. Eckburg, P.B., Bik, E.M., Bernstein, C.N., Purdom, E., Dethlefsen, L., Sargent, M., Gill, S.R., Nelson, K.E., Relman, D.A.: Diversity of the human intestinal microbial flora. Science 308(5728), 1635–1638 (2005)

    Article  PubMed  PubMed Central  Google Scholar 

  14. Edgar, R.C.: MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res. 32(5), 1792–1797 (2004)

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  15. Felsenstein, J.: PHYLIP: Phylogeny inference package (version 3.2). Cladistics 5, 164–166 (1989)

    Google Scholar 

  16. Fulthorpe, R.R., Roesch, L.F.W., Riva, A., Triplett, E.W.: Distantly sampled soils carry few species in common. ISME J. 2, 901–910 (2008)

    Article  CAS  PubMed  Google Scholar 

  17. Garey, M.R., Johnson, D.S.: Comptuers and Intractability: A Guide to the Theory of NP-Completeness. W.H. Freeman and Company, New York (1979)

    Google Scholar 

  18. Gascuel, O.: BIONJ: an improved version of the NJ algorithm based on a simple model of sequence data. Mol. Biol. Evol. 14(7), 685–695 (1997)

    Article  CAS  PubMed  Google Scholar 

  19. Guldener, U., Munsterkotter, M., Kastenmuller, G., Strack, N., van Helden, J., Lemer, C., Richelles, J., Wodak, S.J., Garcia-Martinez, J., Perez-Ortin, J.E., Michael, H., Kaps, A., Talla, E., Dujon, B., Andre, B., Souciet, J.L., De Mon tigny, J., Bon, E., Gaillardin, C., Mewes, H.W.: CYGD: the comprehensive yeast genome database. Nucleic Acids Res. 33(suppl. 1), D364+ (2005)

    Google Scholar 

  20. Hart, T.G., Ramani, A.K., Marcotte, E.M.: How complete are current yeast and human protein-interaction networks? Genome Biol. 7, 120+ (2006)

    Article  PubMed  PubMed Central  Google Scholar 

  21. Jaccard, P.: Nouvelles recherches sur la distribution florale. Bulletin de la Socit Vaudoise des Sciences Naturelles, 223–270 (1908)

    Google Scholar 

  22. Jansen, R., Yu, H., Greenbaum, D., Kluger, Y., Krogan, N.J., Chung, S., Emili, A., Snyder, M., Greenblatt, J.F., Gerstein, M.: A Bayesian networks approach for predicting protein-protein interactions from genomic data. Science 302(5644), 449–453 (2003)

    Article  CAS  PubMed  Google Scholar 

  23. Jukes, T.H., Cantor, C.R.: Evolution of Protein Molecules. Academic Press, London (1969)

    Book  Google Scholar 

  24. Karypis, G., Kumar, V.: A fast and high quality multilevel scheme for partitioning irregular graphs. SIAM J. Sci. Comput. 20(1), 359–392 (1998)

    Article  Google Scholar 

  25. Kennedy, J., et al.: Diversity of microbes associated with the marine sponge, Haliclona simulans, isolated from Irish waters and identification of polyketide synthase genes from the sponge metagenome. Environ. Microbiol. 10, 1888–1902 (2008)

    Article  CAS  PubMed  Google Scholar 

  26. Kerrien, S., Alam-Faruque, Y., Aranda, B., Bancarz, I., Bridge, A., Derow, C., Dimmer, E., Feuermann, M., Friedrichsen, A., Huntley, R., Kohler, C., Khadake, J., Leroy, C., Liban, A., Lieftink, C., Montecchi-Palazzi, L., Orchard, S., Risse, J., Robbe, K., Roechert, B., Thorneycroft, D., Zhang, Y., Apweiler, R., Hermjakob, H.: IntAct—open source resource for molecular interaction data. Nucleic Acids Res. 35(Database issue), D561–D565 (2007)

    Article  Google Scholar 

  27. Kimura, M.: A simple method for estimating evolutionary rates of base substitutions through comparative studies of nucleotide sequences. J. Mol. Evol. 16, 111–120 (1980)

    Article  CAS  PubMed  Google Scholar 

  28. King, A.D., Przulj, N., Jurisica, I.: Protein complex prediction via cost-based clustering. Bioinformatics 20(17), 3013–3020 (2004)

    Article  CAS  PubMed  Google Scholar 

  29. Li, X.L., Foo, C.S., Ng, S.K.: Discovering protein complexes in dense reliable neighborhoods of protein interaction networks. In: Comp. Syst. Bioinformatics Conference, vol. 6, pp. 157–168 (2007)

    Google Scholar 

  30. Mavromatis, K., Ivanova, N., Barry, K., Shapiro, H., Goltsman, E., McHardy, A.C.C., Rigoutsos, I., Salamov, A., Korzeniewski, F., Land, M., Lapidus, A., Grigoriev, I., Richardson, P., Hugenholtz, P., Kyrpides, N.C.C.: Use of simulated data sets to evaluate the fidelity of metagenomic processing methods. Nat. Methods, 495–500 (2007)

    Google Scholar 

  31. Meila, M.: Comparing clusterings—an information based distance. J. Multivariate Anal. 98(5), 873–895 (2007)

    Article  Google Scholar 

  32. Mirkin, B.: Mathematical classification and clustering. J. Global Optim. 12(1), 105–108 (1998)

    Article  Google Scholar 

  33. Navlakha, S., Rastogi, R., Shrivastava, N.: Graph summarization with bounded error. In: Proceedings of the 2008 ACM SIGMOD Conference, pp. 419–432 (2008)

    Google Scholar 

  34. Navlakha, S., Schatz, M.C., Kingsford, C.: Revealing biological modules via graph summarization. J. Comp. Biol. 16(2), 253–264 (2009)

    Article  CAS  Google Scholar 

  35. Newman, M.E.J.: Modularity and community structure in networks. Proc. Natl. Acad. Sci. USA 103(23), 8577–8582 (2006)

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  36. Pei, P., Zhang, A.: A “seed-refine” algorithm for detecting protein complexes from protein interaction data. IEEE T. Nanobiosci. 6(1), 43–50 (2007)

    Article  Google Scholar 

  37. Qiu, J., Noble, W.S.: Predicting co-complexed protein pairs from heterogeneous data. PLoS Comp. Biol. 4(4) (2008)

    Google Scholar 

  38. Rand, W.M.: Objective criteria for the evaluation of clustering methods. J. Am. Stat. Assoc. 66(336), 846–850 (1971)

    Article  Google Scholar 

  39. Rives, A.W., Galitski, T.: Modular organization of cellular networks. Proc. Natl. Acad. Sci. USA 100(3), 1128–1133 (2003)

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  40. Samanta, M.P., Liang, S.: Predicting protein functions from redundancies in large-scale protein interaction networks. Proc. Natl. Acad. Sci. USA 100(22), 12579–12583 (2003)

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  41. Schloss, P.D., Handelsman, J.: Introducing DOTUR, a computer program for defining operational taxonomic units and estimating species richness. Appl. Environ. Microbiol. 71(3), 1501–1506 (2005)

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  42. Schloss, P.D., Handelsman, J.: Toward a census of bacteria in soil. PLoS Comp. Biol. 2(7), e92 (2006)

    Article  Google Scholar 

  43. Sharan, R., Ulitsky, I., Shamir, R.: Network-based prediction of protein function. Nat. Mol. Syst. Biol. 3, 88 (2007)

    Google Scholar 

  44. Sogin, M.L.L., Morrison, H.G.G., Huber, J.A.A., Welch, D.M.M., Huse, S.M.M., Neal, P.R.R., Arrieta, J.M.M., Herndl, G.J.J.: Microbial diversity in the deep sea and the underexplored “rare biosphere”. Proc. Natl. Acad. Sci. USA 103(32), 12115–12120 (2006)

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  45. Tan, M., Smith, E., Broach, J., Floudas, C.: Microarray data mining: A novel optimization-based approach to uncover biologically coherent structures. BMC Bioinformatics 9(1), 268 (2008)

    Article  PubMed  PubMed Central  Google Scholar 

  46. Thompson, J.D., Higgins, D.G., Gibson, T.J.: CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res. 22(22), 4673–4680 (1994)

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  47. Toronen, P.: Selection of informative clusters from hierarchical cluster tree with gene classes. BMC Bioinformatics 5, 32 (2004)

    Article  PubMed  PubMed Central  Google Scholar 

  48. van Dongen, S.: A cluster algorithm for graphs. Technical Report INS-R0010, National Research Institute for Mathematics and Computer Science in the Netherlands, Amsterdam (2000)

    Google Scholar 

  49. Wang, Q., Garrity, G.M., Tiedje, J.M., Cole, J.R.: Naive bayesian classifier for rapid assignment of rRNA sequences into the new bacterial taxonomy. Appl. Environ. Microbiol. 73(16), 5261–5267 (2007)

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  50. Warnecke, F., Luginbühl, P., Ivanova, N., Ghassemian, M., Richardson, T.H., Stege, J.T., Cayouette, M., Mchardy, A.C., Djordjevic, G., Aboushadi, N., Sorek, R., Tringe, S.G., Podar, M., Martin, H.G., Kunin, V., Dalevi, D., Madejska, J., Kirton, E., Platt, D., Szeto, E., Salamov, A., Barry, K., Mikhailova, N., Kyrpides, N.C., Matson, E.G., Ottesen, E.A., Zhang, X., Hernández, M., Murillo, C., Acosta, L.G., Rigoutsos, I., Tamayo, G., Green, B.D., Chang, C., Rubin, E.M., Mathur, E.J., Robertson, D.E., Hugenholtz, P., Leadbetter, J.R.: Metagenomic and functional analysis of hindgut microbiota of a wood-feeding higher termite. Nature 450(7169), 560–565 (2007)

    Article  CAS  PubMed  Google Scholar 

  51. Yu, H., Paccanaro, A., Trifonov, V., Gerstein, M.: Predicting interactions in protein networks by completing defective cliques. Bioinformatics 22(7), 823–829 (2006)

    Article  CAS  PubMed  Google Scholar 

  52. Zhu, X., Gerstein, M., Snyder, M.: Getting connected: analysis and principles of biological networks. Genes Dev. 21(9), 1010–1024 (2007)

    Article  CAS  PubMed  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2009 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Navlakha, S., White, J., Nagarajan, N., Pop, M., Kingsford, C. (2009). Finding Biologically Accurate Clusterings in Hierarchical Tree Decompositions Using the Variation of Information. In: Batzoglou, S. (eds) Research in Computational Molecular Biology. RECOMB 2009. Lecture Notes in Computer Science(), vol 5541. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-02008-7_29

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-02008-7_29

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-02007-0

  • Online ISBN: 978-3-642-02008-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics