Metric Labeling and Semi-metric Embedding for Protein Annotation Prediction

  • Emre Sefer
  • Carl Kingsford
Part of the Lecture Notes in Computer Science book series (LNCS, volume 6577)

Abstract

Computational techniques have been successful at predicting protein function from relational data (functional or physical interactions). These techniques have been used to generate hypotheses and to direct experimental validation. With few exceptions, the task is modeled as multi-label classification problems where the labels (functions) are treated independently or semi-independently. However, databases such as the Gene Ontology provide information about the similarities between functions. We explore the use of the Metric Labeling combinatorial optimization problem to make use of heuristically computed distances between functions to make more accurate predictions of protein function in networks derived from both physical interactions and a combination of other data types. To do this, we give a new technique (based on convex optimization) for converting heuristic semimetric distances into a metric with minimum least-squared distortion (LSD). The Metric Labeling approach is shown to outperform five existing techniques for inferring function from networks. These results suggest Metric Labeling is useful for protein function prediction, and that LSD minimization can help solve the problem of converting heuristic distances to a metric.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Barutcuoglu, Z., Schapire, R.E., Troyanskaya, O.G.: Hierarchical multi-label prediction of gene function. Bioinformatics, 830–836 (2006)Google Scholar
  2. 2.
    Boykov, Y., Veksler, O., Zabih, R.: Fast approximate energy minimization via graph cuts. IEEE T. on Pat. Anal. Mach. Intell. 23(11), 1222–1239 (2001)CrossRefGoogle Scholar
  3. 3.
    Budanitsky, A., Hirst, G.: Semantic distance in WordNet: An experimental, application-oriented evaluation of five measures. In: Workshop on WordNet and Other Lexical Resources, Second Meeting of The North American Chapter of The Association For Computational Linguistics (2001)Google Scholar
  4. 4.
    Chekuri, C., Khanna, S., Naor, J., Zosin, L.: A linear programming formulation and approximation algorithms for the metric labeling problem. SIAM J. Discret. Math. 18(3), 608–625 (2005)MathSciNetCrossRefMATHGoogle Scholar
  5. 5.
    Cheng, J., Cline, M., Martin, J., Finkelstein, D., Awad, T., Kulp, D., Siani-Rose, M.A.: A knowledge-based clustering algorithm driven by Gene Ontology. J. Biopharm. Stat. 14(3), 687–700 (2004)MathSciNetCrossRefGoogle Scholar
  6. 6.
    Chuzhoy, J., Naor, J.S.: The hardness of metric labeling. In: 45th Annual IEEE Symp. Foundations of Computer Science, pp. 108–114. IEEE Computer Society, Washington, DC (2004)CrossRefGoogle Scholar
  7. 7.
    Deng, M., Tu, Z., Sun, F., Chen, T.: Mapping gene ontology to proteins based on protein–protein interaction data. Bioinformatics 20(6), 895–902 (2004)CrossRefGoogle Scholar
  8. 8.
    Dotan-Cohen, D., Kasif, S., Melkman, A.A.: Seeing the forest for the trees: using the Gene Ontology to restructure hierarchical clustering. Bioinformatics 25(14), 1789–1795 (2009)CrossRefGoogle Scholar
  9. 9.
    Fakcharoenphol, J., Rao, S., Talwar, K.: A tight bound on approximating arbitrary metrics by tree metrics. In: Proc. 35th Annual ACM Symp. on Theory of Computing, pp. 448–455 (2003)Google Scholar
  10. 10.
    Fellbaum, C. (ed.): WordNet: An Electronic Lexical Database (Language, Speech, and Communication). The MIT Press, Cambridge (1998)MATHGoogle Scholar
  11. 11.
    Gasch, A.P., Spellman, P.T., Kao, C.M., Carmel-Harel, O., Eisen, M.B., Storz, G., Botstein, D., Brown, P.O., Silver, P.A.: Genomic expression programs in the response of yeast cells to environmental changes. Mol. Biol. Cell 11, 4241–4257 (2000)CrossRefGoogle Scholar
  12. 12.
    Gavin, A.C., Bosche, M., Krause, R., et al.: Functional organization of the yeast proteome by systematic analysis of protein complexes. Nature 415(6868), 141–147 (2002)CrossRefGoogle Scholar
  13. 13.
    GNU Linear Programming Kit (2010), http://www.gnu.org/software/glpk/
  14. 14.
    Hishigaki, H., Nakai, K., Ono, T., Tanigami, A., Takagi, T.: Assessment of prediction accuracy of protein function from protein–protein interaction data. Yeast 18(6), 523–531 (2001)CrossRefGoogle Scholar
  15. 15.
    Ho, Y., Gruhler, A., Heilbut, A., et al.: Systematic identification of protein complexes in Saccharomyces cerevisiae by mass spectrometry. Nature 415(6868), 180–183 (2002)CrossRefGoogle Scholar
  16. 16.
    Huh, W.K., Falvo, J.V., Gerke, L.C., Carroll, A.S., Howson, R.W., Weissman, J.S., O’Shea, E.K.: Global analysis of protein localization in budding yeast. Nature 425(6959), 686–691 (2003)CrossRefGoogle Scholar
  17. 17.
    Ito, T., Chiba, T., Ozawa, R., Yoshida, M., Hattori, M., Sakaki, Y.: A comprehensive two-hybrid analysis to explore the yeast protein interactome. Proc. Natl. Acad. Sci. USA 98(8), 4569–4574 (2001)CrossRefGoogle Scholar
  18. 18.
    Jensen, L.J., Gupta, R., Strfeldt, H.H., Brunak, S.: Prediction of human protein function according to Gene Ontology categories. Bioinformatics 19(5), 635–642 (2003)CrossRefGoogle Scholar
  19. 19.
    Karaoz, U., Murali, T.M., Letovsky, S., Zheng, Y., Ding, C., Cantor, C.R., Kasif, S.: Whole-genome annotation by using evidence integration in functional-linkage networks. Proc. Natl. Acad. Sci. USA 101(9), 2888–2893 (2004)CrossRefGoogle Scholar
  20. 20.
    Kleinberg, J., Tardos, E.: Approximation algorithms for classification problems with pairwise relationships: Metric labeling and markov random fields. In: Proc. 40th Annual IEEE Symp. on Foundations of Computer Science, pp. 14–23 (1999)Google Scholar
  21. 21.
    Komodakis, N., Tziritas, G.: Approximate labeling via graph-cuts based on linear programming. IEEE T. Pat. Anal. Mach. Intell. 29(8), 1436–1453 (2007)CrossRefGoogle Scholar
  22. 22.
    Kourmpetis, Y.A., van Dijk, A.D., Bink, M.C., van Ham, R.C., Ter Braak, C.J.: Bayesian markov random field analysis for protein function prediction based on network data. PloS One 5(2), e9293+ (2010)CrossRefGoogle Scholar
  23. 23.
    Kui, M.D., Zhang, K., Mehta, S., Chen, T., Sun, F.: Prediction of protein function using protein-protein interaction data. J. Computat. Biol. 10, 947–960 (2002)Google Scholar
  24. 24.
    Kumar, M.P., Koller, D.: MAP estimation of semi-metric MRFs via hierarchical graph cuts. In: UAI 2009: Proc. Twenty-Fifth Conf. on Uncertainty in Artificial Intelligence, pp. 313–320. AUAI Press, Arlington (2009)Google Scholar
  25. 25.
    Lee, H., Tu, Z., Deng, M., Sun, F., Chen, T.: Diffusion kernel-based logistic regression models for protein function prediction. OMICS 10(1), 40–55 (2006)CrossRefGoogle Scholar
  26. 26.
    Li, S.Z.: Markov random field modeling in computer vision. Springer, London (1995)CrossRefGoogle Scholar
  27. 27.
    Lin, D.: Automatic retrieval and clustering of similar words. In: Proc. 17th Internat. Conf. on Computational Linguistics, pp. 768–774. Association for Computational Linguistics, Morristown (1998)Google Scholar
  28. 28.
    Lin, D.: An information-theoretic definition of similarity. In: Proc. 15th Internat. Conf. Machine Learning, pp. 296–304. Morgan Kaufmann, San Francisco (1998)Google Scholar
  29. 29.
    Nabieva, E., Jim, K., Agarwal, A., Chazelle, B., Singh, M.: Whole-proteome prediction of protein function via graph-theoretic analysis of interaction maps. Bioinformatics 21(Suppl 1), i302–i310 (2005)CrossRefGoogle Scholar
  30. 30.
    Rain, J.C., Selig, L., De Reuse, H., Battaglia, V., Reverdy, C., Simon, S., Lenzen, G., Petel, F., Wojcik, J., Schachter, V., Chemama, Y., Labigne, A., Legrain, P.: The protein-protein interaction map of Helicobacter pylori. Nature 409(6817), 211–215 (2001)CrossRefGoogle Scholar
  31. 31.
    Resnik, P.: Semantic Similarity in a Taxonomy: An Information-Based Measure and its Application to Problems of Ambiguity in Natural Language. J. Artificial Intelligence Research 11, 95–130 (1999)MATHGoogle Scholar
  32. 32.
    Schlicker, A., Domingues, F., Rahnenfuhrer, J., Lengauer, T.: A new measure for functional similarity of gene products based on gene ontology. BMC Bioinformatics 7(1), 302 (2006)CrossRefGoogle Scholar
  33. 33.
    Schwikowski, B., Uetz, P., Fields, S.: A network of protein-protein interactions in yeast. Nat. Biotechnol. 18(12), 1257–1261 (2000)CrossRefGoogle Scholar
  34. 34.
    Sharan, R., Ulitsky, I., Shamir, R.: Network-based prediction of protein function. Mol. Syst. Biol. 3, 88 (2007)CrossRefGoogle Scholar
  35. 35.
    Stark, C., Breitkreutz, B.J., Reguly, T., Boucher, L., Breitkreutz, A., Tyers, M.: BioGRID: a general repository for interaction datasets. Nucl. Acids Res. 34(suppl 1), D535–D539 (2005)Google Scholar
  36. 36.
    The Gene Ontology Consortium: Gene ontology: tool for the unification of biology. Nat. Genetics 25(1), 25–29 (2000)Google Scholar
  37. 37.
    Uetz, P., Giot, L., Cagney, G., et al.: A comprehensive analysis of protein-protein interactions in Saccharomyces cerevisiae. Nature 403(6770), 623–627 (2000)CrossRefGoogle Scholar
  38. 38.
  39. 39.
    Vazquez, A., Flammini, A., Maritan, A., Vespignani, A.: Global protein function prediction from protein-protein interaction networks. Nat. Biotechnol. 21(6), 697–700 (2003)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2011

Authors and Affiliations

  • Emre Sefer
    • 1
    • 2
  • Carl Kingsford
    • 1
    • 2
  1. 1.Department of Computer ScienceUniversity of MarylandCollege ParkUSA
  2. 2.Center for Bioinformatics and Computational Biology, Institute for Advanced Computer StudiesUniversity of MarylandCollege ParkUSA

Personalised recommendations