Comparative Genomics-Based Prediction of Protein Function

  • Toni Gabaldón
Part of the Methods in Molecular Biology™ book series (MIMB, volume 439)


The era of genomics has opened new possibilities for the computational prediction of protein function. In particular, the comparison of fully sequenced genomes allows us to investigate the so-called genomic context of a gene, which includes its chromosomal positioning relative to other genes as well as its evolutionary record among the genomes considered. This information can be exploited to find functionally interacting partners for a protein of unknown function and thus obtain information on the biological process in which it is playing a role. Such comparative genomics-based techniques are increasingly being used in the process of genome annotation and in the development of testable working hypothesis.


Genomic context phylogenetic profile orthology gene fusion gene neighborhood gene order coevolution 



Toni Gabalón is supported by a long-term fellowship from EMBO (LTF 402–2005). He thanks Martijn A. Huynen for introducing him to the field of computational protein function prediction.


  1. 1.
    1. Devo, D, Valencia A (2001) Intrinsic errors in genome annotation. Trends Genet 17:429–431CrossRefGoogle Scholar
  2. 2.
    Iliopoulos I, Tsoka S, Andrade MA, Janssen P, Audit B, Tramontano A, Valencia A, Leroy C, Sander C, Ouzounis CA. (2001) Genome sequences and great expectations. Genome Biol 2: INTERACTIONS0001Google Scholar
  3. 3.
    3. Gabaldón T, Huynen MA (2004) Prediction of protein function and pathways in the genome era. Cell Mol Life Sci 61:930–944CrossRefPubMedGoogle Scholar
  4. 4.
    4. Durbin, R., Eddy, S. R., Krogh, A., and Graeme, M. (1988) Biological sequence analysis: Probabilistic models of proteins and nucleic acids. Cambridge University Press, CambridgeGoogle Scholar
  5. 5.
    5. Thompson JD, Higgins DG, Gibson TJ (1994) CLUSTAL W: Improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res 22:4673–4680CrossRefPubMedGoogle Scholar
  6. 6.
    6. Edgar RC (2004) MUSCLE: A multiple sequence alignment method with reduced time and space complexity. BMC Bioinformatics 5:113CrossRefPubMedGoogle Scholar
  7. 7.
    7. Gabaldón T (2005) Evolution of proteins and proteomes, a phylogenetics approach. Evolutionary Bioinformatics Online 1:51–56PubMedGoogle Scholar
  8. 8.
    8. Guindon S, Gascuel O (2003) A simple, fast, and accurate algorithm to estimate large phylogenies by maximum likelihood. Syst Biol 52:696–704CrossRefPubMedGoogle Scholar
  9. 9.
    9. Huynen MA, Bork P (1998) Measuring genome evolution. Proc Natl Acad Sci USA 95:5849–5856CrossRefPubMedGoogle Scholar
  10. 10.
    10. Tatusov RL, Koonin E V, Lipman DJ (1997) A genomic perspective on protein families. Science 278:631–637CrossRefPubMedGoogle Scholar
  11. 11.
    11. Zmasek CM, Eddy SR (2001) A simple algorithm to infer gene duplication and speciation events on a gene tree. Bioinformatics 17:821–828CrossRefPubMedGoogle Scholar
  12. 12.
    12. Tatusov RL, Galperin MY, Natale DA, Koonin EV (2000) The COG database: A tool for genome-scale analysis of protein functions and evolution. Nucleic Acids Res 28:33–36CrossRefPubMedGoogle Scholar
  13. 13.
    13. Tatusov RL, Fedorova ND, Jackson JJ, Jacobs AR, Kiryutin B, Koonin EV, Krylov DM, Mazumder R, Mekhedov SL, Nikolskaya AN, Rao BS, Smirnov S, Sverdlov AV, Vasudevan S., Wolf YI, Yin JJ, Natale DA (2003) The COG database: An updated version includes eukaryotes. BMC Bioinformatics 4:41CrossRefPubMedGoogle Scholar
  14. 14.
    14. Birney E, Andrews D, Caccamo M et al (2006) Ensembl 2006. Nucleic Acids Res 34: D556–561CrossRefPubMedGoogle Scholar
  15. 15.
    15. Alexeyenko A, Tamas I, Liu G, Sonnhammer EL (2006) Automatic clustering of orthologs and inparalogs shared by multiple proteomes. Bioinformatics 22:e9–e15CrossRefPubMedGoogle Scholar
  16. 16.
    16. Dufayard JF, Duret L, Penel S, Gouy M, Rechenmann F, and Perriere G (2005) Tree pattern matching in phylogenetic trees: Automatic search for orthologs or paralogs in homologous gene sequence databases. Bioinformatics 21:2596–2603CrossRefPubMedGoogle Scholar
  17. 17.
    17. Burns DM, Horn V, Paluh J, Yanofsky C (1990) Evolution of the tryptophan synthetase of fungi. Analysis of experimentally fused Escherichia coli tryptophan synthetase alpha and beta chains. J Biol Chem 265:2060–2069PubMedGoogle Scholar
  18. 18.
    18. Marcotte EM, Pellegrini M, Ng HL, Rice DW, Yeates TO., Eisenberg D (1999) Detecting protein function and protein-protein interactions from genome sequences. Science 285:751–753CrossRefPubMedGoogle Scholar
  19. 19.
    19. Enright AJ, Iliopoulos I, Kyrpides NC, Ouzounis CA (1999) Protein interaction maps for complete genomes based on gene fusion events. Nature 402:86–90CrossRefPubMedGoogle Scholar
  20. 20.
    20. Yanai I, Derti A, DeLisi C (2001) Genes linked by fusion events are generally of the same functional category: A systematic analysis of 30 microbial genomes. Proc Natl Acad Sci USA 98:7940–7945CrossRefPubMedGoogle Scholar
  21. 21.
    21. Moreno-Hagelsieb G, Trevino V, Perez-Rueda E, Smith TF, Collado-Vides J (2001) Transcription unit conservation in the three domains of life: A perspective from Escherichia coli. Trends Genet 17:175–177CrossRefPubMedGoogle Scholar
  22. 22.
    22. Dandekar T, Snel B, Huynen M, Bork P (1998) Conservation of gene order: A fingerprint of proteins that physically interact. Trends Biochem Sci 23:324–328CrossRefPubMedGoogle Scholar
  23. 23.
    23. Overbeek RF, M D'Souza M, Pusch GD,. Maltsev N (1998) Use of contiguity on the chromosome to infer functional coupling. In Silico Biol 2:93–108Google Scholar
  24. 24.
    24. Blumenthal T (1998) Gene clusters and polycistronic transcription in eukaryotes. Bioessays 20:480–487CrossRefPubMedGoogle Scholar
  25. 25.
    25. Spieth J, Brook, G, Kuersten S, Lea K, Blumenthal T (1993) Operons in C. elegans: Polycistronic mRNA precursors are processed by trans-splicing of SL2 to downstream coding regions. Cell 73:521–532CrossRefPubMedGoogle Scholar
  26. 26.
    26. von Mering C, Huynen M, Jaeggi D, Schmidt S, Bork P, Snel B (2003) STRING: A database of predicted functional associations between proteins. Nucleic Acids Res 31:258–261CrossRefGoogle Scholar
  27. 27.
    27. Pellegrini M, Marcotte EM, Thompson MJ, Eisenberg D, Yeates TO (1999) Assigning protein functions by comparative genome analysis: Protein phylogenetic profiles. Proc Natl Acad Sci USA 96:4285–4288CrossRefPubMedGoogle Scholar
  28. 28.
    28. Galperin MY, Koonin EV (2000) Who's your neighbor? New computational approaches for functional genomics. Nat Biotechnol 18:609–613CrossRefPubMedGoogle Scholar
  29. 29.
    29. Huynen M, Snel B, Lathe W, Bork P (2000) Exploitation of gene context. Curr Opin Struct Biol 10:366–370CrossRefPubMedGoogle Scholar
  30. 30.
    30. Wu J, Kasif S, DeLisi C (2003) Identification of functional links between genes using phylogenetic profiles. Bioinformatics 19:1524–1530CrossRefPubMedGoogle Scholar
  31. 31.
    31. Perna NT, Plunkett G III, Burland V, Mau B et al (2001) Genome sequence of enterohaemorrhagic Escherichia coli O157:H7. Nature 409:529–533CrossRefGoogle Scholar
  32. 32.
    32. Blattner FR, Plunkett G III, Bloch CA, Perna NT et al (1997) The complete genome sequence of Escherichia coli K-12. Science 277:1453–1474CrossRefPubMedGoogle Scholar
  33. 33.
    33. Gabaldón T, Huynen MA (2005) Lineage-specific gene loss following mitochondrial endosymbiosis and its potential for function prediction in eukaryotes. Bioinformatics 21, Suppl 2: ii144–ii50CrossRefPubMedGoogle Scholar
  34. 34.
    34. Fryxell KJ (1996) The coevolution of gene family trees. Trends Genet 12:364–369CrossRefPubMedGoogle Scholar
  35. 35.
    35. Pazos F, Valencia A (2001) Similarity of phylogenetic trees as indicator of protein-protein interaction. Protein Eng 14:609–614CrossRefPubMedGoogle Scholar

Copyright information

© Humana Press, a part of Springer Science+Business Media, LLC 2008

Authors and Affiliations

  • Toni Gabaldón
    • 1
  1. 1.Bioinformatics DepartmentCIPFValenciaSpain

Personalised recommendations