Advertisement

Inferring Orthology and Paralogy

  • Adrian M. Altenhoff
  • Christophe Dessimoz
Part of the Methods in Molecular Biology book series (MIMB, volume 855)

Abstract

The distinction between orthologs and paralogs, genes that started diverging by speciation versus duplication, is relevant in a wide range of contexts, most notably phylogenetic tree inference and protein function annotation. In this chapter, we provide an overview of the methods used to infer orthology and paralogy. We survey both graph-based approaches (and their various grouping strategies) and tree-based approaches, which solve the more general problem of gene/species tree reconciliation. We discuss conceptual differences among the various orthology inference methods and databases, and examine the difficult issue of verifying and benchmarking orthology predictions. Finally, we review typical applications of orthologous genes, groups, and reconciled trees and conclude with thoughts on future methodological developments.

Key words

Orthology Paralogy Tree reconciliation Orthology benchmarking 

Notes

Acknowledgments

We thank Stefan Zoller for helpful feedback on the manuscript. Part of this chapter started as assignment for the graduate course “Reviews in Computational Biology” (263-5151-00L) at ETH Zurich.

References

  1. 1.
    Dewey C (2012) Whole-genome alignment. In Anisimova, M., (ed.), Evolutionary genomics: statistical and computational methods (volume 1). Methods in Molecular Biology, Springer Science+Business media, LLC.Google Scholar
  2. 2.
    Alioto T (2012) Gene prediction. In Anisimova, M., (ed.), Evolutionary genomics: statistical and computational methods (volume 1). Methods in Molecular Biology, Springer Science+Business media, LLC.Google Scholar
  3. 3.
    Loytynoja A (2012) Alignment methods: strategies, challenges, benchmarking, and comparative overview. In Anisimova, M., (ed.), Evolutionary genomics: statistical and computational methods (volume 1). Methods in Molecular Biology, Springer Science+Business media, LLC.Google Scholar
  4. 4.
    Walter M Fitch. Distinguishing homologous from analogous proteins. Syst Zool, 19 (2):99–113, 1970.CrossRefGoogle Scholar
  5. 5.
    Arnold Kuzniar, Roeland C H J van Ham, Sándor Pongor, and Jack A M Leunissen. The quest for orthologs: finding the corresponding gene across genomes. Trends Genet, 24(11):539–51, Nov 2008.PubMedCrossRefGoogle Scholar
  6. 6.
    Roman L. Tatusov, Eugene V. Koonin, and David J. Lipman. A genomic perspective on protein families. Science, 278(5338):631–7, 1997.PubMedCrossRefGoogle Scholar
  7. 7.
    Ross Overbeek, Michael Fonstein, Mark D. Souza, Gordon D. Pusch, and Natalia Maltsev. The use of gene clusters to infer functional coupling. Proc. Natl. Acad. Sci. U.S.A., 96:2896–2901, 1999.Google Scholar
  8. 8.
    Temple F. Smith and Michael S. Waterman. Identification of common molecular subsequences. J. Mol. Biol., 147:195–197, 1981.Google Scholar
  9. 9.
    Altschul S F, Madden T L, Schaffer A A, Zhang J, Zhang Z, Miller W, and Lipman D J. Gapped blast and psi-blast: a new generation of protein database search programs. Nucleic Acids Res., 25(17):3389–3402, Sep 1997.PubMedCrossRefGoogle Scholar
  10. 10.
    Remm M, Storm CE, and Sonnhammer EL. Automatic clustering of orthologs and in-paralogs from pairwise species comparisons. J Mol Biol, 314(5):1041–52, 2001.PubMedCrossRefGoogle Scholar
  11. 11.
    Christophe Dessimoz, Gina Cannarozzi, Manuel Gil, Daniel Margadant, Alexander Roth, Adrian Schneider, and Gaston Gonnet. OMA, a comprehensive, automated project for the identification of orthologs from complete genome data: Introduction and first achievements. In Aoife McLysath and Daniel H. Huson, editors, RECOMB 2005 Workshop on Comparative Genomics, volume LNBI 3678 of Lecture Notes in Bioinformatics, pages 61–72. Springer-Verlag, 2005.Google Scholar
  12. 12.
    Fulton DL, Li YY, Laird MR, Horsman BG, Roche FM, and Brinkman FS. Improving the specificity of high-throughput ortholog prediction. BMC Bioinformatics, 28(7): 270, 2006.CrossRefGoogle Scholar
  13. 13.
    Wall D P, Fraser H B, and Hirsh A E. Detecting putative orthologs. Bioinformatics, 19(13): 1710–1711, 2003.PubMedCrossRefGoogle Scholar
  14. 14.
    Liisa B. Koski and G. Brian Golding. The closest BLAST hit is often not the nearest neighbor. J Mol Evol, 52(6):540–542, 2001.Google Scholar
  15. 15.
    Alexander C Roth, Gaston H Gonnet, and Christophe Dessimoz. The algorithm of OMA for large-scale orthology inference. BMC Bioinformatics, 9:518, 2008. doi:  10.1186/1471-2105-9-518.CrossRefGoogle Scholar
  16. 16.
    Christophe Dessimoz, Brigitte Boeckmann, Alexander C J Roth, and Gaston H Gonnet. Detecting non-orthology in the cogs database and other approaches grouping orthologs using genome-specific best hits. Nucleic Acids Res, 34(11):3309–3316, 2006. doi:  10.1093/nar/gkl433. URL http://dx.doi.org/10.1093/nar/gkl433.
  17. 17.
    David M Kristensen, Lavanya Kannan, Michael K Coleman, Yuri I Wolf, Alexander Sorokin, Eugene V Koonin, and Arcady Mushegian. A low-polynomial algorithm for assembling clusters of orthologous groups from intergenomic symmetric best matches. Bioinformatics, 26(12):1481–1487, Jun 2010. doi:  10.1093/bioinformatics/btq229. URL http://dx.doi.org/10.1093/bioinformatics/btq229.
  18. 18.
    Li Li, Christian J Jr Stoeckert, and David S Roos. Orthomcl: identification of ortholog groups for eukaryotic genomes. Genome Res, 13(9):2178–2189, Sep 2003.PubMedCrossRefGoogle Scholar
  19. 19.
    Stijn van Dongen. Graph Clustering by Flow Simulation. PhD thesis, University of Utrecht, May 2000.Google Scholar
  20. 20.
    Brigitte Boeckmann, Marc Robinson-Rechavi, Ioannis Xenarios, and Christophe Dessimoz. Conceptual framework and pilot study to benchmark phylogenomic databases based on reference gene trees. Brief Bioinform, 12(5):423–435, Sep 2011.PubMedCrossRefGoogle Scholar
  21. 21.
    Lars Juhl Jensen, Philippe Julien, Michael Kuhn, Christian von Mering, Jean Muller, Tobias Doerks, and Peer Bork. eggNOG: automated construction and annotation of orthologous groups of genes. Nucl. Acids Res., 36(Database issue):D250–D254, 2008. doi:  10.1093/nar/gkm796.
  22. 22.
    Evgenia V Kriventseva, Nazim Rahman, Octavio Espinosa, and Evgeny M Zdobnov. Orthodb: the hierarchical catalog of eukaryotic orthologs. Nucleic Acids Res, 36 (Database issue):D271–D275, Jan 2008. doi:  10.1093/nar/gkm845. URL http://dx.doi.org/10.1093/nar/gkm845.
  23. 23.
    Raja Jothi, Elena Zotenko, Asba Tasneem, and Teresa M Przytycka. Coco-cl: hierarchical clustering of homology relations based on evolutionary correlations. Bioinformatics, 22(7):779–788, Apr 2006. doi:  10.1093/bioinformatics/btl009. URL http://dx.doi.org/10.1093/bioinformatics/btl009.
  24. 24.
    Masatoshi Nei. Molecular Evolutionary Genetics. Columbia University Press, New York, 1987.Google Scholar
  25. 25.
    Morris Goodman, John Czelusniak, G W Moore, and A E Romero-Herrara. Fitting the gene lineage into its species lineage: a parsimony strategy illustrated by cladograms constructed from globin sequences. Syst Zool, 28(2):132–168, 1979.CrossRefGoogle Scholar
  26. 26.
    Roderic Page. Maps between trees and cladistic – analysis of historical associations among genes, organisms, and areas. Syst Biol, 43(1):58–77, Jan 1994.Google Scholar
  27. 27.
    Mirkin B, Muchnik I, and Smith T F. A biologically consistent model for comparing molecular phylogenies. J Comput Biol, 2(4):493–507, Jan 1995.CrossRefGoogle Scholar
  28. 28.
    Zhang L. On a mirkin-muchnik-smith conjecture for comparing molecular phylogenies. J Comput Biol, 4(2):177–87, Jul 1997.PubMedCrossRefGoogle Scholar
  29. 29.
    Oliver Eulenstein. A linear time algorithm for tree mapping. Arbeitspapiere der GMD No. 1046, St Augustine, Germany, page 1046, 1997.Google Scholar
  30. 30.
    Zmasek C M and Eddy S R. A simple algorithm to infer gene duplication and speciation events on a gene tree. Bioinformatics, 17(9):821–8, Sep 2001.PubMedCrossRefGoogle Scholar
  31. 31.
    Heng Li, Avril Coghlan, Jue Ruan, Lachlan James Coin, Jean-Karim Hrich, Lara Osmotherly, Ruiqiang Li, Tao Liu, Zhang Zhang, Lars Bolund, Gane Ka-Shu Wong, Weimou Zheng, Paramvir Dehal, Jun Wang, and Richard Durbin. Treefam: a curated database of phylogenetic trees of animal gene families. Nucleic Acids Res, 34(Database issue):D572–D580, Jan 2006. doi:  10.1093/nar/gkj118. URL http://dx.doi.org/10.1093/nar/gkj118.
  32. 32.
    Albert J J. Vilella, Jessica Severin, Abel Ureta-Vidal, Richard Durbin, Li Heng, and Ewan Birney. Ensemblcompara genetrees: Analysis of complete, duplication aware phylogenetic trees in vertebrates. Genome research, 19(2):327–335, 2009. doi: http://dx.doi.org/10.1101/gr.073585.107.
  33. 33.
    Rene TJM van der Heijden, Berend Snel, Vera van Noort, and Martijn A Huynen. Orthology prediction at scalable resolution by phylogenetic tree analysis. BMC Bioinformatics, 8(1):83, 2007.Google Scholar
  34. 34.
    Jaime Huerta-Cepas, Hernán Dopazo, Joaquín Dopazo, and Toni Gabaldón. The human phylome. Genome Biol, 8(6):R109, Jan 2007. doi:  10.1186/gb-2007-8-6-r109. URL http://genomebiology.com/2007/8/6/R109.Google Scholar
  35. 35.
    Maria Poptsova and J Peter Gogarten. Branchclust: a phylogenetic algorithm for selecting gene families. BMC Bioinformatics, 8(1):120, 2007. doi:  10.1186/1471-2105-8-120. URL http://www.biomedcentral.com/1471-2105/8/120.
  36. 36.
    Hallett M and Lagergren J. New algorithms for the duplication-loss model. RECOMB ‘00: Apr 2000. URL http://portal.acm.org/citation.cfm?id=332306.332359.
  37. 37.
    Zmasek C M and Eddy S R. RIO: analyzing proteomes by automated phylogenomics using resampled inference of orthologs. BMC Bioinformatics, 3(14), May 2002. doi:  10.1186/1471-2105-3-14.
  38. 38.
    Ann-Charlotte Berglund-Sonnhammer, Pär Steffansson, Matthew J Betts, and David A Liberles. Optimal gene trees from sequences and species trees using a soft interpretation of parsimony. J Mol Evol, 63(2):240–50, Aug 2006. doi:  10.1007/s00239-005-0096-1.
  39. 39.
    CE Storm and EL Sonnhammer. Automated ortholog inference from phylogenetic trees and calculation of orthology reliability. Bioinformatics, 18(1):92–9, Jan 2002.PubMedCrossRefGoogle Scholar
  40. 40.
    James S. Farris. Estimating phylogenetic trees from distance matrices. The American Naturalist, 106(951):645–668, 1972. ISSN 00030147. URL http://www.jstor.org/stable/2459725.
  41. 41.
    Avise J C, Bowen B W, Lamb T, Meylan A B, and Bermingham E. Mitochondrial dna evolution at a turtle’s pace: evidence for low genetic variability and reduced microevolutionary rate in the testudines. Mol Biol Evol, 9(3):457–473, May 1992.PubMedGoogle Scholar
  42. 42.
    Ayala F J. Molecular clock mirages. Bioessays, 21(1):71–75, Jan 1999. URL http://dx.doi.org/3.0.C0;2-B.Google Scholar
  43. 43.
    John P Huelsenbeck, Jonathan P Bollback, and Amy M Levine. Inferring the root of a phylogenetic tree. Syst Biol, 51(1):32–43, Feb 2002. doi:  10.1080/106351502753475862. URL http://dx.doi.org/10.1080/106351502753475862.
  44. 44.
    R. Tarrío, F. Rodríguez-Trelles, and F. J. Ayala. Tree rooting with outgroups when they differ in their nucleotide composition from the ingroup: the drosophila saltans and willistoni groups, a case study. Mol Phylogenet Evol, 16(3):344–349, Sep 2000. doi:  10.1006/mpev.2000.0813. URL http://dx.doi.org/10.1006/mpev.2000.0813.
  45. 45.
    Anna Graybeal. Is it better to add taxa or characters to a difficult phylogenetic problem? Syst Biol, 47:9–17, 1998.PubMedCrossRefGoogle Scholar
  46. 46.
    Antonis Rokas, Barry L Williams, Nicole King, and Sean B Carroll. Genome-scale approaches to resolving incongruence in molecular phylogenies. Nature, 425(6960):798–804, Oct 2003. doi:  10.1038/nature02053. URL http://dx.doi.org/10.1038/nature02053.
  47. 47.
    Z. Yang, N. Goldman, and A. Friday. Comparison of models for nucleotide substitution used in maximum-likelihood phylogenetic estimation. Mol Biol Evol, 11(2):316–324, Mar 1994.PubMedGoogle Scholar
  48. 48.
    Holmes. Statistics in Genetics, chapter Phylogenies: An Overview, pages 81–118. Springer, NY, 1999.Google Scholar
  49. 49.
    Anisimova M and Gascuel O. Approximate likelihood-ratio test for branches: A fast, accurate, and powerful alternative. Syst Biol, 55(4):539–52, 2006.PubMedCrossRefGoogle Scholar
  50. 50.
    Jean-François Dufayard, Laurent Duret, Simon Penel, Manolo Gouy, François Rechenmann, and Guy Perriere. Tree pattern matching in phylogenetic trees: automatic search for orthologs or paralogs in homologous gene sequence databases. Bioinformatics, 21(11):2596–603, Jun 2005. doi:  10.1093/bioinformatics/bti325. URL http://bioinformatics.oxfordjournals.org/cgi/content/full/21/11/2596.
  51. 51.
    Dannie Durand, Bjarni V Halldórsson, and Benjamin Vernot. A hybrid micro-macroevolutionary approach to gene tree reconstruction. J Comput Biol, 13(2):320–35, Mar 2006. doi:  10.1089/cmb.2006.13.320.
  52. 52.
    Lynch M and Conery J S. The evolutionary fate and consequences of duplicate genes. Science, 290(5494):1151–1155, Nov 2000. Comment.PubMedCrossRefGoogle Scholar
  53. 53.
    Robinson-Rechavi M, Marchand O, Escriva H, Bardet P L, Zelus D, Hughes S, and Laudet V. Euteleost fish genomes are characterized by expansion of gene families. Genome Res, 11(5):781–788, May 2001. doi:  10.1101/gr.165601. URL http://dx.doi.org/10.1101/gr.165601.
  54. 54.
    Lars Arvestad, Ann-Charlotte Berglund, Jens Lagergren, and Bengt Sennblad. Bayesian gene/species tree reconciliation and orthology analysisusing mcmc. Bioinformatics, 19(suppl 1):i7–15, 2003. doi:  10.1093/bioinformatics/btg1000.
  55. 55.
    David G. Kendall. On the generalized “birth-and-death” process. Ann of Math Stat, 19(1):1–15, 1948. ISSN 00034851. URL http://www.jstor.org/stable/2236051.
  56. 56.
    Lars Arvestad, Ann-Charlotte Berglund, Jens Lagergren, and Bengt Sennblad. Gene tree reconstruction and orthology analysis based on an integrated model for duplications and sequence evolution. RECOMB ’04. URL http://portal.acm.org/citation.cfm?id=974614.974657.
  57. 57.
    Orjan Åkerborg, Bengt Sennblad, Lars Arvestad, and Jens Lagergren. Simultaneous bayesian gene tree reconstruction and reconciliation analysis. Proc Natl Acad Sci USA, 106(14):5714–9, Apr 2009. doi:  10.1073/pnas.0806251106.
  58. 58.
    Jean-Philippe Doyon, Vincent Ranwez, Vincent Daubin and Vincent Berry. Models, algorithms and programs for Phylogeny reconciliation. Brief Bioinform, 12(5):392–400, Sep 2011. doi:  10.1093/bib/bbr045. URL http://dx.doi.org/10.1093/bib/bbr045.
  59. 59.
    Tim Hulsen, Martijn A Huynen, Jacob de Vlieg, and Peter MA Groenen. Benchmarking ortholog identification methods using functional genomics data. Genome Biol, 7 (4):R31, April 2006. doi:  10.1186/gb-2006-7-4-r31.
  60. 60.
    Romain A Studer and Marc Robinson-Rechavi. How confident can we be that orthologs are similar, but paralogs differ? Trends Genet, 25(5):210–216, May 2009. doi:  10.1016/j.tig.2009.03.004. URL http://dx.doi.org/10.1016/j.tig.2009.03.004.
  61. 61.
    Adrian M. Altenhoff and Christophe Dessimoz. Phylogenetic and functional assessment of orthologs inference projects and methods. PLoS Comput Biol, 5(1):e1000262, 2009. doi:  10.1371/journal.pcbi.1000262.Google Scholar
  62. 62.
    Chen F, Mackey A J, Vermunt J K, and Roos D S. Assessing performance of orthology detection strategies applied to eukaryotic genomes. PLoS ONE, 2(4):e383, 2007. doi:  10.1371/journal.pone.0000383.
  63. 63.
    Paul D Thomas, Michael J Campbell, Anish Kejariwal, Huaiyu Mi, Brian Karlak, Robin Daverman, Karen Diemer, Anushya Muruganujan, and Apurva Narechania. Panther: a library of protein families and subfamilies indexed by function. Genome Res, 13(9):2129–2141, Sep 2003. doi:  10.1101/gr.772403. URL http://dx.doi.org/10.1101/gr.772403.
  64. 64.
    Barbara E Engelhardt, Michael I Jordan, Kathryn E Muratore, and Steven E Brenner. Protein molecular function prediction by bayesian phylogenomics. PLOS Comp Biol, 1(5):432–445, 2005.Google Scholar
  65. 65.
    Stephen A. Cook. The complexity of theorem-proving procedures. In STOC ’71: Proceedings of the third annual ACM symposium on Theory of computing, pages 151–158, New York, NY, USA, 1971. ACM. doi: http://doi.acm.org/10.1145/800157.805047.
  66. 66.
    Roded Sharan and Trey Ideker. Modeling cellular machinery through biological network comparison. Nat Biotechnol, 24(4):427–433, Apr 2006. doi:  10.1038/nbt1196. URL http://dx.doi.org/10.1038/nbt1196.
  67. 67.
    Colin N Dewey and Lior Pachter. Evolution at the nucleotide level: the problem of multiple whole-genome alignment. Hum Mol Genet, 15 Spec No 1:R51–R56, Apr 2006. doi:  10.1093/hmg/ddl056. URL http://dx.doi.org/10.1093/hmg/ddl056.
  68. 68.
    Toni Gabaldón, Christophe Dessimoz, Julie Huxley-Jones, Albert J Vilella, Erik Ll Sonnhammer, and Suzanna Lewis. Joining forces in the quest for orthologs. Genome Biol, 10(9):403, 2009. doi:  10.1186/gb-2009-10-9-403. URL http://dx.doi.org/10.1186/gb-2009-10-9-403.
  69. 69.
    Pawel Górecki. Reconciliation problems for duplication, loss and horizontal gene transfer. RECOMB ’04. URL http://portal.acm.org/citation.cfm?id=974614.974656.
  70. 70.
    Mike Hallett, Jens Lagergren, and Ali Tofigh. Simultaneous identification of duplications and lateral transfers. RECOMB ’04. URL http://portal.acm.org/citation.cfm?id=974614.974660.
  71. 71.
    Guigó R, Muchnik I, and Smith T F. Reconstruction of ancient molecular phylogeny. Mol Phylogen Evol, 6(2):189–213, Oct 1996. doi:  10.1006/mpev.1996.0071.CrossRefGoogle Scholar
  72. 72.
    Mukul S Bansal and Oliver Eulenstein. The multiple gene duplication problem revisited. Bioinformatics, 24(13):i132–8, Jul 2008. doi:  10.1093/bioinformatics/btn150.
  73. 73.
    Gabriel Ostlund, Thomas Schmitt, Kristoffer Forslund, Tina Köstler, David N Messina, Sanjit Roopra, Oliver Frings, and Erik L L Sonnhammer. Inparanoid 7: new algorithms and tools for eukaryotic orthology analysis. Nucleic Acids Res, 38(Database issue):D196–D203, Jan 2010. doi:  10.1093/nar/gkp931. URL http://dx.doi.org/10.1093/nar/gkp931.
  74. 74.
    Todd F. DeLuca, I-Hisen Wu, Jian Pu, Thomas Monaghan, Leonid Peshkin, Saurav Singh, and Dennis P. Wall. Roundup: a multi-genome repository of orthologs and evolutionary distances. Bioinformatics, 22(16):2044–2046, Jun 2006.Google Scholar
  75. 75.
    Adrian M Altenhoff, Adrian Schneider, Gaston H Gonnet, and Christophe Dessimoz. OMA 2011: orthology inference among 1000 complete genomes. Nucleic Acids Res, 39(Database issue):D289–D294, Jan 2011. doi:  10.1093/nar/gkq1238. URL http://dx.doi.org/10.1093/nar/gkq1238.
  76. 76.
    Feng Chen, Aaron J Mackey, Christian J Stoeckert, and David S Roos. Orthomcldb: querying a comprehensive multi-species collection of ortholog groups. Nucleic Acids Res, 34(Database issue):D363–D368, Jan 2006. doi:  10.1093/nar/gkj123. URL http://dx.doi.org/10.1093/nar/gkj123.
  77. 77.
    Muller J, Szklarczyk D, Julien P, Letunic I, Roth A, Kuhn M, Powell S, von Mering C, Doerks T, Jensen L J, and Bork P. eggnog v2.0: extending the evolutionary genealogy of genes with enhanced non-supervised orthologous groups, species and functional annotations. Nucleic Acids Res, 38(Database issue):D190–D195, Jan 2010. doi:  10.1093/nar/gkp951. URL http://dx.doi.org/10.1093/nar/gkp951.
  78. 78.
    Benjamin Linard, Julie Thompson, Olivier Poch, and Odile Lecompte. Orthoinspector: comprehensive orthology analysis and visual exploration. BMC Bioinformatics, 12(1):11, 2011. doi:  10.1186/1471-2105-12-11. URL http://www.biomedcentral.com/1471–2105/12/11.
  79. 79.
    Simon Penel, Anne-Muriel Arigon, Jean-Franois Dufayard, Anne-Sophie Sertier, Vincent Daubin, Laurent Duret, Manolo Gouy, and Guy Perrire. Databases of homologous gene families for comparative genomics. BMC Bioinformatics, 10 Suppl 6:S3, 2009. doi:  10.1186/1471-2105-10-S6-S3. URL http://dx.doi.org/10.1186/1471-2105-10-S6-S3.

Copyright information

© Springer Science+Business Media, LLC 2012

Authors and Affiliations

  1. 1.Department of Computer ScienceETH ZurichZurichSwitzerland
  2. 2.Swiss Institute of BioinformaticsLausanneSwitzerland

Personalised recommendations