Skip to main content
Log in

Homology and phylogeny and their automated inference

  • Review
  • Published:
Naturwissenschaften Aims and scope Submit manuscript

Abstract

The analysis of the ever-increasing amount of biological and biomedical data can be pushed forward by comparing the data within and among species. For example, an integrative analysis of data from the genome sequencing projects for various species traces the evolution of the genomes and identifies conserved and innovative parts. Here, I review the foundations and advantages of this “historical” approach and evaluate recent attempts at automating such analyses. Biological data is comparable if a common origin exists (homology), as is the case for members of a gene family originating via duplication of an ancestral gene. If the family has relatives in other species, we can assume that the ancestral gene was present in the ancestral species from which all the other species evolved. In particular, describing the relationships among the duplicated biological sequences found in the various species is often possible by a phylogeny, which is more informative than homology statements. Detecting and elaborating on common origins may answer how certain biological sequences developed, and predict what sequences are in a particular species and what their function is. Such knowledge transfer from sequences in one species to the homologous sequences of the other is based on the principle of ‘my closest relative looks and behaves like I do’, often referred to as ‘guilt by association’. To enable knowledge transfer on a large scale, several automated ‘phylogenomics pipelines’ have been developed in recent years, and seven of these will be described and compared. Overall, the examples in this review demonstrate that homology and phylogeny analyses, done on a large (and automated) scale, can give insights into function in biology and biomedicine.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

References

  • Alam I, Dress A, Rehmsmeier M, Fuellen G (2004) Comparative homology agreement search: an effective combination of homology-search methods. Proc Natl Acad Sci U S A 101:13814–13819

    Article  PubMed  CAS  Google Scholar 

  • Allen JE, Salzberg SL (2005) Jigsaw: integration of multiple sources of evidence for gene prediction. Bioinformatics 21:3596–3603

    Article  PubMed  CAS  Google Scholar 

  • Altekar G, Dwarkadas S, Huelsenbeck JP, Ronquist F (2004) Parallel Metropolis coupled Markov chain Monte Carlo for Bayesian phylogenetic inference. Bioinformatics 20:407–415

    Article  PubMed  CAS  Google Scholar 

  • Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ (1997) Gapped Blast and PSI-Blast: a new generation of protein database search programs. Nucleic Acids Res 25:3389–3402

    Article  PubMed  CAS  Google Scholar 

  • Bailey TL, Gribskov M (1998) Combining evidence using p-values: application to sequence homology searches. Bioinformatics 14:48–54

    Article  PubMed  CAS  Google Scholar 

  • Bajic VB, Tan SL, Suzuki Y, Sugano S (2004) Promoter prediction analysis on the whole human genome. Nat Biotechnol 22:1467–1473

    Article  PubMed  CAS  Google Scholar 

  • Bandelt HJ, Dress AW (1992) Split decomposition: a new and useful approach to phylogenetic analysis of distance data. Mol Phylogenet Evol 1:242–252

    Article  PubMed  CAS  Google Scholar 

  • Bateman A, Birney E, Durbin R, Eddy SR, Howe KL, Sonnhammer EL (2000) The Pfam protein families database. Nucleic Acids Res 28:263–266

    Article  PubMed  CAS  Google Scholar 

  • Brown D, Sjölander K (2006) Functional classification using phylogenomic inference. PLoS Comput Biol 2:e77

    Article  PubMed  Google Scholar 

  • Brown NP, Leroy C, Sander C (1998) MView: a web-compatible database search or multiple alignment viewer. Bioinformatics 14:380–381

    Article  PubMed  CAS  Google Scholar 

  • Bryant D, Moulton V (2004) Neighbor-Net: an agglomerative method for the construction of phylogenetic networks. Mol Biol Evol 21:255–265

    Article  PubMed  CAS  Google Scholar 

  • Chen K, Durand D, Farach-Colton M (2000) Notung: a program for dating gene duplications and optimizing gene family trees. J Comput Biol 7:429–447

    Article  PubMed  CAS  Google Scholar 

  • Communi D, Gonzalez NS, Detheux M, Brezillon S, Lannoy V, Parmentier M, Boeynaems JM (2001) Identification of a novel human ADP receptor coupled to G(i). J Biol Chem 276:41479–41485

    Article  PubMed  CAS  Google Scholar 

  • Cuff JA, Clamp ME, Siddiqui AS, Finlay M, Barton GJ (1998) JPred: a consensus secondary structure prediction server. Bioinformatics 14:892–893

    Article  PubMed  CAS  Google Scholar 

  • Curwen V, Eyras E, Andrews TD, Clarke L, Mongin E, Searle SMJ, Clamp M (2004) The Ensembl automatic gene annotation system. Genome Res 14:942–950

    Article  PubMed  CAS  Google Scholar 

  • Devos D, Valencia A (2001) Intrinsic errors in genome annotation. Trends Genet 17:429–431

    Article  PubMed  CAS  Google Scholar 

  • Eddy SR (1998) Profile hidden Markov models. Bioinformatics 14:755–763

    Article  PubMed  CAS  Google Scholar 

  • Edgar RC (2004) MUSCLE: a multiple sequence alignment method with reduced time and space complexity. BMC Bioinformatics 5:113

    Article  PubMed  Google Scholar 

  • Edgar RC, Sjölander K (2003) SATCHMO: sequence alignment and tree construction using hidden Markov models. Bioinformatics 19:1404–1411

    Article  PubMed  CAS  Google Scholar 

  • Eisen JA (1998) Phylogenomics: improving functional predictions for uncharacterized genes by evolutionary analysis. Genome Res 8:163–167

    PubMed  CAS  Google Scholar 

  • Eisen JA, Wu M (2002) Phylogenetic analysis and gene functional predictions: phylogenomics in action. Theor Popul Biol 61:481–487

    Article  PubMed  Google Scholar 

  • Engelhardt BE, Jordan MI, Muratore KE, Brenner SE (2005) Protein molecular function prediction by Bayesian phylogenomics. PLoS Comput Biol 1:e45

    Article  PubMed  Google Scholar 

  • Escriva H, Safi R, Hanni C, Langlois MC, Saumitou-Laprade P, Stehelin D, Capron A, Pierce R, Laudet V (1997) Ligand binding was acquired during evolution of nuclear receptors. Proc Natl Acad Sci U S A 94:6803–6808

    Article  PubMed  CAS  Google Scholar 

  • Felsenstein J (2003) Inferring phylogenies. Sinauer, Sunderland, MA, USA

    Google Scholar 

  • Fitch WM (1970) Distinguishing homologous from analogous proteins. Syst Zool 19:99–113

    Article  PubMed  CAS  Google Scholar 

  • Fitch WM, Farris JS (1974) Evolutionary trees with minimum nucleotide replacements from amino acid sequences.. J Mol Evol 3:263–278

    Article  PubMed  CAS  Google Scholar 

  • Frickey T, Lupas AN (2004) Phylogenie: automated phylome generation and analysis. Nucleic Acids Res 32:5231–5238

    Article  PubMed  CAS  Google Scholar 

  • Friedberg I (2006) Automated protein function prediction—the genomic challenge. Brief Bioinform 7:225–242

    Article  PubMed  CAS  Google Scholar 

  • Fryxell KJ (1996) The coevolution of gene family trees. Trends Genet 12:364–369

    Article  PubMed  CAS  Google Scholar 

  • Fuellen G (1994) A gentle guide to multiple alignment. Complexity International 4

  • Fuellen G, Spitzer M, Cullen P, Lorkowski S (2005) Correspondence of function and phylogeny of ABC proteins based on an automated analysis of 20 model protein data sets. Proteins 61:888–899

    Article  PubMed  CAS  Google Scholar 

  • Fuellen G, Wagele JW, Giegerich R (2001) Minimum conflict: a divide-and-conquer approach to phylogeny estimation. Bioinformatics 17:1168–1178

    Article  PubMed  CAS  Google Scholar 

  • Gabaldón T (2005) Evolution of proteins and proteomes: a phylogenetics approach. Evolutionary Bioinformatics Online 1:51–61

    Google Scholar 

  • Galperin MY, Koonin EV (1998) Sources of systematic error in functional annotation of genomes: domain rearrangement, non-orthologous gene displacement and operon disruption. In Silico Biol 1:55–67

    PubMed  CAS  Google Scholar 

  • Gene Ontology Consortium (2006) The gene ontology (GO) project in 2006. Nucleic Acids Res 34:D322–326

    Article  Google Scholar 

  • Gouret P, Vitiello V, Balandraud N, Gilles A, Pontarotti P, Danchin EGJ (2005) Figenix: intelligent automation of genomic annotation: expertise integration in a new software platform. BMC Bioinformatics 6:198

    Article  PubMed  Google Scholar 

  • Huson DH, Bryant D (2006) Application of phylogenetic networks in evolutionary studies. Mol Biol Evol 23:254–267

    Article  PubMed  CAS  Google Scholar 

  • Hwang D, Rust AG, Ramsey S, Smith JJ, Leslie DM, Weston AD, de Atauri P, Aitchison JD, Hood L, Siegel AF, Bolouri H (2005) A data integration methodology for systems biology. Proc Natl Acad Sci U S A 102:17296–17301

    Article  PubMed  CAS  Google Scholar 

  • Ignatov A, Lintzel J, Hermans-Borgmeyer I, Kreienkamp H, Joost P, Thomsen S, Methner A, Schaller HC (2003) Role of the G-protein-coupled receptor GPR12 as high-affinity receptor for sphingosylphosphorylcholine and its expression and function in brain development. J Neurosci 23:907–914

    PubMed  CAS  Google Scholar 

  • Jensen LJ, Ussery DW, Brunak S (2003) Functionality of system components: conservation of protein function in protein feature space. Genome Res 13:2444–2449

    Article  PubMed  CAS  Google Scholar 

  • Jensen RA (2001) Orthologs and paralogs—we need to get it right. Genome Biol 2:interactions1002.1–1002.3

    Article  Google Scholar 

  • Joost P, Methner A (2002) Phylogenetic analysis of 277 human G-protein-coupled receptors as a tool for the prediction of orphan receptor ligands. Genome Biol 3:research0063.1

    Article  Google Scholar 

  • Katoh K, Kuma K, Toh H, Miyata T (2005) Mafft version 5: improvement in accuracy of multiple sequence alignment. Nucleic Acids Res 33:511–518

    Article  PubMed  CAS  Google Scholar 

  • Kemmeren P, Kockelkorn TTJP, Bijma T, Donders R, Holstege FCP (2005) Predicting gene function through systematic analysis and quality assessment of high-throughput data. Bioinformatics 21:1644–1652

    Article  PubMed  CAS  Google Scholar 

  • Klenk H, Spitzer M, Ochsenreiter T, Fuellen G (2004) Phylogenomics of hyperthermophilic archaea and bacteria. Biochem Soc Trans 32:175–178

    Article  PubMed  CAS  Google Scholar 

  • Kornegay JR, Schilling JW, Wilson AC (1994) Molecular adaptation of a leaf-eating bird: stomach lysozyme of the hoatzin. Mol Biol Evol 11:921–928

    PubMed  CAS  Google Scholar 

  • Koski LB, Golding GB (2001) The closest Blast hit is often not the nearest neighbor. J Mol Evol 52:540–542

    PubMed  CAS  Google Scholar 

  • Krishnamurthy N, Brown D, Sjölander K (2007) Flowerpower: clustering proteins into domain architecture classes for phylogenomic inference of protein function. BMC Evolutionary Biology 7:S12

    Article  PubMed  Google Scholar 

  • Kuang R, Ie E, Wang K, Wang K, Siddiqi M, Freund Y, Leslie C (2005) Profile-based string kernels for remote homology detection and motif extraction. J Bioinform Comput Biol 3:527–550

    Article  PubMed  CAS  Google Scholar 

  • Laudet V (1997) Evolution of the nuclear receptor superfamily: early diversification from an ancestral orphan receptor. J Mol Endocrinol 19:207–226

    Article  PubMed  CAS  Google Scholar 

  • Li L, Stoeckert CJJ, Roos DS (2003) OrthoMCL: identification of ortholog groups for eukaryotic genomes. Genome Res 13:2178–2189

    Article  PubMed  CAS  Google Scholar 

  • Maddison WP, Knowles LL (2006) Inferring phylogeny despite incomplete lineage sorting. Syst Biol 55:21–30

    Article  PubMed  Google Scholar 

  • Mailund T, Brodal GS, Fagerberg R, Pedersen CNS, Phillips D (2006) Recrafting the neighbor-joining method. BMC Bioinformatics 7:29

    Article  PubMed  Google Scholar 

  • Martin AP, Burg TM (2002) Perils of paralogy: using HSP70 genes for inferring organismal phylogenies. Syst Biol 51:570–587

    Article  PubMed  Google Scholar 

  • Metpally RPR, Sowdhamini R (2005) Cross genome phylogenetic analysis of human and Drosophila G protein-coupled receptors: application to functional annotation of orphan receptors. BMC Genomics 6:106

    Article  PubMed  Google Scholar 

  • Murzin AG, Brenner SE, Hubbard T, Chothia C (1995) SCOP: a structural classification of proteins database for the investigation of sequences and structures. J Mol Biol 247:536–540

    PubMed  CAS  Google Scholar 

  • Page RD (1998) Genetree: comparing gene and species phylogenies using reconciled trees. Bioinformatics 14:819–820

    Article  PubMed  CAS  Google Scholar 

  • Philippe H, Zhou Y, Brinkmann H, Rodrigue N, Delsuc F (2005) Heterotachy and long-branch attraction in phylogenetics. BMC Evol Biol 5:50

    Article  PubMed  Google Scholar 

  • Plewniak F, Bianchetti L, Brelivet Y, Carles A, Chalmel F, Lecompte O, Mochel T, Moulinier L, Muller A, Muller J, Prigent V, Ripp R, Thierry J, Thompson JD, Wicker N, Poc (2003) PipeAlign: a new toolkit for protein family analysis. Nucleic Acids Res 31:3829–3832

    Article  PubMed  CAS  Google Scholar 

  • Plotz T, Fink GA (2005) Robust remote homology detection by feature based profile hidden markov models. Stat Appl Genet Mol Biol 4:1

    Google Scholar 

  • Prince VE, Pickett FB (2002) Splitting pairs: the diverging fates of duplicated genes. Nat Rev Genet 3:827–837

    Article  PubMed  CAS  Google Scholar 

  • Rannala B, Huelsenbeck JP, Yang Z, Nielsen R (1998) Taxon sampling and the accuracy of large phylogenies. Syst Biol 47:702–710

    Article  PubMed  CAS  Google Scholar 

  • Rehmsmeier M (2002) Phase4: automatic evaluation of database search methods. Brief Bioinform 3:342–352

    Article  PubMed  Google Scholar 

  • Rehmsmeier M, Vingron M (2001) Phylogenetic information improves homology detection. Proteins 45:360–371

    Article  PubMed  CAS  Google Scholar 

  • Remm M, Storm CE, Sonnhammer EL (2001) Automatic clustering of orthologs and in-paralogs from pairwise species comparisons. J Mol Biol 314:1041–1052

    Article  PubMed  CAS  Google Scholar 

  • Ronquist F, Huelsenbeck JP (2003) MrBayes 3: Bayesian phylogenetic inference under mixed models. Bioinformatics 19:1572–1574

    Article  PubMed  CAS  Google Scholar 

  • Serb JM, Oakley TH (2005) Hierarchical phylogenetics as a quantitative analytical framework for evolutionary developmental biology. Bioessays 27:1158–1166

    Article  PubMed  CAS  Google Scholar 

  • Sicheritz-Ponten T, Andersson SG (2001) A phylogenomic approach to microbial evolution. Nucleic Acids Res 29:545–552

    Article  PubMed  CAS  Google Scholar 

  • Sjölander K (2004) Phylogenomic inference of protein molecular function: advances and challenges. Bioinformatics 20:170–179

    Article  PubMed  Google Scholar 

  • Sonnhammer EL, Eddy SR, Birney E, Bateman A, Durbin R (1998) Pfam: multiple sequence alignments and HMM-profiles of protein domains. Nucleic Acids Res 26:320–322

    Article  PubMed  CAS  Google Scholar 

  • Spang R, Rehmsmeier M, Stoye J (2002) A novel approach to remote homology detection: jumping alignments. J Comput Biol 9:747–760

    Article  PubMed  CAS  Google Scholar 

  • Spitzer M (2006) Automating the analysis of protein family evolution. PhD-Thesis. University of Muenster

  • Spitzer M, Fuellen G, Cullen P, Lorkowski S (2004) VisCoSe: visualization and comparison of consensus sequences. Bioinformatics 20:433–435

    Article  PubMed  CAS  Google Scholar 

  • Stamatakis A (2006) RaXML-VI-HPC: maximum likelihood-based phylogenetic analyses with thousands of taxa and mixed models. Bioinformatics 22:2688–2690

    Article  PubMed  CAS  Google Scholar 

  • Stechmann A, Cavalier-Smith T (2002) Rooting the eukaryote tree by using a derived gene fusion. Science 297:89–91

    Article  PubMed  CAS  Google Scholar 

  • Stolle K, Schnoor M, Fuellen G, Spitzer M, Engel T, Spener F, Cullen P, Lorkowski S (2005) Cloning, cellular localization, genomic organization, and tissue-specific expression of the TGFbeta1-inducible smap-5 gene. Gene 351:119–130

    Article  PubMed  CAS  Google Scholar 

  • Storm CEV, Sonnhammer ELL (2002) Automated ortholog inference from phylogenetic trees and calculation of orthology reliability. Bioinformatics 18:92–99

    Article  PubMed  CAS  Google Scholar 

  • Tatusov RL, Fedorova ND, Jackson JD, Jacobs AR, Kiryutin B, Koonin EV, Krylov DM, Mazumder R, Mekhedov SL, Nikolskaya AN, Rao BS, Smirnov S, Sverdlov AV, Vasudevan S, Wolf Y (2003) The COG database: an updated version includes eukaryotes. BMC Bioinformatics 4:41

    Article  PubMed  Google Scholar 

  • Theissen G (2002) Secret life of genes. Nature 415:741

    PubMed  Google Scholar 

  • Thompson JD, Higgins DG, Gibson TJ (1994) Clustal W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res 22:4673–4680

    Article  PubMed  CAS  Google Scholar 

  • Thompson JD, Plewniak F, Thierry J, Poch O (2000) DbClustal: rapid and reliable global multiple alignments of protein sequences detected by database searches. Nucleic Acids Res 28:2919–2926

    Article  PubMed  CAS  Google Scholar 

  • Thornton JW, DeSalle R (2000) Gene family evolution and homology: genomics meets phylogenetics. Annu Rev Genomics Hum Genet 1:41–73

    Article  PubMed  CAS  Google Scholar 

  • Thornton JW, Kelley DB (1998) Evolution of the androgen receptor: structure-function implications. Bioessays 20:860–869

    Article  PubMed  CAS  Google Scholar 

  • Watson JD, Laskowski RA, Thornton JM (2005) Predicting protein function from sequence and structural data. Curr Opin Struct Biol 15:275–284

    Article  PubMed  CAS  Google Scholar 

  • Wheeler DL, Barrett T, Benson DA, Bryant SH, Canese K, Chetvernin V, Church DM, DiCuccio M, Edgar R, Federhen S, Geer LY, Kapustin Y, Khovayko O, Landsman D, Lipman DJ, Madden T (2007) Database resources of the national center for biotechnology information. Nucleic Acids Res. 35:D5–12

    Article  PubMed  CAS  Google Scholar 

  • Yao H, Kristensen DM, Mihalek I, Sowa ME, Shaw C, Kimmel M, Kavraki L, Lichtarge O (2003) An accurate, sensitive, and scalable method to identify functional sites in protein structures. J Mol Biol 326:255–261

    Article  PubMed  CAS  Google Scholar 

  • Zhang Z, Schaffer AA, Miller W, Madden TL, Lipman DJ, Koonin EV, Altschul SF (1998) Protein sequence similarity searches using patterns as seeds. Nucleic Acids Res 26:3986–3990

    Article  PubMed  CAS  Google Scholar 

  • Zmasek CM, Eddy SR (2002) RIO: analyzing proteomes by automated phylogenomics using resampled inference of orthologs. BMC Bioinformatics 3:14

    Article  PubMed  Google Scholar 

Download references

Acknowledgement

The author wishes to thank the following people for their feedback on (parts of) the manuscript: Etienne Danchin, Tancred Frickey, Iddo Friedberg, Philippe Gouret, Jake Gunn-Glanville, Claus Kerkhoff, Stefan Lorkowski, Frederic Plewniak, Michael Rebhan, Kimmen Sjölander, Michael Spitzer, and Dion Whitehead.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Georg Fuellen.

Electronic Supplementary Material

Below is the link to the electronic supplementary material.

ESM 1 (DOC 41 KB)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Fuellen, G. Homology and phylogeny and their automated inference. Naturwissenschaften 95, 469–481 (2008). https://doi.org/10.1007/s00114-008-0348-1

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00114-008-0348-1

Keywords

Navigation