Skip to main content
Log in

Efficient Methods for Estimating Amino Acid Replacement Rates

  • Published:
Journal of Molecular Evolution Aims and scope Submit manuscript

Abstract

Replacement rate matrices describe the process of evolution at one position in a protein and are used in many applications where proteins are studied with an evolutionary perspective. Several general matrices have been suggested and have proved to be good approximations of the real process. However, there are data for which general matrices are inappropriate, for example, special protein families, certain lineages in the tree of life, or particular parts of proteins. Analysis of such data could benefit from adaption of a data-specific rate matrix. This paper suggests two new methods for estimating replacement rate matrices from independent pairwise protein sequence alignments and also carefully studies Müller-Vingron’s resolvent method. Comprehensive tests on synthetic datasets show that both new methods perform better than the resolvent method in a variety of settings. The best method is furthermore demonstrated to be robust on small datasets as well as practical on very large datasets of real data. Neither short nor divergent sequence pairs have to be discarded, making the method economical with data. A generalization to multialignment data is suggested and used in a test on protein-domain family phylogenies, where it is shown that the method offers family-specific rate matrices that often have a significantly better likelihood than a general matrix.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Figure 1
Figure 2
Figure 3
Figure 4
Figure 5
Figure 6
Figure 7
Figure 8

Similar content being viewed by others

References

  • Adachi J, Hasegawa M (1996) Model of amino acid substitution in proteins encoded by mitochondrial DNA. J Mol Evol 42:459–468

    Article  PubMed  CAS  Google Scholar 

  • Agarwal P, States DJ (1996) A Bayesian evolutionary distance for parametrically aligned sequences. J Comput Biol 3:1–17

    PubMed  CAS  Google Scholar 

  • Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs Nucleic Acids Res 25:3389–3402

    Article  PubMed  CAS  Google Scholar 

  • Arvestad L, Bruno WJ (1997) Estimation of reversible substitution matrices from multiple pairs of sequences. J Mol Evol 45:696–703

    Article  PubMed  CAS  Google Scholar 

  • Bateman A, Birney E, Cerruti L, Durbin R, Etwiller L, Eddy SR, Griffiths-Jones S, Howe KL, Marshall M, Sonnhammer EL (2002) The Pfam protein families database. Nucleic Acids Res 30:276–280

    Article  PubMed  CAS  Google Scholar 

  • Bishop M, Friday A (1985) Evolutionary trees from nucleic acid and protein sequences. Proc R Soc Lond B 226:272–302

    Article  Google Scholar 

  • Cao Y, Adachi J, Janke A, Pääbo S, Hasegawa M (1994) Phylogenetic relationships among eutharian orders estimated from inferred sequences of mitochondrial proteins: Instability of a tree based on a single gene. J Mol Evol 39:519–527

    Article  PubMed  CAS  Google Scholar 

  • Dayhoff MO, Eck RV, Park CM (1972) A model of evolutionary change in proteins. In: Dayhoff MO (ed) Atlas of protein sequence and structure, National Biomedical Research Foundation, Washington, D.C., vol 5, pp 89–99

  • Devauchelle C, Grossmann A, Hénaut A, Holschneider M, Monnerot M, Risler J, Torrésani B (2001) Rate matrices for analyzing large families of protein sequences. J Comput Biol 8:381–399

    Article  PubMed  CAS  Google Scholar 

  • Eddy SR (2001) HMMER: Profile hidden Markov models for biological sequence analysis. Available at: http://hmmer.wustl.edu/

  • Felsenstein J (1981) Evolutionary trees from DNA sequences: a maximum likelihood approach. J Mol Evol 17:368–376

    Article  PubMed  CAS  Google Scholar 

  • Gill PE, Murray W, Wright MH (1991) Numerical linear algebra and optimization. Addison-Wesley, Reading, MA

    Google Scholar 

  • Goldman N, Whelan S (2002) A novel use of equilibrium frequencies in models of sequence evolution. Mol Biol Evol 19:1821–1831

    PubMed  CAS  Google Scholar 

  • Goldman N, Thorne JL, Jones DT (1996) Using evolutionary trees in protein secondary structure prediction and other comparative sequence analyses. J Mol Biol 263:196–208

    Article  PubMed  CAS  Google Scholar 

  • Gonnet G, Hallet M (1997) The Darwin manual. Available at: http://www.inf.ethz.ch/personal/gonnet/

  • Gonnet GH, Cohen M, Benner S (1992) Exhaustive matching of the entire protein sequence database. Science 256:1443–1445

    PubMed  CAS  Google Scholar 

  • Grimmet G, Stirzaker D (1992) Probability and random processes, 2nd ed. Oxford Science, Oxford, p 247

    Google Scholar 

  • Henikoff S, Henikoff JG (1992) Amino acid substitution matrices from protein blocks. Proc Natl Acad Sci USA 89:10915–10919

    Article  PubMed  CAS  Google Scholar 

  • Holmes I, Rubin GM (2002) An expectation maximization algorithm for training hidden substitution models. J Mol Biol 317:753–764

    Article  PubMed  CAS  Google Scholar 

  • Jones DT, Taylor WR, Thornton JM (1992) The rapid generation of mutation data matrices from protein sequences. Comput Appl Biosci 8:275–282

    PubMed  CAS  Google Scholar 

  • Kahaner D, Moler C, Nash S (1989) Numerical methods and software. Prentice-Hall, Upper Saddle River, NJ

    Google Scholar 

  • Kimura M (1983) The neutral theory of molecular evolution. Cambridge University Press, Cambridge

    Google Scholar 

  • Koshi JM, Goldstein RA (1995) Context-dependent optimal substitution matrices. Protein Eng 8:641–645

    Article  PubMed  CAS  Google Scholar 

  • Koshi J, Goldstein R (1996) Probabilistic reconstruction of ancestral protein sequences. J Mol Evol 42:313–320

    Article  PubMed  CAS  Google Scholar 

  • Lanave C, Preparata G, Saccone C, Serio G (1984) A new method for calculating evolutionary substitution rates. J Mol Evol 20:86–93

    Article  PubMed  CAS  Google Scholar 

  • Liò P, Goldman N (2002) Modeling mitochondrial protein evolution using structural information. J Mol Evol 54:519–529

    Article  PubMed  Google Scholar 

  • Liò P, Goldman N, Thorne JL, Jones DT (1998) PASSML: combining evolutionary inference and protein secondary structure prediction. Bioinformatics 14:726–733

    Article  PubMed  Google Scholar 

  • Müller T, Vingron M (2000) Modeling amino acid replacement. J Comput Biol 7:761–776

    Article  PubMed  Google Scholar 

  • Müller T, Spang R, Vingron M (2002) Estimating amino acid substitution models: a comparison of Dayhoff’s estimator, the resolvent approach and a maximum likelihood method. Mol Biol Evol 19:8–13

    PubMed  Google Scholar 

  • Pearson W, Lipman DJ (1988) Improved tools for biological sequence comparison. Proc Natl Acad Sci USA 85:2444–2448

    Article  PubMed  CAS  Google Scholar 

  • Pollock D, Taylor W, Goldman N (1999) Coevolving protein residues: maximum likelihood identification and relationship to structure. J Mol Biol 287:187–198

    Article  PubMed  CAS  Google Scholar 

  • Smith NG, Eyre-Walker A (2001) A test of amino acid reversibility. J Mol Evol 52:467–469

    PubMed  CAS  Google Scholar 

  • Teichmann S, Mitchison G (1999) Is there a phylogenetic signal in prokaryote proteins? J Mol Evol 49:98–107

    Article  PubMed  CAS  Google Scholar 

  • Thorne JL (2000) Models of protein sequence evolution and their applications. Curr Opin Genet Dev 10:602–605

    Article  PubMed  CAS  Google Scholar 

  • Veerassamy S, Smith A, Tillier ER (2003) A transition probability model for amino acid substitutions from blocks. J Comput Biol 10:997–1010

    Article  PubMed  CAS  Google Scholar 

  • Whelan S, Goldman N (2001) A general empirical model of protein evolution derived from multiple protein families using a maximum-likelihood approach. Mol Biol Evol 18:691–699

    PubMed  CAS  Google Scholar 

  • Whelan S, de Bakker PI, Goldman N (2003) Pandit: a database of protein and associated nucleotide domains with inferred trees. Bioinformatics 19:1556–1563

    Article  PubMed  CAS  Google Scholar 

  • Yang Z (1997) PAML: a program package for phylogenetic analysis by maximum likelihood. Comput Appl Biosci 13:555–556

    PubMed  CAS  Google Scholar 

  • Yang Z, Kumar S, Nei M (1995) A new method of inference of ancestral nucleotide and amino acid sequences. Genetics 141:1641–1650

    PubMed  CAS  Google Scholar 

  • Yu Y, Wootton J, Altschul S (2003) The compositional adjustment of amino acid substitution matrices. Proc Natl Acad Sci USA 100:15688–15693

    Article  PubMed  CAS  Google Scholar 

Download references

Acknowledgments

I am grateful for very helpful remarks from anonymous referees. I am also indebted to Bengt Sennblad, Isaac Elias, and Nick Braun for comments on this work, and I thank Ann-Charlotte Berglund for help with both the manuscript and the questions on numerical analysis.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Lars Arvestad.

Additional information

[Reviewing Editor: Dr. Nicolas Galtier]

Rights and permissions

Reprints and permissions

About this article

Cite this article

Arvestad, L. Efficient Methods for Estimating Amino Acid Replacement Rates. J Mol Evol 62, 663–673 (2006). https://doi.org/10.1007/s00239-004-0113-9

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00239-004-0113-9

Keywords

Navigation