Efficient Methods for Estimating Amino Acid Replacement Rates

Arvestad, Lars

doi:10.1007/s00239-004-0113-9

Efficient Methods for Estimating Amino Acid Replacement Rates

Published: 28 April 2006

Volume 62, pages 663–673, (2006)
Cite this article

Journal of Molecular Evolution Aims and scope Submit manuscript

Lars Arvestad¹

193 Accesses
13 Citations
Explore all metrics

Abstract

Replacement rate matrices describe the process of evolution at one position in a protein and are used in many applications where proteins are studied with an evolutionary perspective. Several general matrices have been suggested and have proved to be good approximations of the real process. However, there are data for which general matrices are inappropriate, for example, special protein families, certain lineages in the tree of life, or particular parts of proteins. Analysis of such data could benefit from adaption of a data-specific rate matrix. This paper suggests two new methods for estimating replacement rate matrices from independent pairwise protein sequence alignments and also carefully studies Müller-Vingron’s resolvent method. Comprehensive tests on synthetic datasets show that both new methods perform better than the resolvent method in a variety of settings. The best method is furthermore demonstrated to be robust on small datasets as well as practical on very large datasets of real data. Neither short nor divergent sequence pairs have to be discarded, making the method economical with data. A generalization to multialignment data is suggested and used in a test on protein-domain family phylogenies, where it is shown that the method offers family-specific rate matrices that often have a significantly better likelihood than a general matrix.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Evaluating significance in linear mixed-effects models in R

Article 12 September 2016

Codon usage bias

Article 25 November 2021

A Beginners Guide to Estimating the Non-synonymous to Synonymous Rate Ratio of all Protein-Coding Genes in a Genome

References

Adachi J, Hasegawa M (1996) Model of amino acid substitution in proteins encoded by mitochondrial DNA. J Mol Evol 42:459–468
Article PubMed CAS Google Scholar
Agarwal P, States DJ (1996) A Bayesian evolutionary distance for parametrically aligned sequences. J Comput Biol 3:1–17
PubMed CAS Google Scholar
Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs Nucleic Acids Res 25:3389–3402
Article PubMed CAS Google Scholar
Arvestad L, Bruno WJ (1997) Estimation of reversible substitution matrices from multiple pairs of sequences. J Mol Evol 45:696–703
Article PubMed CAS Google Scholar
Bateman A, Birney E, Cerruti L, Durbin R, Etwiller L, Eddy SR, Griffiths-Jones S, Howe KL, Marshall M, Sonnhammer EL (2002) The Pfam protein families database. Nucleic Acids Res 30:276–280
Article PubMed CAS Google Scholar
Bishop M, Friday A (1985) Evolutionary trees from nucleic acid and protein sequences. Proc R Soc Lond B 226:272–302
Article Google Scholar
Cao Y, Adachi J, Janke A, Pääbo S, Hasegawa M (1994) Phylogenetic relationships among eutharian orders estimated from inferred sequences of mitochondrial proteins: Instability of a tree based on a single gene. J Mol Evol 39:519–527
Article PubMed CAS Google Scholar
Dayhoff MO, Eck RV, Park CM (1972) A model of evolutionary change in proteins. In: Dayhoff MO (ed) Atlas of protein sequence and structure, National Biomedical Research Foundation, Washington, D.C., vol 5, pp 89–99
Devauchelle C, Grossmann A, Hénaut A, Holschneider M, Monnerot M, Risler J, Torrésani B (2001) Rate matrices for analyzing large families of protein sequences. J Comput Biol 8:381–399
Article PubMed CAS Google Scholar
Eddy SR (2001) HMMER: Profile hidden Markov models for biological sequence analysis. Available at: http://hmmer.wustl.edu/
Felsenstein J (1981) Evolutionary trees from DNA sequences: a maximum likelihood approach. J Mol Evol 17:368–376
Article PubMed CAS Google Scholar
Gill PE, Murray W, Wright MH (1991) Numerical linear algebra and optimization. Addison-Wesley, Reading, MA
Google Scholar
Goldman N, Whelan S (2002) A novel use of equilibrium frequencies in models of sequence evolution. Mol Biol Evol 19:1821–1831
PubMed CAS Google Scholar
Goldman N, Thorne JL, Jones DT (1996) Using evolutionary trees in protein secondary structure prediction and other comparative sequence analyses. J Mol Biol 263:196–208
Article PubMed CAS Google Scholar
Gonnet G, Hallet M (1997) The Darwin manual. Available at: http://www.inf.ethz.ch/personal/gonnet/
Gonnet GH, Cohen M, Benner S (1992) Exhaustive matching of the entire protein sequence database. Science 256:1443–1445
PubMed CAS Google Scholar
Grimmet G, Stirzaker D (1992) Probability and random processes, 2nd ed. Oxford Science, Oxford, p 247
Google Scholar
Henikoff S, Henikoff JG (1992) Amino acid substitution matrices from protein blocks. Proc Natl Acad Sci USA 89:10915–10919
Article PubMed CAS Google Scholar
Holmes I, Rubin GM (2002) An expectation maximization algorithm for training hidden substitution models. J Mol Biol 317:753–764
Article PubMed CAS Google Scholar
Jones DT, Taylor WR, Thornton JM (1992) The rapid generation of mutation data matrices from protein sequences. Comput Appl Biosci 8:275–282
PubMed CAS Google Scholar
Kahaner D, Moler C, Nash S (1989) Numerical methods and software. Prentice-Hall, Upper Saddle River, NJ
Google Scholar
Kimura M (1983) The neutral theory of molecular evolution. Cambridge University Press, Cambridge
Google Scholar
Koshi JM, Goldstein RA (1995) Context-dependent optimal substitution matrices. Protein Eng 8:641–645
Article PubMed CAS Google Scholar
Koshi J, Goldstein R (1996) Probabilistic reconstruction of ancestral protein sequences. J Mol Evol 42:313–320
Article PubMed CAS Google Scholar
Lanave C, Preparata G, Saccone C, Serio G (1984) A new method for calculating evolutionary substitution rates. J Mol Evol 20:86–93
Article PubMed CAS Google Scholar
Liò P, Goldman N (2002) Modeling mitochondrial protein evolution using structural information. J Mol Evol 54:519–529
Article PubMed Google Scholar
Liò P, Goldman N, Thorne JL, Jones DT (1998) PASSML: combining evolutionary inference and protein secondary structure prediction. Bioinformatics 14:726–733
Article PubMed Google Scholar
Müller T, Vingron M (2000) Modeling amino acid replacement. J Comput Biol 7:761–776
Article PubMed Google Scholar
Müller T, Spang R, Vingron M (2002) Estimating amino acid substitution models: a comparison of Dayhoff’s estimator, the resolvent approach and a maximum likelihood method. Mol Biol Evol 19:8–13
PubMed Google Scholar
Pearson W, Lipman DJ (1988) Improved tools for biological sequence comparison. Proc Natl Acad Sci USA 85:2444–2448
Article PubMed CAS Google Scholar
Pollock D, Taylor W, Goldman N (1999) Coevolving protein residues: maximum likelihood identification and relationship to structure. J Mol Biol 287:187–198
Article PubMed CAS Google Scholar
Smith NG, Eyre-Walker A (2001) A test of amino acid reversibility. J Mol Evol 52:467–469
PubMed CAS Google Scholar
Teichmann S, Mitchison G (1999) Is there a phylogenetic signal in prokaryote proteins? J Mol Evol 49:98–107
Article PubMed CAS Google Scholar
Thorne JL (2000) Models of protein sequence evolution and their applications. Curr Opin Genet Dev 10:602–605
Article PubMed CAS Google Scholar
Veerassamy S, Smith A, Tillier ER (2003) A transition probability model for amino acid substitutions from blocks. J Comput Biol 10:997–1010
Article PubMed CAS Google Scholar
Whelan S, Goldman N (2001) A general empirical model of protein evolution derived from multiple protein families using a maximum-likelihood approach. Mol Biol Evol 18:691–699
PubMed CAS Google Scholar
Whelan S, de Bakker PI, Goldman N (2003) Pandit: a database of protein and associated nucleotide domains with inferred trees. Bioinformatics 19:1556–1563
Article PubMed CAS Google Scholar
Yang Z (1997) PAML: a program package for phylogenetic analysis by maximum likelihood. Comput Appl Biosci 13:555–556
PubMed CAS Google Scholar
Yang Z, Kumar S, Nei M (1995) A new method of inference of ancestral nucleotide and amino acid sequences. Genetics 141:1641–1650
PubMed CAS Google Scholar
Yu Y, Wootton J, Altschul S (2003) The compositional adjustment of amino acid substitution matrices. Proc Natl Acad Sci USA 100:15688–15693
Article PubMed CAS Google Scholar

Download references

Acknowledgments

I am grateful for very helpful remarks from anonymous referees. I am also indebted to Bengt Sennblad, Isaac Elias, and Nick Braun for comments on this work, and I thank Ann-Charlotte Berglund for help with both the manuscript and the questions on numerical analysis.

Author information

Authors and Affiliations

Stockholm Bioinformatics Center, Albanova University Center, Royal Institute of Technology (KTH), SE-100 44, Stockholm, Sweden
Lars Arvestad

Authors

Lars Arvestad
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Lars Arvestad.

Additional information

[Reviewing Editor: Dr. Nicolas Galtier]

Rights and permissions

Reprints and permissions

About this article

Cite this article

Arvestad, L. Efficient Methods for Estimating Amino Acid Replacement Rates. J Mol Evol 62, 663–673 (2006). https://doi.org/10.1007/s00239-004-0113-9

Download citation

Received: 11 April 2004
Accepted: 17 January 2006
Published: 28 April 2006
Issue Date: June 2006
DOI: https://doi.org/10.1007/s00239-004-0113-9

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Efficient Methods for Estimating Amino Acid Replacement Rates

Abstract

Access this article

Similar content being viewed by others

Evaluating significance in linear mixed-effects models in R

Codon usage bias

A Beginners Guide to Estimating the Non-synonymous to Synonymous Rate Ratio of all Protein-Coding Genes in a Genome

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Efficient Methods for Estimating Amino Acid Replacement Rates

Abstract

Access this article

Similar content being viewed by others

Evaluating significance in linear mixed-effects models in R

Codon usage bias

A Beginners Guide to Estimating the Non-synonymous to Synonymous Rate Ratio of all Protein-Coding Genes in a Genome

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation