Abstract
Replacement rate matrices describe the process of evolution at one position in a protein and are used in many applications where proteins are studied with an evolutionary perspective. Several general matrices have been suggested and have proved to be good approximations of the real process. However, there are data for which general matrices are inappropriate, for example, special protein families, certain lineages in the tree of life, or particular parts of proteins. Analysis of such data could benefit from adaption of a data-specific rate matrix. This paper suggests two new methods for estimating replacement rate matrices from independent pairwise protein sequence alignments and also carefully studies Müller-Vingron’s resolvent method. Comprehensive tests on synthetic datasets show that both new methods perform better than the resolvent method in a variety of settings. The best method is furthermore demonstrated to be robust on small datasets as well as practical on very large datasets of real data. Neither short nor divergent sequence pairs have to be discarded, making the method economical with data. A generalization to multialignment data is suggested and used in a test on protein-domain family phylogenies, where it is shown that the method offers family-specific rate matrices that often have a significantly better likelihood than a general matrix.
Similar content being viewed by others
References
Adachi J, Hasegawa M (1996) Model of amino acid substitution in proteins encoded by mitochondrial DNA. J Mol Evol 42:459–468
Agarwal P, States DJ (1996) A Bayesian evolutionary distance for parametrically aligned sequences. J Comput Biol 3:1–17
Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs Nucleic Acids Res 25:3389–3402
Arvestad L, Bruno WJ (1997) Estimation of reversible substitution matrices from multiple pairs of sequences. J Mol Evol 45:696–703
Bateman A, Birney E, Cerruti L, Durbin R, Etwiller L, Eddy SR, Griffiths-Jones S, Howe KL, Marshall M, Sonnhammer EL (2002) The Pfam protein families database. Nucleic Acids Res 30:276–280
Bishop M, Friday A (1985) Evolutionary trees from nucleic acid and protein sequences. Proc R Soc Lond B 226:272–302
Cao Y, Adachi J, Janke A, Pääbo S, Hasegawa M (1994) Phylogenetic relationships among eutharian orders estimated from inferred sequences of mitochondrial proteins: Instability of a tree based on a single gene. J Mol Evol 39:519–527
Dayhoff MO, Eck RV, Park CM (1972) A model of evolutionary change in proteins. In: Dayhoff MO (ed) Atlas of protein sequence and structure, National Biomedical Research Foundation, Washington, D.C., vol 5, pp 89–99
Devauchelle C, Grossmann A, Hénaut A, Holschneider M, Monnerot M, Risler J, Torrésani B (2001) Rate matrices for analyzing large families of protein sequences. J Comput Biol 8:381–399
Eddy SR (2001) HMMER: Profile hidden Markov models for biological sequence analysis. Available at: http://hmmer.wustl.edu/
Felsenstein J (1981) Evolutionary trees from DNA sequences: a maximum likelihood approach. J Mol Evol 17:368–376
Gill PE, Murray W, Wright MH (1991) Numerical linear algebra and optimization. Addison-Wesley, Reading, MA
Goldman N, Whelan S (2002) A novel use of equilibrium frequencies in models of sequence evolution. Mol Biol Evol 19:1821–1831
Goldman N, Thorne JL, Jones DT (1996) Using evolutionary trees in protein secondary structure prediction and other comparative sequence analyses. J Mol Biol 263:196–208
Gonnet G, Hallet M (1997) The Darwin manual. Available at: http://www.inf.ethz.ch/personal/gonnet/
Gonnet GH, Cohen M, Benner S (1992) Exhaustive matching of the entire protein sequence database. Science 256:1443–1445
Grimmet G, Stirzaker D (1992) Probability and random processes, 2nd ed. Oxford Science, Oxford, p 247
Henikoff S, Henikoff JG (1992) Amino acid substitution matrices from protein blocks. Proc Natl Acad Sci USA 89:10915–10919
Holmes I, Rubin GM (2002) An expectation maximization algorithm for training hidden substitution models. J Mol Biol 317:753–764
Jones DT, Taylor WR, Thornton JM (1992) The rapid generation of mutation data matrices from protein sequences. Comput Appl Biosci 8:275–282
Kahaner D, Moler C, Nash S (1989) Numerical methods and software. Prentice-Hall, Upper Saddle River, NJ
Kimura M (1983) The neutral theory of molecular evolution. Cambridge University Press, Cambridge
Koshi JM, Goldstein RA (1995) Context-dependent optimal substitution matrices. Protein Eng 8:641–645
Koshi J, Goldstein R (1996) Probabilistic reconstruction of ancestral protein sequences. J Mol Evol 42:313–320
Lanave C, Preparata G, Saccone C, Serio G (1984) A new method for calculating evolutionary substitution rates. J Mol Evol 20:86–93
Liò P, Goldman N (2002) Modeling mitochondrial protein evolution using structural information. J Mol Evol 54:519–529
Liò P, Goldman N, Thorne JL, Jones DT (1998) PASSML: combining evolutionary inference and protein secondary structure prediction. Bioinformatics 14:726–733
Müller T, Vingron M (2000) Modeling amino acid replacement. J Comput Biol 7:761–776
Müller T, Spang R, Vingron M (2002) Estimating amino acid substitution models: a comparison of Dayhoff’s estimator, the resolvent approach and a maximum likelihood method. Mol Biol Evol 19:8–13
Pearson W, Lipman DJ (1988) Improved tools for biological sequence comparison. Proc Natl Acad Sci USA 85:2444–2448
Pollock D, Taylor W, Goldman N (1999) Coevolving protein residues: maximum likelihood identification and relationship to structure. J Mol Biol 287:187–198
Smith NG, Eyre-Walker A (2001) A test of amino acid reversibility. J Mol Evol 52:467–469
Teichmann S, Mitchison G (1999) Is there a phylogenetic signal in prokaryote proteins? J Mol Evol 49:98–107
Thorne JL (2000) Models of protein sequence evolution and their applications. Curr Opin Genet Dev 10:602–605
Veerassamy S, Smith A, Tillier ER (2003) A transition probability model for amino acid substitutions from blocks. J Comput Biol 10:997–1010
Whelan S, Goldman N (2001) A general empirical model of protein evolution derived from multiple protein families using a maximum-likelihood approach. Mol Biol Evol 18:691–699
Whelan S, de Bakker PI, Goldman N (2003) Pandit: a database of protein and associated nucleotide domains with inferred trees. Bioinformatics 19:1556–1563
Yang Z (1997) PAML: a program package for phylogenetic analysis by maximum likelihood. Comput Appl Biosci 13:555–556
Yang Z, Kumar S, Nei M (1995) A new method of inference of ancestral nucleotide and amino acid sequences. Genetics 141:1641–1650
Yu Y, Wootton J, Altschul S (2003) The compositional adjustment of amino acid substitution matrices. Proc Natl Acad Sci USA 100:15688–15693
Acknowledgments
I am grateful for very helpful remarks from anonymous referees. I am also indebted to Bengt Sennblad, Isaac Elias, and Nick Braun for comments on this work, and I thank Ann-Charlotte Berglund for help with both the manuscript and the questions on numerical analysis.
Author information
Authors and Affiliations
Corresponding author
Additional information
[Reviewing Editor: Dr. Nicolas Galtier]
Rights and permissions
About this article
Cite this article
Arvestad, L. Efficient Methods for Estimating Amino Acid Replacement Rates. J Mol Evol 62, 663–673 (2006). https://doi.org/10.1007/s00239-004-0113-9
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00239-004-0113-9