Abstract
Sequence divergence among orthologous proteins was characterized with 34 amino acid replacement matrices, sequence context analysis, and a phylogenetic tree. The model was trained on very large datasets of aligned protein sequences drawn from 15 organisms including protists, plants, Dictyostelium, fungi, and animals. Comparative tests with models currently used in phylogeny, i.e., with JTT+Γ±F and WAG+Γ±F, made on a test dataset of 380 multiple alignments containing protein sequences from all five of the major taxonomic groups mentioned, indicate that our model should be preferred over the JTT+Γ±F and WAG+Γ±F models on datasets similar to the test dataset. The strong performance of our model of orthologous protein sequence divergence can be attributed to its ability to better approximate amino acid equilibrium frequencies to compositions found in alignment columns.
Similar content being viewed by others
References
Altschul SF (1998) Generalized affine gap costs for protein sequence alignment. Proteins 32:88–96
Bapteste E, Brinkmann H, Lee JA, Moore DV, Sensen CW, Gordon P, Durufle L, Gaasterland T, Lopez P, Muller M, Philippe H (2002) The analysis of 100 genes supports the grouping of three highly divergent amoebae: Dictyostelium, Entamoeba, and Mastigamoeba. Proc Natl Acad Sci U S A 99(3):1414–1419
Brown M, Hughey R, Mian IS, Sjolander K, Underwood R, Haussler D (1993) Using Dirichlet mixture priors to derive Hidden Markov Models for protein families. In: Hunter L, Searls D, Shavlik J (eds) Proceedings of First International Conference on Intelligent Systems for Molecular Biology. AAAI Press, Menlo Park, CA
Cao Y, Adachi J, Janke A, Paabo S, Hasegawa M (1994) Phylogenetic relationships among eutherian orders estimated from inferred sequences of mitochondrial proteins: instability of a tree based on a single gene. J Mol Evol 39:519–527
Dayhoff MO, Eck RV, Park CM (1972) A model of evolutionary change in proteins. In: Dayhoff MO (ed) Atlas of Protein Sequence and Structure ,Vol 5. National Biomedical Research Foundation, Washington, DC, pp 89–99
Dayhoff MO, Schwartz RM, Orcutt BC (1978) A model of evolutionary change in proteins. In Dayhoff MO (ed) Atlas of protein sequence and structure Vol 5, Suppl 3. National Biomedical Research Foundation, Washington, DC, pp 345–358
Dimmic MW, Rest JS, Mindell DP, Goldstein RA (2002) rtREV: an amino acid substitution matrix for inference of retrovirus and reverse transcriptase phylogeny. J Mol Evol 55:65–73
Efron B (1979) Bootstrap methods: another look at the jacknife. Ann Stat 7:1–26
Felsenstein J (1981) Evolutionary trees from DNA sequences: a maximum likelihood approach. J Mol Evol 17:368–376
Felsenstein J (1993) PHYLIP (Phylogeny Inference Package), version 3.5c. Distributed by the author. Department of Genetics, University of Washington, Seattle
Felsenstein J, Churchill GA (1996) A Hidden Markov Model approach to variation among sites in rate of evolution. Mol Biol Evol 13(1):93–104
Feng DF, Cho G, Doolittle RF (1997) Determining divergence times with a protein clock: update and reevaluation. Proc Natl Acad Sci USA 94:13028–13033
Fitch WM (2000) Homology, a personal view on some of the problems. Trends Genet 16:227–231
Goldman N, Thorne JL, Jones DT (1998) Assessing the impact of secondary structure and solvent accessibility on protein evolution. Genetics 149:445–458
Goldman N, Whelan S (2002) A novel use of equilibrium frequencies in models of sequence evolution. Mol Biol Evol 19:1821–1831
Halpern AL, Bruno WJ (1998) Evolutionary distances for protein-coding sequences: modeling site-specific residue frequencies. Mol Biol Evol 15:910–917
Henikoff S, Henikoff JG (1993) Performance evaluation of amino acid substitution matrices. Proteins 17:49–61
Jones DT, Taylor WR, Thornton JM (1992) The rapid generation of mutation data matrices from protein sequences. Comput Appl Biosci 8:275–282
Lin K, May AC, Taylor WR (2001) Abstract Amino acid substitution matrices from an artificial neural network model. J Comput Biol. 8:471–481
Livingstone CD, Barton GJ (1993) Protein sequence alignments: a strategy for the hierarchical analysis of residue conservation. Comput Appl Biosci 9:745–756
Miyamoto MM, Fitch WM (1995) Testing the covarion hypothesis of molecular evolution. Mol Biol Evol 12:503–513
Müller T, Vingron M (2000) Modelling amino acid replacement. J Comp Biol 7:761–776
Müller T, Spang R, Vingron M (2002) Estimating amino acid substitution models: A comparison of Dayhoff’s sstimator, the resolvent approach and a maximum likelihood method. Mol Biol Evol 19:8–13
Notredame C, Higgins DG, Heringa J (2000) T-Coffee: A novel method for fast and accurate multiple sequence alignment. J Mol Biol 302:205–217
Schwarz G (1978) Estimating the dimension of a model. Ann Stat 6:461–464
Sjolander K, Karplus K, Brown M, Hughey R, Krogh A, Mian IS, Haussler D (1996) Dirichlet mixtures: a method for improved detection of weak but significant protein sequence homology. Comput Appl Biosci 12:327–345
Tatusov RL, Koonin EV, Lipman DJ (1997) A genomic perspective on protein families. Science 278:631–637
Taylor WR (1986) The classification of amino acid conservation. J Theoret Biol 119:205–218
Thompson JD, Higgins DG, Gibson TJ (1994) CLUSTAL W: Improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res 22:4673–4680
Tourasse NJ, Li WH (2000) Selective constraints, amino acid composition and the rate of protein evolution. Mol Biol Evol 17:656–664
Veerassamy S, Smith A, Tillier ER (2003) A transition probability model for amino acid substitutions from blocks. J Comput Biol 10:997–1010
Whelan S, Goldman N (2001) A general empirical model of protein evolution derived from multiple protein families using a maximum-likelihood approach. Mol Biol Evol 18:691–699
Xu W, Miranker DP (2004) A metric model of amino acid substitution. Bioinformatics 20:1214–1221
Yang Z (1994) Maximum likelihood phylogenetic estimation from DNA sequences with variable rates over sites: approximate methods. J Mol Evol 39:306–314
Yang Z (1995) A space–time process model for the evolution of DNA sequences. Genetics 139:993–1005
Yang Z (1997) PAML:a program package for phylogenetic analysis by maximum likelihood. Comput Appl Biosci 13:555–556
Yu YK, Altschul SF (2004) The construction of amino acid substitution matrices for the comparison of proteins with non-standard compositions. Bioinformatics 21:902–911
Yu YK, Wootton JC, Altschul SF (2003) The compositional adjustment of amino acid substitution matrices. Proc Natl Acad Sci USA 100:15688–15693
Zuckerkandl E, Pauling L (1965) Evolutionary divergence and convergence in proteins. In: Bryson V, Vogel HJ (eds) Evolving genes and proteins. Academic Press, New York, pp 97–166
Acknowledgments
We thank T. Hwa for guidance and encouragement and R. Konecny of the W.M. Keck Laboratory for Integrated Biology II for helping to keep the processors running. R.O. was supported in part by a grant from the NIH to W.F.L. (GM-062350), grants from the NSF to T.H. (DMR-9971456, DMR-0211308, MCB-0083704), and the Center for Theoretical Biological Physics.
Author information
Authors and Affiliations
Corresponding author
Additional information
[Reviewing Editor : Dr. Martin Kreitman]
Electronic Supplementary Material
Rights and permissions
About this article
Cite this article
Olsen, R., Loomis, W.F. A Collection of Amino Acid Replacement Matrices Derived from Clusters of Orthologs. J Mol Evol 61, 659–665 (2005). https://doi.org/10.1007/s00239-005-0060-0
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00239-005-0060-0