Journal of Molecular Evolution

, Volume 70, Issue 6, pp 605–612 | Cite as

Empirical Analysis of the Most Relevant Parameters of Codon Substitution Models

Article

Abstract

Traditionally, codon models of evolution have been parametric, meaning that the 61 × 61 substitution rate matrix was derived from only a handful of parameters, typically the equilibrium frequencies, the ratio of nonsynonymous to synonymous substitution rates and the ratio between transition and transversion rates. These parameters are reasonable choices and are based on observations of what aspects of evolution often vary in coding DNA. However, the choices are relatively arbitrary and no systematic empirical search has ever been performed to identify the best parameters for a codon model. Even for the empirical or semi-empirical models that have been presented recently, only the average substitution rates have been estimated from databases of real coding DNA, but the parameters used were essentially the same as before. In this study we attempted to investigate empirically what the most relevant parameters for a codon model are. By performing a principal component analysis (PCA) on 3666 substitution rate matrices estimated from single gene families, the sets of the most co-varying substitution rates were determined. Interestingly, the two most significant principal components (PCs) describe clearly identifiable parameters: the first PC separates synonymous and nonsynonymous substitutions while the second PC distinguishes between substitutions where only one nucleotide changes and substitutions with two or three nucleotide changes. For the third and subsequent PCs no simple descriptions could be found.

Keywords

Codon models Markov models Coding sequence evolution Codon substitutions 

References

  1. Akaike H (1974) A new look at the statistical model identification. IEEE Trans Automatic Control 119:716–723CrossRefGoogle Scholar
  2. Anisimova M, Kosiol C (2009) Investigating protein-coding sequence evolution with probabilistic codon substitution models. Mol Biol Evol 26(2):255CrossRefPubMedGoogle Scholar
  3. Averof M, Rokas A, Wolfe KH, Sharp PM (2000) Evidence for a high frequency of simultaneous double-nucleotide substitutions. Science 287(5456):1283CrossRefPubMedGoogle Scholar
  4. Dessimoz C, Cannarozzi G, Gil M, Margadant D, Roth A, Schneider A, Gonnet G (2005) OMA, a comprehensive, automated project for the identification of orthologs from complete genome data: introduction and first achievements. In: McLysath A, Huson DH (eds) RECOMB 2005 Workshop on Comparative Genomics. Lecture Notes in Bioinformatics, volume 3678. Springer-Verlag, Berlin. pp 61–72Google Scholar
  5. Doron-Faigenboim A, Pupko T (2007) A combined empirical and mechanistic codon model. Mol Biol Evol 24(2):388–397CrossRefPubMedGoogle Scholar
  6. Goldman N, Yang Z (1994) A codon-based model of nucleotide substitution for protein-coding DNA sequences. Mol Biol Evol 11(5):725–736PubMedGoogle Scholar
  7. Gonnet GH, Hallett MT, Korostensky C, Bernardin L (2000) Darwin v. 2.0: an interpreted computer language for the biosciences. Bioinformatics 16(2):101–103CrossRefPubMedGoogle Scholar
  8. Katoh K, Kuma K, Toh H, Miyata T (2005) Mafft version 5: improvement in accuracy of multiple sequence alignment. Nucleic Acids Res 33(2):511–518CrossRefPubMedGoogle Scholar
  9. Klosterman P, Uzilov A, Bendaña Y, Bradley R, Chao S, Kosiol C, Goldman N, Holmes I (2006) XRate: a fast prototyping, training and annotation tool for phylo-grammars. BMC Bioinformatics 7:428CrossRefPubMedGoogle Scholar
  10. Kosiol C, Holmes I, Goldman N (2007) An empirical codon model for protein sequence evolution. Mol Biol Evol 24(7):1464–1479CrossRefPubMedGoogle Scholar
  11. Muse SV, Gaut BS (1994) A likelihood approach for comparing synonymous and nonsynonymous nucleotide substitution rates, with application to the chloroplast genome. Mol Biol Evol 11(5):715–724PubMedGoogle Scholar
  12. Nielsen R, Yang Z (1998) Likelihood models for detecting positively selected amino acid sites and applications to the HIV-1 envelope gene. Genetics 148(3):929–936PubMedGoogle Scholar
  13. Pearson K (1901) On lines and planes of closest fit to systems of points in space. Philos Mag 2(6):559–572Google Scholar
  14. R Development Core Team (2009) R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, AustriaGoogle Scholar
  15. Schneider A, Cannarozzi GM, Gonnet GH (2005) Empirical codon substitution matrix. BMC Bioinform 6:134CrossRefGoogle Scholar
  16. Schneider A, Dessimoz C, Gonnet GH (2007) OMA Browser—exploring orthologous relations across 352 complete genomes. Bioinformatics 23(16):2180–2182CrossRefPubMedGoogle Scholar
  17. Wang H-C, Li K, Susko E, Roger AJ (2008) A class frequency mixture model that adjusts for site-specific amino acid frequencies and improves inference of protein phylogeny. BMC Evol Biol 8:331CrossRefPubMedGoogle Scholar
  18. Whelan S, Goldman N (2004) Estimating the frequency of events that cause multiple-nucleotide changes. Genetics 167(4):2027–2043CrossRefPubMedGoogle Scholar
  19. Yang Z (1997) PAML: a program package for phylogenetic analysis by maximum likelihood. CABIOS 13:555–556PubMedGoogle Scholar
  20. Yang Z (1998) Likelihood ratio tests for detecting positive selection and application to primate lysozyme evolution. Mol Biol Evol 15(5):568–573PubMedGoogle Scholar
  21. Yang Z (2006) Computational molecular evolution. Oxford University Press, OxfordCrossRefGoogle Scholar
  22. Yang Z, Nielsen R (2002) Codon-substitution models for detecting molecular adaptation at individual sites along specific lineages. Mol Biol Evol 19(6):908–917PubMedGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC 2010

Authors and Affiliations

  1. 1.Computer Science DepartmentETH ZurichZürichSwitzerland
  2. 2.Institute of Evolutionary BiologyUniversity of EdinburghEdinburghUK

Personalised recommendations