Empirical Analysis of the Most Relevant Parameters of Codon Substitution Models
- 130 Downloads
Traditionally, codon models of evolution have been parametric, meaning that the 61 × 61 substitution rate matrix was derived from only a handful of parameters, typically the equilibrium frequencies, the ratio of nonsynonymous to synonymous substitution rates and the ratio between transition and transversion rates. These parameters are reasonable choices and are based on observations of what aspects of evolution often vary in coding DNA. However, the choices are relatively arbitrary and no systematic empirical search has ever been performed to identify the best parameters for a codon model. Even for the empirical or semi-empirical models that have been presented recently, only the average substitution rates have been estimated from databases of real coding DNA, but the parameters used were essentially the same as before. In this study we attempted to investigate empirically what the most relevant parameters for a codon model are. By performing a principal component analysis (PCA) on 3666 substitution rate matrices estimated from single gene families, the sets of the most co-varying substitution rates were determined. Interestingly, the two most significant principal components (PCs) describe clearly identifiable parameters: the first PC separates synonymous and nonsynonymous substitutions while the second PC distinguishes between substitutions where only one nucleotide changes and substitutions with two or three nucleotide changes. For the third and subsequent PCs no simple descriptions could be found.
KeywordsCodon models Markov models Coding sequence evolution Codon substitutions
We would like to thank Gina Cannarozzi and Gaston Gonnet for helpful discussions as well as the two reviewers for their valuable comments. Adrian Schneider is supported by a grant from the Swiss National Science Foundation (SNF).
- Dessimoz C, Cannarozzi G, Gil M, Margadant D, Roth A, Schneider A, Gonnet G (2005) OMA, a comprehensive, automated project for the identification of orthologs from complete genome data: introduction and first achievements. In: McLysath A, Huson DH (eds) RECOMB 2005 Workshop on Comparative Genomics. Lecture Notes in Bioinformatics, volume 3678. Springer-Verlag, Berlin. pp 61–72Google Scholar
- Pearson K (1901) On lines and planes of closest fit to systems of points in space. Philos Mag 2(6):559–572Google Scholar
- R Development Core Team (2009) R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, AustriaGoogle Scholar