Codon Deviation Coefficient: a novel measure for estimating codon usage bias and its statistical significance
- First Online:
- 7.9k Downloads
Genetic mutation, selective pressure for translational efficiency and accuracy, level of gene expression, and protein function through natural selection are all believed to lead to codon usage bias (CUB). Therefore, informative measurement of CUB is of fundamental importance to making inferences regarding gene function and genome evolution. However, extant measures of CUB have not fully accounted for the quantitative effect of background nucleotide composition and have not statistically evaluated the significance of CUB in sequence analysis.
Here we propose a novel measure--Codon Deviation Coefficient (CDC)--that provides an informative measurement of CUB and its statistical significance without requiring any prior knowledge. Unlike previous measures, CDC estimates CUB by accounting for background nucleotide compositions tailored to codon positions and adopts the bootstrapping to assess the statistical significance of CUB for any given sequence. We evaluate CDC by examining its effectiveness on simulated sequences and empirical data and show that CDC outperforms extant measures by achieving a more informative estimation of CUB and its statistical significance.
As validated by both simulated and empirical data, CDC provides a highly informative quantification of CUB and its statistical significance, useful for determining comparative magnitudes and patterns of biased codon usage for genes or genomes with diverse sequence compositions.
KeywordsCodon deviation coefficient CDC Codon usage bias CUB Statistical significance Background nucleotide composition GC content Purine content Bootstrapping
Codon Usage Bias
Codon Deviation Coefficient
Background Nucleotide Composition
Positional Composition Set
Ti, Gi, Ci, Si, Ri, A, T, G, C, S, R at codon position i, respectively, where i = 1, 2, 3.
Codon usage bias or CUB, a phenomenon in which synonymous codons (that encode the same amino acid) are used at different frequencies, is generally believed to be a combined outcome of mutation pressure, natural selection, and genetic drift [1, 2, 3, 4, 5]. Within any given species, genes often exhibit variable degrees of CUBs. Moreover, CUB for an individual gene is related closely with gene expression for translational efficiency and/or accuracy [6, 7, 8, 9, 10]. Therefore, the ability to accurately quantify CUBs for protein-coding sequences is of fundamental importance in revealing the underlying mechanisms behind codon usage and understanding gene evolution and function in general.
Over the past few years, a number of measures have been proposed for the quantification of CUB [11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23], leading to investigations on the pattern of CUBs within and across species [24, 25, 26, 27, 28, 29, 30]. Since CUB is primarily shaped by selection and mutation , different measures are differentially informative with regard to differentiating causes. For instance, there are purely descriptive measures of CUB as caused by the joint effects of mutation and selection, such as, the Effective Number of Codons (Nc or ENC)  and the Relative Synonymous Codon Usage . Alternatively, other measures of CUB specifically accord with selection on codon usage associated with translation, such as, the Codon Adaption Index (CAI)  and the Frequency of Optimal codons . In addition, a number of studies have attempted to estimate selection on codon usage based on population genetics [31, 32, 33, 34, 35].
These existing measures generally fall into two categories, as they compare the observed codon usage distribution of target coding sequence against the distribution based on a reference set of highly-expressed genes (e.g., CAI) or the distribution based on a null hypothesis of uniform usage of different synonymous codons (e.g., Nc ). The former measures are highly dependent on their corresponding reference sets (from which preferred codons are derived) and accordingly are limited by the comprehensiveness and accuracy of reference sets. Since reference sets are species-specific, these measures are inappropriate for comparison of CUBs across species . Additionally, they are unreliable in cases where there is inadequate knowledge about the highly-expressed genes for a given species , such as for newly sequenced species that have a limited number of annotated genes.
Due to these shortcomings, measures that do not require prior knowledge of reference gene sets have been implemented. These measures assume a null distribution of uniform usage of synonymous codons and estimate the departure of the observed codon usage from the expected. Among them, Nc is one of the most widely used measures . Its variant, Nc', incorporates GC content of coding sequence as background nucleotide composition (BNC) into CUB estimation. Accounting for BNC refines codon usage analysis, providing a comparable metric for analyses within and among species exhibiting various non-uniform BNCs. In the context of protein-coding sequences, for instance, bacteria have diverse BNCs as their GC contents vary widely - from ~20% to ~80%. Even within a single species, genes often differ considerably in background GC content, as in the case of Escherichia coli str. K-12 substr. MG1655, whose genes have GC contents ranging from 26.9% (rfaS; length = 311aa) to 66.8% (yagF; length = 655aa). Therefore, it is crucial to measure the departure of codon usage from the corresponding background composition (instead of the presumed uniform codon usage). Due to its appropriate consideration of BNC, Nc' outperforms other relevant measures .
However, all extant measures (including Nc') still have limitations. First, they give a general estimate of CUB, but have not been supplied with straightforward procedures for assessing the statistical significance of the bias in codon usages for any given gene. Genes that vary in length and differ in CUB may exhibit different levels of statistical significance for their codon biases. Assessing statistical significance can strengthen functional relationships ascertained considerably by discounting sampling error in correlated gene sets. Second, no previous measure is fully effective at incorporating BNC into CUB estimation. Although Nc' factors GC content as BNC, it does not account for known variation in BNCs at three different codon positions . In bacteria, for instance, Bartonella quintana str. Toulouse and Clostridium thermocellum ATCC 27405 have very similar GC contents in coding sequences (40.5% and 40.4%, respectively), but their position-specific GC contents are quite different: 53.3% and 47.3% at the first codon position, 38.6% and 34.0% at the second codon position, and 29.5% and 39.9% at the third codon position, respectively. Likewise, genes within a given species can also have heterogeneous BNCs at the three codon positions; in E. coli, for example, there are two genes, emrE and hlyE, that are similar in their overall GC contents (41.5% and 41.1%) but different in positional GC contents: 42.7% and 48.2% at the first position, 46.4% and 32.0% at the second position, and 35.5% and 43.2% at the third position, respectively. Such differences in positional BNCs reflect the outcomes of diverse evolutionary mechanisms (e.g., dinucleotide bias , horizontal gene transfer , strand compositional asymmetry in bacteria , isochore structure in vertebrates , etc.), thus conflating the roles of mutation and selection acting at different codon positions. Therefore, incorporation of differential positional BNCs into CUB estimation promises to increase its effectiveness and reliability.
Moreover, GC content is not the sole parameter of BNC. As illustrated in Zhang and Yu , joint use of GC and purine contents effectively models nucleotide, codon, and amino acid compositions. In contrast to a broader variation of GC content, purine content varies within a much narrower range fluctuating around 50%, presumably because purines play a determinative role in physicochemical properties of amino acids [44, 45]. Similar with GC content, purine content differs not only from one species to another, but also from one gene to another, and even between genes with similar GC contents. For instance, emrE and hlyE in E. coli, which are similar in their overall GC contents, have entirely different purine contents not only at the overall level (45.8% and 55.6%, respectively), but also at three codon positions (54.5% and 68.3% at the first position, 34.5% and 48.2% at the second position, and 48.2% and 50.2% at the third position, respectively). Thus, in addition to GC content, purine content is also a significant feature of BNC.
Here we present a novel measure, Codon Deviation Coefficient (CDC), using it to characterize CUB and to ascertain its statistical significance. CDC takes account of both GC and purine contents, comprehensively addressing heterogeneous BNCs, not only in sequences but also at three codon positions. It adopts the cosine distance metric to quantify CUB and employs the bootstrapping to assess its statistical significance, requiring no prior knowledge of reference gene sets. We describe CDC in detail and provide comparative results in the form of an in-depth evaluation of simulated sequences and empirical data.
Expected codon usage
Codon usage bias
Statistical significance of codon usage bias
We implement a bootstrap resampling of N = 10000 replicates for any given sequence to evaluate the statistical significance of non-uniform codon usage. Each replicate is randomly generated according to the sequence BNC (Si and Ri , i = 1, 2, 3) and the sequence length. Consequently, we obtain a bootstrap distribution of N estimates of CUB. A two-sided bootstrap P-value is calculated as twice the smaller of the two one-sided P-values . P ranges from 0 to 1. By convention, a statistically significant CUB is identified by P < 0.05. CDC features its first application of the bootstrap resampling in estimating the statistical significance of CUB. Bootstrapping may also be applicable to other related measures.
Implementation and availability
CDC is written in standard C++ programming language and implemented into Composition Analysis Toolkit (CAT), which is distributed as open-source software and licensed under the GNU General Public License. Its software package, including compiled executables on Linux/Mac/Windows, example data, documentation, and source codes, is freely available at http://cbb.big.ac.cn/software and http://cbrc.kaust.edu.sa/CAT.
Results and discussion
Comparative analysis on simulated data
Background nucleotide compositions at three codon positions specified in simulations
Codon usage bias across a variety of positional background compositions for GC and purine contents
Codon usage bias across all possible quantitative relationships among positional GC contents
Purine content = 0.3
Purine content = 0.5
Purine content = 0.7
Codon usage bias across all possible quantitative relationships among positional purine contents
GC content = 0.3
GC content = 0.5
GC content = 0.7
Differences between estimated and expected codon usage biases
(Estimated CUB)a- (Expected CUB)
Application to empirical data
Correlation coefficients of codon usage bias with gene expression level
LB (n = 1762b)
M9 (n = 2766b)
(n = 5142b)
(n = 1651b)
(n = 12184b)
(n = 1332b)
On the whole, CDC outperformed scaled Nc ' and scaled Nc in correlating closely with gene expression level. Although CDC and scaled Nc ' produced comparable correlation coefficients in yeast (detailed below), CDC exhibited larger correlation coefficients than scaled Nc ' and scaled Nc for all the rest cases (Table 6). When comparing CDC to CAI, we found comparable correlation coefficients in E. coli (LB medium) and yeast, but in general CDC performed better than CAI (Table 6 and Additional file 1). However, it should be noticed that the values of CAI are calculated from expression data (since it requires a reference set of highly-expressed genes), whereas those of CDC are not. When we restricted the above analysis to the top 10% genes referring to their expression levels, CDC continued to perform better than scaled Nc ', scaled Nc , and CAI (Additional file 1). In addition, considering the correlation coefficients among these five species, we found that the smallest values always belonged to A. thaliana (regardless of metric used), indicating relatively weaker selection on A. thaliana codon usage by comparison with those of the other four species (Table 6). Such phenomenon was discovered previously in a comparative analysis between A. thaliana and Oryza sativa. Overall, CDC correlated positively with gene expression level, much better than scaled Nc ', scaled Nc , and CAI.
Ribosomal protein (RP) genes are, in general, both essential and highly expressed, and it is believed that their CUB values are greater than those of other genes . In the case of E. coli, CDC values for 54 RP genes vary from 0.244 to 0.481, larger than the mean and median values of all E. coli genes (Figure 4). Nearly all RP genes have statistically significant CUBs, with three exceptions (Additional file 3): (1) rpmE: CDC = 0.267, P = 0.1136; encoding RP L31, which may be loosely associated with ribosome , (2) rpmF: CDC = 0.329, P = 0.1096; encoding RP L32, which locates near the peptidyltransferase center , and (3) rpmJ: CDC = 0.422, P = 0.0564; encoding RP L36, which is non-essential for protein synthesis . These results suggest that an accurate measure such as CDC has the potential to illuminate the evolutionary process that has operated on each gene.
In summary, we have described a novel measure of CUB, the Codon Deviation Coefficient. As validated by simulated sequences and empirical data, CDC outperforms other measures by providing informative estimates of CUB and its statistical significance. CDC features no necessity for any prior knowledge regarding gene expression or function, properly accounts for BNC, and utilizes a bootstrap assessment to evaluate the statistical significance of CUB. Therefore, CDC promises a significant advance in raw analysis of codon usage, providing the means to better reveal aspects of the historical evolutionary pressures on gene function without the assumptions of underlying reference data sets.
We thank anonymous reviewers for their critical comments and constructive suggestions on this manuscript. We also thank Joe Yu for helpful comments on this work and George Marselis for providing assistance on software hosting. This work was supported by King Abdullah University of Science and Technology (KAUST), Kingdom of Saudi Arabia, and the National Science and Technology Key Project (2008ZX1004-013), 863 Program (2009AA01A130), and 973 Program (2011CB944100) from the Ministry of Science and Technology, the People's Republic of China.
- 15.Ikemura T: Correlation between the abundance of Escherichia coli transfer RNAs and the occurrence of the respective codons in its protein genes: a proposal for a synonymous codon choice that is optimal for the E. coli translational system. J Mol Biol 1981, 151(3):389–409. 10.1016/0022-2836(81)90003-6CrossRefPubMedGoogle Scholar
- 23.Angellotti MC, Bhuiyan SB, Chen G, Wan XF: CodonO: codon usage bias analysis within and across genomes. Nucleic Acids Res 2007, (35 Web Server):W132–136.Google Scholar
- 46.Baeza-Yates R, Ribeiro-Neto B: Modern information retrieval. New York: ACM Press; 1999.Google Scholar
- 52.Bernstein JA, Khodursky AB, Lin PH, Lin-Chao S, Cohen SN: Global analysis of mRNA decay and abundance in Escherichia coli at single-gene resolution using two-color fluorescent DNA microarrays. Proc Natl Acad Sci USA 2002, 99(15):9697–9702. 10.1073/pnas.112318199PubMedCentralCrossRefPubMedGoogle Scholar
- 56.Wuest SE, Vijverberg K, Schmidt A, Weiss M, Gheyselinck J, Lohr M, Wellmer F, Rahnenfuhrer J, von Mering C, Grossniklaus U: Arabidopsis female gametophyte gene expression map reveals similarities between plant and animal gametes. Curr Biol 2010, 20(6):506–512. 10.1016/j.cub.2010.01.051CrossRefPubMedGoogle Scholar
- 58.Zhang Z, Yu J: On the organizational dynamics of the genetic code. Genomics Proteomics Bioinformatics 2010, in press.Google Scholar
- 64.Ikegami A, Nishiyama K, Matsuyama S, Tokuda H: Disruption of rpmJ encoding ribosomal protein L36 decreases the expression of secY upstream of the spc operon and inhibits protein translocation in Escherichia coli. Biosci Biotechnol Biochem 2005, 69(8):1595–1602. 10.1271/bbb.69.1595CrossRefPubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.