The factors dictating the codon usage variation among the genes in the genome of Burkholderia pseudomallei
- First Online:
- Cite this article as:
- Zhao, S., Zhang, Q., Chen, Z. et al. World J Microbiol Biotechnol (2008) 24: 1585. doi:10.1007/s11274-007-9652-8
- 59 Views
Burkholderia pseudomallei is a recognized biothreat agent and the causative agent of melioidosis. Codon usage biases of all protein-coding genes (length greater than or equal to 300 bp) from the complete genome of B. pseudomallei K96243 have been analyzed. As B. pseudomallei is a GC-rich organism (68.5%), overall codon usage data analysis indicates that indeed codons ending in G and/or C are predominant in this organism. But multivariate statistical analysis indicates that there is a single major trend in the codon usage variation among the genes in this organism, which has a strong positively correlation with the expressivities of the genes. The majority of the lowly expressed genes are scattered towards the negative end of the major axis whereas the highly expressed genes are clustered towards the positive end. At the same time, from the results that there were two significant correlations between axis 1 coordinates and the GC, GC3s content at silent sites of each sequence, and clearly significant negatively correlations between the ‘Effective Number of Codons’ values and GC, GC3s content, we inferred that codon usage bias was affected by gene nucleotide composition also. In addition, some other factors such as the lengths of the genes as well as the hydrophobicity of genes also influence the codon usage variation among the genes in this organism in a minor way. At the same time, notably, 21 codons have been defined as ‘optimal codons’ of the B. pseudomallei. In summary, our work have provided a basic understanding of the mechanisms for codon usage bias and some more useful information for improving the expression of target genes in vivo and in vitro.
KeywordsBurkholderia pseudomallei K96243Codon usageCorrespondence analysis
Foot-and-mouth disease virus
Relative synonymous codon usage
Effective number of codons
The frequency of G+C at the synonymous third position of sense codons
- A3S, T3S, G3S and C3S
The adenine, thymine, guanine and cytosine content at synonymous third positions
Open reading frame
The inter- and intra-genomic variation of the pattern of codon usage is a widespread phenomenon. This variation has been attributed to two main factors: natural selection acting on silent sites to increase the rate and/or the accuracy of translation, and mutational biases (Ikemura 1985). In unicellular organisms, such as Escherichia coli, Bacillus subtilis, Saccharomyces cerevisiae, and Dictyostelium discoideum, the codon usage is attributable to the equilibrium between natural selection and compositional mutation bias (Bulmer 1988; Sharp et al. 1993). However, in some prokaryotes with extremely high A+T or G+C contents (Sharp et al. 1993) and in human (Karlin and Mrazek 1996), mutation bias is the major factor accounting for the variation in codon usage. More interestingly, a rather complex pattern was reported for Chlamydia trachomatis, where codon choices were the result of strand-specific mutational biases, natural selection acted at the level of translation, the hydropathy level of each protein, and amino acid conservation (Romero et al. 2000). And previous codon usage analyses showed that codon usage bias is very complicated and is associated with various biological factors, such as gene expression level (Nakamura and Sugiura 2007; Sharp et al. 1993), gene length (Eyre-Walker 1996; Liu et al. 2004), gene translation initiation signal (Ma et al. 2002), protein amino acid composition (Noboru 1999), protein structure (Plotkin et al. 2006), tRNA abundance (Kanaya et al. 1999; Noguchi and Satow 2006), mutation frequency and patterns (Noboru 1999; Sau et al. 2005), and GC composition (Sueoka and Kawanishi 2000; Wan et al. 2004). Knowledge of codon usage patterns can provide a basis for understanding the relevant mechanism for biased usage of synonymous codons and for selecting appropriate host expression systems to improve the expression of target genes in vivo and in vitro.
Melioidosis is an emerging infectious disease of animals and humans caused by the Gram-negative bacterium Burkholderia pseudomallei, which is an environmental saprophyte present in wet soil and rice paddies in endemic areas. B. pseudomallei has come under renewed scientific investigation as a result of recent concerns about its potential future use as a biological weapon and there is no vaccine currently available for it. The majority of infections are reported from East Asia and northern Australia, the highest documented rate being in northeastern Thailand, where melioidosis accounts for 20% of all septicaemias. Infection is acquired through skin abrasions or inhalation of contaminated soil or surface water. Clinical disease presents along a spectrum of severity ranging from acute fulminating sepsis, which carries high mortality rates to chronic persistent infection that is difficult to resolve with current antibiotic therapies. Death usually occurs within the first 48 h as a result of septic shock and in a setting where optimal antimicrobial chemotherapy is given. Of equal concern, there is evidence that the bacterium does not cause overt disease in all individuals exposed to the bacterium but is able to persist at unknown sites in the body to become reactivated later in life. The potential for the bacterium to cause disease after inhalation has also resulted in the inclusion of this pathogen on the Centers for Disease Control list of potential biothreat agents as a Category B agent.
Codon usage in B. pseudomallei has not been investigated in any detail, and it is not clear how (or even if) different genes should vary. Currently, the complete sequence of the B. pseudomallei K96243 genome has been determined (Holden et al. 2004). Therefore, it is of interest to understand how the codon usage pattern in this species is about. In this paper, we reported the analysis of codon usage bias in the B. pseudomallei genome by using methods of multivariate statistical analysis and correlation analysis, and we also determined the optimal codons.
Materials and methods
The complete genome sequences of B. pseudomallei K96243 were obtained from NCBI (http://www.ncbi.nlm.nih.gov/). According to the coordinates (start and stop codons location) of all the examined coding sequences (CDS). Because there is a negative correlation between codon usage bias and gene length—that is, codon usage is restricted in short coding sequences, so to minimize sampling errors (Wright 1990), only those CDS sequences (totally 5328) that are more than or equal to 100 codons and that have correct initial and termination codons were included in this dataset.
Measures of synonymous codon usage bias
Relative synonymous codon usage (RSCU)
G+C and GC3s
G+C value is the frequency of nucleotides that are guanine or cytosine.GC3s value is the frequency of G+C at the third synonymously variable coding position (excluding Met, Trp, and termination codons).
Effective number of codons (ENC)
ENC was often used to measure the magnitude of codon bias for an individual gene, yielding values ranging from 20 for a gene with extreme bias using only one codon per amino acid, to 61 for a gene with no bias using synonymous codons equally (Wright 1990).
Codon adaptation index (CAI)
Gene expressivities were measured by calculating the parameter CAI which was used to estimate the extent of bias toward codons that were known to be preferred in highly expressed genes. A CAI value is between 0 and 1.0, and a higher value means a stronger codon usage bias and a higher expression level (Sharp and Li 1987a). This value has been widely used to estimate the expressivities of genes by different workers (Elisabeth and Richard 2000) and is now considered a well-accepted measure of gene expressivities. The set of sequences used to calculate CAI values in this study were the genes coding for ribosomal proteins.
Hydropathicity and length
Hydropathicity value is the general average hydropathicity or (GRAVY) score, for the hypothetical translated gene product (Lobry and Gautier 1994). It is calculated the arithmetic mean of the sum of the hydropathic indices of each amino acid. Length value is equivalent to the length of one gene.
COA on codon usage
A more extensive and quantitative analysis of the sources of variation among genes can be achieved using multivariate statistical analysis. Now, the most commonly used method is called Correspondence analysis (COA). In this study, Correspondence analysis was used to explore the variation of RSCU values among B.pseudomallei genes. After plotting genes in 59-dimensional hyperspace, according to their usage of the 59 sense codons, correspondence analysis identifies a series of new orthogonal axes accounting for the greatest variation among genes. The analysis yields the coordinate of each gene on each new axis, and the fraction of the total variation accounted for by each axis. A number of indices of codon bias were calculated for each gene.
The correlation between codon usage variation among genes was analyzed using the Spearman’s rank correlation analysis method with significance-of-difference levels of P < 0.05 or P < 0.01.
Determination of optimal codons
There was one group of datasets being used to define ‘optimal codons’. That was to use 5% of the total genes with extremely high and low expression level as the High dataset and the Low dataset, respectively. Putative translationally optimal codons can be identified as those used at higher frequencies when the High data set is compared to the Low data set using chi square tests.
Therefore, the RSCU, GC3s, G+C, ENC, CAI, GRAVY, Length value, COA were calculated using the program INteractive Codon Analysis version 1.20 (http://www.bioinfo-hr.org), and CodonW version 1.4 (http://www.codonw.sourceforge.net).
The correlation analysis was carried out using the Spearman’s rank correlation analysis method wrapped in the multi-analysis software SPSS version 13.0 (http://www.spss.com).
Over all codon usage analysis
Codon usage data in B.pseudomallei K96243a
Heterogeneity of codon usage
Factors shaping codon usage
Effect of gene expressivities on codon usage
While correspondence analysis indicates that there is a single major trend in the codon usage among the genes in this bacterium, it is very interesting to note that the position of the genes along the first axis generated by the analysis might be associated with expressivity. For one extreme were clustered sequences coding for genes which known or presumably expressed at highly levels (such as ribosomal proteins, elongation factors, membrane proteins, heat-shock proteins, histone proteins, globin protein and dnaK, etc.). At the same time, genes presumably expressed at low level (such as various kinase, zinc finger proteins, regulatory proteins, some hypothetical protein, etc.) were scattered on another extreme.
Effect of other factors on codon usage
For a long time, it has been noted that in organisms with a highly skewed base composition, mutational bias is the main factor in shaping the codon usage variation among the genes whereas translational selection plays a minor role. Overall RSCU values (shown in Table 1) and Nc-plot (shown in Fig. 3) provide definite indications that mutational bias is acting in this organism in dictating the codon usage variation among the genes. In this study, axis 1 coordinates are significantly correlated with GC content (r = 0.319, P < 0.01); furthermore, ENC and GC3s, GC content, are significantly negatively correlated with each other (r = −0.794, P < 0.01; r = −0.271, P < 0.01). In B. pseudomallei genome, the GC content was 68.5%, and the third position of the codon tended to use ‘G’ or ‘C’ in the highly expressed genes. The highly expressed genes also had the high GC content. The CAI value and GC3s also had a significantly correlation (r = 0.806, P < 0.01). These results support that the highly expressed genes tend to use ‘C’ or ‘G’ at synonymous positions compared with lowly expressed genes. It was also confirmed that the nucleotide compositional mutation bias may possibly play important roles in shaping codon usage in the genome of this species, although they are less important than that of gene expression level.
In addition, axis 1 coordinates are also significantly correlated with the hydrophobicity of each protein (r = 0.164, P < 0.01) and codon length (r = −0.300, P < 0.01); at the same time, ENC and hydrophobicity of each protein, and gene length are significantly negatively correlated with each other (r = −0.293, P < 0.01; r = −0.133, P < 0.01), indicating that apart from the gene expression level and gene composition, the gene length and hydrophobicity of each protein also had played a critical role in affecting B. pseudomallei codon usage.
Translational optimal codons
Translational optimal codons of the B.pseudomallei genomea
In this paper, we present evidence suggesting that the pattern of synonymous codon choices in the bacterium B. pseudomallei appears to be the result of a complex equilibrium between different forces, namely the natural selection at the translational level, nucleotide compositional mutation bias, the hydrophobicity of each protein and the length of each gene.
Any fitness differences among synonymous codons, perhaps associated with translational accuracy and/or efficiency, are expected to be very small and thus only population sizes (Bulmer 1987). On the one hand, in Escherichia coli and Saccharomyces cerevisiae (organisms expected to have very large effective population sizes) selection for efficient translation seems to determine codon frequencies, particularly in genes expressed at high levels (Sharp and Li 1987b). On the other hand, in human, which have much smaller effective population sizes, there is as yet no evidence of selection among synonymous codons (Karlin and Mrazek 1996). In our study the result shows that the bias decreases with the degree of gene expressing in the genome of B. pseudomallei, measured by the codon adaptation index. Selection may be due either to a direct effect of translation time on fitness or to the extra energy cost of proof-reading associated with longer translating time. So, in the genome of B. pseudomallei, it is easy to see why selection will be stronger leading to greater bias, in a highly expressed gene whose codons are used more often.
The C. reinhardtii (Naya et al. 2001) and Echinococcus spp. (Fernandez et al. 2001) genomes had high GC contents, there were little evidences that the genome composition shaped the codon usages in these two genomes, but in D. melanogaster, the GC content was uniformly higher at silent sites in coding regions than in putatively neutrally evolving introns. In B. pseudomallei genome, the GC content was 68%, and the third position of the codon tended to use ‘G’ or ‘C’ in the highly expressed genes. The highly expressed genes also had the high GC content. The CAI value and GC3s also had a significantly correlation (r = 0.806, P < 0.01). These result support that the highly expressed genes tend to use ‘C’ or ‘G’ at synonymous positions compared with lowly expressed genes. Overall codon usage patterns (Table 1) and Nc-plot (Fig. 3) also confirmed that nucleotide compositional mutation bias is relatively the weaker influence on the codon usage in B. pseudomallei genome.
Apart from the gene expression level and gene composition, the gene length also had played a critical role in affecting B. pseudomallei codon usage. In Drosophila (Comeron et al. 1999) genome, longer genes had lower codon usage bias. But, the longer genes had higher expression level and higher codon usage bias in S. penumoniaes genome (Hou and Yang 2000). Those indicated that different genomes had different gene lengths which accommodated their own genome best requirements, and there weren’t universal rules about gene length and expression level in all genomes. In this study, the longer genes had higher expression level and higher codon usage bias; we argue that the positive correlation could be caused by selection to avoid missense errors during translation. Since the cost of producing a protein is proportional to its length, selection in favor of codons which increase accuracy should be greater in longer genes, and long genes should therefore have higher synonymous codon bias.
It was reported for Chlamydia trachomatis, and Thermotoga maritime where codon choices were influenced the hydropathy level of each protein (Romero et al. 2000; Zavala et al. 2002). In this study, codon usage is significantly positively correlated with the hydrophobicity of each protein in B. pseudomallei. The link with hydropathy and codon usage may be caused by the fact that many of the highly expressed sequences are hydrophilic just because they accomplish their function in the aqueous media of the cell.
At the same time, we defined the 21 codons being shared with the above-mentioned two comparisons as the optimal codons of the B. pseudomallei (Table 2). That will be significant during the design of degenerate primers, introduction of point mutation, modification of heterologous genes, and investigation of the evolution mechanism of species at the molecular level.
In summary, our work have provided a basic understanding of the mechanisms for codon usage bias and some more useful information for improving the expression of target genes in vivo and in vitro. As long as more completed genomes are studied, different factors appear to shape the pattern of codon usage. This pattern is the result of biological processes (i.e. protein structure and folding, physiological constraints, translation, replication, transcription, mutation, etc.), and hence it becomes imperative to analysis codon usage under the light of this complexity. However, it is not still possible to say that the ‘mutational bias-translational selection’ paradigm is not enough to explain codon usage in all species, all ‘new factors’, by the moment, can be explained in terms of this paradigm, although it is certainly becoming more complex.