World Journal of Microbiology and Biotechnology

, Volume 24, Issue 8, pp 1585–1592

The factors dictating the codon usage variation among the genes in the genome of Burkholderia pseudomallei

Authors

    • Jingchu University of Technology
    • College of Life Science and TechnologySouthwest University for Nationalities
  • Qin Zhang
    • Jingchu University of Technology
    • College of Life Science and TechnologySouthwest University for Nationalities
  • Zhihua Chen
    • College of Life Science and TechnologySouthwest University for Nationalities
    • College of Life Science and TechnologySouthwest University for Nationalities
Original Paper

DOI: 10.1007/s11274-007-9652-8

Cite this article as:
Zhao, S., Zhang, Q., Chen, Z. et al. World J Microbiol Biotechnol (2008) 24: 1585. doi:10.1007/s11274-007-9652-8

Abstract

Burkholderia pseudomallei is a recognized biothreat agent and the causative agent of melioidosis. Codon usage biases of all protein-coding genes (length greater than or equal to 300 bp) from the complete genome of B. pseudomallei K96243 have been analyzed. As B. pseudomallei is a GC-rich organism (68.5%), overall codon usage data analysis indicates that indeed codons ending in G and/or C are predominant in this organism. But multivariate statistical analysis indicates that there is a single major trend in the codon usage variation among the genes in this organism, which has a strong positively correlation with the expressivities of the genes. The majority of the lowly expressed genes are scattered towards the negative end of the major axis whereas the highly expressed genes are clustered towards the positive end. At the same time, from the results that there were two significant correlations between axis 1 coordinates and the GC, GC3s content at silent sites of each sequence, and clearly significant negatively correlations between the ‘Effective Number of Codons’ values and GC, GC3s content, we inferred that codon usage bias was affected by gene nucleotide composition also. In addition, some other factors such as the lengths of the genes as well as the hydrophobicity of genes also influence the codon usage variation among the genes in this organism in a minor way. At the same time, notably, 21 codons have been defined as ‘optimal codons’ of the B. pseudomallei. In summary, our work have provided a basic understanding of the mechanisms for codon usage bias and some more useful information for improving the expression of target genes in vivo and in vitro.

Keywords

Burkholderia pseudomallei K96243Codon usageCorrespondence analysis

Abbreviations

bp

Base pair

FMD

Foot-and-mouth disease

FMDV

Foot-and-mouth disease virus

RSCU

Relative synonymous codon usage

ENC

Effective number of codons

COA

Correspondence analysis

GC3S

The frequency of G+C at the synonymous third position of sense codons

A3S, T3S, G3S and C3S

The adenine, thymine, guanine and cytosine content at synonymous third positions

ORF

Open reading frame

SD

Standard deviation

Introduction

The inter- and intra-genomic variation of the pattern of codon usage is a widespread phenomenon. This variation has been attributed to two main factors: natural selection acting on silent sites to increase the rate and/or the accuracy of translation, and mutational biases (Ikemura 1985). In unicellular organisms, such as Escherichia coli, Bacillus subtilis, Saccharomyces cerevisiae, and Dictyostelium discoideum, the codon usage is attributable to the equilibrium between natural selection and compositional mutation bias (Bulmer 1988; Sharp et al. 1993). However, in some prokaryotes with extremely high A+T or G+C contents (Sharp et al. 1993) and in human (Karlin and Mrazek 1996), mutation bias is the major factor accounting for the variation in codon usage. More interestingly, a rather complex pattern was reported for Chlamydia trachomatis, where codon choices were the result of strand-specific mutational biases, natural selection acted at the level of translation, the hydropathy level of each protein, and amino acid conservation (Romero et al. 2000). And previous codon usage analyses showed that codon usage bias is very complicated and is associated with various biological factors, such as gene expression level (Nakamura and Sugiura 2007; Sharp et al. 1993), gene length (Eyre-Walker 1996; Liu et al. 2004), gene translation initiation signal (Ma et al. 2002), protein amino acid composition (Noboru 1999), protein structure (Plotkin et al. 2006), tRNA abundance (Kanaya et al. 1999; Noguchi and Satow 2006), mutation frequency and patterns (Noboru 1999; Sau et al. 2005), and GC composition (Sueoka and Kawanishi 2000; Wan et al. 2004). Knowledge of codon usage patterns can provide a basis for understanding the relevant mechanism for biased usage of synonymous codons and for selecting appropriate host expression systems to improve the expression of target genes in vivo and in vitro.

Melioidosis is an emerging infectious disease of animals and humans caused by the Gram-negative bacterium Burkholderia pseudomallei, which is an environmental saprophyte present in wet soil and rice paddies in endemic areas. B. pseudomallei has come under renewed scientific investigation as a result of recent concerns about its potential future use as a biological weapon and there is no vaccine currently available for it. The majority of infections are reported from East Asia and northern Australia, the highest documented rate being in northeastern Thailand, where melioidosis accounts for 20% of all septicaemias. Infection is acquired through skin abrasions or inhalation of contaminated soil or surface water. Clinical disease presents along a spectrum of severity ranging from acute fulminating sepsis, which carries high mortality rates to chronic persistent infection that is difficult to resolve with current antibiotic therapies. Death usually occurs within the first 48 h as a result of septic shock and in a setting where optimal antimicrobial chemotherapy is given. Of equal concern, there is evidence that the bacterium does not cause overt disease in all individuals exposed to the bacterium but is able to persist at unknown sites in the body to become reactivated later in life. The potential for the bacterium to cause disease after inhalation has also resulted in the inclusion of this pathogen on the Centers for Disease Control list of potential biothreat agents as a Category B agent.

Codon usage in B. pseudomallei has not been investigated in any detail, and it is not clear how (or even if) different genes should vary. Currently, the complete sequence of the B. pseudomallei K96243 genome has been determined (Holden et al. 2004). Therefore, it is of interest to understand how the codon usage pattern in this species is about. In this paper, we reported the analysis of codon usage bias in the B. pseudomallei genome by using methods of multivariate statistical analysis and correlation analysis, and we also determined the optimal codons.

Materials and methods

Dataset

The complete genome sequences of B. pseudomallei K96243 were obtained from NCBI (http://www.ncbi.nlm.nih.gov/). According to the coordinates (start and stop codons location) of all the examined coding sequences (CDS). Because there is a negative correlation between codon usage bias and gene length—that is, codon usage is restricted in short coding sequences, so to minimize sampling errors (Wright 1990), only those CDS sequences (totally 5328) that are more than or equal to 100 codons and that have correct initial and termination codons were included in this dataset.

Measures of synonymous codon usage bias

Relative synonymous codon usage (RSCU)

In order to normalize codon usage within datasets of differing amino acid compositions, RSCU values were calculated by dividing the observed codon usage by that expected when all codons for the same amino acid are used equally (Gupta et al. 2004; Hou and Yang 2000).

G+C and GC3s

G+C value is the frequency of nucleotides that are guanine or cytosine.GC3s value is the frequency of G+C at the third synonymously variable coding position (excluding Met, Trp, and termination codons).

Effective number of codons (ENC)

ENC was often used to measure the magnitude of codon bias for an individual gene, yielding values ranging from 20 for a gene with extreme bias using only one codon per amino acid, to 61 for a gene with no bias using synonymous codons equally (Wright 1990).

Codon adaptation index (CAI)

Gene expressivities were measured by calculating the parameter CAI which was used to estimate the extent of bias toward codons that were known to be preferred in highly expressed genes. A CAI value is between 0 and 1.0, and a higher value means a stronger codon usage bias and a higher expression level (Sharp and Li 1987a). This value has been widely used to estimate the expressivities of genes by different workers (Elisabeth and Richard 2000) and is now considered a well-accepted measure of gene expressivities. The set of sequences used to calculate CAI values in this study were the genes coding for ribosomal proteins.

Hydropathicity and length

Hydropathicity value is the general average hydropathicity or (GRAVY) score, for the hypothetical translated gene product (Lobry and Gautier 1994). It is calculated the arithmetic mean of the sum of the hydropathic indices of each amino acid. Length value is equivalent to the length of one gene.

COA on codon usage

A more extensive and quantitative analysis of the sources of variation among genes can be achieved using multivariate statistical analysis. Now, the most commonly used method is called Correspondence analysis (COA). In this study, Correspondence analysis was used to explore the variation of RSCU values among B.pseudomallei genes. After plotting genes in 59-dimensional hyperspace, according to their usage of the 59 sense codons, correspondence analysis identifies a series of new orthogonal axes accounting for the greatest variation among genes. The analysis yields the coordinate of each gene on each new axis, and the fraction of the total variation accounted for by each axis. A number of indices of codon bias were calculated for each gene.

Statistical methods

The correlation between codon usage variation among genes was analyzed using the Spearman’s rank correlation analysis method with significance-of-difference levels of P < 0.05 or P < 0.01.

Determination of optimal codons

There was one group of datasets being used to define ‘optimal codons’. That was to use 5% of the total genes with extremely high and low expression level as the High dataset and the Low dataset, respectively. Putative translationally optimal codons can be identified as those used at higher frequencies when the High data set is compared to the Low data set using chi square tests.

Analysis tools

Therefore, the RSCU, GC3s, G+C, ENC, CAI, GRAVY, Length value, COA were calculated using the program INteractive Codon Analysis version 1.20 (http://www.bioinfo-hr.org), and CodonW version 1.4 (http://www.codonw.sourceforge.net).

The correlation analysis was carried out using the Spearman’s rank correlation analysis method wrapped in the multi-analysis software SPSS version 13.0 (http://www.spss.com).

Results

Over all codon usage analysis

The overall codon usage in 5328 B. pseudomallei coding sequences shows the expected bias towards G+C-rich (Table 1). As a whole, the analyzed show significant preference for one postulate codon for each amino acid. They show a high bias of codon usage toward the codons with G and/or C ending rather than T and/or A ending for all degenerate codons (7.26 times). It is evident that codons ending in G and/or C are predominant in the entire coding region.
Table 1

Codon usage data in B.pseudomallei K96243a

AA

Codon

Number

RSUC

AA

Codon

Number

RSUC

Ala

GCU

6,059

0.09

Leu

UUA

665

0.02

GCC

69,217

0.97

 

UUG

12,822

0.39

GCA

13,951

0.20

 

CUU

6,622

0.20

GCG

195,211

2.75

 

CUC

86,878

2.63

Arg

CGU

8,859

0.35

 

CUA

1,794

0.05

CGC

100,438

3.97

 

CUG

89,560

2.71

CGA

5,467

0.22

Lys

AAA

7,551

0.29

CGG

32,374

1.28

 

AAG

45,367

1.71

AGA

1,338

0.05

Met

AUG

40,522

1.00

AGG

3,366

0.13

Phe

UUU

8,250

0.23

Asn

AAU

10,180

0.41

 

UUC

61,993

1.77

AAC

39,820

1.59

Pro

CCU

2,976

0.11

Asp

GAU

32,785

0.59

 

CCC

24,647

0.94

GAC

78,812

1.41

 

CCA

2,231

0.09

Cys

UGU

1,323

0.14

 

CCG

74,666

2.86

UGC

17,450

1.86

Ser

UCU

1,469

0.08

Gln

CAA

11,886

0.38

 

UCC

12,496

0.72

CAG

50,907

1.62

 

UCA

2,014

0.12

Glu

GAA

32,339

0.69

 

UCG

52,935

3.06

GAG

61,244

1.31

 

AGU

2,323

0.13

Gly

GGU

7,514

0.18

 

AGC

32,684

1.89

GGC

128981

3.16

TER

UAA

844

0.48

GGA

5,896

0.14

 

UAG

608

0.34

GGG

20,957

0.51

 

UGA

3876

2.18

His

CAU

15,475

0.68

Thr

ACU

2,139

0.09

CAC

30,155

1.32

 

ACC

26,922

1.08

Ile

AUU

9,131

0.32

 

ACA

2,971

0.12

AUC

73,831

2.63

 

ACG

67,677

2.71

AUA

1,359

0.05

Val

GUU

5,582

0.15

Trp

UGG

26,888

1.00

 

GUC

68,002

1.84

Tyr

UAU

15,633

0.65

 

GUA

3,671

0.10

UAC

32,267

1.35

 

GUG

70,318

1.91

aNote: AA, amino acids; Number, number of codons; RSCU, cumulative relative synonymous codon usage of 5328 genes

Heterogeneity of codon usage

The effective number of codons used by a gene (ENC) and (G+C) percentage at the third synonymous codon positions (GC3s) are generally used to study the codon usage variation among the genes in any organism. Figure 1 shows the ENC distribution of different genes in B. pseudomallei. From the ENC values which range from 24.70 to 61 (with a mean of 33.70 and standard deviation of 5.65), we infer that there is a wide variation of codon usage bias among the genes. Moreover, the distributions of (G+C) at the third synonymous codon positions (Fig. 2) further confirms the heterogeneity of codon usage biases among the genes, which shows that (G+C) at the synonymous third position of codons varies from 38% to 98% with a mean of 87% and standard deviation of 7%. Above results indicate that other trends might influence the overall codon usage variation among the genes in B. pseudomallei, beside compositional constraints
https://static-content.springer.com/image/art%3A10.1007%2Fs11274-007-9652-8/MediaObjects/11274_2007_9652_Fig1_HTML.gif
Fig. 1

Distribution of effective number of codons (ENC) value for 5328 B. pseudomallei genes

https://static-content.springer.com/image/art%3A10.1007%2Fs11274-007-9652-8/MediaObjects/11274_2007_9652_Fig2_HTML.gif
Fig. 2

Compositional distribution of GC content at the synonymous third positions of codons (GC3s) for 5328 B. pseudomallei genes

Factors shaping codon usage

The Nc-plot

A plot of ENC against GC3s (Nc-plot) was effectively used to detect the codon usage variation among the genes, for example, if GC3s is zero, then only codons ending in A and T will be used, thus restricting the number of codons used to 30 out of the 61 sense codons. Wright (1990) argued that if a particular gene is subject to G+C compositional constraints, it will lie on or just below the expected curve. If a gene is subject to selection for translationally optimal codons, it will lie considerably below the expected curve. This method had been used to investigate the evolution of many genomes. Nc-plot of B. pseudomallei genes (Fig. 3) showed that there were a small number of genes lied on the expected Nc-plot curve, this indicates that compositional constraints play a role in defining the codon usage variation among those genes. But, a majority of the points with low ENC values were lying well below the expected curve, suggesting that some other factors have primary influences on detecting the codon usage variation among the genes, which are independent of compositional constraints.
https://static-content.springer.com/image/art%3A10.1007%2Fs11274-007-9652-8/MediaObjects/11274_2007_9652_Fig3_HTML.gif
Fig. 3

Effective number of codons (ENC) used in a gene plotted against the G+C content at the synonymously variable third position (GC3S), for 5328 B. pseudomallei genes. The curve indicates the expected codon usage if GC compositional constraints alone account for codon usage bias

Correspondence analysis

The first axis of a correspondence analysis identifies the single largest source of the variation among a set of multivariate data points—in this case, the single largest trend in codon usage among genes: the first axis accounts for 22.86% of all variation among genes, this is a high proportion, since 58 axes are produced in total, whereas the next three axes accounts only for 4.09, 3.67 and 3.37%, respectively. Except for the first axis, none of the other axes individually accounted for more than 10% of the total variation. The plot of genes on the first two axes (Fig. 4) shows most genes falling within a single cloud, near the origins of the axes. These genes have an average ENC value of 32.52 and standard deviation of 3.78, which indicates that they have more or less similar codon usage biases. While it can also be found that very few points scattered along the negative side of the first major axis with an average ENC value of 50.07 and standard deviation of 5.58, which confirms that codon usage biases of these genes are not homogeneous. In addition, as the first axis explained only a partial amount of variation of codon usage among the genes in this bacterial genome, it was postulated that there were several major factors in shaping codon usage of B. pseudomallei genes.
https://static-content.springer.com/image/art%3A10.1007%2Fs11274-007-9652-8/MediaObjects/11274_2007_9652_Fig4_HTML.gif
Fig. 4

Plot of the two most prominent axes generated by the COA of the RSCU values for 5328 B. pseudomallei genes. Each point on the plot corresponds to the coordinates on the first and second principal axes produced by the COA. COA, Correspondence analysis; RSCU, relative synonymous codon usage

Effect of gene expressivities on codon usage

While correspondence analysis indicates that there is a single major trend in the codon usage among the genes in this bacterium, it is very interesting to note that the position of the genes along the first axis generated by the analysis might be associated with expressivity. For one extreme were clustered sequences coding for genes which known or presumably expressed at highly levels (such as ribosomal proteins, elongation factors, membrane proteins, heat-shock proteins, histone proteins, globin protein and dnaK, etc.). At the same time, genes presumably expressed at low level (such as various kinase, zinc finger proteins, regulatory proteins, some hypothetical protein, etc.) were scattered on another extreme.

In this study, we used CAI value to calculate the expressivity for each gene of this organism. The result of the positions of genes on the first major axis was plotted against their corresponding CAI values (Fig. 5.) indicates that the axis 1 coordinates are significantly positively correlated with the gene expression level assessed by CAI values and GC3s content (r = 0.928, P < 0.01; r = 0.853, P < 0.01), while significantly negative correlated with ENC values (r = −0.877, P < 0.01); at the same time there are significantly negative correlations between CAI value and ENC (r = −0.889, P < 0.01), while significantly correlated with GC3s (r = 0.806, P < 0.01) and GC (r = 0.197, P < 0.01). So we infer that gene expression level is the major factor in shaping codon usage in B. pseudomallei; genes with higher expression level exhibit a greater degree of codon usage bias, and they are always GC-rich and prefer for the codons with G and/or C at the synonymous position.
https://static-content.springer.com/image/art%3A10.1007%2Fs11274-007-9652-8/MediaObjects/11274_2007_9652_Fig5_HTML.gif
Fig. 5

Scatter diagram of the positions of 5328 B. pseudomallei genes on the first major axis generated by correspondence analysis against their Codon Adaptation Index (CAI) values

Effect of other factors on codon usage

For a long time, it has been noted that in organisms with a highly skewed base composition, mutational bias is the main factor in shaping the codon usage variation among the genes whereas translational selection plays a minor role. Overall RSCU values (shown in Table 1) and Nc-plot (shown in Fig. 3) provide definite indications that mutational bias is acting in this organism in dictating the codon usage variation among the genes. In this study, axis 1 coordinates are significantly correlated with GC content (r = 0.319, P < 0.01); furthermore, ENC and GC3s, GC content, are significantly negatively correlated with each other (r = −0.794, P < 0.01; r = −0.271, P < 0.01). In B. pseudomallei genome, the GC content was 68.5%, and the third position of the codon tended to use ‘G’ or ‘C’ in the highly expressed genes. The highly expressed genes also had the high GC content. The CAI value and GC3s also had a significantly correlation (r = 0.806, P < 0.01). These results support that the highly expressed genes tend to use ‘C’ or ‘G’ at synonymous positions compared with lowly expressed genes. It was also confirmed that the nucleotide compositional mutation bias may possibly play important roles in shaping codon usage in the genome of this species, although they are less important than that of gene expression level.

In addition, axis 1 coordinates are also significantly correlated with the hydrophobicity of each protein (r = 0.164, P < 0.01) and codon length (r = −0.300, P < 0.01); at the same time, ENC and hydrophobicity of each protein, and gene length are significantly negatively correlated with each other (r = −0.293, P < 0.01; r = −0.133, P < 0.01), indicating that apart from the gene expression level and gene composition, the gene length and hydrophobicity of each protein also had played a critical role in affecting B. pseudomallei codon usage.

Translational optimal codons

Given the correlation that holds between the first axis of the COA and expression levels, we compared the patterns of the genes displaying the most extreme values (10%) at both ends of the first axis. The result of the latter analysis is displayed in Table 2. There are 21 codons whose usage is significantly incremented among the highly expressed genes, which encode 19 amino acids (the only residue with no preferred triplet is Asp). We postulate that these codons are the translationally optimal in B. pseudomallei.
Table 2

Translational optimal codons of the B.pseudomallei genomea

Name

Codon

High

Low

Name

Codon

High

Low

RSCU

Number

RSCU

Number

RSCU

Number

RSCU

Number

Phe

UUU

0.09

189

0.77

1,270

Ser

UCU

0.01

5

0.54

516

UUC*

1.91

4,014

1.23

2,032

 

UCC

0.51

475

0.96

921

Leu

UUA

0.00

2

0.22

294

 

UCA

0.01

10

0.65

624

UUG

0.12

229

1.26

1,679

 

UCG*

3.74

3,469

1.78

1,715

CUU

0.06

105

1.00

1,337

Pro

CCU

0.03

43

0.60

634

CUC*

2.88

5,338

1.39

1,852

 

CCC

0.86

1,202

0.84

879

CUA

0.01

14

0.43

578

 

CCA

0.01

10

0.70

735

CUG*

2.93

5,423

1.70

2,277

 

CCG*

3.10

4,322

1.86

1,950

Ile

AUU

0.15

278

1.07

1,395

Thr

ACU

0.02

24

0.57

617

AUC*

2.85

5,442

1.54

2,007

 

ACC

1.03

1,523

1.08

1,181

AUA

0.01

13

0.39

514

 

ACA

0.02

26

0.64

702

Met

AUG

1.00

2,713

1.00

1,756

 

ACG*

2.94

4,360

1.71

1,858

Val

GUU

0.03

75

0.84

1,250

Ala

GCU

0.03

110

0.58

1,319

GUC*

1.89

4,233

1.41

2,094

 

GCC

0.91

3,444

1.04

2,361

GUA

0.03

72

0.51

757

 

GCA

0.09

331

0.81

1,830

GUG*

2.05

4,595

1.23

1,831

 

GCG*

2.97

11,211

1.56

3,532

Tyr

UAU

0.40

594

1.01

1,204

Cys

UGU

0.01

3

0.66

334

UAC*

1.60

2,355

0.99

1,185

 

UGC*

1.99

923

1.34

680

TER

UAA

0.73

65

0.73

65

TER

UGA

2.08

184

1.76

156

UAG

0.19

17

0.51

45

Trp

UGG

1.00

1,269

1.00

1,267

His

CAU

0.42

524

0.99

967

Arg

CGU

0.16

201

0.95

1,030

CAC*

1.58

1,958

1.01

988

 

CGC*

4.84

6,013

1.91

2,056

Gln

CAA

0.23

428

0.81

1,202

 

CGA

0.04

45

1.05

1,128

CAG*

1.77

3,349

1.19

1,748

 

CGG

0.91

1,133

1.20

1,299

Asn

AAU

0.17

299

0.92

1,265

Ser

AGU

0.01

6

0.65

622

AAC*

1.83

3,221

1.08

1,472

 

AGC*

1.72

1,593

1.43

1,377

Lys

AAA

0.12

258

0.79

1,289

Arg

AGA

0.00

2

0.42

453

AAG*

1.88

4,036

1.21

1,974

 

AGG

0.05

62

0.47

506

Asp

GAU

0.44

1,456

0.90

2,147

Gly

GGU

0.06

155

0.77

1,212

GAC*

1.56

5,177

1.10

2,644

 

GGC*

3.60

8,697

1.69

2,649

Glu

GAA

0.62

1,881

0.94

2,166

 

GGA

0.03

67

0.77

1,211

GAG*

1.38

4,157

1.06

2,449

 

GGG

0.31

737

0.77

1,210

aNote: Comparison of codon usage frequencies between highly and lowly expressed sequences, as discriminated by the first axis of the COA. No., number of occurrences. The codons marked with an * are significantly more frequent among the highly expressed genes (P < 0.01) according to a χ2 test

Discussion

In this paper, we present evidence suggesting that the pattern of synonymous codon choices in the bacterium B. pseudomallei appears to be the result of a complex equilibrium between different forces, namely the natural selection at the translational level, nucleotide compositional mutation bias, the hydrophobicity of each protein and the length of each gene.

Any fitness differences among synonymous codons, perhaps associated with translational accuracy and/or efficiency, are expected to be very small and thus only population sizes (Bulmer 1987). On the one hand, in Escherichia coli and Saccharomyces cerevisiae (organisms expected to have very large effective population sizes) selection for efficient translation seems to determine codon frequencies, particularly in genes expressed at high levels (Sharp and Li 1987b). On the other hand, in human, which have much smaller effective population sizes, there is as yet no evidence of selection among synonymous codons (Karlin and Mrazek 1996). In our study the result shows that the bias decreases with the degree of gene expressing in the genome of B. pseudomallei, measured by the codon adaptation index. Selection may be due either to a direct effect of translation time on fitness or to the extra energy cost of proof-reading associated with longer translating time. So, in the genome of B. pseudomallei, it is easy to see why selection will be stronger leading to greater bias, in a highly expressed gene whose codons are used more often.

The C. reinhardtii (Naya et al. 2001) and Echinococcus spp. (Fernandez et al. 2001) genomes had high GC contents, there were little evidences that the genome composition shaped the codon usages in these two genomes, but in D. melanogaster, the GC content was uniformly higher at silent sites in coding regions than in putatively neutrally evolving introns. In B. pseudomallei genome, the GC content was 68%, and the third position of the codon tended to use ‘G’ or ‘C’ in the highly expressed genes. The highly expressed genes also had the high GC content. The CAI value and GC3s also had a significantly correlation (r = 0.806, P < 0.01). These result support that the highly expressed genes tend to use ‘C’ or ‘G’ at synonymous positions compared with lowly expressed genes. Overall codon usage patterns (Table 1) and Nc-plot (Fig. 3) also confirmed that nucleotide compositional mutation bias is relatively the weaker influence on the codon usage in B. pseudomallei genome.

Apart from the gene expression level and gene composition, the gene length also had played a critical role in affecting B. pseudomallei codon usage. In Drosophila (Comeron et al. 1999) genome, longer genes had lower codon usage bias. But, the longer genes had higher expression level and higher codon usage bias in S. penumoniaes genome (Hou and Yang 2000). Those indicated that different genomes had different gene lengths which accommodated their own genome best requirements, and there weren’t universal rules about gene length and expression level in all genomes. In this study, the longer genes had higher expression level and higher codon usage bias; we argue that the positive correlation could be caused by selection to avoid missense errors during translation. Since the cost of producing a protein is proportional to its length, selection in favor of codons which increase accuracy should be greater in longer genes, and long genes should therefore have higher synonymous codon bias.

It was reported for Chlamydia trachomatis, and Thermotoga maritime where codon choices were influenced the hydropathy level of each protein (Romero et al. 2000; Zavala et al. 2002). In this study, codon usage is significantly positively correlated with the hydrophobicity of each protein in B. pseudomallei. The link with hydropathy and codon usage may be caused by the fact that many of the highly expressed sequences are hydrophilic just because they accomplish their function in the aqueous media of the cell.

At the same time, we defined the 21 codons being shared with the above-mentioned two comparisons as the optimal codons of the B. pseudomallei (Table 2). That will be significant during the design of degenerate primers, introduction of point mutation, modification of heterologous genes, and investigation of the evolution mechanism of species at the molecular level.

In summary, our work have provided a basic understanding of the mechanisms for codon usage bias and some more useful information for improving the expression of target genes in vivo and in vitro. As long as more completed genomes are studied, different factors appear to shape the pattern of codon usage. This pattern is the result of biological processes (i.e. protein structure and folding, physiological constraints, translation, replication, transcription, mutation, etc.), and hence it becomes imperative to analysis codon usage under the light of this complexity. However, it is not still possible to say that the ‘mutational bias-translational selection’ paradigm is not enough to explain codon usage in all species, all ‘new factors’, by the moment, can be explained in terms of this paradigm, although it is certainly becoming more complex.

Copyright information

© Springer Science+Business Media B.V. 2008