Introduction

The phenomenon of synonymous codon usage bias exists in a wide range of paradigms from prokaryote to eukaryote. Due to different genomes having their own characteristic patterns of synonymous codon usage [1], it has not been easy to provide a satisfactory explanation for the particular pattern found in a given genome. Compositional constraints and translational selection are thought to be the main factors accounting for codon usage variation among genes in different organisms such as Escherichia coli, Bacillus subtilis, Saccharomyces cerevisiae, Dictyostelium discoideum [2], Drosophila melanogaster [3] and Caenorhabditis elegans [4]. However, in some prokaryotes with extremely high A + T or G + C contents [2] and human [5], mutation bias is the major factor accounting for the variation in codon usage. In contrast, it was reported that translational selection at silent sites played the most important role in shaping codon usage in Zea mays [6] and Arabidopsis thaliana [7]. Recently, codon usage was suggested to be related to gene function [8] and protein secondary structure [9].

Codon usage information has also been analyzed for viruses, such as human immunodeficiency virus (HIV) [10], nucleopolyhedroviruses [11], cauliflower mosaic virus (CMV) [12], human RNA viruses [13], H5N1 virus [14], and hepatitis A virus [15]. For example, HIV has a marked codon usage bias, due to its strong preference for the A nucleotide [10]. Rubella virus has a genomic G + C content of 0.70 [16]. Codon usage in Epstein-Barr virus (a DNA virus) may have an influence on regulation of latent versus productive infection [17]. In contrast, in nucleopolyhedroviruses codon usage appears to be simply a consequence of uneven base composition [11]. These published studies are mostly restricted to particular groups of viruses and have usually addressed phylogenetic questions [11, 1820]. Moreover, mutational pressure rather than translational selection is the most important determinant of the codon bias in some RNA viruses [11, 13, 14, 21, 22]. However, a recent study showed that the G + C compositional constraint is the main factor that determines the codon usage bias in iridovirus genomes [23]. Clearly, studies of the synonymous codon usage in viruses can reveal information about the molecular evolution of individual genes and such information would be relevant in understanding the regulation of viral gene expression and also to vaccine design where the efficient expression of viral proteins may be required to generate immunity [24].

Foot-and-mouth disease (FMD) is a highly contagious disease of cattle and other cloven-hoofed animals. Apart from the influence on animal health and welfare, the economic impact of an outbreak of FMD can be of great importance for a country’s export trade. The 2001 European outbreak of FMD which mainly affected the UK is estimated to have cost of 6000 million Euros [25]. Effective vaccines and stringent control measures have enabled FMD eradication in most developed countries, which maintain unvaccinated, seronegative herds in compliance with strict international trade policies. However, Outbreaks with devastating economic consequences still occur in many developing regions of Asia, Africa and South America, posing a serious problem for commercial trade with FMD-free countries. FMD is caused by FMD virus (FMDV), a small, non-enveloped virus that contains a single stranded positive-sense RNA genome of about 8500 nucleotides from the Aphthovirus genus of the Picornaviridae family. The capsid is made of four proteins (VP1, VP2, VP3 and VP4) and there are seven serotypes (A, O, C, Asia I, SAT 1, SAT 2, and SAT 3) that can be further divided into many genotypes according to nucleotide difference in the capsid proteins [26]. Although genome sequences of FMDV have been published and some studies have been performed on them in recent years [2730], little codon usage analysis is available. Such information might give some clues to the features of virus biology and some evolutionary information of this virus, and also it is of interest to understand the factors that shape codon usage in this species. In the present study, the codon usage bias was analyzed in these seven serotypes of FMDV genomes. The key evolutionary determinants of codon usage bias in these viruses were also investigated.

Materials and methods

FMDV genome sequences

A total of 40 FMDV genomes were used in this study (Table 1), including 23 Type O genomes, four type A, six type C, two type Asia I , one Type SAT1, three Type SAT2, and one Type SAT3. The complete sequences these FMDV genomes were obtained from EMBI (http://www.ebi.ac.uk/cgi/) and NCBI (http://www.ncbi.nlm.nih.gov/). Serial number (SN), length of each genome, EMBI or GenBank accession numbers are listed in Table 1. FMDV genomes SN 6, 9, 13 and 19 have been reported to be from cattle. FMDV genomes SN 14, 23 were from pigs and FMDV genomes SN 15,16 and 17 have been reported to pig-adapted isolates carrying a deletion in 3A (codons93–102). Alterations or deletion in this region are associated with the reduced ability of the virus to cause FMD in cattle. A program based on Perl has been developed to extract the annotated ORF sequences from each genome.

Table 1 Genomes examined, strain, length and accession numbersa

Synonymous codon usage measures

In order to examine synonymous codon usage without the confounding influence of amino acid composition of different ORF samples, relative synonymous codon usage values (RSCU) of different codons in each ORF sample was calculated as described previously [31]. Additionally, the ‘Effective Number of Codons’ (ENC) was used to quantify the codon usage bias of a ORF [32], which is the best overall estimator of absolute synonymous codon usage bias (Comeron and Aguade, 1998) [51]. The reported value of ENC is always between 20 (when only one codon is used for each amino acid) and 61 (when all codons are used equally) [32]. GC3S, the frequency of the nucleotide G + C at the synonymous third codon position (excluding Met, Trp and the termination codons), was also used to calculate the extent of base composition bias. Similarly, GC1S and GC2S are the frequencies of the nucleotide G + C at the synonymous first and second position, respectively. In oder to examine the relationship between codon usage variation and compositional constraints, the GC1S, GC2S and GC3S of each selected ORF have been calculated.

Correspondence analysis

Correspondence analysis (COA) was used to investigate the major trend in codon usage variation among ORFs. In order to minimize the effect of amino acid composition on codon usage, each ORF is represented as a 59-dimensional vector. Each dimension corresponds to the RSCU value of one sense codon (excluding Met, Trp and the stop codons). The axis of a correspondence analysis identifies the source of the variation among a set of multivariate data points. This method has been successfully used to investigate the variation of RSCU values among ORFs [3338].

Statistical analysis

Correlation analysis was carried out using the Spearman’s rank correlation analysis method. Cluster analysis was done by Hierarchical cluster method and the distances between selected sequences were calculated by the Euclidean distance method.

Analysis tools

The RSCU, GC3s, ENC, G + C, GRAVY, Length value, and COA were calculated using the program CodonW version 1.4 (http://codonw.sourceforge.net). The correlation analysis and Cluster analysis were carried out by using the multianalysis software SPSS version 13.0 (http://spss.com).

Results

Synonymous codon usage in FMD

In order to better understand the synonymous codon usage variation among FMDV isolates, the average codon usage was analyzed for these 40 FMDV genomes. As shown in Table 2 (see http://222.210.17.171/yak/FMDV/ Table 2.doc for details)., the codons ending in G or C are favored and the global pattern of codon usage is very similar among all FMDV genome examined, which indicates that there have not been significant compositional changes since these species diverged from their last common ancestor.

Table 2 Synonymous codon usage in different Serotypes of FMDVa

Compositional properties of coding sequences

In order to investigate if these 40 FMD viruses’ coding sequences examined display similar compositional features, mean ENC and GC3S, were calculated and summarized in Table 3. The values of ENC among these FMDV genomes examined are very similar, which vary from 50.02 to 52.79 with a mean value of 51.58 and S.D. of 0.67. All the ENC values of these strains are more than 50. The data suggest the homogeneity of synonymous codon usage among FMDV genomes examined. The concept is further supported by the GC3S values for each FMDV strain, which range from 61.00 to 68.00% with a mean of 63.80% and S.D. of 0.01. Therefore, taken together with published data of codon usage bias among some RNA viruses [5, 13, 3941], we could conclude that codon usage bias in FMDV genomes is less biased, and there is no significant variation of synonymous codon usage among FMDV seven serotypes.

Table 3 The values of ENC, GC3S and ORF length for these 40 FMDV strains

Correspondence analysis on codon usage

In order to investigate the variation of RSCU values among ORFs, correspondence analysis (COA) was implemented on these 40 FMDV genomes examined as a single-dataset-based on the RSCU value of each strain’ ORFs. As mentioned, the axis of a correspondence analysis identifies the source of the variation among a set of multivariate data points. The four largest trends in codon usage among these ORFs were observed: the first axis accounts for 31.20% of all variation among genomes, whereas the next three axes accounts for 16.50%, 12.40%, and 8.10%, respectively.

Effect of mutational bias on codon usage

In order to investigate if the evolution of codon usage bias is controlled by mutation pressure or by natural selection, firstly, G + C content at the first and second codon positions (G + C12) was compared with that at synonymous third codon positions (G + C3S) (Fig. 1). A highly significant correlation was observed (r = 0.432, P < 0.05), indicating that patterns of base composition are most likely the result of mutation pressure, and not natural selection, since the effects are present at all codon positions. Second, for each strain, actual codon bias was plotted against both G + C3S and the expected ENC value if codon usage bias is solely due to biased base composition (i.e., G + C content). Result showed that the actual codon usage indices are close to the values expected from their G + C composition, although all are slightly lower (Fig. 2). Thirdly, we plotted the first and second axis values in COA and GC3S values of each strains (Fig. 3A,B). The patterns of codon usage in different strains also appear to be closely related to the GC content on the third codon position. Linear regression analysis has been implemented to each virus genome to find some correlation between synonymous codon usage and nucleotide compositions of the ORFs. We also found that axis 1 coordinates are correlated with GC3S and GC (r = 0.843, P < 0.01; r = 0.813, P < 0.01), while there are significant correlations between axis 2 value and GC3S(r = 0.355, P < 0.05). Taken together, these analyses indicate that most of the codon usage bias among these FMDV genomes is directly related to the nucleotide composition. Furthermore mutational bias is the major factor responsible for the variation of synonymous codon usage among ORFs in these virus genomes.

Fig. 1
figure 1

Correlation between GC content at first and second codon positions (GC12) with that at synonymous third codon positions (GC3S). ORF, open reading frame

Fig. 2
figure 2

Distribution of the codon usage index, ENC, and GC content at synonymous third codon positions (GC3S). The curve indicates the expected codon usage if GC compositional constraints alone account for codon usage bias. ENC: effective number of codons; ORF, open reading frame

Fig. 3
figure 3

Correlation between the first (A), second axis (B) values in COA and GC3S values of each FMDV strains. COA: Correspondence analysis; GC3S: the frequency of the nucleotide G + C at the synonymous third codon position (excluding Met, Trp and the termination codons); ORF, open reading frame

Effect of other factors on codon usage

Generally, mutational bias and natural selection, such as, ORF length and the hydrophobicity of each protein are thought to be the factors accounting for the codon usage variation among ORFs in different organisms. However, whether there is any selection pressure that also contributes to the codon usage variation among these virus ORFs and which selection pressure determines the codon usage variation remained to be understood. Therefore, we performed a linear regression analysis on axis 1, axis 2 and axis 3 between the hydrophobicity of each protein and ORF length. It was found that axis 2 coordinates are also significantly correlated with the hydrophobicity of each protein (r = 0.659, P < 0.01), while axis 1 and axis 3 coordinates are also significantly correlated with the ORF length (r = −0.564, P < 0.01; r = 0.610, P < 0.01) respectively, indicating that the hydrophobicity of each protein and ORF length are also critical in affecting these viruses’ codon usage, although they were less important than that of the mutational bias.

Cluster

Based on the RSCU variation of these 40 FMDV strains examined, a cluster tree was generated by using a hierarchical cluster method. As shown in Fig. 4, these 40 FMDV strains examined were divided into three main lineages (I, II and III).

Fig. 4
figure 4

Cluster tree based on the relative synonymous codon usage (RSCU) values of 40 FMDV strains examined. The cluster tree was generated by using Hierarchical cluster method

Lineage I mainly contained strains of A, O, C and Asia 1, and branched to give eight sublineages (I1—I8). Five C [2730, 39] strains were clustered into sublineage I1 and another C [26] was clustered into sublineage I5 separately. Sublineage I2 was composed of the three strains A [23, 24, 38], but another A [25] was clustered into sublineage I8 separately. Two O [1, 6] strains were clustered into sublineage I3. Sublineage I4 was composed of the two O strains [2], while 14 A strains [35, 813, 1720, 22] and one Asia 1 strain [31] were clustered into sublineage I6. In the end, other Asia 1strain [32] was clustered into sublineage I7 separately.

Lineage II comprised five strains [3336, 37] from SAT1, SAT2 and SAT3.

Lineage III included five strains [7, 1416, 21] from O.

Discussion

Codon usage bias in FMDV was investigated in the present study. Up to now, it is unclear how FMDV serotypes might affect codon choice in FMD viruses and we used RSCU [42], ENC [32], COA [43, 44] and GC3S, to measure the synonymous codon usage bias in order to minimize the effects of serotypes on codon bias, which have been successfully used to analyze the variation of codon usage among different viruses species [11, 13, 14, 22, 45, 46]. The analysis revealed that codon usage bias is low in most cases. As a case in point, the values of ENC vary from 50.02 to 52.79 (S.D. = 0.67) and the GC3S values range from 61.00 to 68.00% (S.D. of 0.02). The average ENC value of 51.53 among 40 strains can be compared to those seen in other organisms such as H5N1 virus , severe acute respiratory syndrome Coronavirus (SARSCoV) and Porcine adenovirus where mean values of 50.91, 48.99, and 38.97, respectively, have been reported [14, 22, 47]. In the case of human RNA viruses, the average ENC value also probably lies close to 45 since the distribution of values for [13] individual virus ranges uniformly from just under 38.5 to 58.3 [13]. One possible explanation about why FMDV had had a lower codon usage bias than other RNA viruses [13, 14, 22, 47] is that a low bias is advantageous to viruses that need to replicate efficiently in vertebrate cells, with potentially distinct codon preferences.

A general mutational bias, which affects the whole genome would, certainly account for the majority of the codon usage variation. The genome base compositions affected the codon usage in Entamoeba histolytica genome [48]. Although the Chlamydomonas reinhardtii genome [49] had high GC contents, there was a little evidence that the genome composition shaped the codon usages in this genome. C. elegans showed a weak, but statistically significantly negative correlation between ‘G + C’ content and gene expression levels [50]. In human RNA viruses, H5N1 virus and SARS Coronavirus, mutation pressure rather than natural (translational) selection is the most important determinant of the codon bias. In this study, the general association between codon usage bias and base composition suggests that mutational pressure, rather than natural (translational) selection is supported by the highly significant correlation between GC12 and GC3S (r = 0.432, P < 0.05), and the result of ENC-plot (Fig. 2). The fact that GC content varies in a similar way at all codon positions is usually assumed to be the result of mutational bias. A general mutational bias, which affects the whole genome would certainly account for the majority of the codon usage variation. A similar pattern of codon usage has been reported amongst some RNA viruses [13]. Since mutation rates in RNA viruses are much higher than those in DNA viruses [40], it is understandable that mutation pressure is the determinant source of codon usage bias in the 40 FMDV strains included in this study. Therefore, mutational bias is the major factor responsible for the variation of synonymous codon usage among ORFs in these virus genomes.

In Drosophila [51] genome, longer genes had lower codon usage bias. But, the longer genes had higher expression level and higher codon usage bias in S.penumoniaes genome [52]. In some virus, such as nucleopolyhedroviruses [11], H5N1 virus [14], SARS Coronavirus [22], adenoviruses [47], ORF length has no effect on the variations of synonymous codon usage. Those indicated that different genomes had different ORF lengths which accommodated their particular genome’s best requirements, and there were not universal rules about ORF length and codon usage in all genomes. In this study, the ORF length had played a critical role in affecting FMDV codon usage. The mechanisms that lead this is not clear, which is needed a more comprehensive analysis.

It was reported that codon choices were influenced the hydropathy level of each protein in Chlamydia trachomatis, and Thermotoga maritime [53, 54]. In this study, codon usage is significantly positively correlated with the hydrophobicity of each FMDV. The link with hydropathy and codon usage may be caused by the fact that the expressed sequences are hydrophilic just because they accomplish their function in the aqueous media of the cell.

Up to date, phylogenetic analyses have been performed largely on FMDV sequences from the 1D coding regions. These analyses have permitted the discrimination among serotypically related FMDV strains [55]. Another analysis of the complete FMDV genomes indicated phylogenetic incongruities between different genomic regions which were suggestive of interserotypic recombination [56]. In this study, phylogenetic analyses based on the RSCU values of the 40 FMDV strains examined were carried out using a hierarchical cluster method. The result indicated complex phylogenetic relationships also exist between different FMDV isolates as determined by Carrillo et al. 2005 [56]. For instance, Five O strains (HKN/2002, Tau-Yuan TW97 (Taiwan, 1997), Chu-Pei (Taiwan) (pig strain), Yunlin/Taiwan/97, and lz were clustered into Lineage III; One C strain (Argentina/85) was clustered into sublineageI5 and one Asia 1 strain (YNBS/China/58) was clustered into sublineageI6. Although they all belong to EURO-SA topotype [55], five C in sublineage I1 are sub-serotype C1 and one C in sublineage I5 is sub-serotype C3. The three A strains in Sublineage I2 was EURO-SA topotype and A strains sublineageI8 is Asia topotype, O stains in sublineageI3 is EURO-SA topotype, while 14 O strains in sublineage I6 belong to ME-SA topotype, PanAsia stains, were clustered into. O strains in Lineage III are Cathay topotype. All five SAT strains were clustered into Lineage II. Taken together, just as stated by other researchers [30, 5658], these results suggest that FMDV sequences may undergo intertypic recombination, which conceivably undergo complex recombination events and the result of phylogenetic analyses based on the RSCU values fail to display serotype-specific phylogenetic relationships. These observations raise interesting questions about FMDV genome evolution in nature and the relative contribution of recombination to the generation of FMDV genetic and population diversity.

As we know FMD is highly contagious, affects all cloven-hoofed animals, and is caused by FMDV that exists as antigenically diverse serotypes and intra-typical variants (subtypes); Some published results has shown that the overall extent of codon usage bias in RNA viruses is low and there is a little variation in bias between genes or genomes [13, 14, 21, 22]. Our analysis revealed that although there are a few, variations in codon usage bias among different FMD viruses, codon usage bias in FMDV is low. Due to lack of data and politic factors, in this article, it is impossible to obtain information on the virus isolation background, vaccination etc, but clearly, a more comprehensive analysis is needed to reveal more information about codon usage bias variation within and among FMD viruses and what other factors are responsible, including the influence of factors, such as cell tropism, principal host species, method of transmission, and viral genetic structure. Such information would then allow us to more precisely judge the relative importance of mutation pressure versus natural selection in determining base composition and codon usage in these pathogens.

The recent European epizootic of FMD has made us aware of the great economic losses to be endured because no effective preventive and control measures are available for FMD [59]. Up to our knowledge, our work is the first report of the codon usage analysis on FMDV, which has provided a basic understanding of the mechanisms for codon usage bias and the processes governing the evolution of FMDV