Introduction

Synonymous codons are not used randomly. Rather, some codons are used more frequently than others. Mutational pressure and translational selection were thought to be the main factors that account for codon usage variation among genes in different organisms [14]. Understanding the extent and causes of biases in codon usage is essential to the understanding of viral evolution, particularly the interplay between viruses and the immune response [5]. However, in contrast to many organisms such as bacteria, yeast, Drosophila, and mammals, where codon usage bias and nucleotide composition have been studied in great detail [6], the factors shaping synonymous codon usage bias and nucleotide composition in viruses, especially in animal viruses, have been studied only to a limited extent. For human RNA viruses, it has been observed that codon usage bias is related to mutational pressure, G + C content, the segmented nature of the genome and the route of transmission of the virus [7]. For some vertebrate DNA viruses, genome-wide mutational pressure, rather than natural selection for specific coding triplets, is the main determinant of codon usage [5]. Analysis of the bovine papillomavirus type 1 (BPV1) late genes has revealed a relationship between codon usage and tRNA availability [8]. In the mammalian papillomaviruses, it has been proposed that differences from the average codon usage frequencies in the host genome strongly influence both viral replication and gene expression [9]. Codon usage may play a key role in regulating latent versus productive infection in Epstein-Barr virus [10]. Recently, it was reported that codon usage is an important driving force in the evolution of astroviruses and small DNA viruses [11, 12]. Clearly, studies of synonymous codon usage in viruses can reveal much about the molecular evolution of viruses or individual genes. Such information would be relevant in understanding the regulation of viral gene expression.

To date, little codon usage analysis has been performed on classical swine fever virus (CSFV), which is the pathogen that causes classical swine fever (CSF), an economically important and highly contagious disease of swine. Although eradicated from many countries, CSF continues to cause serious problems in different parts of the world [13]. CSFV is an enveloped virus with a single stranded RNA genome, which contains a single open reading frame (ORF) encoding a polyprotein that, following cellular and viral protease-mediated co- and post-translational processing, gives rise to 11–12 final cleavage products [14]. Studies on the phylogenetic relationship of CSFVs have divided the viruses into 3 main genotypes and 10 subgenotypes based on sequence comparisons of 190 nt of E2 sequence [15]. Based on differences in virulence, CSFVs can also be divided into three clusters, namely, highly virulent strains, moderately virulent strains, and avirulent strains [16]. Recently, we have analyzed the positive selection pressure acting on the CSFV envelope protein genes, Erns, E1, and E2, and identified several specific codons subject to diversifying positive selection in Erns and E2 [17]. In order to better understand the characteristics of the CSFV genome and to reveal more information about the viral genome, we have analyzed the codon usage and dinucleotide composition. In this report, we sought to address the following issues concerning codon usage in CSFV: (i) the extent and causes of codon bias in CSFV; (ii) the relationship between CSFV genotype and codon usage; and (iii) how CSFV virulence might affect codon usage.

Materials and methods

Materials

Three complete genomes of CSFV were previously sequenced by our laboratory (AF407339, AF091507, and AF092448) [18, 19]. The other available complete CDS of CSFV were downloaded from GenBank in March 2008 and sequences with >99% sequence identities were excluded. A total of 35 CSFV genomes [1833] representing 6 subgenotypes (1.1, 1.2, 2.1, 2.2, 2.3, and 3.4) and all 3 kinds of virulence (highly virulent strains, moderated virulent strains, and avirulent strains) were used in this study. The genotyping of 35 CSFV genomes was performed using the CSFV sequence database (http://viro08.tiho-hannover.de/eg/eurl_virus_db.htm) based on 190 nt of E2 sequence [34]. The serial number (SN), mononucleotide composition of each genome, GenBank accession numbers, subgenotype, virulence, and other detail information are listed in Table 1.

Table 1 Classical swine fever virus genomes used in this study

Codon usage indices

Relative synonymous codon usage (RSCU) values of each codon in each ORF were used to measure the synonymous codon usage [35]. RSCU values are largely independent of amino acid composition and are particularly useful in comparing codon usage between genes, or sets of genes that differ in their size and amino acid composition. The effective number of codons (ENC) was used to quantify the codon usage bias of an ORF [36], which is the best overall estimator of absolute synonymous codon usage bias [37]. The ENC values range from 20 to 61. The larger the extent of codon preference in a gene, the smaller the ENC value is. In an extremely biased gene where only one codon is used for each amino acid, this value would be 20; in an unbiased gene, it would be 61. The index GC3s was used to calculate the fraction of the nucleotides G + C at the synonymous third codon position (excluding Met, Trp, and the termination codons). Similarly, GC12s is the fraction of the nucleotide G + C at the synonymous first and second positions. The general average hydrophobicity (GRAVY) score and the frequency of aromatic amino acids (Aromo) in the hypothetical translated gene product were also computed. All the indices mentioned above were calculated using the program CodonW, version 1.4.

Correspondence analysis (COA)

The relationships between variables and samples can be explored using multivariate statistical analysis. Correspondence analysis (COA) was used to study the major trend in codon usage variation among ORFs. In order to minimize the effects of amino acid composition on codon usage, each ORF is represented as a 59-dimensional vector; each dimension corresponds to the RSCU value of one sense codon (excluding AUG, UGG, and stop codons). Major trends within this dataset can be determined using measures of relative inertia and genes ordered according to their positions along the axis of major inertia.

Relative dinucleotide abundance in CSFV ORFs

The relative abundance of dinucleotides in the CSFV ORFs was assessed using the method described by Karlin and Burge [38]. The odds ratio ρ xy = f xy/f x f y, where f x denotes the frequency of the nucleotide X and f xy the frequency of the dinucleotide XY, etc., for each dinucleotide were calculated. As a conservative criterion, for P xy > 1.23 (or <0.78), the XY pair is considered to be of high (or low) relative abundance compared with a random association of mononucleotides [38].

Statistical analysis

Correlation analysis was carried out using Spearman’s rank correlation analysis method. All statistical analyses, as well as cluster analysis, were carried out using the statistical analysis software SPSS Version 15.0.

Results

Synonymous codon usage variation in CSFV

In order to investigate the extent of codon bias in CSFV, the RSCU values of different codon in each ORF was calculated. The details of each ORF and the overall RSCU values of 59 codons in 35 CSFV genomes are shown in Table 1 and supplemental material, respectively. The preferentially used codons were A-ended, C-ended, and G-ended codons (see supplement material). It is interesting to note that no U-ended codons were used as preferential codons. In order to investigate if these 35 coding sequences of CSFV display similar compositional features, ENC and GC3s values were calculated (Table 1). The ENC values of different CSFV genes vary from 51.07 to 52.15, with a mean of 51.703 and S.D. of 0.2635. We found that all the ENC values for CSFV ORFs are high. Based on this finding, together with published data on codon usage bias among some RNA viruses [3943], we conclude that the codon usage bias in CSFV genome is slight. Similarly, the GC3s values of each CSFV strain also confirm the homogeneity of synonymous codon usage among different CSFV viruses, which range from 49.4% to 52.1%, with a mean of 50.23% and S.D. of 0.735%.

Correspondence analysis of codon usage

To investigate synonymous codon usage variation among CSFV viruses, COA was implemented for all 35 CSFV ORFs selected for this study. Figure 1 depicts the position of each ORF on the plane defined by the first and second principal axes generated by COA on RSCU values of ORFs. The first principal axis accounts for 36.87% of the total variation. The next three axes account for 19.54%, 8.79%, and 7.54% of the variation, respectively. This observation indicates that although the first major axis explains a substantial amount of variation in trends in codon usage, the second major axis also has an appreciable impact on total variation in synonymous codon usage. It is worth noting that several CSFV Chinese C strains that can replicate efficiently in rabbits but not in swine have similar coordinates (Fig. 1) to two CSFV Riems strains, which can replicate efficiently in swine. This suggests that the host may not influence the codon usage bias between the CSFV C strain and other CSFV strains. In fact, our study demonstrated that a 12-nt insertion (CUUUUUUCUUUU) at position 61 of 3′ UTR may be responsible for the characteristics of the CSFV Chinese C strain [44].

Fig. 1
figure 1

A plot of value of the first and second axis of each ORF in COA. The first axis accounts for 36.89% of all variation among ORFs and the second axis accounts for 19.54% of total vibrations. Box indicates that CSFV Chinese C strains and CSFV Riems strains were clustered together

Mutational pressure is the main factor accounting for codon usage variation in CSFV

Mutational pressure and translational selection are thought to be the main factors that account for codon usage variation in different organisms [14]. Hence, in order to establish which factor in CSFV can explain their codon usage, first, the G + C content at the first and second codon positions (GC12s) was compared with that at the synonymous third position (GC3s). It was found that GC12s and GC3s are significantly correlated (r = 0.483, P < 0.01). This suggests that they are most likely the result of mutational pressure, as natural selection would be expected to act differently on different codon positions. Additionally, Wright [36] suggested that the ENC-plot (ENC plotted against GC3s) be used as part of a general strategy to investigate patterns of synonymous codon usage. Genes, whose codon choice is constrained only by a G + C mutation bias, will lie on or just below the curve of the predicted values. As shown in Fig. 2, all of the spots lie below the expected curve, indicating that the codon usage bias in these 35 genomes is greatly influenced by the G + C compositional constraints. Furthermore, the correlation between the first or second axis values in COA and GC12s or GC3s values of each strain was analyzed. As shown in Table 4, the first axis value in COA of each selected genome, which contains most of the variation in synonymous codon usage bias between these genomes, is closely correlated with the GC composition at the first, second, and third codon position. The second axis in the COA of each gene is also closely correlated with the GC12s. This analysis indicated that most of the codon usage bias among different ORFs is directly related to the nucleotide composition. Therefore, the compositional constraint is the main determinant of the variation in synonymous codon usage among different CSFV ORFs.

Fig. 2
figure 2

Effective number of codons used in each ORF plotted against the GC3s. The continuous curve plots the relationship between GC3s and NEC in the absence of selection. All of spots lie below the expected curve

The relative abundance of dinucleotide and CpG suppression also shape the codon usage in CSFV

It has been reported that dinucleotide biases can affect codon bias. To study the possible effect of the composition of dinucleotides on codon usage in CSFV, the relative abundances of the 16 dinucleotides in the 35 CSFV genomes were calculated. As shown in Table 2, the frequencies of occurrence for dinucleotides were not randomly distributed and no dinucleotides were present at the expected frequencies. The relative abundance of CpG showed the most marked deviation from the “normal range” (mean ± S.D. = 0.426 ± 0.018). The relative abundance of UpG and CpC also showed slight deviation from the “normal range” (mean ± S.D. = 1.250 ± 0.018 and 1.262 ± 0.019, respectively). Among the 16 dinucleotides, 6 are correlated with the first axis value in COA; 8 are correlated with the second axis value in COA (Table 3). These observations indicated that the composition of dinucleotides, which are independent of the overall base composition but still the result of differential mutational pressure, also determines the variation in synonymous codon usage among different CSFV ORFs. To study the possible effects of CpG under-representation on codon usage bias, the RSCU value of the eight codons that contain CpG (CCG, GCG, UCG, ACG, CGC, CGG, CGU, and CGA) were analyzed. Of these eight codons, seven [GCG (mean 0.375), UCG (mean 0.125), ACG (mean 0.406), CGC (mean 0.141) CGG (mean 0.200), CGU (mean 0.0794), and CGA (mean 0.139)] were markedly suppressed, while CCG (mean 0.676) is slightly suppressed. To study the possible effects of UpG and CpC over-representation on codon usage bias, codons that contain UpG (UUG, CUG, GUG, and UGC) or CpC (UCC, CCU, CCC, CCA, CCG, ACC, GCC) were analyzed. Of these five UpG containing codons, three [CUG (mean 1.677), GUG (mean 1.408), and UUG (mean 1.366)] were markedly over-used. Since both two cysteine codons [UGC (mean 1.082), UGU (mean 0.918)] begin with UpG, these two UpG containing codons are almost equally used. Of seven CpC containing codons, two [ACC (mean 1.342) and GCC (mean 1.347)] were over-used. UCC (mean 0.745) is slightly suppressed. In the rest four CpC containing codons for proline, CCA (mean 1.520) is markedly over-used; CCG (mean 0.676), which also is a CpG containing codon, is slightly suppressed; CCU (mean 0.933) and CCC (mean 0. 871) are almost equally used.

Table 2 Relative abundance of the 16 dinucleotides in 35 Classical swine fever virus with complete genomes available
Table 3 Summary of correlation analysis between the first two axes in COA and sixteen dinucleotides in the selected viruses

The effect of selection pressure on codon usage

As shown in Fig. 2, the majority of the actual ENC values are slightly lower than the expected ENC values. This implies that although codon bias is mainly explained by mutational pressure, there are other factors, with less of an effect, that also influence the codon bias. To test that whether any selection pressure contributes to the codon usage variation between these CSFVs, we performed a correlation analysis between axis values in COA and aromaticity or GRAVY score of each polyprotein. It was found that both axis 1 and axis 2 are significantly correlated with the aromaticity score (r = − 0.526, P < 0.01, r = 0.473, P < 0.01, respectively), indicating that the frequency of aromatic amino acids (Phe, Tyr, Trp) in the hypothetical translated gene product of each ORF is also related to the observed variation in codon bias. No significant relationship was found between axis values in COA and GRAVY using Spearman’s correlation (Table 4).

Table 4 Summary of correlation analysis between the first two axes in COA and GC12s, GC3s, GRAVY, or aromaticity in the selected 35 CSFV ORFs

The effect of CSFV genotype and virulence on codon usage

Beyond the factors mentioned above, we were also concerned with how CSFV genotype and virulence might affect codon usage. Based on the variation in RSCU values among the 35 CSFV genomes, a cluster tree was generated by the hierarchical clustering method. As shown in Fig. 3, these 35 CSFV genomes were divided into 7 sublineages. Sublineages I-1 and I-2 contain all subgenotype 1.1 strains, and sublineage I-2 contains almost all avirulent strains in genotype 1.1. Sublineages I-3, II-1, II-2, II-3, and II-4 contain the subgenotypes 1.2, 2.1, 2.3, 3.4, and 2.2, respectively. It should be noted that the distance between sublineages II-2 and II-3 is closer than the distance between sublineages II-2 and II-4 (Fig. 3). Since sublineages II-2 and II-4 contain the subgenotypes 2.3 and 2.2, respectively, which, in turn, belong to genotype 2, the distance between two sublineages is closer than the distance between sublineage II-2 and sublineage II-3 (contains the subgenotype 3.4). This may be because of the special characteristics of strain 39 in subgenotype 2.2 (see Discussion).

Fig. 3
figure 3

A dendrogram representing the extent of divergence in synonymous codon usage in 35 CSFV strains constructed with the hierarchical clustering method. SG subgenotype; SL sublineage

Discussion

Studies of synonymous codon usage in viruses can reveal much about viral genomes. In the present study, we analyzed synonymous codon usage and dinucleotide composition in CSFV. We found that, as for other viruses such as H5N1 influenza virus (mean ENC = 50.91) [39, 43], SARS-covs (mean ENC = 48.99) [40], human Bocavirus (mean ENC = 44.45) [41], and foot-and-mouth virus (mean ENC = 51.53) [42], the ENC values for CSFV are high (mean ENC = 51.7), indicating that the overall extent of codon usage bias in CSFV genomes is low. In fact, Jenkins et al. [7] have previously reported that the overall extent of codon usage bias in RNA viruses is low with an average ENC value close to 45. Nevertheless, we still wished to determine the factors that constrain codon usage in CSFV. According to the selection–mutation–drift model [35, 45], mutational pressure and translational selection are generally thought to be the main factors that account for codon usage variation between genes in different organisms [14]. In our study, the general correlation between codon usage bias and base composition we observed suggests that mutational pressure is the main factor that determines codon usage bias in CSFV; this conclusion is also supported by the highly significant correlation between GC12s and GC3s (r = 0.483, P < 0.01), and the result of ENC-plot (Fig. 2). Since mutation rates in RNA viruses are much higher than those in DNA viruses [46], it is understandable that mutational pressure is the major cause of codon usage bias in the 35 CSFV strains included in this study.

The majority of the actual ENC values are slightly lower than the expected ENC values (Fig. 2), indicating that there are other factors, albeit with smaller effects, that also influence codon bias. We then asked how CSFV genotype and virulence might affect codon usage. Our cluster analysis revealed that the CSFV genotype also constrains codon usage, since different CSFV strains with the same genotype were clustered together with only one exception, CSFV strain 39 (Fig. 3). CSFV strain 39 (AF407339) was, however, postulated to be a recombinant virus by He et al. [47]. To date phylogenetic analyses have been performed largely on one or three genomic regions but not the complete genome, which might limit it to genotype recombinant viruses. On the other hand, our RSCU-based cluster was based on the complete CDS of each virus. Therefore, it is expected that differences will arise between phylogenetic analyses of recombinant viruses using the two different clustering methods. Our results suggest that CSFV strain 39 might indeed be a recombinant virus and also raised interesting questions about CSFV evolution and the relative contribution of intertypic recombination to the generation of CSFV genetic diversity. Furthermore, our results indicate that virulence is not significantly influenced by codon bias, since not all avirulent strains were clustered together. Although 9 of the 11 avirulent strains of subgenotype 1.1 were clustered together (Fig. 3 subgenotype 1.1B), the other avirulent strains were clustered with highly virulent strains, and 5 moderately virulent strains were also not clustered together (Fig. 3). At present, however, only small numbers of complete CDS of CSFV are available, and these only six cover subgenotypes. Clearly, more complete sequences are needed to allow us to make more precise judgments.

Due to a previous report about CpG under-representation in RNA and small DNA viruses [10], we wanted to determine if the relative abundances of dinucleotides in CSFV affects codon usage. The frequencies of occurrence for dinucleotides were not randomly distributed and no dinucleotides were present at the expected frequencies (Table 2). The general correlation between the axis values in COA and the relative dinucleotide abundances (Table 3) suggests that codon usage in CSFV can also be strongly influenced by underlying biases in dinucleotide frequencies. As a case in point, all CpG containing codons are markedly suppressed. The marked CpG deficiency is a common phenomenon in small eukaryotic viruses [48, 49]. The CpG deficiency was proposed to be related to the immunostimulatory properties of unmethylated CpGs, which were recognized by the host’s innate immune system as a pathogen signature [5, 49]. Indeed, unmethylated CpG motifs in DNA sequences can be recognized by TLR9 [50], and unmethylated CpG motifs in ssRNA may stimulate monocytes through a novel mechanism [51]. This notion was further supported by the fact that CpG is not suppressed in the genomes of most large viruses [48, 49] because they might encode a range of proteins that interfere with cellular pathogen recognition. As a case in point, vaccinia poxvirus encodes agonists of TLRs [52]. In CSFV, Ruggli et al. and our group have shown that Npro and Erns protein can prevent both poly(IC)-and NDV-mediated IFN-α/β induction [5356]. Inhibition by Npro protein is thought to involve an inactivation of interferon regulatory transcription factor 3 (IRF-3) [57]. However, no evidence has been found to support the notion that Npro and Erns proteins interfere with ssRNA through the recognition of unmethylated CpG motifs. It is most likely that the codon usage bias in CSFV may be also related to its host’s innate immune selective forces.

Taken together, our study reveals that codon usage bias in CSFV is slight and mutational pressure is the main factor that affects codon usage variation in CSFV. Other factors, such as dinucleotide composition, genotype, aromaticity, and even innate immune selective forces also significantly influence codon usage bias. However, due to a lack of sequence data and detailed information about these isolations, it is currently impossible to performance an exhaustive analysis about CSFV codon usage. Clearly, a more comprehensive analysis is needed, based on more available data, to reveal more about the viral genome. To our knowledge, this work is the first report of codon usage analysis in CSFV, and it provides a basic understanding of the mechanisms that give rise to codon usage bias. The results we have reported are also useful in understanding the processes involved in CSFV evolution.