Introduction

In 18 out of 20 amino acids (excluding Met and Trp), the degeneracy of the genetic code allows multiple codons to encode the same amino acid, resulting in codon usage bias in genes [7, 24]. Codon usage analysis has been applied to prokaryotes and eukaryotes, such as Escherichia coli, Bacillus subtilis, Saccharomyces cerevisiae, Caenorhabditis elegans and human beings [4, 16, 25, 27]. Some reports have shown that codon usage bias had a high correlation to tRNA abundance, GC content, mRNA secondary structure, exon splicing constraints, translation rate and gene expression level [12, 18, 26] . The study of codon usage can provide some evidence about the molecular evolution of the viruses. It can also enrich our understanding about the relationship between viruses and their hosts by analyzing their codon usage patterns.

BVDV is a member of the genus Pestivirus within the family Flaviviridae. The genus also includes classical swine fever virus (CSFV) and Border disease virus (BDV) of sheep [3, 20]. Based on a comparison of the 5’ untranslated region (UTR) and the Npro- and E2-encoding sequences [23, 30], BVDV can be divided into two different genotypes: BVDV-1 and BVDV-2 [21, 22]. The genome of each genotypes contains a single positive-stranded RNA with a size of approximately 12.3 kb, consisting of a single large open reading frame (ORF) flanked by 5’ and 3’ untranslated regions [6, 8]. The BVDV strains can grow in epithelial cell cultures with cytopathic (CP) or noncytopathic (NCP) effect [17].

Since BVDV is highly genetically variable, little information about synonymous codon usage patterns of BVDV genomes has been acquired to date [13, 29]. To our knowledge, this is the first report of codon usage analysis of BVDV. In this study, we analyzed the codon usage data and base composition of 22 available complete ORFs of BVDV to obtain some clues to the features of genetic evolution of this virus.

Materials and methods

Sequence data

A total of 22 BVDV genomes, consisting of 14 strains of genotype 1 and 8 strains of genotype 2, were used to analyze the relevant factors of synonymous codon usage patterns and nucleotide contents in this study. The genotype, phenotype, country of isolation and GenBank accession numbers of these strains are listed in Table 1. In addition, 22 different well-conserved genes of Bos taurus were selected to examine the relationship between codon preferences in the host and the viruses (Table 2). All of the abovementioned coding sequences were downloaded from NCBI (http://www.ncbi.nlm.nih.gov/Genbank/).

Table 1 Information about the 22 BVDV genomes used in this study
Table 2 Information about 22 conserved genes of Bos taurus

Calculation of relative synonymous codon usage

To investigate the patterns of synonymous codon usage (RSCU) without the confounding influence of amino acid composition among all BVDV samples, the RSCU values of codons in the ORFs of BVDV were calculated according to a formula described in previous reports [25, 32]:

$$ {\text{RSCU}} = {\frac{gij}{{\sum\limits_{j}^{ni} {gij} }}}\ ni $$

where g ij is the observed number of the ith codon for the jth amino acid, which has ni types of synonymous codons. A codon with an RSCU value of more than 1.0 has a positive codon usage bias, while a value of less than 1.0 has a negative codon usage bias. When the RSCU value is equal to 1.0, it means that this codon is chosen equally and randomly.

The effective number of codons

The effective number of codons (ENC) is used to measure deviation from expected random codon usage of BVDV and is independent of hypotheses involving natural selection [5]. The ENC values range from 20 to 61. The larger the codon preference in a gene is, the smaller the ENC value is. In an extremely biased gene where only one codon is used for each amino acid, this value would be 20, and if all codons were used equally, it would be 61 [28, 31]. The formulas for ENC are as follows:

$$ {\text{ENC}} = 2 + \frac{9}{F2} + \frac{1}{F3} + \frac{5}{F4} + \frac{3}{F6} $$
$$ F = {\frac{{n\sum\limits_{i = 1}^{k} {P_{i}^{2} - 1} }}{n - 1}}\quad n > 1\quad P_{i} = \frac{ni}{n} $$

The n is the observed number of codons used, k is the number of synonymous codons, and P i is the usage frequency of the ith codon (n i /n). ENC is influenced by the amino acid content of the gene and its length.

The fraction of each codon within its synonymous family

Codon frequency normalizes the codon observations to a fraction for each codon within its synonymous family [1]. To examine the degree of similarity in codon usage between BVDV and that of its host animal (Bos taurus), the fraction of each codon (a total of 59 standard codons, excluding the synonymous single codon for AUG [Met], UGG [Trp] and the three termination codons) within its synonymous family of 22 ORFs of BVDV and 22 genes of Bos taurus was compared.

Statistical analysis

Principal component analysis (PCA) was conducted to analyze the major trend in codon usage pattern among BVDV samples. This is a statistical method that performs linear mapping to extract optimal features from an input distribution in the mean squared error and can be used by self-organizing neural networks to form unsupervised neural preprocessing modules for classification problems [15]. In order to minimize the effect of amino acid composition on codon usage, each ORF is represented as a 59-dimensional vector, and each dimension corresponds to the RSCU value of one sense codon excluding AUG (Met), UGG (Trp) and the three stop codons.

A Spearman’s rank correlation analysis was used to identify relationships among nucleotide content, RSCU and principal component factors of BVDV. A linear least-square regression was conducted to evaluate the correlation between the fraction of synonymous codons in BVDV and that in the genes of Bos taurus. General average hydrophobicity (GRAVY) and aromaticity scores were used to investigate hydrophobic properties of the targeted proteins. Both scores of each protein were obtained using the software Codon W 1.2.4.

Results

The characteristics of synonymous codon usage in BVDV

In order to investigate the extent of codon usage bias in BVDV, all RSCU values of different codons in 22 BVDV strains were calculated. There is only one preferred codon, AGU, with U at the third position; all of the remaining preferred codons end with A, C or G (Table 3). Moreover, the BVDV genome is A+U-rich, with the A+U content ranging from 53.63 to 55.11, with a mean value of 54.46 and S.D. of 0.35, but most of preferentially used codons are G/C-ended codons (G/C-ended: A/T-ended = 10:8), suggesting that the percentage of G+C at the third position may influence the pattern of synonymous codon usage (Table 4). The values of ENC among these BVDV ORFs are similar, which vary from 50.69 to 52.6, with a mean value of 51.43 and an S.D. of 0.46. The data showed that the extent of codon preference in BVDV genes remained basically stable.

Table 3 Synonymous codon usage in BVDV ORFs
Table 4 Nucleotide content of 22 BVDV genomes

Genetic relationship based on synonymous codon usage

Principal component analysis was carried out to identify the codon usage bias among ORFs. From this, we could detect one major trend in the first axis (\( f_{1}^{'} \)), which accounted for 26.51% of the total variation, and another major trend in the second axis (\( f_{2}^{'} \)), which accounted for 13.02% of the total variation. A plot of the \( f_{1}^{'} \) and the \( f_{2}^{'} \) of each gene is shown in Supplementary Fig.1. Compared with the scattered groups of BVDV genotype 1, all BVDV genotype 2 strains aggregated more tightly to some degree. Interestingly, it seems that there is a clear geographical demarcation in the BVDV-1 groups.

Fig. 1
figure 1

The relationship between the effective number of codons (ENC) and the GC content of the third codon position (GC3)

Compositional properties of all BVDV genomes

Natural selection and mutation pressure are thought to be the main factors that account for codon usage variation in different organisms. The A%, U%, C%, G% and (C+G)% were compared with A3%, C3%, G3%, U3%, (G+C) 3%, respectively. An interesting and complex correlation was observed. In detail, the (C+G)3% values have highly significant correlations with the A%, U%, C%, G% and (C+G)% values, indicating that (C+G)3% may reflect an interaction between mutation pressure and natural selection. In contrast, the U% and C% values did not correlate with the A3%, U3%, G3% and C3% values (Table 5). Both cases suggest that nucleotide constraints possibly influence synonymous codon usage in BVDV. Correlation analysis was used to analyze the relationships among ENC values, (G+C)3% values and (C+G)% values. A highly significant correlation was observed between ENC and (C+G)% (Spearman r = 0.765, p < 0.01), while significant correlation was also observed between ENC and (G+C)3% (Spearman r = 0.534, 0.01 < p<0.05), indicating that codon usage bias is influenced by nucleotide constraints. In addition, the correlation between the \( f_{1}^{'} \)value and A%, C%, G%, U%, A3%, C3%, G3%, U3%, (G+C)%, (G+C) 3% values of each strain was also analyzed. A significant correlation was found between nucleotide composition and synonymous codon usage to some extent (Table 6). The analysis revealed that most of the codon usage bias among ORFs of BVDV strains was directly related to base composition. We found that \( f_{1}^{'} \) also had a significant negative correlation with the general average hydrophobicity (GRAVY) of each protein (Spearman r = -0.737, p < 0.01), and negative correlation with the aromaticity of each protein (Spearman r = –0.455, p = 0.033 < 0.05), indicating that the expressed sequences are hydrophilic, since they accomplish their functions in the aqueous media of the cell.

Table 5 Correlation analysis between the A, U, C, G content and the A 3, U 3, C 3, G 3 content in all ORFs
Table 6 Correlation analysis between \( f_{1}^{'} \) and nucleotide content

Effect of other factors on codon usage

As shown in Figure 1, a plot of actual ENC values against both the (G+C)3% and the expected ENC value provides a useful display of the main features of codon usage patterns. The curve indicates the expected codon usage if it is influenced only by the (G+C)3% value of the genome:

$$ {\text{ENC}} = 2+ s + 2 9\left[ {s^{{ 2 { }}} + \left( { 1{\hbox{-}}s} \right)^{ 2} } \right]^{ - 1} $$

where s represents the given (G+C)3% value [31]. However, all of the points with low ENC values lying below the expected curve suggest that although codon usage bias is influenced by mutational pressure, certain other factors must have an influence on the variation of codon usage in these genes. Therefore, we performed another correlation analysis on \( f_{1}^{'} \) in principal component analysis between GRAVY and the aromaticity score of each protein (Table 6).

Comparison of codon usage in BVDV and its host

A plot of average proportions of codons within its synonymous family in BVDV (excluding strain no. 14, which was isolated from swine) and Bos taurus was conducted to explore the relationship between BVDV and its host in codon usage. When two factors are both less than or equal to 0.15, it is defined as a low frequency of usage; and when one factor is greater than or equal to twice of the other factor, it is considered a great difference in frequency. The plot gave a clear linear relationship between BVDV and Bos taurus, showing that the virus and host had very similar patterns of codon usage (r2 = 0.697). The patterns indicate that the least frequently used codons in the host were also the non-preferred codons of the viruses, such as UCG (Ser), CCG (Pro), ACG (Thr), CGU, CGC, CGA, CGG (Arg) and GCG (Ala), and some highly scattered codons including CUA (Leu), AGG (Arg), AUA and AUU (Ile). Linear regression analysis was also performed to investigate the relationship of codon usage patterns between strain 14 and the other BVDV strains. There was no significant difference between the two patterns (P<0.05).

Discussion

Natural selection is a phenomenon that alters the behavior and fitness of living organisms within a given environment. It is the driving force of evolution. Mutation pressure is the change in some gene frequencies due to the repeated occurrence of the same mutations. There are not many biologically realistic situations where mutation pressure is the most important evolutionary process. However, for RNA viruses, the mutation rate is sometimes high enough that mutation pressure needs to be considered.

It is well established that synonymous codon usage reveals genetic information about some viral genomes [10, 14]. In this study, the evidence suggests that the synonymous codon usage bias in BVDV genes is low (mean ENC = 51.43, greater than 40). Therefore, together with published data on codon usage bias of some RNA viruses, such as influenza A H5N1 virus and SARS coronovirus, with mean values of 50.91 and 48.99, respectively [10, 33], the low frequency of codon usage bias for RNA viruses is similar to some degree. Bahir et al. also reported that there is a strong resemblance in codon usage between viruses and their host cells [2]. This suggests that the characteristics of low codon bias may assist BVDV to replicate efficiently in the host cells.

The general association between codon usage indices and composition constraints shows that mutation pressure plays an important role in determining codon usage variation in BVDV. This is supported by the highly significant correlation between codon usage indices (\( f_{1}^{'} \)) and A%, U%, G%, C%, A3%, U3%, G3% and C3% values (Table 6). The relationship between authentic ENC values and (G+C)3% is weaker than that of the expected values (Fig. 2). We suggest that mutation pressure is one of the main factors responsible for the variation of synonymous codon usage in genomes of BVDV. Further analysis showed that these C3% values of BVDV isolates were low, with an average C3 content of 17.47% and an S.D. of 3.05, but it is interesting that six preferential codons are all ended with C (Table 3). Meanwhile, the U3% value is higher than the C3% value (mean U3%: mean C3% = 19.97:17.47), but only one U-ended codon, AGU, is used as a preferentially used codon. This indicates that natural selection is possibly involved in the patterns of synonymous codon usage. No correlation was found between C%, or U% and A3%, U3%, G3%, or C3% (Table 5), suggesting that nucleotide constraints are involved in codon usage patterns due to low U% and C% values. Aromaticity is one of the factors in variations in amino acid usage [19]. The \( f_{1}^{'} \) values had a negative correlation with the aromaticity of each protein (Table 5). In this study, the degree of aromaticity had a negative correlation with codon usage bias of BVDV, suggesting that natural selection may be involved in BVDV evolution.

Fig. 2
figure 2

A plot of average proportions of codons of BVDV within its synonymous family (excluding strain no. 14, which was isolated from swine) and Bos taurus (excluding the codons for Trp, Met and the three termination codons)

BVDV was first reported in 1946 [11], and the scattered model of all 14 strains of BVDV-1 may imply that there is more diversity among BVDV-1 strains with the development of evolution (Supplementary Fig. 1). Three BVDV-1 strains isolated from Asia were different from other BVDV-1 strains, implying that the strains isolated from Asia were distantly related to American or European strains. However, the strains from American were more closely related to those from Europe than to those from Asia. The low diversity in BVDV-2 might result from the limited number of samples. It is most likely that the codon usage bias in BVDV is related to genotype and geographic factors.

The remarkable similarity in the codon usage patterns between the viruses and Bos taurus reveals that natural selective pressure gives BVDV higher adaptability to its host. This adaptability makes it possible for the virus to survive in the host cell and to use the components of the cell to produce more of itself. However, there is no evidence that the viruses are generally adapted to the codon usage patterns of their host (AUU, CUA, AGG, and AUA), and this is consistent with mutational bias theory [1]. Although it has been reported that isolate 14 was first found in swine, its nucleotide content is similar to that of strains originating from cattle, suggesting that strain 14 is also a possible cattle-origin virus.

In this study, our analysis reveals that codon usage bias in BVDV is low, and mutation pressure is the main factor that affects codon usage variation in BVDV. Other factors, including base composition, genotype, geography, GRAVY, and even aromaticity may also significantly influence codon usage bias.

Although our study provides a basic understanding of the codon usage patterns of BVDV and the roles played by mutation pressure and natural selection, a more comprehensive analysis is needed to reveal more information about codon usage bias variation within BVDV viruses and the other responsible factors.