Introduction

Porcine epidemic diarrhea virus (PEDV), first isolated in the early 1970s from pigs in Europe, is an enteric pathogen belonging to the genus Alphacoronavirus, family Coronaviridae, orders Nidovirales [1,2,3]. Porcine epidemic diarrhea (PED) caused by PEDV is a highly contagious disease with high mortality in piglets and symptoms of severe diarrhea and dehydration. PED has a worldwide distribution and is prevalent in many pig-raising countries in Europe, Asia, America, and Australia, causing serious damage to the pig farming industry [4,5,6]. PEDV is an enveloped, single-stranded and positive-sense RNA virus with a genome of about 28 kb, including 5’ and 3’ untranslated regions (UTRs) and seven open reading frames (ORFs) [5]. More than two-thirds of the PEDV genome is occupied by ORF1a and ORF1b, which encode replicase polyproteins. In addition, four structural proteins, including the spike (S), envelope (E), membrane (M), and nucleocapsid (N) proteins, are encoded by ORF2, 4, 5, and 6, respectively, and one accessory protein is encoded by ORF3, which is located between the S and E coding regions.

Although a previous report demonstrated the evolutionary origin of PEDV is bats, it is of concern that this virus can replicate not only in porcine cells but also in bat and human cells, suggesting that it has the potential for interspecies transmission, and receptor homologs for this virus are present in various species [7]. Molecular and phylogenetic analysis of PEDV isolates has shown that insertions, deletions and point mutations are common and that strains can be divided into several separate clades corresponding to different geographical locations [8, 9]. Therefore, given the high frequency of genetic mutations in PEDV and the pandemic of coronavirus disease 2019 in humans, it is necessary to carry out further research on this virus. Here, a comprehensive analysis of codon usage bias (CUB) was performed to investigate the evolution of PEDV.

Materials and methods

Sequence data and compositional analysis

Nucleotide sequences of the coding regions of each PEDV strain were downloaded from NCBI (https://www.ncbi.nlm.nih.gov/genbank) and concatenated into a complete coding region. A total of 551 full-length PEDV sequences that were available up to March 2020 were obtained. The accession numbers and other detailed information including the strain name, collection year, and country are listed in Supplementary Table S1. Overall nucleotide compositions (A%, U%, G%, and C%) and G + C content (GC%) of each PEDV coding region were analyzed using BioEdit (version 7.0.9.0). The nucleotide composition at the third codon position (A3s, U3s, C3s and G3s) and the GC content at the third codon position (GC3s) of synonymous codons were determined using CodonW 1.4.4. The detailed information is shown in Supplementary Table S2.

Relative synonymous codon usage (RSCU)

The relative synonymous codon usage (RSCU) value is the ratio of the observed frequency of one specific synonymous codon to the expected frequency (no codon usage bias), which is an important measure of codon usage bias [10, 11]. If the RSCU value of a codon is 1.0, there is no codon usage bias, and there is equal usage of the codons for that amino acid. If the RSCU value is higher than 1.0, there is positive codon usage bias, and if the value is less than 1.0, there is negative codon usage bias. RSCU values higher than 1.6 and lower than 0.6 indicate overrepresented and underrepresented codons, respectively [12, 13]. In this study, the RSCU value of each codon was calculated using the CodonW program, and the corresponding values for the host of PEDV (swine) were downloaded from the codon usage database (https://www.kazusa.or.jp/codon).

Effective number of codons (ENC) and ENC plot analysis

The effective number of codons (ENC) indicates the degree of codon usage bias and reflects the extent of preference of synonymous codons [14]. ENC values range from 20 to 61. A value of 20 indicates a maximum level of codon bias, whereas a value of 61 indicates a complete lack of bias [15]. In general, if the ENC value is ≤ 35, the coding sequence is considered to have significant codon usage bias [16]. ENC values were calculated for each PEDV sequence using the CodonW program.

ENC plot analysis was used to identify factors influencing codon usage variation. An ENC-GC3s plot was generated using GraphPad Prism 8. The expected ENC value for each GC3s was calculated using the following equation:

$$ENC_{expected} = 2 + s + \left( {\frac{29}{{s^{2} + (1 - s)^{2} }}} \right)$$

Where s is the GC3s value. If the codon usage is only constrained by mutation pressure, points will be on or around the expected curve. However, if multiple factors constrain the codon usage, the observed ENC values will lie below the expected curve [17].

Principal component analysis (PCA)

To identify major variation trends in codon usage patterns among the different PEDV strains, PCA was performed by analyzing the relationship between variables and samples [18]. In detail, the RSCU values of each strain were distributed into a 59-dimensional vector corresponding to the 59 synonymous codons (excluding the codons of AUG, UGG and the three stop codons), and they were then transformed into uncorrelated variables (principal components) [16]. The first two axes account for most of the components influencing the codon usage variation among genes, so PCA plots were constructed using the first two axes. PCA was performed using SPSS software (version 22), and the figures were drawn using Graph Pad Prism 8.0.

Hydropathicity and aromaticity analysis

General average hydropathicity and aromaticity are two major factors affecting translation and natural selection [19]. GRAVY and Aroma values are used to evaluate these factors and to represent the frequencies of hydrophobic and aromatic amino acids, respectively [20]. These values were calculated using the CodonW 1.4.4 program.

Parity rule 2 (PR2) bias plot analysis

Parity rule 2 (PR2) plot analysis was used to evaluate the effect of mutation pressure and natural selection at the third codon position for the four-codon amino acids. The PR2 plot distinguishes between AU bias [A3/(A3 + U3)] and GC bias [G3/(G3 + C3)]. Generally, if the effect of mutation pressure and natural selection are equal, the points will sit in the center of the plot, where A = T and G = C [21]. The PR2 plot was drawn using Graph Pad Prism 8.0.

Neutrality plot analysis

Neutrality plot analysis is a widely used method for investigating the effects of natural selection and mutation pressure on codon usage by plotting the GC12s values against GC3s values [22, 23]. Each point represents an independent PEDV strain, and a regression line is plotted. If the regression curve lies near the diagonal (slope = 1), this indicates that mutation pressure was the dominant cause of the codon usage bias, with weak external selection pressure. Alternatively, natural selection is considered the main force shaping codon usage if the slope of the regression curve tends toward 0. The neutrality plot was drawn using Graph Pad Prism 8.0.

Results

Compositional analysis and ENC analysis

The nucleotide U was the most abundant, with a mean value of 33.51 ± 0.058% (mean ± SD), followed by similar amounts of A (24.85 ± 0.024%), G (22.63 ± 0.048%) and C (19.00 ± 0.041%). The mean AU content (58.40 ± 0.077%) was higher than the GC content (41.67 ± 0.077%). Analysis of the nucleotides at the third position of synonymous codons showed that U3s (54.41 ± 0.171%) was more frequent than A3s (23.95 ± 0.076%), C3s (22.86 ± 0.121%) and G3s (22.45 ± 0.138%). The mean GC3s value was 35.10 ± 0.170%, which was also lower than the AU3s value. The ENC values ranged from 47.81 to 48.49, and the mean ENC value was 48.04 ± 0.11 (mean ± SD).

RSCU analysis

RSCU values were calculated for the 59 synonymous codons to determine the codon usage bias of the PEDV genome (Table 1). Ten codons, namely, CUU, CCU, AUU, CGU, ACU, GUU, GCU, UCU, UUG, and GGU were overrepresented (mean RSCU value > 1.6), and eleven codons, including CUA, CCC, CCG, AUA, CGA, CGG, ACG, GUA, GCG, GGA, and GGG, were underrepresented (mean RSCU value < 0.6). Among the 18 most abundantly used codons, three were G-ended and 15 were U-ended, indicating that codons ending with U were the most frequently used. PEDV and swine were found to have only three preferred codons in common (CAG [Gln], AAG [Lys], GAG [Glu]).

Table 1 Overall RSCU of the 551 collected PEDV genomic sequences

ENC plot analysis and correlation analysis

An ENC-GC3s plot was generated to investigate the role of mutational pressure in shaping codon usage bias. As shown in Fig. 1, all points in the plot, regardless of the country or continent from which the isolate originated, were lower than the standard curve. The correlation between ENC values and the relative amount of each nucleotide (A, C, G, U, and GC) was analyzed, and a strong correlation was found, with P-values much below 0.01 (Table 2). To further explore the effect of mutational pressure on codon usage, the correlation between nucleotide composition and codon composition (A3s, C3s, G3s, U3s, and GC3s) was also analyzed (Table 2). Significantly positive correlations were identified for all homologous regions, whereas significantly negative correlations were identified for some of the other regions.

Fig. 1
figure 1

ENC plots showing the relationship between ENC values and GC3s values. Each point represents one PEDV strain. (A) points classified by country. (B) points classified by continent. All of the points were below the standard curve, indicating that mutational pressure and other factors both play a role in PEDV codon usage.

Table 2 Correlation analysis of nucleotide composition and ENC

Principal component analysis (PCA)

PCA was used to detect variations in codon usage and to construct the distributions of each vector. The first axis accounted for 25.33% of the total variation, while the next three axes accounted for 14.63%, 10.02% and 7.23% (Fig. 2A). As the first two axes accounted for 39.96% in codon usage trend, PCA plots of the first and second axes were constructed based on different countries, continents, and dates (Fig. 2B, C, D). Subsequently, the correlation between the first two axes and nucleotide composition was analyzed (Table 3).

Fig. 2
figure 2

Principal component analysis of PEDV. Each point represents one PEDV strain. (A) The distributions of the first 20 vectors by PCA. Columns represent the relative inertia, and the curve represents the cumulative inertia. (B) PCA plot constructed with the first two axes according to country. (C) PCA plot according to continent. (D) PCA plot according to date

Table 3 Correlation between the first two axes and nucleotide composition

The role of natural selection in codon usage bias

The correlation between GRAVY, Aroma, axis1, axis2, ENC, and nucleotide composition was analyzed to identify the forces of natural selection (Table 4). Most of them had a significant correlation with P-values far below 0.01. Of note, there was no correlation between GRAVY and axis1 or axis2, demonstrating that amino acid usage plays a more prominent role for aromatic residues of PEDV proteins.

Table 4 Correlation analysis for GRAVY, Aroma, the first two axes, ENC, and nucleotide composition

PR2 bias plot analysis

A PR2 bias plot for PEDV genomes showed that all points were located at the bottom right of the plot, indicating that C and U were used more frequently than G and A in the third codon position (Fig. 3).

Fig. 3
figure 3

PR2 plot analysis. The PR2 bias plot was calculated for the complete PEDV genome.

Neutrality plot analysis

The main factors determining the codon usage pattern in PEDV genomes were identified by neutrality plot analysis (Fig. 4). A slight negative correlation was found between GC12s and GC3s values (r = -0.13, P < 0.01). The slope of the regression line was only 0.0416, suggesting that natural selection was the main force, while mutation pressure played a minor role in the codon usage pattern of the PEDV genome.

Fig. 4
figure 4

Neutrality analysis performed by plotting GC12s values against GC3s values for the complete genome. The regression line is represented by the black straight line. The regression equation is also shown.

To explore whether natural selection acted equally on the structural and non-structural PEDV proteins, a neutrality plot analysis were also carried out for each gene (Fig. 5). Significant correlations were found between the GC12s and GC3s values of all proteins (P < 0.01), and there was only one strong correlation for the ORF3 gene (r = 0.67). The slope for the ORF1ab, S, ORF3, E, M, and N gene was 0.016, -0.230, -0.400, 0.100, 0.036, and 0.078, respectively. Thus, the contribution of natural selection was 98.4%, 77.0%, 60.0%, 90.0%, 96.4%, 92.2%, respectively, demonstrating that natural selection played a dominant role in the codon usage for each PEDV protein.

Fig. 5
figure 5

Neutrality analysis performed for each gene. (A) ORF1ab, (B) S, (C) ORF3, (D) E, (E) M, (F) N

Discussion

PED is a highly contagious disease with worldwide distribution, and the high genetic variability of PEDV has been confirmed repeatedly [3, 24, 25]. Although the genetic diversity and evolution of PEDV had been investigated previously, a systematic analysis of the codon usage bias of the complete PEDV genome is still needed. A previous study reported the codon usage bias of PEDV with 43 strains collected up to 2014 [11], and two articles reported the CUB of individual regions of the genome (the N gene and the ORF3 gene) [26, 27], but in view of the large number of new complete PEDV genome sequences reported in the past few years, it was necessary to perform a new comprehensive analysis to fill the gaps. In this study, we used 551 PEDV sequences to determine the codon usage bias in the PEDV genome.

RSCU analysis is widely applied to standardize the analysis of codon usage bias. In this study, 10 codons were overrepresented and 11 codons were underrepresented, revealing considerable codon usage bias. In general, codons that are used less by the host are selected in the process of evolution of coronaviruses. Here, we found that PEDV and its host have only three preferred codons in common, implying that PEDV tends to use codons that are less used by the host in order to avoid competition with the host cell during gene translation.

The nucleotide content and codon usage compositions can reflect the effect of mutation pressure on CUB. For PEDV, the AU content was higher than the GC content, and likewise, there was a preference for AU in the third codon position. In RSCU analysis, most of the abundantly used codons were U-ended (15/18). Moreover, analysis of the correlation between nucleotide composition and codon composition indicated a significant correlation in most cases. The coding sequences of PEDV were found to be AU-rich, and mutational pressure was found to be an important force affecting the codon usage bias. In addition, an analysis of the correlation between the first two axes and nucleotide composition showed a weak correlation in more than half of the comparisons, suggesting that mutation pressure and other factors contribute to the codon usage in PEDV strains.

The ENC is a useful measure of the extent of the codon usage bias of the virus, and low codon usage bias might make it easier for the virus to overcome host defense mechanisms. For complete PEDV genome sequences, the ENC values ranged from 47.81 to 48.49, and the mean ENC value was 48.04 ± 0.11. For comparison, the mean ENC values for other coronaviruses are as follows: (1) bovine coronavirus (mean ENC = 43.78) [28], (2) SARS coronavirus (mean ENC = 48.99) [29], (3) porcine deltacoronavirus (mean ENC = 52.85) [30]. Although the mean ENC value of PEDV was not the highest among them, it is also greater than 45, which shows that the codon usage bias is somewhat low. Compared with the mean ENC value for PEDV in the first report published six year ago (47.91 ± 0.13) [11], the latest data changed little, with a small standard deviation. Of note, the ENC mean value of SARS-CoV-2 has been reported to be 51.90 ± 2.59 (mean ± SD) [31], which is higher than that of most coronaviruses, suggesting that it is well adapted to its host and able to overcome its defense mechanisms.

ENC plot and PR2 bias plot analyses are widely applied to evaluate the influence of mutation pressure and natural selection [14, 21]. In our ENC plot analysis, all of the points were below the standard curve, revealing that mutational pressure and other forces, such as natural selection, gene length, tRNA abundance, or RNA structure, together shaped the PEDV codon usage pattern. PR2 bias plot analysis further indicated that natural selection and mutation pressure influenced the codon usage bias of PEDV with unequal contributions.

PCA were performed to identify major variation trends. As shown in Fig. 2B, most of the points from one country were concentrated in the same region, confirming that natural selection is an important factor in shaping the codon usage bias. In particular, data from China and South Korea were comparatively disperse, suggesting that the contribution of mutation pressure shaping the CUB in these strains was greater than that in other countries. Furthermore, American strains formed two groups in the PCA plot, exactly corresponding to two genetically different PEDV strains isolated in the USA (U.S. PEDV prototype and S-INDEL-variant strains) [32]. Phylogenetic analysis indicated that the PEDV prototype strains emerging in USA originated from China [33], which accounts for the phenomenon that some of the data points from China and the USA were concentrated in the same region of the plot. As shown in Fig. 2C, the data from Asia were more disperse than those from other continents, suggesting a stronger contribution of mutation pressure in Asian strains and a more conserved evolutionary process on other continents. Regarding the collection date, the data points tended to cluster together up to 2013 but tended to disperse later (Fig. 2D). This revealed that the impact of mutation pressure in shaping the CUB had a development process from weak to strong and then to weak. Due to the large number of complete PEDV genome sequences uploaded to NCBI in recent years, more analyses can be performed with high accuracy, allowing the whole evolutionary process of PEDV codon usage pattern to be studied.

Neutrality plot analysis is one of the most common methods for exploring the effects of natural selection and mutation pressure. The results showed that the relative constraint (natural selection) was 95.84%, strongly suggesting that natural selection was the main force in determining the CUB. This conclusion was also supported by other analyses. First, for correlation analyses, significant positive correlations between nonhomologous nucleotide comparisons were also observed, which implied that natural selection might play a considerable role in determining the codon usage pattern. Second, many significant correlations were observed with GRAVY and Aroma, which are two major indexes for natural selection. Third, the PCA plot showed that most of the data points, when classified by country, continent, or date, were relatively concentrated. Accordingly, we also compared our latest data with the first research concerning the CUB of PEDV. More significant correlations with GRAVY and Aroma were observed, which is also consistent with the results of the neutrality plot analysis (relative constraint rise). The PCA plot based on the country of origin showed that points were less clustered together than they had been previously, which is in agreement with the small decrease in the mean ENC value, both of which reveal a higher codon usage bias at the present time. In addition, we carried out a neutrality plot analysis for each gene. Compared with previous studies performed with the N gene (natural selection = 65.19%) and the ORF3 gene (natural selection = 76.32%), the result of our study showed that natural selection remained the main force for the codon usage pattern.

In conclusion, the results of this comprehensive analysis of the synonymous codon usage patterns in the PEDV genome revealed a low level of codon usage bias and suggested that natural selection was the primary force influencing codon usage. Given the growing PEDV epidemic situation, the latest data on the evolution of this virus will benefit further basic research.