Frequent disease outbreaks in shrimp culture have resulted in great losses to the industry. A Global Aquaculture Alliance (GAA) study in 2008 estimated that total losses due to diseases over the past 15 years might be on the order of US$ 15 billion. Among the various shrimp pathogens, viruses are responsible for the most disease outbreaks. Infections due to white spot syndrome virus (WSSV), yellow head virus (YHV), infectious myonecrosis virus (IMNV), and taura syndrome virus (TSV) may result in significant production losses and mortalities [11, 17].

Due to the redundancy of the genetic code, the majority of amino acids (except Met and Trp) can be encoded by more than one synonymous codon. During evolution, each species is subjected to specific genomic pressures that can influence its codon usage pattern, resulting in preferred selection of some codons over others within the same synonymous group [12]. Natural selection and mutational pressure have been found to be the two most important factors affecting codon usage in the majority of viruses [2, 27]. The effects of natural selection can be evident in the form of gene expression, translational selection (speed and accuracy) and protein folding [13, 32]. Under mutational pressure, codon usage of a particular region/gene is shaped by it nucleotide composition. For example, a high GC content in a particular genomic region will lead to a GC-biased codon usage pattern. Viruses are dependent on the host cell for their replication and transmission, and thus, the interrelationship of codon usage patterns between the host and virus may have a significant impact on virus survival and evolution. Knowledge of these patterns helps in achieving a better understanding of viral gene expression and regulation, which in turn can be applied to efficient in vitro expression of viral proteins and vaccine development [5, 25].

In this study, we analysed the genomes and codon usage patterns of five RNA and two DNA shrimp viruses. Several codon usage metrics in IMNV, gill-associated virus (GAV), TSV, YHV, Penaeus vannamei nodavirus (PvNV), Penaeus monodon hepandensovirus (also known as hepatopencreatic parvovirus [HPV]) and Penaeus monodon penstyldensovirus 1 (also known as infectious hypodermal and hematopoitic necrosis virus [IHHNV]) were evaluated to investigate the extent of codon bias and influencing factors. Coding sequences (CDSs) for all available genomes of five RNA viruses and two DNA viruses (Table 1) were obtained from NCBI GenBank (http://www.ncbi.nlm.nih.gov).

Table 1 Shrimp DNA and RNA viruses and their GenBank accession numbers

Overall nucleotide compositions (AT and GC) as well as nucleotide compositions at third synonymous codon (AT3 and GC3) were calculated to understand their genome bias. Except for GAV and PvNV, all shrimp viruses had significantly (Wilcoxon signed-rank test, p < 0.05) AT/AU-biased genomes. All of the following nucleotide values are reported as percentages. IMNV and YHV with mean ± SD AU values of 62.3 ± 2.8 and 54.2 ± 2.0 showed the highest and lowest bias, respectively (Supplementary Table 1). This AT/AU bias in the majority of shrimp viruses suggests that A/U-ended codons might be preferred over G/C-ended codons. The GC composition at third synonymous codon position (GC3) might have a significant impact on the overall codon usage pattern, and the calculated GC3 values in our study ranged from 26.4 ± 2.4 (in IMNV) to 50.5 ± 8.4 (in GAV).

Correlations between nucleotide compositions at non-synonymous and synonymous codon positions (GC1, 2 and GC3) were used to evaluate the comparative influences of mutational pressure and natural selection on codon composition. In the case of DNA viruses, significant but negative correlations between GC1, 2 and GC3 values were found for HPV (r = -0.636, p < 0.05) and IHHNV (r = -0.840, p < 0.05). Among RNA viruses, IMNV showed significant correlation (r = 0.868, p < 0.05) between GC1, 2 and GC3, whereas non-significant correlation was observed for YHV. Due to the limited amount of data available, no correlation analysis was performed for GAV, TSV and PvNV. These observations suggest that in the case of HPV, IHHNV and YHV, base composition does not follow the same pattern at all codon positions, implying that natural selection has occurred, while in the case of IMNV, mutational pressure appears to play an important role.

Nucleotide compositions (overall as well as at the third position of synonymous codons), effective numbers of codons (ENC) and relative synonymous codon usage (RSCU) metrics were calculated using CodonW 1.4.4 software, developed by John Peden (http://sourceforge.net/projects/condonw) [20]. ENC is used to quantify the codon usage bias in absolute terms. The lowest value of 20 signifies that each amino acid is encoded by one codon, whereas the highest value of 61 indicates completely random codon usage. ENC values revealed the presence of moderate (35 < ENC < 50) codon usage bias in IMNV, GAV, YHV, HPV and IHHNV (Supplementary Table 1). On the other hand, TSV and PvNV, with ENC values (mean ± SD) of 53.7 ± 0.7 and 56 ± 7.4, respectively, showed low bias.

Among the RNA viruses, one PvNV gene and all IMNV and TSV genes were found to lie roughly on the pENC curve towards the low GC3 region (Fig. 1). Moreover, significant correlation (r = 0.957, p < 0.05) between GC3 and ENC was also observed for IMNV. No correlation analysis was performed for TSV and PvNV. These data imply that mutational pressure is the major influencing factor in IMNV and TSV codon usage bias and in some of the genes of PvNV [32]. On the other hand, the ENC values for all genes of GAV, YHV and HPV as well as for remaining genes of PvNV were found to lie well below the pENC curve. For YHV, a significant negative correlation (r= -0.895, p < 0.05) was observed between GC3 and ENC, whereas no correlation was found for HPV. For GAV, YHV and HPV, natural selection seems to be dominant. In the case of IHHNV, the ENC values for all of the genes were found to lie well below the pENC curve, but with positive correlation (r = 0.886, p < 0.05) between GC3s and ENCs. Interestingly, mutational pressure and natural selection might be acting equally on the codon usage bias of PvNV and IHHNV.

Fig. 1
figure 1

The ENC versus GC3 curve for shrimp viruses. The continuous black line represents the predicted ENC (pENC) curve generated by assuming that GC3 mutational bias was the sole determinant of codon usage bias. Observed GC3-versus-ENC relationships are represented by different symbols

For efficient utilization of host cell mechanisms, viruses may show significant adaptations during evolution. Thus, RSCU analysis was performed to compare the patterns of codon usage between shrimp viruses and their host (P. monodon) (Supplementary Table 2). For the shrimp viruses (except GAV and YHV), the numbers of A/U-ended preferred and overrepresented codons (RSCU > 1.6) were comparatively higher than those of the corresponding G/C-ended codons. In the case of underrepresented (RSCU < 0.6) codons in all viruses, the majority were found to be G/C-ended (Supplementary Table 3). As the majority of shrimp viruses had a significantly AT/AU-biased genome, the preferences towards A/U-ended codons and underrepresentations of G/C-ended codons were as per our expectations. In contrast, out of 18 preferred codons for P. monodon, 17 were found to be G/C-ended.

Comparison of RSCU values also indicated that preferred codons common between individual viruses and the host were as low as 1 to 8 (Supplementary Table 2). IMNV had only one preferred codon (CCA [Pro]) in common with its host, while GHV had eight common preferred codons (UUC [Phe], GAC [Asp], AUC [Ile], UAC [Tyr], CAC [His], CCA [Pro], AAC [Asn], and AAG [Lys]). Overall, somewhat antagonistic trends were observed between codon usage patterns of the viruses and the host.

The codon adaptation index (CAI) is used as a measure of relative adaptedness of the codon usage pattern of a gene/set of genes towards a reference set of genes, which in turn tells us the extent to which natural translational selection has been effective in influencing the codon usage bias [26]. In this study, CAI values for viral genes were calculated against a reference set of host genes [23]. The expected CAI (eCAI) values for each gene were also calculated as described by Puigbo et al. [22], to determine the statistical significance of the CAI values. Out of all of the genes of GAV and YHV, the CAI values of two genes each of PvNV and HPV and one gene each of TSV and IHHNV were found to be higher than their respective eCAI values (Supplementary Table 4). These observations suggest that certain genes of these viruses show significant adaptation to their host, and this could be due to translational selection. For IMNV, all genes had CAI values that were lower than their respective eCAI values, signifying the absence of codon adaptation and translational pressure.

Multivariate correspondence analysis (COA) was used to identify the major trends of codon usage patterns in viral CDSs and to represent these trends along the continuous axis for easy visual interpretation. In this study, COA on RSCU was performed separately for each virus. For all of the viruses, axis 1 and axis 2 were together able to explain >50% of variations. Thus, a scatter plot was prepared by plotting the value of each gene along axis 1 and axis 2 (Fig. 2). Although virus genes were widely distributed along both axes, specific clustering based on the type of gene rather than on the virus type was observed. For example, all capsid protein genes from IHHNV (IHHNV_CP) formed a separate group distinct from the non-structural protein gene 1 (IHHNV_NS1) and non-structural protein gene 2 (IHHNV_NS2) groups. These observations suggest that different genes within the same virus were subjected to varying selection pressures during evolution.

Fig. 2
figure 2

COA-RSCU plot depicting the distribution of viral genes along axis 1 and axis 2. NCP, nucleocapsid protein; PPP, polyprotein precursor; REP, replicase polyprotein; NS1, non-structural protein 1; NS2, non-structural protein 2; SP, structural protein; CP, capsid protein; RP, RNA-dependent RNA polymerase; NS, non-structural protein; SGP, structural glycoprotein; B2P, B2-like protein

In COA, correlation analysis between axis 1 and other codon usage parameters can indicate the influence of various factors on codon bias. Significant correlations (r > 0.82, p < 0.05) between axis 1 – GC3 and axis 1 – ENC were observed for both IMNV and IHHNV. In the case of HPV, a significant correlation between axis 1 and GC3 (r = 0.65, p < 0.05) was observed, whereas axis 1 showed no correlation with ENC. YHV showed non-significant correlations between axis 1 – GC3 and axis 1 – ENC. No correlation analysis was performed for GAV, TSV and PvNV. These results confirm our previous observations that mutational pressure plays an important role in codon usage bias of IMNV and IHHNV but it has only minor influence in the case of YHV and HPV.

It has been reported that dinucleotide frequencies can have a significant effect on codon usage bias [6]. In this study, we calculated the relative dinucleotide abundances of all 16 dinucleotide as a ratio of their observed and expected frequencies using the compseq programme [24, 28]. Karlin et al. [14] reported that the CpG dinucleotide was underrepresented in small DNA viruses, but large DNA viruses showed no bias against CpG. Thus, we also calculated the relative dinucleotide abundances of all 16 pairs for two large DNA shrimp viruses, namely white spot syndrome virus (WSSV) and Penaeus monodon nudivirus (PmNV) to investigate possible differences between small-DNA and RNA shrimp viruses (Supplementary Table 5). Among the various dinucleotides, the frequency of CpG was found to be underrepresented in all viruses and was consistently lower than the normal value of one. Interestingly, although most of the shrimp virus genomes had significant AT/AU bias, patterns of underrepresentation were also observed for UpA in all viruses. On the other hand, the dinucleotide UpG was consistently overrepresented in all viral sequences. The CpA dinucleotide was also consistently overrepresented in all shrimp viruses except IHHNV and WSSV. We also determined the RSCU values of CpG-containing codons to investigate their effect on codon usage bias. Out of eight CpG-containing codons (CCG, GCG, UCG, ACG, CGC, CGG, CGU, and CGA), only CGU was the preferred codon in its synonymous group, and the majority of these codons were found to be underrepresented in their groups (Supplementary Table 2). CGU was also found to be a preferred codon for YHV, GAV and PvNV.

Large DNA viruses such as WSSV and PmNV, with genome sizes of approximately 300 kb and 125 kb, respectively, encode hundreds of proteins. These proteins play an important role in viral pathogenicity, survival and replication [29, 33, 35]. Except for IMNV (genome size, ~26 kb), small shrimp viruses have genomes of 10 kb or less, and these small viruses encode two to five proteins. In comparison to large viruses, small viruses are more dependent on host machinery for their survival and replication. This may lead to different patterns of base composition and codon usage in small and large viruses [25]. The majority of these viruses use deoptimized codons (ones that occur at low frequency in the host genome), and only a few preferred codons (1 to 8) are common between individual viruses and their hosts. On a whole-genome basis, somewhat antagonistic trends were observed between codon usage patterns of these shrimp viruses and their host. Antagonistic, coincident, or a mixture of both types of codon usage between the virus and its host has also been reported for hepatitis A virus, poliovirus, and chikungunya virus [5, 7, 19]. Previously, it was reported that the whole-genome codon usage pattern of PmNV, an important shrimp pathogen, was antagonistic to its host but that forces of natural selection were able to overcome this antagonism in some genes [28]. Coincidence between viral and host codon usage may lead to improved translation efficiency, whereas antagonism may result in slow viral mRNA translation and viral replication. It has been reported that certain viruses such as hepatitis A virus are unable to shut down the synthesis of host proteins. Thus, in order to synthesize its own proteins, hepatitis A virus must compete with the host for the cellular translational machinery and therefore uses deoptimized codons to avoid competition with the host for cellular tRNAs. This strategy results in slow synthesis of viral proteins, including those involved in RNA replication. This low rate of translation and RNA replication might also help hepatitis A virus to grow slowly and thereby avoid host defences [21]. The presence of deoptimized codons and a host-antagonistic codon usage pattern in shrimp viruses suggests that these viruses may also use similar strategies to replicate and to avoid host defense. In our analysis, the CAI metric also revealed that, in spite of overall antagonistic trends in shrimp viruses, some genes showed significant adaptation towards the host’s codon usage pattern. Replicase, capsid and non-structural protein genes from these viruses had CAI values that were higher than their respective eCAI values, suggesting that forces of natural selection were able to have a significant impact on codon usage of these genes. Moreover, gene plots obtained by COA also showed specific clustering based on the type of gene rather than the virus type. It was observed the similar genes from different geographical isolates of the same virus formed a unique cluster that was distinctly separated from other genes of same viruses. These observations suggested that, during evolution, shrimp viruses have been subjected to gene-specific selection pressures, resulting in unique codon usage patterns.

The dinucleotide frequency can have strong influence on the codon usage bias of DNA and RNA viruses. Thus, for each shrimp virus, the relative abundance of all 16 dinucleotides was calculated as the ratio of their observed and expected frequencies. The CpG frequencies for all shrimp viruses were found to be consistently underrepresented. In the case of vertebrate viruses, unmethylated CpGs in viral genomes are recognized as a pathogen signature by the host’s pattern-recognition receptors (PRRs), specifically Toll-like receptor 9 (TLR9) [15]. Binding of these CpG motifs to TLR9 triggers an innate immune response in the host [10]. Recently, the existence of a Toll pathway in shrimp has also been suggested [16, 31]. Several Toll-like receptors (TLRs) from shrimp have been cloned and characterized [1, 30, 34]. Studies have also revealed that the Toll pathway in shrimp responds to bacterial [8] and viral [9] infections. These studies have also suggested that like vertebrate TLRs, shrimp TLRs also play an important role in innate immunity through their ability to recognize microbe-associated molecular patterns [31]. Thus, it is possible that shrimp viruses have evolved with underrepresentation of CpGs to avoid the host immune response. It has been reported previously, that CpG bias is limited to small DNA viruses, whereas large DNA viruses show the expected CpG frequencies without any bias [14]. Shackelton et al. [25] explained the lack of CpG bias in large viruses by suggesting that by virtue of encoding large numbers of proteins, these viruses have a higher capacity to interfere with host PRRs. However, we observed the consistent underrepresentation of CpG in both small and large (WSSV, PmNV) shrimp viruses. This observation suggests that both small and large shrimp viruses are quite dependent on the host for replication/survival and might not be have much ability to interfere with host PRRs. However, this hypothesis needs to be tested in further studies. In spite of the significant AT/AU bias in the majority of shrimp viruses, UpA dinucleotides were also found to be underrepresented in all of them. The underrepresentation of UpA has also been observed in other genomes, including those of vertebrates, invertebrates, plants, and prokaryotes [4].The susceptibility of UpA uracils to the host’s RNase has been suggested as one of the reasons for the underrepresentation of this particular dinucleotide. Moreover, UpA is also present in two of the three stop codons [3]. Thus, underrepresentation of UpA in viral genomes might be helpful in mitigating the risk of nonsense mutations resulting in incomplete proteins. Similar to other studies [5, 18], overrepresentation of CpA and UpG was also observed in shrimp viruses, but the significance of this observation is unknown to us. The data suggest that shrimp virus genomes have been subjected to selective pressure during evolution, leading to an alteration of their dinucleotide frequencies and corresponding codon usage patterns. Finally, this study suggests the codon usage biases in shrimp viruses are due to the interrelationship between the genome composition, selective constraints in the form of translational efficiency, and the need to escape the host immune response.