1 Background

Hepatitis E virus (HEV) is a small RNA virus, belonging to the Hepeviridae family. Hepatitis E is potentially a serious acute disease caused by the agent HEV [1, 2]. HEV is primarily transmitted through contaminated water sources or through the consumption of infected or undercooked meat products derived from animals (swine, deer, or wild boar) [3, 4]. The HEV contains a positive-sense, single-stranded RNA molecule of approximately 7.2 kB in length, flanked by 5′ and 3′ untranslated regions (UTR) [5]. The genome possesses a 7-methylguanine cap at the 5′ end and a poly(A) tail at the 3′ end and encodes three open reading frames (ORFs), i.e., ORF1, ORF2 and ORF3. ORF1 encodes the largest non-structural polyprotein having multifunctional domains, required for viral replication [6, 7]. The reading frame ORF2 codes for the capsid protein [8]. The ORF3 encodes the phosphorylated protein having multiple functions [9, 10]. HEV genotype 1 (GT 1) isolates have been recently identified with an additional reading frame (ORF4), which encodes ORF4 protein only during ER stress [11]. This newly identified ORF4 is exclusive to HEV GT 1 [11]. ORF4 has been demonstrated to play a significant functional role in the replication cycle of GT 1 HEV. Evidence suggests that ORF4 interacts with multiple viral and host proteins to enhance virus replication [11, 12].

The present study analyzed the compositional biasness in terms of nucleotide composition and synonymous codon usage patterns of the HEV ORF4 protein genes. The prevalence of degeneracy in the genetic code allows more than one codon to encode for a specific amino acid. Thus, alternative codons encoding the same amino acid are termed as synonymous codons. Interestingly, in viruses, the preference of some codons over the others has been well documented. This phenomenon refers to codon usage bias (CUB) [13, 14]. CUB is considered as an important force in the evolution of viral genomes. Factors influencing the CUB include mutational pressure, natural selection, G + C content, secondary protein structure and selective transcription replication [15,16,17,18]. Previous reports have suggested that natural selection and directional mutation pressure are two major mechanisms that account for codon usage variation among viral genomes [15, 19,20,21]. However, mutational bias, rather than natural selection, found to be a dominant factor affecting the codon usage patterns in some RNA viruses [22,23,24,25]. The development of a disease is caused by the complex interaction among various factors, which includes pathogen’s virulence, host organism defense response and environmental aspects [26, 27]. These mentioned factors play role in addition to CUB decide the outcome of the host–pathogen interaction or relationship [28, 29]. The pathogens can better adapt to their hosts as well as its environment by allowing certain evolutionary changes which is reflected by their CUB patterns. Moreover, the efficiency of a pathogen to infect its host is significantly dependent on codon optimization process. This is because codon optimization affects the growth of a pathogen in its environment [28]. The similar codon usage pattern among virus and its hosts may overall influence the virus’s fitness, evasion from host’s immune system and evolution [30, 31]. Therefore, the study of codon usage in viruses can reveal important information about virus evolution, regulation of gene expression and protein synthesis. Irrespective of the ORF4 region’s importance, its codon usage patterns have not been determined [32, 33]. In this regard, this investigation has been carried out to analyze the codon usage patterns of the HEV ORF4 protein genes.

The codon usage analysis has been extensively carried out for protein genes of other reading frames of HEV, i.e., ORF1, ORF2 and ORF3 [34]. Baha and colleagues has evaluated the codon usage patterns of ORFs, but the codon compositional restrain in ORF4 has not been analyzed [34]. In this study, we performed comprehensive analysis of nucleotide composition and synonymous codon usage, based on available nucleotide sequences (on the NCBI GenBank) of the ORF4 protein genes, to determine the evolutionary factors that could play an important role in shaping the codon usage patterns. To the best of our knowledge, our comprehensive analysis for the first time provides insights into the codon usage patterns of ORF4 protein genes. This study will also shed lights on the distinguishing genetic features of HEV prevalent in the ORF4 sequences.

2 Methods

2.1 Sequence data acquisition

Nucleotide sequences of the ORF4 protein genes were retrieved from GenBank database available at the National Centre for Biotechnology information (NCBI) (http://www.ncbi.nlm.nih.gov). The retrieved sequences were selected based on the following inclusion criteria: (A) Selected sequences from same or different countries at varying time intervals were assembled in order to avoid repetition. (B) Sequences were included from different hosts encompassing human, rat and ferret. (C) Accumulated sequences from GenBank were categorized into different datasets. (D) Three datasets were prepared for each host organism (human, rat and ferret). (E) Multiple alignment was carried out for these datasets using ClustalW algorithm installed in the BioEdit Sequence Alignment Editor 7.2.5 [35]. The complete list of the sequences used for the present analysis in different host organisms are listed in additional files (Additional files 13: Tables S1–S3).

2.2 Nucleotide composition analysis

The following nucleotide composition properties of the ORF4 sequences were calculated using Mega-X (Version 10.1.7): (1) occurrence of overall nucleotide frequencies (A%, C%, T/U% and G%); (2) occurrence of nucleotides at the third codon site (A3%, C3% U3% and G3%); and (3) occurrence of G + C content at different codon positions, i.e., first (GC1), second (GC2) and third synonymous codon positions (GC3). The five non-biased codons were omitted from the nucleotide composition analysis. It included three termination codons (UAG, UGA, UAA), i.e., as they do not code for any amino acid; and two codons AUG and UGG, as they code for particular amino acid Met and Trp, respectively, Therefore, these mentioned five codons do not exhibit any codon bias.

2.3 Relative synonymous codon usage (RSCU) analysis

The ratio between the observed and expected usage frequency of a codon is described as the Relative Synonymous Codon Usage (RSCU). RSCU value if all synonymous codons are used equally for any specific amino acid [36]. The RSCU index was determined as follows:

$${\text{RSCU}} = \frac{{G_{ij} }}{{\sum\nolimits_{j}^{ni} {G_{ij} } }}ni$$

where RSCU is the relative synonymous codon usage value and Gij is the observed number of the ith codon for the sjth amino acid that has an “ni” type of synonymous codon. Codons with RSCU values (> 1.6) and (< 0.6) were considered as “overrepresented” and “underrepresented” codons, respectively, whereas codons having the RSCU values (1) were regarded as not biased (average level codon) [37]. The mean RSCU values of the ORF4 protein genes were calculated using Mega-X (Version 10.1.7), in order to reveal the codon usage patterns without the effect of amino acid composition and sequence length.

2.4 Relationship between overall nucleotide composition and nucleotide composition at the 3rd codon position

The correlation between A, T, G, C, GC and 3rd codon position of its counterparts (A3, T3, G3, C3, GC3) was assessed to analyze whether natural selection/mutation pressure individually contributed or both collaboratively influenced the evolution of ORF4 in HEV natural hosts.

3 Results

3.1 Analysis of nucleotide composition in coding sequences

The nucleotide compositions of the ORF4 protein genes were calculated to analyze the effect imposed by compositional constraints on codon usage. The results of the nucleotide composition analysis are mentioned in Table 1 (Fig. 1).

Table 1 Nucleotide composition analysis of ORF4 of hepatitis E viruses
Fig. 1
figure 1

Comparative analysis of nucleotide composition patterns between HEV host organisms (human, rat and ferret)

Human: The nucleotides C and G were found to be most abundant in these coding sequences, with an average of 35.597% and 27.966%, respectively, compared with U (21.341%) and A (15.094%). The most frequent nucleotide at the third position was G3S (39.245%), followed by C3S (31.194%), A3S (16.352%) and U3S (13.207%). Thus, synonymous codons at the third position followed the same trend (G3S > C3S > A3S > U3S). The overall GC content was higher than that of AU, with 63.563% observed, compared with 36.441%, respectively, which indicated a GC-biased composition. The overall GC content and GC% at different positions GC1, GC2 and GC3 were with an average of 63.563%, 52.829%, 67.421& and 70.439%, respectively (Additional file 1: Table S1) (Table 1).

Rat: The nucleotides C and U were found to be most abundant in these coding sequences, with an average of 29.451% and 27.122%, respectively, compared with G (27.070%) and A (16.356%). The most frequent nucleotide at the third position was G3S (34.782%), followed by C3S (31.754%), U3S (17.003%) and A3S (16.459%). Thus, synonymous codons at the third position followed the trend (G3S > C3S > U3S > A3S). The overall GC content was higher than that of AU, with 56.498% observed, compared with 43.478%, respectively, which indicated a GC-biased composition. The overall GC content and GC% at different position GC1, GC2 and GC3 were with an average of 56.498%, 50.387%, 52.639% and 66.536%, respectively (Additional file 2: Table S2) (Table 1).

Ferret: The nucleotides C and U were found to be most abundant in these coding sequences, with an average of 28.768% and 27.119%, respectively, compared with G (26.358%) and A (17.753%). The most frequent nucleotide at the third position was G3S (32.717%), followed by C3S (30.597%), U3S (20.706%) and A3S (15.978%), Thus, synonymous codons at the third position followed the trend (G3S > C3S > U3S > A3S). The overall GC content was higher than that of AU, with 55.126% observed, compared with 44.872%, respectively, which indicated a GC-biased composition. The overall GC content and GC% at different position GC1, GC2 and GC3 were with an average of 55.126%, 51.63%, 50.434% and 63.314%, respectively (Additional file 3: Table S3) (Table 1).

Therefore, initially it could be interpreted that nucleotide C was overrepresented, whereas the nucleotide A was underrepresented in HEV ORF4 protein genes. The nucleotides G and T (U) were distributed randomly. In addition to this, it was observed that the GC content (> 50%) was significantly higher than AU content (since AT content was < 50%) in ORF4 protein genes.

3.2 Analysis of codon usage patterns in coding sequences

RSCU measure was undertaken to evaluate the codon usage pattern of ORF4 protein gene sequences. The RSCU values were computed for every codon in each gene sequence to decrypt the extent to which C-ended codons were preferred. The results are mentioned in Table 2 (Fig. 2).

Table 2 Average RSCU values in ORF4 of hepatitis E viruses
Fig. 2
figure 2

Comparative analysis of relative synonymous codon usage (RSCU) patterns between HEV- hosts (human, rat and ferret)

Human: Out of 18 preferred codons (UCC, UCA, UCG, AGU, AGC, CCU, CCC, CCA, CCG, ACC, ACG, GCU, GCC, GCG, CAG, UGC, GGC and GGG), 13 were C/G-ending (C-ending: 7; G-ending: 6) and 5 were U/A -ending (U-ending: 3; A-ending: 2) (Additional file 4: Table S4) (Table 2). This indicated preference of C-and G-ended codons over U and A-ended codons in gene sequences. Among these preferred ones, 3 had RSCU value > 1.6, i.e., overrepresented codons (CAG, UGC and GGC), while the remaining 14 had RSCU values > 0.6 and < 1.6 (UCC, UCA, UCG, AGU, AGC, CCU, CCC, CCA, CCG, ACC, ACG, GCU, GCC, GCG and GGG). Presence of one underrepresented (RSCU < 0.6) synonymous codon was revealed (CCU).

Rat: Out of 25 preferred codons (UUU, UUC, UUA, UUG, CUC, CUA, CUG, AUU, AUC, AUA, GUG, UCC, UCG, AGC, CCU, CCG, ACA, ACG, GCC, GCA, UGC, CGC, CGG, AGG and GGC), 17 preferred codons were C/G-ending (C-ending: 9; G-ending: 8) and 8 were U/A-ending (A-ending: 5; U-ending: 3) (Additional file 5: Table S5) (Table 2). This indicated preference of C- and G-ended codons over U- and A-ended codons in gene sequences. Among these preferred ones, 6 had RSCU value > 1.6, i.e., overrepresented codons (GUG, AGC, GCC, CGC, AGG and GGC), while the remaining 18 had RSCU values > 0.6 and < 1.6 (UUU, UUC, UUA, UUG, CUC, CUA, CUG, AUU, AUC, AUA, UCC, UCG, CCU, CCG, ACA, ACG, GCA, UGC and CGG). Presence of one underrepresented (RSCU < 0.6) synonymous codon was revealed (UUA).

Ferret: Out of 22 preferred codons (UUU, UUC, UUG, CUA, CUG, AUU, AUC, AUA, GUC, UCU, UCA, AGC, CCU, CCC, GCC, UAC, CAC, CAG, GAG, CGC, CGG and AGG), 15 preferred codons were C/G-ending (C-ending: 7; G-ending: 5) and 7 were U/A -ending (U-ending: 4; A-ending: 3) (Additional file 6: Table S6) (Table 2). This indicated preference of C- and G-ended codons over U and A-ended codons in gene sequences. Among these preferred ones, 7 had RSCU value > 1.6, i.e., overrepresented codons (UUG, UCU, GCC, CAC, CAG, CGC and AGG), while the remaining 15 had RSCU values > 0.6 and < 1.6 (UUU, UUC, CUA, CUG, AUU, AUC, AUA, GUC, UCA, AGC, CCU, CCC, UAC, GAG and CGG). Presence of an optional underrepresented (RSCU < 0.6) synonymous codon was not revealed.

The overall/host-specific RSCU analysis revealed that C/G-ending codons were preferred over U/A-ending codons in the ORF4 coding sequences across all host organisms. The number of preferred codons in each host followed the order: 25 (rat) > 22 (ferret) > 18 (human). Thus, our results clearly suggested the common attributes and differences among the usage of preferred codons, i.e., in the case of overrepresented and underrepresented codons in each host. Thus, our RSCU findings clearly revealed both similarities and discrepancies in the codon usage patterns among HEV-hosts.

3.2.1 Relationship among hosts by comparing codon usage frequency

A specific amino acid is encoded by more than one codon. It has been documented that the usage of synonymous codons is not random [38]. By exploiting RSCU values of the HEV-hosts, we computed the preferred codon frequency for each amino acid. The frequency was determined to analyze the influence of selection pressure from hosts on codon usage patterns of HEV. A list of preferred codons encoding amino acids with higher frequency as compared to other synonymous codons for HEV-hosts is mentioned in Table 3. (Additional files 46: S4–S6 Tables).

Table 3 Preferred codons for each amino acid in the ORF4 of HEV-hosts

The observed 10 amino acids Iso (I), Ala (A), Glu (Q), Asn (N), Lys (K), Asp (D), Glu (E), Cys (C), Arg (R) and Gly (G) showed similar usage of preferred codons, i.e., AUU for Iso, GCC for Ala, CAG for Gln, AAC for Asn, AAG for Lys, GAU for Asp, GAG for Glu, UGC for Cys, CGC for Arg and GGC for Gly, among all three natural HEV-hosts, which implicated a phenomenon of “mutual codon preference”. Therefore, the codons (AUU, GCC for Ala, CAG, AAC, AAG, GAU, GAG, CGC and GGC) indicated coincident codon usage portion, i.e., these mentioned preferred codons were commonly shared between all the natural HEV-hosts. In addition to this, within some preferred codons, discrepancies were observed between host organisms, i.e., preferred codons showed dissimilar usage among HEV-hosts For instance, HEV-hosts (human, rat and ferret) shared different usage of preferred codon for Ser (UCG for human, AGC for rat and UCU for ferret).

Moreover, this phenomenon was also observed in specific hosts, i.e., preferred codons encoding amino acid were different in specific host in comparison with other two host organisms, such as HEV-hosts (human and rat) shared evidence of preferred codon for UUC encoding Phe, except ferret, which preferred UUU over UUC; hosts human and ferret shared evidence of preferred codon for UUG encoding Leu, except rat, which preferred CUG over UUG; human and rat shared evidence of preferred codon for GUG encoding Val, except ferret, which preferred GUC over GUG; hosts human and ferret shared evidence of preferred codon for CCC encoding Pro, except rat, which preferred CCU over CCC; human and rat shared evidence of preferred codon for ACG encoding Thr, except ferret, which preferred ACC over ACG; rat and ferret shared evidence of preferred codon for UAC encoding Tyr, except human, which preferred UAU over UAC.

Our results clearly indicated that codon usage patterns in ORF4 gene sequences showed a mixture of coincidence and antagonism among HEV-hosts.

3.3 Comparative analysis of the RSCU values among hosts

Moreover, the top most frequent used codons, least frequent used codons and unused codons also showed common attributes and differences in codon usage patterns among HEV-hosts as represented in Table 4. These observations further emphasized occurrence of mutual codon preference and lack of shared codon preference among host–pathogens.

Table 4 List of used codons based on frequency in ORF4 of hepatitis E viruses

3.4 Effect of natural selection in shaping the codon usage patterns in HEV

It has been suggested that the frequencies of nucleotides A and U/T should be equal to that of C and G at the third position of the codon if mutational pressure affects the synonymous codon usage bias [17]. However, we observed huge variations in the nucleotide composition in the overall ORF4 gene sequences as observed in Table 1. This indicated that other mechanisms including natural selection influenced the codon usage bias in HEV. Thus, these findings concluded that compositional constraints under mutational bias in combination with natural selection shaped up the codon usage patterns in ORF4 coding sequences across all hosts.

4 Discussion

As HEV exhibits enormously high genetic diversity in addition to lack of appropriate culture system for its propagation, these factors pose a major challenge in the improvement of treatment methods. HEV has been identified with multiple genotypes and subtypes via nucleotide sequence analysis [39, 40]. Characterizing genetic properties to figure out common regions and possible differences between genotypes is expected to assist and contribute to the process of a development of effective preventive measures against HEV infection. Our previous investigations have elucidated the ORF4 protein structure in different host organisms [41] in addition to its role as a probable drug target [42]. In this context, we conducted bioinformatics study of different ORF4 sequences of HEV by analyzing its codon usage patterns in different host organisms to provide insights into common attributes and differences among usage of amino acid in virus’s structure. Using these findings, it is hoped that more efficient and precise approaches could be identified and selected for treatment protocols.

The genetic code encompasses 64 codons, separated into 20 distinguishable groups. Each individual group consists of one to six codons and encodes the same amino acid. Thus, each standard amino acid is often encoded by alternative codons belonging to the same group. These alternative codons are termed as ‘synonymous’ codons. CUB is a phenomenon wherein one codon (over its synonymous partners) is preferred [15, 43]. CUB is considered as a distinctive property and appreciably differs among genes as well as genomes [36, 44]. Investigations have reported that codon usage patterns in organisms assist in the understanding of molecular organization of genomes. Due to improvement in sequencing technologies, CUB has gained more attention as codon usage patterns in several prokaryotic and eukaryotic have been studied [45]. As viruses are obligate parasites, they require a set of proteins and enzymes to colonize the host by counteracting the host’s defense mechanism [46]. The establishment of an association between a host and viruses depends on translational accuracy [47], which is largely affected by synonymous codon usage patterns [45, 48]. Mutational bias and natural selection are the two major forces that govern the overall codon usage variation in the genomes. It is well known that mutation pressure rather than translational selection is the primary determining factor of codon bias is in human RNA viruses [49]. On combining, these forces help us in decoding the selection of preferred codons that whether it has been influenced by mutational pressure or natural selection. Thus, in the presented study, we performed an orderly survey of the evolutionary pressures (i.e., mutational bias and natural selection) across the ORF4 to gain insights into its codon usgae patterns. The codon usage pattern of the reading frames (ORFs), such as ORF1, ORF2 and ORF3 protein genes have been elucidated [34]; however, our understanding of codon patterns in ORF4 remains to be determined. This study is the first in its kind to describe the codon usage of patterns of ORF4 genome of HEV in three different host organisms (human, rat and ferret).

Nucleotide composition constraints impose an effect on the codon usage patterns, and thus we performed the nucleotide composition analysis of the HEV ORF4 protein genes. The analysis revealed an overrepresentation of C nucleotide and underrepresentation of A nucleotide in the overall nucleotide composition. This is in agreement with the previous investigation carried out by Baha and colleagues in HEV isolates encompassing different genotypes and hosts [34]. The investigation revealed C as the most-represented nucleotide, while A as the least-represented nucleotide [34]. Similarly like previous observation, our nucleotide analysis also showed the random distribution of G and T (U) nucleotides [34]. Our analysis revealed that ORF4 genes were highly endowed with GC content which is again in agreement with the previous report which suggested that all the ORF coding sequences of HEV had overall high value of GC content (exceeding 50%) [34]. Our compositional characteristics revealed C/G-rich nucleotide pattern in humans, while hosts rat and ferret were observed with C/(T)U richness. These results further substantiate our findings as ORF1 and ORF3 showed C/G-rich genome, while ORF2 showed prevalence of C/T(U) nucleotides [34]. However, the observed pattern in ORF4 is different to the pattern observed in most of the RNA viruses (HIV, hepatitis C, rubella viruses), which revealed high prevalence of A rather than C [50]. This opposite nucleotide pattern biasness could be due to adaptation of a common ancestor of modern HEV strains to their host (in terms of nucleotide composition) during the process of evolution [51]. Our observed opposite patterns to majority of RNA viruses further show consistency with earlier report on other reading frames (ORF1, ORF2 and ORF3) [34]. Thus, it is interesting to mention that our findings from initial compositional analysis show consistency with the previous report on HEV ORFs codon usage patterns [34].

Next, we examined the role of selection forces in determining the codon usage patterns of ORF4 genes. In viruses, it has been suggested that their AU or GC-rich composition show correlation with RSCU patterns, such as, AU or GC-rich genomes preferred codons ending with either A/U or G/C, respectively. This trend supports the influence of mutational pressure [49]. As ORF4 revealed that nucleotide compositional bias is in line with its RSCU patterns in the case of human, mutation pressure is found to be a major driving factor in shaping its codon usage pattern. However, in the case of hosts rat and ferret, despite these regions had higher percentage of C and U nucleotides, their RSCU pattern showed preference toward C- and G-ended codons, i.e., RSCU results were not consistent with the initial nucleotide composition. This suggested the involvement of other factors besides nucleotide composition in shaping the synonymous codon usage patterns in these two host organisms (rat and ferret). In context with this, we observed huge variations in the nucleotide composition in the overall ORF4 gene sequences, which indicated that other mechanisms including natural selection influenced the codon usage bias in HEV. Thus, it could be interpreted that both mutation and natural selection forces shaped the codon usage patterns of ORF4 coding sequences. Our findings show consistency with the previous codon usage analyses carried out in HEV that demonstrated the predominance of mutation pressure [51] and natural selection, respectively [34].

Then, we next analyzed the relationship between codon usage patterns of ORF4 in its natural hosts. The common attributes and differences among HEV-hosts were scrutinized by computing the frequency of amino acids using their RSCU values. The number of preferred codons varied among different natural hosts and maximum usage was found to be in rat and least in human. Additionally, it was revealed that the number of overrepresented and underrepresented codons in each host organism also varied. Thus, a noteworthy variation in the usage for preferred codons among HEV-hosts implied that the codon usage patterns in ORF4 in different host organisms were subjected to different selection pressures. Furthermore, we observed that the frequency of the most used and least used codons also showed similarities and differences between hosts. Thus, it was revealed that HEV ORF4 showed a mixture of two codon usage patterns: coincidence and antagonism. This is similar to previous studies carried out in other viruses, such as HCV [52] and enterovirus [53]. A recent investigation on HEV has also shown both similarities and discrepancies in the ORF1 Y-domain region codon usage patterns which further substantiate our present findings [54]. It has been proposed that codon usage similar portions assist in effective translation of the corresponding amino acids between viruses and their respective hosts [55, 56], whereas the antagonistic portions of codon usage encourage in correct folding of viral proteins, even though decrease in the corresponding amino acids translation efficiency is observed [57,58,59]. On summing up these criteria, our findings revealed that none of the hosts showed complete resemblance or complete discrepancy to the other HEV-host.

The findings from such bioinformatics codon usage studies can be validated using experiments and further could be utilized for clinical trials to envisage our understanding of HEV biology. Such type of investigations on other viruses can shed some new lights in its behavioral biology.

5 Conclusion

The presented study documents the codon usage analysis in HEV ORF4 for the first time. This novel bioinformatics approach is expected to strengthen our understanding on the common attributes and differences in the codon usage patterns among ORF4 protein genes. The nucleotide compositional analysis showed overrepresentation of C nucleotide while revealed A as the least-represented nucleotide. The synonymous codon usage analysis revealed that the preferred codons mostly ended with C and G nucleotides. Moreover, it was observed that codon usage pattern among HEV-hosts was a mixture of coincidence and antagonism. The study reveals that synonymous codon usage in ORF4 is an evolutionary process, perhaps reflecting a dynamic process of mutation and selection forces to adjust its codon usage to different hosts and conditions. Investigation of the codon usage patterns is essential for evolution and efficient expression of viral proteins so that they generate efficient immune response. Such strategies of codon optimization for preferred codon usage are very useful in vaccine development. The presented study here is anticipated to increase our knowledge regarding the mechanisms influencing codon usage and evolution of ORF4.