Introduction

Human immunodeficiency virus 1 (HIV1) is the representative species of the lentivirus genus and the cause of the acquired immunodeficiency syndrome (AIDS) pandemic, an ongoing public health problem. Nearly 40 years after the first AIDS case was reported, neither vaccine nor cure has been developed. The Lentivirus genus belongs to the Retroviridae family and Orthoretrovirinae subfamily; this genus currently comprises 10 species. Lentiviruses other than primate lentiviruses (HIV1, HIV2, and Simian immunodeficiency virus [SIV]) cause immune deficiency, encephalitis, and infectious anemia in mammals such as cattle, cat, goat, sheep, horse, and puma. HIV species comprise the genetically distinct HIV1 and HIV2. Although they have the same transmission mode and similar immunological effects, HIV2 has lower pathogenicity and infectivity, longer incubation period, lower likelihood of AIDS development, and lower plasma viral load, compared with HIV1 (Vidyavijayan et al. 2017). In addition, unlike the global spread of HIV1, HIV2 is mainly confined to West Africa; however, it has recently been reported in several countries, including Angola, Mozambique, Portugal, France, the United States, and India (Visseaux et al. 2016). Genomic nucleotide sequence analysis of HIV1 and HIV2 has shown only 55% identity (Motomura et al. 2008); moreover, each species encodes distinct accessory proteins. In addition to the three structural proteins gag, pol, and env, HIV1 encodes tat, rev, vif, nef, vpr, and vpu regulatory/accessory proteins; HIV2 encodes the vpx protein, instead of vpu. According to genetic distance and phylogenetic analyses, HIV1 can be divided into M (main), O (outlier), N (not M-not O), and P groups. The M group is a global epidemic group, comprising 95% of HIV1; it is further divided into 10 subtypes (A-D, F-H, J-L) with the same genetic distance and circulating recombination forms formed by recombination between subtypes. HIV2 has eight lineages (A-G, U); HIV2 lineages are identical to HIV1 groups in terms of genetic distance or origin (Sharp et al. 1999; Robertson et al. 2000). According to epidemiological and phylogenetic studies regarding the gag, pol, and env genes of primate lentiviruses, groups M and N of HIV1 were transmitted from SIVcpz (Gao et al. 1999), while groups O and P were transmitted from SIVgor (Sharp and Hahn 2010; Hemelaar 2012), by at least four independent crossover events. At least eight independent transmissions of viruses from sooty mangabeys to humans gave rise to eight lineages of HIV2. Thus far, SIV infection has been confirmed in more than 40 non-human primates; although it does not cause disease in its native hosts (Rey-Cuille et al. 1998), old world monkeys, SIVmac (which originated from SIVsmm through an interspecies crossover event) causes simian AIDS in Asian macaques (Apetrei et al. 2005). Furthermore, it was initially presumed not to cause disease in chimpanzees, but Keele et al. confirmed that SIVcpz causes progressive cluster of differentiation 4 + (CD4 +) T cell loss and premature death in chimpanzees (in a manner similar to HIV1); these effects negatively impact health and reproduction. All lentiviruses have three structural genes in common (gag, pol, and env); however, unlike other lentiviruses, primate lentiviruses do not have a dUTP diphosphatase (dUTPase)-coding region in the pol open reading frame (Foley 2000).

Although HIV has a high mutation rate and undergoes frequent recombination, its genomic nucleotide composition has been robustly conserved for a long period of time, maintaining high adenine (A) content (up to 40%) and low cytosine (C) content (van der Kuyl and Berkhout 2012). Greater A-bias was observed in the pol gene and generally low A-bias was observed in the region overlapping with other regulatory/accessory protein-coding genes (van Hemert and Berkhout 1995). This high A content has been suggested as the cause of guanine(G) → A hypermutation of HIV1 RNA, mainly because of dNTP pool imbalance during reverse transcription (Vartanian et al. 2002). Apolipoprotein B mRNA-editing enzyme catalytic polypeptide-like 3G (APOBEC3G), a human host restriction factor, also contributes to G → A hypermutation; however, this effect is weaker than the impact caused by dNTP pool imbalance (Yu et al. 2004). Therefore, although the vif of HIV binds to APOBEC3G and degrades it, the HIV genome has a high A-biased base composition pattern (Sheehy et al. 2002; Desimmie et al. 2014). The A-rich nucleotide composition of the HIV genome increases the selectivity for A-rich codons and tends to mainly substitute A for the third codon position, the most neutral part of the codon. An A-biased synonymous codon usage is observed in HIV; HIV1 shows a codon usage pattern lacking CpG dinucleotides, which are characteristic of eukaryotes. The zinc finger antiviral protein (ZAP), a host cellular antiviral protein, specifically targets a region with a high CpG dinucleotide content of HIV1 genomic RNA and repressed translation, thereby inhibiting viral replication (Ficarelli et al. 2020). This may promote the suppressed use of CpG dinucleotides in HIV1.

Most codon usage analyses of HIV have focused on describing the codon usage patterns of all nine genes (Pandit and Sinha 2011), comparing structural genes’ codon usage patterns among geographical locations (Ahn and Son 2006), identifying codon usage pattern changes over time (year or early/late infection stages at the patient level) (Pandit and Sinha 2011; Meintjes and Rodrigo 2005), and contrasting the codon usages of HIV1 and HIV2 (Vidyavijayan et al. 2017). However, the indices used have been limited; codon usage analysis performed at the level of the primate lentivirus group, which includes more than 40 species of SIV, is insufficient. Therefore, we explored the evolutionary and genetic codon usage characteristics common and specific to the structural genes (gag, pol, and env) of HIV1, HIV2, and SIV.

Materials and methods

Data collection

Sequence data for phylogeny, compositional analysis, and codon usage pattern analysis were collected from NCBI Virus with a specific focus on the gag, pol, and env open reading frames of HIV1, HIV2, and SIV. To ensure the accuracy of sequence data, data cleaning including checking the length of each sequence (whether a multiple of 3), the presence of stop codons, and the inclusion of characters other than ATCG was performed in JAVA.

Phylogenetic analysis

To confirm the evolutionary relationships between HIV1, HIV2, and SIV, five HIV1 and HIV2 sequences were randomly selected, as well as two SIV sequences for each host species to construct a phylogenetic tree for each gag, pol, and env gene. As shown in Table 1, we collected the most available sequences of each gag, pol, and env gene to obtain sufficient data and manually selected accurate sequence information for codon usage indices analysis (Table 1). The phylogenetic analysis in our study had two main objectives. The first objective was to confirm phylogenetic relationships among the primate lentiviruses isolated from human hosts (HIV1 and HIV2) and non-human primate hosts (SIV) (Table 2). The second objective was to group sequences for codon usage indices analysis based on clustering patterns within the three gene trees including genus, tribe, and subfamily information of the SIV host. When grouping by host based on the gag and pol gene sequences, we had to consider the group SIVcer with the minimum number of available sequences (5) and, therefore, trees were repeatedly created by building datasets with sequences randomly extracted from the sequence pool for each group (Table 1). Afterwards, we confirmed that the clustering pattern was consistent among all reconstructed trees. Regarding SIV, some host species only had one sequence available; therefore, 60, 59, and 53 sequences for each gag, pol, and env gene were used to construct the final phylogenetic tree, respectively. After multiple sequence alignments had been performed using ClustalW (Thompson et al. 1994), phylogenetic trees were constructed using the maximum-likelihood method (Kishino and Hasegawa 1989) based on the Kimura-2 parameter model through MEGA-X (Kumar et al. 2018). The bootstrap parameter was set to 1000 to measure the reliability of each internal node. To minimize possible bias among the phylogenetic trees, the above process was performed five times for each gag, pol, and env gene; a representative tree with a high degree of agreement was generated. The evolutionary relationships of primate lentiviruses were confirmed. Regarding SIV, codon usage analysis grouping was performed based on the phylogenetic tree results and considering the genus, tribe, and subfamily of the host. The results of codon usage comparative analysis were interpreted considering the phylogenetic relationships among lentivirus species.

Table 1 Data collected for analysis of codon usage indices
Table 2 SIV host information

Compositional analysis

The nucleotide compositions of the gag, pol, and env coding sequences of HIV1, HIV2, and SIV were analyzed using CodonW (https://sourceforge.net/projects/codonw/) and CAIcal (http://genomes.urv.es/CAIcal/) programs. The relative frequencies of the overall (A%, T%, C%, and G%) and third codon (A3%, T3%, C3%, and G3%) positions of each nucleotide, as well as the contents of GC and AT for the whole sequence and third codon position, were intuitively visualized using MS Excel 2016.

Parity rule 2 (PR2) analysis

For the gag, pol, and env genes of HIV1, HIV2, and SIV, PR2 plots were generated for the four amino acid codons (Valine [Val], Proline [Pro], Threonine [Thr], Alanine [Ala], Glycine [Gly]; Leucine [Leu]: CTT, CTC, CTA, CTG; Serine [Ser]: TCT, TCC, TCA, TCG; and Arginine [Arg]: CGT, CGC, CGA, CGG). In the PR2 plot, the x-axis is G3/(G3 + C3) and the y-axis is A3/(A3 + T3), indicating GC bias and AT bias, respectively. The G3/(G3 + C3) and A3/(A3 + T3) mean values for HIV1, HIV2, and SIV are shown as different icons; the relative positions of the lentivirus groups from the (0.5, 0.5) position, where mutation pressure and selection pressure are offset, were used to identify the direction and magnitude of the PR2 bias.

Neutrality analysis

A neutral evolutionary analysis was performed to quantify mutation and selection pressures by comparing the GC contents (GC12s and GC3s) of three codon positions. The effect of mutation usually has “directionality” toward higher or lower GC content; this directional mutation pressure often acts more on neutral parts (e.g., the third codon position) than on functionally important regions (Sueoka 1988). Mutations in the first and second codon positions, except six degenerate amino acids, are non-synonymous mutations that change the primary structure of a protein and eventually affect its function; these do not occur often. Hence, there are minimal differences in GC12s content for each sequence of the same virus under large selection pressure. Using the neutrality plot, the relative magnitude of the directional mutation pressure can be estimated by the degree of change in GC12s according to changes in GC3s (i.e., the slope of the regression line). For each of the gag, pol, and env genes, we created a scatter plot with GC12s as the ordinate and GC3s as the abscissa to determine whether the dots were correlated for each virus group. If a correlation existed, a regression line was generated to identify the major pressures on the codon usage pattern of the gene. A regression line slope of 1 (i.e., all points were distributed diagonally) was regarded as no or weak Darwinian selection pressure; in this context, codon usage bias is mediated solely by the mutation pressure. Conversely, a regression line slope of 0 (i.e., all points were located on a line parallel to the abscissa) indicated no difference in the GC content of the first and second codon positions of all gene sequences; in this context, evolution is driven solely by selection pressure.

Relative synonymous codon usage analysis

Relative synonymous codon usage (RSCU) is defined as the observed frequency of a codon divided by its expected frequency, which is the mean of all synonymous codons encoding that amino acid (Sharp and Li 1986). Excluding Methionine (Met), Tryptophan (Trp), and three stop codons, 59 RSCU values for each codon were allotted to a specific coding sequence. RSCU values are not affected by the sequence length and amino acid composition of the gene, enabling comparisons among genes of different lengths, as well as comparisons between amino acids (Sharp and Li 1986). A codon RSCU value of > 1 indicates that the corresponding codon is used frequently, rather than randomly. Moreover, a codon RSCU value of > 1.6 indicated over-representation, while a codon RSCU value of < 0.6 indicated under-representation in the coding sequence (Wong et al. 2010). RSCU values were calculated using CodonW and CAIcal softwares. The mean values of codons were obtained for each lentivirus group; a scatterplot was generated for each gag, pol, and env gene using XLSTAT for comparative analysis.

Effective number of codons and intrinsic codon deviation index analysis

The effective number of codons (ENC) is a simple measure of overall codon usage bias, which can be easily calculated from the codon frequency table; it is minimally influenced by gene length and amino acid composition. The ENC can assume values from 20 to 61; 20 indicates extreme codon usage bias (only one codon is used for each amino acid) and 61 indicates no bias (alternative synonymous codons are used equally for the corresponding amino acid) (Wright 1990). In general, an ENC value of < 35 indicates significant codon usage bias and an ENC value of > 50 indicates general random codon usage (Jiang et al. 2008). In general, it is considered that the severe codon usage bias is due to the high selection pressure, but there is a limit to suggesting such selection pressure only with ENC values. Since the GC3s % value of the sequence affects the ENC value, an ENC-GC3s plot (the X-axis is %GC3s and the Y-axis is ENC) is usually used to determine the magnitude of the selective pressure. To determine the mutation pressure affected by each gene in the virus, a curve of expected ENC values versus GC3s % values was generated assuming no selection pressure. If the ENC value of a gene is located near the expected curve, the gene is presumed to be under mutational pressure only. Conversely, as the distance from the expected curve increases, the codon usage pattern is mainly formed by natural selection or a pressure other than mutation pressure. The intrinsic codon deviation index (ICDI) is a useful measure of the codon bias of genes from species in which the optimal codons are unknown (Freirepicos et al. 1994); it is highly correlated with ENC and codon adaptation index (CAI). The ICDI ranges from 0 (for nonbiased genes) to 1 (for highly biased genes). Generally, an ICDI value of < 0.3 is considered low codon usage bias, while an ICDI value of > 0.5 is considered high codon usage bias (Freirepicos et al. 1994). Using the ICDI index, it is possible to evaluate the codon usage bias for each virus and check whether the pattern is consistent with the results of ENC analysis.

Calculation of the CAI and similarity D(A, B) analysis

The CAI represents the relative adaptiveness of the codon usage of a gene to the codon usage of highly expressed genes; it can be used to predict the level of gene expression, assess the adaptation of viral genes to a particular host, and compare codon usage among organisms (Sharp and Li 1987). CAI value ranges from 0 to 1 and is expected to have higher codon usage similarity and expression levels as the value increases (Jia et al. 2015). Homo sapiens was used as the reference set for HIV1, HIV2, and SIV because codon usage information was not available for some host species of SIV. To exclude the effects of the GC and amino acid compositions of the query sequence from the CAI value, 500 random sequences were generated by the Markov method for each HIV1, HIV2, and SIV group. Next, the Kolmogorov–Smirnov test was performed to evaluate whether the expected CAI (eCAI) values of the generated sequences followed a normal distribution; the critical value of the one-sided test was calculated at a significance level of 0.05. The mean and critical values of eCAI were calculated using the CAIcal web server. Next, the actual CAI values for each group were compared with 5% critical eCAI values; greater actual CAI values were presumed to indicate translation pressure. Then, using the Wilcoxon and Mann–Whitney U tests, values centered around the mean eCAI value for each group were compared (Puigbo et al. 2008a, b). If the query sequence was excessively large, it could not be successfully submitted to the CAIcal web server to calculate the eCAI value. Thus, for the three genes (gag, pol, and env) of HIV1 and the env gene of SIVmacaca, 500 sequences were randomly extracted and input as the query sequence. The similarity D(A, B) index proposed by Zhou et al. in 2013 quantified virus and host similarity by considering the RSCU values of 59 codons (except for Met and Trp) of the virus and the host as two space vectors in 59 dimensions, then calculating the cosine value (Zhou et al. 2013). The similarity D(A, B) index was calculated for all lentiviruses using H. sapiens as the host, and the RSCU values of 59 human codons were calculated from the codon usage table provided by the codon usage database. A correlation test between the similarity D(A, B) value and the CAI value was performed, and the mean similarity D(A, B) values of the HIV1 and HIV2 groups were compared. In addition, the HIV1 and HIV2 groups were compared with their origins, the SIVcpz_gor and SIVsmm groups.

Results

Evolutionary relationships among primate lentiviruses

Phylogenetic analyses of gag, pol, and env confirmed that HIV1 and HIV2 clustered together with the SIV species identified as their respective origins (Fig. 1). The taxonomies of SIV hosts and the abbreviations of SIV according to host species are shown in Table 2. Separate phylogenetic analyses of each gene showed that HIV1 formed the same clades with SIVcpz and SIVgor, while HIV2 formed the same clades with SIVsmm and SIVmacaca. SIVcpz has been proposed as a recombinant of SIVgsn(3') and SIVrcm(5') (Sharp et al. 2005), and the gene tree constructed in this study also showed the evolutionary distance distribution pattern supporting this assumption; SIVcpz was clustered close to SIVgsn and SIVrcm in the env and pol gene trees, respectively. Furthermore, SIVlst and SIVsun, whose hosts belong to the genus Allochrocebus, were closely related to SIVcolobus. Among the three phylogenetic trees, the env gene exhibited greater variation than the gag and pol genes. Given that the env gene encodes the HIV envelope glycoprotein, a known target for cytotoxic T lymphocytes and neutralizing antibodies, the distinctive level and pattern of env gene variations may be due to selection for antigenic diversity. The selective pressure on the env gene by cell-mediated immunity might result in escape mutations. Thus, the host’s antibody response could continuously drive viral evolution with an initial response to the transmitted virus and with sequential responses to escape variants (Richman et al. 2003). The fact that the env gene can evolve to encode envelope glycoproteins for interactions with altered receptors or coreceptors, which allow the virus to enter alternative host cells, further supports the distinctive level and pattern of env gene variations compared to that of other genes. In most instances where the host monkey species were closely related (i.e., they belonged to the same genus), the corresponding SIV species were also closely related. Therefore, groupings for codon analysis of SIV were based on the same genus; SIV species were classified into groups after checking phylogenetic relationships considering tribe or subfamily information. In the grouping process, the specificities of SIVmnd1 and SIVsmm in the phylogenetic tree topology were considered. SIVmnd1 was closely related to SIVmandrillus (comprising SIVmnd2 and SIVdrl) only in the env gene tree. Although sooty mangabeys belong to the genus Cercocebus, SIVsmm was more closely related to SIVmacaca group than to SIVagi and SIVrcm, which infect members of the genus Cercocebus. Only one complete SIVmnd1 sequence (M27470) was available, and there were large differences between SIVmnd1 and SIVmandrillus in terms of gag and pol genes; thus, this complete sequence was excluded from the SIVmandrillus group. SIVsmm does not cause AIDS in sooty mangabey; therefore, it was isolated from the genus Cercocebus group and analyzed separately. The remaining SIVagi and SIVrcm were analyzed as SIVcerco. The Colobus, Piliocolobus, and Procolobus genera belong to the African group of the subfamily Colobinae and formed the same phylogenetic clade in all three genes; accordingly, they were grouped as SIVcolobus. Chimpanzees and gorillas are members of the same subfamily, Hominidae; analyses of the three genes showed that SIVcpz and SIVgor formed a clade with HIV1. Therefore, these were included in the SIVcpz_gor group. The genera Cercopithecus and Miopithecus belong to the Cercopithecini tribe and were closely located in the phylogenetic tree; they were grouped as the SIVpithecus group. The nucleotide composition and codon usage patterns were analyzed by naming the genus Chlorocebus as the SIVagm group and the genus Allochrocebus as the SIVallo group. The number of gene sequences of each virus used for the final index calculation is shown in Table 1.

Fig. 1
figure 1

Phylogenetic trees constructed with nucleotide sequences of lentivirus structural genes. Maximum likelihood trees using nucleotide sequences show evolutionary relationships of primate lentiviruses. Bootstrap values > 70% based on 1000 replications are shown at each node, and branch lengths represent evolutionary distances. [◆] HIV1; [▽] HIV2; [△] SIVagm; [▲] SIVallo; [□] SIVcer; [] SIVcol; [○] SIVcpz_gor; [●] SIVmac; [■]SIVmnd; [▼] SIVpit; [] SIVsmm

Compositional properties of lentivirus structural genes

Regarding the overall nucleotide compositions of the gag, pol, and env genes, the A content was most abundant in all three genes, while the C content was lowest in pol and env (Fig. 2). For gag gene, the SIVallo, SIVcerco, and SIVcolobus groups had the lowest C content, but the T content was the lowest in other 8 groups. Analysis of the nucleotide composition at the third codon position among the three genes showed that the A3 content was highest in all genes; the mean A3 contents of pol, gag, and env were 45.03%, 38.76%, and 33.9%, respectively. Gag exhibited the highest A3 content in the HIV1 group, while pol and env exhibited the highest A3 contents in the SIVcpz_gor group. In all three genes, A3 was more frequent than T3, while G3 was more frequent than C3 in all groups except the HIV2 group of the env gene. Figure 2 shows that, with respect to gene-specific AT/GC bias, all three genes were rich in AT. Generally, AT3 bias was more severe than overall AT bias, but the HIV2 and SIVpithecus groups of the env gene uniquely showed the opposite pattern. Therefore, the AT bias in these two groups mainly originated from the AT compositions of the first and second codon positions. The top 4 AT3 biases in pol and env genes were exhibited by HIV1, SIVcerco, SIVcpz_gor, and SIVcolobus. Regarding the gag gene, SIVallo showed the greatest AT3 bias, followed by HIV1, SIVcerco, and SIVcpz_gor. In contrast, the groups with the lowest AT3 biases were HIV2 and SIVpithecus for pol and env genes. The gag gene showed the lowest AT3 bias in the SIVpithecus group, followed by SIVmandrillus and HIV2. Overall, HIV1 and SIVcpz_gor showed greater AT3 biases in all three genes, compared with HIV2.

Fig. 2
figure 2

Compositional features of lentiviral gag, pol, and env genes. a Distribution of A, T, C, and G in lentiviral gag genes. b Distribution of A, T, C, and G at the third codon position. c Total GC/AT content and GC/AT content at the third codon position. High AT contents and much higher AT3 contents were observed, indicating the predominant use of A-end codons in primate lentiviral genes

Influences of mutational pressure and natural selection on lentivirus codon usage bias

PR2 analysis showed that, for six amino acids (excluding Leu and Arg), all virus groups generally exhibited similar bias patterns (Fig. 3). Val and Gly generally use codons ending with A or G, whereas Pro, Thr, Ala, and Ser use codons with A or C in the third position. Codons ending in C in Pro, Thr, Ala, and Ser are preferred to codons ending in G; this minimizes the use of CpG dinucleotides, thereby circumventing the restrictive effect of host cell antiviral proteins and permitting viral replication. The host cell zinc finger antiviral protein binds to the high CpG dinucleotide region and inhibits viral replication (Ficarelli et al. 2020); this finding suggests that the HIV1 evasion mechanism may have played a role in the formation of this genetic pattern. Regarding the gag gene, the HIV1 group almost completely avoided the use of ACG codons for Thr; only the SIVsmm group was biased towards the codon end with C instead of G for Gly. SIVpithecus and SIVmacaca favored a codon end with G (rather than C) for Ser, unlike the other groups. In the pol gene, the G3/(G3 + C3) values for the nine groups ranged from 0.52 to 0.62, whereas those values for the HIV2 and SIVmandrillus groups were 0.36 and 0.79, respectively. In contrast to the other groups, HIV2 was found to prefer the GTC codon (rather than GTG) for Val in the pol gene. For Ala in the env gene, the SIVsmm and SIVmacaca groups had a G3/(G3 + C3) value of 0.6 or higher, indicating that a codon ending in G was used more frequently than a codon ending in C. Furthermore, these two groups did not avoid the use of CpG dinucleotides; the HIV2 and SIVcerco groups, which formed the same clade with SIVsmm and SIVmacaca in phylogenetic analysis of env, also used the codons GCC and GCG evenly. The GC biases of Leu amino acids were in the ranges 0.38–0.80 and 0.37–0.76 in gag and pol, respectively; in env, the A3/(A3 + C3) and G3/(G3 + C3) values did not deviate significantly from point (0.5, 0.5).

Fig. 3
figure 3

Mean PR2-bias plot of gag, pol, and env genes of 11 lentivirus groups. AT-bias (A3/(A3 + T3)) and GC-bias (G3/(G3 + C3)) were calculated separately for each four-fold degenerate amino acid (Val, Pro, Thr, Ala, and Gly) and four-codon family of Leu, Ser, and Arg. For all three structural genes of primate lentiviruses, consistent G-end codon avoidance was observed in the amino acids Pro(CC-), Thr(AC-), Ala(GC-), and Ser(TC-), and this CpG dinucleotide avoidance pattern was particularly evident in pol genes

Neutral evolution analysis (Fig. 3) enables evaluation of the relationship between the content of GC3s and the content of GC12s, which can be used to quantify the relative magnitudes of selection and mutation pressures (Fig. 4). Theoretically, the absolute value of the slope of the regression line is within the range 0 to 1; values closer to 0 indicate greater selection pressure. Thus, if the GC12s value changes in a manner similar to the GC3s value, the virus is under weak selection pressure; if changes in the GC12s value are limited to a small range, the virus is under high selection pressure. A correlation analysis of each virus group was performed for each of the three genes. Regarding the gag gene, HIV1 (r = − 0.053, p < 0.0001), HIV2 (r = 0.37, p = 0.001), SIVcerco (r = − 0.9746, p = 0.0048), SIVmacaca (r = − 0.2895, p < 0.0001), SIVmandrillus (r = − 0.9302, p = 0.0008), and SIVpithecus (r = 0.5435, p = 0.0363) showed significant correlations between GC12s and GC3s values. Regarding the pol gene, HIV1 (r = 0.2014, p < 0.0001), HIV2 (r = 0.4803, p = 0.0008), SIVagm (r = 0.3819, p = 0.015), SIVcolobus (r = 0.7664, p = 0.0445), SIVcpz_gor (r = − 0.6292, p < 0.0001), and SIVpithecus (r = 0.589, p = 0.034) showed significant correlations between GC12s and GC3s values. Regarding the env gene, HIV1 (r = 0.0653, p < 0.0001), SIVagm (r = 0.2926, p = 0.0038), SIVmacaca (r = − 0.6714, p < 0.0001), SIVpithecus (r = 0.6066, p = 0.0076), and SIVsmm (r = − 0.6667, p < 0.0001) showed significant correlations between GC12s and GC3s values. All three genes showed a distribution generally parallel to the x-axis, suggesting that selection pressure plays a major role in the formation of codon usage in primate lentiviruses. Regarding the gag gene, all groups except HIV2 had a negative slope. The slope of the regression line for the virus group indicates the relative magnitude of the directional mutation pressure experienced by the virus. Regarding the gag gene, the relative contributions of directional mutation pressure in codon usage pattern formation in HIV1, HIV2, SIVcerco, SIVcolobus, SIVmacaca, SIVmandrillus, and SIVpithecus groups were 2.55%, 17.25%, 41.07%, 4.68%, 33.28%, and 13.94%, respectively. Thus, in the gag gene, the HIV1 group showed the highest selection pressure (97.45%), followed by the SIVmacaca group (95.32%), the SIVpithecus group (86.06%), the HIV2 group (82.75%), the SIVmandrillus group (66.72%), and the SIVcerco group (58.93%). Regarding the pol gene, the slopes of the HIV1, HIV2, SIVagm, SIVcolobus, and SIVpithecus groups, but not the SIVcpz_gor group, were > 0, indicating positive correlations. The directional mutation pressures of the HIV1, HIV2, SIVagm, SIVcolobus, SIVcpz_gor, and SIVpithecus groups in the pol gene were 7.26%, 41.36%, 0.01%, 17.51%, 19.38%, and 7.83%, respectively. Among them, SIVagm (99.99%) and HIV2 (58.64%) were under the maximum and minimum selection pressures in pol; the HIV1, SIVpithecus, SIVcolobus, and SIVcpz_gor groups were affected by 92.74%, 92.17%, 82.49%, and 80.62% selection pressures, respectively. Regarding the env gene, the GC12s and GC3s values showed positive correlations in the HIV1, SIVagm, and SIVpithecus groups, while they showed negative correlations in the SIVmacaca and SIVsmm groups. Regarding the env gene, the magnitudes of directional mutation pressure in the HIV1, SIVagm, SIVmacaca, SIVpithecus, and SIVsmm groups were 3.06%, 9.18%, 15.24%, 7.97%, and 40.67%, respectively; the groups receiving the most and least selection pressure were HIV1 (96.94%) and SIVsmm (59.33%), respectively. In addition, selection pressures in the SIVpithecus, SIVagm, and SIVmacaca groups were 92.03%, 90.82%, 84.76%, and 59.33%, respectively.

Fig. 4
figure 4

Neutrality plot for lentiviral gag, pol, and env genes. Different virus groups are represented by different icons, and each icon represents a sequence. If a correlation exists, the linear regression equation is presented; “not fit” indicates that no correlation was present between GC12s and GC3s values. The extent of change in GC12s was very limited compared to that in GC3s, implying that selection pressure plays a major role in shaping codon usage patterns in primate lentiviruses compared to mutational pressure

Variation in RSCU value and codon usage preference

Through analysis of the mean RSCU values for all virus groups using the gag, pol, and env genes, overrepresented codons (RSCU > 1.6) were identified for each group (Fig. 5). Among all codons of the three genes, the CTG (Leu) codon in the gag gene region had the largest RSCU difference between virus groups, notably overrepresented (SIVmacaca: 1.68) and underrepresented (HIV1: 0.51) virus groups coexisted. The AGA codon of the gag gene was overexpressed most frequently, and the AGC codon showed the largest variation. Regarding the mean RSCU value of AGC, HIV1 and HIV2 had the highest mean value (2.2), followed by SIVmacaca and SIVsmm (1.7), and then SIVallo and SIVpithecus (0.7). Figure 5 shows that the RSCU values of the TTA, ATA, and GTA codons of gag were significantly higher in the HIV1 group than in the other groups. Concerning the 2-codon amino acids Tyrosine (Tyr) and Cysteine (Cys), the HIV1 group was biased toward the use of the TAT and TGT codons, in contrast to the average use of two synonymous codons in the HIV2 group. In terms of Leu codon usage in pol, the HIV2 group was considerably different. Among the 10 groups, with the exception of HIV2, the mean RSCU value of the TTA codon was high (2.18); in HIV2, it was 1.52. Concerning the CTA codon, the mean RSCU value of the other groups was 1.27, whereas HIV2 showed a value of 2.27; HIV2 thus exhibited a greater preference for the CTA codon than for the TTA codon. In terms of Ser codon usage in pol, the TCA and AGT codons were preferred; the HIV2, SIVallo, SIVmacaca, and SIVsmm groups had a greater preference for the TCA codon, while the remaining groups had a greater preference for the AGT codon. In terms of Cys codon usage in pol, synonymous codons were used evenly in most groups; a pattern of TGT codon overexpression was observed in the HIV1 and SIVcpz_gor groups. Regarding the pol gene, a codon ending in A or T was preferred in all two-codon amino acids except for Cys; Ser codon usage in env differed significantly among virus groups. All groups except SIVmacaca (most frequently used TCT) preferred TCA or AGT; the HIV1, SIVagm, SIVallo, and SIVcolobus groups preferred AGT. In terms of Cys codon usage in env, all groups preferred the TGT codon except for HIV2 (most frequently used TGC). Concerning Ile codon usage, pol and env were biased toward the ATA codon; in gag, the SIVcerco, SIVmacaca, SIVmandrillus, and SIVsmm groups preferred ATT. Thr amino acids showed a pattern of overexpression and underexpression of the ACA and ACG codons, respectively, in all three genes. The CCA, ACA, GCA, AGA, AGG, and GGA codons were overexpressed; the corresponding amino acids Pro, Thr, Ala, Arg, and Gly showed a similar pattern in all three structural genes.

Fig. 5
figure 5

RSCU plot of lentiviral gag, pol, and env genes. The degree of variation among the 11 groups was the smallest in the pol gene. Generally, for RSCU values > 1.6, codons were considered over-expressed; for RSCU values < 0.6, codons were considered under-expressed. CCA, ACA, GCA, AGA, AGG, and GGA codons were over-expressed (RSCU > 1.6) in all structural genes of lentiviruses and, based on RSCU values, there was a dominant preference for the A-end codon

Codon usage patterns in gag, pol and env

The mean ENC values of gag, pol, and env were 47.80, 43.47, and 49.18, and the ranges were 44.27–52.52, 40.62–46.62, and 46.53–53.42, respectively. The mean ENC value of pol was lowest, and it showed the least variance among virus groups (Fig. 6). Therefore, the overall codon usage bias of pol was highest among the structural genes, while the codon usage bias of env was lowest. The virus groups showing high codon bias in all three genes were HIV1 and SIVcolobus; the SIVpithecus group consistently showed low codon bias. The HIV2 and SIVsmm groups showed low codon usage bias only in env. As shown in Fig. 7, the SIVagm, SIVcpz_gor, and SIVmandrillus groups demonstrated a vertical distribution with small GC3s % variation and large ENC value variation in the gag gene (Fig. 7). Therefore, smaller ENC value is indicative of greater selection pressure. Regarding the gag gene, the SIVpithecus group had a distinct distribution of sequences on the ENC-GC3s plot. The SIVasc group was under greater selective pressure, compared with the SIVdeb group. Regarding the pol gene, the degrees of selection pressure were similar among virus groups. Contrary to the overall pattern, the SIVagm group was under less selection pressure; the smaller ENC value in the SIVagm group indicated greater selection pressure. Regarding the pol gene, the HIV2 group was under slightly greater selection pressure, compared with the HIV1 group; SIVlhoest in the SIVallo group was less affected by selection pressure than SIVsun. Regarding the env gene, the SIVsmm and SIVcolobus groups showed little variation in GC3s % and a large difference in ENC values; therefore, the smaller ENC values in the two groups were indicative of greater selection pressure. Regarding the env gene, SIVcol was less affected by selection pressure than SIVwrc. Regarding the pol gene, SIVlhoest was under greater selection pressure, compared with SIVsun, in the env gene region. Regarding the env gene, the SIVmacaca group showed a distribution parallel to the abscissa with a large GC3s % variation and a small ENC value variation, suggesting that the larger GC3s % value was indicative of greater selection pressure. Although the sequences in the SIVpithecus group of env were located throughout the ENC-GC3s plot, SIVtal was under greater selection pressure, compared with SIV that infects other species of the genus Cercopithecus. Consistent with the ENC results, the ICDI values of all three genes were < 0.3, indicating low codon usage bias (Table 3). In terms of ICDI results, the groups with comparatively high codon usage bias in the three genes were HIV1 and SIVcolobus. ENC and ICDI index analyses showed that HIV1 had a greater codon usage bias than HIV2, consistent with a prior report (Vidyavijayan et al. 2017). Based on Pearson correlation analysis of ENC, ICDI, CAI, and similarity D(A, B) findings for each gene, pairwise negative correlations were observed between ENC and ICDI, and between CAI and similarity D(A, B). The ICDI and CAI values also showed negative correlations (Table 4). Therefore, greater overall codon usage bias for a particular gene was associated with lower similarity to the H. sapiens codon usage pattern.

Fig. 6
figure 6

ENC box plot of lentiviral gag, pol, and env genes. Cross: mean ENC value; dots: minimum and maximum ENC values. The HIV1 and SIVcpz_gor groups are highlighted by red boxes and the HIV2 and SIVsmm groups are highlighted by blue boxes. The mean ENC values for gag, pol, and env were 47.8, 43.47, and 49.18, respectively, indicating that the codon diversity was greater in gag and env genes than in the pol gene. HIV1 showed a strong preference for using synonymous codons in all three genes

Fig. 7
figure 7

ENC-GC3s plot of lentiviral gag, pol, and env genes. The smooth line is the expected value of ENC calculated based on GC3 content. In the gag gene, sequences within the same group are subject to greater differences in selection pressure than in the pol and env genes. Sequences within the same group of the gag gene show a greater difference in selection pressure than those of the pol and env genes. A small portion of HIV1 sequences located on the expected curve within the env ENC-GC3s plot are clones of highly neurovirulent HIV-1 isolates

Table 3 ICDI values of lentiviral gag, pol, and env genes
Table 4 Correlations of ENC, ICDI, CAI, and D(A, B)

Differences in viral adaptation to codon usage in humans

The CAI value of gag ranged from 0.715 to 0.746 among the 11 groups; it was 0.746 ± 0.005 in the SIVmacaca group, indicating the greatest adaptability to the reference set, H. sapiens (Table 5). The CAI values of pol and env were highest in the SIVmandrillus group (0.708 ± 0.004) and HIV2 group (0.730 ± 0.005), respectively; they were lowest in the HIV1 group (0.691 ± 0.004) and SIVcpz_gor group (0.700 ± 0.006), respectively. Among the three genes, pol and gag showed the smallest and largest differences, respectively, in CAI values among virus groups. Because the CAI value is affected by the nucleotide and amino acid composition of the sequence, the eCAI value at a significance level of 0.05 was calculated using the method established by Puigbò et al. Five hundred sequences were generated by the Markov method to reflect the nucleotide and amino acid compositions of the three genes. The Kolmogorov–Smirnov normality test showed that the 500 nucleotide sequences generated in all groups exhibited a normal distribution (< 0.061). Next, the actual CAI value for each group and the eCAI value at the 5% significance level were compared. There were no significant differences between the CAI and mean eCAI values in the 11 groups in terms of pol and env genes. Therefore, CAI values of pol and env genes can be fully achieved due to the GC content of the sequence itself, suggesting that they experienced less translation pressure. Regarding the gag gene, the SIVallo, SIVcolobus, SIVcpz_gor, SIVmacaca, and SIVsmm groups were adapted to the codon usage pattern of H. sapiens; the other groups did not exhibit a significant difference from the calculated eCAI values. Therefore, HIV1 and HIV2 have high codon usage similarity with H. sapiens in all genes, but this similarity was not mainly derived from translation pressure. Regarding the gag gene, the CAI value was centered on the mean value of the eCAI for the SIVallo, SIVcolobus, SIVcpz_gor, SIVmacaca, and SIVsmm groups. At a significance level of 0.05, the Mann–Whitney U test indicated that the SIVmacaca group had significantly greater adaptability to H. sapiens than the other four groups, while the SIVsmm group had greater adaptability to H. sapiens than the SIVcpz_gor group.

Table 5 CAI and 5% critical eCAI values of lentiviral genes

Quantification of codon usage similarity between virus and host showed that the similarity D(A, B) values of HIV1 and HIV2 in gag were 0.117 ± 0.006 and 0.086 ± 0.008, respectively (Table 6). The similarity D(A, B) values of the SIVcpz_gor and the SIVsmm groups (the presumed origins of HIV1 and HIV2) were 0.095 ± 0.008 and 0.087 ± 0.007, respectively. The Mann–Whitney U test showed that the similarity D(A, B) value of HIV1 was significantly higher than the corresponding value of HIV2 at a significance level of 0.05; the value of HIV1 was significantly higher than the value of SIVcpz_gor. However, the similarity D(A, B) values did not significantly differ between the HIV2 and SIVsmm groups. The similarity D(A, B) values of HIV1, HIV2, SIVcpz_gor, and SIVsmm to H. sapiens in pol were 0.132 ± 0.005, 0.121 ± 0.006, 0.132 ± 0.006, and 0.119 ± 0.008, respectively. The Mann–Whitney U test showed results identical to the findings in gag analysis, using a significance level of 0.05. The similarity D(A, B) values of HIV1, HIV2, SIVcpz_gor, and SIVsmm to H. sapiens in env were 0.096 ± 0.005, 0.063 ± 0.005, 0.099 ± 0.004, and 0.077 ± 0.005, respectively. Using the Mann–Whitney U test for assessment of env, the similarity D(A, B) value of the HIV1 group was significantly higher than the corresponding value of the HIV2 group; however, the similarity D(A, B) values of HIV1 and HIV2 were significantly lower than the corresponding values of SIVcpz_gor and SIVsmm, respectively. In general, in terms of codon usage, lentiviruses showed high similarity to H. sapiens; this similarity was highest in env and lowest in pol. In all three structural genes, HIV2 exhibited comparatively greater similarity to the human codon usage pattern than HIV1. In the case of gag gene, HIV1 showed low similarity with human codon usage compared to SIVcpz_gor. In the env gene, both HIV1 and HIV2 showed a codon usage pattern more similar to that of humans than SIVcpz_gor and SIVsmm, the origin viruses.

Table 6 Similarity D(A, B) values of lentiviral gag, pol, and env genes

Discussion

Because of codon degeneracy, 18 amino acids (excluding Met and Trp) are encoded by two or more codons. Synonymous codons encoding the same amino acid are not used randomly, and varying degrees of codon usage bias have been confirmed in almost all species. The use of different synonymous codons directly affects translation, including its initiation (Kudla et al. 2009; Goodman et al. 2013), efficiency, and accuracy (Drummond and Wilke 2008), as well as RNA structure and folding (Shabalina et al. 2006), thus affecting gene function. Optimizing the codon usage of the HIV1 gag gene increases the expression level of the virus (Deml et al. 2001; Gao et al. 2003; Smith et al. 2004), confirming that improvements in mRNA stability and nuclear escape are critical factors, rather than increased translational efficiency (Kofman et al. 2003). In addition, HIV1 replication and env expression were aborted by substitution of the synonymous codon (AGG → CGU) encoding Arg in the gp41 gene region, due to disruption of the secondary structure of intronic splicing silencer RNA (Jordan-Paiz et al. 2020). Analysis of the synonymous codon usage pattern enables prediction of the evolutionary direction of the viral genome, and an adequate understanding of viral codon usage facilitates codon editing. These aspects allow identification of gene function and elucidation of novel antiviral mechanisms in the innate immune system, as well as unknown areas of the viral life cycle.

In the codon usage analysis of primate lentiviruses in this study, a strong A bias was confirmed in all three structural genes, consistent with previous findings regarding HIV (van Hemert and Berkhout 1995; Pandit and Sinha 2011). Unlike the nucleotide composition that minimizes C nucleotide in pol and env, the use of T and C nucleotides is diminished in gag; we found that the use of T nucleotides was lowest in all primate lentivirus groups except SIVallo, SIVcerco, and SIVcolobus. Furthermore, in env, only the HIV2 and SIVpithecus groups had a lower AT3 bias, compared with the overall AT bias; the AT bias in these two groups was mainly formed by the AT composition of the first and second codon sites. The codon usage pattern of Arg shows extreme A bias in the HIV1 genome and lacks CpG dinucleotides; thus, the AGA codon is used most frequently and all CGNs are underexpressed (van Hemert and Berkhout 1995). In this study, we found that the synonymous codon usage pattern of the Arg amino acid was very conservative in all primate lentivirus groups. Moreover, the use of synonymous codons in the two-codon amino acid Cys of HIV1 was heavily biased toward the TGT codon, unlike the typical use of two synonymous codons in humans (van Hemert and Berkhout 1995). An RSCU analysis of HIV2 and SIVs showed that only HIV1 in gag and HIV1 and SIVcpz_gor in pol exhibit these codon usage biases. In the gag and pol genes of the other primate lentiviruses, the TGT and TGC codons are used evenly, similar to the findings in humans. In a PR2 analysis, an A > T bias was observed in all four codon amino acids of gag and pol, and a G > C bias was observed in Val and Gly. However, a C > G bias was observed in Pro, Thr, and Ala; this is presumed to prevent the use of CpG dinucleotides at Pro, Thr, and Ala. Furthermore, for Val in pol, only the HIV2 group used GTC more frequently than GTG. Regarding the env gene, the SIVsmm and SIVmacaca groups used GCG more frequently than GCC, unlike the strong GCC codon usage bias in gag and pol; the HIV2 and SIVcerco groups used the two codons in a similar manner. Phylogenetic analysis of env indicated that these four groups (SIVsmm, SIVmacaca, HIV2, and SIVcerco) showed the closest evolutionary relationship, consistent with the synonymous codon usage pattern of Ala. Therefore, editing the corresponding codon would confirm a biologically important function. Notably, the ENC-GC3s plot and neutral evolution analysis showed that all primate lentiviruses were more affected by selection pressure than by mutation caused by the GC composition of the gene, consistent with prior reports regarding HIV1. The CAI and similarity D(A, B) values indicated that although there was a high degree of similarity to human codon usage in all three structural genes of HIV, this similarity was not caused by translation pressure. In addition, compared with HIV1, the codon usage of HIV2 is more similar to the human codon usage, but the overall codon usage bias is lower. Finally, the origin viruses of HIV (SIVcpz_gor and SIVsmm) exhibit greater similarity to human codon usage in the gag gene, confirming their robust adaptability to human codon usage. Therefore, HIV1 and HIV2 may have evolved to avoid human codon use by selection pressure in the gag gene after interspecies transmission from SIV hosts to humans.

Conclusion

We confirmed the overall codon usage patterns of primate lentiviruses, then explored the evolutionary and genetic characteristics of HIV1, HIV2, and SIV. Because the grouping of sequence data for codon usage pattern analysis is based on the phylogenetic topology of gag, pol, and env, as well as the classification systems of SIV hosts, differences in patterns within groups and differences due to HIV lineages or subtypes and geographic distribution were not considered. Information such as codon deoptimization, dinucleotide usage, and codon pair usage can be applied to multiple RNA viral genomes to generate novel attenuated vaccines. By overcoming safety and stability issues, information from codon usage analysis will be useful for attenuated HIV1 vaccine development. A recoded HIV1 variant can be used as a vaccine vector or in immunotherapy to induce specific innate immune responses. Further research regarding HIV1 dinucleotide usage and codon pair usage will facilitate new approaches to the treatment of AIDS.