Introduction

A curative therapy or preventive vaccine for HIV-1 infected patients remains elusive to date. Standard HIV treatment is confronted with the emergence of viral resistance to existing drug classes, necessitating the development of inhibitors with new mechanisms of action [1]. The gag polyprotein, essential for HIV-1 morphogenesis, comprises four major domains (matrix, capsid, nucleocapsid, p6) and two small spacer peptides (p1, p2) [2]. Recently, HIV-1 inhibitors that target different stages of virion morphogenesis demonstrated promising antiviral activity, mainly by inhibiting capsid assembly, disrupting nucleocapsid binding with viral RNA/DNA or blocking proteolytic processing of polyproteins during maturation [25].

HIV-1 subtype B isolates were predominantly used for the in vitro experiments. Non-B subtypes however account for 90% of HIV-1 infections worldwide [6] and amino acid (AA) compositions can differ up to 30% between subtypes [7]. Recently, treatment failure of patients in a phase II clinical study of the maturation inhibitor bevirimat was attributed to natural polymorphisms at drug binding positions, showing up in subtype-specific patterns [8]. Studies that extensively investigate the implications of HIV-1 diversity for gag-directed drug development are lacking to date. In this large-scale analysis, we examined the distribution of naturally occurring sequence variability in full-length gag sequences of major HIV-1 subtypes. Moreover, we evaluated the impact of HIV-1 subtypes on the conservation of gag drug binding positions and multisite binding pockets published to date.

Results

We analyzed 10862 full-length gag sequences that fulfilled the quality criteria, encompassing 8 HIV-1 group M subtypes and CRFs: A1 (n = 1648), B (n = 4131), C (n = 2780), D (n = 443), F1 (n = 35), G (n = 49), CRF01_AE (n = 1714) and CRF02_AG (n = 62). Sequences were sampled from 61 countries between 1981 and 2012. Additional file 1: Table S1 summarizes more than 50 gag inhibitors including their binding sites, target protein, mechanism of action, HIV-1 subtypes and PDB data. These candidate inhibitors were either small organic molecules or peptides and primarily targeted the capsid or nucleocapsid proteins. A total of 136 gag positions were reported as drug binding positions, of which 53 interacted with more than one inhibitor.

The AA distribution at 500 gag positions among HIV-1 group M sequences is shown in Figure 1 and subtype-specific distributions are also visualized (Additional file 2: Figure S1). Heterogeneity in consensus sequences was observed at 142 (28.4%) positions across subtypes, while pairwise comparisons of consensus sequences showed an average of 11.6% difference between subtypes. On average, 43.6 ± 2.7% of positions harbored at least one polymorphism relative to its subtype consensus residue (Table 1). The capsid protein (29.4%) contained the lowest number of polymorphic positions followed by nucleocapsid (42.5%), matrix (59.9%), and p6 (65.6%). Moreover, of 147 conserved positions in gag, 67.8% were in capsid, 11.2% in nucleocapsid, 10.5% in matrix and 4.6% in p6. Pairwise AA diversity (Additional file 3) of full-length gag sequences decreased from 17.0 ± 1.6% between subtypes to 9.0 ± 1.0% within subtypes (Table 2). The mean AA diversity was significantly lower for capsid (5.0 ± 0.8%) than for nucleocapsid (7.9 ± 2.8%), matrix (13.2 ± 2.0%) or p6 (14.7 ± 2.0%) (p-value < 0.05) (Table 3). The CI distributions of full-length gag characterized three conserved regions located at the nucleocapsid zinc-finger domains, the capsid N-terminal domain (NTD) and C-terminal domain (CTD) (Figure 2).

Figure 1
figure 1

Distribution of natural variations at 500 gag positions of HIV-1 group M (subtypes: A1, B, C, D, F1, G and CRF01_AE, CRF02_AG). The first position of each protein region is labeled with its protein name in a box. Annotated protein regions are indicated as colored bars: light-green for matrix (positions 1–132), light-blue for capsid (133–363), dark-green for p2 (364–377) and p1 (433–448), dark-blue for nucleocapsid (378–432) and grey for p6 (449–500). HXB2 indices for both full-length gag and individual proteins are shown on top of the colored bars (e.g. '180|48’ indicates the gag position 180 and the capsid position 48). Known drug binding positions are marked with red stars. Consensus subtype B amino acid for each position is shown directly under the bar, and is highlighted green when the consensus AA differed in one or more subtypes. Natural polymorphisms are shown below the consensus subtype B amino acids; proportions (%) are colored blue for proportion ≥ 5%; orange otherwise. Figure S1 in Additional file 2 provides the distribution of natural polymorphisms within each individual subtype.

Table 1 Natural polymorphism proportions in gag domains and drug binding positions across 8 HIV-1 subtypes and CRFs (%)
Table 2 The inter- and intra-subtype diversity of gag AA sequences in 8 HIV-1 subtypes and CRFs (%)
Table 3 The pairwise AA diversity of gag domains in 8 HIV-1 subtypes and CRFs (%)
Figure 2
figure 2

Amino acid conservation in HIV-1 full-length gag. (A) Density plots of CI values are shown for 8 HIV-1 subtypes. Secondary structures are indicated for each protein region, with thick lines for helices and thin lines for coiled-coil structures. Positions conserved in all subtypes are colored blue (layer 1 in a small circle), known drug binding positions are colored red (layer 2) and regions where HIV-1 peptide inhibitors have been derived are colored green (layer 3). (B) Distributions of CI values at 500 gag positions across 8 HIV-1 subtypes and CRFs. Visualization software: Circos v0.64 (http://circos.ca/).

Subtype-specific AA prevalence at the 136 drug binding positions is shown in Figure 3. Most positions were located within capsid (72.1%) followed by nucleocapsid (12.5%), matrix (9.6%) and p2 (5.9%). Of these positions, 41.2% were conserved across all subtypes, while 20.6% showed a different consensus AA in one or more subtypes. On average, 33.8% of drug binding positions harbored at least one polymorphism and 16.3% had at least one polymorphism above 5% prevalence. Non-B subtypes displayed 32 polymorphisms at 20 binding positions that were absent in subtype B. Every inhibitor had at least one polymorphic binding position and 15 inhibitors had more than 50% of drug binding positions showing natural polymorphisms. Among all inhibitors, PF-3450074 [9] targeted the most conserved binding positions at the capsid N-terminal domain, with only one being polymorphic (T107A/S, ≤ 6.2%) (Additional file 1: Table S2).

Figure 3
figure 3

Natural polymorphisms at 136 drug binding positions in 8 HIV-1 subtypes and CRFs. For each gag position, the HXB2 index is shown at the top, followed by the consensus amino acid and natural polymorphisms. Polymorphisms with proportions ≥ 5% are indicated with blue superscripts; orange otherwise.

Finally, we analyzed known crystal structures of 9 protein-inhibitor complexes, with 8 inhibitors targeting a total of 75 positions (binding pockets 1–4) in capsid and one targeting 23 positions in nucleocapsid (binding pocket 5) (Figure 4, Additional file 2). Natural polymorphisms with prevalence ≥ 5% were observed in 28 positions of the binding pockets. Conserved positions were observed in 56% of the capsid binding pockets and 43% of the nucleocapsid binding pocket. Pocket 1 (0.0024) had the lowest average CI values compared to pocket 2 (0.008), 3 (0.0216), 4 (0.0337) or 5 (0.0369).

Figure 4
figure 4

Mapping of drug binding positions and binding pockets to HIV-1 gag protein monomers. The surface spectrum colors indicate the most to the least conserved positions in subtype B from blue CI = 0 to pink CI ≥ 0.1. (A) Secondary structures of 4 gag proteins and 2 spacer peptides, annotated with five drug binding pocket locations. Gag proteins in cartoon representation are colored olive for matrix, blue for capsid, yellow for nucleocapsid, grey for p6, gold for p1 and p2. Bound inhibitors are represented in green sticks. (B) Mapping of drug binding positions to a surface representation of gag structure, with front and back views. Hypothesized binding positions of bevirimat are also annotated; known drug binding positions are colored red. (C) Surface representation of gag conservation in HIV-1 subtype B (Figure S3 in Additional file 2 illustrates other subtypes). (D) Surface representations of five drug binding pockets in HIV-1 subtype B (Figure S2 in Additional file 2 shows other subtypes). Inhibitor names are annotated according to publication (Additional file 1: Table S1). PDB entries of gag proteins: matrix, 1HIW; capsid, 3NTE; p2, 1U57; nucleocapsid, 2M3Z; p6, 2C55. PDB data of capsid inhibitors: 2BUO, 2L6E, 2XDE, 4E91, 4E92, 2JPR and 4INB, each of which was superimposed to 3H4E using PDBs of 5 drug binding pockets: pocket 1, 2XDE; pocket 2, 4INB; pocket 3, 2BUO; pocket 4, 4E91; pocket 5, 2M3Z. PyMOL V1.5 (http://www.pymol.org/).

Discussion and conclusions

To our knowledge, our large-scale analysis provided the first detailed mapping of functional conservation of gag across major HIV-1 subtypes, with implications for the rational design of gag inhibitors. With more than 50 gag inhibitors published to date, targeting virion morphogenesis is considered a potential new drug class for HIV-1 treatment [2]. A clinical proof-of-concept was demonstrated in a phase II clinical trial of the maturation inhibitor bevirimat [10], which blocks proteolytic processing at the capsid-p2 cleavage site [11]. Lack of response was observed in 50% of patients and attributed to naturally occurring polymorphisms in the p2 region [8]. A single polymorphism V370A is sufficient for a 40-fold reduction in bevirimat drug susceptibility [12], with A370 representing the consensus amino acid in several non-B subtypes. Natural diversity was also observed to affect drug effectiveness of other experimental gag inhibitors [1315]. Polymorphisms T190I, E230D and I256V, for instance, reduced drug susceptibility to the benzodiazepine and benzimidazole compounds [13]. Moreover, known HIV vaccine candidates containing subtype B gag gene in HIV-derived vectors did not show sufficient protective efficacies in several large-scale clinical trials [16]. The high diversity of gag and env genes within and between subtypes can contribute to the challenges of designing a global HIV vaccine neutralizing all HIV-1 subtypes [17]. For the development of HIV vaccine and a potential new drug class targeting virion morphogenesis [2], an assessment of gag functional conservation and polymorphisms at known drug binding positions is warranted.

We found that 23.4% of drug binding positions in the full-length gag showed natural polymorphisms in non-B subtypes which could not be detected in subtype B. More importantly, all gag inhibitors had at least one polymorphic binding position irrespective of subtype. We also found levels of gag intra- and inter-subtype diversity (9.04% and 17.0%) that exceeded diversity estimates of key viral enzymes (< 7% and < 11%) targeted by standard HIV-1 treatment [18]. However, the most conserved gag protein capsid has the same level of intra-subtype diversity as integrase (~5%) [18], favoring it as a conserved drug target.

The capsid protein targeted by most candidate inhibitors accounted for 67.7% of conserved gag positions and contained 72.1% of the 136 binding positions previously reported. Our sequence analysis identified two conserved capsid regions (Figure 2) located at the interaction interfaces between N-terminal domains (NTD-NTD) as well as between N-terminal and C-terminal domains (NTD-CTD) (Figure 5). These interaction interfaces, crucial for the assembly and stabilization of pentamer and hexamer lattices [19], provide potential conserved drug targets. To reveal the ideal drug target, we described 4 crystalized drug binding pockets in capsid (Figure 4, Additional file 2: Figure S4). Inhibitors that target pockets 1–3 have shown promising antiviral activity against capsid multimerization in different subtype strains by altering NTD-CTD interaction (pockets 1 and 3) or NTD-NTD interaction (pocket 2) [15, 20, 21]. Pocket 4 is less conserved and its polymorphic residues make direct contact with inhibitors, hindering the development of inhibitors that target this pocket [13].

Figure 5
figure 5

Visualization of conserved regions in capsid and nucleocapsid. The capsid hexamer structure (PDB: 3H4E) is shown in top (A) and side (B) views, with the 6 capsid units (pink, blue), conserved NTD-NTD interaction domains (yellow) and conserved NTD-CTD interaction domains (red). Figure (C) shows the structural complex of nucleocapsid and RNA (left, PDB: 1A1T) and the structural complex of nucleocapsid and inhibitor CAA (right, PDB: 2M3Z). The first zinc-finger domain (nucleocapsid positions: 14–29, gag positions: 389–404) and the second zinc-finger domain (nucleocapsid positions: 35–50, gag positions: 410–425) are colored red and orange, respectively. Figures S5 and S6 in Additional file 2 provide detailed structures of conserved gag regions.

Another potential drug target is the nucleocapsid protein, containing two critical zinc-finger domains for binding with viral RNA genomes [2]. Our conservation analysis mapped the conserved nucleocapsid regions to zinc-finger domains (Figures 2 and 5) and confirmed previous findings of absolute conservation of CCHC motifs at zinc-coordinating positions [22]. However, we detected considerable variation at other positions, which may alter drug binding and affect antiviral activity. Furthermore, nucleocapsid inhibitors tend to suffer from limited specificity and high toxicity due to the ubiquitous presence of zinc finger domains in many human proteins [4].

Matrix inhibitors with broad spectrum antiviral activities were recently reported, but mutations at drug binding positions significantly reduced their effectiveness [23, 24]. We also observed many natural variants at their drug binding sites (Additional file 1: Table S2), suggesting that further optimization of matrix inhibitors is needed.

Studies that analyzed genetic variability and drug binding site heterogeneity in gag using large-scale sequence populations are lacking. Previously, small subtype B sequence datasets were used to characterize gag conservation (n = 125) [25] or positive selective pressure (n = 635) [26]. Polymorphisms at drug binding sites of capsid inhibitor PF-3450074 [9] and conservation of nucleocapsid zinc-finger domains [22] were also reported using fewer than 200 sequences. The only large-scale analysis that we found [27] quantified the drug binding site conservation of a single matrix inhibitor and lacked information on subtype-specific variations. By contrast, we presented here a large-scale and integrative analysis using 10862 full-length gag sequences, 136 gag inhibitor drug binding positions and 14 PDB structures. Natural polymorphisms of full-length gag were detected across 8 major HIV-1 subtypes and a robust estimation of functional conservation was performed using CI analysis, which incorporated biochemical similarities between amino acids (Additional file 3). This sequence analysis predicted three conserved drug targets in gag (Figure 2) which were confirmed by existing structural knowledge (Figure 5).

This study is limited in that it neither addressed how to optimize known gag inhibitors nor quantified the impact of newly identified polymorphisms on antiviral activities of investigated inhibitors. We collected all available PDBs of gag-inhibitor structures from the RCSB protein data bank, but more crystallized complexes are needed to reveal novel mechanisms of action. Moreover, the limited number of available gag sequences for subtypes F1, G and CRF02_AG (n < 100) may have affected the identification of polymorphic positions, but consistent conservation patterns were observed in gag regardless of HIV-1 subtype (Figure 2). While we attempted to be as comprehensive as possible, additional inhibitors may have been reported. Conservation of their binding positions can nevertheless be deduced from our full-length gag analysis. Future studies are also needed to address whether interactions between gag and protease can affect gag drug binding sites, leading to compromised drug activities of gag inhibitors [28].

In conclusion, our study presented a comprehensive mapping of functional conservation in gag and strengthened the idea of capsid as a potential target for HIV-1 therapeutics. Increased knowledge on HIV-1 natural diversity in drug binding pockets contributes to rational design of gag inhibitors and it remains a challenge to design gag inhibitors with drug binding sites conserved across HIV-1 subtypes.

Methods

We retrieved 12543 gag sequences spanning all 1500 base pairs from the HIV Los Alamos database (http://www.hiv.lanl.gov). Sequences were aligned against the HXB2 reference and manually curated using Seaview 4.3 [29]. Hypermutated sequences were detected using the Los Alamos hypermut tool [30]. HIV-1 subtype was determined by the Rega [31] and COMET subtyping tools (http://comet.retrovirology.lu/). Sequence quality was ensured by excluding duplicates and sequences with internal stop-codons, hypermutations, more than 1% ambiguous nucleotides, discordant subtype classification or an identical combination of patient code, sampling year and country. The analysis was restricted to the major subtypes and circulating recombinant forms (CRFs) characterizing the global HIV-1 subtype distribution [6]. For each individual subtype, amino acids that differed from the corresponding consensus AA and with prevalence ≥ 0.5% were defined as polymorphisms [18]. PDB data of protein-inhibitor complexes were collected from the RCSB Protein Data Bank [32], summarized in Additional file 1. The AA sequences in each PDB were aligned against the HXB2 reference. Drug binding pockets were defined by protein positions within a minimum Euclidean distance of less than 5Å between atoms of inhibitors and non-hydrogen atoms of residues [33]. Information on known gag candidate inhibitors and binding positions was retrieved from more than 50 publications, summarized in Additional file 1.

To quantify the degree of positional conservation, a conservation index (CI) was calculated for each position by averaging pairwise scores between all AAs using the BLOSUM62 substitution matrix. Adapted from Karlin and Brocchieri [34], the conservation index (CI) of position x is calculated as: CI x = 1 - 2 N N - 1 i = 1 N j = i + 1 N S x i , x j / S x i , x i S x j , x j , where x i is the amino acid at position x in the ith sequence of the multiple sequence alignment (MSA), N is the number of sequences in the MSA and S(x i , x j ) is the substitution score of BLOSUM62 between amino acids x i and x i . Given that denominators cannot be zero, a linear transformation was applied to S(x i , x j ) by adding the absolute value of the minimum score | min(S)| + 1. CI measures were scaled between 0 and 1, with a CI value of 0 indicating that AA variation was absent at that position. A highly conserved position was identified if its CI is below 0.01 for each HIV-1 subtype, a cutoff which corresponds approximately to a cumulative polymorphism prevalence below 1% (Additional file 3). The Mann–Whitney U test was performed to compare CI distributions. Performance of the CI method is evaluated in Additional file 3 and our Matlab toolbox for sequence analysis is available in Additional file 4.