Defective HIV proviruses are produced in large quantities during natural infection due to mutations introduced during the error-prone process of HIV reverse transcription and APOBEC-induced hypermutation [1, 2]. In untreated patients, this process is counterweighted by the unhindered production of new intact proviruses by virus replication, but in patients on suppressive antiretroviral therapy (ART), defective proviruses accumulate to very high levels [3]. Bruner et al. [4] showed that even in patients who started ART during early infection, 93% of all proviruses were defective, and if HIV replication was blocked by ART during chronic infection, this percentage of defective HIV genomes reached 98%. Similar percentages of defective proviruses have been reported by other groups [5, 6]. It is thought that ART selects for defective proviruses due to continuous cytotoxic T cell (CTL)-mediated surveillance for cells that produce foreign viral antigens, which in ART-treated patients is not counterweighted by virus replication [7, 8]. CTL pressure does decrease after initiation of ART due to decreased antigen exposure, but does not disappear completely [9].

Although defective HIV proviruses are considered by many clinically irrelevant, they do frustrate the accurate measurement of the clinically relevant reservoir of intact HIV genomes that forms a major barrier to curing infected individuals. Furthermore, defective proviruses can be expressed and recognized by the host immune system, which may “distract” CTLs from eliminating the latent reservoir [7, 8, 10] and contribute to the increased levels of immune activation and inflammation on ART [11, 12]. It is therefore important to analyse the pool of defective HIV genomes in greater detail [13].

The structure of most of these defective HIV genomes does confirm the requirement of little or no viral protein expression as the HIV open reading frames acquire inactivating mutations, either by means of large deletions or hypermutation (Fig. 1). A detailed molecular analysis confirmed the protein expression defect for these proviruses [3]. However, a significant subclass of defective HIV genomes is explained less easily. This distinct MSD-Ѱ subclass carries a relatively small deletion in the non-coding part of the HIV genome between the LTR promoter and the first Gag open reading frame (Fig. 1). This region encodes the 5ʹ-untranslated region (5ʹ-UTR) of the HIV RNA genome and contains many post-transcriptional replication signals, including the major splice donor (MSD) that is used in the generation of all spliced HIV transcripts and the packaging signal Ѱ that ensures the selective encapsidation of HIV RNA in assembling virion particles [14]. The magnitude of this MSD-Ѱ class of defective HIV-1 proviruses varies somewhat between studies, ranging from 5 and 6.5% in early studies [3, 4] to 11% in a recent study using a novel provirus sequencing assay [5].

Fig. 1
figure 1

Schematic of the mutations observed in defective HIV proviruses. The HIV genome is depicted on top, with underneath the large deletions and hypermutations (X, nucleotide substitution) that are found in 77–90% of defective proviruses in patients receiving therapy [3, 5, 13]. The bottom part represents a blow-up of the untranslated leader region of the HIV genome (RNA coordinates + 182/+ 338) that is affected in the MSD-Ѱ class of defective proviruses. We marked the position of several replication signals (PBS primer binding site, DIS dimerization initiation signal, AUG-Gag is the first start codon that is used for Gag translation). The deletions and mutations reported by Ho et al. [3] are schematically depicted, showing clustering around the MSD (shadowed)

The persistence of the MSD-Ѱ mutated proviruses during ART suggests an inability to produce viral proteins, but no explanation for such a production defect was yet presented. In fact, MSD inactivation was shown to induce alternative RNA splicing events that can give rise to the expression of viral proteins, e.g. Tat and Rev, or aberrant proteins [7]. Although the level of gene expression can be reduced for these HIV genomes, e.g. due to reduced Tat levels, the corresponding host cells will still be recognized and cleared by CTLs. This MSD-Ѱ class thus far could not be fully understood. Based on extensive literature findings in the field of HIV molecular biology that thus far were ignored, we report an attractive, yet simple explanation for the protein production defect of MSD-Ѱ mutated HIV genomes.

We started by a sequence alignment of the previously reported MSD-Ѱ mutants to identify the critical motifs that were consistently affected. For instance, Fig. 1 shows the deletions reported in the study by Ho et al. [3]. All deletions include the MSD motif, whereas Ѱ sequences do frequently remain present, arguing for a functional role of the MSD motif that controls HIV-1 RNA splicing. In addition, also proviruses with point mutations were reported in the MSD region, e.g. affecting the critical intronic GU dinucleotide of the splice donor site (UG-GU mutated to UG-GG) [3]. Importantly, it was demonstrated that such a point mutant can exhibit a severe replication defect in reconstructed viruses [3].

The literature on HIV molecular biology does provide clues on the MSD-Ѱ mystery. Previous work indicated that the process of HIV RNA polyadenylation is highly regulated. The biological challenge is that the viral RNA genome encodes two identical polyadenylation (polyA or pA) signals as part of the 5ʹ and 3ʹR (repeat) regions near the 5ʹ and 3ʹ ends of HIV RNA. Therefore, regulation is of key importance to suppress the 5ʹ pA site and/or to selectively activate the 3ʹ pA site (Fig. 2a). Work from several groups proposed multiple-layer regulatory mechanisms to achieve negligible 5ʹ pA activity and full 3ʹ pA activity. An early model indicated that the 5ʹ pA site is not frequently used because of its close proximity to the promoter, suggesting that the transcriptional complex needs to mature to become sensitive to pA signals [15]. We demonstrated that both sites are partially suppressed by being part of a local hairpin that reduces binding of cleavage polyadenylation specificity factor (CPSF) [16, 17]. Complete inactivity of the 5ʹ pA site was demonstrated to be linked to the MSD site positioned about 200 nucleotides downstream [18]. Importantly, these results were obtained with HIV proviral constructs, as such emphasizing the physiological importance of this MSD-pA interaction. Efficient interaction of U1 snRNP with the MSD was reported to be critical for complete inactivation of the 5ʹ pA site [19] and follow-up work indicated an important role for the stem-loop 1 of the U1 snRNP [20]. Novel mutational approaches recently confirmed the importance of the MSD region for HIV gene expression [21] and the role of the MSD in regulated polyadenylation [22]. No MSD is present downstream of the 3ʹ pA site, thus avoiding its inactivation. To complete the regulatory mechanism, the 3ʹ pA is also partially suppressed by local RNA structure but able to gain full activity due to an upstream splicing enhancer (USE) element that is uniquely present upstream of this site. This enhancer was shown to act as CPSF entry site for the structurally obstructed 3ʹ pA site [23, 24]. Figure 2a illustrates this complex regulatory mechanism, which seems unique for HIV among the Retroviridae.

Fig. 2
figure 2

Model for 5ʹ pA site activation in the HIV genome by MSD-inactivation. a Cartoon of the proposed model for pA site regulation in the HIV RNA genome: suppression of the 5ʹ pA site by the downstream MSD and activation of the 3ʹ pA site by the upstream USE. See the text for further details. The pA hairpins and the upstream TAR hairpins are shown. The pA hairpin structure suppresses both 5ʹ and 3ʹ polyadenylation and allows the MSD/USE control. The 3ʹTAR hairpin juxtaposes the USE and the 3 pA site, which may enhance USE-mediated activation of polyadenylation [14]. The black triangles indicate the position of the AAUAAA polyadenylation signal. The grey arrow represents the actual site of polyadenylation at position 97 (5ʹ copy) or 9229 (3ʹ copy). b Illustrated are the HIV transcripts expected for wild-type MSD+ viruses (full-length unspliced and spliced versions, SA is one of the many splice acceptors in the HIV genome) and mutant MSD proviruses (only short TAR transcripts). (A)n is the polyA tail

With this mechanistic background, it can easily be understood that inactivation of MSD by mutation or deletion can trigger an effective shutdown of HIV transcription through activation of the 5ʹ pA site. Thus, it follows that a small characteristic HIV transcript of 97 nucleotides plus polyA tail will be synthesized in the cells that carry a MSD-mutated provirus, as illustrated in Fig. 2b. This non-coding HIV transcript that encompasses the TAR motif was indeed reported by the Proudfoot laboratory back in 1995 [18]. This short TAR transcript is polyadenylated at the 5ʹ-pA site and was confirmed in studies on the regulatory role of the polyA hairpin structure [25,26,27]. As this short transcript encodes the complete TAR element it may be processed into the TAR miRNA, of which the precise role has not been determined yet [28, 29].

Although removal of all HIV coding capacity by a large internal deletion is also very effective in preventing HIV gene expression, MSD inactivation is arguably the most elegant way to produce non-expressing proviruses, the host cells of which will survive under massive CTL pressure. This 5ʹ pA activation model seems much more relevant to explain the loss of HIV protein production than the proposed model of alternative usage of splice sites, which at best could reduce and not interrupt HIV protein expression. Although the inactivation of HIV splice sites can indeed trigger the usage of new splice sites [30,31,32,33], this does not prevent protein translation and consequently CTL recognition.

Consistent with the mechanistic model presented in Fig. 2a, short TAR-containing HIV transcripts are produced in treated patients at a level at least 10-fold higher than extended HIV transcripts [34, 35]. These authors did assume that short TAR-containing transcripts represent abortive transcripts. Our model predicts that short transcripts that are polyadenylated at the 5ʹ-pA site may significantly contribute to this small RNA pool.

One could argue that the same end result, that is activation of the 5ʹ-pA site, could be achieved by weakening or opening of the local polyA hairpin structure that suppresses CPSF binding [17, 25]. However, this would require surgical precision for the provirus mutation as the sequence elements that control the polyadenylation process should not be affected. These include the canonical AAUAAA signal and the actual cleavage site that are embedded in the polyA hairpin [24, 36]. This may explain why 5ʹ pA-activation by hairpin destabilization is not observed, at least not frequently.

The 5ʹ pA activation model does not only apply to the relatively minor MSD-Ѱ class of defective HIV proviruses, it will also relate to those members of the two major classes of defective proviruses with large deletions or hypermutated genomes in which the MSD is destroyed. The latter two classes were supposed to be defective by inactivation of one or multiple open reading frames, but 5ʹ pA activation provides a dominant mechanism to abort any viral protein expression. This new mechanism may therefore also be very relevant for scenario’s dealing with the relevance of ongoing viral protein expression [6, 7].

The generation of variant HIV genomes is the result of two independent processes: mutation and subsequent selection of the most fit virus. In this case, host cells carrying a HIV provirus with a protein production defect will survive preferentially under intense CTL pressure that has been built in infected individuals during months or years of unsuppressed virus replication. The mechanistic MSD-pA scenario that we propose suggests that the cells with proviruses carrying MSD-inactivating mutations are selected because of their non-protein-expressing phenotype. Although not likely to be of decisive influence, the presence of hotspots of viral recombination may also influence the type of MSD deletions that occur (Fig. 1). In particular, this MSD-Ѱ part of the viral RNA genome is highly structured and can cause the viral Reverse Transcriptase to pause [37], which can induce recombination and MSD deletion. In any case, the subsequent selection of cells that do not express viral proteins is the key event.

A complete understanding of the pool of defective HIV proviruses remains of critical importance for accurate measurement of the latent virus reservoir. There may be multiple ways to inactivate HIV and we here describe that—besides prominent deletions and hypermutations—more subtle changes like MSD mutations can also destroy HIV expression.