Background

HIV replication requires integration of a cDNA copy of the viral RNA genome into cellular chromosomes, followed by transcription and splicing to yield viral mRNA. Alternative splicing allows the small 9.1 kb HIV genome to generate at least 108 mRNA transcripts encoding at least 9 proteins and polyproteins [16]. During replication, HIV also reprograms cellular transcription and splicing. For example, the virus-encoded Vpr protein arrests the cell cycle [710] and the viral Tat protein binds to P-TEFb and alters transcription at the HIV promoter and some cellular promoters as well [1116].

Changes in host cell gene expression have been reported during HIV infection [1729] and differences in expression have been observed associated with the stage [30] and progression [31] of disease. Multiple studies suggest that cells detect HIV infection, in part through the recognition of cytoplasmic DNA in abortive infections [3234], and respond by inducing interferon-regulated, apoptotic and stress response pathways [1822, 2528]. Several studies have also suggested that HIV infection disrupts normal cellular splicing pathways [28, 35]. However, results have varied with many experimental parameters, including target cell type, HIV isolate and the duration of infection. Many previously published studies have focused on infections with lab-adapted HIV strains in transformed cell lines [17, 18, 24, 25, 28, 29, 36], and so results may not be fully reflective of infections in patients.

HIV infection also appears to induce the expression of human endogenous retroviruses (HERVs) [37], particularly HERV-K [3842], and retrotransposons [43]. Immune responses to HERV proteins appear stronger in HIV-infected individuals suggesting candidate markers of infection and possible vaccine targets [4447]. In contrast, two recent RNA-Seq studies of expression during HIV infection did not report increases in HERV RNA [24, 25]. The origin of this discrepancy is unclear.

The suggestion that HIV integration may disrupt cellular cancer-associated genes and thereby promote cell proliferation [4851] has focused attention on the range of novel message types formed when HIV integrates within transcription units [5256]. Chimeric reads containing HIV and cellular sequence are also of interest due to the potential of lentiviral vectors to trigger oncogenesis in gene therapy patients through insertional mutagenesis [5760]. A better understanding of chimera formation would help clarify this phenomenon in both HIV infections and lentiviral vector-based gene therapies.

In this study, we sought to generate data more representative of HIV replication in patients by using Illumina sequencing to analyze transcriptional responses after infection of primary T cells with HIV89.6, a low passage patient isolate [61]. This represents a continuation of a long term effort to understand HIV-host cell interactions at the transcriptional level that began with analysis of transcription by HIV89.6 in primary T cells using Pacific Biosciences long read single molecule sequencing [6]. Our strategy here was to analyze a single time after infection in depth with over one billion sequence reads from HIV89.6-infected and uninfected host cells. These data were then combined with 147,281 unique integration site sequences from the same infections and the Pacific Biosciences data on HIV89.6 transcription to (1) elucidate effects of HIV infection on host cell mRNA abundances and splicing, (2) characterize viral message structure in detail and (3) probe the nature of the chimeras formed between host cell and viral RNAs.

Results

Infections studied

Primary CD4+ T cells from a single human donor were infected with HIV89.6, a clade B primary clinical isolate [61], in three replicates. For comparison, two additional replicates from the same donor were mock infected. Samples were harvested 48 h after viral inoculation, which allowed for widespread infection in the primary T cell cultures, though some cells may have been infected secondarily by viruses produced in the first round. Thus cultures probably were not tightly synchronized but did have extensive representation of infected primary T cells. From these samples, we obtained 1,161,705,678 101 bp reads; 1,021,207,853 were mapped to the human genome and 24,783,844 to the HIV89.6 provirus (Table 1). Below we first discuss the influence of infection on cellular gene activity and RNA splicing, then analyze HIV RNAs and lastly identify chimeras formed between HIV and cellular RNAs.

Table 1 Samples and sequencing coverage

Changes in gene activity in primary T cells upon infection with HIV89.6

We observed significant expression changes in 3,142 genes (false discovery rate of \(q<0.01\)), which is 17.1 % of expressed cellular genes (Additional file 1). The genes with most extreme increases, all \(>6\times\) fold higher, during HIV infection included IFI44L, RSAD2, HMOX1, MX1, USP18, IGJ, OAS1, CMPK2, DDX60, IFI44, IFI6, IFNG and CCL3. All of these have been reported to be involved in innate immunity [62] or are interferon-inducible [63], highlighting a strong innate immune response in the cells studied. Genes with the largest decreases, all \(>3\times\) fold lower, were GNG4, GPA33, IL6R, CCR8, RORC, AFF2 and CCR2.

Many Gene Ontology [64] categories were significantly enriched for differentially expressed genes (Additional file 2). Notably upregulated with infection were genes involved in apoptosis, immune responses and cytokine production (all \(q<10^{-4}\)) and downregulated were ribosomal protein genes and related pathways (\(q<10^{-15}\)). These changes suggest that the cells responded to HIV infection with the induction of inflammatory, interferon-regulated and apoptotic responses, patterns posited from several previous studies [1822, 2427, 29, 36, 65]. Expression significantly increased for several genes that are characteristic of other hematopoietic lineages, e.g. hemoglobin \(\beta\), CD8, CD20 and CD117, while several CD4+ T cell specific genes, e.g. CD4 and CD3, were downregulated, potentially consistent with de-differentiation of infected or bystander cells. We return to this point in the discussion.

Comparison of transcriptional profiles from HIV89.6 infection of primary T cells to data on HIV infection in other cell types

We sought to identify the transcriptional responses that were most conserved upon HIV infection and so collected and analyzed data from four other studies of transcription in HIV-infected cells (Additional file 3). These included two studies of infection of the SupT1 cell line [24, 25], a study of ex vivo infection of primary CD4+ T cells [26] and a study of lymphatic tissue biopsies from acutely viremic patients [30]. Genes were scored as increased or decreased in activity in infected cell populations, and the amount of agreement was compared among the different studies.

No gene was called as differentially expressed in all five studies. Eight genes were differentially expressed in the same direction in 4 out of 5 studies; AQP3 and EPHX2 were downregulated with HIV infection and CD70, EGR1, FOS, ISG20, RGS16 and SAMD9L were upregulated. A full listing is provided in Additional file 4. Several of the upregulated genes are known to be interferon-inducible, again emphasizing the role of innate immune pathways.

For each pair of studies, we compared whether they agreed on the identities of differentially expressed genes and whether they agreed on the direction of change (Fig. 1). The responses to infection in primary cells showed notable differences to responses in the SupT1 cell line. The two SupT1 studies were significantly similar to each other (odds ratio: 1090, 95 % confidence interval (CI) 232–16,400, Fisher’s exact test \(p<10^{-15}\) for direction of change in differentially expressed genes) but were not significantly associated or were negatively associated with data from ex vivo primary cells and from lymphatic tissue from acutely infected HIV patients. In contrast, our data was significantly associated with the primary cell (odds ratio: 75.7, 95 % CI 16.9–701, \(p<10^{-15}\)) and lymphatic tissue data (odds ratio: 6.49, 95 % CI 1.52–24.9, \(p=0.003\)). This documents significant differences in responses to HIV infection between infected primary cells and SupT1 cells and suggests that results of infections in primary cells more closely align with actual acute HIV infections in patients. SupT1 cells might be expected to respond to infection differently than primary cells since they have several nonsynonymous mutations in innate immunity genes [66], have blocks in immune signaling pathways [67] and fail to activate many interferon-stimulated genes during HIV infection [27].

Fig. 1
figure 1

Comparisons among studies quantifying cellular gene expression after HIV infection. For each pair of studies, the association between up- and downregulation calls was measured for genes identified by both studies as differentially expressed (above the diagonal). As another comparison, we also measured the agreement between studies for which genes were called differentially expressed regardless of direction (below the diagonal). The color scale shows the conservative (i.e. closest to 1) boundary of the confidence interval of the odds ratio with blue indicating a positive association and red a negative association between studies. For confidence intervals overlapping 1, the value was set to 1. Therefore all colored squares indicate significant associations

Comparison of the HIV-infected cell transcriptional profile to additional experimental T cell profiles

To investigate the transcriptional changes in more depth, we compared the results of the five studies of HIV infection to transcriptional profiles comparing immune cell subsets available at the Molecular Signatures Database (MSigDB) [68]. The MSigDB reports genes that are increased or decreased in relative expression for 185 pairs of transcriptional profiles involving CD4+ T cells. We compared the lists of affected genes in each pair to genes altered in activity by HIV infection. Those pairs of studies with the most significant associations with HIV89.6 data are shown in Fig. 2a. For comparison, the associations with the four other HIV transcriptional profiling studies mentioned above are shown as well.

Fig. 2
figure 2

Comparisons of the effect of HIV infection on cellular gene expression to additional studies comparing transcription in subsets of immune cells. The MSigDB database was used to extract 185 sets of differentially expressed genes from pairs of transcriptional profiling studies of immune cell subsets involving CD4+ T cells. For each pair of studies, we used Fisher’s exact test to measure the association between up- and downregulation calls for genes identified as differentially expressed in both our HIV study and the comparator immune subsets. a The transcriptional profiles with strongest associations with changes observed in our study of HIV89.6 infection of primary T cells. Blue indicates a positive association between changes seen in HIV-infected cells and the first immune subset (text colored blue) while red indicates a positive association with the second immune subset (text colored red). The color scale shows the conservative (i.e. closest to 1) boundary of the confidence interval of the odds ratio. For confidence intervals overlapping 1, the value was set to 1. Therefore all colored squares indicate significant associations. b As in a, but showing the transcriptional profiles most strongly associated with changes observed in lymph node biopsies from acutely infected patients [30]

The most significant associations for our data showed gene expression in HIV89.6-infected cells moving away from typical T cell expression patterns and towards patterns more similar to B cells, myeloid cells and bulk peripheral blood mononuclear cells (all Fisher’s exact test \(p<10^{-15}\)) (Fig. 2a). These changes were also seen, although to a lesser extent, in the Imbeault et al. [69] study which also used primary CD4+ T cells.

For comparison, we also extracted those profiles most strongly associated with the transcriptional data on lymphatic tissue of HIV patients [30]. The profiles showed patterns similar to strongly stimulated T cells, autoimmune disease and to the Th1 T cell subset (all \(p<0.01\)) (Fig. 2b). Our data in primary CD4+ T cells paralleled the changes seen in lymphatic tissue. These transcriptional changes again highlights the strong immune response generated by HIV infection in primary cells.

Intron retention

Cells respond to infection by shutting down macromolecular synthesis at multiple levels [7074], so we investigated whether cells also showed perturbations in splicing efficiency after infection. As a probe, we created a database of cellular genomic regions annotated exclusively as exons or introns in all splice forms in the UCSC gene database [75] and quantified expression in these regions in infected and uninfected cells. We found a significant increase in intronic sequences relative to exonic sequence (Wilcoxon test \(p<10^{-15}\)) (Fig. 3a). This increase in intronic sequence was reproducible between replicates in our study (Kendall’s \(\tau =0.42\), \(p<10^{-15}\)) (Fig. 3b). We reanalyzed RNA-Seq data from Chang et al. [25] and also documented intron retention that correlated with the changes seen in our data (Kendall’s \(\tau =0.12\), \(p<10^{-15}\)) (Fig. 3c).

Fig. 3
figure 3

Changes in the abundance of intronic regions with HIV infection. Expression of intronic and exonic regions was quantified as the proportion of reads mapping within the intron/exon out of the total reads mapping to the transcription units overlapping that intron/exon. a Comparison of the ratios of expression between infected and uninfected replicates in exclusively intronic or exonic regions of transcription units. b Reproducibility of intron retention between replicates. Each point quantifies the change in expression with HIV infection for a specific intronic region. The x-axis shows changes in gene activity accompanying infection for one set of replicates (Infected-1 and Infected-2 vs. Uninfected-1) and the y-axis shows the same data for different replicates (Infected-3 vs. Uninfected-2). c Reproducibility of intron retention between studies. The plot is arranged as in b but with all data from our study combined on the x-axis and corresponding data from Chang et al. [25] on the y-axis

A possible artifactual explanation for enrichment of intronic sequences could involve greater DNA contamination in the infected cells samples. That is, if the relative amount of DNA differed between treatments, the amount of apparent intronic sequences could also differ due to sequencing of contaminating DNA. To examine whether DNA contamination was abundant in our samples, we compiled a collection of 27 large gene desert regions, defined here as (1) regions outside the centrosome and first and last cytoband, (2) containing less than 1 % unknown sequence, (3) containing no genes annotated in UCSC genes [75], (4) containing no repeats annotated in the RepeatMasker database [76] and (5) spanning more than 100 kb. No reads were mapped to these 41 Mb of gene deserts in any sample, arguing against explanations based on DNA contamination. Thus these data indicate that intron retention was increased in these cell populations upon HIV infection, revealing a previously undisclosed aspect of the host cell transcriptional response to infection.

Ribosomal protein genes were especially enriched for introns with strong increases in expression with HIV infection (odds ratio: 55.5, 95 % CI 36.9–81.5, Benjamini–Hochberg corrected Fisher’s exact test \(q<10^{-15}\) for introns with a Bayesian 95 % credible interval for differential abundance of \(>2\times\) change). Intron retention was not restricted to particular introns but was evident in most introns in affected genes (Additional file 5A). No other Gene Ontology category had a \(q<0.01\) after excluding introns from ribosomal protein genes.

Intron retention has been linked to intronic characteristics such as splice site strength, GC content and intron width across many cell types [77]. To see if a similar pattern existed in our data, we fit a lasso-regularized logistic regression [78] to predict differential expression of an intronic region based on GC content, width and 3′ and 5′ splice site strength [79] of the introns overlapping the region. We also included a term indicating whether the intron was in a ribosomal protein gene and, because HIV has been reported to induce the expression of HERVs and other repetitive elements, a term indicating if the intron contained any repeat annotation in the RepeatMasker database. The resulting model selected only whether the intron contained a repetitive element and whether it was in a ribosomal protein gene and reduced cross-validated mean square error by only 1 %. Thus, it appears that the intron retention induced by HIV infection does not follow the same patterns seen when comparing cell types and that much of the variation in HIV-induced intron retention remains unexplained.

Induction of transcription from HERVs and retrotransposons by HIV89.6 infection

Because some differentially expressed introns appeared associated with repetitive elements, we investigated the expression of HERVs, transposons and other repeated sequences. Figure 4a shows a comparison of the association between changes in expression with HIV89.6 infection and genomic repeat types annotated in the RepeatMasker database [76] over varying levels of differential expression. At high levels of expression change, ERV-9 (odds ratio: 154, 95 % CI 83.1–262, \(p<10^{-15}\) for LTRs with a Bayesian 95 % credible interval for differential abundance \(>4\times\) change) and its long terminal repeat LTR12C (odds ratio: 145, 95 % CI 98.9–210, \(p<10^{-15}\)) are the only repeats highly associated with HIV infection. Looking at genomic repeats with any significant increase during HIV infection, the expression of many recently acquired genomic repeats, including L1HS, LTR5_Hs (a human specific long terminal repeat of HERV-K), AluYa5, AluYg6 and SVA_D and SVA_F, were associated with HIV89.6 infection (Fig. 4b).

Fig. 4
figure 4

Repeat categories enriched upon infection with HIV. a The association of repeat regions differentially expressed after HIV89.6 infection of primary T cells observed for varying thresholds of differential expression. The threshold used to call a gene differentially expressed based on the Bayesian posterior median was varied and Fisher’s exact test was used to assess whether any genomic repeats had a significant association with this differential expression. Note that only ERV-9 (annotated as HERV9-int in the RepeatMasker database) and it’s corresponding long terminal repeat ERV-9/LTR12C were significantly associated with large changes in expression. b Enrichment of repeat categories in regions differentially expressed (Bayesian 95 % credible interval \(>2\times\) change) between HIV-infected and control CD4+ T cells. The repeated sequences are ordered on the x-axis by the extent of induction within each class with circles indicating repeats annotated as hominid specific and squares marking all other repeats, the y-axis shows the p value for upregulation after infection. The dashed line indicates a Bonferroni corrected p value of 0.05. c The proportion of human mapped reads that align within classes of genomic repeats for data from primary CD4+ T cells from this study and SupT1 cells from Chang et al. [25]. A single read mapping multiple times to a given category was only counted once

We saw a relationship between the age of genomic repeats and its likelihood of being induced by HIV89.6 infection. The most highly enriched repeats were associated with relatively recent hominid-specific repeat classes as annotated by the RepeatMasker database (repeat classes with \(p<10^{-50}\) odds ratio: 31.6, 95 % CI 8.88–112, \(p=10^{-7}\)). In HERV-K (HML-2), the most recently active endogenous retrovirus in the human genome [8082], we saw that integrations unique to the human genome [82] were more likely to be differentially expressed than older HERV-Ks (odds ratio: 5.38, 95 % CI 1.93–16.0, \(p=0.0005\)).

Previous RNA-Seq studies of cellular expression during HIV infection in transformed cell lines did not report increases in HERV mRNA [24, 25]. To investigate this difference, we downloaded and analyzed the RNA-Seq data from Chang et al. [25], which quantified gene activity in transformed SupT1 cells infected with a lab-adapted strain of HIV. We found a much higher level of HERV expression in their data in both HIV-infected cells and uninfected controls than in primary cells (Fig. 4c). We suspect that in SupT1 cells, as with many cancerous cells [8387], the baseline expression of transposons and endogenous retroviruses is higher than in primary cells, masking further induction by HIV infection.

We observed heterogeneous expression among ERV-9/LTR12C sequences and so investigated the primary sequence determinants. We observed that LTR12C has variants with differing number of repeated sequence in the U3 region just upstream of the transcription start site (Fig. 5a), an important region for transcription initiation [88]. The U3 region of LTR12C also contains multiple motifs for transcription factors NFY, GATA2 and MZF1 [89]. To clarify factors affecting expression levels, we counted the number of motifs matching these transcription factors’ binding motifs, checked for a TATA box [90] within 50 bp upstream of the transcription start site, assigned each LTR12C to the short or long length class, counted the number of mutations away from the consensus for that length class and checked for integration in a transcription unit. We then applied a logistic regression to test the effects of these variables on LTR12C differential expression. We found that HIV89.6-induced transcription was more likely for LTR12C containing the short length variant of the 3′ U3 region, located within a transcription unit, containing a TATA box motif and containing greater numbers of GATA2 motifs (Fig. 5b).

Fig. 5
figure 5

Characteristics of ERV-9/LTR12C sequences associated with induction upon infection of primary T cells with HIV89.6. a An alignment of the 3′ end of the U3 region of repeats annotated as ERV-9/LTR12C. Each row is a section of the long terminal repeat sequence and each column a base in that sequence colored by nucleotide identity. For clarity, positions appearing in less than 2 % of sequences are omitted. Two distinct classes are visible with a short form and long forms containing varying numbers of repeated sequence. Mutations away from the consensus can also be seen. b The proportion of LTR12C regions with significant increases in read abundance after infection with HIV and their 95 % confidence intervals separated by the length class of the LTR, presence in a gene, presence of a TATA box and the number of GATA2 motifs in the U3 region. These variables were selected by stepwise regression regression comparing differential expression of LTR12C to the length class of the LTR, the number of mutations away from consensus, the number of NFY, GATA2 and MZF1 motifs and the presence of a TATA box motif within 50 bp of the transcription start site. Variables are labeled with the estimated change in log odds ratio (\(\beta\)) and their Wald test p values

Transcription extending several hundred kilobases from several ERV-9/LTR12C has recently been reported [91]. In contrast in our data, only 14 LTR12C appeared to have continuous transcription more than 1000 bp downstream of the LTR and the maximum length of continuous transcription was only 9275 bp. Transcription from some of these LTR12C does appear to continue directly into transcription units of cellular genes, suggestive of the potential for regulatory function (Additional file 5B).

HIV mRNA synthesis and splicing

Over 24 million Illumina reads mapped to HIV89.6, yielding an average coverage of over 240,000-fold. Reads mapping to HIV89.6 comprised between 3.4–4.8 % of mapped reads in the infected samples (Table 1). It is unclear whether HIV infection increases or decreases the amount of mRNA in infected cells but if we assume HIV-infected cells contain the same amount of mRNA as uninfected cells and adjust for rates of infection ranging between 21–37.5 % (Table 1), we estimate that HIV transcripts comprise between 13.0–16.2 % of the total polyadenylated mRNA nucleotides in infected cells 48 h after initial infection. This parallels previous estimates of around 10 % [92] at 48 h postinfection, 38 % at 24 h [25] or 30 % after 72 h [18].

Over 47,257 single reads spanned previously reported HIV splice junctions, allowing a quantitative assessment of donor and acceptor utilization (Fig. 6a). As expected from previous studies [4, 6], the most abundant junctions were D1-A5 and D4-A7. We confirmed the use of unusual splice acceptors A8c and A5a, previously reported in HIV89.6 [6]. In the Illumina sequencing, we saw a higher abundance of D1-A1 and D1-A2 splice junctions than in PacBio sequencing [6], possibly indicative of recovery bias in PacBio sequencing.

Fig. 6
figure 6

Transcription and splicing of the HIV89.6 RNA. a Junctions between HIV splice donors and acceptors observed in the RNA-Seq data. Acceptors are shown as the columns and donors as the rows with the coloring indicating the frequency of each pairing. b The relative abundance of 78 HIV89.6 transcripts as determined by a combination of PacBio sequencing [6] and Illumina sequencing. Message structures were generated by targeted long read single molecule sequencing, which allowed association of multiple splice junctions in single sequence reads. The Illumina short read sequencing allowed normalization of message abundances between size classes. The inferred HIV message population is shown colored by relative abundance

A 3′ bias is apparent in our sequencing data (Additional file 6A). This could be due to the poly-A capture step of the protocol where any break in the RNA would result in loss of distal 5′ sequences [93]. We used sequence reads from the large unspliced HIV intron 1 to measure this bias by regressing the \(\log\) of the number of fragments with a 5′-most end starting at a given position against the distance of that position from the viral polyadenylation site, yielding an estimated probability of breakage of 0.021 % per base (Additional file 6A). Given this rate of truncation, there is only a 14 % chance of reaching the 5′ end of the 9171 nt unspliced HIV genome (\((1-0.00021)^{9171}\)).

Ocwieja et al. [6] determined the relative abundance of HIV89.6 of similarly sized transcripts using PacBio single molecule sequencing, but were not able to estimate the relative abundance of all transcripts due to a sequencing bias favoring shorter transcripts. For this reason, relative abundances could only be specified within message size classes (i.e. the 4 kb, 2 kb and unexpectedly a 1 kb size class as well) and the overall quantitative abundances were unknown. Our RNA-Seq data are unable to reconstruct the multiply spliced messages due to short read lengths but do permit estimation of size class abundances after correcting for 3′ bias (Additional file 6). Thus the PacBio data reported by Ocwieja et al. [6] and the Illumina data reported here can be combined together to determine complete relative abundance of 78 HIV89.6 transcripts (Fig. 6b).

The most abundant HIV mRNAs were the unspliced HIV genome (37.6 %), a transcript encoding Nef (D1-A5-D4-A7: 15.5 %), two 1 kb size class transcripts (D1-A5-D4-A8c: 10.6 %, D1-A8c: 4.9 %) and two Rev-encoding transcripts (D1-A4c-D4-A7: 4.2 %, D1-A4b-D4-A7: 3.1 %).

Using these abundances, we can estimate the number of HIV89.6 genomes in these primary T cells 48 h after infection. To determine the proportion of the mRNA nucleotides from viral transcripts, we multiplied the estimated abundances by their transcript lengths. Unspliced genome transcripts appear to form 79 % of the mRNA nucleotides from HIV89.6 transcripts. Assuming T cells contain at least 0.1 pg of mRNA then an infected cell should contain at least 0.011 pg of unspliced HIV transcript (\(0.1\text {pg}\times 0.14\frac{\text {HIV mRNA nt}}{\text {cell mRNA nt}}\times 0.79\frac{\text {unspliced mRNA nt}}{\text {HIV mRNA nt}}\)) or, assuming 9171 bases of RNA weigh about \(5 \times 10^{-6}\) pg, at least 2200 HIV genomes at 48 h post infection. This estimate roughly agrees with previous estimates of HIV production per cell [92, 94, 95].

Human-HIV chimeric reads

In our data, 80,045 reads contained sequences matching to both HIV and human genomic DNA. For a baseline measure of HIV89.6 integration patterns, we used ligation-mediated PCR to recover provirus-human junctions from the same infected cell populations, yielding 147,281 unique integration sites [96].

Comparison between these two datasets revealed abundant RNA-Seq chimeras formed between HIV and mitochondrial sequences while no proviral integrations into mitochondria were observed (Additional file 7A) or have been previously reported [53]. This likely indicates significant contamination with chimeras formed during the preparation of libraries for sequence analysis [97104]. Potential mechanisms include template switching between sequences with shared similarity during reverse transcription [105107] and priming off incomplete transcripts during DNA synthesis [97, 98, 108, 109]. To account for these artifacts, we retained only the 605 reads with no overlap and no unknown intervening sequence between human and HIV portions (Additional file 7B) where the HIV sequence bordered the 3′ or 5′ end of HIV or an HIV splice donor or acceptor (Additional file 7C).

Chimeric messages composed of HIV and cellular RNA sequences can be formed by cellular gene transcription reading into the integrated provirus, by HIV transcription reading out through the viral polyadenylation site or by splicing between human and viral splice sites. In our filtered data, the predominant forms appear to be derived from reading through the HIV polyadenylation signal into the surrounding DNA (78 %), splicing out of the viral D4 splice donor to join to human slice acceptors (17 %) and reading into the HIV 5′ LTR from human sequence (4.0 %) (Fig. 7). No splice site other than D4 had more than two chimeric reads observed.

Fig. 7
figure 7

Analysis of chimeric RNA sequences containing both human and HIV sequences. Counts of the location in the HIV genome of the HIV-human junctions in filtered chimeric reads. Due to abundant sequencing artifacts (Additional file 7), reads were filtered to exclude reads where the human and HIV portions contained overlapping complementarity at the sequence junction (a sign of potential artifactual formation) and to exclude reads where the viral portion did not start at a known splice site or 5′ or 3′ border of the HIV genome

The filtered chimeric reads had many traits consistent with biological chimera formation. The reads containing HIV D4 joined to human sequences had the characteristics expected of splicing—72.1 % of the chimeric junctions mapped to known human acceptors and 96.1 % mapped to a location immediately preceded by the AG consensus of human mRNA acceptors. The reads containing the 5′ or 3′ LTR border were almost exclusively (93 %) found in transcription units, with odds of being in a gene 2.3-fold (95 % CI 1.6–3.2, \(p=10^{-7}\)) higher than integration sites from the same sample. The readthrough chimeras were also more likely to be located in an exon than integration sites (odds ratio: 2.1, 95 % CI 1.6–2.6, \(p=10^{-7}\) only considering integration sites and chimeras in transcription units).

Chimeric sequences have the potential to alter the expression of proto-oncogenes leading to proliferation of the host cell [5760]. We investigated possible effects of integration on cell proliferation by asking whether chimeric RNAs were more common at proto-oncogenes. HIV has been reported to integrate near oncogenes more often than expected by chance [110] and here integrations were more frequent in genes annotated as proto-oncogenes by the Uniprot Knowledgebase [111, 112] than in matched random controls [113] (odds ratio: 3.84, 95 % CI 3.72–3.97, \(p=0.0005\)). To account for this preference, we compared the locations of RNA-Seq chimeras to those of integration sites from the same samples. In these data, we saw no significant enrichment for chimeric mRNA to originate in transcription units annotated as proto-oncogenes relative to integration sites (Fisher’s exact test \(p=0.15\)). This lack of significant enrichment might be expected since cells were infected for only 48 h and there would be little time opportunity for selection to occur during cell growth.

We next compared whether the human and viral segments of chimeric reads agreed or disagreed in orientation (i.e. strand transcribed) for reads with the human portion mapped within annotated transcription units. The sequencing technique used here does not preserve strand information, but we can check whether the strand of a sequence read agrees or disagrees with the annotated gene strand and compare this to the observed strand of the HIV portion of the read. Chimeras involving HIV splice donor D4 were highly enriched for matching orientations (odds ratio: 52.5, 95 % CI 12.1–307, \(p=10^{-11}\)) suggesting that pairing with human splice acceptors constrains the orientation of D4 chimeric reads. We also found a strong association between the orientation of the human and HIV portions of chimeric reads within 3′ and 5′ chimeras (odds ratio: 6.2, 95 % CI 3.9–10.2, \(p<10^{-15}\)). This highly significant enrichment of HIV and human genes in the same orientation might indicate that antisense HIV RNA is rapidly degraded by a response to double-stranded RNA or that polymerases oriented in opposing directions interfere with one another during elongation.

Based on these data, we can propose a lower bound on the relative abundance of chimeras. If we assume that our filtering removed nearly all artifacts so that we have few false positives, then our estimate should be lower than the true proportion of chimeras. In our data, only \(\frac{604}{12,689,879{}} = 0.0048\,\%\) of reads containing sequence mapping to HIV also contained identifiable chimeric junctions. However, this is an underestimate because in an HIV-derived mRNA, any fragment of the sequence will be mappable to HIV, while for a chimeric sequence only a read spanning the HIV-human junction will allow identification of a chimera. If we assume that 25 bases of sequence are necessary to map to human or HIV sequence, then, with the 100 bp reads used here, only read fragments starting between 75 and 25 bp downstream of the chimeric junction will be identifiable. If we assume the average chimeric mRNA sequences is at least 2 kb long, then a read from a chimeric sequence has at most a \(\frac{50}{2000}=2.5\,\%\) chance of containing a mappable junction. Thus, a lower bound for the proportion of HIV mRNA that also contain human-derived sequences is 0.2 % (\(\frac{0.0048\text \,{ \%}}{2.5\,\%}\)). Looking only at splicing from HIV donor D4, we saw 16,843 reads containing a junction from D4 to an HIV acceptor and 104 reads from D4 to human sequence. Thus, in our data, 0.6 % of D4 splice products form junctions with human acceptors instead of HIV acceptors.

Discussion

Here we used RNA-Seq to analyze mRNA accumulation and splicing in primary T cells infected with the low passage isolate HIV89.6. We did not carry out dense time series analysis, compare different human cell donors or compare different perturbations of the infections—instead, we focused on generating a dense data set at a single time point. We analyzed replicate infected cell and control samples to allow discrimination of within-condition versus between-condition variation and assessed differences using a series of bioinformatic approaches. Many previous studies have used microarray technology or RNA-Seq to study gene activity in HIV-infected cells [1722, 2428, 36], usually analyzing infections of transformed cell lines or laboratory-adapted strains of HIV-1. Here we present what is to our knowledge the deepest RNA-Seq data set reported for infection in primary T cells using a low passage HIV isolate.

This RNA-Seq data set was paired with a set of 147,281 unique integration site sequences extracted from the same infections, which were critical to our ability to quality control chimeric reads. An advantage of studies using cell lines and laboratory-adapted strains is that a high proportion of cells can be infected, whereas in this study we achieved only \(\mathord {\sim }30\,\%\) infection. However, we report distinctive features of the transcriptional response not seen in studies of HIV infections in cell lines. Novel in this study are (1) identification of intron retention as a consequence of HIV infection, (2) the finding of activation of ERV-9/LTR12C after HIV infection, (3) generation of a quantitative account of the structures and abundances of over 70 HIV89.6 messages and (4) clarification of the predominant types of HIV-host transcriptional chimeras. These findings are discussed below.

Broad changes in host cell mRNA abundances were evident after infection, with over 17 % of expressed genes changing significantly in activity. Changes included response to viral infection, apoptosis and T cell activation. Although it is not possible here to separate the response of infected and bystander cells, this study highlights the drastic changes in cellular expression caused by HIV-1 infection. In a meta-analysis including four previously published studies, no gene was detected as differentially expressed in all five studies and only a handful of genes appeared in four out of five studies. Further analyses showed that expression changes appear to be cell type specific, raising concerns that studies using cell lines may not fully reflect host cell responses in in vivo infections.

Unexpectedly, intronic sequences were more common in the RNA-Seq data from cells after HIV89.6 infection than in mock infected cells. The mechanism is unclear. It is possible that the splicing machinery is reduced in activity after 48 h of infection, perhaps as a part of the antiviral response of infected and bystander cells. HIV infection does appear to alter expression and localization of some splicing factors [35, 114, 115] and genes involved in RNA splicing were more likely to be differentially expressed in our study (Benjamini–Hochberg corrected Fisher’s exact test \(q=2\times 10^{-5}\)). Alternatively, fully spliced mRNAs might be more rapidly degraded after infection, possibly by interferon-mediated induction of RNaseL [116, 117] or off-target binding of viral protein Rev might mediate export of incompletely spliced cellular transcripts [118, 119]. A speculative possibility is that HIV89.6 encodes a factor that alters cellular splicing or promotes mRNA degradation to optimize splicing and translation of viral messages.

Ribosomal protein genes were particularly affected by intron retention. Several of these genes have been reported to autoregulate protein abundance through a feedback loop where the protein represses splicing of its own mRNA transcripts to generate unproductive spliceforms [120122]. HIV infection can cause a decrease in fully processed ribosomal RNA [29] likely through the interferon-activated RNaseL pathway [123, 124]. Here, we do not have a direct measure of rRNA abundance due to the poly-A selection but we did see an apparent decrease in total RNA yield in HIV-infected samples. Decreased rRNA might lead to more free ribosomal proteins which could suppress splicing of ribosomal gene transcripts during HIV infection. However, previous reports of alternative splicing in ribosomal protein genes have involved specific introns rather than the broad intron retention seen here perhaps indicating that both the intron retention and the general decrease of expression of ribosomal genes may be part of an innate immune response repressing translation [125, 126].

Infection resulted in increased expression of specific cellular repeated sequences. HERVs, in particular HERV-K, have previously been observed to show increased RNA accumulation with HIV infection [3742, 47] and possibly represent vaccine targets because of their production of distinctive proteins [4447, 83, 127]. Here, though we saw modest increases in HERV-K expression, ERV-9 had the greatest changes in expression (33 LTR12C and 14 ERV-9 annotated regions with greater than \(4\times\) change in expression). Previous RNA-Seq studies of HIV infection in cell lines did not report increases in HERV expression [24, 25] but this difference is likely due to a much higher baseline expression of HERVs in transformed cell lines. We also observed increases in LINE and Alu element transcription, as has been reported previously [43], and expression changes in ERV-9/LTR12C expression associated with the density of transcription factor binding motifs within specific U3 variants.

Many of the repeated sequence elements that were induced by HIV89.6 infection are relatively recently integrated in the human genome. The reason for this pattern has been unclear. It may be that older elements have accumulated more mutations, resulting in an inactivation of transcriptional signals. Alternatively, perhaps the elements that are induced have been recruited for transcriptional control of cellular functions, so that their transcriptional activity is preserved evolutionarily [90, 91, 128131].

Comparison of the results of sequencing HIV89.6 messages using long-read single molecule sequencing (Pacific Biosciences from Ocwieja et al. [6]) and dense short read sequencing (Illumina data reported here) allowed a full quantitative accounting of more than 70 HIV89.6 splice forms. The full length unspliced HIV RNA comprised 37.6 % of all messages, corresponding to more than 2000 genomes per cell. Notably abundant messages included the full length genome and spliced transcripts encoding Nef and Rev transcripts. The full set of messages is summarized in Fig. 6b.

Our previous analysis using PacBio sequencing [6] revealed an unusually prominent 1 kb size class. HIV89.6 encodes a splice acceptor (A8c) within Nef responsible for formation of the short messages. Our data indicated that two members of the 1-kb size class, D1-A5-D4-A8c and D1-A8c, accounted for 10.6 % and 4.9 % of all viral messages. The 1 kb size class as a whole accounted for fully 20 % of messages. The function of this large amount of 1 kb transcript is unknown. The most abundant 1 kb transcripts do not appear to encode significant open reading frames although other 1 kb transcripts can encode a Rev-Nef fusion [6]. Most HIV/SIV variants do appear to encode an acceptor near this position, suggesting a potential unknown function for these short spliced forms [6, 132, 133]. This analysis also suggests a lower proportion of 4 kb messages than has been seen for another isolate [134] suggesting that these ratios may vary with strain, time of infection or other conditions [6].

After filtering, we detected a sizeable number of apparently authentic chimeras containing both HIV and cellular sequences. Mechanisms of insertional activation have been studied intensively in animal models of transformation and in adverse events in human gene therapy. One of the most common mechanisms involves insertion of a retroviral enhancer near a cellular promoter, so that transcription of a nearby gene is increased [58, 135137]. However, another common mechanism involves formation of chimeric messages involving both cellular and viral/vector sequences [57, 58]. A targeted in vitro study of chimeric message formation by lentiviral vectors showed examples of multiple types of splice-in messages [59], which may have been more frequent and more varied than for the HIV89.6 proviruses studied here. The low level of chimeric splicing into and reading into HIV in this study may be a reflection of the high rate of HIV transcription in these infected cells—because HIV was so highly expressed, there would be more opportunities for polymerase to splice out of or read through the HIV genome than to read or splice in. The vast majority of HIV proviruses in expanded clones in well-suppressed patients appear to be defective [51]—going forward, it will be of interest to investigate whether these HIV proviruses are damaged in ways that promote formation of chimeric transcripts.

Lastly, we note that several features of the transcriptional response to HIV89.6 infection were suggestive of de-differentiation away from T cell specific expression patterns. The increase in expression of cellular HERVs and LINEs is characteristic of cells in early development. Specific HERVs and transposons, including ERV-9/LTR12C and HERV-K, have been implicated in regulating gene activity early in development [90, 128, 131, 138141]. Several genes related to other hematopoietic cell types showed elevated RNA abundance after HIV89.6 infection. These data are of interest given the finding that patients undergoing long term ART can contain long lived T cell clones that may contribute to the latent reservoir [51, 142145]. Possibly the transcriptional responses seen here in infected primary T cells are reflective of processes leading to the formation of latently-infected cells with stem-like properties.

Conclusions

Infections of primary T cells with a low passage HIV isolate showed several distinctive features compared to previously published data using T cell lines and/or lab-adapted HIV strains. We found strong changes in expression in genes related to immune response and apoptosis similar to studies of HIV infection in patient samples and primary cells but different from studies performed in SupT1 cell lines. Notable changes after infection included intron retention and activation of recently integrated retrotransposons and endogenous retroviruses, in particular ERV-9/LTR12C. We also present complete absolute estimation of over 70 messages from HIV89.6 and specify the major virus-host chimeras as splicing from viral splice donor 4 to cellular acceptors and readthrough from the 5′ and 3′ ends of the provirus.

Methods

Cell culture and viral infections

HIV89.6 stocks were generated by the University of Pennsylvania Center for AIDS Research. 293T cells were transfected with a plasmid encoding an HIV89.6 provirus, and harvested virus was passaged in SupT1 cells once. Viral stocks were quantified by measuring p24 antigen content. Primary CD4+ T cells were isolated by the University of Pennsylvania Center for AIDS Research Immunology Core from apheresis product from a single healthy male donor (ND365) using the RosetteSep Human CD4+ T Cell Enrichment Cocktail (StemCell Technologies). The Immunology Core maintains the IRB-approved protocol (IRB #705906) and receipt of these cells is considered secondary use of de-identified human specimens.

T cells were stimulated for 3 days at \(0.5 \times 10^6\) cells per milliliter in R10 media (RPMI 1640 with GlutaMAX (Invitrogen) supplemented with 10 % FBS (Sigma-Aldrich) with 100 units U/mL recombinant IL2 (Novartis) + 5 µg/mL PHA-L (Sigma-Aldrich)). Here PHA and IL2 were used for their strong activating effects but further investigation using cells activated in a more physiological way might provide further benefits. Cells were infected in triplicate and mock infections were performed in duplicate. For each infection, \(6.6 \times 10^6\) cells were mixed with 1.32 µg HIV89.6 in a total volume of 2.25 mL. Infection mixtures was split into three wells of a 6 well plate for spinoculation at 1200 g for 2 h at 37 °C. Cells were incubated an additional 2 hr at 37 °C. Cells were then pooled into flasks and volume was increased to a total of 12 mL. Spreading infection was allowed to proceed 48 hr at 37 °C, after which cells were harvested. \(10^6\) cells were harvested for flow cytometry, and \(6 \times 10^6\) cells were pelleted following two washes in PBS for nucleic acid extraction. Genomic DNA and total RNA were isolated from \(6 \times 10^6\) T cells per infection using the AllPrep DNA/RNA Mini Kit (Qiagen) with Qiashredder columns (Qiagen) for homogenization according to the manufacturer’s instructions. DNA was eluted in 140 µL elution buffer. RNA samples were treated with DNase prior to elution in 40 µL water.

Analysis of HIV89.6 integration sites in primary T cells

Integration site sequences were determined for DNA fractions from the above infections after ligation mediated PCR [96]. A total of 147,281 unique integration site sequences were determined. An analysis of integration site distributions for these samples was reported in Berry et al. [96].

mRNA sequencing

Messenger RNA was isolated and amplified from purified total cellular RNA (3 µL or approximately 9 µg from each uninfected sample, 25 µL or approximately 3 µg from each infected sample) using the Illumina TruSeq RNA sample preparation kit according to manufacturer’s protocol. SuperScript III (Invitrogen) was used for reverse transcription. Each sample was tagged with a separate barcode and sequenced on an Illumina HiSeq 2000 using 100 bp paired-end chemistry.

Flow cytometry

To assess percent infected cells, \(10^6\) cells per infection were stained for flow cytometry. All staining incubations were at room temperature. Cells were first washed in PBS and then twice in FACS wash buffer (PBS, 2.5 % FBS, 2 mM EDTA). Cells were fixed and permeabilized with CytoFix/CytoPerm (BD) for 20 minutes and washed with Perm-Wash Buffer (BD) before staining with anti-HIV-Gag-PE (Beckman Coulter) for 60 min. Finally cells were washed in FACS wash buffer and resuspended in 3 % PFA. Samples were run on a LSRII (BD) and analyzed with FlowJo 8.8.6 (Treestar). Cells were gated as follows: lymphocytes (SSC-A by FSC-A), then singlets (FSC-A by FSC-H), then by Gag expression (FSC-A by Gag).

Analysis

Reads were aligned to the human genome using a combination of BLAT [146] and Bowtie [147] through the Rum pipeline [148]. Estimates of fragments per kilobase of transcript per million mapped reads and changes in expression for cellular genes were calculated by Cufflinks [149]. Reads found to contain sequence similar to the HIV genome using a suffix tree algorithm were aligned against the HIV89.6 genome using BLAT [146]. All statistical analyses were performed in R 3.1.2 [150]. RNA-Seq reads from Chang et al. [25] were downloaded from the Sequence Read Archive (SRP013224) and aligned using the Rum pipeline.

Gene lists were obtained from the supplementary materials of four other studies of differential gene expression during HIV infection [2426, 30]. We called genes differentially expressed in Li et al. [30] if they had a reported \(p<0.01\) or in Lefebvre et al. [24], Chang et al. [25] and Imbeault et al. [26] if they had an adjusted \(p<0.05\). We called genes as differentially expressed in our own study if the adjusted \(p<0.01\). For the comparison of differentially expressed genes regardless of direction in Fig. 1 (below the diagonal), it was unclear exactly how many genes were studied in each study so we assumed a background of the 14,192 genes (the number of genes that could be tested for significance in our data).

We obtained transcriptional profiles comparing immune cell subsets from the Molecular Signatures Database [68]. MSigDB set names from the MSigDB used in Fig. 2a were GSE10325 LUPUS CD4 TCELL VS LUPUS BCELL, GSE10325 CD4 TCELL VS MYELOID, GSE10325 CD4 TCELL VS BCELL, GSE10325 LUPUS CD4 TCELL VS LUPUS MYELOID, GSE3982 MEMORY CD4 TCELL VS TH1, GSE22886 CD4 TCELL VS BCELL NAIVE, GSE11057 CD4 CENT MEM VS PBMC, GSE11057 CD4 EFF MEM VS PBMC, GSE3982 MEMORY CD4 TCELL VS TH2 and GSE11057 PBMC VS MEM CD4 TCELL and in Fig. 2b were GSE36476 CTRL VS TSST ACT 72H MEMORY CD4 TCELL OLD, GSE10325 CD4 TCELL VS LUPUS CD4 TCELL, GSE22886 NAIVE CD4 TCELL VS 12H ACT TH1, GSE3982 CENT MEMORY CD4 TCELL VS TH1, GSE17974 CTRL VS ACT IL4 AND ANTI IL12 48H CD4 TCELL, GSE24634 IL4 VS CTRL TREATED NAIVE CD4 TCELL DAY5, GSE24634 NAIVE CD4 TCELL VS DAY10 IL4 CONV TREG, GSE1460 CD4 THYMOCYTE VS THYMIC STROMAL CELL and GSE1460 INTRATHYMIC T PROGENITOR VS NAIVE CD4 TCELL ADULT BLOOD.

We downloaded the RepeatMasker [76] track from the UCSC genome browser [151] and used the SAMtools library [152] to assign reads to the repeat regions. HERV-K age estimates were obtained from the supplementary materials of Subramanian et al. [82].

We used a Bayesian estimate of the ratio of expression in uninfected and HIV-infected samples to account for sampling effort and differing expression in genomic regions. We modeled the observed counts as a binomial distribution with a flat beta prior (\(\alpha =1,\beta =1\)) separately for uninfected and infected samples. We then Monte Carlo sampled the two posterior distribution to estimate the posterior distribution of the ratio. For introns, the number of binomial successes was set to the number of reads mapped to the intron and the number of trials was the total number of reads observed in the genes overlapping that intron. For repeat regions, the number of binomial successes was set to the number of reads mapped to that region and the number of trials was the total number of reads mapped to the human genome.

Lasso regression was performed using the R package glmnet [153]. The \(\lambda\) smoothing parameter of the lasso regression was optimized by finding the \(\lambda\) with lowest mean squared error in a 500-fold cross validation and picking the simplest model with misclassification error within one standard error.

To estimate determinants of ERV-9/LTR12C expression, we fit a logistic regression for which LTR12C increased in expression with HIV89.6 infection (95 % Bayesian credible interval \(>2\times\) change) on to characteristics of the LTR12C regions. We extracted all the LTR12C regions from the human genome and determined the U3-R boundary using a ends free alignment of the previously reported U3-R border [8890, 154, 155] against the sequences. Regions less than 1,000 bases long were discarded. Previous studies disagreed about the location of the LTR12C transcription start site and it appears that transcription may start in several places [88, 155]. We took the 5′ most site that had agreement between studies (transcription starting with TGGCAACCC). We split the sequences into short and longer length classes based on repeated sequences about 70 bases upstream from the transcription start site. For the short and 3 subtypes within the long length class, we generated a consensus sequence and counted the Levenshtein edit distance between the consensuses and each corresponding sequence. We also counted the number of NFY motifs (CCAAT or ATTGG), MZF1 motifs (GTGGGGA) and GATA2 motifs (GATA or TATC) in the entire U3 region and checked if a TATA box (AATAAA) [90] was present in the 50 bases upstream of the TSS. A final regression model was selected using stepwise regression with an AIC cutoff of 5. For display, the LTR12C sequences were aligned with MUSCLE [156].

The abundance of the HIV RNA size classes was estimated as described in Additional file 6. These estimates were then multiplied by the within size class proportions estimated by Ocwieja et al. [6] using PacBio sequencing of HIV89.6 to yield proportions over 78 measured HIV89.6 RNAs.

Availability of supporting data

RNA-Seq reads from this study are available at the Sequence Read Archive under accession number SRP055981. The integration site data is available at the Sequence Read Archive under accession number SRP057555.