Bioinformatics and HIV Latency

Despite effective treatment, HIV is not completely eliminated from the infected organism because of the existence of viral reservoirs. A major reservoir consists of infected resting CD4+ T cells, mostly of memory type, that persist over time due to the stable proviral insertion and a long cellular lifespan. Resting cells do not produce viral particles and are protected from viral-induced cytotoxicity or immune killing. However, these latently infected cells can be reactivated by stochastic events or by external stimuli. The present review focuses on novel genome-wide technologies applied to the study of integration, transcriptome, and proteome characteristics and their recent contribution to the understanding of HIV latency.


Introduction
The study of biological processes, including viral infections, benefits from novel technologies for exploration and discovery. In the past decade, new developments have focused on genome-wide analyses, including high-throughput sequencing technologies for genome and transcriptome analyses, as well as mass spectrometry-based technologies for proteomic studies. These technologies aim at providing the most comprehensive snapshot picture of a specific cellular content, thereby generating large amounts of data. In turn, large-scale datasets require the development of tools and methods to handle and analyze these data adequately. This review focuses on experimental and bioinformatic methodologies applied to the topic of HIV latency.

HIV Latency
HIV infection is considered today a chronic disease, thanks to the efficacy of antiretroviral therapy (ART) [1, 2•, 3-7]. ART aims at blocking multiple viral replication steps and enzymatic activities, i.e., entry, reverse transcriptase, integrase, and protease, thereby limiting virus escape through selection of variants. ART results in undetectable viremia in the plasma of HIV+ individuals but fails to achieve complete virus eradication, hence requiring life-long treatment [1, 2•, 3-7]. HIV persistence is explained by the existence of reservoirs that are established early upon HIV exposure [8••, 9-11]. Two types of reservoirs have been described that are not mutually exclusive and that are likely to co-exist in some individuals. The first reservoir consists in anatomical sanctuaries, such as the central nervous system, gut, or lungs, where ART drug concentration may be suboptimal due to incomplete penetration [12][13][14]. Infection may persist in these sites due to either partial inhibition of viral replication or to cell-to-cell spread of the virus [12,13,15]. The second reservoir relies on the persistence of HIV in long-lived cells in a latent and thus "hidden" state. This relies on HIV's ability to stably insert its genome in the host cell chromosome, thereby ensuring its life-long association with the infected host cell. This latent cell reservoir consists mostly of resting memory CD4+ T cells, but other cells have been shown to contribute as well, including naïve CD4+ T cells, macrophages, dendritic cells, and hematopoietic progenitor cells [16][17][18][19][20][21][22]. Latently infected CD4+ T cells can be induced, i.e., reactivated, and are thus able to produce particles through stochastic transcription or through immune activation, thereby leading to intermittent detectable viremia or viremia rebound after ART cessation [2•, 3, 23••].
The reservoir of latently infected CD4+ T cells is currently considered to be the major obstacle to HIV cure. Global scientific efforts have focused on infected CD4+ T cells in order to understand the nature of latency, i.e., nature of postintegration blocks hampering viral particle production, as well as designing new strategies to stimulate cells to exit latency. For this purpose, multiple models have been developed to characterize various mechanisms that can contribute to HIV latency [24,25]. Models have also been useful to test reactivation compounds inducing viral particle production from the latently infected CD4+ T cells [26••, 27-29]. The present review focuses on recent genome-wide analyses performed on the models of HIV latency ( Table 1).

The Inducible Reservoir
Available methods overestimate or underestimate the size of the latent viral reservoir. This is mainly due to (i) the use of blood samples as the source of the latent reservoir, (ii) the major presence of cells carrying defective viral genomes which obscures the proportion of latently infected cells, and (iii) the incomplete success of current methods used to reactivate latently infected cells in vitro. Indeed, it has been estimated that out of one million blood-purified resting CD4+ T cells from ART-treated individuals, on average, ∼1000 cells (ranging approximately between 100 and 2000 cells) were infected by HIV and thus contained proviral DNA [30••]. However, only a small proportion of these HIV+ cells (11.7 %) carry replication-competent viral genome sequences and are thus inducible [31••]. Using phytohemagglutinin, interleukin-2, and irradiated peripheral blood mononuclear cells to stimulate CD4+ T cells in a viral outgrowth assay, only 1 % of HIV+ cells were successfully induced, thereby illustrating that current reactivation protocols and methods stimulate only a fraction of the total inducible reservoir [30••, 31••]. Other methods of stimulation using anti-CD3/ CD28 and interleukin-7 for 7 days were able to induce particle production from 1.5 % of HIV+ cells (ranging from 0.6 to 2.4 % in the 13 patients tested) [23••]. Finally, reactivation of latently infected cells may occur stochastically and spontaneously at a frequency of 0.041 % (0.03-0.15 %) [23••].
These studies indicate that the latency is complex as stimulation does not lead to 100 % reactivation of latently infected cells [26••, 31••]. Cillo et al. also show that, under their stimulation conditions, 7.5 % of infected cells were expressing cell-associated RNA, an almost 40× increase in HIV transcription compared to unstimulated resting CD4+ T cells. The gap between the successful viral particle production and viral transcription suggests that transcriptional latency cannot recapitulate all aspects of the viral reservoir and that other postintegration blocks exist, consistent with multiple recent studies using latency-reactivating agents and multiple models of HIV latency [23••, 26••, 32••].

Integration Site Location and HIV Latency
Reverse transcription of the viral RNA genome and its integration in the host cell chromosome are two hallmarks of retroviruses. In the past decade, thanks to the availability of the human genome sequence, many efforts have been focused on the understanding of the preferential site of HIV integration, as well as its consequences on the host cell, mostly regarding insertional mutagenesis [33][34][35]. HIV has been shown to integrate preferentially into active transcription units, with no preference for exons or introns, neither for orientation [34,36,37]. The site of integration may have consequences for both HIV transcription and host gene transcription because of chromatin arrangement and RNA interference [35,38]. Therefore, the site of integration can affect the balance between viral transcriptional success and latency, as well as establish a balance between cell death and clonal expansion.
Analysis of integration site distribution commonly uses a three-step experimental approach: (i) DNA fragmentation (sonication, restriction enzymes), (ii) linker ligation, and (iii) PCR amplifications using primer annealing in the long terminal repeat (LTR) and primer annealing in the ligated linker [39][40][41]. The amplicons are sequenced using next-generation sequencing technologies, generating millions of short reads (ranging usually from ∼20 bp to a few hundreds of bp). Virus sequences are removed and trimmed sequences are aligned to the human genome to locate the integration site. This method allows the efficient capture of the preferential pattern of integration site distribution according to the human reference genome annotations. However, it fails to quantify the frequency of integration sites occurring at the exact same location due to the bias caused by the PCR amplification step [42••]. This limitation was recently addressed using random shearing of the DNA or through the usage of random decamer primers tailed with a U5 sequence that allowed discriminating between two identical integration site locations originating from an   [44•, 45•]. The location of the proviral insertion in these clonally expanded infected cells was enriched in genes involved in cell division cycle or cancer, supporting the notion that the site of viral integration may promote homeostatic proliferation. It is important to note that, despite this observation, cancer due to HIV insertional mutagenesis is not a major issue in HIV-infected individuals [36,46]. Alternative hypotheses explaining the observed cell clonal expansion include (i) the infection of stem cells [20,21,47,48] and (ii) the possible survival advantage of cells carrying defective viral genomes [31••, 46]. It would thus be interesting to know if the expanded cell clones are part of the inducible or defective viral reservoir. This would guide viral eradication strategies to account for those cells and explore ways to specifically inhibit their proliferation.
The link between the integration site and HIV latency is considered to be at the transcriptional level, resulting in silencing via epigenetic regulation of the chromatin environment or transcriptional interference [25,49,50]. Spatial features of the integration site associated with transcriptional latency have been shown to favor (i) heterochromatin (centromeric alphoid repeats) and intergenic regions [51,52], (ii) flanking cellular genes that are highly expressed [51,53], and (iii) the same orientation as cellular genes [54]. In contrast, Dahabieh et al. suggested that NF-kB activity and cell activation state correlate with viral gene expression efficiency rather than genomic integration features [55].
The role of integration site location on HIV transcriptional activity has been recently assessed in five primary cell models of HIV latency, identifying features specific for each model [42••]. Viral silencing was stronger if the host gene expression was high in the Jurkat latent model and in active CD4+ T cells, while the contrary was observed in a central memory CD4+ T cell model. The same orientation integration bias in the Bcl-2 latency model was confirmed but was not identified in the other latency models [54]. Integration in intergenic regions was favored in active or resting CD4+ cell models only. In contrast, integration in alphoid repeats was preferentially identified in the Jurkat latency model, confirming a previous study, and in resting and central memory cell models [51]. Finally, only the H4K12ac acetylation epigenetic mark correlated with latency in active, resting, and central memory CD4+ cell models but not in the Bcl-2 and Jurkat cell line models. This study highlighted characteristics specific for each model but failed to identify common features of integration site selection specific of HIV latency across all latency models, leaving open the debate about the contribution of integration site location to transcriptional silencing [42••]. Single-cell analysis may facilitate understanding of the link existing between integration site location and viral transcription.

Transcriptome and HIV Latency
Transcriptome analyses capture a snapshot of RNA molecules within the cell at a specific time. The mRNA molecules will be translated shaping the proteome content of the cell and therefore its functionality. Although the transcriptome only partially correlates with the proteome, it still provides a valuable approximation of the cell condition at a specific time [56][57][58]. The amount of specific mRNA species results from a balance between the production rate and decay rate. The former is determined by an interplay of regulatory processes acting at transcription initiation, elongation, and termination steps, and the latter is influenced by proteins binding to the 3′ untranslated region (3′ UTR) and the activity of regulatory RNAs such as microRNAs and noncoding RNAs [59][60][61][62][63][64][65][66][67][68]. External stimuli will affect this balance, tipping it towards one direction or another [69,70]. An additional layer of complexity resides in alternative splicing, for both HIVand host cells [59,[71][72][73].
Transcriptional silencing of HIV has been considered as a major cause of latency [25,49,50]. The majority of studies investigating transcriptional silencing have focused on mechanisms acting on the viral LTR promoter, highlighting the role of the viral Tat protein, cellular proteins, histone posttranslational modifications, and chromatin remodeling [25,49,50,[74][75][76][77]. However, only a few studies have focused on the transcriptome analysis of latently infected cells. Evans et al. developed a new primary model of HIV latency, which consists in co-culturing resting CD4+ T cells with myeloid dendritic cells (mDC) for 24 h, followed by CCR5-tropic HIV-GFP infection [78]. Latently infected cells were defined as nonproliferating CD4+ T cells not expressing viral-encoded GFP and were thus sorted by fluorescence-activated cell sorting (FACS) as SNARF-1 high and GFP − cells after 5 days post-infection. This model provided evidence that CD4+ T cell direct contact with mDC, but not plasmacytoid DC (pDC), facilitates establishment of HIV latency. The study used microarrays to compare latently infected CD4+ T cells versus mock-infected CD4+ T cells co-cultured with mDC and identified only a limited number of genes, mostly interferon I-regulated genes (IFIT1, IFI27, and OAS1), as differentially expressed. These genes may represent latency biomarkers but may also originate from noninfected exposed cells.
Gene arrays are being replaced by high-throughput sequencing technologies, as these do not require prior knowledge about gene sequences and provide a higher accuracy. Current methods differ in the library preparation and in the targeted RNA species. For example, total RNA sequencing (including small RNAs) allows capturing a complete picture of RNA within a cell, coding or noncoding [79], but requires a high coverage and does not discriminate between nuclear immature transcripts and cytoplasmic mature transcripts. Messenger RNA-Seq (mRNA-Seq) is based on poly(A) capture, thereby requiring less sequencing depth, and will reflect more faithfully the coding transcriptome, although it will not account for coding transcripts that are not poly-adenylated [80,81]. Recent improvements in library preparation protocols include determination of RNA strand specificity as well as of splicing variants of the coding transcripts. The analysis of these sequences requires computational tools to align the transcripts to the genome and identify transcript isoforms.
The analysis by Evans et al. highlighted the importance of analyzing purified populations of latently infected resting CD4+ T cells in order to identify specific latency biomarkers [78]. Because of the difficulty to isolate and obtain an uncontaminated population of latently infected cells, only a limited number of studies investigated the cellular transcriptome of HIV latency in primary models. Mohammadi et al. used a primary CD4+ T cell model infected with an attenuated viral construct. Successfully infected cells were isolated and purified by FACS and allowed to revert to a resting, hence latent state, by co-culture with H80 feeder cells [32••]. They analyzed the transcriptome of the infected cells over time by mRNA-Seq, thereby following establishment of latency. Cellular activation or resting state was shown to be a major driver of transcriptional differences between cells. Except for the persistent detection of viral transcripts, only a very limited number of host genes were found to be differentially expressed between noninfected and latently infected cells, suggesting that HIV had a very minor impact on the transcriptome of the resting cell. Latently infected cells were also exposed for 8 and 24 h to diverse stimuli to reactivate latency, including SAHA (vorinostat), disulfiram, and IL-7. Although some of these molecules were impacting viral transcription, they failed to stimulate viral protein expression, underscoring the possibility that post-transcriptional blocks would also contribute to the latent phenotype. This study indicated that HIV latency reflects transcriptional and post-transcriptional blocks, including nuclear export and translation [26••, 32••, 82]. The multiplicity of mechanisms leading to HIV latency was also illustrated recently by the comparison of several latency-reactivating agents and their different efficiencies across latency models [26••].

Proteome and HIV Latency
The genetic flow of information in the cell attains a key principal level with the translation of messenger RNA into functional protein. Each protein is modulated through tissue-specific dynamic processes such as alternative splicing, post-translational modifications, interactions with other molecules, and formation of proteomic complexes, which delineate the multidimensional complex nature of the proteome. A recently published blueprint of the human proteome reported approximately 3×10 6 unique peptides covering 84 % of the annotated coding genome [83]. Analyzing the proteins of a biological system gives the opportunity to understand and reveal the intricacies of many cellular functional pathways, inaccessible through genomic or transcriptomic studies.
During HIV infection, as in the case of any viral infection, the proteome of the infected host cell constitutes a key layer where a plethora of responses and interactions with the viral components take place [84,85]. HIV orchestrates various cellular processes by interacting and manipulating the functional blocks on the host cell in order to promote viral replication and assure its survival [86]. On the host side, various studies have identified changes in post-translational modifications, localization, and amount of proteins upon infection by HIV [87,88]. In addition to taking advantage of the host cellular proteome, HIV brings along an arsenal of 15 viral proteins that interfere with the host cell, establish interactions, and carry diverse post-translational modifications in order to assure key steps of the virus life cycle.
Recent technological advances and the advent of mass spectrometry (MS)-based proteomics have made a strong positive impact on virology and have helped uncover and characterize many proteomic responses triggered by viral infection [89,90]. A standard MS experiment consists of a succession of steps: sample protein fractionation into smaller peptides, separation, ion detection, and relative or absolute protein quantitation. The general pipeline can be adapted to address more specific questions such as studying alterations in the phosphoproteome [91]. Online databases and search algorithms are used to map the obtained peptides to putative proteins. The resulting proteomic MS data has additional complexity since, often, more than one candidate protein is associated to the measurements from a set of peptides. A necessary step in MS data pre-processing is to filter out proteins with few, uncertain peptide measurements. MS experiments allow obtaining relative or absolute protein measurements between two different conditions (e.g., HIVinfected and mock-infected cells) which need to be normalized.
Jager et al. used affinity purification coupled with MS (AP-MS) to construct a map of HIV-host interactions at the proteome level. The authors devised a scoring system coined MiST which incorporates the reproducibility, abundance, and specificity of a protein to filter out nonspecific interactions. They reported on 497 high-confidence interactions, mainly enriched in transcription and regulation of viral protein posttranslational modifications [92]. Another study employed isobaric tags for relative and absolute quantitation (iTRAQ) coupled with tandem MS to determine the host proteome response to HIV infection in a time-dependent manner [93]. Differentially expressed proteins were identified between HIV-and mock-infected samples at three different time points post-infection. Interestingly, the groups of differentially expressed proteins differed between early and late time points, capturing an early response enriched for cell proliferation, protein synthesis, DNA recombination, repair, and maintenance versus a late reaction linked to T cell activation [93]. Stable isotope labeling by amino acids in cell culture (SiLAC) profiling of phosphopeptides during HIV entry in the host cell has revealed responses to HIV infection at the level of the host phosphoproteome, notably suggesting that HIV affects the phosphorylation of serine-rich host proteins to ease its replication and subsequent release [94•].
A series of host transcription factors, such as NF-kB, NFAT, or Sp1, bind to the HIV promoter, thereby modulating viral transcriptional activity in infected CD4+ T cells. Their nuclear availability varies between active and resting, latently infected cells [25,95,96]. Furthermore, reversible posttranslational modifications of histones, such as deacetylation, have been found to play an important role in the establishment and maintenance of latency by modeling the neighboring chromatin organization at the viral LTR promoter [25,27,49,50]. At the membrane level, 17 proteins involved in cell survival pathways or in the regulation of cellular adherence and transfer were found to be disrupted in latently infected cells [97]. The same study also showed that targeted treatment with inhibitors of several of these proteins, precisely XIAP and BTK, influenced the viability of latently infected cells [97]. Understanding the molecular pathways contributing to HIV latency at the proteome level may help in designing novel treatments to eliminate the quiescent virus.
Little is known about HIV-host interactions and the host proteome response during latency and reactivation. To date, no study has compared the whole proteomes of latent and activated HIV-infected cells. Glycoproteomic tandem MS has been employed to profile the disruption in glycoprotein secretion from latently HIV-infected T cells and to identify aberrantly secreted proteins in plasma from infected patients. Six glycoproteins, L-selectin, neogenin, galectin-3-binding protein, CECR1, ICOS, and the phospholipid transfer protein, were found to be overly secreted by latently infected T cells and elevated in patient plasma [98]. Although the mechanisms through which these proteins contribute to HIV latency remain elusive, this study represents a starting point to explore how latency alters protein secretion of the host cell. Another recent study focused on characterizing histone post-translational modifications in SupT1 cell lines infected with HIV, UV-inactivated HIV, and mock-infected in order to find a link to the transcriptional behavior of the cell. Using nano-LC-MS/MS, a set of histone modifications was detected in HIV-versus mockinfected cells, providing insight on how the virus modulates chromatin accessibility to assure its transcription during infection [99]. This study illustrates the epigenetic mechanisms used to manipulate chromatin to ensure transcriptional silencing during latency. Improvements of measurement investigation techniques and joint analyses of all post-translational modifications should help in assembling a complete view of the whole cellular proteome, thereby informing on the identity and the role of specific modified proteins in the process of latency.

Conclusions
Unraveling the different aspects of latency will be a major step towards efficient HIV treatment and eradication. However, HIV latency and its reactivation have proven to be complex processes with many confounding factors. Genome-wide data provide a snapshot of the latently infected cell, at different molecular levels, i.e., integration site, transcriptome, and proteome. However, several aspects may influence the analysis: (i) The resting status of the cell limits the possibility of identifying unique markers of latency, (ii) the presence of a large number of infected cells carrying defective viral genome sequences obscures the features of the latent viral reservoir that can be reactivated with the goal of cure, and (iii) the analysis of high-throughput genome-wide data is particularly challenging due to the large amount of data as well as systematic technological and biological noise that may lead to high false-positive rates. Adequate bioinformatic tools are required to account for these effects in order to process and analyze the data. In-depth and dynamic genome-wide studies can provide essential contributions to the understanding of establishment of latency during resting condition, maintenance of latency, and the steps towards full reactivation.

References
Papers of particular interest, published recently, have been highlighted as: • Of importance •• Of major importance