Vertebrate genomes contain retroviral sequences that are believed to be remnants of exogenous retroviral infection from the distant past [1]. The genesis of these endogenous retroviruses (ERVs) necessitates establishment of provirus by the ancestral, exogenous retrovirus within host germ cells, such that these elements are maintained as heritable genetic elements in the host species. Human endogenous retroviruses (HERVs) are not uncommon, and account for at least 8% of the total human DNA. By comparison with DNA sequences from exogenous retrovirus families, three main classes of HERVs, gamma, beta, and delta have been identified [2]. Phylogenetic analyses identified HERV-K, a betaretrovirus, and HERV-H, a gammaretrovirus, as the most recent entries into the genomes of primates, 10 and 25 million years ago, respectively [3, 4].

Though some HERVs are transcriptionally active, most retroviral sequences in the human genome are corrupted by mutations or successive insertion of transposable elements; most HERVs lack intact open reading frames for viral protein production and no autonomously replicating HERV has been identified. As a result, HERVs are generally considered non-functional, junk DNA. In some cases, though, HERVs make significant - even essential - contributions to the normal physiology of the host species [58]. HERVs have also been implicated in the development of pathological conditions [914]. Interestingly, HERVs can be activated by exogenous retrovirus infection [15] and, in reciprocal fashion, the immune response to these elements can influence the outcome of infection by pathogenic, exogenous retroviruses such as HIV-1 [16, 17].

Exogenous retroviruses integrate into locations throughout the host chromosomal DNA in a quasi-random fashion (reviewed extensively in [18]). Previously, we developed statistical tools to identify association between the sites of provirus establishment and chromosome marks – as determined by chromatin immuno-precipitation with massively parallel DNA sequencing (ChIPSeq) [18, 19]. From this analysis we identified chromatin modifications (H3K4me3, H3K4me1, and H3K9ac) and binding sites for transcription factors (STAT1) that predict the site of integration for gamma-retroviruses with great accuracy in a cell-type specific manner [18, 19]. Precise markers such as these were not identified for other classes of retrovirus such as the lentivirus HIV-1 [18]. Here, the same statistical methods were exploited to determine whether any ERVs retain association with cell type-specific chromatin features that might have determined the site of integration in the distant past, or that served to fix the provirus in the genetic patrimony of the host species.


Search for chromatin features near the site of ERVs

Previously observed, high-level association between gamma-retrovirus integration sites and particular epigenetic markers [18] prompted a quest to find association between any known endogenous retroviral element in the human and mouse genomes, and the cell-type specific localization of particular chromosomal features. For this purpose, ChIPSeq profiles were evaluated from more than 40 different human and mouse cell types, including ES cells, iPS cells, monocytes, HeLa cells, CD4+ T cells, and CD34+ hematopoietic cells (Table 1). H3K4me3 was used in the initial analysis because of the availability of ChIPSeq datasets from a large number of cell types for this marker. The ERV dataset was compiled using all endogenous, LTR-containing elements annotated via RepeatMasker on the human reference genome hg18 (UCSC) or the mouse reference genome mm9 (UCSC).

Table 1 Data used for analysis of endogenous retroviruses

Figure 1 shows this analysis schematically. The block Associator is a computational module fed by ChIPSeq profiles and LTR loci. Among all LTRs, only HERV-H showed a significant association with H3K4me3, though this association was only with some human cell types; the location of the endogenous gamma-retrovirus was associated with H3K4me3 profiles in ES cells (F-score >0.8; p < 10-300, by Fisher exact test) and, to a lesser extent, in iPS cells (F-score >0.7; p < 10-100). F-scores > 0.5 are considered significant with 1.0 maximal [18], so these values are highly significant. In contrast, no retroviral element in the mouse was found to be associated with H3K4me3.

Figure 1
figure 1

Schematic showing the strategy for detecting association between endogenous retro elements and the H3K4me3 marker. H3K4me3 CHiPSeq data for more than 40 different cell types were checked for association within 2 kB of all annotated endogenous retroviral elements (hg18) using methods described in reference [18]. Only HERV-H proviruses showed association with H3K4me3, and this was only significant in human ES cell lines. No association between H3K4me3 and endogenous retro-elements was detected in the mouse.

The data were assessed by means of Hierarchical Clustering. Each association profile was calculated as a function of the distance between HERV-H and the nearest H3K4me3 marker, discretized with steps of 0.5 kB. The specific cells are listed on the x axis of Figure 2, and the window size in kilobases is shown on the y-axis. The Euclidean distance between these profiles was used to discriminate among clusters. The algorithm identified three main clusters of high, medium and low association (represented in red, green and blue in Figure 3). The cluster with the highest association (red) was populated by ES cells (H1, H9 and I3) and iPS cells (iPS-15b and iPS-11a), with a mean F score of 0.74. The cluster with medium association (green) was populated by bone marrow mesenchymal stem cells (DMSC), breast cancer cells (vHMEC), fetal lung cells and fetal brain cells; it had F scores that were just barely significant. Some iPS cells (A6, PDB1lox, 18c, C1) fall within this cluster, thus confirming that many iPS cells have epigenetic profiles distinct from those of ES cells [29]. The blue cluster consisted of differentiated cells, including CD4+ T cells, fibroblasts, pancreatic islet cells, and HeLa cells; the mean F score of 0.36 was insignificant. The association of HERV-H with the post-translational histone modification H3K4me3 correlated strongly with the degree of cell differentiation, a relationship that was clearly visible using chromosome projection mandalas (Figure 4).

Figure 2
figure 2

H3K4me3 association with HERV-H proviruses, as measured with the F score, in 40 different cell types, as a function of the distance from the marker. The F score is reported on the z-axis. Cell types are listed on the x-axis, while the window size is reported on the y-axis. Yellow to red bands indicate a significant association (F score >0.5). Significance decreases as the color shifts from red to blue. 12 out of 15 of the cell types with significant F scores are either human ES cells or iPS cells.

Figure 3
figure 3

Hierarchical cluster analysis of the HERV-H/H3K4me3 F score, as a function of the window size (horizontal axis). F score is represented using the color scale as in Figure 2. The highly associated cluster (red tree) was made of human ES and iPS cells. The cluster with medium association (green) consisted of mesenchymal stem cells (DMSC), fetal lung and brain cells, and some iPS cells, with F scores that were barely significant. The blue cluster consisted of differentiated cells and the mean F score of 0.36 was insignificant.

Figure 4
figure 4

Chromosome projection mandalas showing the proximity of each HERV-H provirus to the nearest site of H3K4Me3 on the chromosome, in human I3 ES cells, iPS-15b cells, and HeLa cells. Each dot on the mandala indicates an HERV-H provirus, as described in reference [18]. The angular distance around the mandala indicates the linear position of each provirus on the indicated chromosome. The radial distance from the perimeter indicates the distance of the provirus from the nearest H3K4Me3 site, in log scale from 0 to 1 megabase. Blue dots are HERV-H proviruses within 2 kB from the nearest marker. Red dots are proviruses >2 kB away from the nearest H3K4Me3 site. The association strength (F score) is written under each Mandala. F score > 0.5 constitutes a significant association.

HERV-H expression in ES cells

Given the remarkable, pluripotent stem cell-specific association of HERV-H with H3K4me3, a marker for transcriptionally active promoters [30, 31], HERV-H expression would be expected to be higher in ES and iPS cells than in differentiated cell types. Consistent with this possibility, HERV-H was not associated with H3K27me3 (F score 0.2), a marker for transcriptional repression in ES cells [32]. ENCODE Project RNA-seq data sets [33] of paired 75 nucleotide reads from human ES cells and 6 differentiated cell types were assessed for HERV-H RNA peaks. In H1 human ES cells, HERV-H RNA accounted for 2% of the total RNA. This extraordinary level of HERV-H expression was confirmed with RNA-seq data [34] from H9, another ES cell line. For comparison, expression of the younger and better-conserved beta-retrovirus HERV-K was 1000-fold lower than HERV-H in H1 ES cells. HERV-H RNA was 100-lower in HeLa cells and more than 100-fold lower in K562 myelogenous leukemia cells, GSM12878 lymphoblastoid cells, HepG2 hepatocellular liver carcinoma cells, human umbilical vein endothelial cells (HUVEC), and NHEK epidermal keratinocytes. BRD2, a gene that is expressed at nearly the same level in all these cell types was 25-fold lower in expression than HERV-H in ES cells (Figure 5).

Figure 5
figure 5

Cumulative expression of all HERV-H proviruses in human H1 ES cells (hESC), HeLa cells, or K562 cells, compared to expression of HERV-K and BRD2, a constitutive gene with the same expression level in all three cell types. In human ES cells, HERV-H is expressed 1000-fold higher than HERV-K and 25-fold higher than BRD2. HERV-H expression is barely detectable in HeLa, and no significant HERV-H expression was detected in K562 cells. RNASeq data for this analysis were from reference [33].

Almost all the HERV-H RNA expressed in ES cells had the structure 5LTR-gag-pro-3LTR with deletion of the pol and env regions, and no intact open reading frames. HERV-H fragments containing pol and env sequences were barely detected (Figure 6).

Figure 6
figure 6

Mapping of RNA-seq reads from H1 human ES cells on a schematic of the HERV-H provirus. The quantity of each RNA read was normalized to the reads corresponding to the 5 LTR. Only RNA fragments corresponding to 5LTR-gag-pro-3LTR were expressed to a significant level in human ES cells.

It has been reported that human ES cell DNA is hypo-methylated with respect to differentiated cell lines, and that this global effect releases all endogenous elements, such as SINEs, LINEs, and HERVs, from transcriptional silencing [35]. To determine if HERV-H expression in ES cells is simply a result of this global trend, the RNA-seq data was used to compare the expression level of all repetitive elements in ES cells. Compared to LINEs and SINEs, HERV-H was by far the most expressed repetitive element in human ES cells, accounting for nearly all transcription of the HERV family (Figure 7).

Figure 7
figure 7

HERV-H expression accounts for nearly all HERV expression in human ES cells and is a not a non-specific consequence of wide-spread hypo-methylation in these cells. Quantitation of the RNA-seq reads from H1 ES cells, broken down according to the LINES, SINES, all HERVs, the nearly 1,000 HERV-H proviruses, and conventional genes. HERV-H RNA accounted for nearly all the HERV RNA in human ES cells, and 2% of total RNA. Specific vs. non-specific expression was determined by comparing the expression level of each element to the surrounding sequences.

It is also important to consider that expression of endogenous retro-elements might be influenced by local effects from adjacent transcriptional units. The difference between the expression level of the repetitive elements and the expression level of the surrounding sequences was assessed (see Methods for details). By this measure, HERV-H exhibited transcriptional specificity comparable to that of conventional genes, in terms of specific/unspecific transcription ratio (0.85 vs. 0.86), while LINEs and SINEs had a transcription ratio of 0.06 and 0.03, respectively (Figure 7).

HERV-H expression over the course of human ES cell differentiation

The disparity between the high transcriptional level of HERV-H in human ES cells, and the near absence of HERV-H transcription in differentiated cells prompted an assessment of HERV-H expression as ES cells differentiate. To accomplish this, raw data was analyzed from a 4-point time course experiment ( in which RNA was collected from human ES cells in the undifferentiated state (N0), early initiation (N1), neural progenitor (N2), and pre-glial cell (N3) stages [26]. High HERV-H transcriptional levels were confirmed in the undifferentiated N0 stage, and HERV-H levels decreased progressively with a magnitude that correlated with the differentiation time-point (Figure 8).

Figure 8
figure 8

HERV-H expression correlates with differentiation status. HERV-H expression as H1 human ES cells differentiate down a pathway towards neural progenitors and early glial cells. Black bars indicate unspecific expression; white bars represent specific expression, adjusted for expression as described in Figure 7. BRD2 has the same expression level at each stage of differentiation and was used to normalize HERV-H RNA levels.

HERV-H and pluripotency transcription factors

When ectopically expressed in particular combinations, the transcription factors NANOG, OCT4, and SOX2 are capable of reprogramming mature somatic cells into pluripotent stem cells [3639]. Conversely, the expression of these factors decreases as cells differentiate [40]. The RNA-seq data sets utilized above that measure the dynamics of RNA in ES cells as they differentiate into neural progenitors was analyzed for expression of HERV-H and for these transcription factors. HERV-H RNA levels correlated well with those of NANOG and OCT4 (Figure 9). SOX2 was more stably expressed during differentiation than was NANOG or OCT4 and its expression did not correlate with that of HERV-H.

Figure 9
figure 9

HERV-H expression levels correlate with those of pluripotency transcription factors NANOG and OCT4 as human ES cells move down a differentiation pathway. N0, undifferentiated ES cells. N1, early initiation stage of differentiation. N2, neural progenitor stage. OCT4 and NANOG are positively correlated with HERV-H (ρ = 0.95, ρ = 0.84, respectively). SOX2 shows no correlation.

To explore the possibility that these pluripotency transcription factors are somehow involved in HERV-H transcription, an association analysis was conducted between HERV-H locations and the ChIPSeq binding profiles of NANOG, OCT4, SOX2, and KLF4 extracted from ES cells [24, 25] (Figure 10). Over a 2 kB window, NANOG showed the most significant association (F score 0.75, p < 10-180). OCT4 weakly associated with HERV-H (F score 0.55, p < 0.0001), while SOX2 and KLF4 were not associated (F score 0.46 and 0.24, respectively).

Figure 10
figure 10

Modified chromosome projection mandalas depicting NANOG, OCT4, SOX2 and KLF4 associations with HERV-H proviruses. Mandalas were generated as described in reference [18] and Figure 4 above. The green ring indicates the HERV-H 5 LTR region (500 nucleotides from the transcriptional start site). Dot size is proportional to the expression level of each single provirus. NANOG is bound to the 5LTR of almost all highly expressed retroviruses. OCT4 shows a significant association. SOX2 binds all expressed HERV-H in a region between 1 KB and 2 KB while KLF4 does not show any significant association pattern.

Statistical significance increased considerably when the expression level of individual HERV-H proviruses was monitored (Figure 11A and B). As visualized on the chromosome projection mandala in Figure 11C, NANOG binds to the LTRs of 96% of the 50 most highly expressed HERV-H proviruses. These 50 proviruses account for 80% of all HERV-H RNA. The LTR region corresponding to the first 0.5 kB of HERV-H proviral sequence is highlighted in green, with dot sizes proportional to expression levels. With a window of 0.5 kB, the F score for NANOG was nearly perfect (0.97, p < 10-300). Limited to this subset of HERV-H proviruses the F scores for OCT4 (0.90 wi2kB, p < 10-300) and SOX2 (0.85 wi4kB; p < 10-300) were also tightly linked to HERV-H. KLF4, in contrast, was not associated with HERV-H (F score 0.34). The average distance of the NANOG, OCT4 and SOX2 ChIPSeq peaks from the HERV-H transcription start sites was 250 basepairs, 1 kB and 2 kB, respectively, as if these factors were uniformly distributed along the HERV-H promoter in human ES cells (Figure 11B). Indeed, for 80% of the highly expressed HERV-H proviruses NANOG, OCT4 and SOX2 map in this order.

Figure 11
figure 11

Ordered spacing of pluripotency transcription factors, binding to the HERV-H 5 LTR in human ES cells. (A) Association strength (F score) of NANOG, OCT4 and SOX2 binding sites with the 50 most highly expressed HERV-H proviruses (accounting for 80% of total HERV-H expression), as a function of distance from the HERV-H transcription start site (TSS). Maxima in F score indicate the distance of greatest association. (B) Average distance of NANOG (red), OCT4 (blue) and SOX2 (green) to HERV-H TSS is shown schematically. As expected from a uniform distribution model, the average distance is half of the distance between maximal association and TSS. (C) Chromosome Projection Mandala combining NANOG, OCT4 and SOX2 with respect to 50 HERV-H proviruses. The three embryonic transcription factors bind with the same order (NANOG-OCT4-SOX2) to the promoter region of the most expressed HERV-Hs.

Association of LINE elements with NANOG in human ES cells

As a matter of comparison, a chromosome projection mandala was created in which NANOG binding sites in ES cells were compared with the location of the LINEs (Figure 12). Only one LINE, located on chromosome 13 between nucleotides 55,048,158 and 55,050,476, was expressed to an appreciable level. Interestingly, a deeper inspection of this region revealed a particular genomic structure where LINE transcription was driven by an adjacent HERV-H, bounded by NANOG binding sites at both LTRs. This element does not have an intact ORF, and its functional role is unclear. It is also noteworthy that the adjacent HERV-L is not expressed, even if it is immediately adjacent to the transcribed region, confirming that HERV-H expression is not a non-specific consequence of hypomethylation in ES cells.

Figure 12
figure 12

(A) Chromosome projection mandala depicting the association between NANOG and expressed LINEs in human ES cells. Only one LINE (large blue dot on chromosome 13) was expressed to high level in these cells. (B) The genomic region around the LINE on chromosome 13 is shown with the UCSC Genome Browser, where the linear chromosome is mapped on the horizontal axis. The position of the adjacent HERV-H and HERV-L proviruses is shown. Four biological replicates confirm HERV-H and LINE expression in human ES cells while no expression was detected in K562 cells (only one of the 4 replicates is shown). The direction of transcription was determined by strand-specific sequencing [33]. NANOG bound to both LTRs (represented with black squares along the LTR row) in the adjacent HERV-H. The adjacent HERV-L was not transcribed.


This study discovered a strong association between the genomic location of HERV-H proviruses and H3K4me3-modified histones, that is exclusive to human ES and some iPS cells. Consistent with H3K4me3 serving as a marker for active transcription, HERV-H expression was high in these pluripotent stem cells. Moreover, the pluripotency transcription factors NANOG, OCT4 and SOX2 bound to the LTRs of transcriptionally active HERV-H proviruses, or within 2 kB of them. NANOG, OCT4 and SOX2 frequently co-occupy the promoters of target genes, many of which are transcription factors that regulate development such as homeodomain proteins [41].

These observations strongly support the hypothesis that HERV-H transcripts play a role in human pluripotency and that this role is finely regulated by three of the most important transcription factors in ES cells. In addition to the binding of NANOG, OCT4, and SOX2 to the HERV-H promoter, HERV-H RNA decreased as ES cells differentiated, in a manner that was proportional to the expression of NANOG and OCT4. Conversely, HERV-H RNA was undetectable in primary fibroblasts but increased enormously after forced re-programming to generate pluripotent stem cells (unpublished data provided by Audrey Letourneau and Stylianos Antonarakis). HERV-H, then, can be exploited as a reliable marker of ES cell pluripotency, as well as an indicator of the degree of “stemness” of iPS cells as they are generated from fibroblasts.

HERV-H transcripts are 5 to 6 kB in length and lack open reading frames. We can only speculate about the function of these lncRNAs. They might, for example, serve to soak-up miRNAs that promote differentiation, as has been shown with linc-MD1 in muscle differentiation [42] or the PTENP1 pseudogene in the regulation of PTEN and growth suppression [43]. They might bind to chromatin and act as a scaffold for the local recruitment of pluripotency transcription factors, similar to other lncRNAs like HOTAIR for histone modification complexes [44] and Xist in the context of X-chromosome inactivation [4446]. Alternatively, HERV-H might counteract retrovirus spread by interfering with packaging of retroviral genomic RNA [47, 48] or by soaking up miRNAs that are required for retrovirus transduction.

The study here failed to identify chromatin markers that associate with endogenous retro-elements in mice. This was somewhat surprising given the many endogenous retro-elements in this species, including endogenous gamma-retroviruses, some of which are intact and functional [8]. It was also surprising because exogenous gamma-retroviruses have the same integration site preferences in mouse cells as they have in human cells [18]. MLV integration sites are associated with the H3K4me3 profile in mouse embryonic fibroblasts (F score = 0.83; p < 10-100). Similar results with murine hematopoietic stem cells (F score of 0.81; p < 10-100) indicate that, as in human cells, the association strength is cell-type dependent.

One possible explanation for the failure to identify chromatin markers associated with endogenous murine retroviruses is species-specific differences in the recruitment of the transcriptional silencing machinery. In murine ES cells, for example, a sequence-specific DNA-binding protein, ZNF809 [49], recruits TRIM28 and other components of the cellular machinery that silence MLV [50]. ZNF809 has no orthologue in humans; perhaps ZNF809 arose as a result of selective pressure exerted by murine specific gamma-retroviruses during evolution.

Previous work demonstrated that when exogenous retroviruses integrate they home to sites of H3K4me3 [18]. Similarly, the association of endogenous gamma-retrovirus HERV-H with H3K4me3 suggests that when human and simian germ cells were bombarded with the HERV-H ancestor 15 to 30 millions years ago, these ancient retroviruses integrated in proximity to H3K4me3-marked chromatin. These proviruses might then have retained these cell-type specific marks as they became fixed in the primate genome. Alternatively, unmethylated HERV-H LTRs might have recruited chromatin remodeling factors and induced H3K4me3 modification of the viral promoter after integration had occurred.

Analysis of the DNA surrounding HERV-H proviruses failed to clarify which of these two scenarios is more likely. Additionally, search for epigenetic markers like H3K4me3 in syntenic regions in the mouse genome was attempted to determine if these chromatin marks are conserved across the species and predate the entry of HERV-H into the primate genome. DNA surrounding HERV-H proviruses in the human genome was aligned to the mouse genome (using the tool LiftOver, after excision of all repetitive elements (performed with RepeatMasker, Nothing informative was found by measuring the association of mouse H3K4me3 with these syntenic regions and comparing these values with those obtained using control loci.

P300 and H3K27ac bind intra-species conserved regions and co-localize in embryonic-specific enhancers [34, 51]. These markers are also associated with HERV-H in human ES cells, lying within 4 kB of 80% of HERV-H proviruses (F score 0.95; p < 10-300). It seems unlikely that an exogenous retrovirus would be capable of recruiting these factors by exploiting random conserved regions around its integration site. This suggests that a pre-existing layer of epigenetic markers favored integration of HERV-H into particular host loci and that these features are still preserved millions of years later.


Among retroelements in the human genome, the endogenous gammaretrovirus HERV-H is extraordinary for its high level expression in embryonic stem cells, in which it makes up 2% of all polyadenylated RNA. The human genome has ~1,000 copies of HERV-H, and the majority of the HERV-H RNA is encoded by a subset of 50 of these. HERV-H expression decreases as ES cells lose pluripotency, to the point where its expression is undetectable in fibroblasts. Consistent with this expression pattern, HERV-H is also expressed to high level in many iPS cells, though expression in some iPS cells is more modest; this heterogeneity may reflect reported differences in the epigenetic profile of many iPS cells, when compared with ES cells [29]. This suggests, then, that HERV-H RNA offers a relatively stringent marker for human pluripotency that would be worth monitoring during the generation of new iPS lines. The HERV-H RNAs in ES cells average about 5 kB in length and encode no protein. It, therefore, seems likely that HERV-H RNA contributes to pluripotency by acting as a chromatin-associated structural element or by acting as a microRNA decoy.


Statistical analysis of association between chromatin markers and retro-elements

Association with a given marker was defined as the presence of the endogenous proviral DNA within a fixed distance (usually 2 kilobases) from the nearest marker on the linear sequence of the chromosome. Unlike exogenous retroviruses, the endogenous virus is already integrated. Therefore, to restore the conditions before integration, the distance from the j-esim marker M j with peaks in loci {mj0, …, m jN } and the i-esim provirus V i spanning along the loci {V i s, V i e} has been calculated as

d V i , M j = min V i s M j , V i e M j = min V i c M j V i e V i s 2

where V i c = V i s + V i e 2 is the central locus of the provirus.

As a control dataset, we randomly selected 100000 genomic locations. Association strength was measured with the statistical method based on the F score, as previously described in [18].

Formally the F β -score is defined as the β-weighted harmonic mean of Precision(P) and Sensitivity(R): F β 1 + β 2 PR β 2 P + R .

Here β = 0.5 to give more weight to Precision than to Sensitivity. This balances type I and type II errors by adjusting for the high rate of False Positives inherent to the examination of large datasets for genome-wide binding sites according to statistical significance (F score based statistics and comparison with other measures have been extensively discussed in [18]).

Markers with F scores ranging between 0.5 and 1 were considered to be associated with endogenous integration sites.

RNASeq data analysis

HERV-H is present in more than 1000 “imperfect” copies in the human genome and its transcripts share a number of short conserved regions (each around 100 bp). Therefore, deep sequencing of those transcripts yields reads (25-75 bp long sequences) which perfectly align to several genomic loci. Indeed, multireads mapping is still a challenging process [52]. The strategy adopted here was to perform the alignment of uniquely mapped single- and paired-end reads and to reassign the multiple-mapped reads in function of the expression level of the surrounding (context) region [26]. HERV-H expression was evaluated in term of “Reads Per Kilobase per Million mapped reads (RPKM)” with the standard formula E r = K N r L r N T [53] where N r is the number of reads mapping onto the r transcript, L r is the length (in kB) of the r esim transcript and N T is the total number of reads K = 106.

Alignments of RNASeq generated reads have been performed with a two-step procedure. First we used Bowtie[54] on raw data, admitting up to two mismatches for each alignment, and then we discriminated between unique mapping reads and multireads, 80% and 20% of the total number of reads respectively. As expected, many multireads matched with repeated elements. Ignoring them would have resulted in a potential underestimation of the expression of endogenous retroviruses and, in general, of all repetitive elements. Therefore, we adopted a probabilistic assignment based on the amount of reads that univocally map onto the surrounding regions (context), as described in more detail in the next paragraph. This evaluation is also useful to establish if a repeated element is expressly transcribed or if it is part of another structure (i.e. a gene).

Probabilistic alignment of multiple mapping reads

The uniquely mapping read was defined as a short sequence generated by high throughput sequencing that can be aligned to a single genomic region s. Accordingly, we defined the multiread as a sequence r that aligns to a set of M regions {S1,…,SM}. For each region s i the context region is c i  = C i /(C i s i ) being C i a genomic region of n nucleotides encompassing s i . The assumption that the amount of reads is proportional to the amount of actual mRNA implies that the set of multireads is distributed on the reference genome accordingly to the amount of uniquely mapping reads aligned to the context regions. Consider R D s as the function giving the number of reads of the dataset D that map univocally to the genomic region s described by the tuple (chr, start, end). Therefore, the probability of the read r to be actually part of the mRNA generated from the region s i is estimated as:

P D r s i = R D c i j M R D c j .

Eventually, the set of multireads mapping to the same M regions is then partitioned to {S1,…,SM} accordingly to P D (s i ).


The first axiom of high throughput sequencing asserts that the number of reads aligned to a specific genomic region is proportional to how much RNA has been generated by this region within the cell. At this point it is worth observing that repeated element (RE) sequences might be present in the RNA just because they are part of longer mRNAs.

Since we expect that elements having a specific biological function are independently transcribed, we attempted to distinguish between RE specifically expressed with their own promoter from those that are part of longer RNAs. The number of reads mapped to a region s can be naively modeled as a linear combination of specific reads T(s), unspecific reads U(s) and additional zero-mean noise σ2 that account for all other experimental and non-systematic fluctuations that can randomly influence the output of the sequencing process. Formally:

R s = U S + T s + σ 2

where R s , as before, is the function giving the number of reads assigned to the region s.

Therefore, the mean number of reads in the region s is:

E T s = R s U s .

In order to estimate U(s), it is possible to count the number of reads mapped to the context region c as previously shown. Therefore we set

U s = R c 2 ,

and we eventually adopt the following approximation to correct for the non-specificity of transcription:

E T s R s R c 2