Background

With the advent of sequencing projects, coding genes have been revealed to correspond to a tiny fraction of eukaryotic genomes. In the human genome, the protein-coding genes represent less than 2 % of the genome, whereas repeated sequences represent more than half of it [1]. While a large fraction of the non-coding sequences was first thought to bare no function [2], it is now known to be composed of a mixture of repetitive DNA and non-functional sequences interspersed with non-coding RNA genes and regions that are crucial for transcriptional and post-transcriptional regulation [3, 4]. A large part of repeated DNA is classified as transposable elements (TEs). TEs are middle-repeated DNA sequences that have the ability to move from one position to another along chromosomes [5, 6]. These mobile elements typically encode for all the proteins necessary for their movement and possess internal regulatory regions, allowing for their independent expression. Globally, two main classes have been described according to their transposition intermediates. Retrotransposons use an RNA intermediate and form the class I, composed by the LTR-retrotransposons (endogenous retrovirus-like elements baring Long Terminal Repeat sequences on each extremity) and the non-LTR retrotransposons LINEs and SINEs (standing for Long- and Short- Interspersed Nuclear Elements respectively) that are the most frequent in the human genome [2]. Transposons use a DNA intermediate and form the class II. In the human genome, TE distribution appears to be linked to gene function. Indeed, Alu elements, a particular family of SINEs, were shown to be absent from the neighborhood of genes implicated in transcription and regulation [7]. Moreover, we have previously shown that TE content is associated with the function of neighboring genes: while TE-free genes are more frequently involved in development, transcription, and regulation of transcription, TE-rich genes are enriched for the functions of transport and metabolism [8].

Because of their presence in genomes, TEs have a significant impact on genome evolution by promoting various types of mutations [9, 10]. In particular, TEs possess their own regulatory sequences, and they could alter the normal expression pattern of neighboring genes while inserted in intergenic regions [11]. As an example, the MER20 element contributed to the origin of a novel gene regulatory network dedicated to pregnancy in placental mammals [12] and ERV1 elements have wired new genes into the core regulatory network of embryonic stem cells [13]. Moreover, the presence of SINEs affects the expression of neighboring genes in tumor tissue cells, with more gene deregulation associated with more SINEs in the gene vicinity [14]. In human, 0.3 % of TE insertions have been suggested for causing a disease, i.e. one insertion in every 20–100 live births [15], and approximately 96 new transposition events were directly linked to single-gene diseases [16]. Overall, the human genome harbors millions of TE insertions that could potentially affect its functioning under certain conditions. Because the effects associated with TE insertions can potentially be harmful for the host genome, TE activity needs to be regulated, a role that is partly undertaken by epigenetic mechanisms.

For the past few years, epigenetic modifications have been shown to contribute to gene expression regulation. For example, epigenetic changes can explain part of the variation in gene expression observed between tissues of a single organism [1720], or the fate of honeybees by affecting the differentiation between the queen and the workers [21]. These examples are likely to represent only a tiny fraction of all the possible effects of epigenetic processes. Three main intertwined epigenetic mechanisms have been described so far: DNA methylation, RNA interference, and histone modifications. DNA methylation is usually occurring in the context of CpG dinucleotides in animals and is associated with transcription silencing in vertebrates [2225]. RNA interference mechanism is characterized by the synthesis of small noncoding RNAs, which, when associated with a protein complex, can target messenger RNAs and trigger their degradation [26, 27]. Histone modifications correspond to post-translational biochemical changes occurring at particular amino acid residues of these proteins [23, 28, 29]. According to the type of histone modification, the effect can be either compacting or relaxing the chromatin structure, which have both a direct impact on gene accessibility for RNA polymerase and therefore on the gene expression [19, 30]. According to the organism, the role of each epigenetic mechanism may be more or less predominant in gene regulation. For example, DNA methylation is implicated in a large number of cellular functions in mammals and in plants, while it is almost absent from Drosophila [22, 31]. In normal condition, according to the residues and the histones, the hypermethylation of histones can be associated with methylated and repressed DNA sequences [32]. Therefore, one might expect that global alterations of histone modification patterns could disrupt gene expression. Numerous research studies have associated epigenetic changes with human diseases. For instance, cancer cells harbor global epigenetic abnormalities that could have been the initial point to tumor development [33]. For example, CpG islands, unmethylated regions overlapping the majority of human gene promoters, become hypermethylated when associated with tumor-suppressor genes, leading to their transcriptional silencing while the whole genome undergoes a global hypomethylation in cancer condition [34, 35]. Specific histone modifications, and other epigenetic processes, have been shown to specifically target TEs (for reviews, see [36, 37]). While TEs are usually methylated (and therefore silenced) in normal human cells, TE methylation is abolished in cancer cells, letting the possibility for TEs to be activated and to affect the integrity of the cell [38, 39]. For example, specific endogenous retroviruses produce viral particles in human melanoma cells [40], TE expression is enhanced in urothelial and renal carcinoma cells [41], in some carcinomas [42], in human leukemia [43, 44], and in human colorectal, ovarian and breast cancers [4548]. These activations are potentially resulting from different epigenetic modifications occurring in a cancer cell. The majority of the studies concerning the epigenetic alterations occurring on TEs in a cancer environment have mainly focused on DNA methylation (for a review see [49]). While only a few studies investigated TE histone modifications, a global loss of monoacetylation of lysine 16 and of trimethylation of lysine 20 on histone 4 has been found associated to repetitive elements [50]. Moreover, the spread of TE histone modifications to adjacent regions has been observed in plants, fungi, and mouse [5154] suggesting that the presence of TEs may influence the epigenetic state of neighboring genes. Among the different mechanisms that could explain the effects of epigenetic changes in a cancer cell, the implication of TE insertions, harmless in normal conditions but for which epigenetic changes could lead to a cascade of deregulation either causing or reinforcing the tumor status of a cell, still needs to be investigated.

Here, we first observed the variation of ten histone modifications and TE content of genes according to their genomic position in normal condition. We observed that genes are generally more enriched in activating modifications at all chromosome locations compared to repressive modifications. We then compared the histone modification landscapes of genes in normal and cancer blood cell lines, according to their TE neighborhood. Our results showed that the presence of TEs near human genes is associated with greater changes in histone enrichment. Finally, we could highlight that differentially expressed human genes harbored larger histone enrichment variation related to the presence of TEs. Taken together, these results suggest that the presence of TEs near genes could favor important variation in gene expression when the cell environment is modified in human.

Methods

Data acquisition

Gene locations were downloaded from the Biomart server using the Martview tool [55] (www.ensembl.org/biomart/martview/) on the last version of the human genome (GRCh37.p10 = hg19). Over a total of 62,380 genes in the human genome, we filtered for protein coding genes located on the 22 autosomal and the two sexual chromosomes, removing those located on the mitochondrial genome and unidentified chromosomes, and retrieved 19,071 genes. For each gene, Ensembl identification number, strand orientation, and localization (start and end positions on the chromosome) were collected.

TE insertions in human genome were previously identified using RepeatMasker [56], a program that determines the occurrences of sequences with homology to consensus TE sequences present in the Repbase database [57] and were retrieved from the website of the University of California, Santa Cruz (ftp://hgdownload.cse.ucsc.edu/goldenPath/hg19/chromosomes/). The RepeatMasker output files were parsed using the program “One code to find them all” [58] (with the --strict option) to assemble each TE copy and determine their localization.

Locations of histone modifications produced by ChIP-seq experiments were downloaded for the last version of the human genome on the ENCODE Genome Browser (http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeBroadHistone/). They correspond to broader regions of enrichment (broadPeaks) [59]. These regions were retrieved for 10 histone modifications (H3K4me1, H3K4me2, H3K4me3, H3K9ac, H3K9me3, H3K27ac, H3K27me3, H3K36me3, H3K79me2, and H4K20me1) and for two different conditions: a lymphoblastoid cell line originated from normal peripheral blood lymphocyte of a female donor (GM12878 named “normal condition”) and a leukemic cell line originated from derived from a female patient with chronic myeloid leukemia (K562 named “cancer condition”). The two replicates of expression data obtained by RNA-seq experiments were retrieved for the two different conditions (GM12878 and K562) on the ENCODE Genome Browser (http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeCaltechRnaSeq/).

Mean histone enrichment for each gene

To determine the mean histone enrichment of each gene for a given histone modification, we computed the average fold enrichment ε of the histone modifications for the positions covered by an entire gene, normalized by the gene size (E1). We chose not to focus only on the promoter region since it has been shown that some of the modifications can be enriched also along the transcribed region of a gene with very different levels of enrichment between active and inactive genes [60, 61].

$$ \varepsilon (h)=\frac{\sum {e}_i}{n\ast l} $$
(1)

with h the histone modification, n the number of values of fold enrichment of the histone modification h mapped within the gene, e i the value of enrichment of the histone modification h at position i mapped within the gene, and l the length of the gene.

Computation of the density and coverage of TEs in the vicinity of genes

To estimate the amount of TEs within and around genes, we first used each TE position to allocate it to a gene vicinity, using a 2 kb-flanking region upstream, to include gene promoters, and downstream the gene [8]. Then, for each gene, the density in TEs reported as the number of insertions per base pair (E2) and the coverage in TEs, in percentage of the gene (E3), were computed in general for all TEs and for each TE type (DNA transposons, LTR-retrotransposons, LINEs, and SINEs).

$$ {D}_g=\frac{N}{L_g-{L}_{\mathrm{TE}}} $$
(2)
$$ {C}_g=\frac{L_{\mathrm{TE}}}{L_g} $$
(3)

with g the gene, N the number of TEs, L g the length of the gene plus its 2 kb-flanking region, and L TE the number of nucleotides annotated as TEs in the region encompassing the given gene.

These two different metrics were used because the number of TEs associated with a gene is affected by the size of the gene and its flanking region, and by its own size. Whereas the density rather estimates the number of insertions, the coverage measures the proportion of nucleotides belonging to an element in the sampled sequence. The relationship between these two statistics was tested by a Spearman correlation test.

Genes were clustered according to their level of density and coverage of TEs using the pam() function of the R package [62]. This algorithm, called “Partitioning Around Medoids”, provides a robust clustering method because outliers have a less important impact than in the k-means method often used for clustering [63]. The main difference between the two methods is that pam() uses a minimization of dissimilarities instead of a sum of Euclidean distances, and that the medoids (center of a cluster) is an actual point within the dataset. The genes with density and coverage equal to 0 were defined as TE-free genes (4,300 genes). The remaining 14,771 genes were clustered with the pam() function to discriminate between the TE-intermediate (9,132 genes) and the TE-rich genes (5,639 genes). To have more precise information concerning the influence of particular TE types, we also classified the 14,771 genes according to the density and coverage for each particular TE type using the pam() function. We thus determined 11 different categories: the all-TE-intermediate and all-TE-rich categories that correspond to genes with respectively intermediate and rich levels for every TE types, and the SINE-rich, LINE-rich, DNA-rich, LTR-rich, SINE-intermediate, LINE-intermediate, DNA-intermediate, LTR-intermediate, plus a “mix” category, which contains genes with a combination of TE types. To avoid any confounding factors due to the simultaneous presence of different types of TE near genes, we applied a strict rule to determine the category. For example LINE-intermediate genes are free from other TE types.

Differential gene expression and functional analyses by GO term enrichments

RNAseq reads from both samples GM12878 and K562 were trimmed to ensure sequencing quality using the unsupervised approach of the program UrQt [64] and aligned against human genes using Tophat2 [65]. Alignment counts were obtained on sorted bam files using htseq-count [66], and differential gene expression was assessed using DESeq2 [67]. We used an adjusted p-value threshold <0.1 for significance, which allowed us to identify 7,724 genes differentially expressed over the 19,071 total protein coding genes. We determined the enrichment in particular GO terms in a list of target genes (for example down-regulated genes in cancer condition) by comparing it with the list of all the genes in the genome using GOrilla [68] and REVIGO [69].

Statistical analyses

All statistical analyses were performed using the R software [62]. To account for multiple testing and to be conservative, we used the Bonferroni correction and considered significant the results with p values < 0.05/n, n being the number of tests realized.

Results

Histone modifications and TE enrichment of genes vary according to the gene position on chromosome

We observed the mean histone enrichment of genes according to gene position on chromosomes in normal condition (Fig. 1a, Additional file 1 for each chromosome). We split each chromosome in bins representing 5 % of the total chromosome length, i.e., genes located in terminal regions of the chromosomes are located in bins 5 % and 100 %. Independently of the chromosome location and for both sex and autosomal chromosomes, genes are on average less enriched for repressive histone modifications than for activating histone modifications. However, there are some local variations according to the histone modification. On sex chromosomes, H3K27ac is particularly enriched at four locations. In each case, this is due to a small subset of genes that display particularly high enrichment for this modification (Additional file 2). Some of these genes are also responsible for the peak corresponding to a high level of enrichment for H3K4me3. Less important peaks of mean enrichment are also observed on autosomal chromosomes for three locations, which concern the same histone modifications in addition to H3K9ac (Additional file 3).

Fig. 1
figure 1

a. Distribution of the mean histone enrichment along sex and autosomal chromosomes for the 10 histone modifications in the normal condition (GM12878). b Distribution of the TE density and TE coverage of genes along sex and autosomal chromosomes

We also observed the variation in TE density and TE coverage of genes according to their location on chromosomes. As both metrics are highly correlated (r = 0.95, p < 2.2e-16), either of them can be used to determine the TE richness of each gene vicinity. Globally, TE density and TE coverage values tend to be lower for genes located on sex chromosomes than for autosomal genes (Fig. 1b, Additional file 4 for each chromosome). Moreover, the level of variation in TE density and TE coverage of genes is more important for genes located on sex chromosomes than for autosomal. Especially, genes located on the bin 30 % of the sex chromosomes display a higher TE density and coverage than the genes from the other part of these chromosomes.

The presence of TEs is locally associated with greater changes in the chromatin environment of genes between normal and cancer conditions

We determined how the histone enrichment of genes varies between the two conditions, normal and cancer. There is no clear general pattern of enrichment or depletion in activating modifications associated with cancer (Fig. 2). However, except the activating modification H3K79me2, all modifications display different profiles of enrichment between the two conditions (Wilcoxon paired tests, p < 0.005). For example, genes are on average more enriched in H3K27ac in normal condition compared to the cancer condition, when it is the reverse for the H3K27me3 modification.

Fig. 2
figure 2

Mean histone enrichment of genes for the 10 histone modifications in the two conditions: normal (GM12878) and cancer (K562). The modifications known to participate in the expression of genes or to be associated with open chromatin are represented in green. Those known to induce gene repression or to be associated with closed chromatin are represented in red. Vertical bars indicate the mean +/− standard errors

To determine if the presence of TEs near genes may be associated with greater changes in histone modifications of genes between the two conditions, we computed the mean histone enrichment for the genes according to their TE category: TE-free, TE-intermediate or TE-rich (Fig. 3). For each condition, we found that some histone modification enrichments vary when comparing TE-rich and TE-free genes (Additional file 5; Wilcoxon tests, p < 1.67e-3). For example, in normal condition, TE-rich genes are more than twice enriched for H3K9ac than TE-free genes (εH3K9ac = 15.49 and 6.01 respectively, p < 2.2e-16). We then compared the histone enrichment for each gene between the two conditions and we observed that excepted for H3K79me2 in all gene categories and for H3K27ac in TE-free and TE-rich genes, the histone enrichment is different between the two conditions inside each gene category (Wilcoxon paired tests, p < 8.3e-4). TE-rich genes are more enriched in H3K9ac in normal condition than in cancer condition (εH3K9ac = 15.49 and 7.98 respectively, p < 2.2e-16). However, TE-rich genes are more enriched in H3K4me2 and H3K27me3 in cancer condition (εH3K4me2 = 12.72 and εH3K27me3 = 4.13) compared to the normal condition (εH3K4me2 = 9.15 and εH3K27me3 = 1.87, p < 2.2e-16 and p < 2.2e-16 respectively).

Fig. 3
figure 3

Heatmap of the mean enrichment for the 10 histone modifications of genes according to the TE category of their neighborhood in the two conditions: normal (GM12878) and cancer (K562). The number of genes of each category is given (n). High enrichments are toward yellow color whereas low enrichments are toward dark blue color

The previous analyses showed that histone enrichment does vary according to the TE content in the neighborhood of genes. However, it is not expected that particular levels of enrichment could be systematically associated to the presence or absence of TEs. We tested whether the presence of TEs is associated with a greater variation in histone enrichment between the two conditions, whatever the level of enrichment. To determine any over or under-representation of each gene category according to their proportion in the genome, we compared their number to (i) the number of genes displaying similar enrichment in normal and cancer conditions, and (ii) the number of genes displaying significantly different enrichment between the two conditions. The results are presented on Fig. 4. Chi2 homogeneity tests showed that distribution of the number of genes from each TE-content category is significantly different when considering variation in histone enrichment compared to their distribution in the whole-genome (p < 0.0025). Globally, the TE-free genes are more frequently showing similar histone modification enrichment in the two conditions, while TE-rich genes tend to exhibit differences. For example, the genes without variation in histone enrichment between normal and cancer conditions for H3K4me1 and H4K20me1 are more represented by TE-free genes compared to their proportion in the genome (respectively 52.79 and 34.52 %, instead of 22.55 %). For the same histone modifications, in the genes that exhibit different histone enrichment between normal and cancer conditions, the proportion of TE-free genes decreases (15.68 % or H3K4me1, and 15.13 % for H4K20me1) whereas the proportion increases for the TE-intermediate (50.35 % for H3K4me1 and 49.57 % for H4K20me1) and TE-rich genes (33.97 % for H3K4me1, and 35.30 % for H4K20me1). Taken together, these results indicate that a gene with TEs in its vicinity is more likely to have a change in histone enrichment between the two conditions compared to a TE-free gene.

Fig. 4
figure 4

Gene proportion according to the TE category of their neighborhood. The gene proportion is shown for the global genome and between the two conditions (normal (GM12878) and cancer (K562)) for genes displaying the same histone enrichment and for genes displaying different histone enrichment for the 10 histone modifications

In some particular cases, TEs can be associated with various histone modifications according to their classes [70, 71]. To determine if similar patterns were found when considering TE types individually, we computed the mean differential enrichment of genes between normal and cancer conditions according to the TE type in the gene neighborhood for each histone modification (Fig. 5 and Additional file 6). The presence of different types of TEs near genes is associated with different effects (Kruskal Wallis, tests p < 0.005). In particular, SINE-rich, LTR-intermediate, and TE-free genes are more enriched for H3K4me3 in normal condition, whereas LINE-rich, LINE-intermediate, and all-TE-rich genes are more enriched for this modification in cancer condition.

Fig. 5
figure 5

Differential histone enrichment between normal (GM12878) and cancer (K562) condition for the 10 histone modifications of genes according to the TE category of their neighborhood. The number of genes of each category is given (n). More enrichments in normal condition are toward blue color whereas more enrichments in cancer condition are toward read color. White color corresponds to an absence of differential enrichment between the two conditions

Differentially expressed genes between normal and cancer conditions have particular histone enrichment variations and TE environment

To test a possible association between the presence of TEs, particular histone enrichment, and gene expression, we analyzed in more detail the 7,699 genes differentially expressed between the two conditions for which histone modifications were associated, the 25 missing genes being located on unidentified chromosomes. Down-regulated genes in the cancer condition compared to normal one are enriched for functions in the regulation of lymphocyte activation, the defense response, and the immune system process. Up-regulated genes are enriched for functions in cytoskeleton organization, cell cycle process, sulfur compound biosynthesis, regulation of vesicle mediated transport, single organism cell process, and post-translational protein folding (Additional file 7). We have also compared our datasets of down- and up-regulated genes to the set of census cancer genes identified in the COSMIC database (http://cancer.sanger.ac.uk/cosmic; [72]). The results show that among the 596 census genes that have been identified as “cancer genes”, meaning genes for which mutations have been causally implicated in cancer, 156 and 120 correspond to genes from our sets of down- and up-regulated genes respectively.

The mean histone enrichment of up- and down-regulated genes in cancer condition in comparison to the normal one is reported in Table 1, for both conditions. The histone enrichment is significantly different between the two conditions for all modifications, and for up- and down-regulated genes (Wilcoxon paired tests, p < 0.0025) with the only exceptions of H3K4me2 for down-regulated genes and H3K27me3 for up-regulated genes. Both up- and down-regulated genes display the same pattern with more enrichment in normal condition for H3K27ac, H3K36me3, H3K9me3, H3K9ac, and more enrichment in cancer condition for H4K20me1. It is therefore unlikely that the divergence of expression in response of the cancer is due to these modifications. However, up-regulated genes are more enriched for H3K4me1, H3K4me2, H3K4me3, and H3K79me2 in cancer condition whereas the down-regulated genes are depleted for these activating modifications in the same condition (except for H3K4me2, which displays no difference between the normal and cancer conditions). Symmetrically, down-regulated genes are more enriched in cancer condition for the repressive histone modification H3K27me3 whereas up-regulated genes do not show variation between the two conditions. These differences could potentially explain the divergence of expression of these genes between the two tested conditions. In order to determine if some particular functions could be more represented among these genes, we looked at the Gene Ontology terms of the most highly down-regulated genes that are TE-rich and enriched in H3K27me3 in cancer condition (Additional file 8). Interestingly, seven out of the 15 genes are implicated in immune system process and response to stress, among which one gene, LCK, is identified as a “cancer gene” in the COSMIC database. Similarly, we looked at the most highly up-regulated genes that are either TE-intermediate or TE-rich, and enriched in H3K79me2 (Additional file 9). In that case, there is less common GO terms but we can notice that among the 43 genes, six are involved in immune system process and response to stress, and four are involved in transcription from RNA polymerase II. Among the genes from this last category, two have been identified as “cancer genes” in the COSMIC database (GATA1 and GATA2).

Table 1 Mean histone enrichment for the 10 histone modifications of genes according to their expression divergence between normal and cancer condition

The TE environment appears to be associated with the variation in histone modifications observed between the up- and down-regulated genes (Table 2). Among the differentially expressed genes displaying enrichment or depletion in particular histones, we tested whether the number of genes regarding their local TE landscape is different from that observed in the total genome. We first considered the down-regulated genes with more enrichment in H3K27me3 in cancer condition (1,514 genes) and depleted in H3K4me1 (1,649 genes), H3K4me3 (1,420 genes), and/or H3K79me2 (1,766 genes). Globally, the proportions are different for all comparisons (Chi2 homogeneity tests, p < 0.0055). More specifically, there is an increase of LTR-rich genes inside each group of genes (9.44 % (total genome) versus 17.97 % (H3K27me3), 15.46 % (H3K4me1), 14.37 % (H3K4me3), and 16.08 % (H3K79me2)) whereas the proportion of TE-free genes greatly decreases (22.55 % (total genome) versus 11.23 % (H3K27me3), 14.55 % (H3K27me3), 15.56 % (H3K4me3), and 12.85 % (H3K79me2)). We also observe an increase in the proportion of DNA-intermediate genes (0.08 % (total genome) versus 0.13 % (H3K27me3) and 0.14 % (H3K4me3)), all-TE-intermediate genes (1.67 % (total genome) versus 2.77 % (H3K27me3)), and all-TE-rich genes (1.00 % (total genome) versus 2.46 % (H3K4me3) and 2.38 % (H3K79me2)), but also a decrease in the proportions of SINE-rich, SINE-intermediate, and LTR-intermediate genes. Among the up-regulated genes that display enrichment in H3K4me1 (2,334 genes), H3K4me2 (2,345 genes), H3K4me3 (2,583 genes), and/or H3K79me2 (1,819 genes), the proportions of SINE-rich, DNA-intermediate, and LTR-rich genes increase whereas the proportions of LINE-intermediate, LTR-intermediate, and TE-free genes decrease.

Table 2 Gene number (proportion) among differentially expressed genes according to the TE-content category and their enrichment in histone modifications in cancer condition

Discussion

In this work, we showed that genes are generally more enriched for activating histone modifications than for repressive ones when considering all positions on chromosomes, in both autosomal and sex chromosomes. This may reflect the fact that genes are usually enriched in regions associated to an open chromatin state [73, 74]. We did not detect any significant effect of the local gene density on a chromosome on the histone modification enrichment pattern (Spearman correlation tests, data not shown). However, at a more fine scale, we know that variations among genes exist according to their function in the tissue considered. We observed regions with high level of enrichment for activating histone modifications, which are due to especially high values associated to a small number of genes. This could point to genes particularly active in the analyzed cell line since it has been shown that histone modification levels are good predictors of the gene expression level [75]. When we analyzed the TE content near genes, we observed that genes are on average more enriched in TEs when located on autosomal chromosomes when compared to genes present on sex chromosomes. This is in general agreement with previous analyses made on the TE distribution in the human genome, where the density of some retroelements is higher on autosomal chromosomes than on the X chromosome [76], which could be associated with variation in the recombination rate on these chromosomes.

We did not observe any general pattern of increase or decrease of histone modifications according to their effect on gene expression in association with cancer compared to the normal state, but the two conditions showed significantly different landscapes for enrichment. Variances of enrichment for some histone modifications appear to be larger for genes in normal condition. This points out the need to better understand how epigenetic modifications are labile to quantify how much they vary among normal conditions, across time, or even among individuals, a whole body of research that is just starting [77]. For the purpose of the study, we made the hypothesis that the “within condition” variation can be estimated using the large number of genes corresponding to the whole-genome.

Our results showed that there is more variation in the histone enrichment of genes between normal and cancer condition, when the genes are enriched in TEs. This could be linked to the fact that TEs can be associated to particular epigenetic modifications. In human and mouse, TEs are associated with H3K9me3 and H4K20me3 [78, 79]. In mouse, an association of the modification H3K27me3 to SINEs and gene rich regions has been shown [80]. Histone modifications play a major role in the global silencing of TEs in the mammal genomes, even if some variability exists regarding the TE family [78, 79, 81, 82]. Interestingly, some of the histone modifications are likely to be cell-type specific and could indicate that some of them targeting TEs may regulate the expression of “host” genes, especially if they provide the host with a function [82]. Particular histone modifications of TEs have also been shown to spread to the neighboring regions of the TE insertion. For example, Intracisternal A-particle (IAP) elements, which are moderately repeated TEs in mouse (~1000 copies) induce H3K9me3 and H4K20me3 targeting on flanking regions of their insertion [54]. A similar observation has been made in plants, in which the insertions of TEs in euchromatic regions induce the local formation of heterochromatin [53, 81, 83]. Hence, the presence of particular histone modifications associated with TEs could influence the epigenetic profile of neighboring genes, due to the synergetic or antagonist actions of different histone modifications [84]. In cancer condition, the global modifications occurring on TEs may also spread to neighboring genes inducing changes in their expression, which in turn would perturb various genetic networks. Indeed, in cancer cells, silencing of tumor-suppressor genes by hypermethylation of CpG island promoters is associated with deacetylation of histones H3 and H4, loss of H3K4me3, and gain of H3K9me and H3K27me3 [35, 85]. However, unmethylated tumor-suppressor genes are silenced when hypoacetylation and hypermethylation of histones H3 and H4 are present, indicating that only changes of histone modifications can be sufficient to repress a gene [34]. A global reduction of monoacetylated H4K16 has been observed in cancer cells, along with a loss of the active modification H3K4me3 and of the repressive modification H4K20me3, and a gain of the repressive modification H3K27me3 [50, 85, 86]. Interestingly, we did not observe an association with more repressive histone modifications for TE-rich genes compared to TE-free genes in normal condition, as could be expected if all TE insertions are indeed only targeted by silencing modifications. Some of the TE insertions might have been selected for their adaptive role in the gene regulation, and therefore not silenced by the host-genome. A theory concerning an “exaptation hypothesis” has been suggested [87]. The authors proposed that the role of TE epigenetic modifications could be adaptive, with TEs having been recruited to participate in the regulation of host genes, although some evidences remained in support to the alternative hypothesis of “genome defense”, in which epigenetic regulatory system evolved to silence TEs and prevent their deleterious activities. In any case, this implies that among all TE insertions in a genome, not all of them will have the same impact on gene expression, according to their impact on natural selection.

Among the differentially expressed genes between the two conditions and presenting variation in histone enrichments, genes with particular TEs in their vicinity are over-represented while TE-free genes are under-represented. This was especially clear for down-regulated genes. This result supports a causal link between the presence of TEs, the histone modifications and the changes in gene expression. In cancer condition, epigenetic remodeling of large genomic region is observed, as well as a loss of control of various epigenetic mechanisms [88, 89]. The presence of TEs in these regions could thus trigger particular changes in epigenetic modifications when compared to regions devoid of TEs. Interestingly, the effect seems to change according to the type of TEs present near genes. We showed here that the proportion of LTR-rich genes increases among down-regulated genes with a depletion in several activating histone modifications and an enrichment in the repressive modification H3K27me3 in cancer condition. Similarly, an effect on gene expression has been observed for L1 elements when inserted into genes, associated with DNA hypomethylation in cancer condition [90]. In addition, we observed that LINE-intermediate and LINE-rich genes are less represented among up-regulated genes in cancer condition, which could be linked to the same effect.

In this study, we have made the hypothesis that all TE insertions currently present in the human genome are fixed. Although it is true for the large majority of the millions of insertions of this genome, a small number of TE families corresponding to non-LTR retrotransposons are known to be still active and potentially able to produce new insertions, which corresponds to a few thousand active copies [16, 91]. Since in cancer conditions more transcriptional activity of TEs has been observed, new insertions could be generated for the families still active. Some studies have indeed identified several hundred of somatic transposition events in various cancer tissues that were mainly found inside known cancer genes, indicating a direct link between the new insertions and the cancer development [9295]. Novel insertions may provide particular changes in the epigenetic profiles of genes inside or near which they insert that we would not be able to detect here. However, it would not completely change the global pattern we observed since these new insertions cannot change completely the TE category of genes, except for some of the TE-free genes. Moreover, since we focused on genes having one category of TE in their neighborhood to avoid confounding factors of various TE families, it is unlikely that new insertions would be inserted in the genes we considered. Although new cancer insertions may not blur the observations we made, the use of polymorphic insertions would be especially interesting to directly measure the influence on gene expression and epigenetic modifications according to the differential presence / absence of active TEs near particular genes. For example, the study of paralogous regions in the human genome has shown that the presence of Alu elements is associated with DNA methylation divergence, with a hypermethylated region being closer to Alus than to their corresponding hypomethylated copy [96]. Then the differential presence of some TE insertions could in some cases be associated with variation in the epigenetic landscape of genes, which may be associated to certain susceptibility to cancer development. These polymorphic insertions have been shown to be more numerous than somatic cancer insertions since they can represent a few thousand sequences [16, 92, 97]. However, these insertions are usually not found near genes, as a consequence of the direct action of natural selection, which eliminates deleterious mutations. Then, it can be expected that not having considered these insertions would not modify our current results.

Conclusions

Our analyses have shown that the genomic environment of genes is important to understand changes in gene expression when the cell undergoes changes of condition. The presence of TEs around genes may have crucial impact on their epigenetic landscape.

Abbreviations

GO, Gene Ontology; LINEs, Long Interspersed Elements; LTR, Long Terminal Repeat; SINEs, Short Interspersed Nuclear Elements; TEs, transposable elements