Background

Individuals and populations display variable susceptibility to infectious diseases, chronic inflammatory disorders, and autoimmunity [1, 2]. Over the last decade, it has become clear that such disparities partly result from differences in the host genetic make-up, with an increasing number of genes being associated with varying abilities to fight infections at the individual and population level [3, 4]. Furthermore, population genetic studies have revealed that pathogen-driven selection has substantially impacted human genetic diversity [5, 6]. Because the mortality, and thus the selective pressure, imposed by pathogens have been paramount [7], human populations had to adapt to the different pathogenic environments they encountered around the globe, and genes involved in host defense are among the functions most strongly selected for by natural selection [5, 8,9,10,11]. While substantial evidence supports this hypothesis at the genetic level, we still know little about the degree of naturally occurring epigenetic variation at the population level and how this may impact immune phenotypes.

As the immune system is the primary interface with the human pathogenic environment, the study of DNA methylation [12, 13] offers a unique opportunity to explore the interplay between the genome and environmental cues. DNA methylation can be affected by a range of external factors, such as nutrition, toxic pollutants, social environment, and infectious agents [14,15,16,17,18,19]. Furthermore, numerous studies have mapped DNA sequence variants associated with DNA methylation variation [20,21,22,23,24,25,26,27,28], i.e., methylation quantitative trait loci (meQTLs), and ~ 20% of the inter-individual variation in DNA methylation has been attributed to genetics [29, 30]. DNA methylation variation has also been associated with complex traits, including aging [31], body mass index [32], various cancers [33, 34], obesity [35], and autoimmune and inflammatory disorders [36, 37]. Yet, most studies of human epigenome variation, both in health and disease conditions, have focused on populations of homogeneous genetic ancestry, primarily of European descent.

A few studies, however, have reported that population differences in ancestry, habitat, or lifestyle affect DNA methylation, providing an initial assessment of the contribution of genetic factors and gene-environment (G × E) interactions to population-level epigenetic variation [38,39,40,41,42,43,44]. Yet, these studies investigated DNA methylation variation from virus-transformed lymphoblastoid cell lines or whole blood, so the differences observed could reflect, at least partially, epigenetic changes induced by cell immortalization or heterogeneity in blood cell composition that was not fully accounted for [45,46,47]. Thus, the extent of DNA methylation variation related to ancestry, and its genetic determinants, in a cellular setting relevant to immunity are far from clear.

A growing body of research has reported ancestry-related variation in terms of immune gene expression levels. Two recent studies found marked differences between individuals of African and European ancestry in their transcriptional responses to infectious challenges [48, 49] and showed that regulatory variants (i.e., expression quantitative trait loci, eQTLs) explain a substantial proportion of these population differences. Still, a large fraction of the variance in gene expression, both across individuals and populations, cannot be attributed to genetic factors and remains unexplained [48,49,50,51,52,53,54,55]. In this context, DNA methylation represents an additional, possible layer for variation in gene regulation [56]. The observed correlations between DNA methylation and gene expression levels can be positive and negative; in the canonical model, high levels of methylation at promoter regions are often associated with low gene expression, but elevated gene body methylation is also associated with active expression [28, 47, 57,58,59,60]. There is also increasing evidence that DNA methylation can play both passive and active roles in the regulatory interactions influencing gene expression, but the causality relationships between DNA methylation, gene expression, and genetic factors are not fully understood [19, 23, 56]. Furthermore, genetic variants associated with complex traits or diseases by genome-wide association studies (GWAS) often overlap both eQTLs and meQTLs, suggesting that disease risk can be mediated, directly or indirectly, by variation in DNA methylation [61,62,63,64,65,66,67].

Here, we aimed to broaden our understanding of the mechanistic links between ancestry-related differences in DNA methylation, genetic factors, and immune gene regulation. To do so, we build upon the EvoImmunoPop collection of primary monocytes originating from healthy individuals of African and European ancestry [48]. We profiled the DNA methylome of 156 donors, including 78 of each ancestry, using the high-resolution Infinium MethylationEPIC array, which captures methylation variation at more than 850,000 sites. This new dataset was combined with both genome-wide genotyping and whole-exome sequencing data, as well as with RNA-sequencing profiles from resting and stimulated monocytes with various immune stimuli, obtained from the same individuals. Such a system-level approach, integrating epigenetic, genetic, and transcriptional data, allowed us to assess the extent to which population-level variation in DNA methylation and its genetic determinants impact transcriptional activity related to immune responses.

Results

Population differences in DNA methylation profiles of primary monocytes

To assess population differences in DNA methylation of a purified innate immune cell type, we characterized DNA methylation variation at > 850,000 CpG sites across the genome, in monocytes originating from 156 male healthy volunteers: 78 of African descent (AFB, median age = 30.9 years) and 78 of European descent (EUB, median age = 25.9 years), all living in Belgium. Note that AFB individuals moved to Belgium between the ages of 6–45 years old (median age = 29 years). After normalization and filtering (see “Materials and methods”), we retained a final dataset of 552,141 methylation sites in the 156 individuals (Additional file 1: Figure S1). Principal component analysis (PCA) of DNA methylation clearly separated AFB and EUB along the first two PCs, which explained together 11.6% of the total variance (Fig. 1a). At a false discovery rate (FDR) = 1%, we identified 77,857 sites (14.1% of the total number) that presented a significant difference between AFB and EUB in their mean level of DNA methylation, after adjusting for age and surrogate variables. When restricting our analyses to CpGs that presented a mean difference > 5% (measured by the β value [68], see “Materials and methods”), we identified a total of 12,050 differentially methylated sites between populations (DMS) that mapped to 4818 genes. Because the age distributions of AFB and EUB individuals significantly differ (Wilcoxon P value = 10−4; Additional file 1: Figure S2), and age might have a non-linear effect on DNA methylation [69], we also investigated with ANOVA the extent to which DNA methylation is non-linearly affected by age in our dataset. Our analyses showed that such effects had little to no impact on the population differences in DNA methylation detected (Additional file 2: Supplementary Note 1).

Fig. 1
figure 1

Population differences in DNA methylation profiles. a Principal component analysis (PCA) of DNA methylation profiles for all 156 individuals. Red and blue circles represent African (AFB) and European (EUB) individuals, respectively. The proportions of variance explained by PC1 and PC2 are indicated. b Genomic location of differentially methylated sites (DMS), for CpG sites hyper-methylated in AFB (red) and in EUB (blue). Odds ratio and 95% confidence intervals are displayed for AFB-DMS and EUB-DMS, comparing their localization in different genomic locations as provided by Illumina (TSS1500, TSS200, 5′UTR, 1stExon, Body, Exon boundaries [ExonBnd], and 3′UTR), and in enhancer and promoter regions specifically detected in monocytes by ChromHMM phase 15 (see refs. [110, 111]). Odds ratios were computed against the general distribution of the 552,141 CpGs of our dataset. c Proportion of DMS that are hyper-methylated either in AFB (red) or in EUB (blue) individuals. The density of β values of one CpG site by category is given as an illustration of the population differences, with red and blue lines representing the methylation density in AFB and EUB, respectively. d Gene Ontology (GO) enrichment analyses of AFB- and EUB-DMS. For both groups, the top-GO categories reaching 5% FDR are shown, together with the number of genes per category and the log10-transformed FDR-adjusted enrichment P values

The genomic distribution of DMS, which were highly enriched in enhancer regions (odds ratio (OR) ~ 2.6, P = 1.42 × 10−224), was independent of the population where hyper-methylation was observed (Fig. 1b). However, of the 12,050 DMS, 76.3% were more methylated in AFB than in EUB, with respect to the observed 54% when considering all CpGs (Fisher’s exact P < 2.2 × 10−16) (Fig. 1c). The corresponding genes were enriched in Gene Ontology (GO) categories related to cellular periphery and plasma membrane (Fig. 1d). The remaining 23.7%, which were hyper-methylated in EUB, were enriched in sites located in genes largely associated with immune response regulation and responses to external stimulus (Fig. 1c, d; Additional file 3: Table S1). These results cannot be explained by population differences in monocyte subpopulations (i.e., CD14high/CD16neg [Classical], CD14high/CD16low [Intermediate], and CD14low/CD16high [Non-Classical]), as adding these subpopulations as covariates in the model did not alter our results (Additional file 1: Figure S3). Furthermore, we detected no CpG sites whose levels of methylation correlate significantly with monocyte subtypes (FDR = 5%), indicating that the effects of monocyte subpopulations on DNA methylation are negligible at the epigenome-wide level. Together, these analyses reveal genes and functions that present extensive differences in DNA methylation between individuals of African and European ancestry, in the context of primary monocytes.

Genetic factors drive most ancestry-related DNA methylation variation

We next examined the genetic determinants of the observed population differences in DNA methylation, and mapped methylation quantitative trait loci (meQTLs). We first tested for local associations between DNA methylation variation at CpGs and SNPs located within a 100-kb window (cis-meQTLs), using MatrixEQTL [70] (see “Materials and methods”). We set a 5% FDR threshold, considering one association per CpG site and using 100 permutations (P < 1 × 10−5). We adjusted for age, two surrogate variables (accounting for batch effects and unknown confounders, see “Materials and methods”), and the first two PCs of the genetic data (Additional file 1: Figure S4), to account for population stratification. To detect subtle effects, we merged all individuals and included ancestry as a covariate, but simultaneously, we analyzed the two populations separately to detect putative population-specific effects. For all subsequent analyses, we present the significant results of these two approaches combined, unless otherwise indicated.

We identified 69,702 CpGs associated with at least one genetic variant in at least one population (~ 12.6% of all sites, referred to as meQTL-CpGs). Given that multiple linked SNPs can be associated to the same CpG, we kept the best-associated SNP for each meQTL-CpG. However, we also used a fine mapping approach [51] to detect independent SNPs associated to each CpG (see “Materials and methods”). In doing so, we detected 9826 additional meQTLs (Additional file 1: Figure S5), providing a more thorough view of the contribution of proximate genetic variants to DNA methylation variation. The median distance between a CpG and its associated SNP was ~ 3.8 kb (Additional file 1: Figure S6), supporting the close genetic control of DNA methylation [22, 28, 41, 65]. Furthermore, we found a 2.2-fold enrichment of meQTL-CpGs in enhancers (P < 1 × 10−326), a trend that was even more pronounced for meQTLs associated with population differences in DNA methylation (meQTL-DMS; OR ~ 2.8, P = 6.8 × 10−317, Additional file 1: Figure S7).

Focusing on ancestry-related differences, we observed that ~ 70.2% of DMS harbor a significant meQTL, with respect to the 12.6% detected genome-wide (Fisher’s exact P < 2.2 × 10−16; Fig. 2a). These meQTLs were found to account, on average, for ~ 58% of the observed population differences in DNA methylation (Additional file 1: Figure S8, see “Materials and methods”). Furthermore, meQTLs presented opposite effects on DNA methylation as a function of population differences in allelic frequency, i.e., a derived allele at higher frequency in Africans was generally associated with high levels of DNA methylation, while a derived allele at higher frequency in Europeans was primarily associated with low DNA methylation (Fig. 2b). This observation provides a genetic explanation for the unbalanced patterns of hyper-methylation, observed at DMS, between Africans and Europeans (Fig. 1c).

Fig. 2
figure 2

Genetic control of population differences in DNA methylation levels. a Proportions of CpGs and DMS associated to genetic variants identified in the three meQTL studies: merging the two populations (gray shades), mapping in AFB only (red shades) and in EUB only (blue shades). For each mapping, proportions among all 552,141 tested CpG sites and among DMS are indicated in light and dark colors, respectively. ***Fisher’s exact P < 2.2 × 10−16. b Contour plot of meQTL effects on DMS as a function of their difference in derived allelic frequencies (DAF) between populations. For each of the 8459 DMS for which we detected at least one meQTL, we used a kernel density estimation to draw the contour plot of the effect of the derived allele of the meQTL onto methylation (beta, Y axis) according to the ΔDAF (DAFEUB – DAFAFB, X axis). The coefficient and P value of Pearson’s correlation test are displayed. The marginal distribution of the two variables is displayed: top for ΔDAF, and right for beta. c, d Examples of meQTLs detected in this study. Boxplots represent the distribution of β values as a function of genotype, for AFB (red) and EUB (blue) individuals. The minor allele frequency of each meQTL is presented for each population on the top. Gray lines indicate the fitted linear regression model for β value~genotype for each population. e Fold enrichment of meQTLs associated with DMS in GWAS hits. For each of the 17 parental EFO categories, the fold enrichment, the 95% confidence intervals obtained by bootstrap, and the associated P values are shown

Local meQTLs can, a priori, lead to population differences in DNA methylation following two main models: (i) the meQTL has a similar effect in both populations but present different allelic frequencies (Fig. 2c), or (ii) the meQTL is present at similar frequencies but display population-specific effects, revealing more complex interactions (Fig. 2d). We therefore investigated the population specificity of the 69,702 meQTL-CpGs detected using a model selection approach (see “Materials and methods”). We found 2868 (4.1%) significant population-specific effects (1337 AFB-specific and 1531 EUB-specific), suggesting the occurrence of G × E or G × G effects.

Ancestry-related meQTLs are enriched in associations with complex traits and diseases

Given that a large fraction of genetic variants identified by GWAS are thought to act by affecting gene regulation [71,72,73,74], we investigated the putative functional impact of the detected meQTLs on ultimate complex phenotypes. In practice, we searched for enrichments in GWAS hits among our set of 79,528 meQTLs, correcting for linkage disequilibrium (see “Materials and methods”). Focusing on the 17 parental classes of the Experimental Factor Ontology (EFO) classification [75], we found that meQTLs were enriched in significant hits for all these functional categories (Additional file 1: Figure S9, OR ~ 2.1–5.5, P < 4.1 × 10−10). Stronger enrichments were detected for meQTLs associated with population differences in DNA methylation (OR ~ 2.7–9.8, P < 2.9 × 10−3), in particular for phenotypes related to hematological measurements, neurological disorders, immune system disorders, inflammatory measurements, and digestive system disorders (Fig. 2e).

Because DNA methylation and meQTLs have been shown to be largely cell or tissue dependent [23, 76,77,78,79,80,81], we next searched for the specific traits that account for the signals detected at the parental category “immune system disorder”, given our focus on primary monocytes. We found that meQTLs overlapped variants associated with diseases such as osteoarthritis, psoriasis, systemic lupus erythematosus, inflammatory skin disease, or type 1 diabetes (Additional file 1: Figure S10). For example, the meQTL SNP rs629953 presents markedly different frequencies between AFB and EUB (DAF AFB 7.5% versus DAF EUB 62%), leading to variable population-level DNA methylation at TNFAIP3 (cg06987098), and has been associated with psoriasis susceptibility [82, 83]. Together, our analyses support that complex traits and variable DNA methylation are pleiotropically associated with genetic variation [39, 60, 63, 64], but extend these associations to variants affecting ancestry-related epigenetic variation in the context of an innate immunity cell type.

Exploring the distant genetic control of DNA methylation variation

We subsequently searched for the effects of distant genetic variants on DNA methylation variation (trans-meQTLs). To limit the burden of multiple testing, and because trans-meQTLs are enriched in cis-eQTLs for genes encoding transcription factors (TF) [65], we focused on two non-independent subsets of genetic variants: (i) the 4037 SNPs detected as cis-eQTLs for one of 600 TF-coding genes and, more generally, (ii) the 73,561 SNPs located in the vicinity (± 10 kb) of the TSS of these genes. Only associations for which the SNP-CpG distance was higher than 1 Mb were considered, at an FDR of 5% (P < 1 × 10−9). Given the generally low power to map trans-associations, we performed this analysis by considering all individuals together and including ancestry as a covariate.

We identified 133 CpG sites associated with at least one distant SNP, for a total of 672 trans-meQTLs that involved 91 independent loci (Additional file 4: Table S2). Among these, we detected a number of hubs of distant genetic control of DNA methylation variation, including six TFs (ZNF429, CTCF, FOXI1, ZBTB25, MKL2, and NFATC1) where local genetic variation was associated with at least 10 different CpGs in trans. Highlighting one pertinent example, a single genetic variant (rs7203742) nearby CTCF—encoding a transcriptional regulator with 11 highly conserved zinc-finger domains—controls the degree of DNA methylation at 30 CpG sites, ~ 29.4% of all CpGs regulated in trans. Furthermore, of the 21 trans-regulated CpGs that were detected as DMS, 12 were controlled by the same CTCF variant. That this variant (T → C) presents high levels of population differentiation (DAF AFB 24% vs. EUB 88%, FST = 0.59 in the 1% of the genome-wide distribution) suggests the action of positive selection targeting the derived allele in Europeans. This observation makes of CTCF not only a master regulator of DNA methylation, as previously observed [65], but also an important contributor to differences in DNA methylation between human populations.

Dissecting the mechanistic relationships between DNA methylation and gene expression

We leveraged the availability of RNA-sequencing data from the same individuals [48] to obtain new insights into the mechanistic relationships between DNA methylation and gene expression variation, in African and European individuals. We associated the levels of expression of 12,578 genes in primary monocytes with those of DNA methylation at CpGs located within 100 kb of their TSS, for a total of 513,536 CpG sites. Associations were considered significant if they passed a P value threshold determined using 100 permutations (FDR = 5%, P < 5 × 10−5) (see “Materials and methods”).

We identified 1666 CpGs whose levels of DNA methylation were associated with gene expression (eQTMs), for a total of 811 genes (eQTM-genes) associated with at least one CpG in one population group (Additional file 5: Table S3). The KEGG pathways associated with eQTM-genes contained a large number of immune-related pathways, providing a link between DNA methylation and gene expression in the context of immunity (Fig. 3a). When investigating the population specificity of the 811 eQTMs (see “Materials and methods”), we detected 93 significant population-specific effects (43 AFB-specific and 50 EUB-specific). The majority of these cases (80 out of 93) corresponded to genes whose eQTMs were also under genetic control, suggesting, again, the occurrence of G × G or G × E interactions.

Fig. 3
figure 3

Correlations of DNA methylation with gene expression. a Networks of KEGG pathways of genes detected in the eQTM mapping. b Genomic location of eQTMs, for positively and negatively associated CpG sites (light and dark yellow, respectively). Odds ratio were computed against the general distribution of the 552,141 CpGs from our dataset. The distribution of eQTMs according to the direction of their effect on gene expression is shown. c Proportions of different groups of CpG sites in all tested sites (left panel) and among the detected eQTMs (right panel)

Based on current genomic annotations, eQTMs were mostly negatively correlated to gene expression (69.5% vs. 30.5%, see also refs. [23, 28, 65, 84, 85]). Negatively correlated sites were strongly enriched in enhancers (OR~ 2.6, P = 6.6 × 10−59) (Fig. 3b), highlighting their major role in transcriptional regulation [86,87,88]. In addition, we found a slight excess of negative associations in promoters (OR ~ 1.2, P = 1.8 × 10−2) and nearby TSS (TSS1500) (OR ~ 1.4, P = 7.2 × 10−13), as expected following the canonical model. Conversely, positive associations were enriched in sites located nearby UTRs, particularly 3′-UTR (OR ~ 1.8, P = 8.4 × 10−5) [89], but depleted in sites located in promoters (OR ~ 0.6, P = 1.1 × 10−4) (Fig. 3b). Furthermore, we found that eQTMs were strongly enriched in DMS (OR ~ 11.8, P < 1.93 × 10−216) and, importantly, in meQTL-CpGs (OR ~ 33.2, P < 1 × 10−326) (Fig. 3c). Together, these observations indicate that DNA methylation variation, in particular at sites that are differentially methylated across populations (DMS), is much more likely to be under genetic control when associated with gene expression differences (eQTMs), than random CpG sites.

Exploring the underlying causality between regulatory loci and gene expression

Because the respective roles of genetic and epigenetic factors in transcriptional regulation are not fully understood [56], we next mapped eQTLs (FDR = 5%, see “Materials and methods”) to identify the cases where DNA methylation, gene expression, and genetic variants show significant associations between all pairs (Additional file 1: Figure S11). We thus obtained 552 trios, each of them consisting of one gene, one to various CpGs and one to various SNPs (containing 68.1% of the genes detected in the eQTM mapping). This suggested potential, causal relationships between these variables—a latent, though challenging, question in epigenetics. To infer causality between regulatory loci (i.e., eQTMs and eQTLs) and gene expression variation for these specific trios, we first used an elastic net model to build two intermediate variables measuring (i) DNA methylation variability attributable to genetics for the trios presenting more than one SNP and (ii) gene expression variability attributable to DNA methylation for the trios presenting more than one CpG (see “Materials and methods”).

We used a Bayesian approach [90] to assess potential causal effects of a mediating variable M (DNA methylation) on the relationship between an independent variable X (genetics) and a dependent variable Y (gene expression) [91]. When comparing the performance of this method with that of an approach based on partial correlations, using simulated data and various genomic scenarios, we found similar results between the two approaches in terms of sensitivity and specificity (Fig. 4a, b; Additional file 1: Figure S12; see “Materials and methods”). We then ran the mediation analysis on each trio, adjusting for regular covariates (age and surrogate variables), but also for the fourth and second PCs of gene expression and DNA methylation, respectively. The latter covariates were added because they likely capture potential confounding factors inducing correlation between DNA methylation and expression, which would violate the assumption of the causal inference model (Additional file 1: Figure S13). Note that reverse causation was found to be unlikely in our experimental setting and was thus not considered in our analyses (Additional file 2: Supplementary Note 2).

Fig. 4
figure 4

Inference of the causal effects of DNA methylation on gene regulation. a Representation of a simulated scenario, with the three varying parameters (α, β, and τ). b Comparison of the mediation analysis (med) with a partial correlation approach (PartCor) using a range of different simulated parameters for α (0.3–0.8), β (0.9–0.1), and τ (0.1–0.9). Note that the parameter range simulated for β and τ was adjusted so that we kept 75% of the variance unexplained (random noise parameter γ = 0.25). The difference of the area under the curve (AUC) between the two approaches is represented with different shades of red and blue. The sizes of the circles are proportional to the mean AUC of the two approaches. Two examples of the ROC curves are shown in the upper part of the figure. c Number of mediated and non-mediated eQTM-genes for negative and positive associations between DNA methylation and gene expression. The percentages of these two categories are also indicated. d Proportion of variance of gene expression explained by DNA methylation (light gray) and genetics (dark gray), in mediated and non-mediated cases

At FDR = 5%, we identified 165 genes where the genetic control of expression levels was mediated by DNA methylation (i.e., α × β was significantly different from zero, Fig. 4a), in at least one population. Remarkably, in 66 of these cases, mediation occurred through CpG sites that are differentially methylated across populations (DMS) (Additional file 6: Table S4). The proportion of mediated genes whose expression was positively and negatively correlated to DNA methylation was similar, ranging from 26 to 31% (Fig. 4c). Expectedly, we found that, among mediated genes, DNA methylation explained a significantly higher proportion of the variance of gene expression than genetics (mean R2 = 23.4% versus 15.4%, respectively; Wilcoxon P = 3.3 × 10−11), in contrast with the 387 non-mediated cases where we observed the opposite trend (Wilcoxon P = 7.8 × 10−37) (Fig. 4d).

We also found that CpG sites mediating gene expression were preferentially located in enhancers (OR ~ 2.5, P = 4.0 × 10−21), highlighting again the major role of these regions in epigenetic regulatory mechanisms [92,93,94]. These CpGs were depleted in promoters (OR ~ 0.7, P = 1.4 × 10−2), which were otherwise enriched in non-mediating CpGs (OR ~ 1.3, P = 5.9 × 10−3). Notably, 86.6% of mediating CpGs fell directly into a TF-binding site (TFBS), with respect to the expected 76.9% at the genome-wide level (OR ~ 1.9, Fisher’s exact P = 8.64 × 10−7). This result suggests that DNA methylation might actively regulate transcriptional activity through the modulation of TF binding, a hypothesis that requires experimental validation.

Interestingly, among mediated cases, we found key genes of the immune response, such as NLRP2, RAI14, NCF4, or ICAM4, and genes with functions related to transcriptional activity, encoding zinc-finger proteins (Additional file 6: Table S4). This suggests a more extensive role of DNA methylation in regulating gene expression than the local associations described here, through the regulation of DNA-binding protein activity.

Impact of immune perturbation on genetic and epigenetic interactions

Finally, we sought to understand how DNA methylation variation at the basal state affects transcriptional responses to immune activation. We used RNA-sequencing data, obtained from the same individuals, after exposure to various stimuli: LPS activating TLR4 and Pam3CSK4 activating TLR1/2, both pathways sensing bacterial components, R848 activating TLR7/8, predominantly sensing viral nucleic acids, and influenza A virus (IAV) [48]. We then mapped response-QTMs (reQTMs) using fold changes in gene expression between non-stimulated and stimulated states, for all genes expressed in either condition (see “Materials and methods”).

We found 230 unique genes whose response to immune activation was associated with DNA methylation in at least one condition; most associations were context-specific, with only 7 genes detected in all conditions (Fig. 5a; Additional file 5: Table S3). Furthermore, a 2.5-fold increase was observed in the number of reQTM-genes detected upon activation with viral-stimuli (R848 and IAV; 197 unique genes) with respect to those detected for bacterial ligands (LPS and Pam3CSK4; 78 unique genes) (Fig. 5a). For example, we detected a reQTM upon R848 stimulation for CARD9 in EUB and CD1D upon IAV infection in AFB, both genes known to play an important role in host defense (Fig. 5b, c). Despite reQTMs and eQTMs present a similar genomic distribution (Additional file 1: Figure S14), we observed an important shift towards positive associations between DNA methylation and transcriptional responses, in particular to TLR ligands (Fig. 5d). This shift was mainly accounted for by reQTMs that present the strongest associations between DNA methylation and gene expression in the non-stimulated condition (Additional file 1: Figure S15), corresponding to 109 genes (47% of the total). This contrasts with the canonical model of negative associations primarily observed at reQTMs presenting the strongest associations at the stimulated state, corresponding to 131 genes (57% of the total). Note that 10 genes were associated with reQTMs of both groups.

Fig. 5
figure 5

Effects of DNA methylation on transcriptional responses to immune stimulation. a Number of genes harboring reQTMs in single conditions or combinations of stimulations. b, c Examples of reQTMs detected in this study. Lines indicate the fitted linear regression model, and gray shades the 95% confidence intervals of these models. b The distribution of the expression values of CD1D at the non-stimulated (yellow) and after IAV infection (purple) is plotted as a function of β values, for AFB individuals only. c The distribution of the expression values of CARD9 at the non-stimulated (yellow) and upon R848 stimulation (blue) is plotted as a function of β values, for EUB individuals only. d Number of reQTM-genes by condition and according to the direction of their association with DNA methylation. e Number of mediated and non-mediated reQTM-genes per stimulation condition. The percentages of these two categories for each condition are also indicated. f Proportion of variance of gene expression explained by DNA methylation, among negative (dark colors) and positive (light colors) associations, in mediated cases

To explore causal mediation effects of DNA methylation in the context of immune activation, we mapped response-QTLs (see “Materials and methods”). Following our previous rationale (Additional file 1: Figure S11), we identified 141 trios (61.3% of the 230 reQTM-genes, Additional file 6: Table S4). At FDR = 5%, we detected 40 genes (28.4%) where the genetic control of their transcriptional response was mediated by DNA methylation (Fig. 5e). Although non-significant, we found a higher proportion of mediation for genes whose response was positively associated with DNA methylation, as compared to negative associations, in particular for viral challenges (OR ~ 2.0; Fisher’s exact P = 0.33) (Additional file 1: Figure S16). Among mediated genes in the viral conditions, the proportion of gene expression variance explained by DNA methylation was higher for positive than for negative associations, again at odds with the non-stimulated condition (Fig. 5f). More generally, our analyses illustrate the value of mapping reQTMs and studying the underlying patterns of causality, to uncover mechanisms that might explain disparities in the way individuals and populations respond to immune activation.

Discussion

Our population epigenetic results, obtained in the setting of an innate immunity cell population, demonstrate extensive differences in DNA methylation profiles between two populations that differ in their genetic ancestry but share the same present-day environment. Such population differences were observed at the epigenome-wide level (explaining ~ 12% of the total variance in DNA methylation) and involved 12,050 sites that were mostly located in genes with functions related to cell periphery or immune response regulation. Previous studies have searched for ancestry-related differences in DNA methylation in various human populations and cell types [16, 38,39,40,41, 43, 95]. Although comparisons across studies are complicated by differences in experimental settings and statistical thresholds used to detect ancestry-associated CpG sites, these range from 299 between Caucasian- and Asian/mixed-descent individuals living in Canada [16] to 36,897 between European CEU and African YRI [39]. An interesting insight that can be drawn from our analyses is that genes involved in the activation and regulation of immune responses tend to present higher levels of DNA methylation in individuals of European ancestry, with respect to those of African ancestry, mostly owing to genetic control. That up to 16% of immune-related genes that are hyper-methylated in Europeans are also differentially expressed between populations [48] could provide a mechanistic explanation for the ancestry-related differences in transcriptional responses to bacteria reported in macrophages, where European ancestry is associated with lower inflammatory responses [49].

Although variation in past environmental exposures and socioeconomic factors may contribute to population differences in DNA methylation, we found that 70% of differentially methylated sites between African and European ancestry groups were associated with at least one meQTL. This indicates that population differences in DNA methylation are mostly driven by DNA sequence variants [38, 40,41,42]. In some cases, a single genetic variant can account for important population differences at multiple CpG sites, as attested by the trans-meQTL we detected at CTCF, whose local genetic variation has been shown to alter distant DNA methylation patterns in whole blood [65]. We show that a CTCF variant (rs7203742) regulates DNA methylation of 30 distant CpGs, 40% of which are differentially methylated between populations. We also found that all CTCF trans-regulated CpGs fall within a TFBS, confirming our initial hypothesis about the mechanism by which a genetic variant might alter DNA methylation at a distant CpG site. Interestingly, 9 out of the 30 CTCF trans-regulated CpGs fall within a TFBS of CTCF, while the remaining 21 fall within a TFBS specific to other TFs such as YY1, ESR1, or ZNF143. This observation is consistent with a model of pioneer transcription factor activity [96] and suggests that CTCF acts as a pioneer factor that will generate changes in chromatin state that, in turn, will become accessible for binding of secondary factors.

At the genome-wide level, we find that the quantitative impact of DNA methylation on gene expression variation is lower than that reported by some previous studies, possibly reflecting differences in experimental settings and statistical power (e.g., cell types and sample sizes) [23, 65, 84, 89]. For example, a study of 204 healthy newborns detected substantial variation across tissues in the number of genes whose expression levels were associated with DNA methylation, ranging from 596 in fibroblasts to 3838 in T cells [23]. We detected, at the non-stimulated state, 811 eQTM-genes (6% of the total number of expressed genes), a figure that drops to 230 for reQTM-genes across stimulation conditions. However, a limitation of our study is that we measured DNA methylation at the basal state, while gene expression was obtained after 6 h. Studies including a more comprehensive range of epigenetic marks obtained at different time points—in different cell types and tissues originating from individuals of various ancestries—are needed to more precisely understand the interplay between these regulatory elements and quantify their respective roles in the regulation of transcriptional activity.

The detected eQTMs were found to be drastically enriched in genetic control (OR ~ 33.2, P < 1 × 10−326, Fig. 3c), which highlights the coordinated action of genetic and epigenetic factors in driving gene expression variation but raises questions about the causal role of DNA methylation [56]. Despite cautious interpretation of causality in mediation analyses is required [97], our analysis provides a first estimate of the potential direct role of DNA methylation in regulating transcriptional activity, in both resting and stimulated monocytes. At the non-stimulated state, we find that ~ 20% of eQTM-genes show evidence of a causal mediation effect of DNA methylation. Although a similar extent of mediation was found upon immune stimulation (~ 17%), we detected specific patterns upon treatment with viral challenges, where a higher occurrence of positive associations was observed among mediated cases. These findings mostly reflected cases where high levels of DNA methylation were associated with low gene expression in the non-stimulated condition, thus requiring stronger responses to reach high levels of gene expression upon cell perturbation. These trends suggest a major, direct, and context-specific role of DNA methylation in the regulation of immune responses, whose complexity requires further investigation.

Finally, we found that meQTLs, in particular those associated with ancestry-related differences, are enriched in GWAS hits related to immune disorders. This suggests that DNA methylation has an important impact on the cellular activity of monocytes and ultimately affect phenotypic outcomes. Nonetheless, a large fraction of the variance of DNA methylation and gene expression remains unexplained. Additional work is needed to quantify the relative impact of genetic, epigenetic, environmental, and lifestyle factors in driving variation of DNA methylation and gene expression, both in resting and stimulated cells. Furthermore, although the causal mediation analyses presented in this study reinforce the notion that DNA methylation can play a direct role in regulating gene expression in humans [23, 98], monitoring the kinetics of variation in DNA methylation and gene expression after exposure to different infectious agents will broaden our understanding of the interplay between these molecular phenotypes and their impact on endpoint phenotypes.

Conclusion

Our study reveals extensive variation in DNA methylation profiles between individuals and populations, with ancestry-related differences being mostly explained by genetic variation. It also suggests that DNA methylation can have a direct, causal impact on the transcriptional activity of primary monocytes, providing new insight into the nature of the host factors that drive immune response variation in humans.

Materials and methods

Sample collection and monocyte purification

The EvoImmunoPop collection consists of 156 individuals (males between 20 and 50 years old, mean 31.5 years old) from two different ancestries (78 of European and 78 of African descent), who were recruited at the Center for Vaccinology from the Ghent University Hospital (Ghent, Belgium) [48]. For each participant, 300 ml of whole blood was collected into anticoagulant EDTA-blood collection tubes and peripheral blood mononuclear cells (PBMCs) were purified using Ficoll-paque density gradients (#17-1440-03, GE Healthcare). Monocytes were positively selected from purified PBMCs using magnetic CD14 microbeads (#130-050-201, MiltenyiBiotec), as per manufacturer’s instructions. All samples had a monocyte purity higher than 90% with a mean value of 97%.

DNA methylation profiling and data normalization

Genomic DNA was extracted from the monocyte fraction using a phenol/chloroform protocol followed by ethanol precipitation. The DNA was then bisulfite converted, and BC-DNA was then processed using the Illumina Infinium MethylationEPIC BeadChip Kit (Illumina, San Diego, CA) to obtain the methylation profile of each individual at more than 850,000 CpG sites genome-wide.

In total, 184 samples were hybridized with the EPIC array, including 172 unique samples and 12 technical replicates. We removed any technically unreliable probes: (i) potentially cross-hybridizing probes (83,635 probes), (ii) those located on the X and Y chromosomes (17,229 probes), and (iii) probes overlapping SNPs that present a frequency higher than 1% in at least one of the studied populations (206,998 probes). These SNPs were chosen based on our own genotyping dataset, as well as on the 1000 Genomes project [99]. To control for the quality of the probes and samples, we filtered out individuals with > 5% of probes associated with a detection P value > 10−3, and then, probes with a detection P value > 10−3 in one or more individuals (6833 probes). Following this filtering process, 552,141 of the original 866,836 sites on the array were retained.

We calculated methylation levels from raw data, using the R Bioconductor lumi package [100]. Given that the M value has been shown to provide better detection sensitivity than β values at extreme levels of modification [68], we used the M value to run all statistical analysis unless otherwise stated. Note that in some instances of the text and figures, β values are reported for ease of clarity and interpretation. M values were then adjusted for background noise with the normal-exponential using out-of-band probes (noob) from the R Bioconductor minfi package [101]. Next, normalization for color bias was performed using lumiMethyC with the “quantile” method, and for methylated/unmethylated intensity variation using the lumiMethyN with the “ssn” method [100]. Finally, we corrected for technical differences between type I and type II assay designs, by performing beta-mixture quantile normalization [102]. To correct for known batch effects and potential hidden confounders, we used the sva function from the sva Bioconductor package [103] with age as a variable of interest. Additionally, five EUB samples were removed because they presented an excess of hemimethylated sites, leaving 89 EUB and 78 AFB samples. To obtain equal power in the two studied populations, we down-sampled the European group to 78 samples by randomly removing 11 EUB samples, for an overall final cohort of 156 individuals.

Extraction of differentially methylated sites (DMS)

To detect CpG sites presenting statistically different levels of DNA methylation between AFB and EUB, we fitted a linear regression model for each CpG site: M value ~ population + age + two surrogate variables + error, and next applied an empirical Bayes smoothing to the standard errors using the R Bioconductor limma pipeline [104]. P values were adjusted using the Benjamini and Hochberg method. DMS were extracted using a threshold of adjusted P value (< 0.01) and a difference in the mean β value of each population |Δβ| > 5%.

Mapping of methylation quantitative trait loci (meQTLs)

All individuals were genotyped for a total of 4,301,332 SNPs on the Illumina HumanOmni5-Quad BeadChips and went through whole-exome sequencing with the Nextera Rapid Capture Expanded Exome kit, on the Illumina HiSeq 2000 platform, with 100-bp paired-end reads. Details of the processing of genotyping and whole-exome sequencing data, together with imputation using the 1000 Genomes Project imputation panel [99], are reported in ref. [48]. For the meQTL mapping, we filtered out SNPs with a minor allele frequency < 5% in the populations studied and kept a final dataset of 10,278,745 SNPs (i.e., corresponding to the merged genotyping and whole-exome sequencing dataset after imputation; 8,913,090 SNPs in Africans and 6,178,808 SNPs in Europeans). Age, PC1 and PC2 of the genotype matrix, and two surrogate variables, as identified with the sva R package, were used as covariates in the linear model.

We mapped meQTLs using the statistical framework implemented in the MatrixEQTL R package [70]. For local associations (i.e., distance SNP-CpG ≤ 100 kb), we performed two independent mappings using (i) the direct linear model from the MatrixEQTL pipeline and (ii) a Kruskal-Wallis rank test. Associations were considered significant when passing the 5% FDR threshold in both mappings. Two models were considered: merging all individuals and including a binary variable adjusting for ancestry or keeping the two populations separately. To detect all possible independent SNPs regulating methylation at a single CpG site in cis, we regressed out genotypes of all primary cis-meQTLs and then performed cis-meQTL mapping on the regressed methylation data to find secondary cis-meQTLs. We repeated this process in a stepwise fashion until no additional independent cis-meQTLs were detected. This allowed us to refine our local meQTL mapping by detecting all possible independent SNP-CpG associations.

For distant, trans-acting associations (i.e., distance between SNP and CpG ≥ 1 Mb or on different chromosomes), we restricted our analysis to SNPs located in the vicinity of transcription factor (TF) coding genes, to limit the burden of multiple testing. Specifically, we selected (i) all SNPs located less than 10 kb to the TSS of any expressed TF in our dataset and (ii) SNPs detected as cis-eQTLs for these TFs. For each SNP, we only investigated CpG sites that mapped at least 1 Mb from the SNP or located on other chromosomes, using a Kruskal-Wallis rank test.

For both cis- and trans-meQTLs, FDR was computed by mapping meQTLs on 100 datasets with the M values permuted within each population. We then kept, after each permutation, the most significant P value per CpG site, across populations (probe-level FDR). Finally, we computed the FDR associated with different P value thresholds for cis or trans, and subsequently selected the P value threshold that provided a 5% FDR: P = 1 × 10−5 and P = 1 × 10−9 for cis- and trans-meQTLs, respectively.

Investigating the genetic basis of population differences in DNA methylation

We aimed at identifying the proportion of the population differences in DNA methylation that was accounted for by genetic variability. To do so, for the 8459 DMS that were associated with at least one meQTL, we computed the following ratio:

$$ ExpDiff=\frac{\beta \times \Delta DAF}{\Delta Meth} $$

with β reflecting the effect of the derived allele of the meQTL on methylation, ΔDAF the difference in allelic frequencies between Europeans and Africans (DAFEUB − DAFAFB), and ΔMeth the observed difference in the mean levels of DNA methylation between European and African individuals (\( \overline{Meth_{EUB}}-\overline{Meth_{AFB}}\Big) \).

Note that this ratio is not bound to [0:1], as the effect of genetics onto the overall population differences in DNA methylation can be counteracted by opposite effects of independent origins (e.g., environmental factors or non-detected independent genetic effects).

Detecting population-specific meQTLs

We aimed at refining our meQTL mapping by detecting population-specific meQTL effects (i.e., SNPs present at similar frequencies in both populations but having different effect sizes on DNA methylation between populations). To do so, we used a Bayesian model selection approach to identify specific and shared effects for each of the 69,702 CpGs that we detected as being associated with at least one genetic variant. Specifically, for each CpG-SNP pair, we computed the likelihood of three models:

$$ lm\left( Meth\sim SNP+ Pop\right) $$
(i)
$$ lm\left( Meth\sim {SNP}_{EUB}+ Pop\right) $$
(ii)
$$ lm\left( Meth\sim {SNP}_{AFB}+ Pop\right) $$
(iii)

with SNPEUB coded 0,1,2 in EUB individuals and 0 in AFB individuals, and SNPAFB coded 0,1,2 in AFB individuals and 0 in EUB individuals. We next calculated the posterior probability of each model assuming that all models are equally likely a priori. We then set a threshold of 0.9 to consider one of the models as supported by the data. Thus, a meQTL is classified as EUB-specific if the posterior probability of model (ii) is higher than 0.9, or AFB-specific if the probability of model (iii) is higher than 0.9.

GWAS enrichment analyses

We used the NHGRI GWAS catalog [105] to first select all significant SNPs that were significantly associated with a complex trait or disease at a P < 1 × 10−8. Using this set of GWAS hits, we next extracted all SNPs in LD with each of these hits (R2 > 0.8) and classified the resulting final set of 166,248 SNPs according to their parental Experimental Factor Ontology (EFO) term [75].

We then selected all meQTLs in our dataset that passed the P value threshold corresponding to FDR 5% in our initial mapping, and filtered out meQTLs that were in LD (R2 > 0.8) keeping one SNP per independent loci (56,574 independent SNPs). For the resampling set, we considered all SNPs that were initially used for the meQTL mapping and pruned them for LD (R2 > 0.8), yielding a final set of 921,466 SNPs. Resampling was performed using bins of allelic frequencies at intervals of 5%.

Finally, we tested for fold enrichments of meQTLs in GWAS hits, for each of the 17 parental EFO categories [75]. The fold enrichment was calculated by comparing the number of LD pruned-meQTLs that were found to correspond to GWAS hits (or were in LD with GWAS hits) with the expected number estimated through 10,000 resamples. P values associated to the fold enrichment were calculated by fitting a normal distribution to the empirical distribution of our 10,000 resampled sets of SNPs. Confidence intervals were computed using 10,000 resamples by bootstrap. The same procedure was applied when searching for enrichments of meQTLs specifically in GWAS hits related to the 268 traits of the “Immune system disorder” EFO parental term.

Expression quantitative trait methylation (eQTM) analysis

To identify associations between DNA methylation levels and gene expression of nearby genes, we leveraged RNA-sequencing data obtained from the same individuals, both at the non-stimulated state (NS) and in response to four immune stimuli [48]. Briefly, RNA-sequencing was performed on the Illumina HiSeq2000 platform with 101-bp single-read sequencing with fragment size of around 295 bp, and outputs of around 30 million single-end reads per sample were obtained. A total of 763 RNA-sequencing samples from our filtered dataset of 156 donors were analyzed for gene expression profiling, including 156, 151, 153, 148, and 155 samples for the NS, LPS, Pam3CSK4, R848, and IAV conditions, respectively. Details of cell culture, immune stimulation conditions, and RNA-seq processing can be found in ref. [48].

Using the RNA-sequencing data from the NS condition, we mapped eQTMs (i.e., CpGs whose variation is associated with gene expression) in a window of 100 kb around the TSS of each gene (12,578 expressed genes in primary monocytes). The associated P values and the coefficients of correlation between methylation profiles and gene expression were obtained using Spearman’s rank correlation. FDR was computed by mapping eQTMs on 100 datasets with the M values permuted, and kept, after each permutation, the most significant P value per gene (gene-level FDR). We selected the P value threshold that provided a 5% FDR (P = 5 × 10−5).

We also mapped eQTMs in the context of the response to the various stimulations, namely response-QTMs (reQTMs). To do so, the same procedure explained above for the eQTM mapping was followed, using the fold change of expression upon stimulation as a measure of the host response to infection. Specifically, we calculated the difference of the log2 of expression values between the stimulated and non-stimulated states, corrected for the effect of low values of FPKM, for each gene expressed in at least one of the two conditions.

$$ Diff={\log}_2\left(1+{FPKM}_{Stim}\right)-{\log}_2\left(1+{FPKM}_{NS}\right)={\log}_2\left(\frac{1+{FPKM}_{Stim}}{1+{FPKM}_{NS}}\right) $$
$$ FoldChange=\frac{1+{FPKM}_{Stim}}{1+{FPKM}_{NS}}={2}^{Diff} $$

For the mapping of eQTMs and reQTMs, we conducted two separate analyses: merging all individuals and including ancestry as a covariate, or keeping the two populations separately.

Expression quantitative trait loci (eQTL) analysis

We mapped expression quantitative trait loci (eQTLs) using the MatrixEQTL R package [70], leveraging our genotyping and expression data [48]. As for the meQTL mapping, we filtered out SNPs with a minor allele frequency < 5% in the populations studied and kept a final dataset of 10,278,745 SNPs. Age and PC1/PC2 of the genotype matrix were used as covariates in the linear model. Two different models were used: merging all individuals and including ancestry as a covariate, or keeping the two populations separately. We also mapped response quantitative trait loci (reQTLs), using the fold change of expression described above, instead of expression, and the same covariates that we used for the eQTL mapping.

For both eQTLs and reQTLs, FDR was computed by mapping eQTLs/reQTLs on 100 datasets with the expression values permuted within each population. We then kept, after each permutation, the most significant P value per gene, across populations (gene-level FDR). Finally, we computed the FDR associated with different P value thresholds for eQTLs or reQTLs, and subsequently selected the P value threshold that provided a 5% FDR: P = 5 × 10−5 and P = 5 × 10−6 for eQTLs and reQTLs, respectively.

Simulations to infer causality

We simulated different scenarios to infer causal relationships between DNA methylation and gene expression. For each scenario, we started by randomly selecting genomic blocks of 1 Mb each along the genome to keep realistic expectations of genetic structure. We next randomly sampled SNPs in these blocks, which we used to simulate methylation and gene expression data. For example, in a scenario where a genetic variant influences DNA methylation variation that, in turn, actively regulates gene expression (see Fig. 4a), we followed the next steps:

  1. (i)
    $$ {G}_{i\_ std}=\frac{\left({G}_i-\overline{G_i}\right)}{sd\ \left({G}_i\right)} $$
  2. (ii)
    $$ {M}_i=\sqrt{\alpha_i}\times {G}_{i_{std}}+\sqrt{\left(1-{\alpha}_i\right)}\times {\varepsilon}_i $$
  3. (iii)
    $$ {M}_{i\_ std}=\frac{\left({M}_i-\overline{M_i}\right)}{sd\ \left({M}_i\right)} $$
  4. (iv)
    $$ {E}_i=\sqrt{\gamma \times {\beta}_i}\times {M}_{i_{std}}+\sqrt{\gamma \times {\tau}_i}\times {G}_{i_{std}}+\sqrt{\left(1-\gamma \times \left({\beta}_i+{\tau}_i\right)\right)}\times {\zeta}_i $$

where Gi is the genotype of the ith sampled variant and Gi_std the standardized value of its genotype; Mi is the simulated methylation data and Mi_std its standardized methylation value; Ei is the simulated gene expression data; αi is the proportion of variance of Mi that is explained by Gi, and γ is a noise parameter that corresponds to the total proportion of variance of Ei that is explained by Gi and Mi. βi and τi are the proportions of explained variance that are attributable to Gi and Mi respectively (satisfying βi + τi = 1). Finally, εi and ζi are random, normally distributed residuals. Note that in the simulation presented in Fig. 4a, b, we used a gamma of 0.25, so that 75% of the variance of gene expression remained unexplained.

Detection of genetic variants-DNA methylation-gene expression trios

To infer causality between regulatory loci and gene expression variation, we considered eQTLs that were also detected as meQTLs, and, out of this subset, we kept only those for which the meQTL-CpG had previously been identified as an eQTM of the eQTL-gene (Additional file 1: Figure S11). When multiple SNPs or CpGs where present in a trio, we used an elastic net model, to build linear predictors of (i) gene expression based on DNA methylation variability for trios with multiple CpGs and (ii) DNA methylation based on genetic variability for trios with multiple SNPs. These predictors were then used as summary variables for DNA methylation variability (i) or genetic variability (ii). Specifically, the glmnet function from the R package glmnet [106] was used to fit the generalized linear model via penalized maximum likelihood, with an elastic net mixing parameter α of 0.5. The strength of the penalty λ1se was chosen as the largest value of lambda such that the error was within 1 standard deviation of the minimum lambda, when performing k-fold cross validation with the cv.glmnet function. Finally, the generic R function predict was used to build the optimal linear predictor in each case. For the trios presenting more than one SNP, we also used a predictor of gene expression based on genetic variability, as summary variable for the genetic variability, and found no differences in our simulation-based mediation results when compared to building the summary variable from a predictor of DNA methylation (data not shown).

Mediation analyses

For conducting causal mediation analyses, we used a Bayesian approach as implemented in the mediation R package [90]. Briefly, this approach estimates causal effects of a mediating variable M (DNA methylation) on the relationship between an independent variable X (genetics) and a dependent variable Y (gene expression). In this scenario, the global effect of X on Y can be written as ρX → Y = τ + α · β, where τ is the specific effect of X on Y, α the specific effect of X on M, and β the specific effect of M on Y. With this, the product α·β represents the mediation effect of G on Y, through M. The mediate function of the mediation R package was used to compute point estimates for average causal mediation effects, as well as 1000 simulation draws of average causal mediation effects. The empirical distribution of simulated effects was used to fit a normal distribution, which was subsequently used to compute empirical P values for the H0 hypothesis “α·β = 0.” We used the R function p.adjust with method “fdr” to correct at a FDR = 5%.

For comparison purposes with the mediation analyses, we conducted on simulated data a partial correlation approach to test for independence between expression and methylation levels when accounting for genetic variability. We used the pcor.test function from the R package ppcor [107] to compute P values of the partial correlation between simulated expression and methylation data.