Integration of DNA methylation patterns and genetic variation in human pediatric tissues help inform EWAS design and interpretation
The widespread use of accessible peripheral tissues for epigenetic analyses has prompted increasing interest in the study of tissue-specific DNA methylation (DNAm) variation in human populations. To date, characterizations of inter-individual DNAm variability and DNAm concordance across tissues have been largely performed in adult tissues and therefore are limited in their relevance to DNAm profiles from pediatric samples. Given that DNAm patterns in early life undergo rapid changes and have been linked to a wide range of health outcomes and environmental exposures, direct investigations of tissue-specific DNAm variation in pediatric samples may help inform the design and interpretation of DNAm analyses from early life cohorts. In this study, we present a systematic comparison of genome-wide DNAm patterns between matched pediatric buccal epithelial cells (BECs) and peripheral blood mononuclear cells (PBMCs), two of the most widely used peripheral tissues in human epigenetic studies. Specifically, we assessed DNAm variability, cross-tissue DNAm concordance and genetic determinants of DNAm across two independent early life cohorts encompassing different ages.
BECs had greater inter-individual DNAm variability compared to PBMCs and highly the variable CpGs are more likely to be positively correlated between the matched tissues compared to less variable CpGs. These sites were enriched for CpGs under genetic influence, suggesting that a substantial proportion of DNAm covariation between tissues can be attributed to genetic variation. Finally, we demonstrated the relevance of our findings to human epigenetic studies by categorizing CpGs from published DNAm association studies of pediatric BECs and peripheral blood.
Taken together, our results highlight a number of important considerations and practical implications in the design and interpretation of EWAS analyses performed in pediatric peripheral tissues.
KeywordsDNA methylation Genetic variation Surrogate tissues Peripheral blood leukocytes Buccal epithelial cells Illumina 450K array Pediatric
buccal epithelial cell
peripheral blood mononuclear cell
epigenome-wide association study
- G × E
methylation quantitative trait loci
- 450K array
Infinium HumanMethylation450 array
Beta MIxture Quantile dilation
principal component analysis
false discovery rate
Epigenome-wide association studies (EWASs) are becoming increasingly popular, in part due to their potential to enhance our understanding of the determinants of health and disease, including potential early life embedding of experiences and exposures and their association with later life outcomes [1, 2, 3, 4, 5, 6, 7]. The term “epigenetics” describes mitotically heritable modifications of DNA and its regulatory components, including chromatin and non-coding RNA, that potentially modulate cellular states or fate through gene expression changes, without changing the DNA sequence itself [8, 9, 10]. DNA methylation (DNAm), which involves the covalent attachment of a methyl group to a cytosine primarily at cytosine–phosphate–guanine (CpG) dinucleotides, is the most well-studied chromatin mark in human populations due to its relative stability and ease of measurement on quantitative array-based methods [11, 12]. To date, EWASs have identified differential DNAm across a broad range of contexts including disease states, genetic background and environmental exposures, thereby providing evidence for the potential contribution of DNAm in mediating gene-by-environment ((G × E)) interactions [1, 13, 14].
Given that tissue specificity is an integral feature of epigenetic profiles, as different tissues and cell types acquire distinct epigenomes during differentiation, the selection of tissue source is a key consideration in the careful design and interpretation of EWAS analyses [15, 16, 17]. The collection of a disease-relevant, target tissue allows for the direct assessment of epigenetic associations that may be implicated in the underlying phenotypic or disease biology. In certain cases, readily accessible peripheral tissues may represent the target tissue; for example, use of PBMCs for the investigation of DNAm associations to immune or inflammatory phenotypes [4, 18, 19, 20]. However, in many cases, the target tissue, such as brain, muscle, adipose tissue, among others, may be impossible or very difficult to collect from living individuals or at sufficient quality for analysis from postmortem samples . Easily accessible peripheral tissues are therefore often used in human epigenetic studies for biomarker discovery in lieu of target tissues that are difficult to collect. This is particularly relevant to pediatric cohorts in which biopsy specimens with invasive collection procedures or postmortem samples are less common than in adult populations. As such, more readily accessible tissues with minimally invasive collection procedures, such as cord blood, saliva, buccal epithelium cells (BECs) or peripheral blood mononuclear cells (PBMCs), are widely used tissue source materials for early life EWASs. The use of pediatric tissues in DNAm analyses is further complicated by the fact that widespread alterations occur in tissue-specific DNAm patterns during development, therefore conferring additional complexity in the selection of appropriate source material for early life DNAm studies [21, 22]. Furthermore, changes in cell composition within a tissue are a source of potential confound in EWAS, as shown for a number of DNAm associations, including changes during development and certain environmental exposures such as smoking [1, 23, 24, 25, 26, 27].
Currently, two major focal points in human epigenetic research are to elucidate the tissue specificity of DNAm patterns with respect to individual CpGs as well as assess inter-individual variation within a single tissue [21, 28, 29, 30]. At a population level, a number of studies have examined the concordance of DNAm patterns across multiple tissues [20, 29, 30, 31, 32, 33, 34]. Findings have shown that beyond tissue-specific differences in absolute DNAm measures, inter-individual DNAm variability also varies by tissue type [20, 31]. For example, previous work by our group has shown that BECs have greater DNAm variability over matched PBMCs at both the genome-wide level and at individual CpGs . Moreover, CpG sites with higher DNAm variability tend to be more correlated between matched tissues [29, 30, 31, 34]. Although these results provide important insights into the comparability of DNAm measures across matched tissues, the analyses to date have been conducted in adult tissues, thereby limiting their relevance to DNAm profiles from pediatric samples. As previous studies have demonstrated that developmental changes in blood DNAm patterns tend to be more pronounced and occur more rapidly in childhood, the examination of DNAm concordance and variability in pediatric tissues represents an important and currently missing step in our understanding of EWAS associations from pediatric peripheral tissues [21, 22].
Genetic variation represents an additional contributor to DNAm patterns in tissues, with genetic influences accounting for nearly 20–80% of DNAm variance within a tissue [35, 36, 37, 38, 39, 40]. Methylation quantitative trait loci (mQTL), sites at which DNAm is associated with genetic variation, are present across the genome and are often consistent across tissues, ancestral populations and developmental stage [41, 42, 43, 44]. Notably, genetically influenced sites of inter-individual DNAm variation, which can co-occur across tissues, may be biologically informative. For example, allele-specific DNAm of the FK605 binding protein 5 (FKBP5) gene, which has been associated with risk of developing stress-related psychiatric disorders, responds to glucocorticoid stimulation in a similar way in peripheral blood cells and neuronal progenitor cells . Within a particular tissue, such as blood, mQTL often are stable across development [43, 46]. Moreover, approximately 75% of the inter-individual regional DNAm variance within a single tissue can be best described by (G × E) models . As such, delineating the contribution of genetic influences to tissue-specific DNAm may help clarify the interpretation of EWAS associations.
Given that early life development brings about sizable changes to DNAm patterns, it is important to examine DNAm variability and concordance between peripheral tissues, as well as genetic influences on early life DNAm patterns, in childhood [21, 22]. To this end, we used matched PBMC and BEC samples, two commonly used peripheral tissues in EWAS, from two independent early life cohorts in order to identify (a) differences in inter-individual variability and concordance of DNAm between these tissues and (b) genetic contributions to these patterns at the site-specific level. Our results showed that genome-wide DNAm variability differed between tissues, with BECs exhibiting greater inter-individual DNAm variability over PBMCs. Moreover, we found that highly variable CpGs were more likely to be positively correlated between matched tissues and enriched for DNAm sites under genetic influence. Finally, we demonstrated the relevance of our findings to EWAS analysis by categorizing DNAm associations that were previously identified in pediatric BECs and peripheral blood. Collectively, these findings highlighted a number of potential insights and considerations for the appropriate design and interpretation of EWAS analyses performed in commonly used peripheral tissues of pediatric samples.
Study cohorts and DNAm data processing
Sample characteristics for C3ARE and GECKO cohorts
Age range (years) at BEC collection (mean)
Age range (years) at PBMC collection (mean)
n = 16 total (50% F)
n = 79 total (46% F)
DNAm data, as measured across ~ 485,000 CpGs by the Illumina 450K array, were filtered down to overlapping 419,507 sites which passed independent quality control measures in both cohorts. Each 450K dataset was normalized to remove probe type differences and adjusted for cell type heterogeneity in each tissue using established bioinformatic correction methods [34, 48, 49, 50, 51]. Genetic variants were measured genome-wide using the Illumina Infinium PsychChip. Following probe filtering for low-quality probes, 550,200 and 547,662 SNP probes remained for analysis in C3ARE and GECKO, respectively. We used these corrected DNAm and genotyping data of matched PBMC and BEC samples from both cohorts to assess inter-individual DNAm variability, DNAm concordance across tissues and genetic influence on DNAm, in order to gain insight into DNAm variation in these commonly used pediatric peripheral tissues.
BEC DNAm had significantly greater inter-individual variability than PBMC DNAm
Apart from tissue-specific differences in reference range, we also observed a cohort-specific difference in DNAm variability. Specifically, CpGs in GECKO had a significantly greater median reference range than C3ARE CpGs in both tissues (Wilcoxon rank-sum test, p values = 2.2 × 10−16). In BECs, the median reference range was 1.6% higher in GECKO than C3ARE and in PBMCs, it was greater by 1.3%. This difference remained significant when GECKOsub was used in lieu of GECKO (BEC difference = 1.2%, PBMC difference = 1.1%, Wilcoxon rank-sum test, p values = 2.2 × 10−16), suggesting that these cohort-specific DNAm variability differences occurred irrespective of sample size and may be related to age-associated increases in DNAm variability, as previously described [53, 54, 55, 56, 57, 58].
Variable CpGs were more highly correlated between tissues
We next sought to investigate DNAm variability and concordance at individual CpGs. Specifically, we aimed to identify “informative sites,” which we defined as CpGs that are both variable across individuals and highly correlated between BECs and PBMCs, using a previously described method . Such CpGs may be predictive of PBMC DNAm when measured in BECs or vice versa. To be classified as informative, i.e., variable and concordant, a CpG was required to have a reference range ≥ 5% in both tissues and meet the minimum correlation coefficient between tissues of 0.47 in GECKO samples and 0.32 in C3ARE samples, as determined by a beta mixture model run on highly variable CpGs in each cohort. Overlapping CpGs that met these criteria in both cohorts resulted in a set of 8140 informative sites. Of note, we observed a greater than expected by chance overlap (3682 out of 8140 sites, 45%, 10,000 permutations, p < 1 × 10−4) between our set of informative sites and informative CpGs previously identified between matched samples from adult brain and blood tissues . Visualization of our six most correlated informative sites revealed continuous distributions of positively correlated DNAm values between the tissues, as expected (Fig. 2b). However, the most variable informative sites exhibited discrete distributions with 2–3 distinct clusters, rather than a typical continuous distribution, suggesting that these CpGs may be enriched for CpGs which are likely under genetic influence (Fig. 2b) .
Genetic variation contributed to tissue concordance
After quality control processing and variability filtering of the C3ARE DNAm and genotyping data, 16,138 and 17,563 SNP-CpG pairs could be tested for validation in PBMCs and BECs, respectively (mQTL that were not tested for validation lacked genetic variability in the C3ARE cohort). This resulted in a total of 1871 PBMC-specific, 3705 BEC-specific and 1097 shared-tissue validated cis-mQTL (FDR ≤ 0.05 and DNAm change per allele ≥ 2.5%), which exhibited highly consistent effect sizes between GECKO and C3ARE cohorts (Spearman rho = 0.92, p = 2.2 × 10−16) (Fig. 3a, b). The overlap between validated cis-mQTL between tissues was greater than expected by chance (10,000 permutations, p value < 1 × 10−4) (Fig. 3a and Additional file 3: Fig. S6). This suggested that genetic influences contributed to covariation between tissues. Finally, we found a significant overlap of our 1871 PBMC-specific and 1097 shared-tissue cis-mQTL with previously published mQTL hits from whole blood samples of 7-year-old children in the AIRES cohort (1810 out of 2968 sites, 61%, 10,000 permutations, p < 1 × 10−4), further supporting our mQTL findings .
We next sought to characterize our validated cis-mQTL by their genomic localization and functional features. Firstly, the 4980 unique CpGs associated with the validated cis-mQTL showed a greater than expected by chance enrichment in intergenic regions and were depleted in intragenic and north shelf regions (2–4 kb upstream of CpG islands) (Additional file 4: Fig. S7A, FDR ≤ 0.05). In particular, both the CpGs associated with tissue-specific cis-mQTL and the CpGs associated with shared-tissue cis-mQTL were significantly enriched at intergenic and intragenic regions and showed significant depletion at promoters and CpG islands, where DNAm levels tend to be low and there is limited inter-individual variation (Additional file 4: Fig. S7B and C, FDR ≤ 0.05). However, tissue-specific mQTL CpGs exhibited significant enrichment at south shelf regions (2–4 kb downstream of CpG islands), whereas shared-tissue mQTL CpGs were significantly enriched in north shores (0–2 kb upstream of CpG islands) but depleted in north shelf regions (Additional file 4: Fig. S7B and C, FDR ≤ 0.05). In addition, we found that CpGs associated with shared-tissue cis-mQTL exhibited a greater than expected by chance enrichment of informative CpGs (687 out of 812 unique CpGs in shared-tissue cis-mQTLs, 85%, 10,000 permutations, p < 1 × 10−4), further substantiating that site-specific DNAm correlation between tissues are influenced, in part, by genetic variation (Additional file 5: Fig. S8).
Tissue-specific differential DNAm was consistent across cohorts
Differentially methylated sites were common in previously published EWAS findings
We then applied this categorization scheme to previously reported EWAS findings performed in pediatric BEC or PBMC tissues to provide an example of how the classification of CpGs can aid in the interpretation of such studies. We selected five published studies that used the 450K array in pediatric BECs or peripheral blood to assess DNAm variation associated with puberty, aging in early life, childhood psychotic symptoms, fetal alcohol spectrum disorder and autism spectrum disorder [61, 62, 63, 64, 65]. By implementing our CpG classification scheme on their respective list of significant EWAS hits, we found that cis-mQTL, as identified here, accounted for 0.02–13.5% of significant CpGs reported in these five studies. Differentially methylated CpGs comprised the most represented type of CpG across all five studies with only one study demonstrating an overlap of 24.3% with our identified informative sites (Fig. 5b; Additional file 6: Table S2) . This suggested that the majority of DNAm associations identified in these EWASs were likely specific to peripheral blood or BECs, rather than shared across tissues. Finally, we tabulated our CpGs classifications across all 419,507 DNAm probes assessed in our study in order to serve as a resource for researchers wishing to compare their own EWAS results (Additional file 7). Collectively, these findings reveal the importance of considering DNAm variability and concordance between tissues, as well as genetic influences on these patterns, when interrogating and interpreting EWAS findings from pediatric peripheral tissues.
In this study, we comprehensively compared genome-wide DNAm in BECs and PBMCs using matched samples from two independent pediatric cohorts. Moreover, we leveraged the strength of paired DNAm and genotyping profiles to define cis-mQTL across the genome and assess the influence of local genetic variation on DNAm variability and tissue concordance. Our findings showed that at the genomic and site-specific level, BECs had greater inter-individual DNAm variability over PBMCs, with highly variable CpGs more likely to be positively correlated between the matched tissues. In our subsequent cis-mQTL analyses, we observed distinct genetic influences on tissue-specific DNAm and confirmed that a sizeable proportion of shared DNAm patterns between tissues resulted from allelic variation. Finally, we provided a classification framework for the post hoc examination of EWAS associations and examined the representation of our categorized CpGs in published EWAS findings performed in pediatric BECs and PBMCs.
Our findings highlighted extensive differences in DNAm patterns between tissues and thus the importance of tissue selection when designing an EWAS. To a large extent, EWAS tissue selection in early life cohorts is guided by two factors. Firstly, ease of collection is particularly important in this age range and may restrict tissue availability. Buccal swabs are less invasive than intravenous puncture, and the latter contributes to participation refusal in pediatric cohorts . Secondly, the relevance of the tissue to the phenotype or exposure being tested represents an important consideration for all EWAS analyses, irrespective of age. As peripheral blood represents a circulating tissue with broad immune and inflammatory functions, it might be more relevant to a wider range of health phenotypes than BECs. However, another hypothesis posits that tissues that arise from the same germ layer are more epigenetically similar and thus might be a preferred choice for surrogate tissue selection . For example, in comparison with blood, it has been proposed that BEC DNAm may more closely reflect brain DNAm than blood DNAm, as both derive from the ectodermal germ layer [32, 65]. Adding to the complexity of this issue, we found that BEC DNAm had significantly greater inter-individual variability than PBMC DNAm at the genome-wide level and at the site-specific level, a finding consistent with adult BECs and PBMCs . Having a higher proportion of variable CpGs might be desirable for EWAS analyses as testing any tissue with little inter-individual DNAm variation would naturally limit effect sizes. From this perspective, BECs might represent a more appropriate choice of peripheral tissue for population-based epigenetic studies over PBMCs. However, it is worth noting that while we did correct for cellular heterogeneity in both tissues using bioinformatic deconvolution approaches, the higher proportion of variable CpGs in BECs may, to some extent, be attributed to the increased diversity of cell types or residual cellular heterogeneity in BECs over PBMCs (i.e., epithelial and hematopoietic in the former and entirely hematopoietic in the latter) .
Taking advantage of our matched sample design, we were able to rigorously interrogate the extent of correlation between DNAm signatures of BECs and PBMCs. CpGs with greater variability were more likely to be correlated between matched tissues, as best exemplified by the 8140 informative sites we identified. These may aid in the inference of unmeasured PBMC or BEC DNAm (when the other tissue is measured) as well as for prioritization of sites for cross-tissue replication. In the latter case, cross-tissue replication typically involves the generation of candidate gene lists in accessible tissues for validation in less available tissues, such as postmortem samples, an approach which can boost confidence in identified associations [69, 70, 71]. There was a substantial overlap (45%) between our informative sites and those previously published in matched adult blood and brain tissues . However, we found only 1.9% of total measured CpGs to be informative by our measures and thresholds as compared to 9.7% found in the previous analyses of adult samples from our laboratory . These quantitative differences might be due to a number of reasons, with the most likely being that the blood–brain informative sites were identified using a single cohort, while our blood–buccal informative sites were filtered down to sites that were common across both GECKO and C3ARE cohorts; other explanations may be methodological (i.e., slight differences in analytical thresholds derived from empirical testing) or biological (i.e., blood may be more epigenetically similar to brain tissue than to BECs, resulting in more informative sites). An in-depth analysis of such cross-tissue comparisons between pediatric and adult samples, ideally by means of longitudinal sampling of DNAm, may help elucidate such sources of tissue variation across the lifespan.
Integration of genetic and epigenetic information may further clarify the relative contribution of genetic and environmental factors on inter-individual DNAm variability. We found that genetic variation contributed to both inter-individual DNAm variation within a tissue, as well as common DNAm variation between tissues. This is in general agreement with previous findings that show that many—but not all—mQTLs have consistent effects across tissues and human populations and are generally depleted in genomic regions which tend to have low DNAm variability such as promoters and CpG island but enriched in more variable intergenic and intragenic regions [41, 43, 44, 46, 72]. It is currently unclear why we observed more BEC-specific mQTL in our matched design as compared to PBMC-specific or cross-tissue mQTL. The most likely explanation is that BECs contained more validated cis-mQTL due to greater inter-individual DNAm variability. It is also tempting to speculate that allelic variation contributes more strongly to DNAm in BECs over PBMCs, because blood DNAm might be more plastic and responsive due to the role of blood cells in the immune system [73, 74, 75]. For example, changes in genome-wide transcriptional programs and DNAm profiles are observed in response to an inflammatory stimulus in blood leukocytes, which could be incongruent with a high degree of fixed, genetically driven DNAm patterns in these cells [73, 74, 75]. In a more complicated paradigm, DNAm variation may be best explained by the interaction of both genetic and environmental factors ((G × E) interactions), as previously demonstrated in blood-based DNAm profiles [45, 47].
As touched upon in several recent reviews, genetic contribution to DNAm might be more prominent in shaping the DNA methylome than initially anticipated, and thus affect the analysis and interpretation of EWAS findings [1, 76]. To illustrate this, we tested for the presence of our categorized CpGs in published EWAS findings. Notably, we found that while most identified EWAS associations may be distinct to the tissue in which they were examined, in some instances, these associations may be reflected across multiple tissues and/or under genetic influence. For example, we observed CpGs associated with autism spectrum disorder to contain the highest proportion of cis-mQTL. While there might be a number of reasons for this, it is possible that the proportion of genetically influenced CpGs found in an EWAS may be proportional to the heritability of the phenotype under examination, although such hypotheses will require rigorous testing in large cohorts across a diverse spectrum of phenotypes with and without heritable contributions. Furthermore, it is difficult to discern whether having a high proportion of mQTL in EWAS analyses is favorable or not. Previous work has shown that the majority of variably methylated regions are best described by an interaction of both genetic and environmental factors [47, 71]. Emerging findings from neonatal blood samples have additionally shown that the bulk of variable DNAm sites are best accounted for by either additive (G + E) or interaction (G × E) models, suggesting that environmental influences on DNAm may be further delineated with the inclusion of genotype information [47, 77]. As such, any mQTL CpGs found in an EWAS may offer alternate interpretations to phenotypic associations with DNAm and would require further investigation for potential gene-environment effects.
It is worth noting that our study had a few inherent limitations. Firstly, in both GECKO and C3ARE cohorts, PBMCs were collected from individuals at a slightly later time point than BECs, resulting in an age-related difference (0–1.5 years for C3ARE; 0.5–2.3 years for GECKO) between matched tissues, which may have affected analyses of DNAm variability. However, we anticipate that age-related differences in DNAm variability are relatively small compared to tissue-specific differences as our findings are consistent with previous work performed on age-matched tissues in adults . Another limitation was the relatively small sample size of our cohorts, which may have inflated type II error rates. We also chose to not assess distal genetic effects on DNAm (i.e., trans-mQTL) due to the increased multiple testing burden, but rather prioritized cis-mQTL as previous work has suggested these may be more functionally linked to nearby CpGs [47, 59, 60]. As well, previous work in blood has shown that the proportion of DNAm variance explained by trans-mQTL is much lower than that of cis-mQTL . For these reasons, we examined SNPs that were directly measured and not imputed, as performed in other pediatric mQTL analyses, within a 5 kb window [47, 78]. As a result, we likely underestimated the number of mQTL present in our tissues. Future work using large cohorts will be required to clarify the contribution of distal genetic variants to DNAm in other peripheral tissues. In addition, our mQTL findings were limited by the coverage of the 450K array, which interrogates less than 2% of all DNAm sites across the genome, although this includes 94% of all mapped CpGs islands. As such, it is generally biased toward CpG-dense promoter regions, which typically have limited inter-individual and inter-tissue variation [21, 79, 80, 81]. Finally, while we found the Houseman blood deconvolution method to perform well in our cohorts, evidence of substantial DNAm changes across the lifespan, especially during early childhood, necessitates the refinement of cell deconvolution methods, including adjusting for age, to allow for more nuanced estimation of cell types in early life [21, 22, 58].
The work here presents a comprehensive assessment of local genetic influences on DNAm in matched BECs and PBMCs, as well as a characterization of DNAm variability and concordance between paired pediatric tissues. Moreover, our results highlight a number of possible considerations for EWAS analyses, including the potential enrichment of mQTL findings following prefiltering to variable CpGs to reduce multiple test barriers and possible strategies to facilitate in-depth curation of EWAS hits. Such post hoc examination of significant differentially methylated CpGs will hopefully support the interpretation of EWAS findings and aid in the prioritization of candidate associations for functional validation.
Study cohorts and tissue samples
Matched tissues were obtained from a subset of two separate pediatric cohorts. Specifically, a subset of samples from the previously described C3ARE (Cleaning, Carrying, Changing, Attending, Reading and Expressing) cohort were collected from 16 individuals (8 females; 50%) aged 3–5 years (age range 3.6–4.2 years (BEC) and 4.5–5.2 years (PBMC)) from Vancouver, British Columbia . The GECKO cohort samples (Gene Expression Collaborative Kids Only) comprised of 79 individuals (36 females; 46%) aged 6–13 years (age range 6–11 years (BEC) and 7–13 years (PBMC)) also from Vancouver, British Columbia. Birth dates were not available for all GECKO participants; age in years was recorded at the BEC sample collections. In both cohorts, the majority of BEC samples were collected at the first visit and PBMCs were collected at a later date. In the C3ARE cohort, follow-up visits ranged from 7 days to 1.5 years, with three pairs of matched BECs and PBMCs being collected on the same day. In the GECKO cohort, the follow-up visits at which peripheral blood was collected ranged from 6 months to 2.3 years after the initial visit. Demographic descriptors of both cohorts are provided in Table 1. All experimental procedures were conducted in accordance with institutional review board policies at the University of British Columbia. Written informed consent was obtained from a parent or legal guardian and assent was obtained from each child before study participation. For both cohorts, BECs were collected using the Isohelix Buccal Swabs (Cell Projects Ltd., Kent, UK) and stabilized with Isohelix Dri-Capsules for storage at room temperature prior to DNA extraction, as previously described . Whole blood was collected into Vacutainer® CPT™ Cell Preparation Tubes (Becton, Dickinson and Company, NJ, USA) and PBMCs were isolated following centrifugation, washing and resuspension into R10 media (Sigma-Aldrich, MO, USA), as previously described . PBMC pellets were frozen and stored at − 80 °C until DNA extraction.
DNA isolation and DNA methylation arrays
Genomic DNA from stabilized buccal samples was isolated using Isohelix Buccal DNA Isolation Kits (Cell Projects Ltd., Kent, UK) and was purified and concentrated using DNA Clean & Concentrator (Zymo Research, CA, USA). Genomic DNA was extracted from PBMC pellets using the DNeasy kit (Qiagen, MD, USA). DNA yield and purity were assessed using a Nanodrop ND-1000 (Thermo Fisher Scientific, MA, USA). Bisulfite conversion of DNA (750 ng) was performed using the Zymo Research EZ DNA Methylation Kit (Zymo Research, CA, USA). Samples were subsequently randomized, and 160 ng of bisulfite-converted DNA was applied to the Illumina Infinium HumanMethylation450K Beadchip (450K) array, as per manufacturer’s protocols (Illumina, CA, USA) .
DNA methylation array data quality control and normalization
Data from each cohort were analyzed separately. Specifically, raw intensity values from the DNAm arrays were imported into Illumina GenomeStudio V2011.1 software and subjected to initial quality control checks for array staining, extension and bisulfite conversion followed by color correction and background adjustment using control probes contained on the 450K array. Data were exported from GenomeStudio as beta values which represent the estimated DNAm level based on a ratio of intensities between methylated and unmethylated alleles, with beta values ranging from 0 (unmethylated) to 1 (fully methylated). Subsequent processing and analysis were performed in R Version 3.2.1 (http://www.r-project.org). Profiles from 65 probes targeting single nucleotide polymorphisms (SNPs) were used to ensure matched tissue samples originated from the same individual. The 65 SNP probes were subsequently filtered out of the dataset. Since the cohorts were not equally matched for sex, we removed sex chromosome probes (11,648) from both datasets. Additional probe filtering was performed in which poor performing probes including those with detection p values greater than 0.01 or probes with missing beta values in more than 2% of samples were removed (14,400 C3ARE, 13,374 GECKO). Reannotation of the Illumina 450K array was used to filter probes that are known to be polymorphic at the target CpG. Probes, which have non-specific in silico binding to the sex chromosomes, were also removed . Final probe count after quality control probe filtering was 429,494 probes for C3ARE and 430,581 probes for GECKO. Following quality control processing, quantro determined quantile normalization to be inappropriate as the global DNAm distributions between the two distinct tissues were highly differential . Beta Mixture Quantile dilation (BMIQ) normalization was performed to remove differences between Type I and Type II probes on the 450K array, yielding normalized DNAm .
Cell type correction of DNA methylation data
The effects of cellular heterogeneity on DNAm measures were removed from PBMC and BEC samples in both cohorts. Specifically, blood cell type proportions were estimated for the PBMC samples using the established Houseman blood deconvolution method [49, 50]. This blood deconvolution algorithm has been previously used in pediatric blood DNAm profiles where it was shown to perform reasonably well . To test whether this was indeed also true in our GECKO and C3ARE samples, we assessed the appropriateness of the Houseman probeset panel in our pediatric blood samples compared to adult blood profiles [49, 86]. We downloaded the original adult blood DNAm dataset (Reinius) on which the Houseman method was trained (Accession# GSE35069) and filtered to 500 probes used in the algorithm that were common across all GECKO, C3ARE (following preprocessing) and Reinius samples . Given that this Houseman signature comprises 600 statistically related probes, 500 of which passed quality control in both GECKO and C3ARE, we chose to use two commonly used analytical approaches, principal component analysis (PCA) and hierarchical clustering, to determine the relationship of methylation states between cohorts in the data. PCA showed an overlap of child and adult PBMC profiles in the two top-ranking PCs (accounting for 98% of the DNAm variance of the Houseman probeset panel) and similarly, adult samples did not cluster separately from child samples in the hierarchical clustering analysis. Collectively, these findings suggested that DNAm at CpGs used in the Houseman deconvolution signature were similar between adult and child blood samples (Additional file 8: Fig. S1). Given that no cell deconvolution algorithm for buccal tissues exists and that buccal swabs, like saliva, are predominantly composed of BECs and leukocytes, we used a saliva-based deconvolution method which was designed to predict these cell types from underlying DNAm patterns [34, 68, 87]. Predicted cell proportions from both PBMC and BEC tissues were used to normalize cellular heterogeneity within each tissue using a regression-based strategy  (Additional file 9: Fig. S2). PCA was subsequently used to confirm that the correlation of estimated cell type proportions to DNAm variance within a tissue was minimal in the corrected 450K datasets (data not shown).
Assessment of cross-tissue correlation, tissue-specific variability and tissue-specific differences in DNA methylation data
Prior to subsequent DNAm analyses, the corrected 450K datasets were filtered down to overlapping probes (419,507) between the GECKO and C3ARE cohorts. Probewise cross-tissue Spearman’s correlations were calculated on beta values between the matched PBMC and BEC tissues. Inter-individual variability of each CpG was calculated as the range between the 10th and 90th percentile beta values for each CpG, referred to as “reference range” . This method captures variability across the bulk of samples while being largely robust to outlier samples.
In order to assess sample size-related differences in our DNAm analyses between GECKO and C3ARE, we performed 100 trials of Monte Carlo simulations. Specifically, we randomly subsampled the GECKO cohort to the equivalent size as the C3ARE cohort (n = 16 individuals) 100 times and reran the cross-tissue correlations and reference range calculations on the subsamples. We reported the average correlation coefficients, p values and references ranges from the 100 trials, which we refer to as “GECKOsub.”
Paired Wilcoxon signed-rank tests were used to compare global differences in reference range between matched BEC and PBMC samples. Fligner–Killeen tests were used to compare probewise variability differences in each of the cohorts. Using previously published methods, we aimed to identify informative sites between BECs and PBMCs, which we defined as CpGs that are both variable across individuals and highly correlated between both tissues . To identify informative sites, we first subset each cohort down to CpGs with a reference range greater than 0.10 in both tissues. We subsequently ran a beta mixture model on Spearman correlation rho values generating two Gaussian distributions, which separated out a group of highly concordant CpGs (Additional file 10: Fig. S3). The Spearman rho distributions in this set of highly correlated CpGs was used to define a threshold correlation coefficient, the cutoff being two standard deviations lower than the mean of the distribution. In the GECKO cohort rho > 0.47 was determined as the threshold and in the C3ARE cohort, rho > 0.32 was determined as the threshold. We also set a minimum reference range of 0.05 in both tissues to exclude CpGs with little inter-individual variation.
Finally, we identified CpGs which were differentially methylated between tissues by running Wilcoxon signed-rank tests across all probes in the C3ARE, GECKO and GECKOsub datasets. For all tests, the resulting p values were adjusted using the Benjamini–Hochberg (BH) false discovery rate (FDR) method . CpGs which passed an FDR < 0.05 and an effect size threshold, delta beta > 0.05, independently in all three datasets, C3ARE, GECKO and GECKOsub, were classified as “differential sites.”
SNP genotyping arrays
In the GECKO cohort, DNA for genotyping was collected from saliva samples of 63 individuals using the Oragene OG-500 DNA all-in-one system as per manufacturer’s protocol (DNA Genotek Inc, ON, Canada). In the C3ARE cohort, genomic DNA for genotyping was obtained from PBMC samples as described above. Genotyping data was measured at 588,454 SNP sites using the Illumina Infinium PsychChip BeadChip (PsychChip), as per manufacturer’s protocols (Illumina, CA, USA). Content for the PsychChip includes 264,909 proven tag SNPs found on the Infinium Core-24 BeadChip, 244,593 markers from the Infinium HumanCoreExome BeadChip, and 50,000 additional markers associated with common psychiatric disorders.
Preprocessing of SNP genotyping data and PCA analyses for genetic ancestry
Quality control prepreprocessing of Illumina Infinium PsychChip data was performed separately for each cohort according to recommended guidelines . Specifically, SNPs with a low 10th percentile GenCall score or with a low average GenCall score were filtered out. Additionally, SNP probes located on mitochondrial DNA, on sex chromosomes or without chromosome labels were removed. After probe filtering, final SNP probe counts for the C3ARE and GECKO datasets were 550,200 and 547,662, respectively. To test for difference in genetic ancestry between the two cohorts, we ran all samples in PCA, using the 542,699 SNPs called for every individual in both processed datasets. Genetic ancestry was not found to differ significantly between the cohorts (Additional file 11: Fig. S4), as determined by Wilcoxon ranked-sum test of GECKO vs C3ARE in PC1 scores (p = 0.8) and PC2 scores (p = 0.4). Therefore, genetic ancestry was not considered in further analyses.
We ran cis-mQTL analyses in each cohort separately, using GECKO as the discovery cohort and C3ARE as the validation cohort. In the GECKO cohort, PsychChip data were filtered after quality control to remove any SNP probes containing missing values in 5% of all samples, leaving 560,770 SNPs. In addition, SNPs with a minor allele frequency less than 5% or not in Hardy–Weinberg equilibrium were removed. Remaining SNPs (249,835) were then numerically coded, as 1, 2, or 3, for correlational analyses. Therefore, all SNPs used in mQTL analyses were directly measured on array, rather than generated through imputation. CpGs with a reference range of less than 5% were removed from mQTL analysis; this was performed separately in each tissue, leaving 131,706 CpGs in PBMCs and 210,784 CpGs in BECs. Finally, SNP–CpG pairs less than 5 kb apart were tested as mQTL using Spearman correlations. We selected a 5 kb window as previous mQTL analyses using whole genome bisulfite sequencing data reported that associations between SNP–CpG pairs are more likely to be causal within a 5 kb window [47, 59, 60, 91, 92]. In GECKO, a total of 165,591 unique SNP-CpG pairs in PBMC and 261,739 unique SNP-CpG pairs in BEC were interrogated for associations between DNAm and allelic variation; this included 145,222 SNP-CpG pairs tested in both tissues. Pairs with FDR ≤ 0.05 and DNAm change per allele ≥ 2.5% were designated as cis-mQTL hits and followed up for validation in the C3ARE cohort . For validation testing in the C3ARE samples, SNP-CpG pairs were further filtered to exclude those with SNPs that were (a) not present in the filtered C3ARE PsychChip data or (b) monomorphic or had less than 2 heterozygotes in the C3ARE samples. The mQTL analyses were repeated in the C3ARE data. SNP-CpG pairs with FDR ≤ 0.05 and DNAm change per allele ≥ 2.5% were designated as validated cis-mQTL hits and followed up in subsequent analyses. All genotyping and DNAm data were analyzed using the human assembly GRCH37 (hg19) genome build. All SNPs are reported on the (+) strand, according to standard practices in the field.
Representation of identified sites in published EWAS findings
In order to relate our results to published EWAS findings performed in pediatric cohorts, we selected five published studies which used the 450K array to measure DNAm profiles in pediatric BECs or peripheral blood. Specifically, these studies examined DNAm variation associated with puberty, aging in early life, childhood psychotic symptoms, fetal alcohol spectrum disorder and autism spectrum disorder [61, 62, 63, 64, 65]. For each study, we downloaded the list of probes reported as significant and matched these probes to sites, which we identified as: (1) informative sites, (2) differential sites and/or (3) cis-mQTL-associated CpGs. For one study, in which differentially methylated regions (DMRs) were reported, we downloaded the dataset (Accession # GSE50759) and extracted individual probes underlying the DMRs .
SAI, SJG and MSK designed the research study and formulated the research questions. RGB, WTB, JO and MSK were responsible for cohort design, participant recruitment and collection of samples for the C3ARE and GECKO cohorts. JLM performed all microarray experiments, along with SJG. SAI and SJG performed all data analyses and cowrote the manuscript together with MSK. All authors read and approved the final manuscript.
We would like to acknowledge Dr. Meaghan Jones, Rachel Edgar, Dr. Magda Price, Dr. Lisa McEwen, Nicole Gladish and Dr. Alexandre Lussier for their analytical and editorial support.
The authors declare that they have no competing interests.
Availability of data and materials
The DNAm datasets supporting the conclusions of this article are available in the Gene Expression Omnibus (GEO) database (GSE124366.
Consent for publication
Ethics approval and consent to participate
Collection of all human samples used in this study was approved by the University of British Columbia, Children and Women’s Hospital Ethics board (Certificates H07-01317 (C3ARE) and H07-02773 (GECKO)). Written informed consent was obtained from a parent or legal guardian, and assent was obtained from each child before study participation.
RG Barr was supported by a Tier 1 Canada Research Chair in Community Child Health Research and was a Senior Fellow of the Canadian Institute for Advanced Research. MS Kobor is the Tier 1 Canada Research Chair in Social Epigenetics, Senior Fellow of the Canadian Institute for Advanced Research, and Sunny Hill BC Leadership Chair in Child Development. SJ Goodman was supported by a CIHR Frederick Banting and Charles Best Doctoral Research Award.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
- 30.Hannon E, Lunnon K, Schalkwyk L, Mill J. Interindividual methylomic variation across blood, cortex, and cerebellum: implications for epigenetic studies of neurological and neuropsychiatric phenotypes. Epigenetics. 2015;10:1024–32. https://doi.org/10.1080/15592294.2015.1100786.CrossRefPubMedPubMedCentralGoogle Scholar
- 33.Slieker RC, Bos SD, Goeman JJ, Bovée JVMG, Talens RP, van der Breggen R, et al. Identification and systematic annotation of tissue-specific differentially methylated regions using the Illumina 450k array. Epigenetics Chromatin. 2013;6:26. https://doi.org/10.1186/1756-8935-6-26.CrossRefPubMedPubMedCentralGoogle Scholar
- 34.Smith AK, Kilaru V, Klengel T, Mercer KB, Bradley B, Conneely KN, et al. DNA extracted from saliva for methylation studies of psychiatric traits: evidence tissue specificity and relatedness to brain. Am J Med Genet Part B Neuropsychiatr Genet. 2015;168:36–44. https://doi.org/10.1002/ajmg.b.32278.CrossRefGoogle Scholar
- 36.Grundberg E, Meduri E, Sandling JK, Hedman AK, Keildson S, Buil A, et al. Global analysis of DNA methylation variation in adipose tissue from twins reveals links to disease-associated variants in distal regulatory elements. Am J Hum Genet. 2013;93:876–90. https://doi.org/10.1016/j.ajhg.2013.10.004.CrossRefPubMedPubMedCentralGoogle Scholar
- 38.Gertz J, Varley KE, Reddy TE, Bowling KM, Pauli F, Parker SL, et al. Analysis of DNA methylation in a three-generation family reveals widespread genetic influence on epigenetic regulation. PLoS Genet. 2011;7:e1002228. https://doi.org/10.1371/journal.pgen.1002228.CrossRefPubMedPubMedCentralGoogle Scholar
- 39.Chen L, Ge B, Casale FP, Vasquez L, Kwan T, Garrido-Martín D, et al. Genetic drivers of epigenetic and transcriptional variation in human immune cells. Cell. 2016;167(1398–1414):e24.Google Scholar
- 40.Cheung WA, Shao X, Morin A, Siroux V, Kwan T, Ge B, et al. Functional variation in allelic methylomes underscores a strong genetic contribution and reveals novel epigenetic alterations in the human epigenome. Genome Biol. 2017;18:50. https://doi.org/10.1186/s13059-017-1173-7.CrossRefPubMedPubMedCentralGoogle Scholar
- 43.Smith AK, Kilaru V, Kocak M, Almli LM, Mercer KB, Ressler KJ, et al. Methylation quantitative trait loci (meQTLs) are consistently detected across ancestry, developmental stage, and tissue type. BMC Genomics. 2014;15:145. https://doi.org/10.1186/1471-2164-15-145.CrossRefPubMedPubMedCentralGoogle Scholar
- 51.Jones MJ, Islam SA, Edgar RD, Kobor MS. Adjusting for cell type composition in DNA methylation data using a regression-based approach. In: Population epigenetics: methods and protocols. New York: Springer New York; 2017. p. 99–106. http://link.springer.com/10.1007/7651_2015_262.
- 59.Banovich NE, Lan X, McVicker G, van de Geijn B, Degner JF, Blischak JD, et al. Methylation QTLs are associated with coordinated changes in transcription factor binding, histone modifications, and gene expression levels. Plos Genet. 2014;10:e1004663. https://doi.org/10.1371/journal.pgen.1004663.CrossRefPubMedPubMedCentralGoogle Scholar
- 76.Jones MJ, Moore SR, Kobor MS. Principles and challenges of applying epigenetic epidemiology to psychology. Annu Rev Psychol. 2017;69:459–85. https://doi.org/10.1146/annurev-psych-122414-033653.CrossRefPubMedGoogle Scholar
- 77.Czamara D, Eraslan G, Lahti J, Page CM, Lahti-Pulkkinen M, Hämäläinen E, et al. Variably methylated regions in the newborn epigenome: environmental, genetic and combined influences. bioRxiv. 2018. http://biorxiv.org/content/early/2018/10/17/436113.abstract.
- 83.Miller GE, Chen E, Fok AK, Walker H, Lim A, Nicholls EF, et al. Low early-life social class leaves a biological residue manifested by decreased glucocorticoid and increased proinflammatory signaling. Proc Natl Acad Sci USA. 2009;106:14716–21. https://doi.org/10.1073/pnas.0902971106.CrossRefPubMedGoogle Scholar
- 86.Reinius LE, Acevedo N, Joerink M, Pershagen G, Dahlén S-E, Greco D, et al. Differential DNA methylation in purified human blood cells: implications for cell lineage and studies on disease susceptibility. PLoS ONE. 2012;7:e41361. https://doi.org/10.1371/journal.pone.0041361.CrossRefPubMedPubMedCentralGoogle Scholar
- 87.Eipel M, Mayer F, Arent T, Ferreira MRP, Birkhofer C, Gerstenmaier U, et al. Epigenetic age predictions based on buccal swabs are more precise in combination with cell type-specific DNA methylation signatures. Aging (Albany NY). 2016;8:1034–44. https://doi.org/10.18632/aging.100972.CrossRefGoogle Scholar
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.