Introduction

Epigenetics usually refers to the heritable molecular modifications that have effects independent of the primary DNA sequence, that can be modified by environmental exposures at various developmental stages throughout the lifespan [1, 2]. Epigenetic modifications can be inherited across cell generations to exert a long-term impact on the development of chronic diseases. DNA methylation (DNAm) is an essential epigenetic mechanism for normal development and is associated with several key processes linked to chronic diseases. The methyl group is passed from the donor molecule to cytosine catalyzed by DNA methyltransferases (DNMT). Most human diseases are thought to be due to both genetic and environmental factors, and the interplay between genes and environment. Studying the epigenome in the well-characterized samples may enable us to discover novel genes and pathways through which genetic factors and environmental exposures influence disease development, and thereby provide new targets for prevention and treatment [24].

Expanding research in epigenetics and epigenomics has resulted in increasing knowledge of the important roles of epigenetics in both normal development and the disease process. After the completion of the Human Genome Project in 2003 [5], multiple consortial studies have emphasized identifying and annotating epigenomic features in humans. National Institutes of Health (NIH) roadmap epigenomics project (http://www.roadmapepigenomics.org/) aims to produce a public resource of human epigenomic data to catalyze basic biology and disease-oriented research. The mapping consortium has generated a comprehensive map of the dynamic human DNA methylome [6]. The International Human Epigenome Consortium (IHEC) is a global consortium with the primary goal of providing access to high-resolution reference human epigenome maps for normal and disease cell types to the research community (http://www.ihec-epigenomes.org/). Another major community project, ENCyclopedia Of DNA Elements (ENCODE) has targeted the identification of all functional DNA elements in the human genome including some epigenetic modifications [7, 8]. The mapping of epigenomic features has provided important data sets and information for molecular mechanism linking epigenetic variants and functional outcomes. Additionally, the European Union has committed more than €200 million funding to support over 300 epigenetics projects [9]. The availability of high-throughput genotyping methods, high-density reference panel of human epigenome, and well-characterized samples enables epigenetic association studies at the genomic and population levels. Epigenome-wide association study (EWAS) is an examination of epigenome-wide markers in many individuals to scan for epigenetic markers associated with a trait. Current EWAS in human populations are all based on genome-wide measurement of DNA methylation of cytosine. Thus, I refer these studies as methylome-wide association studies (MWAS) to distinguish them from studies of other epigenetic markers such as histone modifications.

MWAS has emerged as a promising approach to searching for molecular mediators of genetic and environmental factors, and unexplained risks for diseases. Current population level MWAS rely on high-density microarrays [10] or sequencing-based methods [11] following biochemical modifications or enrichments of genomic DNA [12]. Although MWAS only represent a small portion of epigenetic studies, several studies have successfully identified DNAm sites associated with disease traits [13∙, 14, 15∙∙, 16]. Using hundreds of subjects, these MWAS have much smaller sample sizes compared to a typical genome-wide association study (GWAS) of the same trait, indicating rather larger effects of identified DNAm sites [15∙∙]. However, this first wave of MWAS aiming to discover associations between epigenomic variation and disease traits is limited by a number of issues such as imperfect technologies, tissue and cell-type specificity, access to biospecimens, sample size, and analytical framework for data pre-processing and statistical modeling [17]. The field has rapidly evolved and equipped us with improved designs, methods, and tools for MWAS. Several recent articles have reviewed the design, analysis and interpretation of EWAS [12, 1820], especially in the context of epigenetic association study of human disease. In light of these well-covered reviews about EWAS of complex diseases and traits, here I focus on recent reports on environmental and genetic determinants of the DNA methylome, and discuss their influences on MWAS. In the following sections, I only include recent reports using a methylome-wide approach among a minimum of 50 subjects due to power consideration.

Genetic Determinants of DNA Methylome

Genetics, along with environmental factors and stochastic process, are the primary sources of epigenetic variation [21]. Genetic influences can be from the key genes maintaining the epigenetic profile (e.g., DNMT gene family), or from the neighboring genetic variants affecting the epigenetic state (e.g., modification of binding affinity of enzymes). DNA methylation quantitative trait loci (meQTLs) refer to genetic variations within genomic regions associated with a DNAm site. A genome-wide meQTL map can be obtained by correlating the GWAS data with DNA methylomic data from the same samples. Several studies have reported genome-wide meQTLs in peripheral blood, brain, lung, adipose tissue, lymphoblastoid cell lines (LCL), as well as tumor tissues. I summarize thirteen meQTL studies published since 2010 in Table 1 [2234]. All studies analyzed the cis-meQTLs (i.e., SNP and DNAm site collocate in the same genomic region), while a few also reported the trans-meQTLs. Because of the large number of SNP-DNAm pairs to test, trans-meQTL analyses are more computational intensive and require a more stringent multiple-testing correction. The definition of the cis-effect is arbitrary from study to study, ranging from 5 kb [27] to the whole chromosome [33] with majority (8 out of 13) being between 100 kb and 1 Mb from the DNAm site. The heterogeneous definition of cis-effects poses a challenge to the direct comparison of the cis-meQTL maps between studies, since the non-overlapping sites can simply be a result of the inclusion or exclusion of certain SNP-DNAm pairs. I would recommend 1 Mb flanking region as a standard definition of cis-effect in meQTL analysis, being inclusive and not very computationally demanding.

Table 1 Summary of published studies of meQTL in multiple human tissues and cell types

All studies used at least one type of array-based platform to measure the DNA methylome, either the Illumina Infinium HumanMethylation27 (27 K), or the more recent HumanMethylation 450 (450 K) BeadChips (Illumina Inc., San Diego, CA), the latter providing better coverage of genomic regions. Interestingly, all studies excluded the DNAm sites located on the sex chromosomes, although thousands of DNAm sites were measured for their methylation status. Comparing to DNAm sites located on the autosomes, most X chromosome sites are hemimethylated in females, showing a bimodal distribution strongly associated with sex [24, 35]. The sex-specific DNAm sites on X chromosome are caused by X-chromosome inactivation (XCI), characterized by an X chromosome-wide methylation on the female genome [36]. A new analytical strategy specific to such bimodal distributed DNAm sites would fill the gap of the ignored X chromosome in meQTL study, and benefit the MWAS of human diseases.

Another common theme is that the majority of studies (11 out of 13) scanned for meQTL in samples of European ancestry (EA). In addition to two studies of HapMap LCLs [24, 34] and one study of human variation panel (HVP) LCLs [28], only two studies reported non-EA meQTL maps, one in African Americans (AA) [32] and one in Asians [33] using peripheral and cord blood from individuals. Many genetic associations identified in GWAS of EA are not transferable to other racial groups due to their diverse genetic backgrounds. Similarly, the association between common SNPs and DNAm sites can be different across racial groups. Smith et al. highlighted the overlapping meQTLs between racial groups [32]. However, a large proportion of cis-meQTLs are race-specific, and the inter-race consistency of trans-meQTLs is unknown. Therefore, more studies in diverse racial groups are needed to address potentially distinct genetic influence on DNA methylome.

Using over 700,000 genome-wide SNPs and 22,928 autosomal DNAm sites from 460 unrelated AA individuals, I analyzed the genome-wide meQTL of peripheral blood leukocytes (PBLs). Adjusting for age, sex, cell type proportions, and batch effect, I identified 16,320 pairs of significant meQTLs adjusting for multiple testing (Bonferroni corrected alpha level of 0.05). These meQTLs are summarized in Fig. 1, ordered by the chromosomal position of SNPs and CpG sties. Among 16,320 meQTLs representing 6,081 unique SNPs and 2,991 unique CpG sites, 32 % are cis-meQTLs (within 1 Mb flanking region) spreading across the entire genome (points along the diagonal line in Fig. 1). Consistent with meQTL studies of other tissues and cell types [22, 24], the genetic contributions to the methylome observed in this AA sample are ubiquitous, both locally (in cis) and from long distance (in trans). A number of SNPs are associated with multiple DNAm sites, which imply the existence of master genetic regulators for epigenetic markers.

Fig. 1
figure 1

A genomic map of meQTLs in human peripheral blood lymphocytes

These reported meQTL studies cover both young and old adults, as well as neonates. However, none of the studies involve pediatric samples between newborn and early teens. DNA methylation profile changes dramatically during this critical development stage [37], which is important to understand the genetic effect on methylome and chronic diseases. More importantly, early life exposures may interact with genetic factors to synergistically impact the methylome. Thus, future meQTL studies need to focus on the pediatric age group to complete our understanding of genome-methylome correlation across the entire life-span.

LCLs have been developed to provide epidemiological studies with an ‘unlimited’ supply of DNA for genetic studies. These immortalized cell lines can be easily obtained to investigate the molecular system involving transcriptome, proteome, and epigenome. LCLs from HapMap [24, 34] and HVP [28] were used to map meQTLs. Although the overall correlation of DNAm profile between LCL and PBLs was high, the transformation process may alter the methylation status of a large number of DNAm sites and increase the inter-individual variation [35]. Therefore, PBLs better reflect the natural DNAm profile of an individual, and are preferred in MWAS over LCLs. Due to tissue and cell type specificity of DNAm, the meQTLs can be substantially different from tissue to tissue [26, 32], cell type to cell type [27]. The functional impact of these tissue-specific meQTLs may be important to understand the pathophysiology of diseases in target tissues. To date, we do not have the meQTL maps of many targeted tissues and cell types from sizeable population samples. We need to establish a more complete panel of tissue-specific meQTL maps as a publically available resource, to better understand the genetic determinants of DNA methylome, and its role in chronic diseases.

Utilities and Applications of meQTL

Overall, these studies cataloged a large number of meQTLs in multiple tissues and cell types across the human genome, and provide a rich resource not only to understand the genetic regulation of DNAm, but also to illustrate an important molecular mechanism mediating the interaction between genetic variants and environment for chronic diseases. Several recent studies have demonstrated the utilities of meQTL data.

First, as a potentially functional feature of the genome, DNAm sites may mediate the genetic association between SNPs and disease traits. meQTLs have been used to link the genetic associations to disease traits from recent GWAS. Significant GWAS loci are enriched for meQTLs. Over 300 unique GWAS SNPs covering 34 % of reported diseases/traits are meQTLs in LCLs [34]. Among the top susceptibility variants of bipolar disorder, meQTLs are enriched in the cerebellum but not in lymphocytes [38]. Liu et al. reported that rheumatoid arthritis-associated genetic variants may function through the epigenetic mechanism as a potential molecular mediator for gene-environment interaction [13∙]. Shi et al. reported that four out of the five established lung cancer loci in people of European ancestry are meQTLs in lung tissue. In aggregate, cis-meQTLs in lung tissue are enriched for lung cancer risk [31].

Secondly, meQTL can be used as the instrumental variable in Mendelian randomization (MR) study of DNAm markers [3, 39]. MR was initially proposed as an epidemiologic method to obtain unbiased estimates of the putative casual effects without conducting a randomized trial [40, 41]. The MR approach uses the genetic variant mimicking the biological effects of a modifiable exposure. If the exposure truly alters the disease risk, the genetic variant should also be associated with the disease through the causal pathway. Because the genetic variant is randomly assigned to the offspring during meiosis in a population, the genotype distribution should not be biased by confounding. Only the genetic variant in the causal pathway should be associated with disease outcome by carrying the association through the causal exposure. MR assumes that the genetic variants are independent of the confounders, the genetic variants are reliably associated with the exposure, and there is no direct effect of the genetic variants on disease. In epigenetic epidemiologic study, MR approach can be applied to study (1) the causal environmental factor for epigenetic profile, and (2) the causal epigenetic risk factor for human disease, as detailed in recent publications [3, 39].

Environmental Modifiers of the DNA Methylome

Environmental factors are also important epigenetic modifiers [2, 20]. Epigenetic variation may mediate the environmental risks for human diseases during the entire life-span, from early embryogenesis, in utero development, childhood to adulthood. Although environment-induced epigenetic changes may vary at different development stages, the cumulative effects can eventually lead to chronic disease in later life. While facing challenges in exposure measurement, epityping technology, availability of biospecimens, study design and analytical methods, genome-wide environmental epigenetics studies also hold the promise to discover epigenetic dysregulation mediating the life-course exposures [20]. Heijmans et al. first showed that early life environment can cause DNAm changes in humans persisting over 60 years after the Dutch Hunger Winter [42]. Individuals who were prenatally exposed to famine had less DNA methylation of IGF2 gene compared with their unexposed, same-sex siblings adjusted for age and relatedness [42]. A number of environmental factors, such as socioeconomic status [43], early life environment [44], traumatic experience [45], pollutants [46], nutrition [47], and physical activity [48∙, 49], have been associated with the DNAm profile. These preliminary findings require further replication studies with larger sample sizes and consistent phenotyping. To date, cigarette smoking is the most convincing environmental modifier of the DNA methylome. The strong epigenetic associations with smoking behavior have been consistently demonstrated in several comprehensive MWAS in different populations. After the first MWAS of cigarette smoking published in 2011 [50∙∙], 11 more MWAS have been reported in sizable population samples. In Table 2, I summarize a total of ten MWAS of smoking in adults [50∙∙, 51, 53, 54∙, 55, 56, 57, 58, 59∙], and two MWAS of maternal smoking effect on offspring [60∙, 61].

Table 2 Summary of published methylome-wide association studies of cigarette smoking

Cigarette Smoking-Related DNA Methylation

Cigarette smoking is an environmental risk factor for many chronic diseases including CVD and cancer, the deadliest diseases in the US. Smoking can induce cellular and molecular changes, including epigenetic modification, but the short-term and long-term epigenetic modifications caused by cigarette smoking at the gene level have not been well understood. Recent MWAS studies have identified and replicated smoking-related DNAm sites in samples of European ancestry [50∙∙, 51, 53]. The most significant smoking-related DNAm sites are hypomethylated among smokers [53, 54∙, 58]. Several studies also reported an inverse association between pack-years and DNA methylation, and a positive association between time since quitting smoking and DNA methylation [51, 55, 58, 62]. With hundreds of smokers and non-smoking controls, over a dozen of differentially methylated loci have been discovered and replicated in at least two studies. The most replicable smoking-related DNAm loci include F2RL3 (factor II receptor-like 3), AHRR (aryl hydrocarbon receptor repressor), GPR15 (G-protein-coupled receptor 15), 2q37.1, LRRN3 (leucine rich repeat neuronal 3), AKT3 (v-akt murine thymoma viral oncogene homolog 3), LIM2 (lens intrinsic membrane protein 2, 19 kDa), NCAPD3 (non-SMC condensin II complex, subunit D3), and CNTNAP2 (contactin associated protein-like 2). These smoking-related DNAm loci are often replicable across racial groups [54∙]. So far, we have limited knowledge of how these genes and their products relate to smoking and physiological function. F2RL3 plays a role in platelet activation and cell signaling, which may mediate the risk for cardiovascular disease. Aryl hydrocarbon receptor triggers expression of a diverse set of genes, some of which are involved in metabolism of endogenous toxins from cigarette smoke [63, 64].

Joubert et al. conducted an MWAS of maternal smoking and identified 10 loci differently methylated in umbilical cord blood. Interestingly, these loci include reported sites associated with adult smoking (e.g., AHRR, CYP1A1 and CNTNAP2) and sites uniquely associated with maternal smoking exposure (e.g., GFI1) [60∙]. The notable difference in smoking-related DNAm between newborns and adults may indicate distinct effects of direct versus indirect smoking exposure, or of different developmental stages. Markunas et al. replicated 7 previously reported DNAm loci, and identified 10 new regions using 287 mothers smoking during pregnancy and their newborn infants [61]. The DNAm variants in newborns may mediate the effect of in utero smoking exposure on health outcomes in later life. A recent MWAS compared the DNAm changes associated with tobacco smoking to using snuff. Contrary to tobacco smoking, smoke-less tobacco use is not significantly associated with any DNAm site, nor with any enrichment of biological functions and molecular processes [59∙]. This interesting observation suggests that the smoking-related DNAm variations are most likely caused by the burnt products of tobacco, but not by its natural components.

Utilities and Applications of Environment-Associated DNAm Marker

The environment-associated DNAm sites may serve as novel biomarkers for both short-term and long-term exposures. In the case of cigarette smoking, DNAm sites located in F2RL3 and AHRR not only predicts current smoking as cotinine does, but also distinguish former-smokers versus never-smokers [65, 66]. The DNAm levels of F2RL3 sites are associated with the cumulative dose of smoking, as well the time since quitting [65]. Therefore, these DNAm sites are potentially better predictors representing the long-term risk of smoking than the often-flawed self-reported smoking data. Such epigenetic biomarkers of long-term exposures may exist for other environmental risk factors, where the longitudinal profile of the exposure is challenging to measure. For these methylation biomarkers, we can assess if they are more strongly associated with disease traits than the traditional risk factors in epidemiologic studies. More importantly, we may identify new epigenetic predictors of disease outcomes. Recently, two studies showed that the DNAm sites in F2RL3 predict secondary cardiovascular events among patients with stable coronary heart disease [67], and predict total mortality after 10-year follow up after adjusting for smoking status [62]. Based on single-marker predictors, we can develop a joint methylation risk score (MRS) for each risk factor with potentially stronger predictive ability. In light of recent MWAS findings of physical activity [48∙], [49], BMI [15∙∙] and blood lipids [16], we will be able to examine the predictive ability of the MRS and may improve the prognosis of chronic diseases. The epigenetic plasticity, in combination with the life-course exposure of disease risks, will offer a new window into a more comprehensive explanation for the development of chronic diseases and inter-individual variations.

Partition of Environmental and Genetic Influences on DNA Methylome

The genetic and environmental contribution to a disease is not fixed across the life-span. The proportion of genetic and environmental components of each DNAm site may vary in different studies due to the developmental stage, tissue type of interest, and life-course experience. For chronic diseases, the cumulative environmental effects may dominant the genetic effect in older adults. For age-related chronic diseases such as cardiovascular disease and chronic kidney disease, the epigenetic mechanism may be particularly important in explaining different disease risks among people carrying the same genetic risk alleles. The classical twin design has been used to estimate the contributions of genetic and environmental effects to complex traits. The phenotypic variation in a trait can be decomposed to the genetic variance and environmental variance. These unobserved variance components can be quantitatively estimated from the observed phenotypic covariance in monozygotic (MZ) and dizygotic (DZ) twin pairs.

MZ twins carry identical genetic information from the primary sequence of DNA, but their epigenetic profiles diverge during aging [68]. Fraga et al. showed that MZ twins were epigenetically indistinguishable during the early years of life, and older MZ twins exhibited remarkable differences in DNAm and histone acetylation [68]. The increase of epigenetic differences between MZ twins over time strongly suggests the influence of unshared environment on the epigenome. Rates of discordant disease phenotypes in MZ twins are usually above 50 % even for highly heritable diseases [6971]. It has been suggested that epigenetics can make a significant contribution to phenotypic discordance in MZ twins [72, 73]. The epigenomes of discordant MZ twin pairs are, by definition, matched by genetic profile, age and sex, three important predictors of epigenetic profile. In addition, MZ twin design allows controlling for unmeasured confounders such as shared environmental influences. Therefore, using discordant MZ twin pairs is the most powerful design to study epigenomic variation associated with targeted diseases traits [74]. The discordant twin model is particularly useful in detecting moderate effects even with small samples of twin pairs. Several epigenetic studies of MZ discordant twins have investigated DNA methylation profiles in relation to several human diseases such as diabetes [75], psychiatric disorders [7679], cancer [80], and autoimmune diseases [8183].

In addition to the discordant twins design, we can also use MZ twin pairs to study the within-twin DNAm difference associated with quantitative traits, which is driven by unshared environment. We are able to estimate both within-pair and between-pair effects in a single regression model [84]. Using 69 pairs of middle-aged MZ twins, we found that the within-twin effect of cg22891070 (a BMI-associated CpG site [15∙∙]) is 7.7 × 10−4 (p value of 0.93), whereas the between-twin effect is 0.031 (p value of 0.07). This observation hints that the epigenetic association with BMI on the HIF3A loci is not likely driven by the unshared environment.

Future Opportunities in DNA Methylome Research

The modifiability of epigenetics provides an important molecular mechanism for response to the external and internal environments, but also poses a great challenge in understanding the causality underlying the identified epigenetic association. Many environmental factors can potentially modify the DNAm across the life-span, and many are variable over time. Thus, long-term profiles of DNAm have to be measured and studied to understand their relationship to environment and disease development. In the case of pharmacoepigenetic research, the DNAm profiles have to be captured prior to and after the exposure to establish the causal effect of a given treatment. To improve the accuracy and efficiency of recruiting participants and collecting biospecimens, Sun and Davis developed a novel strategy for pharmacoepigenetic research taking advantage of the real-time data of an electronic medical record system [85∙]. The longitudinal studies of DNA methylome and other epigenomic features will be critical to establish the contribution to human diseases in the life-span.

The integrative genomic approach involving transcriptomic and GWAS data has greatly improved our understanding of DNA methylome. However, the technology for a comprehensive profiling of environmental factors is not available. Exposome is a new concept referring to the totality of human environmental exposures. Although currently we cannot measure or model the exposome, improved metabolomic technology is a promising method to measure thousands of metabolites representing both external and internal environment [86, 87]. A joint metabolome-DNA methylome study surveyed the epigenetic association with 649 blood metabolic traits [88]. The identified associations were driven by an underlying genetic effect, or by environmental and life-style-dependent factors without any underlying genetic signals. These findings extend the role of DNAm in regulating metabolism. Regarding the epityping, both 27K and 450K array, as well as BSS require bisulfite conversion of genomic DNA in order to distinguish methylated from unmethylated cytosines. This chemical conversion does not distinguish 5-methylcytosine (5mC) from 5-hydroxymethylcytosine (5hmC) [89], another type of cytosine modification with potentially different function in cellular processes [90]. Future sequencing-based MWAS can distinguish 5hmC from 5mC, to discovery 5hmC specific loci associated with disease traits.

Conclusions

Epidemiologic and human genetic studies have demonstrated that complex diseases are caused by both genetic and environmental factors. The epigenome including the DNA methylome, links environmental exposures and genetic effects to the development of human diseases. Thus, study of the DNA methylome may lead to a new approach to examining gene-environment interaction, and to identify unexplained risks for complex diseases beyond known environmental and genetic risks. Our knowledge of human DNA methylome has advanced dramatically in recent years. Using high-throughput technologies such as microarray and next-generation sequencing, large sample sizes from human populations can be achieved to unveil the potential impact of DNA methylome on human disease. Recent MWAS involving hundreds to thousands of subjects have successfully identified epigenetic associations with human diseases, environmental, and genetic factors. However, sheer increase of sample size is not sufficient to address the challenges and limitations facing MWAS. To fully recognize the role of DNA methylome in complex diseases and to eventually establish a new epigenetic approach to preventing and treating diseases, we have to carefully consider the characteristics of samples, study design, and key research questions for each MWAS. Therefore, epidemiology methods and designs will play a very important role in the era of epigenomic epidemiology and MWAS. We will need more well designed MWAS and follow-up studies to continuously unravel the complexity between human epigenome and diseases, and to further understand the epigenetic mechanisms of complex diseases, and to examine the utilities of the DNAm markers.