Epigenetics is the study of non-genetic cellular processes that may be inherited, are stable through cell division, and may change in response to external and internal cellular stimuli. Epigenetic markers may change within an individual over time and have been shown to exhibit cell-type specificity [1-3]. Epigenetics has been shown to play a critical role in cell differentiation, development, and tumorigenesis [4,5]. DNA methylation is probably the best studied epigenetic modification of DNA, but our understanding of DNA methylation is still in its infancy. In vertebrates, DNA methylation occurs when a methyl group is added to the fifth carbon of the cytosine residue, mainly in the context of neighboring cytosine and guanine nucleotides in the genome (5-CG-3 dinucleotides or CpG sites), and is mediated by DNA methyl-transferases [6,7]. DNA methylation has been shown to play an important functional role in the cell, including involvement in DNA replication and gene transcription, with substantial downstream association with development, aging, and cancer [1-3,8-10].

CpG sites are under-represented in the human genome relative to their expected frequency as a result of being mutation hotspots, where the deamination of methylated cytosines encourages CpG sites to mutate to TpG sites [5,11]. Although CpG sites are mainly methylated across the mammalian genome [12], there are distinct, mostly unmethylated CG-rich regions called CpG islands (CGIs), which have a G+C content greater than 50% [5,11,13]. CGIs account for 1 to 2% of the genome and are often located in promoters and exonic regions in mammalian genomes [14,15]. Methylation patterns in CGIs that are in promoter regions, where most previous studies have focused attention, have recently been shown to differ from methylation patterns elsewhere, indicating a specific biological role for these promoter CGIs [12]. CGIs have been shown to co-localize with DNA regulatory elements such as transcription factor binding sites (TFBSs) [16-23] and DNA binding insulator proteins, such as CTCF, which insulate downstream DNA from upstream methylation activity [24]. Across the genome, DNA methylation levels have been shown to be dependent on context: methylation levels are relatively predictable within particular genomic regions. In particular, predictable levels of methylation have been observed in active chromatin marks [25-27] and cis-acting DNA regulatory elements [14,28]. Context-dependent methylation suggests cellular processes that regulate methylation and also provides clues as to how methylation may impact cellular phenotypes.

The non-uniform distribution of CpG sites across the human genome and the important role of methylation in cellular processes imply that characterizing genome-wide DNA methylation patterns is necessary for a better understanding of the regulatory mechanisms of this epigenetic phenomenon [29]. Recent advances in methylation-specific microarray and sequencing technologies have enabled the assay of DNA methylation patterns genome-wide at single base-pair resolution [29]. The current gold standard for quantifying single-site DNA methylation levels across a genome is whole-genome bisulfite sequencing (WGBS), which quantifies DNA methylation levels at ∼26 million (out of 28 million in total) CpG sites in the human genome [30-32]. However, WGBS is prohibitively expensive for most current studies, is subject to conversion bias, and is difficult to perform in particular genomic regions [29]. Other sequencing methods include methylated DNA immunoprecipitation sequencing, which is experimentally difficult and expensive, and reduced representation bisulfite sequencing, which assays CpG sites in small regions of the genome [29]. As an alternative, methylation microarrays, and the Illumina HumanMethylation450 BeadChip in particular, measure bisulphite-treated DNA methylation levels at ∼482,000 preselected CpG sites genome-wide [33]; however, these arrays assay less than 2% of CpG sites, and this percentage is biased to gene regions and CGIs. Quantitative methods are needed to predict methylation status at unassayed sites and genomic regions.

In this study, we examined measurements of methylation levels in 100 individuals using the Illumina 450K BeadChip [34]. Within these methylation profiles, we examined the patterns and correlation structure of the CpG sites, with attention to characterizing methylation patterns in CGI regions. Using features that include neighboring CpG site methylation status, genomic location, local genomic features, and co-localized regulatory elements, we developed a random forest (RF) classifier to predict single-CpG-site methylation levels genome-wide. Using this method, we were able to identify DNA regulatory elements that were especially predictive of DNA methylation levels at single CpG sites, providing hypotheses for experimental studies on mechanisms by which DNA methylation is regulated or leads to biological changes or disease phenotypes.

Related work in DNA methylation prediction

Methylation status is a difficult epigenomic feature to characterize and predict because assayed DNA methylation marks are: (a) an average across the sampled cells, (b) specific to a cell type, (c) environmentally unstable and (d) not well correlated within a genomic locus [2,35,36]. Specific CpG sites may show differential methylation status across platforms, cell types, individuals or genomic regions [37,38]. A number of methods to predict methylation status have been developed (Additional file 1: Table S1). Most of these methods assume that methylation status is encoded as a binary variable, e.g., a CpG site is either methylated or unmethylated in an individual [28,39-45].

Related methods have often limited predictions to specific regions of the genome, such as CGIs [40-43,45,46]. These methods make predictions of average methylation status for windows of the genome instead of individual CpG sites (with one exception [38]). All of the studies that achieved prediction accuracy ≥90% [40,43,45,46] predicted average methylation status within CGIs or DNA fragments within CGIs. Most of the CpG sites in CGIs are unmethylated across the genome [12] – for example, 16% of CpG sites in CGIs in samples from the human brain were found to be methylated using a WGBS approach [47] – so it is not surprising that classifiers limited to these regions perform well. Studies extending prediction beyond CGIs uniformly achieved lower accuracies, ranging from 75% to 86%. Only two studies predicted methylation levels as a continuous variable: one study was limited to ∼400 bp DNA fragments instead of a genome-wide analysis [48], and the other used as prediction features the same CpG site in reference samples [38].

Across these methods, features that are used for DNA methylation prediction include: DNA composition (proximal DNA sequence patterns), predicted DNA structure (e.g., co-localized introns), repeat elements, TFBSs, evolutionary conservation (e.g., PhastCons [49]), single nucleotide polymorphisms (SNPs), GC content, Alu elements, histone modification marks, and functional annotations of nearby genes. Several studies used only DNA composition features [28,39,42,44,48]. Bock et al. used ∼700 features including DNA composition, DNA structure, repeat elements, TFBSs, evolutionary conservation, and number of SNPs [40]; Zheng et al. included ∼300 features including DNA composition, DNA structure, TFBSs, histone modification marks, and functional annotations of nearby genes [45]. One study used as features methylation levels from the same CpG sites in reference samples from different cell types [38]. The relative contribution of each feature to prediction quality is not quantified well within or across these studies because of the different methods and prediction objectives.

The majority of these methods are based on support vector machine (SVM) classifiers [28,38-41,43,45,46,48]. General non-additive interactions between features are not encoded when using linear kernels, which are used by most of these SVM-based classifiers. If a more sophisticated kernel is used, such as a radial basis function kernel, within the SVM-based approach, the contribution of each feature to prediction quality is not readily available. Three studies included alternative classification frameworks: one found that a decision tree classifier achieved better performance than an SVM-based classifier [46]. Another study found that a naive Bayes classifier achieved the best prediction performance [42]. A third study used a word composition-based encoding method [44].

Our method for predicting DNA methylation levels at CpG sites genome-wide differs from these current state-of-the-art classifiers in that it: (a) uses a genome-wide approach, (b) makes predictions at single-CpG-site resolution, (c) is based on a RF classifier, (d) predicts methylation levels β instead of methylation status τ, (e) incorporates a diverse set of predictive features, including regulatory marks from the ENCODE project, and (f) allows the quantification of the contribution of each feature to prediction. We find that these differences substantially improve the performance of the classifier and also provide testable biological insights into how methylation regulates, or is regulated by, specific genomic and epigenomic processes.


Characterizing methylation patterns

DNA methylation profiles were measured in whole blood samples from 100 unrelated human participants by Illumina HumanMethylation450 BeadChips at single-CpG-site resolution for 482,421 CpG sites [50]. single-CpG-site methylation levels are quantified by β, the proportion of probes for this CpG site that are methylated, which is computed as the methylated probe intensity divided by the sum of both the methylated and unmethylated probe intensities; thus, β ranges from zero (the CpG site is unmethylated) to one (the CpG site is fully methylated). Within the single-CpG-site β values across individuals, we controlled for probe chip position, sample age, and sample sex. After these data were filtered and preprocessed (see Materials and methods), 394,354 CpG sites remained across the 22 autosomal chromosomes.

First, we examined the distribution of DNA methylation levels, β, at CpG sites on autosomal chromosomes across all 100 individuals. The majority of CpG sites were either hypermethylated or hypomethylated (levels of methylation that are consistently higher or lower than 0.5, respectively), with 48.2% of sites with β>0.7 and 40.4% of sites with β<0.3 (Additional file 1: Figure S1A). Using a cutoff of 0.5, across the methylation profiles and individuals, 54.8% of these CpG sites have a methylated status (β≥0.5). Across the individuals, we observed distinct patterns of DNA methylation levels in different genomic regions (Additional file 1: Figure S1B). Using CGIs labeled in the UCSC genome browser [51], we defined CGI shores as regions 0 to 2 kb away from CGIs in both directions and CGI shelves as regions 2 to 4 kb away from CGIs in both directions [34]. We found that CpG sites in CGIs were hypomethylated (81.2% of sites with β<0.3) and sites in non-CGIs were hypermethylated (73.2% of sites with β>0.7), while CpG sites in CGI shore regions had variable methylation levels following a U-shape distribution (39.0% of sites with β>0.7 and 46.2% of sites with β<0.3), and CpG sites in CGI shelf regions were hypermethylated (78.2% of sites with β>0.7). These distinct patterns reflect highly context-specific DNA methylation levels genome-wide.

DNA methylation levels at nearby CpG sites have previously been found to be correlated (indicating possible co-methylation), particularly when CpG sites are within 1 to 2 kb from each other [35,36]. These methylation patterns stand in contrast with correlation among nearby genetic polymorphisms due to linkage disequilibrium, which often extends to large genomic regions from a few kilobases to >1 Mb [52]. We quantified the correlation of methylation levels β between neighboring pairs of CpG sites using the absolute value Pearson’s correlation across individuals. We found that correlation of methylation levels between neighboring (i.e., adjacent CpG sites in the genome that are both assayed) CpG sites decreased rapidly to approximately 0.4 within ∼400 bp, in contrast to sharp decays noted within 1 to 2 kb in previous studies with sparser CpG site coverage (Figure 1A) [35,36].

Figure 1
figure 1

Correlation of methylation levels between neighboring CpG sites. The x-axis represents the genomic distance in bases between the neighboring CpG sites, or assayed CpG sites that are adjacent in the genome. Different colors and points represent subsets of the CpG sites genome-wide, including pairs of CpG sites that are not adjacent in the genome but that are the specified distance apart (non-adjacent). The CGI shore and shelf CpG sites are truncated at 4,000 bp, which is the length of the CGI shore and shelf regions. The solid horizontal line represents the background (absolute value correlation or mean squared Euclidean distance, MED) level from 50,000 pairs of CpG sites from different chromosomes. (A) Absolute value of the correlation between neighboring sites across all individuals (y-axis). The lines represent cubic smoothing splines fitted to the correlation data. (B) Median MED was calculated (y-axis) across pairs of CpG sites within the genomic distance window (x-axis). bp, base pair; CGI, CpG island; MED, mean squared Euclidean distance.

To make this decay more precise, we contrasted the observed decay to the level of background correlation (0.22), which is the median absolute value Pearson’s correlation between the methylation levels of pairs of randomly selected pairs of CpG sites across chromosomes (Figure 1A). We found substantial differences in correlation between neighboring CpG sites versus randomly sampled pairs of CpG sites at matching distances, presumably because of the dense CpG tiling on the 450K array within CGI regions. Interestingly, the slope of the correlation decay plateaus after the CpG sites are approximately 400 bp apart (both for neighbors and for randomly sampled pairs at a matching distance). However, the distribution of correlation between pairs of CpG sites matches the distribution of background correlation even within 200 kb (Figure 2A, Additional file 1: Figure S2A). We found the rate of decay in the correlation to be highly dependent on genomic context; for example, for neighboring CpG sites in the same CGI shore and shelf region, correlation decreases continuously until it is well below the background correlation (Figure 1A). Because of the over-representation of CpG sites near CGIs on the 450K array, we see an increase in correlation as the distance between neighboring sites extends past the CGI shelf regions, where there is lower correlation with CGI methylation levels than we observe in the background. While this suggests that there may be types of methylation regulation that extend to large genomic regions, the pattern of extreme decay within approximately 400 bp across the genome indicates that, in general, methylation may be biologically manipulated within very small genomic windows. Thus, neighboring CpG sites may only be useful for prediction when the sites are sampled at sufficiently high densities across the genome.

Figure 2
figure 2

Histogram of correlation and MED of methylation values between pairs of CpG sites. The x-axes represent the correlation or MED of methylation values between pairs of CpG sites; the left column plots show the histogram of correlation of CpG sites within 200 kb (A), 1 Mb (C) and 10 Mb (E); the right column plots show the histogram of MED of CpG sites within 200 kb (B), 1 Mb (D) and 10 Mb (F). The distribution of the background is calculated by 50,000 random selected pairs of CpG sites and is shown in blue; The distributions of correlation and MED with corresponding distances are shown in pink. dist, distance; kb, kilobase, MB, megabase; MED, mean squared Euclidean distance.

We repeated these experiments using the mean squared Euclidean distance (MED) between pairs of CpG site levels to quantify patterns of decay of methylation within each individual, instead of across individuals, as is measured with the correlation analyses (see Materials and methods; Figure 1B, Figure 2B,D,F). In general, the MED trends echo the local patterns seen in the correlation analysis and also appear to be specific to a region. In CGI regions, the MED of neighboring sites was low and increased slowly with genomic distance. In contrast, MED in CGI shore and shelf regions increased rapidly to an MED higher than background MED (i.e., median MED between pairs of CpG sites within the same individual across chromosomes; 0.13), indicating that the edges of a single shore and shelf region are less predictive of each other than any two CpG sites at random. The individual-specific MED between neighboring sites (Additional file 1: Figure S2B) shows substantial deviation from the background distribution of MED at 200 kb relative to the correlation at this distance, indicating that there may be biological manipulation of methylation in larger genomic regions, but this manipulation may be specific to an individual, such as being driven by genetic variants or environmental effects. That said, for both MED and correlation distributions, we applied the Kolmogorov–Smirnov test to the background distribution and MED or correlation distributions at three genomic distances. We found that, in every case, the P values indicated that the null hypothesis – that the two samples came from the same underlying distribution – should be rejected. We found it difficult to reconcile the results of these tests against the relative histograms and quantile–quantile plots of the same samples (Figure 2, Additional file 1: Figure S2), which showed low levels of enrichment at high correlation and low MED in the non-background distributions.

As we observed that methylation patterns at neighboring CpG sites depend on genomic content, we further investigated methylation patterns within CGIs, CGI shores, and CGI shelves. Methylation levels at CGIs and CGI shelves were fairly constant genome-wide and across individuals – CGIs are hypomethylated and CGI shelves are hypermethylated – but CGI shores exhibit a reproducible but drastic pattern of change (Figure 3A). CpG sites in CGI shores have a monotone increasing pattern of methylation status from CGIs towards CGI shelves, and this pattern is symmetric in the CGI shores upstream and downstream of CGIs. We examined the MED between methylation status for pairs of CpG sites in these regions, and we found that MED within the CGI and within the CGI shelves is low, consistent with the variance we observed within DNA methylation profiles in these regions (Figure 3B). Additionally, we found that the MED between CpG sites in the shelves appears to increase as the sites are further away from the CGI on the shelf, suggesting a circular dependency in CpG site methylation across the ends of the shelf sequences. It is interesting that the CpG sites in the shore regions are substantially more predictive of CpG sites in the shelf regions than those in the CGI regions, although this may indicate a less precise delineation of the shore and shelf regions relative to the CGI and CGI shore delineation.

Figure 3
figure 3

Methylation levels, correlation within CGI. Since each CGI is a different length, each CGI was split into 40 equal-sized windows, and methylation levels and correlation were averaged within each window. (A) The x-axis is the mean β value within a window in the CGI, CGI shore, or CGI shelf regions across all sites in all individuals with a window size of 100 bp. (B) Methylation values of each CpG site in a CGI, CGI shore or CGI shelf Oxford, were compared with all other sites in the same CGI using MED. The x-axis and y-axis represent the genomic position of each CGI with a scale of 1:100, i.e. one unit in the matrix represents 100 bp distance. The MED of each unit cell was calculated for all pair-wise CpG sites corresponding to that matrix position and averaged over the 100 individuals. bp, base pair; CGI, CpG island; MED, mean squared Euclidean distance.

To quantify the amount of variation in DNA methylation explained by genomic context, we considered the correlation between genomic context and principal components (PCs) of methylation levels across all 100 samples (Figure 4). We found that many of the features derived from a CpG site’s genomic context appear to be correlated with the first principal component (PC1). The methylation status of upstream and downstream neighboring CpG sites and a co-localized DNAse I hypersensitive (DHS) site are the most highly correlated features, with Pearson’s correlation r=[0.58,0.59] (P<2.2×10−16). Ten genomic features have correlation r>0.5 (P<2.2×10−16) with PC1, including co-localized active TFBSs ELF1 (ETS-related transcription factor 1), MAZ (Myc-associated zinc finger protein), MXI1 (MAX-interacting protein 1) and RUNX3 (Runt-related transcription factor 3), and co-localized histone modification trimethylation of histone H3 at lysine 4 (H3K4me3), suggesting that they may be useful in predicting DNA methylation status (Additional file 1: Figure S3). That said, the features themselves are well correlated; for example, active TFBS ELF1 is highly enriched within DHS sites (r=0.67,P<2.2×10−16) [53,54].

Figure 4
figure 4

Correlation matrix of prediction features with first ten PCs of methylation levels. The x-axis corresponds to one of the 122 features; the y-axis represents PCs 1 through 10. Colors correspond to Pearson’s correlation, as shown in the legend. PC, principal component.

Binary methylation status prediction

These observations about patterns of DNA methylation suggest that correlation in DNA methylation is local and dependent on genomic context. Thus, prediction of DNA methylation status based only on methylation levels at neighboring CpG sites may not perform well, especially in sparsely assayed regions of the genome. Using prediction features, including neighboring CpG site methylation levels and features characterizing genomic context, we built a classifier to predict binary DNA methylation status. Status, which we denote using τ i,j ∈{0,1} for i∈{1,…,n} samples and j∈{1,…,p} CpG sites, indicates no methylation (0) or complete methylation (1) at CpG site j in sample i. We computed the status of each site from the β i,j variables: \(\tau _{i,j} = \mathbb {1}[\beta _{i,j} > 0.5]\). For each sample, there were 378,677 CpG sites with neighboring CpG sites on the same chromosome, which we used in these analyses.

The 124 features that we used for DNA methylation status prediction fall into four different classes (see Additional file 1: Table S2 for a complete list). For each CpG site, we include the following feature sets:

  • neighbors: genomic distances, binary methylation status τ and levels β of one upstream and one downstream neighboring CpG site (CpG sites assayed on the array and adjacent in the genome)

  • genomic position: binary values indicating co-localization of the CpG site with DNA sequence annotations, including promoters, gene body, intergenic region, CGIs, CGI shores and shelves, and nearby SNPs

  • DNA sequence properties: continuous values representing the local recombination rate from HapMap [55], GC content from ENCODE [56], integrated haplotype scores (iHSs) [57], and genomic evolutionary rate profiling (GERP) calls [58]

  • cis-regulatory elements: binary values indicating CpG site co-localization with cis-regulatory elements (CREs), including DHS sites, 79 specific TFBSs, ten histone modification marks and 15 chromatin states, all assayed in the GM12878 cell line, the closest match to whole blood [56]

We used a RF classifier, which is an ensemble classifier that builds a collection of bagged decision trees and combines the predictions across all of the trees to produce a single prediction. The output from the RF classifier is the proportion of trees in the fitted forest that classify the test sample as a 1, \(\hat {\beta }_{i,j}\in [0,1]\) for i={1,…,n} samples and j={1,…,p} CpG sites assayed. We thresholded this output to predict the binary methylation status of each CpG site, \(\hat {\tau }_{i,j} \in \{0,1\}\), using a cutoff of 0.5. We quantified the generalization error for each feature set using a modified version of repeated random subsampling (see Materials and methods). In particular, we randomly selected 10,000 CpG sites genome-wide for the training set, and we tested the fitted classifier on all held-out sites in the same sample. We repeated this ten times. We quantified prediction accuracy, specificity, sensitivity (recall), precision (1− false discovery rate), area under the receiver operating characteristic (ROC) curve (AUC), and area under the precision–recall curve (AUPR) to evaluate our predictions (see Materials and methods).

Using 122 features (excluding β for one upstream and one downstream neighboring CpG site but including status τ) and considering all CpG sites with two neighboring CpG sites in our data, we achieved an accuracy of 91.9% and an AUC of 0.96 (Figure 5A). We considered the role of each subset of features (Table 1). For example, if we only included genomic position features, the classifier had an accuracy of 78.6% and AUC of 0.85. Including DNA sequence properties and TFBS features increased the accuracy to 85.7% and the AUC to 0.92. When we included all classes of features except for neighbors, the classifier achieved an accuracy of 89.0% and an AUC of 0.94, a significant improvement in prediction from only considering genomic position features (t-test; P=7.75×10−23). These results suggest that TFBSs, histone modifications, and chromatin state are predictive of DNA methylation. However, we also found that the genomic context features improved prediction significantly over using only the neighbor features, which has an accuracy of 90.7% and an AUC of 0.94 (t-test; P=3.45×10−18).

Figure 5
figure 5

Prediction performance of methylation status and level. (A) ROC curves of cross-genome validation of methylation status prediction. Colors represent classifier trained using feature combinations specified in the legend. Each ROC curve represents the average false positive rate and true positive rate for prediction on the held-out sets for each of the ten repeated random subsamples. (B) ROC curves for different classifiers. Colors represent prediction for a classifier denoted in the legend. Each ROC curve represents the average false positive rate and true positive rate for prediction on the held-out sets for each of the ten repeated random subsamples. (C) Precision–recall curves for region-specific methylation status prediction. Colors represent prediction on CpG sites within specific genomic regions as denoted in the legend. Each precision–recall curve represents the average precision–recall for prediction on the held-out sets for each of the ten repeated random subsamples. (D) Two-dimensional histogram of predicted methylation levels versus experimental methylation levels. x- and y-axes represent assayed versus predicted β values, respectively. Colors represent the density of each matrix unit, averaged over all predictions for 100 individuals. CGI, CpG island; Gene_pos, genomic position; k-NN, k-nearest neighbors classifier; ROC, receiver operating characteristic; seq_property, sequence properties; SVM, support vector machine; TFBS, transcription factor binding site; HM, histone modification marks; ChromHMM, chromatin states, as defined by ChromHMM software [107].

Table 1 Performance of methylation status prediction using different prediction models

Cross-sample prediction

To determine how predictive methylation profiles were across samples, we quantified the generalization error of our classifier genome-wide across individuals. In particular, we trained our classifier on 10,000 sites from one individual, and predicted methylation status for all CpG sites for the other 99 individuals. The classifier’s performance was highly consistent across individuals (Additional file 1: Figure S4), suggesting that individual-specific covariates – different proportions of cell types, for example – do not limit prediction accuracy. Aware of the unbalanced proportion of female and male samples in our study, we further investigated prediction performance across sex. The classifier’s performance is highly consistent when training on females and predicting CpG site methylation status in males, and vice versa (Additional file 1: Figure S5).

To test the sensitivity of our classifier to the number of CpG sites in the training set, we investigated the prediction performance for different training set sizes. We found that training sets with greater than 1,000 CpG sites had fairly similar performance (Additional file 1: Figure S6). Throughout these experiments, we used a training set size of 10,000, to strike a balance between sufficient numbers of training samples and computational tractability.

Cross-platform prediction

To quantify classification across platform and cell-type heterogeneity, we investigated the classifier’s performance on WGBS data [59,60]. In particular, we categorized each CpG site in a WGBS sample based on whether that CpG site was assayed on the 450K array (450K site) or not (non 450K site); neighboring sites in the WGBS data are sites that are adjacent on the genome when both are 450K sites. We use one WGBS sample from b-cells, which will match some proportion of each whole blood sample; we note that the 450K array whole blood samples will contain heterogeneous cell types in contrast to the WGBS data. Overall, we see a much higher proportion of hypomethylated CpG sites on the 450K array relative to the WGBS data (Additional file 1: Figure S7) because of the disproportionate representation of hypomethylated CpG sites within CGIs on the 450K array.

First, we investigated cross-platform prediction, training our classifier on a 450K array sample and testing on WGBS data. We trained the classifier on 10,000 CpG sites in the 450K array samples, and then we tested on 100,000 CpG sites in WGBS data twice – once restricting the test set to 450K sites and once restricting the test set to non 450K sites. We repeated this experiment ten times. Next, we performed the same experiment but trained and tested on the WGBS data. Because the proportion of hypomethylated and hypermethylated sites was imbalanced for CpG sites not on the 450K array, we used a precision–recall curve instead of a ROC curve to measure the prediction performance [61]. We used all 122 features and considered prediction of inverse CpG status \({\hat {\tau }} = -(\tau - 1)\) in this experiment, to assess the quality of the predictions for the less frequent class of hypomethylated CpG sites.

Trained on 450K array data and tested on WGBS 450K sites, our RF classifier achieved an accuracy of 89.3%; trained on 450K array data and tested on WGBS non 450K sites, our RF classifier achieved an accuracy of 92.2% (Figure 6; Table 2). Training and testing exclusively on WGBS data showed a similar performance, with an accuracy of 90.0% for CpG sites in the 450K sites and 92.4% for CpG sites in the non 450K sites (Figure 6). Predictions for CpG sites in non 450K sites had lower precision at high recall rates because it is more difficult to predict unmethylated sites in the sequencing data as there are many more unmethylated CpG sites. These results suggest that our RF classifier is able to generalize across platforms and methylation assay types.

Figure 6
figure 6

Prediction performance on WGBS data and cross-platform prediction. Precision–recall curves for cross-platform and WGBS prediction. Each precision–recall curve represents the average precision–recall for prediction on the held-out sets for each of the ten repeated random subsamples. WGBS, whole-genome bisulfite sequencing.

Table 2 Performance of methylation prediction using whole-genome bisulfite sequencing data

Comparison of random forest classifier with other classifiers

We compared the prediction performance of our RF classifier with several other classifiers that have been widely used in related work (Table 3). In particular, we compared our prediction results from the RF classifier with those from a SVM classifier with a radial basis function kernel, a k-nearest neighbors classifier (k-NN), logistic regression, and a naive Bayes classifier. We used identical feature sets for all classifiers, including all 122 features used for prediction of methylation status with the RF classifier. We quantified performance using repeated random resampling with identical training and test sets across classifiers.

Table 3 Performance of methylation status prediction using different classifiers

We found that the k-NN classifier showed the worst performance on this task, with an accuracy of 73.2% and an AUC of 0.80 (Figure 5B). The naive Bayes classifier showed better accuracy (80.8%) and AUC (0.91). Logistic regression and the SVM classifier both showed good performance, with accuracies of 91.1% and 91.3% and AUCs of 0.96% and 0.96%, respectively. We found that our RF classifier showed significantly better prediction accuracy than logistic regression (t-test; P=3.8×10−16) and the SVM (t-test; P=1.3×10−13). We note also that the computational time required to train and test the RF classifier was substantially less than the time required for the SVM, k-NN (test only), and naive Bayes classifiers. We chose RF classifiers for this task because, in addition to the gains in accuracy over SVMs, we were able to quantify the contribution to prediction of each feature, which we describe below.

Region-specific methylation prediction

Studies of DNA methylation have focused on methylation within promoter regions, restricting predictions to CGIs [40,41,43-46,48]; we and others have shown DNA methylation has different patterns in these genomic regions relative to the rest of the genome [12], so the accuracy of these prediction methods outside of these regions is unclear. Here we investigated regional DNA methylation prediction for our genome-wide CpG site prediction method restricted to CpGs within specific genomic regions (Additional file 1: Table S3). For this experiment, prediction was restricted to CpG sites with neighboring sites within 1 kb distance because of the small size of CGIs.

Within CGI regions, we found that predictions of methylation status using our method had an accuracy of 98.3%. We found that methylation level prediction within CGIs had an r=0.94 and a root-mean-square error (RMSE) of 0.09. As in related work on prediction within CGI regions, we believe the improvement in accuracy is due to the limited variability in methylation patterns in these regions; indeed, 90.3% of CpG sites in CGI regions have β<0.5 (Additional file 1: Table S4). Conversely, prediction of CpG methylation status within CGI shores had an accuracy of 89.8%. This lower accuracy is consistent with observations of robust and drastic change in methylation status across these regions [62,63]. Prediction performance within various gene regions was fairly consistent, with 94.9% accuracy for predictions of CpG sites within promoter regions, 93.4% accuracy within gene body regions (exons and introns), and 93.1% accuracy within intergenic regions. Because of the imbalance of hypomethylated and hypermethylated sites in each region, we evaluated both the precision–recall curves and ROC curves for these predictions (Figure 5C and Additional file 1: Figure S8).

Predicting genome-wide methylation levels across platforms

CpG methylation levels β in a DNA sample represent the average methylation status across the cells in that sample and will vary continuously between 0 and 1 (Additional file 1: Figure S9). Since the Illumina 450K array measures precise methylation levels at CpG site resolution, we used our RF classifier to predict methylation levels at single-CpG-site resolution. We compared the prediction probability (\({\hat \beta }_{i,j} \in \left [0,1\right ]\)) from our RF classifier (without thresholding) with methylation levels (β i,j ∈[0,1]) from the array, and validated this approach using repeated random subsampling to quantify generalization accuracy (see Materials and methods). Including all 122 features used in methylation status prediction, but modifying the neighboring CpG site methylation status τ to be continuous methylation levels β, we trained our RF classifier on 450K array data and evaluated the Pearson’s correlation coefficient (r) and RMSE between experimental and predicted methylation levels (Table 1; Figure 5D). We found that the experimentally assayed and predicted methylation levels had r=0.90 and RMSE =0.19. The correlation coefficient and the RMSE indicate good recapitulation of experimentally assayed levels using predicted methylation levels across CpG sites.

We quantified the performance of methylation level prediction on WGBS data. We trained on CpG sites from the 450K array, and tested the classifier on CpG sites from the WGBS data, both restricted to CpG sites in the 450K sites set and restricted to CpG sites in the non 450K sites set. We achieved different correlations (r=0.62 and 0.84, P<2.2×10−16) but similar RMSE (0.23 and 0.24, P=3×10−16) when predicting methylation levels for CpG sites in the 450K sites set and CpG sites in the non 450K sites set, respectively, in WGBS data. We suspect that the performance difference between the two experiments reflected in the correlation coefficients may be due to the overabundance of CpG sites from CGIs included on the 450K array and, correspondingly, in the 450K sites set of CpG sites.

Feature importance for methylation prediction

We evaluated the contribution of each feature to overall prediction accuracy, as quantified by the Gini index. In the RF classifier, the Gini index measures the decrease in node impurity, or the relative entropy of the observed positive and negative examples before and after splitting the training samples on a single feature, of a given feature over all trees in the trained RF. We computed the Gini index for each of the 122 features from the trained RF classifier for predicting methylation status. Our analysis confirmed that the upstream and downstream neighboring CpG site methylation statuses are the most important features for prediction (Additional file 1: Table S5, Figure 7). When we restrict prediction to promoter or CGI regions, the Gini score of the neighboring site status features increased relative to other features, echoing our observation that the non-neighbor feature sets are less useful when a CpG site’s neighbors are nearby, and thus more informative. In contrast, we found that the Gini index of the genomic distance to the neighboring CpG site feature decreased, suggesting that neighboring genomic distance is an important feature to consider when some neighbors are more distant and correspondingly less predictive.

Figure 7
figure 7

Top 20 most important features by Gini index. Gini index of the top 20 features for prediction in different genomic regions. Colors represent different types of features: neighbors in red, genomic position in green, sequence properties in blue and CREs in black. (A) Gini index for whole-genome prediction. (B) Gini index for prediction in promoter regions. (C) Gini index for prediction in CGIs. CGI, CpG island; CRE, cis-regulatory element; DHS, DNAse I hypersensitive; UpMethyl, upstream CpG site; DownMethyl, downstream CpG site; UpDist, distance in bases to the upstream CpG site; DownDist, distance in bases to the downstream CpG site.

The CRE features also have variable Gini indices across experiments. We found that DHS sites are strongly predictive of an unmethylated CpG site; the DHS site feature has the third most significant Gini index across these experiments. This observation is consistent with a previous study showing that CpG sites in DHS sites tend to be unmethylated [64]. GC content, which was also ranked highly based on Gini index, may have a substantial contribution to prediction as a proxy for other important features, such as CGI status and CpG density. We found that the feature rankings based on Gini index differed when predicting methylation status in specific genomic regions (Figure 7), implying context-specific DNA methylation mechanisms.

When predicting methylation status in arbitrary regions, several transcription factors (TFs) and histone modifications were among the most highly ranked features across experiments. Some of these CREs have a reported association with DNA methylation, including ELF1, RUNX3, MAZ, MXI1, and MAX. Indeed, the ETS-related transcription factor (ELF1) has been shown to be over-represented in methylated regions, associating DNA methylation with hematopoiesis in hematopoietic stem cells [65]. RUNX3 (Runt-related transcription factor 3), a strong tumor suppressor associated with diverse tumor types, has been suggested to be associated with cancer development through regulating global DNA methylation levels [66-71]. RUNX3 expression is associated with aberrant DNA methylation in adenocarcinoma cells [70], primary bladder tumor cells [68], and breast cancer cells [69]. For another tumor suppressor transcription factor, MXI1 (MAX-interacting protein 1), expression levels (specifically, lack of expression) have been reported to be associated with promoter methylation levels and neuroblastic tumorigenesis [72]. It has been suggested that suppression of MAZ (Myc-associated zinc finger protein) may be associated with DNA methyltransferase I, the key factor for de novo DNA methylation [73,74]. MXI1 and MAX (Myc-associated factor X) both interact with c-Myc (myelocytomatosis oncogene), a well-characterized oncogene, which has been shown to be methylation sensitive, meaning that the TF motifs contain CpG sites and, thus, TF binding is sensitive to methylation status at those sites [75]. This suggests a potential regulatory relationship between MAX, MXI1, and DNA methylation that may extend to downstream cancer tumor development.

The association between specific histone modifications and DNA methylation is poorly understood. A previous study suggested that high H3K4 methylation and H3 acetylation are associated with MYC recognition [76], suggesting regulatory relationships among DNA methylation, histone modification, and transcription factor binding. Our results suggest that further work is needed to clarify this relationship, as a subset of histone modifications appear to be predictive of methylation status.

We found that the correlation between a binary feature and PC1 is proportional to the Gini index of that feature (Figure 4 and Additional file 1: Table S5). The variation in the Gini index rankings for CREs varied more than we expected based on the other features (Additional file 1: Figure S10). CREs that co-occur with CpG sites more often tend to be more important for prediction, according to the Gini index. We found that the Gini index of a binary feature has a log linear relationship with the number of co-occurrences of that binary feature with CpG sites in the data set: the more often a CpG site in the training data co-occurred with a CRE, the higher the Gini index rank of that CpG site (Additional file 1: Figure S10). There were several outliers to this trend, including co-localization with bound POL3 (RNA polymerase III), C-fos (a proto-oncogene), and histone modifications H3K9ac and H4K20me. These features were less important than we would predict using the fitted linear regression model of log Gini index. This trend limits the strong conclusions that associate specific CREs with DNA methylation biochemically from a high Gini index rank for that CRE; it may be that there are general relationships between CREs and CpG sites that we are learning, but a relatively high CRE frequency in these data may artificially inflate the rank of that CRE in comparison to the others (Additional file 1: Figure S10). Most CpG sites within TFBSs have low average methylation levels (Additional file 1: Table S4). Several TFBSs have disproportionately high average methylation levels, for example, ZNF274 (Zinc-finger protein 274) and JunD (Jun D proto-oncogene); however, both of these outliers also have a low co-occurrence frequency with CpG sites in these data, suggesting that this finding may be an artifact.


We characterized genome-wide and region-specific patterns of DNA methylation. We performed these characterizations based on summary statistics instead of a model-based analysis, which may lead to less dramatic region-specific methylation patterns than in our study (L Pachter, personal communication). These region-specific patterns raise additional questions, including how these observations may resolve or at least suggest causal relationships between methylation and other genomic and epigenomic processes. Since there are SNP associations with complex traits, it is likely that the genotype drives associated processes rather than the other way around; the causal relationship is established by inductive logic, since it is biologically difficult to perform site-specific mutation. The dynamic nature of CpG site methylation means that no such causal relationship can be established inductively; however, experiments can be designed to establish the impact of changing the methylation status of a CpG site [77,78]. Conditional analyses, such as those developed for DNA, may prove to be illuminating for epigenomics [79,80], but the current data are still difficult to interpret. For example, does a TFBS containing a CpG site prevent methylation when a transcription factor is actively bound, or does a methylated CpG site in a TFBS prevent a TF from binding to that site?

We built a RF predictor of DNA methylation levels at CpG site resolution. In our comparison between an RF classifier and alternative classifiers, we found that improvements of the RF classifier include better prediction, especially in sparsely sampled genomic regions, and biological interpretability, which comes from the ability to readily extract information about the importance of each feature in prediction. An additional benefit of using cell-type-specific features (i.e., CREs) is that the predictions are robust to differential methylation across cell types [81,82]. The accuracy results for predictions based on this model are promising, in particular the cross-cell-type heterogeneity and cross-platform performance, and suggest the possibility of imputing CpG site methylation levels genome-wide in the future using WGBS samples as reference. For example, if we assay a set of individuals in an epigenome-wide association study on the Illumina 450K array, we may be able to impute the missing genome-wide CpG sites up to WGBS assays. We are still far from the prediction accuracies currently expected for SNP imputation for downstream use in genome-wide association studies; however, in imputation we would include CpG site-specific methylation levels from reference samples, instead of predicting methylation levels in a site-independent way [38,83]. Our cross-sample analysis illustrates that including methylation profiles from other individuals as reference may improve accuracies substantially. However, because of biological, batch, and environmental effects on DNA methylation, it is possible that precise imputation will require a much larger reference panel relative to DNA imputation. As in genome-wide association studies, all of these imputation methods will fail to predict rare or unexpected variants [84], which may hold a substantial proportion of association signal for both genome-wide and epigenome-wide association studies [85-87]. This work raises the additional question, then, of how best to sample CpG sites across the genome given the methylation patterns and the possibility of imputation; for example, it may be sufficient to assay a single CpG site within a CGI and impute the others, given the high correlation between methylation values in CpG sites within the same CGI.

We identified genomic and epigenomic features that were most predictive of methylation status for co-located CpG sites. The biological functions of CGI shore and shelf regions, and in particular the impact of methylation in these regions, are mostly unknown; however, it has been shown there is substantial DNA methylation variation in CGI shore regions relative to other regions in the genome, and these alterations may contribute to cancer development and tissue differentiation [62,63]. We hope to characterize the role of CGI shore and shelf regions better with respect to enrichment of particular regulatory elements in the future to understand the cellular role of these regions and the specific, curious pattern of methylation found within them.

One particularly important driver of methylation that we do not study carefully here is methylation quantitative trait loci (QTLs), or genetic drivers of methylation [35,88,89]. There is substantial work on the enrichment of methylation QTLs within SNPs and genetic loci that appear to regulate gene transcription levels (expression quantitative trait loci, or eQTLs), DHS site status (dsQTLs), and others [35,64,88,90-92]. The characterizations described here lead us to consider identifying QTLs associated with deviations from CRE-specific methylation patterns instead of single CpG sites, as has been done with methylation in CGI shore regions and associations with cancer [63].


We investigated genome-wide methylation in 100 individuals profiled using the Illumina 450K array. We identified patterns of correlation in DNA methylation at CpG sites specific to CGIs, CGI shores, and non-CGIs, quantifying the variability within CGI shore regions and a pattern of correlation across the shelf regions by which correlation increases with distance. We built a RF classifier to predict methylation as a binary status and as a continuous level at single-CpG-site accuracy, using as features neighboring CpG site information, genomic position, DNA sequence properties, and CRE co-location information. We found that our RF-based method outperformed alternative methylation status classifiers and produces interpretable results. We found that the accuracy of our classifier remained high when predicting methylation status in WGBS data, and across samples. Our approach quantifies features that are most predictive of CpG status: we found that neighboring CpG site methylation levels, location in a CGI, and co-localized DHS sites and specific transcription factor binding sites were most predictive of DNA methylation levels. We identified several TFBSs, including ELF1, MAZ, MXI1, and RUNX3, and histone modifications that are highly predictive of methylation levels in whole blood. These predictive features may play a mechanistic role in methylation, either in regulating the methylation of CpG sites or as a downstream partner in modifying the cellular phenotype.

Materials and methods

DNA methylation data

Illumina HumanMethylation450 array data were obtained for 100 unrelated human participants from the TwinsUK cohort [93]. All participants provided written informed consent in accordance with local ethics research committees. The 100 individuals were adult unselected volunteers and included 97 female and three male participants (age range 27 to 78). Whole blood was collected and DNA was extracted using standard protocols.

Illumina HumanMethylation450 array (Illumina 450K) measured the DNA methylation values for more than 482,000 CpG sites per individual at single-nucleotide resolution. Genomic coverage includes 99% of reference sequence genes, with an average of 17 CpG sites per gene region distributed across the promoter, 5 UTR, first exon, gene body, and 3 UTR, and 96% of CGIs [34,94].

Methylation values for each CpG site are quantified by the term β, which is the fraction of methylated bead signal over the sum of the methylated and unmethylated bead signals:

$$ \beta =\frac{\max (\text{Methy},0)}{\max (\text{Methy},0)+\max (\text{Unmethy},0)+\alpha} $$

where Methy represents the signal intensity of the methylated probe and Unmethy represents the signal intensity of the unmethylated probe. The quantity β ranges from 0 (unmethylated) to 1 (fully methylated).

Data quality control was implemented using R [95] (version 2.15.3). We removed 17,764 CpG sites for which probes mapped to multiple loci in the human genome reference sequence. CpG sites that are SNPs, that had missing values, or that had detection P>0.01 were excluded. Methylation data from probes mapping to the X and Y chromosomes were excluded. We were left with 394,354 CpG sites from 100 individuals in downstream analyses. We normalized the data as follows. Within the methylation level data, we controlled for array number, probe position on the array, age, and sex by taking the residual from a fitted linear regression model. The sum of residuals and intercepts of each site was scaled to [0,1] by truncating all sites with values larger than 1 to 1 and all sites with values smaller than 0 to 0. We assessed data quality to identify sample outliers and batch effects using principal component analysis (PCA) [96] across individuals and CpG sites; no obvious outliers were identified.

We downloaded the WGBS data (BED files) from NCBI Gene Expression Omnibus (GEO) [GEO:GSE46644], sample GSM791827 [59,60]. CD19+ B cells were purified from peripheral blood collected from one healthy female donor. Bisulfite sequencing and read mapping processes were described in previous work [60]. The methylation levels for each CpG site were quantified by the ratio of the number of methylated and the total reads at each CpG site. Only CpG sites with greater than 5 × coverage were included. Methylation level data from the X and Y chromosomes were excluded. After quality control, there were 10,000,890 CpG sites in the WGBS data. Because we used only a single sample, we did not control for PCs.

Correlation and principal component analysis

The statistical analyses were implemented using R and Bioconductor [97] (version 2.15.3). Methylation correlations between CpG sites were assessed by the absolute value of Pearson’s correlation coefficient and MED:

$$ \text{MED} = \frac{\sum_{i=1}^{n}(x_{i,j}-x_{i,k})^{2}}{n}, $$

where x i,j and x i,k represent the methylation values of the two CpG sites being compared j and k, and n represents the number of samples in the comparison. For neighboring CpG sites, pairs of CpG sites assayed on the array that were adjacent in the genome were sampled; the genomic distance between the pairs of CpG sites were within the range x−200 bp to x bp, where x∈{200,400,600,…,6,000}. The correlation and MED of a 200-bp window was not computed, as there were too few CpG sites. The non-adjacent pair correlation or MED values are the average absolute value correlation or MED of 5,000 pairs of CpG sites that were not immediate neighbors with their genomic distances in the same range as for the adjacent CpG sites.

We performed PCA on methylation values of CpG sites by computing the eigenvalues of the covariance matrix of a subsample of CpG sites using the R function svd. Among the 378,677 CpG sites that have complete feature information, 37,868 sites (every tenth CpG site) were sampled along the genome across all autosomal chromosomes. Absolute value Pearson’s correlation was calculated between each feature and the first ten PCs. PCA was performed by plotting the PC biplot (scatterplot of first two PCs), colored by feature status of each CpG site, and by computing the Pearson correlation between the PCs and the feature status across CpG sites.

Random forest and comparison classifier

We used the randomForest package in R in the implementation of the RF classifier [98] (version 4.6-7). Most of the parameters were kept as default, but ntree was set to 1,000 to balance efficiency and accuracy in our high-dimensional data. We found the parameter settings for the RF classifier (including the number of trees) to be robust to different settings, so we did not estimate parameters in our classifier. The Gini index, which calculates the total decrease of node impurity (i.e., the relative entropy of the class proportions before and after the split) of a feature over all trees, was used to quantify the importance of each feature:

$$ I(A)=1-\sum_{k=1}^{c}{p_{k}^{2}}, $$

where k represents the class and p k is the proportion of sites belonging to class k in node A.

We used the SVM implementation in the e1071 package in R [99] with a radial basis function kernel. The parameters of the SVM were optimized by tenfold cross-validation using a grid search. The penalty constant C ranged from 2−1,21,…,29 and the parameter γ in the kernel function ranged from 2−9,2−7,…,21. The parameter combination that had the best performance – γ=2−7 and C=23 – was used to generate the results used in the comparisons.

For k-NN, we used the knn function in R, with the number of neighbors equal to the square root of the number of samples in the training set. For the logistic regression classifier, we used the logistic regression classifier implemented in the R base package with the function glm and family = ‘binomial’. We set the threshold for classification to \(\hat {\beta }_{i,j} \geq 0.5\). For the naive Bayes classifier, we used the naiveBayes function in the R e1071 package.

Features for prediction

A comprehensive list of 124 features were used in prediction (Additional file 1: Table S2). The neighbor features were obtained from data from the Methylation 450K Array. The position features, including gene coding region category, location in CGIs, and SNPs, were obtained from the Methylation 450K Array Annotation file. DNA recombination rate data were downloaded from HapMap (phaseII_B37, update date 19 January 2011) [55]. GC content data were downloaded from the raw data used to encode the gc5Base track on hg19 (update date 24 April 2009) from the UCSC Genome Browser [100,101]. iHSs were downloaded from the HGDP selection browser iHS data of smoothedAmericas (update date 12 February 2009) [57,102], and GERP constraint scores were downloaded from SidowLab GERP++ tracks on hg19 [58,103].

For the CRE features, DHS sites data were obtained from the DNase-seq data for the GM12878 cell line produced by Crawford Lab at Duke University (UCSC accession [wgEncode:EH000534], submitted date 20 March 2009). Chromatin immunoprecipitation sequencing (ChIP-seq) data for 79 specific TFBSs were obtained from the narrow peak files from the GM12878 cell line that were available before June 2012 from the ENCODE website. Ten histone modifications were obtained from the peak files from the GM12878 cell line that were available before December 2013 from the ENCODE website and 15 chromatin states were obtained from the Broad ChromHMM data from the GM12878 cell line (UCSC accession [wgEncode:EH000033], submitted data 21 March 2011) [56].

Neighboring CpG site methylation status τ was encoded as methylated (τ=1) when the site has β≥0.5 and unmethylated (τ=0) when β<0.5. For continuous features, the feature value is the value of that feature at the genomic location of the CpG site; for binary features, the feature status indicates whether the CpG site is within that genomic feature or not. DHS sites were encoded as binary variables indicating a CpG site within a DHS site. TFBSs were included as binary variables indicating the presence of a co-localized ChIP-Seq peak. iHSs, GERP constraint scores and recombination rates were measured in terms of genomic regions. For GC content, we computed the proportion of G and C within a sequence window of 400 bp, as this feature was shown to be an important predictor in a previous study [41]. Among all 124 features, 122 of them (excluding β values of upstream and downstream neighboring CpG sites) were used for methylation status predictions, and all, excluding methylation status of upstream and downstream neighboring CpG sites τ, were used for methylation level predictions. When limiting prediction to specific regions, e.g., CGIs, we excluded those region-specific features from the data.

Prediction evaluation

Our methylation predictions were at single-CpG-site resolution. For regional-specific methylation prediction, we grouped the CpG sites into either promoter, gene body, and intergenic region classes, or CGI, CGI shore and shelf, and non-CGI classes according to the Methylation 450K array annotation file, which was downloaded from the UCSC genome browser [51].

The classifier performance was assessed by a version of repeated random subsampling validation. Within a single individual, ten times we sampled 10,000 random CpG sites from across the genome into the training set, and we tested on all other held-out sites. The prediction performance for a single classifier was calculated by averaging the prediction performance statistics across each of the ten trained classifiers. We checked the performance with smaller training set of sizes 100, 1,000, 2,000, 5,000 and 10,000 sites in the same evaluation setup. In cross-sample analyses, we set the size of the training set to 10,000 randomly chosen CpG sites to balance computational performance and accuracy. We then evaluated the consistency of methylation pattern in different individuals by training the classifier using 10,000 randomly chosen CpG sites in one individual, and then using the trained classifier to predict all of the CpG sites for the remaining 99 individuals. In cross-gender analyses, we randomly chose 10,000 CpG sites from one randomly chosen female or male and tested on all CpG sites from another randomly chosen female or male. This was repeated ten times.

In cross-platform prediction and WGBS prediction, we sampled 10,000 randomly chosen CpG sites from 450K data or CpG sites categorized as 450K sites in WGBS data as training sets. We tested on 100,000 randomly chosen CpG sites that were categorized as 450K sites or non 450K sites in the WGBS data. The prediction performance for a single classifier was calculated by averaging the prediction performance statistics across each of the ten trained classifiers.

We quantified the accuracy of the results using the specificity (SP), sensitivity (recall) (SE), precision, accuracy (ACC), and Matthew’s correlation coefficient (MCC). Note that truly significant CpG sites are those that are methylated, and truly null CpG sites are those that are unmethylated in these data. These values were calculated as follows:

$$ \text{SP}=\frac{\text{TN}}{\text{TN}+\text{FP}} $$
$$ \text{SE} =\frac{\text{TP}}{\text{TP}+\text{FN}} $$
$$ \text{Precision} =\frac{\text{TP}}{\text{TP}+\text{FP}} $$
$$ \text{Recall} =\frac{\text{TP}}{\text{TP}+\text{FN}} $$
$$ \text{ACC} =\frac{\text{TP}+\text{TN}}{\text{TP}+\text{FP}+\text{TN}+\text{FN}} $$
$$ \text{MCC} \,=\,\frac{\text{TP}\times \text{TN}-\text{FP}\!\times \! \text{FN}}{\sqrt{(\text{TP}\,+\,\text{FN})\!\times \!(\text{TP}\,+\,\text{FP})\!\times\!(\text{TN}\,+\,\text{FP})\!\times \!(\text{TN}\,+\,\text{FN})}}, $$

for true positives (TP), true negatives (TN), false positives (FP), and false negatives (FN) for a particular threshold. We computed ROC curves, precision–recall curves, AUC, and AUPR; AUC and AUPR reflect the overall prediction performance considering both type I (FPs) and type II errors (FNs) [39,104]. We used the ROCR package in R [105].

To estimate continuous methylation levels (\(\hat {\beta }\)), we used the classifier output of prediction probability from the RF classifier directly as an estimate of a specific β∈[0,1]. Prediction accuracy was evaluated using Pearson’s correlation coefficient and RMSE:

$$ r_{x,y}=\frac{\sum_{j=1}^{p}(x_{j}-\bar{x})(y_{j}-\bar{y})}{(p-1)\cdot\sigma_{x}\cdot \sigma_{y}} $$
$$ \text{RMSE}_{x,y} = \sqrt{\frac{\sum_{j=1}^{p}(y_{j}-x_{j})^{2}}{p}}, $$

where x j ,y j are the experimental and predicted values for the jth CpG site, respectively, \(\bar {x}\), \(\bar {y}\) are the means of the experimental and predicted methylation levels, respectively, and σ x , σ y are the empirical standard deviations of the experimental and predicted values, respectively.

Availability of data and code

We have released the TwinsUK 450K Array data for 100 samples through GEO [GEO:GSE62992]. We have released the R code associated with the processing and analyses of these data on the Engelhardt Group website [106].