Background

DNA methylation is an epigenetic modification known to play a prime role in gene silencing and is an important topic in epigenetic research. However, due to technology-dependent errors there are inconsistencies between methylation measurements from different methods [1]. Incorrect methylation calls could result in the discovery of spurious associations between methylation patterns and specific phenotypes in epigenome-wide association studies (EWAS). We worked towards assigning a measure of confidence to individual CpGs to down-weigh or exclude positions with inconsistent measurements in such studies. We used methylation measurements from the Infinium HumanMethylation450 microarray (β450K) and whole genome bisulfite sequencing (βWGBS) to evaluate whether locus-specific measurement differences, Δβ = β450K − βWGBS, are predictable using statistical learning techniques.

Methods

Methylation for Illumina WGBS data from HepaRGd7R2 was called with Bis-SNP [2], while methylation for Infinium 450K data from the same cell line was determined using RnBeads [3] and normalized with BMIQ [4]. For a uniform feature representation, we considered windows of reads overlapping with CpGs on the microarray (Figure 1). As predictors we examined sets of read sequences, their consensus sequences (with and without base frequencies), and non-sequence features such as base quality and depth of coverage. To obtain a predictive model independent of the methylation state, we masked CpG positions by introducing gaps or zeroing base frequencies.

Figure 1
figure 1

Data preprocessing. (1) Only reads overlapping with a CpG on the Infinium 450K chip are retained. (2) Windows are extended to the left and right of each CpG according to the maximum read length, yielding a uniform feature representation. (3) For each CpG, a consensus sequence is formed from its corresponding set of reads. Additionally, the position-specific frequency of each base is extracted. (4) Finally, CpG positions are masked by introducing gaps in the sequence or zeroing frequencies.

To predict Δβ, we built support vector regression models based on Illumina WGBS data. Read similarity was measured with numerical, string [57], and set kernels [8]. We introduced the notion of hybrid string kernels to afford a similarity measure for both numeric and string input simultaneously. These kernels are based on scaling the motif similarity scores of two sequences according to the similarity of their base frequency profiles.

Results

For a read-based set kernel utilizing the weighted degree kernel with shifts [6], we found that the predicted values of Δβ correlated significantly with the observed outcomes (r = 0.37, p-value < 2.2 · 10−16). Furthermore, the hybrid weighted degree kernel (r = 0.234) outperformed the weighted degree kernel with shifts (r = 0.22) by also considering the frequencies of individual bases in addition to the consensus sequences. Non-sequence features were less predictive of the outcome than the sequence, e.g., RBF kernels on base quality and depth of coverage attained only correlations of r = 0.057 and r = 0.003 with the outcome, respectively.

Conclusion

To our knowledge, this is the first approach indicating that differences between methylation measurements from bisulfite sequencing and the Infinium HumanMethylation450 microarray are predictable from the reads. The results suggest that features beside the sequence play only a minuscule role in the emergence of inconsistent methylation measurements. We were able to show that, in this scenario, set kernels and hybrid string kernels provide well-suited similarity measures. Further work is necessary to validate the model's generalizability for data from other cell lines and to evaluate its practical merit.