Background

Cytosine methylation (5mC) is a widely conserved epigenetic mark [14] with important roles in the regulation of gene expression and the silencing of transposable elements (TEs) and repeats [5, 6]. Experimentally-induced changes in 5mC patterns have been shown to affect plant phenotypes [79], rates of meiotic recombination [1013], genome stability [1418] and alter plant-environment interactions [1922]. Similar to genetic mutations, changes in 5mC patterns can also occur spontaneously as a result of errors in DNA methylation maintenance [2326]. There is substantial evidence in plants that experimentally-induced as well as spontaneously occurring 5mC changes can be stably inherited across multiple generations, independently of genetic changes [27]. Cytosine methylation has therefore emerged as a potentially important factor in plant evolution [2830] and as a possible molecular target for the improvement of commercial crops [31, 32].

Plant methylomes are now routinely studied using whole-genome bisulfite sequencing (WGBS), a next generation sequencing (NGS) method that can interrogate the methylation status of individual cytosines at the genome-wide scale. The application of this technology has been instrumental in dissecting the molecular pathways that establish and maintain 5mC patterns in plant genomes. Unlike in animals, plants methylate cytosines in context CG, but also extensively in contexts CHG and CHH, where H= A, T, C [5]. Methylation at CG dinucleotides (mCG) is maintained by methyltransferase 1 (MET1), which is recruited to hemi-methylated CG sites in order to methylate the complimentary strand in a template-dependent manner during DNA replication [33]. By contrast, mCHG is maintained dynamically by the plant specific chromomethylase 3 (CMT3) [34], and requires continuous interactions with H3K9me2 (dimethylation of lysine 9 on histone 3) [35]. Asymmetrical methylation of CHH sites (mCHH) is established and maintained by another member of the CMT family, CMT2 [2, 36]. Similar to CMT3, CMT2 dynamically methylates CHH in H3K9me2-associated regions. In addition to these context-specific maintenance mechanisms, all three sequence contexts can also be methylated de novo via RNA-directed DNA methylation (RdDM) [5], which involves short-interfering 24 nucleotide small RNAs (siRNA) that guide the de novo methyltransferase domains rearranged methyltransferase 2 (DRM2) to homologous target sites throughout the genome [37, 38].

Although these methylation pathways appear to be broadly conserved across plant species, recent data indicates that there is extensive variation in 5mC patterns both between but also within species [3, 39]. Efforts to explore the origin of this variation and its implications for plant evolution, ecology and agriculture will require large inter- and intraspecific methylome datasets. Such datasets are currently emerging. To date, the methylomes of over 50 plant species have been analyzed using WGBS [3, 4], including representative species of major taxonomic groups such as angiosperms (flowering plants), gymnosperms, ferns, and non-vascular plants. In addition, the methylomes of over 1000 natural A. thaliana accessions are now available [40], as well as those of several experimentally derived populations [41]. However, deep inter- and intraspecific WGBS measurements remain cost-prohibitive, particularly for species with large genomes. Most published plant methylomes have therefore been sequenced far below saturation (i.e. large number of cytosines in the genome are not covered). Indeed, even simple genomes, like that of the model plant A. thaliana (Col-0 accession), are typically only sequenced to about 10-30X. At this depth, about 5-10% of cytosines have missing data (i.e. zero read coverage) and about 15-20% have nearly uninformative read coverage (<3 reads), and this problem is exacerbated in more complex genomes, like those of rice and maize (see Fig. 1).

Fig. 1
figure 1

Coverage distributions. a-c Percentage of cytosines with X coverage (strand-specific). d-f Percentage of cytosines with missing data (red) and “uninformative” coverage (green), defined as less than three reads

Low to moderate sequencing depths in individual samples have cumulative consequences for analyzing population-level data. For instance, in the recently released 1000 A. thaliana methylome data [40] (measured at 5X coverage per strand on average), 92% of cytosines have missing data in at least one sample when 100 accessions are compared (Additional file 1: Figure S1). These incomplete measurements will reduce statistical power in genome-wide methylation QTL (meQTL) mapping studies, in estimates of epimutation rates, or in ecological studies that aim to correlate site-specific methylation levels with environmental/climatic variables. Moreover, incomplete measurements also complicate and potentially bias methylome scans for signatures of epigenetic selection using methylation site frequency spectrum (mSFS) analytic approaches [28]. One way to circumvent the missing data problem is to calculate methylation levels over larger regions, ranging anywhere from several hundred to several thousand basepairs and to use these methylation levels for downstream population-level analyses. In the above-mentioned A. thaliana population data, only 36% of 100bp regions in the genome are missing in at least one sample of the 100 accessions, compared with 92% of individual cytosines, and this percentage further decreases with larger region sizes. However, while region-based methylation levels are useful measures for descriptive and correlative analyses, these measures obscure detailed insights into the base-resolution methylation status calls, and thus arguably undermine the key advantages of WGBS over other lower resolution technologies such as MeDIP-seq. Base-resolution status calls are needed to be able to apply existing population (epi)genetic models to population methylome data, and to be able to test explicit hypotheses about the evolutionary forces that shape methylome variation patterns within and among species [28].

In order to maximize the information contained in WGBS data and to facilitate cost-effective sequencing decisions for future studies, we developed METHimpute, a Hidden Markov Model (HMM) based imputation algorithm for the construction of base-resolution methylomes from WGBS data. The unique feature of this algorithm is its ability to impute the methylation status and level of cytosines with missing or uninformative coverage, thus yielding complete methylomes even with low-coverage WGBS datasets. Indeed, using published WGBS data from Arabidopsis thaliana (thale cress), Oryza sativa (rice) and Zea mays (maize), we demonstrate that METHimpute accurately constructs base-resolution methylomes from data with an average coverage as low as 6X, suggesting that typical sequencing costs could be cut without a significant loss of information.

Results

Conceptual overview

WGBS is an NGS-based method in which DNA is treated with sodium bisulfite prior to sequencing in order to convert unmethylated cytosines into uracils and finally into thymines during PCR amplification. Hence, a cytosine in a bisulfite treated read that maps to a cytosine in the reference genome provides evidence for methylation, while a thymine that maps to a cytosine does not. Many specialized short read mapping programs make use of this information and output so-called methylation levels [4244]; that is, the proportion of aligned reads that support that a cytosine is methylated out of all the reads covering the site. Methylation levels are inherently noisy due to inefficiencies in the sodium bisulfite conversion step. Moreover, tissue heterogeneity and the highly dynamic maintenance methylation at CHH and CHG, which requires feedback loops with histone modifications and small RNAs [5, 6], lead to intermediate methylation levels which are very susceptible to experimental variation. Finally, in WGBS data a large proportion of cytosines are often either not covered by any sequencing read or are covered only by a few number of reads (Fig. 1), meaning that methylation levels at these positions cannot be determined.

To overcome these limitations we developed METHimpute, a Hidden Markov Model (HMM) for the construction of base-resolution methylomes from WGBS data. METHimpute takes methylated and unmethylated read counts at every cytosine as input, and outputs discrete methylation status calls (unmethylated or methylated), together with recalibrated methylation levels between 0 and 1 for every cytosine in the genome, regardless of coverage (Fig. 2).

Fig. 2
figure 2

Conceptual overview of METHimpute. a Cytosines on the sequenced genome are assumed to be either unmethylated or methylated. b Bisulphite-sequencing and alignment yields methylation levels for each cytosine, i.e. the number of reads showing methylation divided by the total number of reads. c Emission densities for each state are obtained with a binomial test with state-specific parameters. Note that "imputed" cytosines, i.e. cytosines without any reads, are treated identically as all other cytosines. However, since the emission densities for all states are 1 for imputed cytosines, the methylation status call is purely driven by the neighborhood of cytosines. d Model fitting yields posterior probabilities for methylation status calls. e Inferred methylation status calls and methylation levels

The METHimpute algorithm fits a two-state HMM to the observed methylation counts. The two hidden states correspond to the unmethylated (U) and methylated (M) components, with component-specific binomial emission densities. The estimates of the binomial parameters (pU and pM) and the HMM transition matrix (i.e. the collection of probabilities to transition from one hidden state to another) are estimated freely during model training for different sequence contexts, thus requiring no empirical knowledge of the conversion rate. In the present analysis we have used contexts CG, CCG, CWG, CAA, CTA and CCA |CHY (where H={A,C,T},W={A,T} and Y={C,T}), following evidence of their different methylation characteristics [45]. If necessary the model could be extended to account for different emission and transition parameters for every context and annotation category (genes, TEs, CpG density, etc.), instead of context alone, although this would only be possible for well-annotated genomes.

Based on the model fits, the probability that a given cytosine belongs to one of the hidden states is given by the posterior probabilities γU and γM (Fig. 2d and “Methods” section). A cytosine’s maximum posterior probability represents its most likely methylation status (Fig. 2d, e), and the magnitude of this probability can be used as a measure of confidence in the underlying status call. In addition to methylation status calls, METHimpute outputs recalibrated methylation levels per cytosine, calculated as m=pU·γU+pM·γM (Fig. 2e). A key feature of METHimpute is its ability to infer the methylation level and status for cytosines with missing data (i.e. zero read coverage) or for those with poor read coverage (i.e. less than 3 reads). It achieves this inference iteratively during HMM training by borrowing information from neighboring sites. The algorithm therefore outputs complete, base-resolution methylomes, that can otherwise only be obtained through very high-depth sequencing experiments.

Imputation-guided construction of complete Arabidopsis, rice and maize methylomes

To demonstrate the performance of METHimpute we analyzed representative WGBS datasets from A. thaliana (Col-0) (replicate 1: 8.6X; rep.2: 15.7X coverage per cytosine per strand) [41], O. sativa (japonica nipponbare) (rep.1: 7.4X, rep.2.: 6.9X, rep.3: 4.6X) [46], and Z. mays (B73) (rep.1: 1.6X, rep.2: 3.3X, rep.3: 2.4X) [47]. These three species cover a wide spectrum of plant genomes in terms of length and complexity: the A. thaliana, O. sativa and Z. mays genomes are 120 Mb, 374 Mb and 2.1 Gb in length, respectively, and have an estimated repeat content of 10, 28-35 and 85% [4851]. The mapping statistics for each dataset are detailed in Additional file 2: Table S1. Alignment and pre-processing of the data was carried out using a single pipeline as described in the “Methods” section. Runtimes and memory requirements for METHimpute are listed in Additional file 2: Table S4.

Despite average coverage being relatively high, a substantial proportion of cytosines had either missing data or low coverage. For instance, in the A. thaliana (rep.1: 8.6X), O. sativa (rep.3: 4.6X) and Z. mays (rep.2: 3.3X) datasets, about 9% (3.71M), 24% (39.54M) and 26% (36.77M) of all cytosines had missing data (i.e. zero read coverage) and 24% (10.27M), 49% (79.38M) and 60% (85.5M) were nearly uninformative (here defined as coverage <3 reads) (Fig. 1d-f and Additional file 1: Figure S2 for the other replicates). Interestingly, the genome-wide proportions of missing or uninformative sites were highly context dependent, being highest for CCA |CHY, probably as a result of less unique short read alignments in this context as it is more abundant in repetitive regions of the genome (Additional file 1: Figures S3 and S4).

We applied METHimpute to the above-described datasets and evaluated the quality of the resulting methylation calls. For A. thaliana, O. sativa and Z. mays, the algorithm imputed the methylation status of all 3.71M, 39.54M and 36.77M missing data cytosines, respectively, and inferred the methylation status of all 10.27M, 79.38M and 85.5M uninformative cytosines.

Inferred methylation calls capture known biology

To evaluate the quality of the inferred methylation status calls and levels we examined the per-cytosine posterior probability of being either unmethylated (U) or methylated (M). This probability represents a measure of statistical confidence in the underlying methylation call, with a value of 1 being the most confident. We found that the distribution of maximum posterior probability values for imputed cytosines shows a clear peak around 1 and a tail of lower confidence values (Fig. 3 and Additional file 1: Figure S5 for the other replicates), suggesting that the algorithm produces high-confidence methylation calls for a large proportion of cytosines with missing data. Indeed, 58% (1.50M), 54% (3.96M) and 83% (6.43M) of imputed cytosines in A. thaliana, O. sativa and Z. mays were called with high confidence (defined as posterior probability ≥0.9), and these numbers increased to 91% (4.16M), 90% (6.64M) and 93% (9.56M) for cytosines covered by only one or two reads.

Fig. 3
figure 3

Maximum posterior distributions for imputed cytosines (coverage = 0), uninformative cytosines (coverage = 1 or 2) and informative cytosines (coverage ≥3). For Arabidopsis (a), rice (b) and maize (c), for each context. The figure shows the distributions of the maximum posterior probabilities with density on the y-axis and the maximum posterior probability on x-axis. The maximum posterior probability, i.e. the confidence in the methylation status calls, is generally lower for sites with less coverage

To assess whether the inferred methylation levels are consistent with known biology, we constructed meta-methylation profiles for annotated repeats and genes using cytosines separated in three different categories: informative (coverage ≥3), uninformative (coverage = 1 or 2) and imputed cytosines (coverage =0). Regardless of coverage category, METHimpute confirms that A. thaliana TE sequences are heavily methylated in all sequence contexts, with a marked decrease in methylation levels at their 5’ and 3’ ends (Fig. 4b and Additional file 1: Figure S6b for the other replicate). The CCA |CHY context shows the lowest methylation levels and CG shows the highest, consistent with Gouil and Baulcombe 2016 [45], and the ordering is conserved for imputed and uninformative cytosines. Similar profiles were detected for repeat elements in O. sativa and Z. mays, with high CG, CCG and CWG methylation, and very low levels of CAA, CTA, and particularly CCA |CHY methylation, consistent with known results (Fig. 4d, f and Additional file 1: Figure S6 for the other replicates) [52].

Fig. 4
figure 4

Enrichment profiles for genes (left panels) and transposable elements or repeats (right panels) for Arabidopsis (a, b), rice (c, d) and maize (e, f), for each context. Sub-panels show the enrichment profiles for imputed (coverage = 0), uninformative (coverage = 1 or 2) and informative cytosines (coverage ≥3). See the “Methods” section for definition of the recalibrated methylation level

In line with numerous methylome studies in Arabidopsis (e.g. [45, 53, 54]), METHimpute finds that A. thaliana genes are intermediately methylated in CG context, and essentially unmethylated at all other contexts. (Fig. 4a and Additional file 1: Figure S6a for the other replicate). Genic meta-methylation profiles for O. sativa and Z. mays were generally similar to those of A. thaliana (Fig. 4c, e and Additional file 1: Figure S6 for the other replicates), with the exception that both crop species are known to also methylate genic CHG context, probably owing to the fact that genes in these complex genomes often overlap or contain heavily methylated TE or repeat copies.

Taken together the above analyses illustrate two points: first, METHimpute infers annotation-specific methylation profiles that are consistent with published reports; and second, the methylation profiles inferred from imputed or uninformative cytosines recapitulate the patterns seen for highly-informative cytosines, indicating that - regardless of coverage - the inferred methylation calls are robust and biologically meaningful.

Saturation analysis for the performance assessment of imputed methylomes

METHimpute achieves high quality imputations by leveraging information from neighboring cytosines via the estimated distance-dependent transition probabilities (see “Methods” section). Therefore, confidence in the imputed calls is higher for cytosines that are closer to informative sites (Additional file 1: Figure S7). This spatial dependency remains high over distances of 10-40 bp and then decays to background levels. We reasoned that our imputation method may therefore be relatively robust even in shallow WGBS experiments.

To test this directly, we implemented a saturation analysis similar to Libertini et al. 2016 [55], where we compared high-coverage datasets with low-coverage subsets of these datasets. Bam files with mapped reads for the Arabidopsis, rice and maize replicates were merged to obtain samples with 23.2X, 18.6X and 7.2X coverage per cytosine per strand, respectively (Additional file 2: Table S1). These merged files were downsampled to generate a series of reduced datasets, ranging from 90 to 10% of the original data (Additional file 2: Table S3). Upon downsampling, the proportion of cytosines with zero read coverage increased from 5% (23.2X) to 31% (13.47M, 2.6X) in A. thaliana, from 11% (18.6X) to 40% (65.41M, 1.8X) in O. sativa and from 14% (7.2X) to 37% (52.07M, 2.2X) in the Z. mays data (Fig. 5d-f). We ran METHimpute on each reduced dataset and calculated the F1-score in the status calls relative to those obtained with the full data. The F1-score is defined as the harmonic mean of precision and recall, and the status calls of the full dataset were assumed as ground truth.

Fig. 5
figure 5

Saturation analysis. a-c F1-score for METHimpute and the binomial test, compared to the full sample, respectively. The F1-score is the harmonic mean of precision and recall. d-f Proportion of imputed cytosines. g-i Proportion of the genome in each state. The x-axis shows the average strand-specific coverage per cytosine

Our analysis shows that performance remains remarkably high despite drastic decreases in sequencing depth (Fig. 5a-c, Additional file 1: Figure S8 with precision and recall, Additional file 1: Figure S9 F1-score per context). With data as low as 5X coverage per cytosine (strand-specific), the F1-score was as high as 95% (U: 95%, M: 74%) in Arabidopsis, 97% (U: 97%, M: 88%) in rice and 99% (U: 99% M: 98%) in maize. In general, annotations with a large percentage of missing cytosines in the high coverage datasets were less accurately called upon downsampling (Additional file 1: Figure S4). These include in particular transposable elements and repeats. The exception to this trend were 5’ UTRs, which in all three species showed a large percentage of cytosines with missing data but a low amount of miscalled sites upon downsampling.

To put the above accuracy analysis into perspective, we compared the HMM-based imputation method with a much simpler method based on the commonly used binomial test: Methylation states for informative cytosines (>=3 reads) were called with a binomial test, and methylation states for missing and uninformative cytosines (<3 reads) were imputed by assigning the majority methylation state of covered cytosines of the same context in the 200bp neighborhood of the missing or uninformative cytosine. Cytosines without any informative neighbors in a 200bp neighborhood were not imputed and treated as “undefined”, and therefore counted as false negatives in the downsampled data if the full dataset was informative in these positions. We find that the accuracy obtained with this approach is less robust to average sequencing depth. With only 5X data, the F1-score drops down to 93% (U: 93%, M: 74%) in Arabidopsis, 94% (U: 93%, M: 84%) in rice and 95% (U: 93%, M: 91%) in maize (Fig. 5a-c).

Finally, we also considered the fidelity of the recalibrated methylation levels upon downsampling. Recalibrated methylation levels can be interpreted as the probability of observing a methylated read at a given position, and they are highly correlated with original methylation levels: For Arabidopsis, rice and maize, the correlation (linear fit) was 0.91, 0.94 and 0.93, respectively (p-value ≤2e−16). To assess their fidelity upon downsampling, we calculated the correlation between recalibrated methylation levels per cytosine and per 100 bp window to the full coverage dataset, and compared that to the results obtained from the original methylation level (Additional file 1: Figure S10). Per-cytosine recalibrated methylation levels show slightly higher correlations than original methylation levels, and with 10% of the original data the correlations for Arabidopsis, rice and maize are 0.89, 0.90 and 0.93, respectively. Window-based recalibrated methylation levels showed the same correlation performance as the original ones, with remarkably high correlations even when only 10% of the original data was retained (0.95, 0.95, 0.83 for Arabidopsis, rice, maize). These results suggest that recalibrated methylation levels can be used for downstream methylation analysis, since they are correlated to original methylation levels and are robust upon downsampling, while providing base-resolution information even at low sequencing depth.

Overall, both for status calls and for recalibrated methylation levels, METHimpute produces robust results even at very low sequencing depth, suggesting that the algorithm offers a cost-effective solution for methylome studies of large genomes and for population-level studies involving a large number of samples.

Re-calibrated estimates of genome-wide and context-specific methylation levels

Plant species differ greatly in their genome-wide methylation levels (GMLs, i.e. the proportion of cytosines that are methylated) [3, 4]. In a recent survey of about 30 angiosperms, GMLs were found to be as low as 5% in Theobroma cacao to as high as 43% in Beta vulgaris, with a mean of about 16% [3, 39]. Much of this diversity appears to be the result of differences in genome size and repeat content, as well as differences in the efficiency of DNA methylation maintenance pathways [28]. Precise estimates of GMLs are important for studying the evolutionary forces that shape plant methylomes over short and long time-scales, and for understanding genome-epigenome co-evolution. However, obtaining GML estimates based on WGBS data is not trivial, as they are highly dependent on the method used for methylation status calling and on the depth of the sequencing experiment. In A. thaliana, for instance, reported GML estimates vary widely between studies. This dependency is even larger when considering context-specific GMLs (i.e. the proportion of methylated cytosines in a given context; CG-GMLs, CHG-GMLs, CHH-GMLs), with CHH-GMLs being by far the most variable between studies, with reported values ranging from as low as 1.51% [1] to as high as 3.91% [3].

In order to bypass many of the statistical issues in calling methylation states, especially in shallow WGBS data, recent studies have proposed so-called weighted genome-wide methylation levels (wGMLs) as a proxy for GMLs. A wGML is a non-statistical measure which is obtained by counting the number of methylated reads over the total number of reads at the genome-wide scale. Figure 5g-i shows clearly that wGMLs are robust upon down-sampling in any sequence context in the A. thaliana, O. sativa and Z. mays data, thus justifying its use. By contrast, GMLs calculated from base-resolution binomial status calls (i.e. #mC/all Cs) are highly unstable, particularly in non-CG contexts and when sequencing depth is low (Fig. 5g-i).

In order to assess whether the re-calibrated methylation levels provided by METHimpute can also be used to obtain robust estimates of GMLs, we calculated wGMLs by summing the per-cytosine re-calibrated methylation level genome-wide, weighted by coverage. Using this measure we find that METHimpute-derived wGMLs perform nearly identical to naive wGMLs, both in terms of robustness and magnitude (Fig. 5g-i, Additional file 1: Figure S11 with replicates). This demonstrates that METHimpute recalibrated levels are consistent with original methylation levels and known biology not only at the individual cytosine level, but also aggregated over 100bp windows and genome-wide, with the added advantage that they are available for all positions in the genome.

METHimpute facilitates insights into bisulfite conversion rates

One source of measurement noise in WGBS data is the bisulfite conversion procedure prior to sequencing. Bisulfite treatment of DNA is typically performed long enough so that all unmethylated cytosines are converted to uracils. The conversion success (or rate) is typically high. Most studies report conversion rates of about 0.99, implying that only about 1% of all unmethylated cytosines failed to convert. Knowledge of this rate is important not only to verify that bisulfite reaction was efficient but also to be able to separate biological signal from noise in downstream analyses of the data. Empirical estimates of the conversion rate are often obtained by including unmethylated chloroplast and virus genomes as controls in the WGBS workflow, and counting the number of non-converted cytosines from the mapped reads.

A helpful byproduct of the METHimpute fitting procedure is that the conversion rate can be directly estimated from the sequenced material without requiring auxiliary information from chloroplast or virus genomes. METHimpute achieves this in the HMM framework by estimating the probability, pU, of finding a methylated read given that the underlying cytosine is unmethylated (see “Methods” section), which can be used to derive the conversion rate. To obtain these rates we focus on estimates of pU in context CG to exclude potential biases arising from the “fuzzy” maintenance of methylation at CHG and CHH sites. For A. thaliana and Z. mays our estimated conversion rates were 0.989 and 0.961, respectively, which is remarkably close to chloroplast-based estimates of 0.993 and 0.970.

Although bisulfite conversion kits and protocols have been optimized to achieve the highest conversion rate possible the specificity of the reaction is not perfect. A well-known trade-off is that some methylated cytosines can be accidentally converted to uracils, and are later falsely detected as unmethylated. Some controls (commercial or artificially methylated DNA fragments) are available to estimate this inappropriate conversion rate, but, to our knowledge, they are not systematically used in WGBS experiments. Some studies using such controls have shown that the inappropriate conversion rate (% of methylated cytosines converted to uracils) ranges from 0.09 to 6.1% depending on the kit and protocol used [5658].

METHimpute approximates this value by estimating the parameter pM for the M component (see Methods), which can be used to calculate the probability of finding an unmethylated cytosine given that the underlying cytosine is truly methylated. Again, focusing on CG sites, we estimate the methylated cytosines conversion rate at 6.3, 11.5 and 16% in O. sativa, Z. mays and A. thaliana, respectively. Although these estimates are close to the empirical rates reported in the literature, they are biased upward most likely owing to the fact that the parameter pM is partly confounded with methylation variation arising from cellular heterogenity in the sampled tissues. We therefore suspect that our estimates become more accurate in situations where tissue heterogeneity is minimized.

Nonetheless, the ability of METHimpute to provide an accurate estimate of the conversion rate for unmethylated cytosines and an upper-bound estimate for methylated cytosines could be utilized to calibrate WGBS experiments in the laboratory when no controls are available.

Discussion

A key advantage of WGBS over alternative measurement technologies is its ability to provide base-resolution measurements from bulk and - more recently - also from single cell data. Since its first application in the model plant A. thaliana in 2008 [53, 54], WGBS has become an integral tool for studying the methylomes of increasingly large plant genomes and for surveying patterns of natural methylome variation within and among plant species. However, the relatively high costs associated with this technology pose limits on the sequencing depths that can be achieved within most experimental budgets. A typical solution is to sequence methylomes far below saturation, which results in substantial measurement noise and missing data at the level of individual cytosines.

Here we presented METHimpute, a Hidden Markov Model for the construction and imputation of complete methylomes from shallow or deep WGBS data. We show that our approach imputes high-confidence methylation calls for uncovered cytosines that are sufficiently close to informative cytosines (<40 bp). For cytosines in widely uncovered regions, our approach imputes low-confidence calls which might be filtered out in downstream analyses, since they do not contain more information than background frequencies of methylation. A threshold for filtering can be determined from the asymptotic behavior of the maximum posterior probability as in Additional file 1: Figure S7. In the presented analysis we have used six different sequence contexts, but our implementation is general enough to allow also other context specifications. For example, the model could be extended to account for different emission and transition parameters for every context and annotation category (genes, TEs, CpG density, etc.), instead of context alone, although this would only be possible for well-annotated genomes.

We recommend the use of METHimpute instead of the binomial test for the analysis of WGBS data whenever methylation status calls are required. Furthermore, METHimpute solves the problem of missing data in population epigenetic studies, which will facilitate the estimation of epigenetic mutation rates and methylation site frequency spectrum analyses.

Conclusions

Here we introduced METHimpute, an imputation-based HMM for the construction of complete methylomes from shallow or deep WGBS data. Our analyses show that the algorithm can impute the methylation status of cytosines with missing (i.e. zero read coverage) or uninformative coverage (i.e. coverage of less than 3 reads), as well as their recalibrated methylation levels. We demonstrate that these imputations are not only statistically robust, but also biologically meaningful. Our estimates suggest that routine use of this algorithm could reduce sequencing costs of typically sized methylome experiments without a substantial loss of biological information. The method works with small, streamlined genomes like that of Arabidopsis but also with large, repeat-rich genomes like those of most commercial crops, thus making it a flexible software tool for the analysis of DNA methylomes of a wide spectrum of species. METHimpute is implemented as an R-package and seamlessly integrates with the extensive bioinformatic tool sets available through Bioconductor. The algorithm has been extensively tested in plants, but it should also be applicable in non-plant species.

Methods

Hidden Markov Model for methylation calling

Outline

We define an N=2 state Hidden Markov Model (HMM), where the states i represent unmethylated (U) and methylated (M) cytosines. The emission densities for each state are binomial distributions, which can be interpreted as a binomial test on the number of methylated counts m over total counts r. The probability parameter pi of the binomial test can be interpreted as the probability of finding m methylated counts out of r total counts, given the state i. Note that in this definition 1−pU is the conversion rate, i.e. the probability of a read showing non-methylation when the cytosine is indeed non-methylated. Cytosines are not equally spaced in the genome, and we therefore chose a distance dependent transition matrix A, where the distance dependent change in transition probabilities is modeled by an exponential function. Furthermore, to account for different sequence contexts, we implemented context-specificity for both the binomial test and the transition probabilities.

Mathematical description

The probability P of observing methylated mt and total rt read count at a particular cytosine t in context ct can be written as

$$ P_{t}\left(m_{t}, r_{t}, \mathbf{p}_{c_{t}}\right) = \sum_{i \in \{U,M\}} \gamma_{it} B_{{ic}_{t}}\left(m_{t}, r_{t}, p_{{ic}_{t}}\right), $$
(1)

where γi are the posteriors (mixing weights) and Bi are binomial distributions with context-specific parameter pic. The binomial distribution is defined as

$$ B(m, r, p) = {{r}\choose{m}} p^{m} (1-p)^{r-m}. $$
(2)

All probability parameters of the binomial tests (i.e. the probabilities of a success) are estimated freely during model training (next section). For C=6 contexts and N=2 states, N·C=12 independent parameters pic need to be fitted.

The distance dependent transition probabilities from cytosine t in state i to cytosine t+1 in state j, separated by distance dt,t+1 and in transition context ct,t+1, can be described as

$$ {\begin{aligned} A_{ij,c_{t,t+1}}\left(A^{o}_{ij,c_{t,t+1}}, d_{t,t+1}, D_{c_{t,t+1}}, N\right) &= A^{o}_{ij,c_{t,t+1}} \mathrm{e}^{-d_{t,t+1}/D_{c_{t,t+1}}} \\ & \quad + \frac{1}{N} \left(1-\mathrm{e}^{-d_{t,t+1}/D_{c_{t,t+1}}}\right). \end{aligned}} $$
(3)

Here, \(A^{o}_{ij,c_{t,t+1}}\) are the transition probabilities without distance dependency (or for adjacent cytosines with dt,t+1=0). \(D_{c_{t,t+1}}\) is a constant that reflects how fast neighboring cytosines lose correlation. The distance dependency is constructed in such a way that all transitions \(A_{ij,c_{t,t+1}}\) are equally likely for an infinite distance dt,t+1=. For C=6 contexts the model has C·C=36 transition contexts and thus 36 different transition matrices with dimensions N×N.

The constants Dc are determined by a non-linear least-squares (nls) fit to the correlation decay between cytosines in transition context ct,t+1 (see Additional file 1: Figure S12 for all used transition contexts). The formula for the fit is \(y_{c}(d) = a0 * \mathrm {e}^{-d/D_{c}}\), where yc is the correlation between neighboring cytosines at distance d in transition context c. The parameters a0 and Dc are fitted by the nls-fit.

The correlation is calculated between adjacent cytosines, with no other cytosines in between. This reflects the definition of the transition probabilities in the Hidden Markov Model, where transitions are defined from one cytosine to the next in the sequence.

Model fitting

Model parameters are fitted with the Baum-Welch algorithm [59]. The distance-dependent transition probabilities require modified updating formulas compared to a standard Baum-Welch algorithm without distance dependency. The derivation of the modified updating formulas is detailed below, and uses notation introduced in [60].

The conditional expectation Q that needs to be maximized can be written as

$$ {\begin{aligned} Q &= \sum_{i}^{N} \gamma_{i,t=0} ~ log\left(\pi_{i}\right) + \sum_{i,j,t}^{N,N,T-1} \xi_{ijt} ~ log\left(A_{ij,c_{t,t+1}}\right)\\ & \quad + \sum_{i,t}^{N,T} \gamma_{it} ~ log\left(f_{i}\right). \end{aligned}} $$
(4)

The updated transition probabilities \(A^{\prime {o}}_{ijc}\) can be obtained by solving \(\frac {\partial Q}{\partial A^{o}_{ijc}} = 0\) using the method of Lagrange multipliers to deal with the constraint \(\sum ^{N}_{j} A^{o}_{ijc} = 1\).

$$ {\begin{aligned} A^{\prime{o}}_{ijc} &= \left(\sum_{t}^{T-1} \delta_{c,c_{t,t+1}} ~ \xi_{ijt} ~ \frac{A^{o}_{ijc}}{A_{ij,c_{t,t+1}}} \frac{\partial A_{ij,c_{t,t+1}}}{\partial A^{o}_{ij,c_{t,t+1}}} \right) \\ & \quad \left/ \left(\sum_{t,j}^{T-1,N} \delta_{c,c_{t,t+1}} ~ \xi_{ijt} ~ \frac{A^{o}_{ijc}}{A_{ij,c_{t,t+1}}} \frac{\partial A_{ij,c_{t,t+1}}}{\partial A^{o}_{ij,c_{t,t+1}}} \right)\right.. \end{aligned}} $$
(5)

Here, \(\delta _{c,c_{t,t+1}}\) is the Kronecker delta function, which ensures that only terms in the correct transition context c are included into the sum.

Similarly, the updated parameters for the binomial test can be obtained by solving \(\frac {\partial Q}{\partial p_{ic}} = 0\). For independent binomial tests, this yields

$$ p^{\prime}_{ic} = \left(\sum^{T}_{t} \delta_{c,c_{t}} ~ \gamma_{it} ~ m_{t} \right) \left/ \left(\sum^{T}_{t} \delta_{c,c_{t}} ~ \gamma_{it} ~ r_{t} \right)\right.{.} $$
(6)

The methylation status it is determined by maximizing over the posterior probabilities it=argmaxi(γit).

Finally, we can use the posterior probabilities γU|M,t and estimated parameters pic to define a recalibrated methylation level \(m^{\prime }_{t}\) that is defined on every cytosine t in the genome and can serve as input for other applications:

$$ m^{\prime}_{t} = p_{U,c_{t}} \cdot \gamma_{U,t} + p_{M,c_{t}} \cdot \gamma_{M,t} $$
(7)

Plants DNA methylation data

We used published data (fastq files containing bisulfite sequencing reads) from three model plant species: Arabidopsis thaliana, rice (Oryza sativa Japonica cv. Nipponbare) and maize (Zea mays B73). We used three replicates for rice and maize, and two replicates for Arabidopsis. Each sample was mapped to the latest available version of the reference genome for this species. Details and references on these datasets, reference genomes and annotations files, as well as additional alignment metrics can be accessed in Additional file 2: Table S2.

Mapping of bisulphite sequenced (BS-seq) reads and construction of DNA methylomes

Read sequences (Additional file 2: Table S2) were quality trimmed and adapter sequences were removed with Cutadapt (version 1.9; python version 2.7.9; [61]). Trimming was performed on both ends using the single-end mode and the quality threshold was set to a phred score of 20 (q=20). We applied the default error rate of 10% for the removal of the adapter sequences. Afterwards, we discarded reads shorter than 40 base pairs. Reads were subsequently mapped to an indexed genome. The maximum allowed proportion of mismatches was set to 0.05 (m=0.05, 5 mismatches per 100 bp) and the maximum insert size was set to 1000 bp (X=1000). BS-Seeker2 (v2.0.10; [44]) using Bowtie2 (version 2.2.2; [62]) was chosen for the alignment of the reads. Samtools (version 1.3.1; using htslib 1.2.1; [63]) was used to remove duplicates (samtools rmdup -s) and to sort bam files (samtools sort). Methylomes were subsequently constructed through the bs_seeker2-call_methylation.py module from BS-Seeker2 (v2.0.10; [44]). CGmap files containing methylome information were used as an input for METHimpute.