Background

Different cell types in one organism exhibit distinct gene expression profiles, despite sharing the same genomic sequence. Epigenomic regulation is essential for this phenomenon and contributes to the maintenance of cellular identity. In that regard, covalent DNA cytosine modifications have an important role in gene regulation in a number of eukaryotic species, including mice and humans [1]. The best-studied cytosine modification is 5-methylcytosine (5mC), which entails the addition of a methyl group to the \(5^{\prime }\) carbon of cytosine. Widely known for its effect on gene expression, 5mC occurs in diverse genomic contexts [2, 3].

Active demethylation of methylcytosine to its unmodified form proceeds through successive oxidation to 5-hydroxymethylcytosine (5hmC), 5-formylmethylcytosine (5fC), and 5-carboxylmethylcytosine (5caC) [4, 5], mediated by ten-eleven translocation (TET) enzymes [6] (Additional file 1: Fig. S1). While 5hmC has less genome-wide abundance than 5mC, it is nonetheless recognized as a stable modification [7]. Furthermore, 5hmC is increasingly implicated in gene regulation processes [8]. We know less about 5fC and 5caC, largely because they are even less abundant than 5hmC [9].

In mouse embryonic stem cells (mESCs), 5fC accounts for only 0.0014% of cytosine bases [10], while 5caC accounts for a miniscule 0.000335% [4] compared to nearly 3% of 5mC [4] and 0.055% of 5hmC [10]. Hence, fewer studies investigate the genome-wide distribution and functions of 5fC and 5caC [11,12,13]. In fact, 5fC and 5caC are often regarded as mere intermediates of the demethylation cascade. Nevertheless, while it remains uncertain if 5fC and 5caC do play a distinctive and pan-tissue regulatory function, several lines of evidence suggest that they too can modulate gene expression [8].

Although these covalent cytosine modifications do not alter DNA base pairing, they do protrude into the major and minor grooves of DNA and impact other aspects of DNA conformation [14]. These effects can influence the DNA binding of transcription factors [15, 16]. Many transcription factors prefer specific motifs, enabling the sequence specificity of transcriptional control [17]. The position weight matrix (PWM) model allows the computational identification of transcription factor binding sites by characterizing the position-specific preference of a transcription factor over the A/C/G/T DNA alphabet [18].

Just as transcription factors distinguish one unmodified nucleobase from another, some transcription factors distinguish between unmodified and modified bases. For example, some transcription factors, such as MeCP2, bind to methyl-CpG [19]. This type of non-sequence-specific modified nucleobase binding, however, occurs only in specific protein families [20].

A few transcription factors have well-characterized modification preferences. For example, both C/EBP\(\upalpha\) and C/EBPβ have increased binding activity in the presence of central CpG methylation, formylation, or carboxylation of their canonical binding motif (consensus: TTGC|GCAA). Both DNA strands contribute and hemi-modification leads to a reduced effect [21]. 5hmC inhibits binding of C/EBPβ, but not C/EBP\(\upalpha\) [21]. ZFP57 also prefers methylated motifs, specifically in the context of a completely centrally methylated TGC\(\overline {\text C}\)GC(R) heptamer (\(\overline {\text C}\) indicates methylation on the positive strand and G on the negative strand) [22, 23].

Additional methylation often occurs in ZFP57 motifs with a final guanine residue as the core binding site [23]. Crystallography and fluorescence polarization analyses further confirm this preference [24]. ZFP57 has successively decreasing affinity for the oxidized forms of 5mC [24]. In contrast, the basic helix-loop-helix (bHLH) family transcription factor c-Myc, has a strong preference for unmethylated E-box motifs, often preferring the fully unmethylated CACGTG hexamer [25, 26]. Many other bHLH transcription factors also demonstrate such a preference [27,28,29,30,31].

Other transcription factors also have methylation sensitivity [32]. Protein binding microarray data demonstrate that central CpG-methylated motifs have strong binding activity for multiple transcription factors [15]. Interestingly, these data also show that methylated motifs often differ from the unmethylated sequences that those transcription factors usually bind. Some transcription factors may even show increased binding in the presence of 5caC [33]. In Arabidopsis thaliana, among 327 transcription factors, 248 (76%) exhibited sensitivity to covalent DNA modifications, with 14 preferring modified DNA [34].

Transcription factors act as both readers and effectors of methylation [20]. They may bind to a modified base to prevent its modification or, in some instances, to increase the likelihood of its modification. Alternatively, transcription factors could bind to reverse an existing modification. These scenarios could occur in different genic contexts, potentially mediated by different motif groups. Even factors within the same family may have differences in modified binding preferences, conferring additional specificity or assisting in stable protein-DNA complex formation. This regulatory interplay [20, 35] highlights the need for additional genome-wide characterizations of transcription factor binding preferences in the context of modified DNA.

The role of modified DNA in transcription factor binding has motivated the development of a computational framework to elucidate and characterize altered motifs. A comprehensive in vitro analysis, coupled with selected follow-up crystal structures, revealed the mechanistic basis for some 5mC interactions [36]. A random forest [37, 38] combined genomic and methylation data [39] to predict transcription factor binding. Those predictions, however, did not attempt to predict the preference of factors for methylated DNA [39]. The MethMotif database enumerates methylated transcription factor motifs [40].

Most recently, Grau et al. [41] analyze an expanded alphabet genome from whole genome bisulfite sequencing (WGBS) data, for 5mC only. Their focus differs from ours, however. They emphasize that their models go beyond PWMs, the standard model to describe transcription factor DNA-binding specificities and allow for intra-motif dependencies. Their comparisons mainly focus on classification performance bench-marking in differentiating bound versus unbound sequences. Song et al. [42] demonstrate an in vitro method to assess modification-specific preferences of all cytosine states. They demonstrate distinct preferences of both symmetric and hemi-modifications. Hernandez-Corchado et al. [43] also recently provide a joint model of accessibility and methylation. They use this model to explore a large number of chromatin immunoprecipitation-sequencing (ChIP-seq) datasets, assessing many transcription factor binding site preferences for 5mC.

Existing work has often indirectly analyzed the impact of modified bases on binding, focused on improved motif elucidation itself, or often categorized modified binding preferences in a largely binary fashion. Mostly, when modeling the affinity of transcription factors for DNA sequences, previous work has not treated modified nucleobases as first-class objects akin to unmodified nucleobases, adding artificial distinctions unlikely to reflect the underlying biophysical interactions. There has been a dearth of large-scale comprehensive analyses including modified motifs. Also, there has been an absence of specific experimental follow-up to predicted motif preferences, directly detecting modified bases.

Here, we describe methods to analyze covalent DNA modifications and their effects on transcription factor binding sites by introducing an expanded epigenetic DNA alphabet. While others proposed expanding the genomic alphabet in other ways [44], we (in our earlier preprint of this work) [45] and Ngo et al. [46, 47] first proposed expanding it in this context for facilitating bioinformatic analyses of cytosine modifications. Unlike our work, however, Ngo et al. [46] focused on motif identification in this expanded alphabet. We, rather, leverage existing tools to focus on downstream consequences, such as distinct groups of modified-preferring motifs and specific predictions of modified binding preferences. We introduce Cytomod, a software to integrate DNA modification information into a single genomic sequence and we detail the use of extensions to the Multiple EM for Motif Elicitation (MEME) Suite [48] to analyze 5mC and 5hmC transcription factor binding site sensitivities. We validate our predictions for the transcription factor OCT4 by providing conjoint cleavage under targets and release using nuclease (CUT&RUN) [49, 50] datasets across conventional, methylated-, and hydroxymethylated-enriched sequences. Our results especially highlight that most factors can bind in both unmodified and modified contexts, to varying extents and often with different groups of motifs. While it was previously known that DNA methylation affects binding, here, we show that modified motifs are considerably more complex than previously appreciated and that many new motifs with varied modified binding preferences exist, to different extents across a variety of transcription factors.

Results

Expanded-alphabet genomes facilitate the analysis of modified base data

We created an expanded-alphabet genome sequence using oxidative (ox) and conventional WGBS maps of 5mC and 5hmC for naive ex vivo mouse CD4+ T cells [51]. We expand the standard A/C/G/T alphabet, adding the symbols m (5mC), h (5hmC), f (5fC), and c (5caC). We also designed and implemented symbols for the reverse strand, preserving information of complements (Methods). This allows us to more easily adapt existing computational methods, that work on a discrete alphabet, to work with epigenetic cytosine modification data.

Next, we generated individual modified genomes across four replicates of combined ox and conventional WGBS data [51] and for a variety of modified base calling thresholds. These calibrated modified genomes allowed us to accurately assess transcription factor binding site affinities, for both 5mC and 5hmC. We elaborate upon these base calling thresholds and their use in creating calibrated genomes in the next subsection and in Comparing motif modifications, using hypothesis testing. In order to construct modified genome sequences, specific to the varied epigenetic state of a cell type, we designed the Cytomod software. It allows us to rapidly construct combined threshold-specific modified genomes, using single-base resolution data. Modified genomes with our expanded alphabet allowed us to deploy our methods across large datasets including those from Encyclopedia of DNA Elements (ENCODE) [52].

We used these modified genome sequences as the basis for the extraction of genomic regions implicated by ChIP-seq data for all assessed transcription factors. These modified sequences have a central role not just in our method, but also in enabling bioinformatic analyses of modifications more generally (Discussion). Using the thresholds discovered in the murine analyses, we created conventional 5mC maps for the human K562 erythroid leukemia cell line [53, 54], from ENCODE WGBS data.

In addition to creating new standalone software to instantiate our expanded alphabet concept, we also updated the MEME Suite [48] and associated software, implementing the ability to work with custom alphabets, such as our expanded epigenomic alphabet. We created the MEME::Alphabet Perl module as part of the implementation. Others can use this module to rapidly obtain suitable expanded-alphabet definitions, making it easier to extend older code bases. This Perl module does not create expanded alphabets or expanded alphabet genome sequences but rather provides capabilities for other Perl software to read and handle biomolecular sequences with expanded alphabets. Moreover, it provides a reference for implementing the same capabilities in other programming languages. These changes allow comprehensive analyses of epigenetic states, including their impacts on transcription factor binding, with support for any additional modified bases. Furthermore the software improvements make all future MEME Suite tools compatible with expanded alphabets, enabling continuing innovation and insights in these areas.

Our methods yield suitable base-calling thresholds for downstream analyses

We constructed expanded-alphabet modified genomes, making discrete calls from the continuous output of our modification calling pipeline. The pipeline produced floating-point numbers in [0, 1] indicating the strength of evidence for a modification at each position. We determined whether to call a base modified or not by comparing the output values to a threshold value fixed across the whole genome (Comparing motif modifications, using hypothesis testing).

A grid search for transcription factor binding thresholds at 0.01 increments allowed us to determine suitable thresholds (0.3 and 0.7) for further investigation (Additional file 1: Fig. S2). Overall, this grid search demonstrated the suitability of a wide range of thresholds, indicating the range for which one can adequately maintain both specificity and sensitivity of modified binding detection. For example, de novo analyses of C/EBPβ confirmed the preference for methylated DNA, with methylated motifs having much greater central enrichment than their unmethylated counterparts, at both the 0.3 (Fig. 1) and 0.7 thresholds (Fig. 2).

Fig. 1
figure 1

C/EBPβ (GSM915179 [55]; 11 434 ChIP-seq peaks) CentriMo analysis of de novo and JASPAR motifs (Methods). Depicts female replicate 2 of the combined WGBS and oxWGBS data [51] at a 0.3 modification threshold. A The CentriMo result with the JASPAR C/EBPβ motif (orange), top Discriminative Regular Expression Motif Elicitation (DREME) unmethylated C/EBPβ motif (blue), and DREME methylated motifs (red, cyan, and green). B Sequence logo of the JASPAR C/EBPβ motif. C Sequence logo of the top DREME unmodified motif. D Sequence logo of the top DREME methylated motif. E Sequence logo of the second DREME methylated motif. F Sequence logo of the third DREME methylated motif. Listed p-values computed by CentriMo [56]. For consistency, we depict the JASPAR sequence logo using MEME’s relative entropy calculation and colouring

Fig. 2
figure 2

C/EBPβ (GSM915179 [55]; 11 434 ChIP-seq peaks) CentriMo analysis of de novo and JASPAR motifs (Methods). Depicts female replicate 2 of the combined WGBS and oxWGBS data [51] at a 0.7 modification threshold. the CentriMo result with the JASPAR C/EBPβ motif (orange), top DREME unmethylated C/EBPβ motif (blue), and DREME methylated motifs (red, cyan, and green). B Sequence logo of the JASPAR C/EBPβ motif. C Sequence logo of the top DREME unmodified motif. D Sequence logo of the top DREME methylated motif. E Sequence logo of the second DREME methylated motif. F Sequence logo of the third DREME methylated motif. Listed p-values computed by CentriMo [56]. For consistency, we depict the JASPAR sequence logo using MEME’s relative entropy calculation and colouring

We show the assessments at different thresholds not for comparing against each other, but to demonstrate the robustness of our results to varying the threshold. At both the minimum and the maximum of our modified base calling threshold calibration, we can elucidate expected modified motif preferences. In both cases, the expected motif has strong central enrichment across ChIP-seq peaks. Only the central region of these motif enrichment analyses are relevant. The bounding of suitable thresholds provided by the grid search analysis will likely prove useful for assessing future datasets as well.

Hypothesis testing reveals altered transcription factor binding preferences

Expanded-alphabet analysis shows results consistent with known preferences

We used a hypothesis testing approach on the expanded-alphabet sequence to examine the preferences of transcription factors for modified or unmodified DNA. First, we analyzed three transcription factors with previously known methylation or hydroxymethylation sensitivities. ZFP57 [23] and C/EBPβ [21] show a preference for methylated DNA, while c-Myc prefers unmethylated DNA [25, 26]. Additionally, C/EBPβ has reduced affinity for hydroxymethylated DNA [21].

We used the known preferences as controls to calibrate our modification-calling thresholds and to validate our approach. We used c-Myc as the positive control for an unmethylated binding preference [25, 26]. As positive controls for methylated binding preferences, we used both ZFP57 and C/EBPβ [21,22,23,24] (Detection of altered transcription factor binding in modified genomic contexts).

In this hypothesis testing framework, we tested all known unmodified transcription factor binding motifs against all possible 5mC and 5hmC modifications at all CpG dinucleotides. That is, for each unmodified and modified version of all motifs, across every transcription factor, we assessed the motif’s expected DNA binding affinity using the adjusted central enrichment p-value from CentriMo [56] (Detection of altered transcription factor binding in modified genomic contexts). For this analysis, we included motifs of interest from de novo results, and we partially or fully changed the base at a given motif position to each modified base, to comprehensively assess its affinity (Table 3; Comparing motif modifications, using hypothesis testing). To compare all binding affinities, we subtracted the \(\log _{10}\)-transformed p-value of the modified motif from the unmodified motif. Positive values for this difference represented a preference for the modified motif, while negative values represented a preference for the unmodified.

The expected transcription factor binding preferences for c-Myc, ZFP57, and C/EBPβ held across all four biological replicates of WGBS and oxWGBS data and for all investigated modified nucleobase calling thresholds (Fig. 3). The thresholds we investigated, representing modification confidence, varied from 0.01–0.99 inclusive, at 0.01 increments. We also obtained the same results for multiple different ChIP-seq replicates for these three transcription factors (Additional file 1: Fig. S2). Perturbations of binding assessments, such as peak-calling stringency (Additional file 1: Fig. S3) and required degree of motif statistical significance (Additional file 1: Fig. S4) demonstrated the robustness of our results.

Fig. 3
figure 3

Relationship between unmodified versus modified motif statistical significance of central enrichment (from CentriMo [56]) and modified base calling thresholds across different WGBS and oxWGBS specimens, in mice [51]. We compare each unmodified motif, at each threshold, to its top three most significant modifications for c-Myc and C/EBPβ, but only the single most significant modification for ZFP57. The displayed motif pairs changes at individual thresholds, depending on which motif pairs stay in the top three. See Tables 1 and 3 for an overview of modified base notation. Sign of value indicates preference for the unmodified (negative) motif or the modified (positive) motif. Rows: single ChIP-seq replicates for a particular transcription factor target, one each of A c-Myc (mESCs; Krepelova et al. [57]), B ZFP57 (CB9 mESCs; Strogantsev et al. [23]), and C C/EBPβ (C2C12 cells; ENCFF001XUT). Columns: replicates of WGBS and oxWGBS (mouse CD4+ T cells; Kazachenka et al. [51])

None of the c-Myc log p-value differences exceeded zero, confirming that c-Myc favours unmodified E-box motifs over modified c-Myc motifs. Two methylated motifs had the greatest increase in predicted binding affinity for C/EBPβ: TTGmGCAA and TTGC1TCA (see Tables 1 and 3 for an overview of modified base notation). As expected, ZFP57 favours binding to modified nucleobases over their unmodified counterparts. The well-known TGCm1m1 motif [23] had one of the greatest increases in predicted binding affinity of ZFP57 for modified DNA.

Table 1 The expanded epigenetic alphabet. This includes the known modifications to cytosine and symbols for each guanine complementary to a modified nucleobase. Symbol colours indicate recommended display colour in representations such as genome browser tracks

While ZFP57 had a strong preference for methylated DNA, we also observed a noticeable preference for hydroxymethylated DNA (Fig. 4F). CentriMo quantifies these preferences [56], both in terms of p-value significance, and in terms of the centrality of motif concentration (Detection of altered transcription factor binding in modified genomic contexts).

Fig. 4
figure 4figure 4figure 4

Modified versus unmodified motifs, combining score and cluster information, for selected transcription factors. These plots come from non–spike-in calibrated data, for the \({500\,\text {bp}}\) regions surrounding peak summits. We clip scores beyond \(\pm 4000\) and plot them at the threshold to maximize dynamic range where most scores occur. Some combinations of the displayed hypothesis pairs had multiple data points (for example, multiple identical hypothesis pairs, but for different data sub-types or stringencies). We aggregated these data points by plotting the maximum score. Below each plot is an asymmetric, diverging, colour scale that further highlights modification-preferring motifs. The colour scales are identical across plots. We depict a larger selection of transcription factors in Additional file 1: Fig. S7. A FEZF2 and C/EBPβ. B Individual motifs illustrating the range of preferences found for C/EBPβ in K562. Left: the least modified-preferring motif (\(\text {score} = -2177.28\)); centre: a motif lacking substantial preference (\(\text {score} = 2.36\)); right: the most modified-preferring motif (\(\text {score} = 3785.86\)). C JUND and JUN. D USF1 and USF2. E c-Myc. F ZFP57. G OCT4. H Individual motifs illustrating the range of preferences found for OCT4. Left: the least modified-preferring motif (\(\text {score} = -762.53\)); centre: a motif lacking substantial preference (\(\text {score} = 2.22\)); right: the most modified-preferring motif (\(\text {score} = 518.01\)). I The most highly significant and centrally enriched DREME motif, for OCT4 hmC-Seal CUT&RUN in mESCs (replicate 1). See Tables 1 and 3 for an overview of modified base notation

CentriMo reported 328 of 393 total ZFP57 methylated motifs with a score \({>}0\) (median: 171.2; max: 2292, exemplifying the strong preference; Motif clustering of modified binding preferences) across all our assessed ZFP57 datasets. This included motifs from mESCs, from both our previously mentioned BC8/CB9 Strogantsev et al. [23] datasets, and 2 motifs, both scoring positively, from Quenneville et al. [22] (Fig. 4F). Hydroxymethylated CpGs had a substantially smaller increase in binding affinity than methylated motifs (Fig. 3), but still greater than the completely unmethylated motif. CentriMo reported 291 of 435 total ZFP57 hydroxymethylated motifs with a score \({>}0\) (median: 152.4; max: 1379) across all motifs in our BC8/CB9 datasets (Fig. 4F). Our Quenneville et al. [22] analysis did not reveal any sufficiently significant hydroxymethylated motifs of any score. Most modified motifs that scored above zero, however, had at least one 5mC and one 5hmC nucleobase (Fig. 4F, red motifs). Therefore, our expanded-alphabet methodology recapitulates the observation that ZFP57 has the greatest binding affinity for motifs containing 5mC, followed by 5hmC, and then by unmodified cytosine [24].

Overall, these positive control results for known binding preferences allowed us to select thresholds sufficient to accurately assess modified binding preferences regardless of tissue-specific differences in modification frequency. This led us to discover novel modified motifs, across the wider array of transcription factors to follow.

Expanded-alphabet analysis enables comparisons across a wide array of transcription factors

Similarities in protein structure of transcription factors might form a useful categorical framework for expectations regarding modified base affinity. To that end, we looked for shared preferences among families of transcription factors for modified or unmodified bases in both the mouse and the human data. Mostly, we defined families with TFClass [58, 59] (Assessment of transcription factor familial preferences).

We and others [27,28,29,30,31] have found a consistent preference for unmethylated binding motifs across a broad selection of bHLH transcription factors. Some motifs for bHLH transcription factors had a putatively modified preference versus their unmodified JASPAR counterparts. Unmodified de novo motifs we generated for the same transcription factors, however, consistently had more significant p-values (Additional file 1: Fig. S7). This suggests that, as expected, these transcription factors usually preferred the unmodified motif. The leucine-zipper subfamily of bHLH transcription factors, however, had a subset of motifs that preferred to bind in a modified context. For example, both USF1 and USF2 preferred to bind in unmodified and modified contexts, to differing extents, and had mixed binding preferences within motif clusters (Fig. 4D; Discussion).

Many zinc finger family motifs displayed a propensity toward modified motifs, but not all. EGR1/ZIF268/NGFI-A, a Cis2–His2 zinc finger, showed a moderate binding preference for methylated DNA, with multiple positively scoring hypothesis pairs, including some \({>}100\). These very high scores indicate an exceptionally strong predicted binding preference for the modified, over the unmodified, nucleobases.

Conversely, ZNF384/CIZ/TNRC1 of the same sub-family, present in both our mouse and human analyses, had only weak evidence of a preference for binding modified DNA, with only a single hypothesis pair scoring above 10 (Additional file 1: Fig. S2). We suspect this factor intrinsically has the ability to bind in both unmodified and modified contexts, perhaps with a weak modified preference. This would likely hold across quite different tissue types. Unlike most of our analyzed transcription factors, this result occurred at a significant level in both our mouse and human datasets, allowing us to form this more general conclusion.

While we can use our methods to analyze and group transcription factors by their families, few clear signals of strong preferences nicely stratify in this manner. This suggests that complex preferences tend to outweigh family-specific patterns. Factors such as local epigenetic state or tissue type likely play a larger role in locus-specific transcription factor binding.

Our hypothesis testing confirms C/EBPβ’s dichotomous binding preferences

C/EBPβ provides an excellent test case for the impact of modified bases on transcription factor binding because of its dichotomous preferences for 5mC versus 5hmC [21]. Our method recapitulated this preference, across all ChIP-seq datasets, for all replicates of oxidative and conventional WGBS. Methylated motif pairs generally had positive ratios, whereas hydroxymethylated motif pairs had negative ratios (Additional file 1: Fig. S4).

One positive strand, hemi-methylated motif (TTGmGTCA), presented an exceptional case. Surprisingly, we observed a preference for the unmodified motif over its hemi-methylated motif. Unlike the consensus C/EBPβ motif. Unlike the consensus motif, this motif corresponds to the chimeric C/EBP|CRE octamer. This chimeric transcription factor has a more modest preference toward its methylated DNA motif [21]. Nonetheless, we would still have expected a weak preference for the hemi-methylated motif, over its unmodified counterpart. Additionally, we found greater enrichment for hemi-methylation than complete methylation, which contradicts findings of both strands contributing to the preferential binding of C/EBPβ [21]. This may arise from technical issues with hemi-methylation in our modified sequence, or because our methods have greatest accuracy only within specific cell types or contexts.

Many transcription factors bind in modified and unmodified contexts, with variable motif preferences

We analyzed 144 transcription factors to characterize their overall motifs and their affinities to methylated and hydroxymethylated DNA. Leveraging our hypothesis testing approach and normalized CentriMo-based scoring methods, we assessed all detected motifs, and specifically characterize their likely binding affinities. Our analyses revealed that several factors bind in both modified and unmodified contexts (Fig. 4). Unlike prior analyses which often aimed to binarize binding preferences, our results highlight that most factors can bind in both contexts, albeit to varying extents and with different motifs.

For example, protein binding microarray analyses have led to the conclusion that binding of the transcription factor JUND is “uniformly inhibited by 5mC” [60]. Overall, these data accord well with our results, for which almost all tested hypothesis pairs (37/44) showed an unmethylated preference. Nevertheless, closer inspection of these prior analyses reveals that they contain a small group of motifs where JUND showed a slight preference for 5mC [60]. Specifically, at least 8 cytosine-containing motifs have 5mC z-scores above zero, with at least 3 such motifs having scores of close to 30. Similar findings applied to 5hmC. This indicates that JUND likely has at least some preference for hydroxymethylated motifs, despite mostly preferring to bind in unmodified contexts.

In our analysis, JUND showed a preferences for binding to 7 motifs (including one hydroxymethylated motif) if the motifs were methylated (Fig. 4C). JUN, a transcription factor related to JUND, showed similar preferences (Fig. 4C).

FEZF2 appeared to have two completely different motifs, with no overlap in their preferences for modified versus unmodified cytosine (Fig. 4A). Indeed, removing modification information from modified-preferring FEZF2 motifs led to a single motif cluster, distinct from the main unmodified motif clusters. Therefore, two distinct motif classes for FEZF2 appear to exist.

Many of the motifs we found, across a wide array of transcriptions factors including those discussed above, were novel. Many motif groups, especially when viewed as collapsed or root motifs, often have similarity with previously reported motifs. Nonetheless, within these motif groups, we often find additional variations, as well as a number of entirely new motif groups, for most transcription factors.

More than half of the transcription factors we assessed bound almost or entirely exclusively in unmodified contexts (Additional file 1: Fig. S7). Specifically, if we limit our analysis to transcription factors without even a single slightly positive motif, \({49.3}\%\) of factors had all tested hypothesis pairs score below zero. This varied across our overall dataset, with other factors having some occasional modified preferences. Overall, we assessed a total of 144 distinct transcription factors, across all datasets. Grouping these by modification preference, with each factor potentially in multiple groups to reflect the possibility of mixed preferences, we found:

  • Seventy-one transcription factors appeared to bind only to unmodified motifs. Each transcription factor in this grouping had no positively scoring motifs.

  • Fourteen transcription factors appeared to bind predominantly to modified motifs. Each transcription factor in this grouping had motifs with an upper quartile score \({\ge }0\).

  • Nine transcription factors had no clear modified motif preferences. Each transcription factor in this grouping had no motifs with a score outside of \([-50, 50]\), excepting those transcription factors where every motif scored positively. Of these transcription factors, 5 had three or fewer significant motifs.

Modified-base CUT&RUN validates our predictions for OCT4

While OCT4 bound to a number of motifs in an unmodified context, some OCT4 motifs preferred binding in both methylated and hydroxymethylated states. A preference of OCT4 for methylated motifs has previously been reported [36], but we are unaware of any reports of a preference of OCT4 for hydroxymethylated sequences. Interestingly, those hydroxymethylated motifs appeared to predominantly cluster either on their own, or with the canonical OCT4 homo- and hetero-dimer motifs, rather than mixing with other motif groups, such as those belonging to methylated motifs or co-factors.

We validated our OCT4 predictions, by performing CUT&RUN [49, 50] experiments in mESCs, with conventional, bisulfite-converted, and hmC-Seal-seq [61] library preparations. These three sets of library preparations allowed us to characterize the modification states of OCT4-bound fragments across both methylated and hydroxymethylated contexts.

We observed that OCT4 has a strong preference to bind in a hydroxymethylated context, in line with our predictions. When comparing unconverted, conventional CUT&RUN to hmC-Seal-seq CUT&RUN, OCT4 and similar de novo motifs were preferentially bound in hydroxymethylated context (Fig. 5; top DREME motif: \(p = 4.5 \times 10^{-149}\); top OCT4 motif: \(p = 2.5 \times 10^{-13}\)) than in the unmodified context (top DREME motif: \(p = 4.1 \times 10^{-104}\); top OCT4 motif: \(p = 1.1 \times 10^{-8}\)), all with comparable or greater motif centrality. These motifs also had at least similar preferences for binding in a methylated context (Additional file 1: Fig. S6).

Fig. 5
figure 5

CentriMo [56] results for replicate 1 of OCT4 CUT&RUN in mESCs. Motifs include the top three DREME motifs with colour indicating rank: first (red); second (purple), third, where applicable (dark green). Motifs also include the top non-POU5F1 JASPAR motif (cyan), and the JASPAR POU5F1B motif (orange). Both of these motifs come from the JASPAR 2020 [62] core vertebrate set. We generated these results using \({500}\,\text {bp}\) regions centred upon the summits of MACS 2 [63] peaks generated from those CUT&RUN fragments ≤120 bp. We called peaks using IgG controls and without any spike-in calibration (Data processing). Listed p-values computed by CentriMo [56]. For consistency, we depict the JASPAR sequence logo using MEME’s relative entropy calculation and colouring. A bisulfite-converted (methylated; 2797 CUT&RUN peaks) sequences. B hmC-Seal (3974 CUT&RUN peaks) sequences. Also depicts the top MEME [64] de novo motif (blue). C Sequence logo of the POU5F1B motif (MA0792.1). D Sequence logo of the top DREME de novo motif

The predicted cluster that included the canonical POU5F1B JASPAR motif (MA0792.1) also showed enrichment for 5hmC motifs. Overall, our findings suggest that OCT4 specifically binds hydroxymethylated nucleobases, in concert with methylated and unmodified binding sites.

Discussion

We developed a method for creating modified genomic sequences at suitable thresholds, using our tool Cytomod. We have added expanded alphabet capabilities to the widely used MEME Suite [48], a set of software tools for the sequence-based analysis of motifs. This included extending several of its core tools, including: MEME [64], DREME [65], and CentriMo [56], used in a unified pipeline through MEME-ChIP [66]. We undertook further extension of all downstream analysis tools and pipelines, and most of the MEME Suite [48] now supports arbitrary alphabets. Our approach has yielded a much greater understanding of transcription factor’s affinities and motifs in a modified genomic context. We validated our novel OCT4 binding site predictions, generating new high-quality binding site data, in both unmodified and modified genomic contexts.

We devised a hypothesis testing approach to enable more accurate comparisons between unmodified and modified motifs. Hypothesis testing, with equal central region widths and relative entropies, leads to more interpretable results than the standard CentriMo analyses, in that it permits a direct comparison of centrality p-values. These p-values help assess the statistical significance of the motif within the central region of its detected binding enrichment—a strong indicator of direct DNA binding [56]. We often observed the expected outcomes for many replicates of conventional CentriMo runs with de novo motifs, such as with C/EBPβ (Figs. 1 and 2) and ZFP57 (Additional file 1: Fig. S5). We encountered instances, however, in which de novo CentriMo analyses did not show the expected motif binding preference. This occurred for c-Myc and for a small subset of ZFP57 CentriMo results pertaining to de novo motifs, despite the hypothesis testing robustly corroborating its expected preference for unmethylated DNA (Additional file 1: Fig. S2). Overall, our hypothesis testing framework allows for a more accurate comparison than a direct assessment of de novo motifs, which would be less well-controlled for technical biases.

Prior to our introduction of the expanded epigenetic alphabet concept, it was not possible to perform direct, non-subsequent analyses to assess modified bases like 5mC in motifs that leverage standard algorithms (such as for motif elucidation) or existing engineering (such as comprehensive analysis pipelines). While some of the transcription factors examined had previously known preferences for modified or unmodified cytosines, the motifs found here are, in effect, completely new, due to this expanded alphabet allowing the joint consideration of cytosine modification status and multiple-base sequence specificity.

Without the expansion of the alphabet, one cannot directly use motifs to search for nor distinguish between any unmodified or modified bases. As such, our analyses and motifs result directly from this expansion, and all our results demonstrate quantified elucidation of expanded alphabet motifs. One can see this most emphatically in our validation results, where we show that we can directly detect motifs pertaining to bound chromatin fractions specifically containing the predicted modification of interest (Fig. 5).

Various biochemical complexities increase the difficulty of mapping cytosine modifications. These complexities include strand biases [9], populations of cells with different modifications at the same locus, and hemi-methylation [67]. Our use of Maximum Likelihood Methylation Levels (MLML) [68] to provide consistent estimators of modification, our relative entropy normalization, and our controlled hypothesis testing approach all help to minimize the impact of these challenges.

Cytosine modifications occur most frequently at CpG dinucleotides. Nevertheless, non-CpG 5mC nucleobases still exist, particularly in mESCs [69, 70]. Within a population of cells, at a given locus, unmodified nucleobases and different kinds of modified nucleobases often co-occur [9].

Our methods ensure the comprehensive analysis of non-CpG modifications. While this can result in some modified hypotheses being unlikely to occur in some cell types, we can still evaluate and score those hypotheses in an unbiased and tissue-specific manner. Therefore, some motifs shown may be unlikely to occur but will usually tend to have scores near 0. One can interpret such scores as a weak preference, should that motif be present. Our DNAmod database catalogues these and many other DNA modifications [71].

We suspect that the inability of de novo analyses to reveal modified binding preferences primarily arises from being unable to integrate modified and unmodified motifs. Our de novo analyses cannot compensate for the large differences in modified versus unmodified background frequencies. De novo analysis involves some form of optimization or heuristic selection of sites—an inherently variable process. Modified motifs have particular characteristics that differ from most unmodified motifs. Most notably, they necessarily differ from the overall and likely local sequence backgrounds, as a result of the low frequency of modifications. Conversely, an unmodified genome sequence has a comparably uniform nucleobase background, and unmodified motifs usually appear within local sequence of highly similar properties to the motifs themselves [72]. Accordingly, without specifically accounting for these confounds, modified motifs can get lost within a background of irrelevant unmodified motifs or one might not find comparable sets of motifs. Also, modified motifs that a de novo analysis finds might not be comparable to any unmodified counterpart. This may arise from the potential pairs of motifs having substantially different lengths, often with the modified motifs having significantly shorter length. Comparing motifs having sequence properties that often indicate a poor-quality motif also remains difficult. These properties include repetitious motifs, or off-target motifs, such as zingers—common contaminant motifs similar to CTCF, ETS, JUN, and THAP11 [73]. Hypothesis testing, with relative entropy normalization, can mitigate these concerns. Possibly, however, we often simply observe bona fide, but non-canonical, motifs. Non-canonical binding sites have far more abundance and importance than generally appreciated [74]. Therefore, while our approach can yield biologically relevant modified de novo motifs, one should not rely solely on these motifs’ binding preferences to conclusively establish a factor’s preference for modified DNA. Our hypothesis testing approach, however, helps mitigate the above biases, allowing for more robust comparisons.

Our method is robust in the face of parameter perturbations. There exists an inherent trade-off between a lower and higher modification calling threshold. The low threshold may yield more modified loci but potentially introduce false positives, while a higher threshold may prove too stringent to detect modified base binding preferences. Nonetheless, expanded-alphabet motif analysis across a broad range of modified base calling thresholds consistently led to the same expected results, across three transcription factors and a number of ChIP-seq and bisulfite sequencing replicates (Additional file 1: Fig. S2). We selected a lower threshold of 0.3, based primarily on the observation of increased variance and decreased apparent preference for unmethylated DNA for c-Myc below this threshold, across multiple replicates (Additional file 1: Fig. S2). We also selected an upper threshold of 0.7, based primarily on the rapid decrease in relative affinity for methylated over unmethylated motifs in ZFP57 (Fig. 3) and, to a lesser extent, C/EBPβ (Additional file 1: Fig. S4). Furthermore, modification of peak calling stringency for a set of ZFP57 datasets did not negatively impact our ability to detect the affinity of ZFP57 for methylated DNA (Additional file 1: Fig. S3).

The consistency of our controls (c-Myc, ZFP57, and C/EBPβ) provides confidence in the ability of our method to detect and accurately characterize the effect of modified DNA on transcription factor binding. We applied our method to a diverse array of ChIP-seq data in order to identify biologically meaningful binding preferences. We were also able to confirm OCT4 binding preferences, by generating new experimental data.

We found that motifs often enriched for hemi-modified, as opposed to completely modified binding sites. These hemi-modified motifs often had more central enrichment, as measured by CentriMo, than those with complete modification of a central CpG dinucleotide (Figs. 1 and 2). This appears surprising, because in vitro experiments have demonstrated that each modification usually has an additive effect for transcription factors that prefer modified DNA, resulting in completely modified motifs having the greatest affinity [21, 24]. This might imply that the hemi-methylated motifs arise from technical artifacts, either in the bisulfite sequencing data or from the methods used. Alternatively, the hemi-methylation events we detect may arise from asymmetric binding affinities of transcription factors for 5mC (and 5hmC). ZFP57, for example, has known asymmetric recognition of 5mC, with negative strand methylation more important than positive strand methylation in the TGCCGC motif [24]. In addition, there exists evidence for a preference for hemi-methylation of the C/EBP half-site |GmAA [60]. Therefore, the hemi-methylated motifs observed in some of our analyses, especially for C/EBPβ, may represent bona fide preferences. This would accord with similar findings in an independent analysis [42]. Nevertheless, further work is needed to determine whether the hemi-methylated motifs we discover reflect an actual biological preference.

There exist few high-quality single-base resolution datasets of 5hmC, 5fC, and 5caC. We had previously attempted analyses using DNA modification data that did not have single-base resolution, such as from assays like methylation DNA immunoprecipitation (MeDIP) [75], that did not employ single-base resolution methods [76]. A lack of single-base resolution makes it difficult to create a discrete genome sequence with a reasonable abundance of the modification under consideration without biasing the sequence. This makes downstream analyses of transcription factor binding uninformative. Therefore, it is essential to have single-base resolution data, for any modifications that one wishes to analyze. Additionally, many single-base resolution datasets use some form of reduced representation approach that enriches CpGs, because this allows sequencing at reduced depth, while still capturing many DNA modifications. The use of reduced representation bisulfite sequencing (RRBS) data can lead to confounding factors, due to the non-uniform distribution of methylated sites surveyed. Accordingly, we recommend avoiding similar enrichment approaches for use with our framework.

The ChIP-seq data we used were not generated in the same cell type as the WGBS and oxWGBS data. While cell type specificity might cause confounding effects, we consistently observed the expected preferences in transcription factor binding for the expected modification affinities across multiple ChIP-seq replicates, often in different cell types. Therefore, we expect that using ChIP-seq and WGBS data from different cell types will lead to meaningful results.

Although we predominantly observed the expected transcription factor binding preferences, in some limited instances we did not. For example, we found that USF1 and USF2 appear to have a subset of 5mC- and 5hmC-preferring motifs (Fig. 4D). This contradicts previous in vitro work [77], which showed that USF1 prefers to bind neither 5mC nor 5hmC. Additionally, the same study suggested that while TCF3, a transcription factor related to USF1 and USF2, can undergo a conformational shift to bind to 5hmC, USF1 cannot. The in vitro work largely derived from structural preferences, however: it “only focused on the most obvious, steric and hydrogen bonding effects...[more] complex methods are required to explain [their] subtler [protein binding microarray] results” [77]. Nonetheless, in their data, some of the C versus 5hmC and 5mC versus 5hmC z-score plots depict a few significant motif pairs with near equal preference for versus against hydroxymethylation. This indicates that their more general conclusion may not accurately sum up all of their data either; our results are more in accord than it would appear at first glance.

Our results for USF1 only had a single weakly positive-scoring motif containing a hydroxymethylated base: AAAhYAmA. This motif had only a slight preference to bind over its unmodified counterpart. This preference may instead have arisen primarily from the methylated base near the end of the motif. The low score might indicate that there is no strong preference; it could also suggest a technical artifact. All our other modified-preferring motifs for USF1 preferred to bind in a methylated context, with most showing only weak preferences. This stood in contrast to the many methylated motifs that showed strong preferences for their unmodified counterparts. Overall, these results suggest that not all USF1 motifs tend to bind in unmodified contexts, even if most of them may do so. USF2, however, has a non-negligible number of motifs that appear able to bind in a hydroxymethylated context, having 5hmC motifs that scored above zero.

Our limited assessment of transcription factor preferences across different families did not yield clear conclusions. Despite previous findings for specific families, like bHLH factors tending to prefer to bind to unmodified DNA [27,28,29,30,31], most conclusions in this area are ambiguous [40,41,42, 78]. One reason for this is the different motif groups that prefer unmethylated versus methylated binding for many transcription factors. This tends to confound binary categorization even for individual transcription factors. Across a whole family of transcription factors, making this binary call becomes even more difficult. Even closely related transcription factors, like USF1/2, often have different degrees of preferences and variable motifs. A second reason is observation bias with regards to transcription factors for which one can find binding data. There is a particular depletion of binding data for the large number of zinc finger transcription factors. This bias may account, at least partially, for the often observed greater number of unmodified-preferring factors [20, 36]. For an unbiased assessment of transcription factors across families, we need data on a less biased set of transcription factors [36, 79, 80].

The MEME Suite’s [48] new custom alphabet capability permits further downstream analyses of modified motifs. Our custom alphabet is provided together with this software and is available both from the MEME Suite webpage and as the standalone MEME::Alphabet package. For example, one can find individual motif occurrences with Find Individual Motif Occurrences (FIMO) [81] or conduct pathway analyses with Gene Ontology for MOtifs (GOMO) [82]. Alternatively, one can use FIMO results for pathway analyses through tools like Genomic Regions Enrichment of Annotations Tool (GREAT) [83] or Biological Enrichment of Hidden Sequence Targets (BEHST) [84]. For further interpretation of the results, one can use downstream pathway analysis tools, such as Enrichment Map [85, 86]. This permits inference of implicated genomic regions and biological pathways.

We designed all of our software so that others can readily extend our approach to additional DNA modifications. Technology now allows the detection of a number of DNA modifications [71] at high resolution, such as 5-hydroxymethyluracil (5hmU), 5-formyluracil (5fU), 8-oxoguanine (8-oxoG), and 6-methyladenine (6mA), many of which occur in diverse organisms [87,88,89]. We provide recommendations for the nomenclature of these modified nucleobases, among others (Additional file 2: Appendix A [90, 91]). We used these recommendations in our database of DNA modifications, DNAmod [71]. The Global Alliance for Genomics and Health (GA4GH) [92] has also adopted these recommendations for use in sequence alignment/map (SAM) and binary alignment/map (BAM) formats [93].

For representation of sequence data with modifications as input to neural network architectures, one could use our expanded alphabet. One could use this for either modified motif elucidation or for overall classification of a transcription factor’s propensity to bind modified bases. Naively, one could encode the expanded epigenetic alphabet by simply extending the standard one-hot DNA encoding, as long used in motif elucidation [94], to add our additional symbols. Without additional changes, however, we would not recommend this approach because of the substantial disparity in modified base frequencies.

For representing expanded epigenetic alphabet data, we would suggest using an approach where significantly lower frequency of modified bases compared to unmodified has a lower impact. One might instead encode all nucleobases as vectors representing their functional groups, as recently done in a similar unmodified context [95]. Alternatively, one might adapt recent work which designed filters to create sparse codes, similar to images, that can effectively encode DNA motifs [96]. This work constructs the encoding from PWMs, as we do for unmodified bases. It uses a one-hot encoding for DNA, but should allow for the incorporation of altered background frequencies. These approaches likely have more resilience to biases in this expanded alphabet context, while still enabling the ready application of modern neural networks to modified nucleobase data.

It will be important to characterize modified binding affinities in vivo, in addition to the more abundant in vitro approaches, such as high-throughput systematic evolution of ligands by exponential enrichment (HT-SELEX) [97] and DNA affinity purification sequencing (DAP-seq) [34]. While the in vitro approaches contribute to our improved understanding of the underlying biophysics, only in vivo analyses can directly assess the actual cellular binding events that lead to differences in gene expression pattern. By using available ChIP-seq data, our work contributes to this effort and bolsters it by providing transcription factor CUT&RUN datasets that directly assess unmodified and modified binding states. These data represent a unique type of experiment, one needed to fully understand the role of methylation and hydroxymethylation in transcription factor binding.

Conclusions

We provide a framework for transcription factor binding motif analyses on sequences containing DNA modifications. Our approach’s ability to reproduce known transcription factor binding affinities and the validation of our predictions for OCT4 suggest that these methods meaningfully predict the modification sensitivity of transcription factors. One can use our approach to analyze a wide array of transcription factors across diverse sets of epigenetic modifications, in any organism for which suitable data exist. The existence of specific transcription factor binding motifs whose recognition is driven by cytosine modifications may explain why transcription factors bind specific repetitive element loci, as opposed to every genome-wide iteration of the motif. Our work provides an initial foundation towards a better understanding of this important aspect of motif specificity.

Methods

Our combinatorial and statistical approach to assess the impact of DNA modifications uses an expanded epigenetic alphabet to harness existing the powerful motif analysis workflows of the MEME Suite [48] and Regulatory Sequence Analysis Tools (RSAT) matrix-clustering [98] (Fig. 6). We report each step of this process, in detail, in the subsections below, outlining every processing step of our analysis methods.

Fig. 6
figure 6

Overall workflow of all main software employed for our analyses. Cylinders: datasets; rectangles: processes

An expanded epigenetic alphabet

To analyze DNA modifications’ effects upon transcription factor binding, we developed a model of genome sequence that expands the standard A/C/G/T alphabet. Our model adds the symbols m (5mC), h (5hmC), f (5fC), and c (5caC). This allows us to more easily adapt existing computational methods, that work on a discrete alphabet, to work with epigenetic cytosine modification data.

Each symbol represents a base pair in addition to a single nucleotide, implicitly encoding a complementarity relation. Accordingly, we add four symbols to represent G when paired with modified C (Table 1): 1 (G:5mC), 2 (G:5hmC), 3 (G:5fC), and 4 (G:5caC). This ensures that complementation remains a lossless operation. The presence of a modification alters the base pairing properties of a complementary guanine [14], which this also captures. We number these symbols in the same order in which the TET enzyme acts on 5mC and its oxidized derivatives (Additional file 1: Fig. S1) [5].

Many cytosine modification-detection assays only yield incomplete information of a cytosine’s modification state. For example, conventional bisulfite sequencing alone determines modification of cytosine bases to either 5mC or 5hmC, but cannot resolve between those two modifications [5]. Even with sufficient sequencing to disambiguate all modifications, we require statistical methods to infer each modification from the data, resulting in additional uncertainty. To capture common instances of modification state uncertainty, we also introduce ambiguity codes: z/9 for a cytosine of (completely) unknown modification state, y/8 for a neither hydroxymethylated nor methylated cytosine, x/7 for a hydroxymethylated or methylated cytosine, and w/6 for a formylated or carboxylated cytosine (Table 2). These codes are analogous to those defined by the Nomenclature Committee of the International Union of Biochemistry already in common usage, such as for unknown purines (R) or pyrimidines (Y) [99, 100].

Table 2 Ambiguous bases for uncertain modification states. The MEME Suite recognizes these ambiguity codes in the same manner as the ambiguous bases already in common usage, such as R for A or G in the conventional DNA alphabet

Cytomod: method for creation of an expanded-alphabet genome sequence

Like most epigenomic data, abundance and distribution of cytosine modifications varies by cell type. Therefore, we require modified genomes for a particular cell type and would not necessarily expect downstream analyses to generalize. Accordingly, we first need to construct a modified genome that pertains to the organism, assembly, and tissue type we wish to analyze. This modified genome uses the described expanded alphabet to encode cytosine modification state, using calls from single-base resolution modification data.

To do this, we created a Python program called Cytomod. It loads an unmodified assembly and then alters it using provided modification data. It relies upon Genomedata [101] and NumPy [102] to load and iterate over genome sequence data. Cytomod can take the intersection or union of replicates pertaining to a single modification type. It also allows one to provide a single replicate of each type, and potentially to run it multiple times to produce multiple independent replicates of modified genomes. It permits flagging of ambiguous input data, such as when only possessing conventional bisulfite sequencing data, therefore yielding only x/7 as modified bases. Cytomod additionally produces browser extensible data (BED) [103, 104] tracks for each cytosine modification, for viewing in the University of California, Santa Cruz (UCSC) [103] (Fig. 7), or Ensembl genome browsers [105].

Fig. 7
figure 7

Differential cytosine modification status in naive mouse T cells for a 25 kbp region (within cytoband 17qB1) surrounding Zfp57 and Mog. This UCSC Genome Browser [103] display includes RepeatMasker [106] regions, CpG islands [107], GENCODE [108] genes, and calls for modified nucleobases h (5hmC), m (5mC), x (5mC/5hmC), z (C with unknown modification state), 1 (G:5mC), 2 (G:5hmC), 7 (G:5mC/5hmC), and 9 (G:C with unknown modification state)

We subjected the unaligned, paired-end, BAM files output from the sequencer to a standardized internal quality check pipeline. Widely known to work well, we selected Bismark [109] for alignment [110, 111]. We use the following processing pipeline: sort the unaligned raw BAM files in name order using Sambamba [112] (version 0.5.4); convert the files to FASTQ [113], splitting each paired-end (using BEDTools [114] version 2.23.0 bamtofastq); align the FASTA files to NCBI m37/mm9 using Bismark [109] (version 0.14.3), which used Bowtie 2 [115,116,117], in the default directional mode for a stranded library; sort the output aligned files by position (using Sambamba sort); index sorted, aligned, BAMs (SAMtools [93] version 1.2 index); convert the processed BAM files into the format required by MethPipe, using to-mr; merge sequencing lanes (using direct concatenation of to-mr output files) for each specimen (biological replicate), for each sex, and each of WGBS and oxWGBS; sort the output as described in MethPipe’s documentation (by position and then by strand); remove duplicates using MethPipe’s duplicate-remover; run MethPipe’s methcounts program; and finally run MLML [68]. After alignment, we excluded all random chromosomes. We use modifications called beyond a specified threshold (as described below) as input for Cytomod (with Genomedata [101] version 1.36.dev-r0).

In bulk data, usually one considers a base modified or not using some threshold, above which one “calls” a particular modified base or set of possible modifications. There exist several ways to perform modified base calling, generally first involving computing a proportion of modification, at a specific position. We use the MLML [68] method to do this. Then, we must decide the value sufficient to call a modification downstream.

MLML [68] outputs maximum-likelihood estimates of the levels of 5mC, 5hmC, and C, between 0 and 1. It outputs an indicator of the number of conflicts—an estimate of methylation or hydroxymethylation levels falling outside of the confidence interval computed from the input coverage and level. An abundance of conflicts can indicate the presence of non-random error [68]. We assign z/9 to all loci with any conflicts, regarding those loci as having unknown modification state. Our analysis pipeline accounts for cytosine modifications occurring in any genomic context. It additionally maintains the data’s strandedness, allowing analyses of hemi-modification.

Mouse expanded-alphabet genome sequences

We used conventional and oxidative WGBS data generated for naive CD4+ T cells, extracted from the spleens of C57BL/6J mice, aged 6 weeks–8 weeks. The dataset authors obtained a fraction enriched in CD4+ T cells, by depletion of non-CD4+ T cells by magnetic labelling, followed by fluorescence-activated cell sorting to get the CD4+, CD62L+, CD44low, and CD25 naive pool of T cells. We previously published these data [51] as part of the BLUEPRINT project [118] (GSE94674 [119]; GSE94675 [120]). We analyzed biological replicates separately, 2 of each sex.

For our mouse datasets, we aligned sequencing reads with Bowtie 2 [115,116,117] version 2.2.4. We used MethPipe [121] (development version, commit 3655360 [122]), to process the data.

We used our mouse datasets to calibrate our modified base calling thresholds. MLML [68] combines the conventional and oxidative bisulfite sequencing data to yield consistent estimations of cytosine modification state. In our case, with two inputs per mouse run (WGBS and oxWGBS), we obtain values of 0, 1, or 2. We created modified genomes using a grid search, in increments of 0.01, for a threshold t, for the levels of 5mC (m) and 5hmC (h), as described in Fig. 8.

Fig. 8
figure 8

Conditions on the MLML [68] confidence levels of 5mC (m) and 5hmC (h) in relation to a threshold t that lead to the calling of different modified nucleobases. We call a modification if m or h equal or exceed the threshold. These base assignments assume that MLML had no conflicts for the locus under consideration. If any conflicts occur, we use z as the base, irrespective of the values of m or h. We depict bases for the positive strand only, and complement those occurring on the negative strand, as outlined in Tables 1 and 2

We use half of the threshold value for assignment to x/7, since we consider that consistent with the use of the full threshold value to call a specific modification. Namely, if t suffices to call 5mC or 5hmC alone, \(m + h \ge t\) ought suffice to call x/7.

Human expanded-alphabet genome sequence

Using publicly available ENCODE WGBS data (ENCFF557TER and ENCFF963XLT), we created K562 (RRID: CVCL_0004) modified genome for the GRCh38/hg38 assembly at 0.3 and 0.7 stringencies. WGBS data alone does not differentiate 5mC and 5hmC. As the data cannot differentiate between these states, one might represent them as x. Nonetheless, we represent modified bases from the WGBS data as m, both for convenience, and because, in most cases, these positions are just methylated. We processed these datasets, as previously described. We aligned human datasets with Bowtie 2 [115,116,117] version 2.2.3 and processed with MethPipe [121] (release version 3.4.2 [123]).

Cytomod performance

Generally, one needs to run Cytomod only once per analysis project, so we did not focus on improving runtime performance. For the analyses of hundreds of ChIP-seq datasets undertaken here, outside of development, initial tests, and threshold calibration, we only had to run Cytomod four times. In other words, we only needed four modified genome sequences to complete most of our work, two per each selected threshold, one for the mouse genome and one for the human genome. Since future work will likely not require further Cytomod development and threshold calibration, others using Cytomod would not need to run it more than a few times either. Nonetheless, Cytomod performs quickly already, considering each run produces a complete modified genome assembly.

Typically, Cytomod completes work on an \({{\sim}3\,}\) Gbp genome in considerably less than 8 h on a single core of an Intel Xeon E5-2650 v2 2.6 GHz Linux workstation, and it uses less than 24 GB of RAM. Cytomod requires output disk storage space of approximately the same size as the unmodified input assembly. We benchmarked some specific illustrative runtimes. First, creating a modified GRCh38/hg38 assembly with 1 modification track (x; WGBS data only) took Cytomod 3 h 15 min 26 s. Second, creating a modified NCBI m37/mm9 assembly with 4 modification tracks (m, h, x, z; both WGBS and oxWGBS data) took 4 h 14 min 40 s.

Detection of altered transcription factor binding in modified genomic contexts

Following creation of expanded-alphabet genome sequences, we performed transcription factor binding site motif discovery, enrichment, and modified-unmodified comparisons. Here, we use mouse assembly NCBI m37/mm9 for all murine analyses, since we wanted to make use of all Mouse ENCODE [124] ChIP-seq data (RRID: CVCL_0188) without re-alignment nor lift-over. Specifically, we used the Mus musculus Illumina (San Diego, CA, USA) iGenome [125] packaging of the UCSC NCBI m37/mm9 genome. This assembly excludes all alternative haplotypes as well as all unreliably ordered, but chromosome-associated, sequences (the so-called “random” chromosomes). While ideal for downstream analyses, that assembly does not suffice for aligning data ourselves. Exclusion of these additional pseudo-chromosomes might deleteriously impact alignments, by resulting in the inclusion of spuriously unique reads. Therefore, we used the full UCSC NCBI m37/mm9 build when aligning to a reference sequence. For our human datasets, we used GRCh38/hg38, with all K562 ENCODE datasets.

We used all K562 peak calls, processed as outlined below, from Human ENCODE ChIP-seq data, from preliminary data processed for Karimzadeh et al. [126]. We briefly recapitulate the processing steps here. First, they align the raw reads with Bowtie 2 [115,116,117] (version 2.2.3). Then, they then de-duplicate reads using SAMtools [93] (version 0.1.19) and filter for those with a mapping quality of greater than 10 using option -bq 10. Finally, they call peaks without any control, using MACS 2 [63] (version 2.1.0) callpeak and options --qvalue 0.001 --format BAM --gsize hs --nomodel --call-summits.

We updated the MEME Suite [48] to work with custom alphabets, such as our expanded epigenomic alphabet. Additionally, we created the MEME::Alphabet Perl module to assist with its internal functionality. We incorporated these modifications into MEME Suite version 4.11.0.

We characterize modified transcription factor binding sites using MEME-ChIP [66]. It allows us to rapidly assess the main software outputs of interest: MEME [64] and DREME [65], both for de novo motif elucidation; CentriMo [56, 127], for the assessment of motif centrality; SpaMo [128], to assess spaced motifs (especially relevant for multipartite motifs); and FIMO [81].

We mainly focus upon CentriMo [56] for the analysis of our results. It permits inference of the direct DNA binding affinity of motifs, by assessing a motif’s local enrichment. In our case, we scan peak centres with PWMs, for the best match per region. We generate the PWMs used from MEME-ChIP, by loading the JASPAR 2014 [129, 130] core vertebrates database, in addition to any elucidated de novo motifs from MEME or DREME. CentriMo counts and normalizes the number of sequences at each position of the central peaks to estimate probabilities of central enrichment. CentriMo smooths and plots these estimates, using a one-tailed binomial test to assess the significance of central enrichment [56].

MEME-ChIP [66] can yield repetitive motifs, without masking of low complexity sequences. Existing masking programs do not support modified genomes, and we accordingly mask the assembly, prior to modification with Cytomod. We use this masking only for downstream motif analyses. We use Tandem Repeat Finder (TRF) [131] (version 4.07b in mice and 4.09 in humans) to mask low complexity sequences. We used the following parameters: 2 5 5 80 10 30 200 -h -m -ngs, from published TRF parameter optimizations [132]. For version 4.09, to ensure compatibility with GRCh38/hg38 or larger future genomes, we increased the maximum “expected” tandem repeat length to \({12\,000\,000}\), adding -l 12.

We ran MEME-ChIP [66], against Cytomod genome sequences for regions pertaining to ChIP-seq peaks from transcription factors of interest. For this analysis, we used the published protocol for the command-line analysis of ChIP-seq data [133]. We employed positive controls, in two opposite directions, to assess the validity of our results. We use c-Myc as the positive control for an unmethylated binding preference [25, 26]. For this control, we used ChIP-seq data from a stringent streptavidin-based genome-wide approach with biotin-tagged Myc in mESCs from Krepelova et al. [57] (GSM1171648 [134]). Also, we used murine erythroleukemia and CH12.LX Myc Mouse ENCODE samples (ENCFF001YJE and ENCFF001YHU). Conversely, we used both ZFP57 and C/EBPβ as positive controls for methylated binding preferences [21,22,23,24]. For C/EBPβ, we used Mouse ENCODE ChIP-seq data, conducted upon C2C12 cells (ENCFF001XUT) or myocytes differentiated from those cells (ENCFF001XUR and ENCFF001XUS). Also, we used one replicate of ZFP57 peaks provided by Quenneville et al. [22]. When processing this replicate, we used the same parameters as for our other ZFP57 samples, except for employing default MACS stringency (\(q = 0.05\)). The reduced peak calling stringency allowed us to ensure sufficient peaks for this older, lower-coverage, dataset. We constructed a ZFP57 BED file using BEDTools [114] (version 2.17.0) to subtract the control influenza hemagglutinin (HA) ChIP-seq (GSM773065 [135]) from the target (HA-tagged ZFP57: GSM773066 [136]). We retain only target regions with no overlap with any features implicated by the control file, yielding 11 231 of 22 031 features.

Modified binding preferences of ZFP57

We used ZFP57 ChIP-seq data, provided by Strogantsev et al. [23] (GSE55382 [137]), to examine the modified binding preferences of that transcription factor. Strogantsev et al. [23] derived these 40 bp single-end reads from reciprocal F1 hybrid Cast/EiJ \(\times\) C57BL/6J mESCs (BC8: sequenced C57BL/6J mother \(\times\) Cast father and CB9: sequenced Cast mother \(\times\) C57BL/6J father).

We re-processed the ZFP57 data to obtain results for NCBI m37/mm9. We performed this re-processing similarly to some of the Mouse ENCODE datasets, to maximize consistency for future Mouse ENCODE analyses. We obtained raw FASTQs using Sequence Read Archive (SRA) Toolkit’s fastq-dump. Then, we aligned the FASTQs using Bowtie [115] (version 1.1.0; bowtie -v 2 -k 11 -m 10 -t --best --strata). We sorted and indexed the BAM files using Sambamba [112]. Finally, we called peaks, using the input as the negative enrichment set, using MACS 2 [63] (version 2.0.10) callpeak, with increased stringency (\(q = 0.00001\)), with parameters: --qvalue 0.00001 --format BAM --gsize mm. This resulted in 90 478 BC8 and 56 142 CB9 peaks.

We used the ChIPQC [138] Bioconductor [139] package to assess the ChIP-seq data quality. We used the two control and two target runs for each of BC8 and CB9. Then, we used ChIPQC(samples, consensus=TRUE, bCount=TRUE, summits=250, annotation="mm9", blacklist="mm9-blacklist.bed.gz", chromosomes=chromosomes). We set the utilized list of mouse chromosomes to only the canonical 19 autosomal and 2 sex chromosomes. Using a blacklist, we filtered out regions that appeared uniquely mappable but empirically show artificially elevated signal in short-read functional genomics data. We obtained the blacklist file from the NCBI m37/mm9 ENCODE blacklist website (https://sites.google.com/site/anshulkundaje/projects/blacklists) [52]. The BC8 data had 13.7% fraction of reads in peaks (FRiP) and the CB9 data had 9.12% FRiP. Additionally, we performed peak calling at the default \(q = 0.05\). This resulted in many more peaks for both BC8 (197 610 peaks; 27.6% FRiP) and CB9 (360 932 peaks; 19.7% FRiP). The CB9 sample had a smaller fraction of overlapping reads in blacklisted regions (RiBL). At the default peak calling stringency, BC8 had an RiBL of 29.7%, while CB9 had only 8.38%. This likely accounts for our improved results with the CB9 replicate (Additional file 1: Fig. S2).

Additionally, we analyzed three ZFP57 ChIP-seq replicates (100 bp paired-end reads) pertaining to mESCs in pure C57BL/6J mice [140]. We paired each replicate with an identically conducted ChIP-seq in a corresponding sample, which lacked ZFP57 expression (ZFP57-null controls).

For the pure C57BL/6J data, we used the same protocol as for the hybrid data, except for the following three differences. First, instead of input as the negative control, we used the ZFP57-null ChIP-seq data. We ran Bowtie in paired-end mode (using -1 and -2). Second, we omitted the Bowtie arguments --best --strata, which do not work in paired-end mode. Instead, we added -y --maxbts 800, the latter of which we set with --best’s value, in lieu of the default threshold of 125. Third, we set MACS to paired-end mode (option -f BAMPE). This resulted in very few peaks, however, when processed with the same peak-calling stringency as the hybrid data (at most 1812 peaks) and FRiP values under 2%. Even when we used the default stringency threshold, we obtained at most 4496 peaks, with FRiP values of around 4.5%. Nonetheless, we still observed the expected preference for methylated motifs (Additional file 1: Fig. S2).

Processing of additional OCT4 and n-Myc datasets

We additionally used OCT4 and n-Myc ChIP-seq data, from Yin et al. [36] (GSE94634 [141]). Except as indicated below, we processed these in the same manner as our ZFP57 data. For these transcription factors, we used mouse data for three conditions per factor. These conditions consist of a wild-type sample, a triple-knockout sample for TET1+TET2+TET3, and a triple-knockout sample for DNMT1+DNMT3A+DNMT3B. OCT4 has two antibodies, both replicated, across all three conditions. Because of pooling of both replicates for one antibody in the TET triple-knockout condition, this leads to \(\text {3 conditions} \times \text {2 antibodies} \times \text {2 replicates} - \text {1 pooled replicate} = 11\) OCT4 samples. The n-Myc came from only a single antibody, resulting in \(\text {3 conditions} \times \text {2 replicates} = 6\) samples. We used the provided IgG samples as negative controls for peak calling for this dataset. As discussed previously [36], we also called peaks for matching samples against each of their mouse, rabbit, and goat IgG samples.

Comparing motif modifications, using hypothesis testing

To directly compare various modifications of motifs to their cognate unmodified sequences, we adopted a hypothesis testing approach. One can derive motifs of interest from a de novo result that merits further investigation. Often, however, researchers identify motifs of interest using prior expectations of motif binding preferences in the literature, such as for c-Myc, ZFP57, and C/EBPβ. For every unmodified motif of interest, we can partially or fully change the base at a given motif position to some modified base (Table 3).

Table 3 Illustrative examples of possible changes made to convert unmodified motifs to specific modified counterparts, for downstream hypothesis testing. We use stacked letters like simple sequence logos. At these positions, N represents any base frequencies other than the base being modified. These make up the other positions in the motif’s PWM. \(\overset{\texttt {C}}{\texttt {N}} \rightarrow \texttt{m}\) indicates that a position containing cytosine is modified by setting all base frequencies other than m to 0 and setting the frequency of m to 1. Conversely, \(\overset{\texttt {C}}{\texttt {N}}\texttt{\raisebox{4pt}{N}} \rightarrow \ \overset{\texttt {h}}{\texttt {N}}\texttt{\raisebox{4pt}{G}}\) indicates that a position containing cytosine is modified by replacing the frequency apportioned to C with h, leaving the other base frequencies at that position unmodified. We portray the second base of each dinucleotide as having a frequency of 1. This second base, however, could also comprise different bases of various frequencies, including the base shown

To directly compare modified hypotheses to their cognate unmodified sequences robustly, we tried to minimize as many confounds as possible. We fixed the CentriMo central region width (options --minreg 99 --maxreg 100). We also compensated for the substantial difference in the background frequencies of modified versus unmodified bases. Otherwise, vastly lower modified base frequencies can yield higher probability and sharper CentriMo peaks, since when CentriMo scans with its “log-odds” matrix, it computes scores for nucleobase b with background frequency \(\text {f}\left( b\right)\) as

$$\begin{aligned} \log \left( \frac{\Pr \left( b\right) }{\text {f}\left( b\right) }\right) . \end{aligned}$$

To compensate for this, we ensured that any motif pairs compared have the same length and similar relative entropies. To do this, we used a larger motif pseudocount for modified motifs (using CentriMo option --motif-pseudo). We computed the appropriate pseudocount, as described below, and provided it to iupac2meme. We set CentriMo’s pseudocount to 0, since we had already applied the appropriate pseudocount to the motif. We seek to normalize the average relative entropies of the PWM columns between two motifs.

The relative entropy (or Kullback-Leibler divergence), \(D_{\textrm{RE}}\), of a motif m of length |m|, with respect to a background model b over the alphabet A, of size |A|, is [142]

$$\begin{aligned} D_{\textrm{RE}}\left( m, b\right) = \sum _{i = 0}^{|{m}| - 1}\sum _{j = 0}^{|{A}| - 1}\left( m_{ij} \log_{2} \left( \frac{m_{ij}}{b_{j}}\right) \right) . \end{aligned}$$
(1)

For each position, i, in the motif, the MEME Suite adds the pseudocount parameter, \(\alpha\), times the background frequency for a given base, j, at the position: \(m^{\prime }_{ij} = m_{ij} + \alpha b_{j}\). This omits the effective number of observed sites, which the MEME Suite also accounts for, essentially setting it to 1.

Accordingly, to equalize the relative entropies, we needed only substitute \(m^{\prime }_{i\,j}\) for each \(m_{i\,j}\) in Eq. 1 and then isolate \(\alpha\). In this process, we solve for \(\alpha\), by equating \(D_{\textrm{RE}}\) for the unmodified motif with that of the modified motif, substituting as above, while holding \(\alpha\) for the unmodified motif constant. If we proceed in this fashion, however, our pseudocount would depend upon the motif frequency at each position and the background of each base in the motif. Instead, we can make a number of simplifying assumptions that apply in this particular case. First, the unmodified and modified motifs we compare differ only in the modified bases, which in this case, comprise only C or G nucleobases, with a motif frequency of 1. Additionally, we set the pseudocount of the unmodified motif to a constant 0.1 (CentriMo’s default). Thus, the pseudocount for a single modified base is the value \(\alpha\), obtained by solving, for provided modified base background frequency \(b_{m}\) and unmodified base frequency \(b_{u}\):

$$\begin{aligned} 1 + \alpha b_{m} \log_{2} \left( \frac{1 + \alpha b_{m}}{b_{m}}\right) = 1 + 0.1 b_{u} \log_{2} \left( \frac{1 + 0.1 b_{u}}{b_{u}}\right) . \end{aligned}$$
(2)

Equation 2 only accounts for a single modification, however, on a single strand. For complete modification, we also need to consider the potentially different background frequency of the modified bases’ complement. Thus, for a single complete modification, with modified positions \(m_{1}\) and \(m_{2}\) and corresponding unmodified positions \(u_{1}\) and \(u_{2}\), modified base background frequencies \(b_{m_{1}}\), \(b_{m_{2}}\), and unmodified base frequencies \(b_{u_{1}}\), \(b_{u_{2}}\), we obtained

$$\begin{aligned} 1 + \alpha&\ b_{m_{1}} \log_{2} \left( \frac{1 + \alpha b_{m_{1}}}{b_{m_{1}}}\right) + 1 + \alpha b_{m_{2}} \log_{2} \left( \frac{1 + \alpha b_{m_{2}}}{b_{m_{2}}}\right) \nonumber \\&= 1 + 0.1 b_{u_{1}} \log_{2} \left( \frac{1 + 0.1 b_{u_{1}}}{b_{u_{1}}}\right) + 1 + 0.1 b_{u_{2}} \log_{2} \left( \frac{1 + 0.1 b_{u_{2}}}{b_{u_{2}}}\right) . \end{aligned}$$
(3)

We numerically solved for \(\alpha\) in Eq. 3 for each modified hypothesis, using fsolve from SciPy [143]. Finally, we may have multiple modified positions. We always either hemi-modify or completely modify all modified positions, so the pseudocount is the product of modified positions and the \(\alpha\) value from Eq. 3.

The pseudocount obtained in this fashion does not exactly equalize the two motif’s relative entropies, since we do not account for the effect that the altered pseudocount has upon all the other positions of the motif. It does, however, exactly equalize the relative entropies per column (RE/col, as defined by Bailey et al. [142]) of the modified versus unmodified motifs, which suffices to ensure correctly normalized comparisons.

Then, we performed hypothesis testing for an unmodified motif and all possible 5mC/5hmC modifications of all CpGs for known modification-sensitive motifs for c-Myc, ZFP57, and C/EBPβ. These modifications consist of the six possible combinations for methylation and hydroxymethylation at a CpG, where a CpG is not both hemi-methylated and hemi-hydroxymethylated. These six combinations are: mG, C1, m1, hG, C2, and h2. For c-Myc, we constructed modified hypotheses from the standard unmodified E-box: CACGTG. For ZFP57, we tested the known binding motif, as both a hexamer (TGCCGC) and as extended heptamers (TGCCGCR and TGCCGCG) [22, 23]. We additionally tested motifs that occurred frequently in our de novo analyses, C(C/A)TGm1(C/T) (A). We encoded this motif as the hexamer MTGCGY and heptamers, with one additional base for each side: CMTGCGY and MTGCGYA. This encoding permitted direct comparisons to the other known ZFP57-binding motifs of the same length. Finally, for C/EBPβ we tested the modifications of two octamers: its known binding motif (TTGCGCAA) and the chimeric C/EBP|CRE motif (TTGCGTCA) [21].

Using CentriMo, we assessed motifs for their centrality within their respective ChIP-seq datasets. Then, we computed the ratio of CentriMo central enrichment p-values, adjusted for multiple testing [56], for each modified/unmodified motif pair. For numerical precision, we computed this ratio as the difference of their log values returned by CentriMo. This determines if the motif prefers a modified (positive) or unmodified (negative) binding site.

We conducted hypothesis testing across all four replicates of mouse WGBS and oxWGBS data, for a grid search of modified base calling thresholds. The levels output by MLML [68], allowed us to obtain these thresholds. We interpret these values as our degree of confidence for a modification occurring at a given locus. We conducted our grid search from 0.01–0.99 inclusive, at 0.01 increments. Finally, we plotted the ratio of CentriMo p-values across the different thresholds, using Python libraries Seaborn [144] and Pandas [145]. We used IPython [146] and IGV [147] during initial testing and data exploration. We also used GNU Parallel [148] throughout our workflow. Then we extended this combinatorial hypothesis testing approach, across all JASPAR and de novo motifs from our Mouse and Human ENCODE datasets, at our 0.3 and 0.7 selected thresholds.

Assessment of transcription factor familial preferences

We used TFClass [58, 59] (downloaded November 8, 2017) to categorize and group analyzed transcription factors into families and super-families. We used Pronto [149] (version 0.2.1) to parse the TFClass ontology files.

For transcription factors either not categorized at time of analysis or that yielded inexact matches, we manually assigned them to families and super-families. We performed curation by searching one or more of GeneCards [150], Genenames.org [151], UniProt [152], Gene3D [153], InterPro [154], Pfam [155], SMART [156], and SUPERFAMILY [157].

We manually re-named a number of transcription factors in the family assignment to match the names used elsewhere in our data, largely removing hyphens for consistency, creating a group for POLR2A, and adding a number of missing transcription factors. The following factors (with asterisks denoting any suffix and slashes denoting synonymous factors) underwent this manual annotation: RAD21, REC8, SCC1, ZC3H11*, CHD*, NELFE, PAH2, SIN3*, PIAS*, ZMIZ*, KLHL, HCFC*, EP300, TCF12, TIF1*/TRIM24/TRIM28/TRIM33, SMC*, KAT2A/GCN5, and SMARCA4/BRG1.

We aggregated all hypothesis testing results across either mouse or K562 datasets distinctly. Grouped by modification type (m or h), we aggregated across stringency (0.3 or 0.7), replicate of origin, and unmodified hypothesis. When comparing a modified hypothesis pair to its unmodified counterpart, different replicates of data may produce different scores. In this instance, to aggregate multiple similar hypothesis tests, we took the maximum absolute value score. For each transcription factor, we retained only the most statistically significant (“top-1”) or top three most statistically significant (“top-3”) hypothesis pairs. We omitted hypothesis pairs that lack statistical significance (\(\text{p-value} \ge 0.05\)).

Motif clustering of modified binding preferences

We used RSAT matrix-clustering [98] to hierarchically cluster similar motifs. For each transcription factor, we clustered each of its unmodified motifs, alongside their modified counterparts. These motifs matched the set of hypothesis pairs for that factor.

We partitioned each transcription factor’s motifs into unmodified-preferring (\(\text {score} < -\epsilon\)), modified-preferring (\(\text {score} > \epsilon\)), and those without any substantive preference (\(-\epsilon \le 0 \le \epsilon\)). Here, we set \(\epsilon = 5\), to ignore any near-neutral preferences.

After this, we removed duplicate hypothesis pairs, selecting only those with the scores furthest away from zero. Then, we plotted these clusters, annotated by their score, in a treemap [158] plot. We created this plot using R [159] (version 3.5.1) ggplot2’s [160] treemapify [161] and Python (version 2.7.15) Pandas (version 0.22.0) [145] data structures through rpy2 [162] (version 2.8.6).

We designed a colour scheme to highlight motifs with strong preferences. To do this, we used white for scores of 80, above or below zero (namely, from \(-80\) to 80). This represented an expansion of disregarded motif from the \(\epsilon = 5\) threshold used above. We also kept the colour ramp linear within a score of 20 on either side of 0. Outside of this range around the centre, the ramp becomes logarithmic.

To further highlight the rarer and lower scores occurring for motifs preferring to bind in modified contexts, we altered the mid-point of the colour scheme. To do this, we re-centred the colour scheme, shifting it \(-10\), thereby biasing it towards modified contexts, in shades of red. The re-centring offset skews the entire colour scheme toward red hues, including moving the white regions accordingly.

Validation of our OCT4 findings, using CUT&RUN

After finding that OCT4 had a number of both methyl- and hydroxymethyl-preferring motifs, we performed CUT&RUN [49, 50] on mESCs, targeting OCT4. We also performed CUT&RUN for a matched IgG, for use as a background during peak calling. We performed CUT&RUN on mESCs, targeting OCT4. We subjected the resultant DNA to 3 workflows: conventional library preparation for sequencing on an Illumina platform, bisulfite sequencing, and Nano-hmC-Seal-seq [61]. Using a NovaSeq 6000 (Illumina), we sequenced the resulting libraries from the 3 workflows, using a paired-end \(2 \times {150}\,\text {bp}\) read configuration (Princess Margaret Genomics Centre, Toronto, ON, Canada).

Cell lines

Using feeder-free conditions, we grew male E14 murine embryonic stem cells [163] on \({10}\,{\text{cm}}\) plates gelatinized with 0.2% porcine skin gelatin type A (Sigma, St. Louis, MO, USA) at 37 \(\,^{\circ}\)C and 5% CO2. The ES-E14TG2a (E14) embryonic stem cells were a gift from the Fazzio lab (RRID: CVCL_9108). We cultured cells in N2B27+2i media [164]. Briefly, this media contains DMEM/F12 [165] (Sigma) and Neurobasal media (ThermoFisher, Waltham, MA, USA), supplemented with \(0.5\times\) B27 (Invitrogen, Waltham, MA, USA), \(1\times\) N-2 Supplement, \(50\ \mu{\text {mol}/\text{l}}\) 2-mercaptoethanol (ThermoFisher), \({2}\,\text {mmol/l}\) glutamine (ThermoFisher), Leukemia Inhibitory Factor (LIF), \(3\ \mu{\text{mol}/\text{l}}\) CHIR99021 glycogen synthase kinase (GSK) inhibitor (p212121, Boston, MA, USA), and \(1\ \mu\text{mol/l}\) PD0325091 mitogen-activated protein kinase/extracellular signal-regulated kinase kinase (MEK) inhibitor (p212121). We passaged cells every \({48}\,\text {h}\) using trypsin (Gibco, Waltham, MA, USA) and split them at a ratio of \({\sim }1{:}8\) with fresh medium. We conducted routine anti-mycoplasma cleaning (LookOut DNA Erase spray, Sigma) and screened cell lines by PCR to confirm no mycoplasma presence.

CUT&RUN assay

We performed CUT&RUN as described elsewhere [166,167,168] using recombinant Protein A-micrococcal nuclease (pA-MN). Briefly, we extracted nuclei from \(\sim\)4 500 000 embryonic stem cells using a nuclear extraction buffer comprised of \({20}\,\text {mmol/l}\) 4-(2-hydroxyethyl)-1-piperazineethanesulfonic acid (HEPES)-KOH [169], pH 7.9; \({10}\,\text {mmol/l}\) KCl; \({0.5}\,\text {mmol/l}\) spermidine; 0.1% Triton X-100; 20% glycerol; and freshly added protease inhibitors. We bound the nuclei to  \(500\ \mu{\text{l}}\) pre-washed lectin-coated concanavalin A magnetic beads (Polysciences, Warrington, PA, USA). Then, we washed beads in binding buffer (\({20}\,\text {mmol/l}\) HEPES-KOH, pH 7.9, \({10}\,\text {mmol/l}\) KCl, \({1}\,\text {mmol/l}\) CaCl2 MnCl2). We pre-blocked immobilized nuclei with blocking buffer (\({20}\,\text {mmol/l}\) HEPES, pH 7.5, \({150}\,\text {mmol/l}\) NaCl, \({0.5}\,\text {mmol/l}\) spermidine, 0.1% bovine serum albumin (BSA), \({2}\,\text {mmol/l}\) EDTA, fresh protease inhibitors). We washed the nuclei once in wash buffer (\({20}\,\text {mmol/l}\) HEPES, pH 7.5, \({150}\,\text {mmol/l}\) NaCl, \({0.5}\,\text {mmol/l}\) spermidine, 0.1% BSA, fresh protease inhibitors). Following this, we incubated nuclei in wash buffer containing primary antibody (anti-Oct4, Diagenode (Denville, NJ, USA) cat no. C15410305 or anti-IgG, Abcam (UK) cat. no. ab37415; RRID: AB_2631996) for \({1}\,\text {h}\) at \({4}\,^{\circ}\)C with rotation. Then, we incubated in wash buffer containing recombinant pA-MN for \({30}\,\text {min}\) at \({4}\,^{\circ}\)C with rotation.

Using an ice-water bath, we equilibrated samples to \({0}\,^{\circ}\)C and added \({3}\,\text {mmol/l}\) CaCl2 to activate pA-MN cleavage. Then, we performed sub-optimal digestion, at \({0}\,^{\circ}\)C for \({30}\,\text {min}\). As described in Step 31 of Skene et al. [50], we intentionally conducted digestion at a temperature lower than optimal, to prevent otherwise unacceptable background cleavage levels [49, 50]. We chelated digestion with 2XSTOP+ buffer (\({200}\,\text {mmol/l}\) NaCl, \({20}\,\text {mmol/l}\) EDTA, \({4}\,\text {mmol/l}\) ethylene glycol-bis(\(\upbeta\)-aminoethyl ether) \(N, N, N^{\prime}, N^{\prime}{\text{-}}\)-tetraacetic acid (EGTA), \(50\ \mu{\text{g/ml}}\) RNase A, \(40\ \mu{\text{g/ml}}\) glycogen, and \({1.5}\,\text {pg}\) MNase-digested Saccharomyces cerevisiae mononucleosome-sized DNA spike-in control).

After RNase A treatment and centrifugation, we released and separated genomic fragments. We digested protein using proteinase K. Finally, we purified DNA using phenol:chloroform:isoamyl alcohol extraction, followed by ethanol precipitation.

Library preparation for bisulfite sequencing

We prepared our bisulfite sequencing library using \({30}\,\text {ng}\) of CUT&RUN DNA. We used the Ultra II Library Preparation Kit (New England Biolabs (Canada) (NEB), cat. no. E7645L) following manufacturer’s protocol, with some modifications. In brief, after end-repair and A-tailing, we ligated NEBNext methylated adapters for Illumina (NEB, cat. no. E7535) at a final concentration of \(0.04\ \mu{\text{mol/l}}\) onto the DNA, followed by incubation at \({20}\,^{\circ}\)C for \({20}\,\text {min}\). Post adapter incubation, we subjected adapters to USER enzyme digestion at \({37}\,^{\circ}\)C for \({15}\,\text {min}\) prior to clean up using AMPure XP Beads (Beckman Coulter, cat. no. A63881).

We bisulfite-converted adapter-ligated CUT&RUN DNA using Zymo Research EZ DNA Methylation Kit (Zymo, Irving, CA, USA, cat. no. D5001) following the alternative protocol for the Infinium Methylation Assay (Illumina). Briefly, we added \(5\ \mu{\text{l}}\) of M-Dilution buffer to purified adapter-ligated DNA, and adjusted total sample volume to \(50\ \mu{\text{l}}\) with sterile molecular grade water. We incubated samples at \({37}\,^{\circ}\)C for \({15}\,\text {min}\), prior to the addition of \(100\ \mu{\text{l}}\) of CT Conversion Reagent. We further incubated samples, prior to purification, at (\({95}\,^{\circ}\)C for \({30}\,\text {s}\), \({50}\,^{\circ}\)C for \({60}\,\text {min}\)) for 16 cycles, then at \({4}\,^{\circ}\)C for at least \({10}\,\text {min}\), following the manufacturer’s protocol. Finally, we eluted samples in \(23\ \mu{\text{l}}\) molecular grade water.

We amplified the bisulfite-converted DNA using \(2\times\) HiFi HotStart Uracil+ ReadyMix (KAPA, Wilmington, MA, USA, cat. no. KK2801), and unique dual index primers (NEB, cat. no. E6440S) in a final volume of \(50\ \mu{\text{l}}\). We performed this using the following PCR program: \({98}\,^{\circ}\)C for \({45}\,\text {s}\), followed by 17 cycles of: \({98}\,^{\circ}\)C for \({15}\,\text {s}\), \({65}\,^{\circ}\)C for \({30}\,\text {s}\), \({72}\,^{\circ}\)C for \({30}\,\text {s}\), and final extension at \({72}\,^{\circ}\)C for \({60}\,\text {s}\). We purified and dual size selected amplified libraries using AMPure (Beckman Coulter, ON, Canada) XP Beads at \(0.6\times\) to \(1.0\times\) bead ratio, eluted in a volume of \(20\ \mu{\text{l}}\).

Library preparation for hmC-Seal sequencing

We performed library preparation using the NEB Ultra II Library Preparation Kit (NEB, cat. no. E7645L) on \({30}\,\text {ng}\) CUT&RUN DNA, as per the manufacturer’s protocol, with the below modifications. In brief, after end-repair and A-tailing, we purified adapter ligated DNA using AMPure XP Beads at a \(0.9\times\) ratio and eluted in \(11.5\ \mu{\text{l}}\) sterile water.

We added three spike-in DNA controls to the adapter ligated DNA to assess specific enrichment of modified DNA fragments. Controls consisted of \(0.2\ {\text{ng}}/\mu{\text{l}}\) working stocks of unmethylated and methylated Arabidopsis DNA spike-in controls from the Diagenode DNA methylation control package (cat. no. C02040012). They also included the 5hmC spike-in control DNA (amplified from the APC promoter) from the Active Motif Methylated DNA standard kit (Active Motif, cat. no. 55008). We combined \({0.3}\,\text {ng}\) of each spike-in DNA in a final volume of \(4.5\ \mu{\text{l}}\) per experimental sample. We mixed the adapter ligated DNA with the spike-in DNA mix. Then we aliquoted \(1.6\ \mu{\text{l}}\) of this mix into a separate PCR tube and stored it at \({-20}\,^{\circ}\)C, as an input control.

We 5hmC-glucosylated the remaining \(14.4\ \mu{\text{l}}\) CUT&RUN DNA mixed with spike-in controls, as previously described [61], with the below modifications. Briefly, we 1:1 diluted a \(3\ \mu{\text{mol/l}}\) stock of uracil diphosphate (UDP)-azide-glucose (Active Motif, cat. no. 55020) in \(1\times\) phosphate-buffered saline (PBS) to establish a working stock of \(1.5\ \mu{\text{mol/l}}\) for Mastermix preparation. We prepared a \(20\ \mu\text{l}\) glucosylation Mastermix per experimental sample consisting of the following: \(14.4\ \mu\text{l}\) CUT&RUN DNA mixed with spike-in controls, \(50\ \mu{\text{mol/l}}\) HEPES (pH 8.0), \({25}\,\text {mmol/l}\) MgCl2, \({0.1}\,\text {mmol/l}\) UDP-azide-glucose and 1 U of T4 Phage \(\upbeta\)-glucosyltransferase (NEB, cat. no. M0357L). We incubated the mix for \({1}\,\text {h}\) at \({37}\,^{\circ}\)C, to promote glucosylation.

Then, we performed biotinylation of azide-labelled 5hmC residues of the glucosylated DNA fragments. In sterile water, we prepared \({20}\,\text {mmol/l}\) dibenzocyclooctyne-PEG4-biotin conjugate (Bioscience, cat. no. CLK-A105P4-10) and stored it in one-time use aliquots at \({-20}\,^{\circ}\)C, to avoid freeze-thaw. We mixed \(20\ \mu{\text{l}}\) of glucosylated DNA with \(1.8\ \mu{\text{mol/l}}\) dibenzocyclooctyne-PEG4-biotin in a final reaction volume of \(22\ \mu{\text{l}}\), then incubated \({2}\,\text {h}\) at \({37}\,^{\circ}\)C to promote biotinylation. Then, we prepared MicroSpin P-30 Gel Columns (Bio-Rad, Hercules, CA, USA, cat. no. 7326223), following the manufacturer’s protocol, and used them to purify total DNA fragments from reaction components. Briefly, we loaded the sample onto the column, then centrifuged \({4}\,\text {min}\) at \(1000\!\times\!\text{g}\) to elute purified DNA sample in \(22\ \mu{\text{l}}\) of Tris buffer.

To specifically capture biotinylated 5hmC DNA fragments, we prepared \(2\times\) binding and washing (B&W) buffer (\({10}\,\text {mmol/l}\) Tris-HCl, \({10}\,\text {mmol/l}\) EDTA, \({2}\,\text {mol/l}\) NaCl). Using \(20\ \mu{\text{g}}\) of MyOne Streptavidin C1 Dynabeads (ThermoFisher, cat. no. 65001), we re-suspended in \({0.2}\,\text {ml}\) of \(1\times\) B&W buffer per experimental sample to wash beads. We subjected beads to 3 total washes, then re-suspended to a final volume of \(22\ \mu{\text{l}}\) per sample in \(2\times\) B&W buffer. We added \(22\ \mu{\text{l}}\) of purified total DNA fragments to \(22\ \mu{\text{l}}\) of washed beads, then incubated \({15}\,\text {min}\) under gentle rotation to promote streptavidin-biotin binding.

To isolate beads containing streptavidin-bound biotinylated DNA fragments, we incubated them on magnet for \({3}\,\text {min}\). Then, we washed the beads 3 times with \(1\times\) B&W buffer to remove non-biotinylated DNA fragments lacking 5hmC. Finally, we re-suspended the beads in \(50\ \mu{\text{l}}\) of low TE buffer.

We conducted quantitative PCR (qPCR) to compare enrichment of 5hmC spike-in after biotin enrichment relative to input sample stored earlier. qPCR in the form of a \(10\ \mu{\text{l}}\) reaction consisted of \(1\times\) SYBR Fast qPCR Mastermix (KAPA, cat. no. KK4601), \(1\ \mu{\text{l}}\) of template DNA, and primers at a final concentration of \(0.3\ \mu{\text{mol/l}}\). We set up different reactions for each primer set to detect each spike-in control DNA separately. We used template-specific forward and reverse primers. For 5hmC, we quantified spike-in DNA fragment from the Active Motif Methylated DNA standard kit. We also quantified the methylated or unmethylated Arabidopsis DNA spike-in controls from the Diagenode DNA Methylation Control package kit. We amplified with the following PCR program: \({98}\,^{\circ}\)C for \({30}\,\text {s}\), followed by 40 cycles of \({98}\,^{\circ}\)C for \({30}\,\text {s}\) and \({60}\,^{\circ}\)C for \({15}\,\text {s}\) (with image capture), ending with melt curve analysis.

To generate bead-free template for library DNA amplification, we established PCR reaction mix containing \(0.3\ \mu{\text{mol/l}}\) of unique dual index primers (NEB, cat. no. E6440S), \(1\times\) NEBUltra II Q5 MM, and DNA/bead template for a final volume of \(100\ \mu{\text{l}}\) per sample. We split samples into \(2\times{50}\ \mu{\text{l}}\) reactions and amplified using the following PCR program: \({98}\,^{\circ}\)C for \({30}\,\text {s}\), followed by 5 cycles of \({98}\,^{\circ}\)C for \({10}\,\text {s}\), \({60}\,^{\circ}\)C for \({75}\,\text {s}\), ending with a hold at \({4}\,^{\circ}\)C. Then, we transferred reaction tubes to a magnetic rack and transferred bead-free supernatant to new PCR tubes. To amplify DNA libraries for a maximum of 16 cycles total (including initial 5 cycles), we used the same PCR conditions for the bead-free template. Then, we dual size selected DNA libraries using AMPure XP beads at \({0.7}\times\) to \({1.0}\times\) ratio, as described in the library preparation for bisulfite sequencing.

Sample sequencing

We performed library preparation of \({5}\,\text {ng}\) of CUT&RUN DNA, following the NEB Ultra II Library Preparation Kit (cat. no. E7645L) manufacturer’s protocol. We used different NEB dual indices for each sample (Table 4). We sequenced all libraries on a NovaSeq 6000 sequencing system using a SP flow cell run in standard mode, with paired-end \(2\!\times\!150\,\text{bp}\) read length configuration. This allowed us to obtain the desired number of reads per sample (Table 4).

Table 4 CUT&RUN samples used in our experiments, with their sequencing technique, indices, and target read details. Target reads represent the number of single-end equivalent Illumina passing-filter read estimates we sought to obtain

Data processing

We performed base calls using Real-Time Analysis (RTA) (version 3.4.4). Using bcl2fastq (version 2.20), we converted Binary Base Call (BCL) files to FASTQ files.

We processed the CUT&RUN sequences as follows. Before alignment, we trimmed adapter sequences with fastp (version 0.19.4) [170]. We assessed sequencing data quality using FastQC (version 0.11.8) [171], Picard [172] (version 2.6.0) CollectInsertSizeMetrics, QualiMap [173] (version 2.2) bamqc, Preseq [174] (version 2.0.0) bound_pop and lc_extrap, DeepTools [175] (version 3.1.3), and MultiQC [176] (version 1.7). For tools requiring Java, we used Java SE 8 Update 45. For tools requiring, Python we used version 2.7.12, except as otherwise noted.

We aligned reads to GRCm38/mm10 with Bismark [109] (version 0.22.3; for 5mC or 5hmC sequences). Bismark used Bowtie 2 [115,116,117] (version 2.4.1; also directly used for conventional sequences), SAMtools [93] (version 1.10), and BEDTools [114] (version 2.29.2). We used Bismark’s default parameters, save those controlling output destinations and use of multiple cores, and parameters passed to Bowtie 2, as described below.

We used Bowtie 2 parameters as recommended [177], excepting increasing alignment sensitivity, and specifying implied or default parameters. Therefore, we used the parameters -D 20 -R 6 -N 1 -L 18 -i S,1,0.25 for increased sensitivity, slightly more so than the --very-sensitive-local preset. We used -I 10 for a minimum fragment length of 10 bp and -X 700 for a maximum fragment length of 700 bp, as recommended [50, 177]. This range of fragment lengths included those we selected for during library preparation (30 bp–280 bp). We also used the parameters --local --phred33 --no-unal --no-discordant --no-mixed. For alignments used for calculating the spike-in coefficient, we did not permit dovetailing (--no-dovetail) nor overlaps (--no-overlap), as recommended [50].

For post-processing, we used Sambamba [112] (version 0.7.1), including marking duplicates. Where applicable, we performed spike-in calibration as described by Meers et al. [177].

For our final OCT4 results, we did not use our S. cerevisiae spike-in calibrated data. In the unmodified context, the spike-in calibrated data made little difference. In the modified context, insufficient modified bases in the spike-in probably prevented us from properly calibrating.

We called peak summits using MACS 2 [63] (version 2.1.2). We ensured that the input only included reads with insert sizes \({\le } {120}\,\text {bp}\), as recommended [50, 177], by using DeepTools [175] (version 3.1.3) alignmentSieve.

For data not calibrated with spike-in, we used MACS 2 callpeak, specifying treatment and control inputs and outputs as usual. We used the additional MACS 2 parameters, --buffer-size 1000000 --format BAMPE --gsize mm --qvalue 0.05 --call-summits --bdg --SPMR.

For spike-in calibrated data, we used advanced MACS sub-commands, constructed to yield a peak calling scheme that worked well for CUT&RUN datasets. Specifically, we used pileup on BAMPE input, then bdgopt to multiply by the scaling factor defined by the spike-in calibration. Then, we added a pseudocount of 1.0 to mimic the default workflow, using bdgcmp, specifying --pseudocount 0.0 --method qpois, followed by bdgpeakcall, with --cutoff \(-\ln (0.05)/\ln (10)\). This cutoff parameter represented the usual q-value cutoff of 0.05 converted to \(-\!\log_{10}\) space.

For bisulfite-converted data, we extracted and called peaks only upon methylated reads (filtering through Sambamba using Bismark’s added “XM:Z:” tag). We regarded all hmC-Seal-seq reads as completely hydroxymethylated.

Finally, we used MEME-ChIP (version 4.11.2.1, with Perl version 5.18.1) [66], with DREME [65], as previously described, on TRF-masked genome (same parameters as before, using version 4.9). For this particular processing, we used SAMtools [93] version 1.3.1 and BEDTools [114] version 2.27.1.