The past few decades have seen a revolution in the understanding of mechanisms of 'epigenetic' inheritance - the passing on from one generation of cells to another of information that affects phenotype but does not alter the actual DNA sequence. One of the longest known epigenetic modifications is the methylation of cytosines in DNA, which is typical of organisms with large genomes whose cells undergo many divisions over the organism's lifetime. In recent years the realization that inappropriate methylation of promoters of tumor suppressor genes may contribute to oncogenesis has sparked renewed interest in DNA methylation. Despite many years of work, however, how changes in DNA methylation status occur during normal cell differentiation and whether these changes have a role in regulating gene expression still remain unclear. One of the few unambiguously recognized facts about DNA methylation is that a methylated promoter is highly likely to be silenced, whereas hypomethylation of a promoter can be associated with active or silent genes.

As with all things DNA-centric, the genomics revolution holds great promise for understanding where cytosine methylation occurs throughout the genome, potentially suggesting biological function through 'guilt by association'. In a study published recently in Nature, Meissner et al. [1] ally the power of massively parallel DNA sequencing to a clever strategy for picking out GC-rich sequences from the genome to produce the most comprehensive insight so far into cytosine methylation at nucleotide resolution in a large mammalian genome, that of the mouse. Such techniques will undoubtedly increase the potential for understanding the biological role of this venerable epigenetic regulator and its dysregulation in disease.

Cytosine methylation goes 'omic

Current assays for cytosine methylation are generally divided into those involving protein-mediated detection and those using chemical detection. The former allow lower-resolution discovery studies genome-wide, a useful first pass in many applications, whereas the latter allow nucleotide-resolution studies but have been refractory to scaling. Only chemical detection can provide whole-genome, single-base resolution from individual DNA molecules. This level of resolution is based on the deamination of cytosine, but not 5-methylcytosine, to uracil by bisulphite and related compounds. In vitro mutagenesis of DNA with bisulphite followed by resequencing thus identifies unmethylated cytosines, as they will have been replaced by thymines after replication of the converted DNA. Failure to find thymine replacement is taken as a signal that the cytosine in question was originally the site of a 5-methylcytosine. Careful analysis of the resulting sequence is necessary to rule out technical artifacts, but as almost all mammalian cytosine methylation occurs within the context of CG dinucleotides, non-CG cytosines serve as useful controls for conversion of unmethylated Cs.

Two types of bisulphite assays have been used, exploring what we call intermolecular and intramolecular dimensions. By intermolecular, we mean the overall frequency of methylation at a given CG across a population of molecules, whereas by intramolecular we mean the coordinate methylation patterns of different CGs located in cis on the same molecule. The short read lengths of the study by Meissner et al. [1] limit the ability to detect intramolecular information, although given evidence of correlation between CGs located up to 1 kb in cis in the human genome [2] this may not prove a great loss. On the other hand, intermolecular information is determined by read number per cytosine, and these data are valuable when allelic methylation is being studied, such as in genomic imprinting (for a review see [3]), and can indicate heterogeneity of methylation occurring in different cells in the population being studied. In any case, we note that the ideal assay to explore both dimensions of nucleotide-resolution bisulphite sequence would be one with deep coverage consisting of long cis read information.

Such a platform does not yet exist, but significant insights have been obtained into the relatively small Arabidopsis thaliana genome using massively parallel sequencing of bisulphite-converted DNA on a short-read platform (from Illumina) [4]. The substantially greater size of the human genome makes a comparable approach daunting, especially in analyzing the data. In large part, this is because after bisulphite conversion, the four-base 'native' genome is effectively collapsed into a three-base genome, with the original Cs remaining on only some CG dinucleotides. Worse, the strands of the genome are no longer complementary, so the effective size of the converted genome doubles, making sequence mapping to 6 Gb converted mammalian genomes a serious challenge. The study of Meissner et al. [1] partly gets round this problem by using a clever strategy of 'reduced-representation bisulphite sequencing'.

Reduced-representation bisulphite sequencing

Meissner et al. extend previous work from the same group [5] which used Sanger sequencing on a restriction-enzyme-based sampling of the genome. In the new work, reduced representation of the genome was achieved by the isolation of small restriction fragments generated by the methylation-insensitive type II endonuclease MspI, resulting in enrichment of CG-rich regions, and thereby directing the analysis towards CpG islands and less to CG-depleted regions of the genome. This enabled comprehensive sequencing of this fraction and easier mapping of the bisulphite-converted sequences onto this more limited search space. Moreover, because of the imposed directionality of the sequencing adaptors, there was only one strand to analyze, further simplifying the mapping problem at the expense of identifying instances of hemimethylation - a relatively small price to pay. Only around 1% of the genome was sampled in this study, however, which leaves the remaining 99% of potential methylation space unexplored.

Analysis of the sequencing data was designed to ensure that only reads with unique alignments to the reference mouse genome were picked up. This involved mapping the sequence data to a simulated bisulphite-converted unmethylated genome, converting any remaining Cs in the sequence reads to Ts for initial alignment, and eliminating reads with more than one map position. Once mapped, the original Cs were recalled to the analysis, counting the frequency at which Cs and Ts were located at the same position across reads, and thus inferring the methylation status. To avoid including sequencing errors, quality scores for sequencing base calls were utilized as a data filter, ignoring low-quality bases at potential methylation sites, an important advance on previous studies [6, 7].

Bisulphite genomic representations have traditionally demonstrated bias in genomic coverage, but recent observations suggest that by avoiding a step of cloning in bacteria, current massively parallel sequencing protocols have avoided such bias, although this should be determined empirically for each platform. Any detected distortion of representation and/or coverage at a CG site when comparing native and bisulphite-converted spaces could be used to adjust the inferred DNA methylation levels from those regions of the genome. Another useful in silico experiment would be to assess the impact of random data assembly within the representation space. An assessment of the contribution of random data to the inferred methylation levels would allow this artifact to be accounted for analytically in a region-specific manner. Finally, a future analytical goal should be to modify the match matrices used for alignment scoring. Mapping reads to a native genome using a penalty matrix tolerant to the bisulphite-induced SNPs in the reads should allow more accurate positioning of reads and identification of sample-specific SNPs to extract optimum-quality methylation data. Certainly, moving to larger endeavors (such as the whole genome) will require such fundamental approaches to be optimized. The study by Meissner et al. [1] sets the stage and serves as a compass for future efforts. Encouragingly, their platform's ability to measure genomic methylation was demonstrated through the convincing detection of biologically significant variation occurring with cell differentiation.

Insights into the biological significance of cytosine methylation

Meissner et al. had two major biological aims: to look first at static relationships of cytosine methylation with DNA sequence features and histone modifications in embryonic stem (ES) cells; and then to test the dynamic properties of cytosine methylation during cell differentiation. In the first case, their analysis largely confirmed previous findings, with transposable elements consistently methylated and a strong correlation between CG density and hypomethylation. However, not all transposable elements behaved the same way - autonomously transposing long interspersed nuclear elements (LINEs) and long terminal repeat (LTR) elements were methylated whatever their CG content, whereas the more CG-rich subgroup of non-autonomous short interspersed nuclear elements (SINEs) were generally unmethylated and looked comparable to nonrepetitive DNA of similar CG content.

Correlation of cytosine methylation with histone-modification patterns was confirmed, with results suggesting that histone H3 lysine 4 and lysine 9 methylation are better predictors of cytosine methylation than CG content, although trimethylation of lysine 27 on H3 did not have the same predictive power. The last finding is interesting, as biochemical studies have found the enzyme responsible for methylating lysine 27 (the Polycomb group protein EZH2) to be in a complex with DNA methyltransferases [8], suggesting that both these repressive marks - trimethylated H3K27 and methylated cytosines - might have been coincident in the genome. Interestingly, CG-depleted non-promoter regions enriched in H3 lysine 4 dimethylation displayed a greater tendency for the DNA to be unmethylated, suggesting that the cell generally marks regulatory elements with DNA hypomethylation, an idea first proposed by Adrian Bird (for review see [9]).

A question that arises when considering cytosine methylation mapping studies is how and why methylation is directed to some sequences and not others. Much has been learned from the study of methylation mutants in the mouse and Arabidopsis [1012]. Cytosine methylation is catalyzed by a family of DNA methyltransferases that do not have innate sequence specificity, with one recognized exception, the Dnmt3a-Dnmt3L complex, which appears to process CGs spaced apart by 8-10 bp (a single helical turn) more efficiently [13]. Demethylation may occur either passively - by failure to remethylate after replication - or actively by demethylase activity: potential demethylases might include glycosylase [14] or base excision activity [15] replacing the methylcytosine with cytosine, and even by mutagenic deamination mediated by DNA methyltransferases [16], although this field is not free from controversy [17]. The correlation between histone modifications and DNA methylation supports ideas that methylation may be targeted by pre-existing signals such as histone modification [18], or that both histone modification and cytosine methylation could be directed by sequence-specific molecules such as transcription factors or even small RNAs (for a review see [19]). However, one observation suggests that we are still missing part of the mechanism for establishing and maintaining the methylation pattern. In both the mouse and Arabidopsis, DNA left unmethylated as a result of mutation of DNA methyltransferase will become remethylated on reintroduction of the enzyme, but such remethylation occurs much more slowly in the absence of the SWI/SNF maintenance factor DDM1/LSH1 [11, 2022].

The advent of see-it-all 'omics has reopened investigation into the role of cytosine methylation in cellular differentiation - a controversial topic. There are clear differences in cytosine methylation between distinct cell types [23], but many promoters are constitutively hypomethylated even when the gene is transcriptionally silent [24]. Meissner et al. [1] addressed this issue by differentiating ES cells to neural lineages in culture and retesting the same loci for cytosine methylation. They found that CG-dense regions remained hypomethylated for the most part, whereas CG-depleted regions were more likely to change DNA methylation status with differentiation. The so-called 'bivalent chromatin domains' first described in ES cells [25] tended to remain constitutively cytosine hypomethylated. The authors also compared in-vivo-derived, minimally cultured neural precursor cells (NPCs) with NPCs derived in vitro, revealing substantially less methylation in the primary cells. However, multiple passaging of the in-vivo-derived NPCs resulted in methylation patterns comparable with the ES-cell-derived NPCs, implicating cell culture in the generation of these 'epialleles'. Intriguingly, a specific subset of CG-dense promoters tended to be more consistently susceptible to this acquisition of methylation, an observation paralleling changes observed in cancer cells [26]. The effect of tissue culture on the variability of cytosine methylation has been recognized for some time [27]; the current study points to a similar phenomenon and was able to test far more CGs than previously possible. The obvious question is whether chromatin modifications, which are shown to be correlated with cytosine methylation, are also influenced by cell culture or whether this is an observation specifically affecting cytosine methylation.

The study by Meissner et al. represents a breakthrough in the ability to study cytosine methylation in mammalian cells, an advance that is directly due to the introduction of massively parallel sequencing technologies, coupled with a clever experimental design that allowed comprehensive sequencing of CpG islands and mapping of degenerate sequences to the genome. With longer reads, intramolecular cytosine methylation could be explored more comprehensively, and increased sequencing throughput should some day make whole-genome methylation studies feasible for genomes as large as those of mouse and human. Meissner et al. note the potential for increasing genomic coverage by adding further restriction enzyme representations or by coupling massively parallel bisulphite sequencing with the sequence-capture approach [28]. With these and other advances, the potential for revealing how cytosine methylation exerts its effects in normal cells will be immense, forming the basis for understanding the changes that we see in human disease.