Introduction

In mammals, DNA methylation is predominantly, if not exclusively, found in CpG dinucleotides, due to site specificity of the known DNA methyltransferases [1]. Although it was reported in the early 1960s that cytosines can be methylated, it was not until two decades later that DNA methylation was fully recognized as an important player in gene regulation [24]. By acting coordinately with histone tail modifications and recruitment of an array of proteins involved in chromatin condensation, DNA methylation participates in gene silencing, independently of changes in DNA sequence [5]. The large majority of CpG dinucleotides in the human genome are methylated, and this results in a depletion of CpG sites due to conversion to thymines by deamination [6, 7]. Unmethylated CpG sites escape depletion and are clustered in relatively small areas called CpG islands. A widely accepted definition of CpG islands was formulated by Gardiner-Garner and Frommer and takes into account local GC content, observed-to-expected frequency of CpGs and length of the region [8]. The exact meaning of these parameters has been disputed in recent publications and alternative definitions have been proposed in an attempt to better match definition of CpG islands to biological function [911]. Regardless of the definition, roughly one-third of CpG islands overlap with gene promoters, and as many as 70% of human promoters are associated with a CpG island. The vast majority of these promoter-associated CpG islands are unmethylated in normal tissues in both active and inactive genes, thus do not explain tissue-specific gene expression [12]. Exceptions to this general pattern are imprinted genes, X-inactivated genes in women, and germ-cell-restricted genes where promoter CpG island methylation is present [13]. Outside of CpG islands, the bulk of methylated cytosines in normal tissues is found in repetitive DNA elements, mostly retrotransposons of LINE and SINE classes [14].

DNA methylation is an extremely dynamic process during fertilization and embryogenesis. Almost complete loss of methylation occurs very early, and selective re-methylation occurs during implantation [15, 16]. The pattern of methylation established after this stage is remarkably stable, although as discussed above, somewhat rare in bona fide promoter CpG islands in adult tissues. Remodeling of these patterns is found in human diseases, especially cancer, with global demethylation (mainly at repetitive DNA) and local hypermethylation (frequent in promoter CpG islands) being hallmarks of most neoplasias [1719]. Since DNA methylation results in gene silencing, it has been recognized as a frequent cause of inactivation of tumor suppressor genes and other genes important for tumor development [20]. There is a vast literature on promoter CpG island methylation in cancer, with evidence supporting its role in disease progression [21]. Also of note is the existence of a subset of tumors with extensive, concomitant methylation of multiple genes, which has been termed CpG island methylator phenotype (CIMP) [22, 23]. Additionally, DNA methylation has proven to be an important therapeutic target. Two drugs with demethylating activity (azacitine and decitabine) have been approved by the Food and Drug Administration (FDA) for treatment of myelodysplastic syndrome, and are being tested in clinical trials for treatment of other leukemias as well as solid tumors [2426]. These broad implications support the in-depth study of DNA methylation in cancer and normal tissues.

Array-based methodologies for large-scale analysis

One of the main obstacles to DNA methylation analysis is that methylated cytosines cannot be detected simply by sequen cing. During polymerase chain reaction (PCR) amplification, methylated cytosines are not differentiated by the DNA polymerase and, similarly to unmethylated cytosines, they are paired with guanosine dinucleotides. Thus, reading of methylated cytosines depends on indirect methods. The most commonly used are (1) restriction enzyme-based approaches, which take advantage of methylation-sensitive enzymes, (2) affinity-based approaches, where antibodies against either 5-methylcytosine or methyl-binding domain proteins are used to collect the methylated fraction of the genome, and (3) bisulfite conversion of non-methylated cytosines to thymidine through a hydrolytic deamination reaction, which takes advantage of the non-reactivity of methylated cytosines to free hydroxyl groups. Each one of these methods has an important application in studying the epigenome and has been individually, or in combination, applied to individual genes and also to large-scale analyses (Table 1). Among these methods, bisulfite conversion is the gold standard, due to its potential high resolution when combined with sequencing methods. In this way, every single cytosine can be identified as methylated or unmethylated.

Table 1 Recent methodologies applied to whole human genome DNA methylation analysis

All the above-mentioned strategies to unveil methylated cytosines have been applied to microarray platforms to achieve moderate- and high-resolution coverage of the human genome. In the first generation of methylation micro arrays, methylated genomic fragments were selectively amplified in a ligation-mediated PCR after DNA digestion with one or more methylation-sensitive enzymes and, after labeling with fluorescent dyes, hybridized against a normal control [27, 28]. Soon thereafter, the gold-standard status of bisulfite modification to study DNA methylation prompted the generation of microarray platforms exploiting this chemical to study methylated cytosines. These arrays mostly targeted a few genes by tiling olinucleotide probes representing the bisulfite-converted methylated and unmethylated versions of the promoter sequence [29, 30]. These methods suffered from low throughput and complicated probe design and were soon abandoned in favor of restriction-enzyme-based methods.

Since then, the microarray platforms have increased in gene density, and genome-wide coverage can be achieved with tiling arrays. Concomitantly, variations of the restriction-enzyme-based methods were developed to maximize the number of studied genomic targets and to increase the sensitivity and specificity of the method. Our group developed a strategy based on the well-established methylated CpG island amplification protocol (MCA). The advantage of the method is the use of two isoschizomer enzymes with differential sensitivity to methylated cytosines (SmaI and XmaI) which, due to their recognition site, preferentially target CpG islands [31]. Done this way, our method is a positive representation of methylated fragments (Figure 1), which results in higher sensitivity and specificity compared to other methods. Since then, this method has been applied to study the methylome of leukemias, liver cancer and normal peripheral blood lymphocytes [12, 21, 32]. Other enzymes tested by other groups include HpaI/MspI (HELP - HpaII-tiny fragment enrichment by ligation-mediated PCR [33]) and McrBc, which, contrary to methylation-sensitive enzymes, preferentially fragments the DNA between a pair of methylated CpGs at a critical distance.

Figure 1
figure 1

Schematic diagram of the methylated CpG island amplification microarray (MCAM) method. Enrichment for methylated DNA and reduction of genome complexity is achieved by serial digestion with SmaI (methylation sensitive) and XmaI (methylation insensitive) restriction enzymes, followed by ligation of adaptors and PCR amplification. The resulting amplicons, representative of the methylated fraction of tumor and normal cells, are labeled and co-hybridized in a microarray platform. Image acquisition and data analysis allow identification of methylated and non-methylated genes by comparing intensity values of Cy5 and Cy3 dyes for each pair of tumor and control samples. In this example, the M-A plot of normalized data from the cancer cell line MDA-MB-435 compared to normal peripheral blood is presented, from which amplicons were co-hybridized to a custom Agilent microarray containing 44,000 olinucleotide probes targeting human promoter CpG islands.

The success of restriction-enzyme-based methods is largely dependent upon their capacity to simplify the genome prior to PCR amplification (thus allowing a more uniform, unbiased amplification), generating what has been called a reduced representation. However, since only selected sites can be studied at once, these methods are not truly genome-wide and can be biased to genome compartments (for example, CG-rich versus CG-poor areas). Two affinity-based strategies were developed to circumvent this limitation. In one method, termed methylated DNA immunoprecipitation (MeDIP), antibodies against 5-methyl-cytosine were used to pull-down the methylated fraction of the genome, and were co-hybridized against the unprocessed DNA from the same sample [34]. In another strategy, antibodies against the methyl-binding domain proteins MBD2 and MBD3L1 were used to capture methylated DNA fragments. This methylated-CpG island recovery assay (MIRA) was performed similarly to MeDIP, in the sense that the control sample is the unprocessed DNA. A recent comparison of the sensitivity and specificity of HELP, MeDIP and McrBc fragmentation methods showed that each was biased in a different way [35]. Among these, the authors found McrBc fragmentation to have the highest potential for improvement, and modified it to achieve more precise mapping of methylated CpG sites, a method they called comprehensive high-throughput arrays for relative methylation (CHARM).

Next-generation sequencing

Microarray-based methods, despite their high resolution, are generally far from being truly genome-wide analyses. Close to genome-wide coverage can be achieved by the combination of one of the affinity-based methods and high-density tiling arrays, and this has been done to study the methylome of B lymphoid blood cells at 100-bp resolution [36]. Such an approach is quite expensive and time consuming, explaining why few research groups have used it to study whole-genome methylation. The introduction of what has been called next-generation sequencing brought a fresh excitement to genome and epigenome analysis. By making possible the reading of millions of sequences at once, next-generation sequencing equilibrated the usefulness of the methods to reveal genome-wide DNA methylation in favor of the gold-standard bisulfite-based detection. Currently, there are four main competing next-generation sequencing technologies available: Illumina Genome Analyzer, generally referred to as Solexa sequencing, from Illumina, Inc.; SOLiD™ System, from Applied Biosystems; HeliScope Single Molecule Sequencer, from Helicos BioSciences; and 454 Sequencing, from Roche. Despite variations, all platforms take advantage of parallel processing of thousands to millions of DNA sequences at a time (massively parallel sequencing), and the base detection is either based on classical Sanger sequencing (using fluorescently labeled nucleotides) or the innovative pyrosequencing method. This is a rapidly advancing field and companies are strongly competing to increase genome coverage per run and to reduce the cost of their method.

As for whole-genome tiling microarrays [37], the first organism to have its methylome sequenced at single-base resolution was the plant Arabidopsis thaliana [38, 39]. To do this, two groups fragmented the genomic DNA by sonication prior to ligation of PCR primer adaptors and bisulfite conversion, and performed shotgun sequencing using the Illumina Solexa platform. Compared to the human methylome (and the methylome of all mammals), the methylome of Arabidopsis is quite complex: in addition to methylation in CpG dinucleotides, there are also CHG and CHH methylation (H = A, C or T). From an analytical point of view, the possible combinations of methylated/unmethylated cytosines are less complex in humans than in Arabidopsis, making sequence matching and assembling less laborious. However, the Arabidopsis genome is just a fraction of the size of the human genome (119 Mb in Arabidopsis versus 3.1 Gb in human). Thus, the size of the human genome has been the main obstacle to whole-genome sequencing.

Not long after the Arabidopsis methylome was fully sequenced, the mouse methylome of pluripotent and differentiated cells from various tissues was sequenced with moderate coverage. To circumvent the genome size obstacle (the mouse genome is 2.7 Gb in size), the authors took advantage of the reduced representation generated from DNA digestion with the MspI restriction enzyme, which has a recognition site (CCGG) abundant in CpG islands [40]. In this technique (reduced representation bisulfite sequencing, RRBS), bisulfite treatment is done for size-selected DNA fragments, targeting the most CpG island-enriched fraction, followed by bisulfite-treatment and Illumina Solexa sequencing. While analysis of the human methylome by RRBS has not yet been reported, this ingenious technique is very promising for such investigation. Meanwhile, the human methylome has been studied using other reduced representation strategies. A target-specific approach using 'padlock' probes was recently introduced by two different groups [41, 42]. By presenting a unique sequence in each end, designed to match the bisulfite-converted genome, these probes capture targeted regions and create a circular molecule. The internal part of these probes is a universal sequence that allows for simultaneous amplification of all circularized, captured sequences prior to massively parallel sequencing. Coincidentally, in their initial articles, both groups demonstrated the feasibility of their method by sequencing 10,000 targets, but the method can be extended to more or fewer targets according to the research goal. Interestingly, there seems to be an inherent bias in the process, with some circularized DNA being preferentially amplified or sequenced. Thus, some additional optimization of the method will be necessary prior to increasing the number of targets per analysis. It is also important to note that, since target selection is part of the procedure, these methods do not represent a genome-wide method. However, they are of extreme practical use when there is a strong interest in genome regions or promoter CpG islands alone. In one of these reports, the authors go one step further and introduce a less biased approach, termed MSCC (for methyl-sensitive cut counting) [41]. In this method, the authors use the methylation-sensitive restriction enzyme HpaII, which, similarly to its methylation-insensitive ishoschizomer MspI, cuts the genome at CCGG sites and thus covers 90% or more of the human CpG islands. The ligation of adaptors to the generated fragments, followed by PCR and massively parallel sequencing, results in mapping of unmethylated cytosines in the CCGG context. The authors present an inverse correlation between the abundance of MSCC tags and measured cytosine methylation per regions, but recognize that a much larger sequencing effort is necessary to increase accuracy at low methylation densities. In another independent publication, Brunner et al. [43] published a similar approach to MSCC, but they introduced the MspI-digested DNA as a control in the procedure, to discriminate CpG sites that can be assayed and mapped uniquely in the genome from those that cannot, to reduce the rate of false-positive methylation.

The first human methylome at single-base resolution was published earlier this year [44] and the authors employed the MethylC-Seq method, previously used to sequence the Arabidopsis methylome, to investigate the human methylome at single-base resolution. This landmark report is industrious both in methodology and in its findings. One embryonic stem cell (ESC) and one fetal lung fibroblast were sequenced and, to achieve a 14-fold coverage of the genome, more than 1 billion Solexa reads were generated for each. The results support that the methylome is very different between undifferentiated and differentiated cells, and the authors' unexpected findings of significant non-CpG methylation in ESCs (up to 25% of the methylated cytosines were in CHG and CHH contexts, similar to Arabidopsis cytosine methylation) strongly support that the physiological impact of DNA methylation will be better captured in whole-genome, deep, unbiased analyses. However, until sequencing costs are significantly reduced, the human methylome analysis at single-base resolution will be restricted to a few samples at a time. Studies in cancer, however, will need more extensive analysis. At the minimum, cancer studies require the sequencing of dozens, if not hundreds, of samples due to their inherent genetic and epigenetic heterogeneity, and the various disease grades and prognostic groups. Additionally, genome-wide mapping of methylated cytosines must be quantitative rather than just qualitative; thus, massively parallel sequencing requires several-fold coverage of each individual CpG dinucleotide, which makes the task prohibitively expensive. As a compromise, strategies based on reduced representation of the genome are currently more practical for whole-methylome analysis.

Emerging technologies: single-molecule sequencing

Much of the excitement about advances in DNA sequencing technologies has emerged from the race to achieve genome-wide analysis of the human genome for $1,000 or less. At the same time as improvements to the performance of next-generation sequencing are being carried out to reduce costs, totally new technologies are emerging. One of the most promising new technologies uses nanopores to achieve fast and reliable DNA sequencing. An electric current is generated by passing the DNA molecule through these nanopores and, although very weak, this current can be accurately measured and is dependent on the nucleotide base passing through the pore [45]. Importantly, done this way, DNA sequencing is possible without prior DNA amplification or use of labeled nucleotides. In terms of methylome analysis, this is very exciting: it has been reported that the electric current-based nanopore detection can differentiate methylated from unmethylated cytosines directly, bypassing the need for bisulfite treatment [46]. There is still much improvement to be made before this technology is ready to be commercialized, and one of the main technical difficulties is to pass the DNA molecule through the nanopores at the right speed, enabling correct base detection without gaps.

Conclusions

Genome-wide methods for methylome analysis have evolved at a pace. The methodological advances achieved in the last five years have moved the field from single-gene detection to the possibility of whole-genome studies at the single-base level, or at least high resolution. A better understanding of the function of DNA methylation in healthy and diseased tissues is likely to arise from these more detailed investigations and their correlation with both genetic and other epigenetic studies. Specifically in cancer, the study of the methylome of various disease stages and response to therapies will improve patient care by providing markers of progression and response to treatment.