The epigenome

Eukaryotic genomes are packaged into chromatin, consisting mostly of nucleosomes composed of approximately 147 bp of DNA wrapped around basic histone octamers [1]. Nucleosomes package DNA approximately 10,000-fold to form metaphase chromosomes, and so are essential for faithful segregation of sister genomes at mitosis. As nucleosomes occupy approximately 70% of the chromatin landscape during interphase, they must be mobilized during processes that require access to DNA, such as replication, transcription, repair, and binding by regulatory proteins. The occupancy, positioning, and composition of nucleosomes, as well as chemical modifications of histones and DNA, form a complex landscape superimposed on the genome: the epigenome [2]. Whereas the genome sequences of many organisms are now essentially complete [3], inquiry into their epigenomes is grossly incomplete due to the complexity and dynamics of the individual epigenomic constituents.

As in prokaryotes, sequence-specific DNA-binding proteins stand at the top of the eukaryotic transcriptional regulatory hierarchy, and differential expression of transcription factors (TFs) results in cell type-specific differences. Most other key chromatin components are found in all cells of an organism, and dynamically change their distribution as a result of TF binding. The incorporation of histone variants [4] and the covalent modification of histone tails [5] help to mediate the inheritance of expression states of a gene by regulating the accessibility of DNA. Additionally, hundreds of chromatin-associated proteins, including ATP-dependent chromatin remodelers [6] and histone modifying enzymes [5], interact with chromatin to modulate its structure. Notably, mutations in nucleosome remodelers and in the histone constituents of chromatin have been implicated in human developmental disorders and cancer [6, 7]. Thus, high-resolution genomic analysis of chromatin structure and the proteins that influence it is a major focus of biological technology development to study both basic cellular processes and the pathogenesis of human disease.

Many methods have been put forth with which to probe various aspects of the epigenome (Table 1), but until recently, the resolution of genome-wide methods for epigenome characterization, such as ChIP-chip [8] and MeDIP [9], was on the order of hundreds of base pairs, with the use of hybridization-based read-out technologies and chromatin preparation protocols based on random fragmentation. However, with the advent of massively parallel short-read DNA sequencing and its potential for single base-pair resolution, there has been a renaissance of interest in traditional methods for chromatin characterization, including the use of bisulfite sequencing for mapping DNA methylation [10] and the use of non-specific nucleases, including micrococcal nuclease (MNase) [11], deoxyribonuclease I (DNase I) [12] and exonuclease [13] (Table 1). Here, we focus on recently developed strategies for characterizing nucleosomes, TFs and chromatin-associated proteins at base-pair resolution, and we discuss prospects for full epigenome characterization.

Table 1 Strategies for epigenome mapping

Technologies for base-pair resolution epigenomic mapping

Several recent studies have introduced methods for analyzing various protein components of the epigenome at base-pair resolution while simultaneously addressing specific limitations of current epigenomic protocols. Below, we discuss the techniques upon which each of these high-resolution methods is based and how these new methods address the limitations of current epigenomic technologies.

MNase-seq

Digestion of chromatin with MNase has long been used to study chromatin structure in a low-throughput manner [14] and has more recently been combined with tiled microarray analysis (MNase-chip) or massively parallel DNA sequencing (MNase-seq) to study nucleosome positioning, occupancy, composition, and modification genome-wide [15]. MNase is a single-strand-specific secreted glycoprotein that is thought to cleave one strand of DNA as the helix breathes, then cleave the other strand to generate a double-strand break. MNase evidently 'nibbles' on the exposed DNA ends until it reaches an obstruction, such as a nucleosome. Though MNase has primarily been used to study nucleosomes, its mode of action suggests that it will be blocked by any obstruction along the DNA, such as a DNA-binding protein, allowing for the determination of genomic regions protected by non-histone proteins. By combining MNase digestion with paired-end sequencing of protected DNA to determine precise fragment lengths, specific sizes of MNase-protected particles can be recovered with or without affinity purification and mapped. Indeed, we have used paired-end MNase-seq to map the distributions of both nucleosomes and paused RNA polymerase II in Drosophila cells [16]. Kent and colleagues [17] also used paired-end MNase-seq of native yeast chromatin to map the positions of both nucleosomes and sequence-specific TFs. Floer and colleagues [18] employed MNase digestion in conjunction with paired-end crosslinking chromatin immunoprecipitation (X-ChIP)-seq to identify binding sites for the RSC (remodels the structure of chromatin) complex, identifying partially unwrapped nucleosomes in the process. Importantly, these studies showed that DNA fragments as small as approximately 50 bp could be recovered after MNase digestion, suggesting applications for MNase-seq in epigenome mapping beyond nucleosome analysis.

A basic limitation of paired-end sequencing as a readout for MNase digestion, and epigenomic methods in general, is that standard short-read sequencing library preparation protocols are optimized for DNA fragments of nucleosomal size (approximately 150 bp) or larger and involve size-selection of DNA [19], while regions of DNA protected by TFs are often up to an order of magnitude smaller. To circumvent this limitation, we introduced a modified library construction protocol to facilitate paired-end sequencing of DNA fragments as small as approximately 25 bp. By combining MNase digestion time points with mapping of a broad range of fragment sizes (approximately 25 to >200 bp), the distributions and dynamics of nucleosomes and non-histone proteins were analyzed [20]. Notably, subnucleosomal and nucleosomal particles can occupy the same genomic position within a population of cells, suggesting a highly dynamic interplay between nucleosomes and other chromatin-associated factors. As paired-end sequencing provides both fragment position and length, these two parameters can be displayed as a two-dimensional 'dot-plot'. The X-axis position of each dot represents the distance of the fragment midpoint to the center of a genomic feature such as a TF binding site (TFBS), and the Y-axis position represents its fragment length (Figure 1). The resulting graph is referred to as a 'V-plot', because the minimal region of DNA protected is seen as the vertex of a 'V' corresponding to the fragment midpoint on the X-axis and its length on the Y-axis. Based on examination of V-plots for >100 TFs, the binding sites for TFs known to participate in nucleosome phasing, such as Abf1 and Reb1 [21, 22], displayed well-positioned flanking nucleosomes and were flanked by subnucleosomal particles. V-plotting was also applied to ChIP data to show that the tripartite structure of the approximately 125-bp functional centromere sequence precisely corresponds to occupancy by a Cse4-containing centromeric nucleosome that is immediately flanked by particles corresponding to the Cbf1 TF and the kinetochore-specific Cbf3 complex [23].

Figure 1
figure 1

V-plots reveal chromatin features of transcription factor binding sites. (a) V-plot of MNase-seq data from Kent et al. [17] centered on binding sites for the Cbf1 transcription factor. Figure adapted from Henikoff et al. [20]. (b) Interpretive diagram of a V-plot. A dot representing the midpoint of each paired-end fragment is placed on the graph. Its Y-axis value represents its length and its X-axis value represents the distance of its midpoint from the center of a given genomic feature (in this case, a transcription factor binding site (TFBS)). Locations of dots corresponding to each fragment are indicated by red arrows. The minimal region protected by the transcription factor (TF) is indicated by the intersection of the left and right diagonals on the Y-axis and also as the width of the gap on the X-axis resulting from extrapolation of the diagonals to Y = 0. The left diagonal results from fragments cleaved precisely to the right of the TF-protected region, and the converse is true of the right diagonal. The triangular densities flanking the TF-protected region are generated by protected regions adjacent to the TFBS that are cleaved between the TFBS and the protein responsible for the density.

MNase-seq with paired-end sequencing offers several advantages for epigenomic profiling. By mapping a wide range of fragment sizes, the genomic distributions of both nucleosomes and numerous non-histone proteins can be assessed using a single sequenced sample, making the method especially cost-effective. The method does not require epitope tags or antibodies and is thus easily adapted to a range of cell types, particularly those for which affinity reagents are unavailable or impractical. No more than 25 cycles of sequencing per fragment end are needed to accurately map fragments onto genomes as large as that of Drosophila [24] and mouse (unpublished data), and the use of fewer cycles reduces both the cost and machine time for sequencing. Although MNase has a well-known AT-cleavage preference, in practice it causes only a minor mapping bias [25], which can be mitigated computationally if necessary [26]. A basic drawback of MNase-seq for mapping non-nucleosomal particles is that the identity of such particles cannot be formally established by this method alone, as multiple proteins may bind identical sequences. However, the recovery of non-nucleosomal particles from soluble native chromatin [20] suggests that this material is suitable for high-resolution ChIP-seq; indeed, it has been successfully applied to ChIP-seq mapping of paused RNA polymerase II in Drosophila [24]. The use of native chromatin for ChIP-seq (N-ChIP) may also offer solutions to issues associated with standard crosslinking ChIP protocols, such as epitope masking and protein-protein crosslinking due to formaldehyde treatment and the intrinsically low resolution of ChIP protocols employing sonication [27].

DNase-seq

DNase I is a non-specific endonuclease that has long been used for mapping sites of 'open' chromatin based on their hypersensitivity to cleavage [12]. Mapping of DNase I hypersensitivity with tiled microarrays (DNase-chip) or high-throughput sequencing (DNase-seq) has also been used to study the epigenome [28]. DNase I preferentially cleaves nucleosome-depleted genomic sites including regulatory elements such as promoters, enhancers, and insulators as well as TFBSs. DNase-seq identifies sites of DNase I digestion at base-pair resolution and offers an inverse approach to MNase-seq, as it infers the presence of DNA-occluding particles between hypersensitive sites whereas MNase maps the regions protected by such particles.

Hesselberth and colleagues [29] employed DNase-seq of yeast chromatin to map chromatin structure at computationally predicted binding sites for several TFs. Analysis of raw DNase-seq data revealed small regions of DNase protection within overall hypersensitive sites, likely indicative of TF binding. However, given that multiple proteins bind to identical sequences, it is necessary to integrate DNase-seq data with ChIP-seq data for definitive identification of the protein responsible for a particular DNase footprint. To this end, Boyle and colleagues [30] recently combined DNase-seq with TF ChIP-seq to precisely determine the DNA bound by several TFs in human cells. Analysis of raw DNase-seq data revealed footprints of DNase resistance within larger hypersensitive sites, similar to the results of Hesselberth and colleagues [29]. DNase-seq was also central to the recent characterization of the human epigenome by the ENCODE consortium [31] .

DNase-seq offers advantages to epigenomic analysis that are similar to MNase-seq in some respects. As it does not rely on antibodies or epitope tags, DNase-seq can query the genomic occupancy of numerous proteins in a single experiment and can be adapted to a range of cell types. However, given that multiple proteins can bind to identical sequences, integration of DNase-seq with ChIP-seq data is necessary to formally identify the protein responsible for a particular region of DNase protection. Mapping of nucleosome positioning with DNase-seq is also somewhat complicated, as DNase I cleaves nucleosomal DNA with 10 bp periodicity [32].

ChIP-exo

ChIP localizes proteins to specific sites on the genome and has become the most widely used epigenomic mapping technique in many fields of biological investigation. ChIP in combination with tiled microarray analysis (ChIP-ChIP) or high-throughput sequencing (ChIP-seq) has been extensively used to study the genomic distributions of hundreds of proteins [33]. While many important insights have been gained through ChIP-chip and ChIP-seq, there are limitations. Standard ChIP protocols employ sonication to fragment chromatin, which produces a heterogeneous mixture of fragments [34]. This issue is further compounded by size selection of 200 to 400 bp fragments during library preparation, a standard procedure in ChIP-seq protocols involving sonication [19]. Lastly, most ChIP-seq libraries are sequenced in single-end mode, wherein only one end of each DNA fragment is sequenced, and the resulting short sequence reads are computationally extended to approximate the size of each sequenced fragment. Taken together, these issues intrinsically limit the resolution of popular genome-wide ChIP methods.

To improve the resolution of ChIP-seq, Rhee and Pugh [35] introduced a technique called ChIP-exo. ChIP-exo involves performing a standard X-ChIP followed by λ exonuclease treatment. λ Exonuclease degrades DNA in a 5' to 3' manner and a protein crosslinked to DNA will block λ exonuclease digestion a specific number of bases 5' to the bound protein on each of the DNA strands, in effect creating a 5' barrier at a fixed distance from the protein past which exonuclease cannot digest and allowing sequences 3' of the barrier to remain intact. Following a specialized sequencing library preparation and single-end high-throughput sequencing, the 5' ends of the resulting sequence reads are mapped back to the genome and precisely demarcate the 5' barriers created by the protein-DNA crosslinks with a high degree of precision and representing protein-bound locations as peak pairs, with one peak on either side of the bound protein. By precisely mapping the boundaries of exonuclease cleavage, ChIP-exo circumvents the limited resolution generally associated with single-end ChIP-seq.

ChIP-exo was applied to several yeast TFs, as well as to the human insulator-binding protein CTCF. Comparison of the yeast TF Reb1 ChIP-exo and ChIP-seq data revealed that ChIP-exo peaks displayed a standard deviation of 0.3 bp versus 24 bp for ChIP-seq peaks, a nearly 100-fold improvement in resolution [35]. The increased resolution of ChIP-exo revealed novel features regarding the modes of genomic binding by these factors. For instance, Reb1 displayed primary and secondary sites of occupancy. Secondary sites were defined as Reb1-occupied sites bound to a lesser extent than strongly occupied Reb1 sites within 100 bp. Notably, these primary-secondary Reb1 binding events were not resolved by standard ChIP-chip or ChIP-seq, indicating that ChIP-exo can resolve multiple crosslinking events within a single bound region. ChIP-exo analysis of other factors also revealed previously unidentified low-occupancy binding sites and refined and expanded the repertoire of sequence motifs associated with factor binding. For instance, CTCF occupancy was positively correlated with the presence of various sequence modules within a single consensus motif. CTCF binding sites with more modules tended to be more highly occupied, consistent with previous studies showing that CTCF uses various combinations of its 11 zinc fingers to bind distinct combinations of motif modules [36].

ChIP-exo addresses several limitations associated with conventional ChIP-seq. The precise mapping of nuclease protection boundaries allows for base-pair resolution determination of protein-bound sequences versus standard ChIP methodologies, which offer only an approximation of bound sequences. Additionally, unbound DNA contaminates ChIP samples, increasing background signal, which may lead to false positives in the case of highly enriched contaminating sequences and false negatives in the case of sites that are weakly bound by the protein of interest. Like MNase and DNase I, exonuclease treatment removes unbound DNA, greatly reducing the background of ChIP experiments (ChIP-exo signal-to-noise, 300- to 2,800-fold versus 7- and 80-fold for ChIP-chip and ChIP-seq, respectively), allowing for identification of low-occupancy binding sites and enabling in-depth analysis of relationships between DNA sequence and TF occupancy. Overall, ChIP-exo offers a base-pair resolution method by which to assess protein occupancy and further dissect the complex interplay between DNA sequence and TFs in genomic regulation and should be readily applicable to systems with available ChIP reagents.

Adapting other epigenomic methods to single base-pair resolution mapping

MNase-seq, DNase-seq, and ChIP-exo, discussed above, are successful modifications of classical techniques for genome-wide analysis of epigenomic features. However, many other techniques have been used to map epigenomes (Table 1). One such technique is a novel targeted chemical cleavage approach that provides base-pair resolution mapping of nucleosome positions [37]. We therefore asked if other current techniques could be adapted for single base-pair resolution epigenome mapping.

Formaldehyde-assisted isolation of regulatory elements (FAIRE) [38] and Sono-seq [39] have been routinely used to map regions of 'open' chromatin. Both techniques rely on the fact that nucleosomes are much more readily crosslinked to DNA than are DNA-binding proteins when cells are treated with formaldehyde. While there are some differences in the FAIRE and Sono-seq protocols, they are based on the same principle. Cells are treated with formaldehyde to crosslink protein-DNA interactions and cells or isolated nuclei are sonicated to shear chromatin. Following sonication, samples are subjected to phenol-chloroform extraction. DNA not crosslinked to proteins ('open' chromatin) is recovered in the aqueous phase, while protein-DNA complexes are retained in the interface. DNA from the aqueous phase is then analyzed by microarray hybridization or high-throughput sequencing. However, as sonication produces a heterogeneous mixture of fragments and only non-protein-associated DNA is recovered, the precise positions of particles delimiting regions of 'open' chromatin cannot be obtained with these techniques. To map the precise positions of DNA-occluding particles using either the FAIRE or Sono-seq chromatin preparation protocol, the protein-DNA complexes contained in the insoluble fraction, which is normally discarded, could be purified and subjected to exonuclease digestion to generate DNA ends a uniform distance from each protein-DNA crosslink, as in ChIP-exo. High-throughput sequencing of exonuclease-digested chromatin would then reveal precise locations of DNA-protecting particles, and this approach could also be coupled to affinity purification to precisely localize specific factors.

Summary and future directions

While the development of technologies for base-pair resolution characterization of epigenomes is still in its early stages, important insights regarding chromatin organization have already been obtained with these methods. ChIP-exo provides a method to precisely map the genomic binding of proteins in systems where ChIP reagents are readily available. MNase-seq allows for mapping of nucleosomes and non-histone proteins within a single sample and like DNase-seq is easily adapted to any system with a sequenced genome. In combination with ChIP-seq, MNase-seq and DNase-seq provide powerful methods for base-pair resolution identification of protein binding sites. These techniques are summarized schematically in Figure 2.

Figure 2
figure 2

Summary of techniques for base-pair resolution epigenome mapping. Schematic representations of ChIP-exo, MNase-seq, and DNase-seq. In ChIP-exo, chromatin is sonicated and specific fragments are isolated with an antibody to a protein of interest. ChIP DNA is trimmed using λ exonuclease, purified, and sequenced. In MNase-seq, nuclei are isolated and treated with MNase to fragment chromatin. Chromatin is then subjected to DNA purification with or without prior affinity purification and MNase-protected DNA is sequenced. In DNase-seq, nuclei are isolated and treated with DNase I to digest chromatin. DNase-hypersensitive DNA is then ligated to linkers, affinity purified, and sequenced. HS, hypersensitive.

While epigenomic profiling is relatively straightforward in single-cell systems, it is more challenging in multicellular organisms, where different cell types are tightly interwoven in complex tissues. Indeed, ChIP-exo, MNase-seq, and DNase-seq have generally been performed either in yeast, which are unicellular, or cultured cells from other organisms, which are not necessarily reflective of the in vivo situation in the organism from which they were derived. To profile cell type-specific epigenomes at base-pair resolution, it will be necessary to combine the above technologies with methods for the isolation of specific cell types from a complex milieu. One such method is fluorescence-activated cell sorting (FACS), involving purification of fluorescently labeled cells or nuclei. FACS has been used to isolate specific cell populations from mouse and human brain and mouse embryonic mesoderm for chromatin analysis [40, 41]. Another technique, isolation of nuclei tagged in specific cell types (INTACT) has been used to isolate nuclei from individual cell types in Arabidopsis, Caenorhabditis elegans, and Drosophila for expression and preliminary chromatin profiling [42, 43]. Combining these techniques with the various methods of base-pair resolution epigenome analysis detailed above should provide striking insights into the regulatory networks underlying specific cell identities.

As base-pair resolution epigenomic techniques are further developed and the cost of sequencing continues to decrease, genome-wide profiling of cell type-specific chromatin landscapes will become increasingly routine. The precise mapping of TFs, of nucleosomal features (positioning, occupancy, composition, and modification), and of ATP-dependent chromatin remodelers may provide the epigenomic equivalent of genome sequencing projects, delineating the regulatory frameworks by which the various cell types within an organism use the same genome to generate distinct cellular identities.