Structure determination of genomic domains by satisfaction of spatial restraints
- 836 Downloads
The three-dimensional (3D) architecture of a genome is non-random and known to facilitate the spatial colocalization of regulatory elements with the genes they regulate. Determining the 3D structure of a genome may therefore probe an essential step in characterizing how genes are regulated. Currently, there are several experimental and theoretical approaches that aim at determining the 3D structure of genomes and genomic domains; however, approaches integrating experiments and computation to identify the most likely 3D folding of a genome at medium to high resolutions have not been widely explored. Here, we review existing methodologies and propose that the integrative modeling platform (http://www.integrativemodeling.org), a computational package developed for structurally characterizing protein assemblies, could be used for integrating diverse experimental data towards the determination of the 3D architecture of genomic domains and entire genomes at unprecedented resolution. Our approach, through the visualization of looping interactions between distal regulatory elements, will allow for the characterization of global chromatin features and their relation to gene expression. We illustrate our work by outlining the recent determination of the 3D architecture of the α-globin domain in the human genome.
Keywordschromosome conformation chromatin structure integrative modeling computational structural biology
Chromosome conformation capture
Circular chromosome conformation capture
Chromosome conformation capture carbon copy
- Bp, Kb, Mb
Base pairs, kilo base pairs, mega base pairs
DNA adenine methyltransferase identification
Fluorescence in situ hybridization
Integrative modeling platform
Nuclear magnetic resonance
Polymerase chain reaction
Root mean square deviation
The three-dimensional (3D) architecture of a genome facilitates the colocalization in space of sequentially distant loci, which is essential for carrying out their specific functions (Takizawa et al. 2008). The determination of the 3D architecture of whole genomes or genomic domains is deemed necessary for characterizing how genes are regulated. Unfortunately, existing light microscopy technologies like fluorescence in situ hybridization (FISH), although very informative, are not yet sufficient to provide high-resolution information on chromatin loops and the interactions between distal DNA segments. Therefore, an integrative and general approach for determining the spatial organization of chromatin may prove very useful not only for identifying long-range relationships between genes and distant regulatory elements but also for elucidating chromatin higher-order folding principles.
Previously, chromatin conformation has been modeled, for example, using polymer physics (Dekker et al. 2002; Mateos-Langerak et al. 2009) and molecular dynamics (Wedemann and Langowski 2002). Such methods rely on physics-based approaches and only partially leverage the current wealth of experimental data on chromatin folding; however, they have proven to be very valuable for understanding general features of chromatin fibers, including flexibility, compaction, and unpacking (Dekker 2008; Wachsmuth et al. 2008; Langowski and Heermann 2007; Rosa and Everaers 2008). This special issue of the Chromosome Research journal includes several such approaches (Fritsch and Langowski 2010; Mirny 2010).
Higher resolution experimental techniques such as chromosome conformation capture (3C)-based approaches (Dekker et al. 2002; Lieberman-Aiden et al. 2009; Dostie et al. 2007; Zhao et al. 2006; Simonis et al. 2006) have prompted the development of new hybrid methods that aim at integrating experiments and computation for identifying the most likely 3D folding of a chromatin domain at medium to high resolutions. The 3C-driven approaches, in combination with computational modeling, have so far been capable of generating medium resolution models of the topological conformation of the HoxA cluster (Fraser et al. 2009), the α-globin domain (Baù et al. 2010), and the yeast genome (Duan et al. 2010).
This review starts by describing existing 3C-based methodologies for 3D determination of genomic domains and proposes that the integrative modeling platform (IMP, http://www.integrativemodeling.org) (Alber et al. 2007a) can be used for integrating such experimental data towards the determination of their 3D architecture at unprecedented resolutions. We then review the modeling of the 3D architecture of the α-globin domain in chromosome 16 of the human genome, which has been recently described in detail (Baù et al. 2010). This review ends by summarizing our thoughts on future directions for the structure determination of genomic domains.
Chromatin interaction maps for 3D modeling
3C-based methods allow the investigation of the overall spatial organization of genomic domains, chromosomes, and entire genomes (Miele and Dekker 2009). Briefly, 3C-based methods rely on formaldehyde cross-linking between spatially close DNA loci through their corresponding bound proteins. Cross-linked DNA is then digested with a specific restriction enzyme and ligated under very diluted conditions that favor intramolecular (i.e., the cross-linked fragments) over intermolecular ligation. Cross-linking reversal and ligation product quantification by the polymerase chain reaction (PCR) using locus-specific primers returns the frequencies at which interactions occur. Given that the original 3C technique (Dekker et al. 2002) was only applicable to single pairwise loci at a time and within a relatively small DNA region (up to hundreds of kilo base pairs), there has been a plethora of new approaches that expanded 3C. For example, the so-called circular chromosome conformation capture (4C) techniques allow for the characterization of a genomic domain by a one-to-many loci interaction analysis. Those approaches take advantage of the fact that most of the 3C ligation products are already circular or can be easily circularized and then inversely amplified by PCR. Four different laboratories developed 4C technologies in parallel, which differ in the restriction enzymes used, the step at which the circular DNA is formed and the analysis of the amplified fragments. Such methods include circular 3C (Zhao et al. 2006), 3C-on-chip (Simonis et al. 2006), open-ended 3C (Wurtele and Chartrand 2006), and “olfactory receptor” 3C (Lomvardas et al. 2006). More recently, the 3C carbon copy technology (5C) was developed to allow the simultaneous and parallel detection of interactions within relatively large genomic domains or even entire chromosomes (Dostie and Dekker 2007; Dostie et al. 2006). In 5C, the PCR step of 3C is replaced by ligation-mediated amplification (LMA) followed by the detection of ligation products. With LMA, it is possible to use simultaneously thousands of primers, allowing the parallel detection of millions of chromatin interactions. Thus, the generated 5C library is an “amplified” version of the 3C library that can be analyzed by microarray analysis or deep sequencing. Since 5C experiments are designed so that 5C primers anneal across the 3C ligation products, the specific design of the 5C primers allows interrogating multiple pairwise loci interactions in an all-against-all fashion (Lajoie et al. 2009). Such experiments are thus suitable for generating complete and extensive loci interaction matrices for large genomic regions (up to a few mega base pairs). Finally, the Hi-C technology, which allows for an unbiased genome-wide analysis, was recently developed to overcome the need for predefining a set of target loci to investigate (van Berkum et al. 2010). Hi-C’s key step is the imprinting of ligated products (i.e., products of interacting loci) with a biotin marker that later on is precipitated with streptavidin beads. Such a step allows for specifically rescuing ligation products and discarding non-ligated DNA, which is necessary for further large-scale genome-wide sequencing of the ligation products. Hi-C was recently applied to the entire human genome at 1 Mb resolution (Lieberman-Aiden et al. 2009).
The 3C-based methods result in a measure of the frequency of interaction between loci located within the studied genomic domain; however, they do not give direct information on the spatial distances between the interacting loci. It is in this scenario that the integration of 3C-based experiments with computational analysis is necessary for further determining the 3D conformation of a genomic domain. For example, Dostie and colleagues developed a suite of computer programs to identify the so-called “chromatin conformation signatures” using 5C data (Fraser et al. 2009). The work resulted in distinct structures of the HoxA cluster depending on the cellular differentiation stage. Starting from a random conformation, the models were iteratively changed to obtain a 3D conformation that minimized its root mean square deviation to the theoretical interloci distances calculated as simply the inverse of the 5C interaction frequencies. The final models were then used to visualize the chromatin conformation signatures of the HoxA cluster. More recently, Noble and colleagues (Duan et al. 2010), using a similar approach, built 3D models of the entire yeast genome coupling 4C (Simonis et al. 2006) with massively parallel sequencing. Interaction frequencies calculated by 4C were converted into distances upon the assumption that polymer packing determines intrachromosomal distances (Bystricky et al. 2004). In particular, interaction frequencies were translated into nuclear distances by assigning 130 bp of packed chromatin to a length of 1 nm. Each chromosome was then represented as a series of beads of 10 Kb which were assigned to the closest restriction enzyme fragment resulting from the 4C experiment. Finally, a 3D model of the entire yeast genome were constructed by minimizing an objective function that scored the fitting of all bead distances to the theoretical distances as derived from the 4C experiments.
As outlined above, 3C-based experimental data about the structure of genomic domains can only be translated into a 3D model via computational methods. Next, we propose that the IMP modeling software can be used for such a task.
Structure determination by IMP
IMP’s conceptual framework for structure determination is similar to the determination of protein structure by two-dimensional (2D) nuclear magnetic resonance (NMR) spectroscopy, where the nuclear Overhauser effect between nuclear spins is used to observe, via the 2D-nuclear Overhauser effect spectroscopy (NOESY) spectra, correlations between resonances from spins that are spatially close. In NMR, a polypeptide is represented by its atoms (particles in IMP), which will be placed in the 3D space based on the spatial distances between them (restraints in IMP) calculated from the 2D-NOESY maps (Wagner et al. 1987). In contrast to constraints, restraints are subject to probability distributions allowing the restrained particles to move within predefined limits. The final 3D ensemble of solutions of a biomolecule corresponds to the spatial positioning of all atoms that best satisfies the input experimental restraints. In contrast to NMR determination, which relies on 2D-NOESY data, IMP was developed as a general platform for simultaneously integrating diverse structural information available about the object of interest (Alber et al. 2008). Such data may greatly vary in accuracy and resolution and may originate from any type of experimental or theoretical observations of the system. Data integration by IMP normally results in a deterministic ensemble of solutions of higher resolution than any of the individual observations (Alber et al. 2007b). The IMP conceptual framework consists of four steps: representation, scoring, optimization, and analysis.
The first step in the integration of experimental data into a computational framework is the definition of an adequate representation of the system so that the use of the available information makes the search of the 3D conformational space feasible. Indeed, the detail of representation (or resolution) of the system determines the accessible conformational space. In other words, coarse-grained representations are more suitable for large conformational searches while fine-grained representations require more computational power to explore the same search space. IMP represents an object by a set of hierarchical particles and their properties (including their Cartesian coordinates that spatially position them). The hierarchy of the particles allows for a flexible representation of the system at different resolutions, which allows for the appropriate representation of the diverse input data.
The key step in structure determination by IMP is the proper evaluation of the generated models (i.e., the different solutions compatible with the input data). Therefore, the observations about the system—be it experimental, physical, or statistical—need to be translated into measurable and formulable relationships between the particles that represent the system. For this purpose, IMP uses joint probability density functions (pdf) affecting attributes of the particles including their Cartesian coordinates. Each independent observation of the system thus results in a number of pdfs affecting one or many particles. The final scoring function, also called the IMP objective function, will be then the sum of the individual pdfs affecting all particles in the system. The functional forms of the restraints implemented in IMP are diverse and were initially developed to determine the structure of protein assemblies (Alber et al. 2008).
Once the system is represented at the appropriate scale/s and the relationship between the particles is formulated based on the observations, a conformational solution of the modeled object is obtained by minimizing the IMP objective function. That is, simultaneously reducing the violations of all imposed restraints. Since many different conformational solutions could satisfy (to a certain degree) the imposed restraints, it is necessary to generate a large number of independent structures to ensure an adequate conformational search. The selection of the optimization protocol in IMP depends on the representation and scoring schema of the system (Alber et al. 2008).
Finally, the structural analysis of the resulting ensemble of possible solutions consistent with the input restraints will reveal important aspects of the IMP modeling. It may inform, for example, on the degree of satisfaction of the imposed restraints, conflicts between different experimental observations or the intrinsic structural variability of the modeled object. Moreover, such analysis may prove useful for designing new, more informative experiments which may help to increase the resolution of the models.
Next, we briefly describe our recent work (Baù et al. 2010) using IMP for determining the 3D conformation of the α-globin domain in the human chromosome 16 based solely on 5C experimental observations.
3D models of the α-globin domain using 5C data: a proof of principle
Modeling the 3D structure of a complex system such as a genomic domain implies the exploration of a large conformational space. Determining the exact level of representation is important for balancing the need to capture all observations about the system and the required computational time to generate solutions that satisfy such observations. The α-globin domain was represented by a set of 70 particles, one for each of the resulting restriction fragments after digestion by HindIII. Each particle in the system was assigned an excluded volume defined as a sphere of radius proportional to the particle’s size (in base pairs). Considering the canonical 30-nm fiber, the relationship between length and base content was set to 0.01 nm per base pair (Gerchman and Ramakrishnan 1987). After the system was properly represented, a set of restraints was assigned to each particle defining their relative 3D position and thus the final spatial organization of the whole α-globin domain.
The 5C data consist of a matrix of interaction frequencies between restriction fragments which do not give a direct measure of the Euclidean distances between the particles representing the fragments. Nevertheless, given the assumption that spatially adjacent fragments interact more frequently than spatially distant ones, the frequency by which two loci are captured in an experiment represents the probability for those two loci to be spatially close in a given cell state (Lieberman-Aiden et al. 2009). Restraints were assigned to each of the 70 particles in our system following three basic assumptions: (1) neighbor (i.e., i to i + 1..2) and non-neighbor (i.e., i to i + 3..n) particles followed different 5C Z scores distribution which reflected their different response in 5C experiments (Dekker 2006); (2) consecutive particles were spatially restrained proportionally to the occupancy of their chromatin fragments with a relationship of 0.01 nm/bp, assuming a canonical 30-nm fiber; and (3) two non-neighbor fragments could not get closer than 30 nm, which corresponds to the diameter of the chromatin fiber. Given these assumptions, we were able to define two different linear relationships to map 5C Z scores onto Euclidean distances that were then set to restraint pairs of particles. Distances between neighbor particles were derived from the relationship between the sum of the radii of the experimental restriction enzyme fragments involved in the interaction and their corresponding 5C Z scores (red points and regression line in Fig. 1c). For non-neighbor particles, three IMP parameters were empirically determined to define the type and magnitude of the restraint to be applied: the minimum possible distance between two non-interacting particles (yellow regression line in Fig. 1c) and a Z score upper- and lower-bound cutoffs (blue dashed vertical lines in Fig. 1c). The optimal values for these three IMP parameters were obtained by maximizing the correlation coefficient between the input 5C frequency matrix and a contact map generated from the 3D models built by IMP. The correlation coefficient was 0.69 for the α-globin models built using 400 nm as the maximum distance between non-neighbor fragments, −0.1 for the lower-bound Z score cutoff, and +0.9 for the upper-bound Z score cutoff (Fig. 1c). Three different types of restraints were then applied to the 70 particles: (1) harmonic restraints, which were set between pairs of neighbor particles and between pairs of non-neighbor particles with Z scores higher than the upper-bound cutoff, maintained two particles at a given equilibrium distance (red in Fig. 1d); (2) lower-bound harmonic restraints, which were set between pairs of non-neighbor particles with Z scores lower than the lower-bound cutoff, maintained two particles farther than a given equilibrium distance (blue in Fig. 1d); and (3) upper-bound harmonic restraints, which were set between pairs of neighbor particles with no experimental data available, maintained two particles within a given equilibrium distance (green in Fig. 1d). For example, an upper-bound harmonic was set between neighbor particles 26 and 28 with missing 5C data, a pair of harmonic restraints was set between non-neighbor particle 20 and particles 26 and 28 with 5C Z scores higher than +0.9, and a pair of lower-bound harmonic restraints was set between non-neighbor particle 62 and particles 26 and 28 with Z scores lower than −0.1 (Fig. 1e). In total, the 70 particles representing the α-globin domain were restrained by 1,049 restraints, of which 235 were harmonic, 709 lower-bound harmonic, and 105 upper-bound harmonic.
Once the system is represented by a set of particles and their imposed restraints, IMP 3D structure determination is expressed as an optimization problem. Starting from a random configuration (Fig. 1f), particles are moved in search for their relative position in space (i.e., a 3D conformation) that minimally violates the imposed restraints (Fig. 1g). The quality of a model is then measured by the IMP objective function, which is the sum of violations of all individual restraints applied to the particles representing the system. Each restraint is scored based on the difference between the distance in the model and its equilibrium distance. For harmonic restraints, the score scales with the square of the distance difference. When all the restraints in the final model are consistent with the input experimental data, the IMP objective function of the system is 0, whereas inconsistencies are penalized depending on the magnitude of the individual violations. The optimization protocol adopted in our method consisted of 500 Monte Carlo rounds and five local optimization steps taken at each round with a simulated annealing protocol (Kirkpatrick et al. 1983). At each Monte Carlo step, a defined set of particles was moved by translating their Cartesian coordinates limited by a defined Gaussian distribution with a sigma of 0.25 and centered in their current position. According to the Metropolis criteria, a move was accepted with a probability proportional to the difference in the IMP objective function before and after the move and the temperature of the system. To warrant proper search of the conformational space, this protocol was run for 50,000 times with different random starting conditions. Therefore, our model building by IMP resulted in 50,000 different solutions of the α-globin domain structure.
3D models of the α-globin domain reveal the formation of chromatin globules
Our 3D models, which were validated by FISH experiments (Baù et al. 2010), accurately reflected the known long-range interactions between the α-globin genes and their distant regulatory elements. The 3D structure of the α-globin domain showed the presence of a higher-order chromatin folding motif in which groups of adjacent genes clustered together to form what we termed “chromatin globules.” The general features of such chromatin globules included the existence of a limited number of chromatin loops of 50–70 Kb with an average path length of ~500 nm, and anchoring points separated by ~100–200 nm. Analysis of the internal architecture of these globules revealed that active genes were enriched in the cores of these structures. These observations suggested that chromatin globules could represent subnuclear structures dedicated to gene expression, perhaps related to the clustering of shared transcription machineries.
Experimental observations on the structure of genomic domains
We thank the Dekker group for their support during the development of our approach. We also thank the IMP community (especially the Sali Lab) and the Chimera developers (http://www.cgl.ucsf.edu/chimera). Financial support from the Spanish Ministerio de Ciencia e Innovación (BIO2007/66670 and BFU2010/19310) is also acknowledged. This review was partially based on the authors’ previous work (Baù et al. 2010).
- Baù D, Sanyal A, Lajoie B, Capriotti E, Byron M, Lawrence JB, Dekker J, Marti-Renom MA (2010) The three-dimensional folding of the α-globin gene domain reveals formation of chromatin globules. Nat Struct Biol. doi: 10.1038/nsmb.1936
- Birney E, Stamatoyannopoulos JA, Dutta A, Guigo R, Gingeras TR, Margulies EH, Weng Z, Snyder M, Dermitzakis ET, Thurman RE, Kuehn MS, Taylor CM, Neph S, Koch CM, Asthana S, Malhotra A, Adzhubei I, Greenbaum JA, Andrews RM, Flicek P, Boyle PJ, Cao H, Carter NP, Clelland GK, Davis S, Day N, Dhami P, Dillon SC, Dorschner MO, Fiegler H, Giresi PG, Goldy J, Hawrylycz M, Haydock A, Humbert R, James KD, Johnson BE, Johnson EM, Frum TT, Rosenzweig ER, Karnani N, Lee K, Lefebvre GC, Navas PA, Neri F, Parker SC, Sabo PJ, Sandstrom R, Shafer A, Vetrie D, Weaver M, Wilcox S, Yu M, Collins FS, Dekker J, Lieb JD, Tullius TD, Crawford GE, Sunyaev S, Noble WS, Dunham I, Denoeud F, Reymond A, Kapranov P, Rozowsky J, Zheng D, Castelo R, Frankish A, Harrow J, Ghosh S, Sandelin A, Hofacker IL, Baertsch R, Keefe D, Dike S, Cheng J, Hirsch HA, Sekinger EA, Lagarde J, Abril JF, Shahab A, Flamm C, Fried C, Hackermuller J, Hertel J, Lindemeyer M, Missal K, Tanzer A, Washietl S, Korbel J, Emanuelsson O, Pedersen JS, Holroyd N, Taylor R, Swarbreck D, Matthews N, Dickson MC, Thomas DJ, Weirauch MT, Gilbert J et al (2007) Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project. Nature 447:799–816CrossRefPubMedGoogle Scholar
- Dostie J, Richmond TA, Arnaout RA, Selzer RR, Lee WL, Honan TA, Rubio ED, Krumm A, Lamb J, Nusbaum C, Green RD, Dekker J (2006) Chromosome conformation capture carbon copy (5C): a massively parallel solution for mapping interactions between genomic elements. Genome Res 16:1299–1309CrossRefPubMedGoogle Scholar
- Dostie J, Zhan Y, Dekker J (2007) Chromosome conformation capture carbon copy technology. Curr Protoc Mol Biol, Chapter 21, Unit 21.14Google Scholar
- Fritsch C, Langowski J (2010) Chromosome dynamics, molecular crowding, and diffusion in the interphase cell nucleus: a Monte Carlo lattice simulation study. Chromosome Res. doi: 10.1007/s10577-010-9168-1
- Hughes JR, Cheng JF, Ventress N, Prabhakar S, Clark K, Anguita E, De Gobbi M, De Jong P, Rubin E, Higgs DR (2005) Annotation of cis-regulatory elements by identification, subclassification, and functional assessment of multispecies conserved sequences. Proc Natl Acad Sci U S A 102:9830–9835CrossRefPubMedGoogle Scholar
- Lieberman-Aiden E, Van Berkum NL, Williams L, Imakaev M, Ragoczy T, Telling A, Amit I, Lajoie BR, Sabo PJ, Dorschner MO, Sandstrom R, Bernstein B, Bender MA, Groudine M, Gnirke A, Stamatoyannopoulos J, Mirny LA, Lander ES, Dekker J (2009) Comprehensive mapping of long-range interactions reveals folding principles of the human genome. Science 326:289–293CrossRefPubMedGoogle Scholar
- Mirny L (2010) The fractal globule as a model of chromatin architecture in the cell. Chromosome Res. doi: 10.1007/s10577-010-9177-0
- Van Berkum NL, Lieberman-Aiden E, Williams L, Imakaev M, Gnirke A, Mirny LA, Dekker J, Lander ES (2010) Hi-C: a method to study the three-dimensional architecture of genomes. J Vis Exp (39). pii:1869Google Scholar
- Wagner G, Braun W, Havel TF, Schaumann T, Go N, Wuthrich K (1987) Protein structures in solution by nuclear magnetic resonance and distance geometry. The polypeptide fold of the basic pancreatic trypsin inhibitor determined using two different algorithms, DISGEO and DISMAN. J Mol Biol 196:611–639CrossRefPubMedGoogle Scholar
- Zhao Z, Tavoosidana G, Sjolinder M, Gondor A, Mariano P, Wang S, Kanduri C, Lezcano M, Sandhu KS, Singh U, Pant V, Tiwari V, Kurukuti S, Ohlsson R (2006) Circular chromosome conformation capture (4C) uncovers extensive networks of epigenetically regulated intra- and interchromosomal interactions. Nat Genet 38:1341–1347CrossRefPubMedGoogle Scholar