Reconstructing high-resolution chromosome three-dimensional structures by Hi-C complex networks
Hi-C data have been widely used to reconstruct chromosomal three-dimensional (3D) structures. One of the key limitations of Hi-C is the unclear relationship between spatial distance and the number of Hi-C contacts. Many methods used a fixed parameter when converting the number of Hi-C contacts to wish distances. However, a single parameter cannot properly explain the relationship between wish distances and genomic distances or the locations of topologically associating domains (TADs).
We have addressed one of the key issues of using Hi-C data, that is, the unclear relationship between spatial distances and the number of Hi-C contacts, which is crucial to understand significant biological functions, such as the enhancer-promoter interactions. Specifically, we developed a new method to infer this converting parameter and pairwise Euclidean distances based on the topology of the Hi-C complex network (HiCNet). The inferred distances were modeled by clustering coefficient and multiple other types of constraints. We found that our inferred distances between bead-pairs within the same TAD were apparently smaller than those distances between bead-pairs from different TADs. Our inferred distances had a higher correlation with fluorescence in situ hybridization (FISH) data, fitted the localization patterns of Xist transcripts on DNA, and better matched 156 pairs of protein-enabled long-range chromatin interactions detected by ChIA-PET. Using the inferred distances and another round of optimization, we further reconstructed 40 kb high-resolution 3D chromosomal structures of mouse male ES cells. The high-resolution structures successfully illustrate TADs and DNA loops (peaks in Hi-C contact heatmaps) that usually indicate enhancer-promoter interactions.
We developed a novel method to infer the wish distances between DNA bead-pairs from Hi-C contacts. High-resolution 3D structures of chromosomes were built based on the newly-inferred wish distances. This whole process has been implemented as a tool named HiCNet, which is publicly available at http://dna.cs.miami.edu/HiCNet/.
KeywordsChromosomal three-dimensional structure Hi-C complex network Wish distance Converting parameter Small-world network Topologically associating domain
Fluorescence in situ hybridization
Topologically associating domain
The chromosome conformation capture techniques [1, 2, 3, 4] can detect physical interactions between a pair of genome loci. Especially, the recent Hi-C technique  can identify chromosome contacts at the whole genome level. In the past few years, Hi-C experiments have been conducted on different species and cell lines [5, 6, 7, 8, 9]; and the resolution of Hi-C experiments keeps increasing from 1 Mb to 1 kb [6, 9]. Recently, a computational method that uses deep learning has been developed to enhance Hi-C data resolution .
Hi-C contact data have been widely used in different fields, such as exploring Xist transcript mechanism , predicting DNA methylation , and revealing structural properties of chromosomes, e.g., topologically associating domains (TADs)  and peaks/loops . Topologically associating domains (TADs), a segment of a chromosome with megabase size or smaller, have been found to be conserved between different cell lines and across different species . TADs are identified based on the property that the Hi-C contact counts within a TAD are apparently higher than those between two adjacent TADs. It has also been tested that the boundary regions of TADs are enriched with some genomic factors , such as insulator binding protein CTCF. Loops are identified from local peaks in a Hi-C contact matrix: the peak pixels have an apparent enrichment of Hi-C data, while the pixels in their neighbourhood do not seem to have high contact counts. A peak indicates that there may be a loop physically residing in the peak region. Peaks are also conserved across different cell lines and species and can reside in topological domain boundaries and CTCF binding sites . However, it has been proved that there are some systematic biases in raw Hi-C data [13, 14]. Therefore, before using Hi-C data we need to remove these biases. There are some efficient normalization tools for eliminating the known biases (e.g., restriction enzyme cutting sites, GC content, and mappability) in raw Hi-C data, such as Hicpipe , ICE , HiCNorm , KR [9, 17], and scHiCNorm .
Another important application of Hi-C data is to reconstruct chromosome 3D structures. Several methods based on simulation and probability models have been developed [18, 19, 20, 21, 22, 23, 24]. A widely created method is to first convert Hi-C contacts into wish Euclidean distances based on the assumption that wish distances follow power law distribution with Hi-C contacts (δ = c-α, δ: wish distance, c: Hi-C contact number, and α: a converting parameter) and then followed by an optimization process that calculates three-dimensional coordinates using algorithms such as metric multidimensional scaling [21, 22, 24].
It has been observed that Hi-C contact probability of mammalian chromosomes is inversely proportional to genomic distance on each chromosome  (c ~ s− 1). Meanwhile, based on previous studies of polymers the volume scales are proportional to the chain length (d3 ~ s) (e.g. genomic distance) . Therefore, Varoquaux et al.  concluded that the relationship between Hi-C contacts and spatial distances was d ~ c-1/3 (i.e., α = 1/3). Based on this conclusion, they modeled chromosomal 3D structures at different resolutions using the same parameter (1/3). However, this arbitrary converting between number of Hi-C contacts and wish distances has drawbacks, especially when applied to different resolutions , different organisms [21, 26], and different time points during cell cycle . For cases when number of Hi-C contacts are larger than 10, the converted wish distances using δ ~ c-1/3 are very small and almost have no difference (Additional file 1: Figure S1a), which makes it hard to distinguish these interactions in terms of spatial distance. For example, for the contacts between positions with 20 beads apart, (a chromosome is evenly divided into beads; and each bead is 40 kb), in today’s high-resolution Hi-C data sets > 50% of them have the number of Hi-C contacts larger than 10 (Additional file 1: Figure S1b). This indicates that the δ ~ c-1/3 formula may not work well nowadays when the Hi-C experiments can reach a high resolution by generating significantly larger number of Hi-C reads.
Therefore, it is reasonable to assume δ ~ c-α; but α should be bead-pair dependent instead of a fixed value for all bead-pairs. Zhang et al.  designed a method to dynamically assign values for α, which used semi-definite embedding to infer the spatial organizations of chromosomes and then calculated Hi-C reversely to obtain the optimal α in which the inferred Hi-C contacts best fitted the original ones. The whole process was time-consuming as it needed to reconstruct the 3D structure at the beginning. In comparison, our method does not need to generate a 3D structure first. Chromosome3D  used the Spearman correlation between Hi-C contact and inferred distances to tune the parameter, but it still needed to generate many structures to obtain the best parameter.
In order to evaluate the reconstructed 3D structure, the distances parsed from the reconstructed 3D structure are usually compared with fluorescence in situ hybridization (FISH) data [6, 19, 20]. The chromosomal interactions detected by FISH are usually considered accurate, and therefore used as benchmarks. However, it is in a small scale because usually only a couple of genomic interactions can be detected by FISH. Therefore, we also used the Xist localization intensity on X-chromosome and ChIA-PET to evaluate our structures.
Engreitz et al.  conducted RNA Anti-sense Purification (RAP) experiments in mouse embryonic stem (ES) cells to detect the localization intensities of lncRNA Xist when X-chromosome was being inactivated. They found that Xist transcripts more intensively bound at the DNA sites in spatial proximity to the Xist locus but less intensively on the DNA sites spatially far away from the Xist locus (Hi-C contact data were used to measure spatial proximity). They detected a significant correlation between 3D distances to Xist locus and the Xist localization intensities. If the inferred distances or inferred 3D structures make sense, the same strong correlation should be found.
Dowen et al.  have applied cohesion ChIA-PET in mouse ES cells to detect protein-enabled long-range chromatin interactions. An unique feature of ChIA-PET is the inclusion of chromatin immunoprecipitation (ChIP) at the beginning to enrich the fragments bound by a particular protein of interest . Together with the design of using two aliquots before fragment ligation, these make ChIA-PET good at detecting protein-enabled interactions . Therefore, we can use these ChIA-PET-confirmed interactions to evaluate our inferred Euclidean distances or reconstructed 3D structures.
In this study, we present a new method to model the converting factor α based on the tendency of a bead to be clustered with neighboring beads in a complex network named Hi-C network (HiCNet). The optimized converting factor α enables us to directly generate optimized pairwise Euclidean distances without generating a 3D structure. The optimized distances are not only consistent with the definitions of intra- and inter-TADs, but also well fit FISH data and ChIA-PET confirmed interactions. We further used the optimized distances and another round of optimization to reconstruct the chromosomal 3D structures of mouse ES cells at 40 kb high resolution and found that compared to other existent methods our inferred 3D structures better fit a FISH data set.
In this equation, pij is the Pearson’s correlation coefficient between the ith row and jth row in the normalized Hi-C matrix, which are the Hi-C profiles between the ith and jth beads with all other beads, respectively. Therefore, a high value pij indicates that the ith and jth beads are spatially close because these two beads have similar Hi-C contact patterns with all other beads. In Eq. (5), p0 is a threshold and is set to 0.95 in our research. In this way, the second term of Eq. (4) tries to achieve this: if any two beads in a triple have a high correlation (e.g., > 0.95), their “clustering strength” values αi, αj, and αk should be highly similar or the same. These triples put important global constraints to the inferred “clustering strength” because the three beads in the triples may not be adjacent but irregularly spread over the entire chromosome. Multiple triples like that can improve the accuracy of inferred distances as it adds the consideration of correlations on normalized Hi-C contacts, which have been found helpful to remove noise from raw Hi-C contact matrices .
The λ values (i.e., λ1, λ2, and λ3) in Eq. (4) are weight parameters tuned based on fluorescence in situ hybridization (FISH) data (six pairs, three from chromosome 2 and the other three from chromosome 11) from .
The second constraint is the triangle inequality, where δij is the inferred distance between beads i and j. It can be found that this constraint contains a large number of triangles consisting of triple beads (Additional file 1: Figure S1c). This tries to make the inferred distances δ between the three beads not violating triangle inequality. These triangles have a regular pattern (i and j are adjacent; and k cannot be i or j) and more densely exist on the chromosome, which is different from the triples in Eq. (5). They both constrain the inferred distances but from different perspectives.
Notice that by solving the above optimization problem, we get the inferred distances δij, which is the optimized Euclidean distances between every pair of beads. For many studies, these optimized distances are all we need, such as calculating the correlation between Euclidean distances and Xist localization intensities . To many studies, the final purpose of reconstructing a 3D structure is to analyze it in a quantitative way; and the pairwise Euclidean distances are one of the most frequently used structural features of a 3D structure.
We also assigned the inferred distances back to the Hi-C complex network as the weight of edges. In this way, the weighted Hi-C complex networks can directly provide optimized Euclidean distance for all bead pairs with no need to reconstruct the 3D structure.
Relationships between inferred distances and Hi-C contacts
The normalized Hi-C data were downloaded from http://chromosome.sdsc.edu/mouse/hi-c/download.html. Our method was performed on 20 chromosomes of mouse embryonic stem (ES) cells at the resolution of 40 kb. The distribution of optimal α parameters for the twenty chromosomes can be found in Additional file 1: Figure S2.
Second, we also found that αij is positively correlated with Hi-C contact cij (see Additional file 1: Figure S5) when we only considered Hi-C contacts not equal to zero and genomic distance between two beads (i.e., |i - j|) larger than 0.1 times total number of beads on a chromosome, which was following the same practice as in .
Small-world properties of Hi-C complex networks
We constructed the Hi-C complex network for each chromosome, e.g., the Hi-C network for chromosome 10 had 3164 vertices and 9492 edges; and the Hi-C network for X-chromosome had 3651 vertices and 10,953 edges.
A small-world network  is defined as having the following properties: (1) a small average shortest path length L; (2) a large clustering coefficient; (3) the average path length L is proportional to the logarithm of the number of nodes in the network. The 20 networks we have created for mESC meet all three properties: (1) the average path lengths of 20 chromosome networks are within [1.5, 2.0] (Fig. 4b); (2) the average clustering coefficients for the 20 chromosome networks are mostly within [0.4, 0.6] (Fig. 4c); (3) with the increase of the logarithm of the number of vertices in each network, the average path length grows proportionally (Fig. 4b). There are two chromosomes that are particularly interesting: chromosome 19 that has the smallest path length but has the largest average clustering coefficient and X-chromosome that has the largest path length but has the smallest average clustering coefficient. Future research can be conducted to further study their network topologies.
Evaluation of the inferred distances by FISH, RAP, and ChIA-PET
First, we compared our inferred distances with FISH data (six pairs, three from chromosome 2 and the other three from chromosome 11) from  in mouse embryonic stem (ES) cells. Because parameters in the target function (Eq. 4) were tuned based on this FISH data, it was not surprising to see that our inferred distances achieve a higher correlation with the FISH data (r = 0.81) compared to αfixed (r = 0.73). Both are better than randomly selected α values (r = 0.59).
Second, we used the localization intensities of a long non-coding RNA Xist to evaluate our inferred distances. Engreitz et al.  found that Xist transcripts are more intensively bound to those DNA sites in spatial proximity to Xist locus but less intensively to the DNA sites that were far away from Xist locus (significant correlations found). We used RAP data to see whether our inferred distances matched this finding. Our method outperformed αfixed by a higher correlation with RAP data (r = − 0.64, n = 906) than αfixed (r = − 0.59); and both are better than random α values (r = − 0.36).
Chromosomal 3D structure inference using Hi-C complex networks
We further tested whether our inferred 3D structures fitted Hi-C contact patterns. We generated a Hi-C contact heatmap of X-chromosome at the resolution of 500 kb, which was normalized by KR method (Fig. 7e). Plotting the heatmap for the whole chromosome at 40 kb resolution is hard to achieve. However, we did plot 40 kb resolution heatmaps for a segment of chromosome 10 (see Fig. 6). We then parsed the Euclidean distances from the reconstructed 40 kb resolution 3D structure and averaged them into 500 kb resolution. In this way, we were able to draw the distance heatmap at 500 kb resolution (Fig. 7f). We performed the same procedure and plotted the heatmaps of distances parsed from the 40 kb resolution 3D structures generated by PASTIS (Additional file 1: Figure S7) and ChromSDE (Additional file 1: Figure S8). From Fig. 7e and f, we observed that our inferred 3D structure better matched the general patterns in Hi-C contact heatmap.
There are many studies that can reconstruct chromosomal 3D structures. However, the goal of reconstructing chromosomal 3D structures is not only to visualize the structure, but also to quantitively analyze the 3D structures. For many cases, the Euclidean distances between all bead pairs are the only information needed for the quantitative analysis on a 3D structure. In this type of analysis, our optimized distances can directly be used with no need to reconstruct a 3D structure (and then parse the distances out from the 3D structure).
Moreover, after we assign the optimized distances as the weights of edges back to the Hi-C complex networks, the topology of this type of networks has integrated optimized Euclidean distances in the 3D space. This would provide a new perspective of modeling and studying chromosomal 3D structures. For example, it would be interesting to cluster vertices based on network topology (with weights considered) and then compare the clusters in the networks with known genomic locations of topologically associating domains. The current definition of TADs is mostly based on 2D Hi-C enrichment. However, the network-clustering approach would be based on 3D structures although there is no need to construct the 3D structure.
Furthermore, since our inferred distances are already optimized, reconstructing a 3D structure from these distances becomes faster and less complicated. Also, two rounds of optimizations and the design of including FISH data in the first optimization (some of Eq. 4’s parameters are tuned by FISH data) make the reconstructed 3D structure more accurate and better fits the FISH observations (this is not the same as FISH data used to tune parameters in Eq. 4).
We notice that very limited chromosomal 3D structure reconstruction methods are evaluated using ChIA-PET. Therefore, we used two more measures to evaluate our inferred wish distances compared with those converted from α = 1/3. First, we found that when we only considered the number of Hi-C contacts in the range [12, 12.5] our inferred wish distances between beads within the same TAD are apparently smaller than those from different TADs, which better matches the property of TADs. Second, our inferred wish distances have a higher correlation with Xist transcript localization than those distances inferred from α = 1/3. To evaluate the 3D structures we inferred, we used another FISH data set; and the results show that our inferred 3D structures are more consistent with the new FISH data set than those generated by other two 3D-resconstruction methods PASTIS and ChromSDE with different α values.
We developed a novel method to infer the wish distances between DNA bead-pairs from Hi-C contacts. Our inferred distances better fitted the definitions of TADs, FISH data, and the localization patterns of Xist transcripts compared to the distances generated by using a fixed parameter. High-resolution 3D structures of chromosomes were built based on the newly-inferred wish distances. The whole process has been implemented as a tool named HiCNet.
Publication of this article was sponsored by the National Institutes of Health R15GM120650 to ZW and a start-up funding from the University of Miami to ZW.
Availability of data and materials
HiCNet has been implemented in C++. It is publicly available at http://dna.cs.miami.edu/HiCNet/.
About this supplement
This article has been published as part of BMC Bioinformatics Volume 19 Supplement 17, 2018: Selected articles from the International Conference on Intelligent Biology and Medicine (ICIBM) 2018: bioinformatics. The full contents of the supplement are available online at https://bmcbioinformatics.biomedcentral.com/articles/supplements/volume-19-supplement-17.
TL designed and implemented the system and benchmarked the results. TL and ZW wrote the manuscript. ZW advised the research. All of the authors have read and approved the final manuscript.
Ethics approval and consent to participate
Consent for publication
The authors declare that they have no competing interests.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
- 2.Dostie J, Richmond TA, Arnaout RA, Selzer RR, Lee WL, Honan TA, Rubio ED, Krumm A, Lamb J, Nusbaum C. Chromosome conformation capture carbon copy (5C): a massively parallel solution for mapping interactions between genomic elements. Genome Res. 2006;16(10):1299–309.PubMedPubMedCentralCrossRefGoogle Scholar
- 4.Zhao Z, Tavoosidana G, Sjölinder M, Göndör A, Mariano P, Wang S, Kanduri C, Lezcano M, Sandhu KS, Singh U. Circular chromosome conformation capture (4C) uncovers extensive networks of epigenetically regulated intra-and interchromosomal interactions. Nat Genet. 2006;38(11):1341–7.PubMedCrossRefGoogle Scholar
- 14.Liu T, Wang Z. scHiCNorm: a software package to eliminate systematic biases in single-cell Hi-C data. Bioinformatics. 2018;34(6):1046–7.Google Scholar
- 22.Zhang Z, Li G, Toh K-C, Sung W-K: 3D chromosome modeling with semi-definite programming and Hi-C data. J Comput Biol. 2013;20(11):831–46.Google Scholar
- 27.Ay F, Bunnik EM, Varoquaux N, Bol SM, Prudhomme J, Vert J-P, Noble WS, Le Roch KG. Three-dimensional modeling of the P. falciparum genome during the erythrocytic cycle reveals a strong connection between genome architecture and gene expression. Genome Res. 2014;24(6):974–88.PubMedPubMedCentralCrossRefGoogle Scholar
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.