Graph-based clustering of Hi-C data recovers better clusters than non-graph-based clustering
To assess the utility of graph-based clustering for Hi-C data over non-graph-based clustering, we compared three algorithms: (1) hierarchical clustering, (2) k-means, and (3) spectral clustering. Hierarchical clustering and k-means have been used widely to analyze functional genomics data sets such as gene expression [26] and chromatin marks [27]. The spectral clustering algorithm is a graph-based clustering method that clusters the eigenvectors of the Laplacian operator on a graph [25]. For all three clustering methods, we considered different distance metrics: (1) Euclidean distance, (2) Pearson’s correlation, (3) Spearman’s correlation, (4) contact counts, and (5) log2 of contact counts. In total, we had 15 clustering approaches that differed by clustering algorithm and distance metric.
We applied each clustering method to Hi-C data from the human H1 embryonic stem cell (hESC) line [3], binned into 2755 1-Mbp bins. Each method was applied to obtain k=10 clusters (Methods). We evaluated the quality of clusters from each clustering method using five different statistical measures: (1) the Davies–Bouldin index (DBI), (2) the silhouette index (SI), (3) the difference in contact counts between regions in the same cluster and between regions from different clusters (delta contact count), (4) the number of clusters enriched for a regulatory signal (e.g. transcription factor occupancy or histone modification), and (5) analysis of variance (ANOVA) of a regulatory signal. DBI measures the within-cluster scatter and is a number between 0 and 1; the lower the value the better the clustering. SI assesses the boundaries of clustering and ranges between −1 and 1; the lower the value the worse the clustering. The Kolmogorov–Smirnov (KS) test was used to assess whether a particular feature was significantly high in a cluster compared to the genomic background. ANOVA was used to examine how well the clusters explain the variation in a particular regulatory signal. DBI, SI, and the delta contact count served as internal validation metrics of clustering that need only the data being clustered, while the number of enriched clusters and ANOVA served as measures of external validation.
A comparison of different clustering approaches showed considerable variation among the different methods (Fig. 1). For example, using DBI and SI, hierarchical clustering with 1-Pearson’s correlation as a distance measure was among the best performing methods (Fig. 1
a, b), but it was among the worst when using the number of enriched clusters or ANOVA (Fig. 1
d, e). To compare the different clustering approaches across all these measures, we, therefore, ranked each method on a scale of 1 to 15 (appropriately adjusting ties) on each of the evaluation metrics, and computed the average rank for each method. Based on the average rank, the top five methods were spectral clustering on contact count (1), hierarchical clustering with 1-Spearman’s correlation as the distance measure (2), spectral clustering with Spearman’s correlation (3), spectral clustering with Euclidean distance (4), and k-means using Euclidean distance (5). Thus, three among the top five ranking methods were spectral clustering variants. We next inspected the patterns of enrichment in the clusters from each method. We found that clusters obtained from spectral clustering with Spearman’s correlation (Fig. 2) were most distinct in their patterns of enrichment compared to the other variants of spectral clustering (Additional file 1: Figure S1) and hierarchical clustering (Additional file 1: Figure S2, Additional file 2). In particular, spectral clustering with Spearman’s correlation found three clusters that were significantly enriched with open chromatin signatures (described in detail in the next section, Fig. 2
d). In contrast, clusters from hierarchical clustering were unbalanced and all the activating marks were concentrated in one cluster. Thus, the clusters obtained from spectral clustering on Spearman’s correlation are likely more biologically meaningful based on external validation measures and are comparable to hierarchical clustering approaches for internal validation metrics. Based on these observations, we selected spectral clustering (Spearman’s correlation) for our subsequent analysis. We note that our clustering framework is flexible and can use other definitions of graph weights as well.
Spectral clustering can incorporate both cis and trans interactions and identifies two major types of clusters
Inspection of the chromosomal coverage of our clusters in hESC showed that most (six of ten) clusters cover multiple chromosomes, revealing cis and trans interactions (Fig. 2
a, Methods). Spectral clustering of only trans interactions finds a similar number of multi-chromosomal clusters suggesting that our clustering is robust and that intra-chromosomal interactions do not overshadow the inter-chromosomal interactions (Additional file 1: Figure S3, Additional file 2).
To interpret our clusters functionally and relate them to downstream gene expression programs, we tested our clusters for statistical enrichment of multiple genome-wide regulatory signals including chromatin marks (H3K4me1, H3K4me2, H3K4me3, H3K36me3, H3K79me2, H4K20me1, H3K9ac, H3K27ac, H3K27me3, and H3K9me3), LADs, early versus late replication timing (RepTime), general transcription factors (POLII, TAF, TBP, CTCF, P300, and CMYC), cohesin components (RAD21 and SMC3), open chromatin from DNase I hypersensitivity assays, number of genes, and various classes of repeat elements [short interspersed nuclear elements (SINEs), long interspersed elements (LINEs), and long terminal repeats (LTRs)].
We found that clusters C0, C1, and C2 were significantly enriched with gene-rich regions, open chromatin (DNase I), SINE, and activating and repressive marks, with the exception of H3K9me3, which varied between the clusters (KS test P<0.05, Fig. 2
d, e). Cluster C0 was also moderately enriched for LADs while C1 and C2 were depleted in LADs. The remaining seven clusters were associated with LADs, LINEs, and LTRs, and were depleted for genes and chromatin marks. The clusters comprising entirely regions from one chromosome were associated with LADs and either LINE (C7 and C8) or LTRs (C6). We also observed that SINE and LINE enrichments are exclusive: SINEs tend to be with clusters with high genomic activity (i.e. enriched for different chromatin marks and gene-rich regions), while LINE and LTR elements are associated with LAD clusters. Our observation that the clusters associated with gene-rich regions are depleted in LADs and clusters associated with gene-poor regions are enriched for LADs is in agreement with previous studies that showed LADs are relatively gene poor [28]. Because the clusters appeared to be discriminated based on activity, we asked if DNase I footprints can explain the association of all other marks. We observe significant conditional mutual information between each signal and the clustering assignments given DNase I, which suggests there is information to be gained by the clustering that is not captured in the DNase I signal (Additional file 1: Methods, Additional file 1: Figure S4). Furthermore, the observed values of the different evaluation metrics (SI, DBI, and delta contact counts) are significantly higher than random, suggesting that we are not over-clustering (Additional file 1: Methods and Additional file 1: Figure S5).
In parallel, we clustered the genomic regions using k-means on their one-dimensional signal profiles (Fig. 2
f) and compared these clusters to the spectral clusters based on a hypergeometric test. Several of the Hi-C clusters were mutually enriched in these k-means clusters (Fig. 2
g), suggesting that these two partitions of the data are mutually consistent with each other. For example, the spectral clusters C1 and C2 (with high genomic activity) had significant overlap with the k-means clusters C0 and C4. However, the Hi-C clusters do not have a one-to-one mapping with the one-dimensional signal k-means clusters (e.g. the C3 k-means cluster had significant overlap with the C4, C6, and C7 spectral clusters), suggesting that the Hi-C clusters capture additional information that is specific to the 3D organization of the genome. We repeated this analysis for Hi-C data in a mouse ESC (mESC) line from Dixon et al. [3] (Additional file 1: Figure S6 and Additional file 2), and observed similar patterns, suggesting that our clusters are capturing generalizable properties of chromosomal organization.
To test the sensitivity of our conclusions to fixed-sized bins, we also considered regions defined by TADs. Briefly, we aggregated the counts in TADs defined in Dixon et al. [3] and clustered the resulting matrix (Additional file 1: Methods). We observed similar patterns of enrichment in these clusters and found that 43 % of the total bases were co-clustered when using a fixed bin size and clusters of TADs (Additional file 1: Figure S7, Additional file 2). We also repeated our analysis of the hESC data for multiple resolutions, 100 and 500 kbp. There was a significant overlap of base pair coverage between clusters at different resolutions (64 % for 100 and 500 kbp, 53 % for 100 kbp and 1 Mbp, and 71 % for 500 kbp and 1 Mbp), which is significantly greater than random (Additional file 1: Table S1). Furthermore, we could find a one-to-one mapping for the majority of the clusters, and the mapped clusters also exhibited similar patterns of enrichment as the 1-Mbp regions (Additional file 1: Figure S8).
Hi-C data clusters from spectral clustering recapitulate known and novel higher-order organizational units
To examine the relationship between our spectral clusters and major chromosomal architectural units such as compartments on individual chromosomes [8], we applied k=2 clustering to our data. A compartment is defined by a subset of regions on a chromosome that densely interact with each other, but are depleted for interactions with other regions on the chromosome. We obtained the cluster assignment for all regions in a chromosome and compared these cluster assignments to the compartments (Additional file 1: Figure S9 and Methods). The majority of the chromosomes (except for chromosomes 16, 19, 20, 21, and 22) were partitioned into two clusters by our approach, indicating the presence of compartment-like structures in our clustering results. Pairs of regions that were clustered together by spectral clustering tended to be in the same compartment as assessed by two independent measures of co-clustering. In the majority of the chromosomes (18 out of 23), these measures are significantly higher than what is expected by chance (F score: 60–80 %, t test P<3.49×10−5, and Rand index: 50–80 %, t test P<1.45×10−5, Additional file 1: Figure S9), suggesting that spectral clustering with k=2 can also recover aspects of compartments. Chromosomes 16, 19, 20, 21, and 22 are not detectable as separate clusters with k=2, likely because they tend to co-localize in the nucleus [8]. The application of the spectral clustering method at higher resolution (e.g. 40 kbp instead of 1 Mbp), can recover TAD-like structures (Additional file 1: Figure S10a, b, c, d, and Additional file 1: Methods). In addition, applying the clustering method to each chromosome separately can also recover clusters with significant overlap with the compartment (Additional file 1: Figure S10e, f, g). These results further suggest that graph-based clustering approaches can be a general and powerful approach for recovering different organizational units of the genome, spanning both cis (within one chromosome) and trans (between chromosome) interactions.
Arboretum-Hi-C: A multi-task spectral clustering algorithm for comparative analysis of Hi-C data
Having determined that spectral clustering is a powerful approach for analyzing Hi-C data from one cell line, we next developed a new approach, Arboretum-Hi-C, to compare systematically the 3D organization across multiple cell types and species. Arboretum-Hi-C combines two clustering strategies: spectral clustering and multi-task clustering (Fig. 3). Multi-task clustering is a special case of multi-task learning [29], where the goal is to solve multiple learning tasks simultaneously. Arboretum-Hi-C takes as input n different Hi-C data sets (n=3 in Fig. 3), representing possibly different cell lines or species, a tree describing the hierarchical relationship between the data sets, the number of clusters k, and a mapping of regions between the different data sets. The Hi-C data sets represent observed data as the leaves of the tree (Fig. 3). As output, Arboretum-Hi-C returns the cluster assignments of regions in each Hi-C data set. Arboretum-Hi-C is based on a previous multi-task clustering approach, Arboretum [30], which uses a generative probabilistic model to cluster expression data from multiple species while accounting for the hierarchical relationships among the species as described by a phylogenetic tree (Methods). However, instead of expression matrices at each leaf node, we now have Hi-C interaction graphs. Edges in these graphs are weighted, with edge weights corresponding to Spearman’s correlation since this gave the best results among different distance metrics. However, our general approach is applicable to different definitions of edge weight (e.g. contact count between a pair of regions). To cluster these graphs, we apply Gaussian mixture model-based clustering to the first k eigenvectors of each graph’s Laplacian (Additional file 1: Methods).
Major modules of chromosome contact interactions are shared between human and mouse cell lines
We applied Arboretum-Hi-C to two human and two mouse cell lines that were studied in Dixon et al. [3]. Two of these cell lines represent the undifferentiated ESC state in both organisms (hESC and mESC, respectively), and the other two cell lines represent examples of a terminally differentiated cell state (IMR90 human fibroblasts and mouse cortex, referred to as hIMR90 and mCortex, respectively). We first examined 1318 1-Mbp human and mouse regions that constitute one-to-one orthologous regions (Methods). Results at a higher resolution (500 kbp) are described subsequently.
We considered two possible hierarchical relationships of these four data sets (Additional file 1: Figure S11 and Additional file 1: Methods) and used the probabilistic framework of Arboretum-Hi-C to select between these two trees. In one tree, the cell lines from the same species were closer to each other, and in the other, the embryonic cell lines from the two species were closer to each and the differentiated cell lines were closer to each other. We observed that the first tree, in which the Hi-C data within a species were closer to each other, had a greater data likelihood (Additional file 1: Figure S11). Therefore, we performed our subsequent analysis with this tree topology.
Application of Arboretum-Hi-C to these four data sets identified ten clusters of interacting regions, several of which exhibited conserved patterns of interactions (Fig. 4). The multi-task clustering framework of Arboretum-Hi-C provides a correspondence between clusters of one cell line/species to the clusters of another cell line/species. That is, cluster Ci from hESC would correspond to cluster Ci of mESC (and all other data sets examined), where i ranges from 0 to k−1. This correspondence or mapping of clusters between different data sets (as further described and validated below) enables a systematic comparison of patterns of interactions and the regions that participate in these interactions. We visually examined the patterns of these clusters based on the eigenvectors (Fig. 4
a) as well as Spearman’s correlation matrices for regions in each cluster (Fig. 4
b). Several clusters exhibited conserved patterns of eigenvectors and interactions across all four data sets (C1 and C2), while some clusters were more similar between cell lines of the same species [C6 (human) and C5], and some clusters captured similarity in the ESC state between species (C3 and C4).
To examine the extent of conservation at the region level, we examined these clusters in two ways. First, we extracted the core conserved set of regions by obtaining those regions that were in the same cluster in all species and cell lines (Fig. 4
c). We observe a striking pattern of conservation of interactions in this conserved set of regions. For some clusters, this represented a high fraction of their elements (>30 % for clusters C3, C4, and C7), or a moderate fraction (10–30 % for clusters C0, C1, C2, and C5), while for some clusters this represented a small fraction (<10 % for clusters C6, C8, and C9). Cluster C3 was the most conserved, with 44 % of its regions in the conserved core set. Second, we compared the clusters, one pair of cell line/species at a time, using the significance of overlap of orthologous regions of one cluster from one species (or cell line), and another species (or cell line). We quantified the overlap in orthologous regions using the negative log of the hypergeometric test P value as described in Roy et al. [30], and visualized them using red-blue heat maps (Fig. 5
a), for every pair of species or cell lines. The off-diagonal elements of the heat map denote the shared chromosomal organization between clusters of different IDs, and the diagonal elements measure the extent of conservation between clusters of the same ID (Fig. 5
a, red-blue heat maps). We found that between hESC and mESC (same cell type but different species), there were a larger number of strong red diagonal elements compared to hESC and mCortex.
To compare the extent of conservation between the clusters identified by Arboretum-Hi-C to clusters identified by applying spectral clustering to the data sets independently, we calculated the difference in the diagonal elements and off-diagonal elements for every pair of Hi-C data sets over multiple random initializations of the algorithm. We find that using Arboretum-Hi-C there is greater conservation between clusters (Fig. 5
b box plot of Arboretum-Hi-C clusters) of the matched cell lines (hESC vs mESC) than between different cell lines (hESC vs mCortex). In contrast, independent clustering of the Hi-C data using non-multi-task spectral clustering did not discriminate between the cell lines and estimated a similar extent of conservation for both matched and different cell lines (Fig. 5
b). Overall, the patterns of conservation and divergence from the non-multi-task clustering may not be as biologically meaningful as those from Arboretum-Hi-C.
To assess the extent to which conserved chromosomal modules exhibit similar regulatory signals and validate the mapping of clusters between data sets identified by Arboretum-Hi-C, we examined these clusters for enrichment of regulatory signals (Fig. 6). Arboretum-Hi-C mESC and hESC clusters of the same ID exhibited similar patterns of enrichment. In particular, clusters C0, C1, and C2, in both hESC and mESC were associated with gene-rich, open chromatin, chromatin mark modified, LAD-depleted regions (Fig. 6
a, b). Similarly, clusters C3, C4, C7, C8, and C9 were gene poor and associated with LADs and repeat elements. SINEs tend to be associated with gene-rich, active chromatin, mark modified regions, while LINEs and LTRs are associated with LADs and gene-poor regions. Overall, we found that Arboretum-Hi-C clusters in both species could be grouped into clusters with high (C0, C1, and C2) and low genomic activity (C3, C4, C7, C8, and C9). While some clusters exhibited additional signal enrichment (e.g. mESC C9, H3K9me3, and DNase I), clusters with the same ID exhibited similar patterns of enrichment, despite not being completely orthologous, thus validating the correspondence of chromosomal cluster IDs of Arboretum-Hi-C.
To assess the effect of bin size in the definition of orthology mapping of the regions and the subsequent Arboretum-Hi-C analysis, we repeated our experiments at a higher resolution of 500 kbp, clustering a total of 2342 regions. As observed in the 1-Mbp case, we found conserved modules between human and mouse cell lines that could be matched based on their enrichment patterns (Additional file 1: Figure S12a, b and Additional file 2). Furthermore, we observed significant overlap between clusters obtained at 500-kbp resolution and 1-Mbp resolution (Additional file 1: Figure S12c), suggesting that changes in the bin size at this resolution (1 Mbp to 500 kbp) does not significantly affect the resulting clusters. To test whether intra-chromosomal interactions create a bias by overshadowing the inter-chromosomal interactions, we repeated our analysis after removing any interactions that are between regions of the same chromosome in human or mouse (Additional file 1: Methods). As in independent clustering, we observe significant overlap between clusters derived from inter-chromosomal interactions and clusters using both inter- and intra-chromosomal interactions (Additional file 1: Figure S13).
Changes in chromosome contact modules between human and mouse cell lines
We next examined module divergence between species and module dissimilarity between cell lines by inspecting the off-diagonal elements of the red-blue heat maps in Fig. 5
a. This analysis relied on our characterization of clusters into high and low activity described in the previous section. We found that most of the module transitions were between clusters of the same type, that is, from low activity to low activity, or from high activity to high activity (Table 1). However, there were a few examples of transitions between modules with high and low genomic activity that we discuss below.
Table 1 Number of significant divergence events between pairs of clusters of different types in each pair of cell lines
Changes in high and low activity modules between species are associated with LADs and chromatin activity
We found three examples of transitions between species-specific modules of different activities. One transition was between hESC C9 and mESC C0 involving 29 regions spanning a total 29 Mbp. The C9 cluster is associated with LADs, whereas C0 is associated with open chromatin and histone modification marks (Fig. 6
a, b, c). Comparison of regulatory features of these 29 regions (hES_9 mES_0) against regions that maintained their cluster assignment in C9 (hES_9 mES_9) and C0 (hES_0 mES_0) in hESC showed that these switched regions were less LAD-rich than the regions in cluster C9 (KS test P<8.02×10−2, Fig. 6
e). In mESC, where these regions were assigned to C0, a cluster with high activity, they tended to have a lower propensity of LADs than other regions associated with C9 (KS test P<8.66×10−5), and more like other elements of C0. These regions exhibit a similar tendency for the number of genes and DNase I elements (Fig. 6
e(ii), (iii)). A second transition, also between a high- and low-activity module, was from hESC C8 (low activity) to mESC C0 (high activity, Fig. 6
a, b, c) and included 21 regions. In both hESC and mESC, these regions have significantly lower LAD content than the regions with conserved assignments to cluster C8 (KS test P<4×10−3, Additional file 1: Figure S14a). In addition, in hESCs, these regions have a significantly higher LAD content than regions that are in cluster C0 in both species (KS test P<1×10−4). Similarly, DNase I and gene count in human regions that switch are intermediate between the conserved members of C8 and C0 in both human and mouse (Additional file 1: Figure S14b, c). The third transition was in a different direction involving regions in a high-activity module in human (C1) and a module C5 in mouse, which was not significantly enriched for any signals in mouse, but is likely a low-activity cluster based on the enrichment profile of the orthologous human C5. Although, the human regions that transitioned to module C5 in mouse did not exhibit a significantly different distribution in LADs, they exhibited a significantly depleted pattern of enrichment for DNase I (KS test P<7.89×10−3) and gene count (KS test P<2.18×10−2 when comparing diverged regions to C1 in mouse, Additional file 1: Figure S14d,e). Overall, these results suggest that the regions that switch their chromatin interaction preference between species are associated with different one-dimensional signals than the regions that maintain their interaction preference between species.
Changes in module assignment between cell lines are associated with CTCF and RAD21 binding sites
In addition to transitions in modules between species, we found several examples of transitions between clusters with high and low activity among cell lines of the same species (five within human and four within mouse, Table 1). One example of such transitions is between cluster C9 (low activity) of hIMR90 and cluster C1 (high activity) of hESC. Figure 6
f shows the pattern of correlation of contact counts for the regions in clusters C1 and C9 and regions that change their cluster assignment. To relate these transitions to the binding profiles of general transcription factors, we examined the distribution of binding of transcription factors measured in both cell lines, namely CEBPB, CTCF, MAFK, POLR2A, and RAD21 (Fig. 6
g and Additional file 1: Figure S15). Among these transcription factors, CTCF and RAD21 appeared to discriminate hESC regions that remained in C1 and those that were in C9 in hIMR90 (KS test P<6.04×10−4 and P<1.12×10−4, respectively). Similarly, in hIMR90, these regions were more enriched than the regions that were in cluster C9 in both cell lines (KS test P<6.20×10−2 for CTCF and P<3.05×10−2 for RAD21). This differential enrichment suggests that CTCF and RAD21, which are known to be major players in chromosomal architecture and organization [31], likely contribute to cell-type-specific behavior between a differentiated and undifferentiated cellular state.