Background

3D chromatin structure is characterized by Topologically Associated Domains (TADs) and chromatin loops, which create physical interactions between genes and distant regulatory sequences [1]. CTCF and the protein complex cohesin are localized to the boundaries of TADs [2,3,4], where they serve as barriers to the spread of chromatin. Genetic variation in these sequences has the potential to influence the binding of these factors and contribute to variability in chromatin structure in humans. However, little is known about patterns of topological variation in the population and the underlying genetic mechanisms.

Structural Variants (SVs) are a major source of genetic variability, and SVs have significant functional impact on the genome through the deletion or rearrangement of coding and regulatory sequences. Notably, large SVs that disrupt or re-establish chromatin contacts are associated with two rare monogenic disorders including human limb malformations [5,6,7] and female-to-male sex reversal [5]. Multiple recent studies have begun to examine the potential of SVs to influence chromatin conformation by theoretical modeling of ChIA-PET [8] or Hi-C [9] data from a single cell line (GM12878). However, these studies have not directly investigated how genetic variation between individuals contributes to variation in large-scale chromatin structure.

In this study, we investigated the effect of common SV polymorphism on 3D chromatin structure in a sample of individuals from the 1000 genomes project [10]. Specifically we sought to test the hypothesis that deletions of the boundary regions between adjacent TADs could result in large scale alterations in chromatin conformation. We performed Chromatin Conformation Capture (Hi-C) sequencing of lymphoblastoid cell lines (LCLs) of 19 individuals from the 1000 genomes project, and we tested the effects of common SVs on the numbers of nearby chromatin contacts.

Results

We hypothesize that SVs could influence TAD structure indirectly by disrupting regulatory sequences that control formation of TADs in adjacent genomic regions. In addition, we anticipate that SVs will have direct effects on the coverage and spacing of paired-end reads similar to the effects that are ordinarily observed for SVs in whole genome sequence data [11]. We sought to distinguish these two types of effects by separately quantifying the direct effects on chromatin interactions that span a deletion breakpoint and indirect effects on chromatin interactions adjacent to a deletion. We illustrate this with an example in Fig. 1; a large deletion of ~ 80 kb that disrupts the complement factor H-related genes CFHR3 and CFHR1. This deletion has been associated with decreased risk of age-related macular degeneration (AMD), an increased risk of atypical hemolytic uremic syndrome (aHUS), and systemic lupus erythematosus (SLE) [12,13,14,15]. A map of chromatin contacts for the deleted region and two adjacent TADs (spanning 1.24 Mb) is illustrated in Fig. 1 at a 40 kb resolution. The average number of contacts is shown for subjects who were homozygous for the deletion (Fig. 1 a) and for subjects who were homozygous for the reference allele (Fig. 1 b). As expected, the deletion results in loss of contacts in bins that overlap with the deleted region, and as adjacent regions are brought closer together, we observe an increase in contacts that span the deletion.

Fig. 1
figure 1

Deletion of CFHR3 and CFHR1 is associated with variation in chromatin conformation. Maps of chromatin interaction surrounding an 80 kb deletion of the CFHR3 and CFHR1 genes (hg19 position chr1:196,728,877–196,808,865) are depicted by averaging the counts within the contact matrices of subjects homozygous for the deletion haplotype (N = 3, Panel a) and subjects homozygous for the reference haplotype (N = 12, Panel b). Normalized counts were plotted as a heatmap with red tone representing the number of chromatin interactions in 40 kb bins. To better visualize the effects for this example, the correlation of counts with the deletion haplotype was tested for all bins across a 1.24 Mb region by linear regression, and regression coefficients were displayed as a blue-red heatmap (Panel c)

The regional effects of the CFHR3/1 deletion on TAD structure was examined in more detail by correlating counts with genotype for all elements of the contact matrix using linear regression controlling for ancestry and sex. The resulting correlation matrix is visualized as a heatmap of the regression coefficients (Fig. 1 c, see methods). The correlation matrix reveals a pattern consistent with an increase in interactions between the proximal TAD (involving the CFH gene) and the distal TAD (involving a broad region between the genes CFHR2 and CRB1). A portion of the CFHR3/1 deletion overlaps with multiple annotated segmental duplications (SDs) which could potentially confound the mapping of Hi-C read pairs. A similar analysis was conducted after masking segmental duplications and the observed effects were unchanged. Therefore, the effects we observe are not explained by the segmental duplications or by contacts between paralogous sequences. Furthermore, a map of SDs across the region (Fig. 1 c) shows that the positive effects that span the deletion primarily involve contacts between heterologous sequences.

To more rigorously determine the association of deletions with chromatin conformation, we used a linear regression model to test for the effects of deletions on chromatin contacts. We again use the CFHR3/1 example to illustrate (Fig. 2). Counts were averaged for elements that span the deletion and for flanking regions within 240 kb (Fig. 2 a), a region chosen as the optimal distance by a parameter sweep (see methods). The effects of deletions on chromatin conformation were then tested for “span” and “flank” separately by linear regression controlling for ancestry principal components (PCs) and sex. Other potential confounders were evaluated, including surrogate variables, to account for unknown sources of noise (see methods), however including these additional covariates did not reduce the overall inflation of the test statistic (Additional file 1: Fig. S1). The effect of the CFHR3/1 deletion on spanning contacts was statistically significant (Fig. 2 b, p-value: 0.002), but the deletion did not have a significant effect on the number of contacts in the flanking regions that overlap with the adjacent TADs (Fig. 2 c).

Fig. 2
figure 2

Testing the effect of common deletions on chromatin conformation. The Hi-C map of chromatin interactions for the 80 kb CFHR3/1 deletion was separated into regions that interact across the deletion (span) and regions that do not cross the deletion (flank) as they can exhibit different behavior with the removal of the deletion bins (Panel a). The effect of the deletion on chromatin conformation was investigated by linear regression, showing a significant effect in the span region (p-value: 0.002, Panel b) and no effect in the flank region (Panel c). The same analysis was run for all common deletions and p-values stratified by size at a 10 kb threshold were displayed in a QQ plot. Large deletions have the strongest effect in the span region while the contribution from small deletions is non-existent (Panel d). Large deletions show a smaller effect in the flank region (Panel e)

We next sought to extend the analysis of Hi-C data to all common deletions in the phase 3 release of the 1000 genomes project [10]. Analysis was restricted to all deletions that were present in ≥ 3/19 samples (N = 2180 deletions). The deletions ranged in size from 51 bp to 125 kb, with an average size of 2622 bp. The magnitude of the genetic effects was assessed based on genomic inflation of the test statistic (λ). A Quantile-Quantile (QQ) plot of observed regression p-values relative to an empirical null distribution based on permutation of genotypes shows very modest effects for deletions overall, λ = 1.10 and 1.04 for span (Fig. 2 d) and flank (Fig. 2 e) respectively, but the effects were stronger for large (> 10 kb) deletions (λ = 3.30 and 1.20 for span and flank respectively). The magnitude of the effect of large deletions on the spanning contacts was greater than for small deletions (Kolmogorov-Smirnov test, p-value: 7.63 × 10− 6), but was not significantly different for the flank region (p-value: 0.132). Summary statistics for all deletions that were tested are included in Additional file 2: Table S1. Given that the effects of common deletions on chromatin conformation are driven by large deletions, our subsequent analyses focused on this subset of SVs.

TAD boundaries correlate with insulator and barrier elements that control chromatin conformation and gene regulation [2]. We therefore hypothesized that deletions could have more dramatic effects on chromatin conformation when they occur in TAD boundaries. Common large deletions (N = 80 deletions) were separated into deletions at TAD boundaries (TB, N = 16 deletions) and those not at a TAD boundary (NonTB, N = 64 deletions). The distribution of regression coefficients for common large deletions in TB/NonTB categories was compared against an empirical null distribution based on permutation of genotypes. These results show a statistically significant positive effect for the span region of NonTB deletions (Wilcoxon rank-sum test, p-value: 0.002) (Fig. 3 a). A visualization of the change in chromatin structure is illustrated by averaging each element of the contact matrix within 240 kb of a deletion across loci in TB/NonTB categories separately (Fig. 3b, c). For NonTB deletions we observe an increase in the number of deletion spanning contacts (Fig. 3a) that is concentrated within a narrow region around the deletion (Fig. 3b). This pattern is consistent with the “direct” effects of deletion on the number of breakpoint-spanning read pairs. We do not see a significant effect of NonTB deletions on the number of contacts within the adjacent flanking regions. For TB deletions, we did not detect significant effects on the number of spanning or flanking contacts (Fig. 3a). These results suggest that TB deletions have effects that are relatively subtle or that are quite variable between loci, but studies of larger samples would be needed to determine if effects differ consistently between TB and nonTB deletions. Analysis was repeated after masking segmental duplications and results were unchanged (Additional file 3: Fig. S2).

Fig. 3
figure 3

Large deletions that do not intersect a TAD boundary have a significant positive effect on the number of contacts that span the deletion region. To determine if the strength or direction of effects differed for deletions located at the boundaries of TADs, regression coefficients from our genome wide analysis were compared between groups of deletions located at TAD boundaries (TB) and those not at TAD boundaries (NonTB) (Panel a). A Wilcoxon rank-sum test was performed for each group against a null distribution, resulting in a significant positive effect for the span region of NonTB deletions (p-value: 0.002). To visualize the topological changes of these effects, a blue-red heatmap of regression coefficients was constructed for NonTB and TB deletions separately. A linear regression was performed for each pairwise bin interaction and coefficients were averaged across deletions. Deletions not present at TAD boundaries have positive values in the span region (Panel b). Deletions that intersect TAD boundaries do not have a unique trend in the span or flank region (Panel c)

A recent paper has described a method to predict the potential of deletions to cause the fusion of two adjacent TADs [9], a potential mechanism described in [16]. This study reported that deletions at TAD boundaries are under negative selection and deletions with a high “fusion score” were skewed toward a low frequency. Using the deletion-spanning contacts for 80 large common deletions as a measure of TAD fusion, we examined whether there was a correlation between the fusion score of the deletion and the coefficient from the regression. We found no correlation of the predicted fusion scores with the observed effects of these deletions on spanning contacts (Additional file 4: Fig. S3).

Our results suggest that large SVs have detectable effects on chromatin conformation. Since the above analysis focused on deletions, it did not assess the largest common SVs known to exist in the population, which include large inversions of 8p23.1 (3.87 Mb) and 7q11.1 (2.45 Mb). To characterize the effects of large inversions on chromatin conformation, inversion genotypes were obtained from single-cell strand sequencing (Strand-seq) of a subset of 9 subjects in the 1000 genomes project [17], and the correlation of chromatin contacts across the region was visualized (Fig. 4 a). The most dramatic effects of the inversion involve contacts that span the inversion breakpoints, denoted by the black triangle, and these effects span distances > 2 Mb from the breakpoint.

Fig. 4
figure 4

Long range effects of a large 8p23 inversion on chromatin conformation. A correlation heatmap shows chromatin interactions that are gained (red) and lost (blue) on the inversion haplotype relative to the reference (Panel a). The gray region corresponds to missing values that could not be normalized. The inversion region is depicted by the black triangle. Hi-C matrices for samples that were homozygous for the absence of an inversion and homozygous for the inversion at 8p23.1 were averaged separately and annotated (Panel b). The TAD structure is preserved in a mirrored fashion along with their associated genes. Chromatin interactions for the inversion were mirrored to aid visual comparison with the reference

The availability of a full assembly of the 8p23.1 inversion haplotype [18] enabled us to map TAD structure of the inversion haplotype by directly mapping Hi-C data of subjects that were homozygous for the 8p23.1 inversion to the inversion haplotype. The average number of contacts is shown for subjects with homozygous genotypes for the inversion (Fig. 4 b, bottom) and the reference haplotype (Fig. 4 b, top). TAD structures of the reference and inversion haplotypes were similar, and the same 5 TADs were defined. Patterns of long-range contacts for the inversion of 7q11.1 were similar (Additional file 5: Fig. S4).

We hypothesize that the genetic variants that influence chromatin conformation could thereby influence gene regulation [19]. However, the effects detectable in our current dataset are restricted to large SVs, relatively few of which represent lead variants for expression quantitative trait loci (eQTLs). Of the 2180 common deletions from our analysis and 5128 SV-eQTLs that were previously identified in another study [20], 75 common deletions tested in this study correspond to SV-eQTLs, and these were larger on average with an average length of 5.98 kb compared to the rest of the 2105 deletions which had an average length of 2.5 kb. A Wilcoxon rank sum test was performed between these two groups to determine if there was a significant difference between the regression p-value distribution of the deletions with SV-eQTLs and the regression p-value distribution of deletions without SV-eQTLs in the span region. However, SVs that were driving eQTLs did not have stronger effects on chromatin contacts (p-value: 0.45). Summary statistics for all deletions are annotated with SV-eQTLs in Additional file 2: Table S1.

Discussion

Hi-C has enabled discoveries related to understanding the structural and functional basis of the genome. We show that large common deletions have significant effects on patterns of chromatin conformation with effects that are sufficiently large to be detectable in our small sample of 19 subjects.

Large common deletions have a distinctive signature characterized by positive effects on contacts that span the deletion. The most dramatic example was a common deletion polymorphism at CFHR3/1, which results in the gain of contacts that span a broad region betweem two adjacent TADs. An increase in the number of contacts between two distinct TADs is an effect reminiscent of “TAD fusion” [21] (Fig. 1). However, for most large common deletions, their effects on the number of deletion-spanning contacts were more subtle and were concentrated within a narrow region around the deletion (Fig. 3 b).

The effect of common SVs on 3D chromatin conformation has potential significance for gene regulation. However, in our current sample size, we are only able to capture effects from the largest and most common SVs, few of which are associated with expression QTLs.

Our results are consistent with common SVs having signatures in Hi-C data that are distinguishable but subtle. We reason that common SVs might tend to have relatively small effects on TAD structure as compared to rare pathogenic variants that have been described previously [5,6,7]. Deletions that remove TAD boundaries and cause TAD fusion may be under negative selection in the population and would therefore tend to be rare. Well-powered characterization of the effects of SVs on chromatin structure and gene regulation would therefore require Hi-C characterization of common variants in larger samples combined with targeted Hi-C and RNA sequencing of patient samples with specific rare disease associated variants.

Large common inversions have distinct effects on chromatin interactions that span the inversion breakpoints, and these effects can extend for distances > 2 Mb. TAD structures within the large inverted segments of two common inversions appear to be well preserved, suggesting that the sequences within the inverted regions are sufficient to determine their 3D structures.

Conclusions

Our analysis has shown that large common SVs can influence local 3D chromatin structure, and the strength and direction of the observed effect varies by locus. Deletions and inversions have distinct signatures. Deletions increase the amount of chromatin interaction between adjacent regions while inversions rearrange the contacts that span its breakpoints.

Methods

Generation of hi-C data for 19 subjects

Hi-C data was generated for 19 subjects from the 1000 Genomes Project (Additional file 2: Table S1) using a “dilution” HindIII protocol as previously described [1]. Data collection is described in detail within a companion manuscript [22]. Hi-C allows for unbiased identification of chromatin interactions by using the following process: cells are cross-linked with formaldehyde, DNA is digested using the HindIII restriction enzyme that leaves a five-prime overhang, the five-prime overhang is filled with nucleotides, the resulting fragments are ligated under dilute conditions, DNA is sheared and fragments containing biotin are identified by paired-end sequencing [1]. Read ends were aligned to hg19 with BWA-MEM v0.7.8 [23] and in the case of split alignments, the five-prime-most alignment was used as the primary alignment. Reads without a five-prime end alignment and alignments with low mapping quality were filtered out. WASP was used to generate alternative reads and realigned using the BWA-MEM [24, 25]. Reads that did not have all alternative reads aligned to the same location were removed. Reads were re-paired and valid read pairs were pairs in which both reads passed this filtering.

Contact matrices were generated and normalized by dividing read pairs into 40 kb bin pairs and normalizing raw counts using HiCNorm [26, 27]. To compare matrices across samples, we needed to remove unwanted variation between matrix elements due to date of processing as well as remove any other batch effects. This was corrected for by using Bandwise Normalization and Batch effect Correction (BNBC, preprint on bioRxiv https://www.biorxiv.org/content/10.1101/214361v1). This method involves performing quantile normalization on a matrix that contains all contacts between loci at a fixed genomic distance.

Defining TAD boundaries

TADs were defined as follows. Directionality Index (DI) was computed for each 40 kb bin and used in a Hidden Markov Model to predict the probability of a bin being upstream bias, no bias, or downstream bias [2]. TAD boundaries were called as regions switching from upstream bias to downstream bias.

Extracting structural variant regions from the hi-C contact matrix

Genotypes for 68,818 SVs were obtained on the same subjects from the phase 3 SV calls from the 1000 genomes project [10]. The phase 3 SV call set includes 42,279 deletions, 6,025 duplications and 20,514 inversion/insertion/complex SVs, of which 5,517 deletions, 101 duplications, and 227 inversion/insertion/complex SVs were present at least once in our sample of 19 subjects. Given that deletions vastly outnumber all other classes of variants, we focused our primary analysis on these. Only deletion alleles that were present in ≥3/19 subjects (N = 2180 deletions, Additional file 2: Table S1) were included in our analysis. Deletions were then mapped to 40 kb bins within the chromosome Hi-C contact matrices. The bins of the contact matrix that “span” or “flank” each deletion were then defined as illustrated in Fig. 2. To determine the flanking distance that optimally captures the effect of deletions on flanking regions, multiple bin sizes were tested by a parameter sweep. Effects weakened as the distance increased from the deletion and 6 flank bins displayed the largest effect.

Quantifying effects of common deletions on TAD structure

Quantitative effects of deletions on chromatin conformation were tested by Ordinary Least Squares Regression (OLSR) using Python. First, bins that overlapped with SVs were masked and specific deletion-flanking and deletion-spanning target regions were defined within 240 kb (six 40 kb bins) on either side of the deletion (Fig. 2 a). For each sample, contacts were averaged across the flanking and spanning target regions respectively. Regression was performed for each deletion on the span and flank regions separately, controlling for ancestry PCs obtained from SNP genotypes using PLINK1.9 software [28] and sex. The regression was constructed with normalized chromatin interaction counts between regions near the deletion as the independent variable and copy number as the dependent variable (0: Homozygous reference, 1: Heterozygous deletion, 2: Homozygous deletion).

Selection of covariates used in regression model

The genomic inflation factor (λ) was used to determine how much of the effect could be attributable to confounding variables such as ethnicity or other unobserved noise in the data that could be captured with surrogate variables. Covariate terms were added one at a time and λ was calculated for the span and flank regions after each addition (Additional file 1: Fig. S1A). The possible confounding variables tested include ancestry PCs to control for population stratification, sex, and surrogate variable PCs to control for variation within each chromosome. Given the sample size of 19, the model becomes saturated with more than two variables [29]. Covariates were chosen, according to the combination that minimized λ. The lowest inflation included two ancestry PCs and sex as covariates. The proportion of variance explained by the first two ancestry PCs was calculated to be 47%. The ancestry PC and sex model was used for the rest of the study and regression coefficients for all loci were displayed in a boxplot (Fig. 3 a).

Visualization of topological effects for CFHR3/1 and across multiple loci

Effects were visualized for select loci as heatmaps of regression coefficients. Each heatmap is constructed by applying the regression model for all bins separately across a target genomic region. To visualize the topological effect for CFHR3/1, the regression coefficients for each bin were then plotted as a heatmap with red indicating positive correlation, blue indicating negative correlation, and bins that overlapped the deletion were indicated in gray (Fig. 1 c).

In addition, to visualize “average” effects across multiple loci, matrices were centered on the left and right deletion boundaries, and the median regression coefficient for each bin across multiple loci was displayed as a heatmap (Fig. 3 b and c).

Analysis of large inversions

Hi-C chromatin interactions for the bins that overlap the inversion and 62 bins on each side of the inversion were extracted. A Pearson correlation between number of chromatin interactions and genotype was applied for each bin across the 9 samples that had both Hi-C data and inversion calls available. The Pearson correlation for each bin was displayed as a heatmap (Fig. 4 a).

Annotation of structural variants with summary statistics and eQTLs

All 2180 common deletions were first annotated with summary statistics from the regression analysis by reporting a p-value and regression coefficient describing the effect of the variant on both the flank region and span region. The SVs were then intersected with the TAD boundaries previously defined in the methods and defined as overlapping that TAD boundary if the intersection was at least 1 bp. An empty element in the table represents no overlap with a TAD boundary. All deletions were intersected with SV-eQTLs previously identified in another study [20]. If these SV-eQTLs were also present within the GWAS Catalog [19], then the table was further annotated with gene information like gene name, gene ID, etc.