Background

The three-dimensional (3D) architecture of the eukaryotic genome has been shown to be an important factor in regulating transcription [1,2,3]. In the nucleus, DNA is folded into a highly organised structure, allowing transcriptional and regulatory machinery to be in specific nuclear territories for efficient usage. The impact of DNA folding and the resulting physical interactions can have dramatic impacts on the regulation of the genes, enabling non-coding regions such as regulatory elements (e.g. enhancers and silencers) to act on distally located gene promoters with disruption of chromosomal organisation increasingly linked to disease [4,5,6]. However, while highly organised, the folding structure of the 3D genome can also be highly dynamic to allow for the flexibility and modularity to facilitate regulatory action across a wide-range of cell types and biological processes, such as development, immune homeostasis, cancer and diseases.

In recent decades, the development of chromosome conformation capture assays and high-throughput sequencing has facilitated the construction of 3D genomes at high resolution, enabling the identification of cell type and tissue-specific 3D interactions between regions in the genome. However, the analysis of such data is complicated by the massive amount of identified physical interactions, hindering the detection and interpretation of interactions that are biologically meaningful. In this review, we introduce the background of 3D genome structure and its components, followed by a summary of the protocols that are commonly used to study 3D genome architecture in recent years, focusing on Hi-C protocols and other derived methods, whilst the use of microscopy to image 3D genome organisation has also been recently reviewed [7]. We then thoroughly review current in silico methods for identification of potentially functional interactions, which are contacts with higher chance to be biologically functionally relevant, and categorise them into three methodological groups.

Chromosome architecture and gene regulation

Within eukaryotic nuclei, chromosomal DNA is condensed and folded into highly organised 3D structures, with distinct functional domains [8, 9]. A key consequence of chromosome folding is that it can bring DNA regions that are far away from each other on the same linear DNA polymer (i.e. intra-chromosomal), into close proximity, allowing direct physical contact to be established between regions. Interchromosomal interactions may also play an important role in transcriptional regulation but are less studied. The best characterised examples of this type of interaction include the clustering of ribosomal genes to form the nucleolus and the clustering of olfactory receptor genes to ensure the monogenic and mono-allelic expression in an individual olfactory neuron [10].

The most basic level of chromosome organisation is chromatin “Loop” structures (Fig. 1A). Chromatin loops are formed based on a loop extrusion model, where linear DNA is squeezed out through the structural maintenance of chromosomes (SMC) cohesin complex until the complex encounters convergent CTCF bound at loop anchor sequences [8, 11,12,13,14]. Chromatin loops can either bring distal enhancers and gene promoters into close proximity to increase gene expression, or exclude an enhancer away from the loop to initiate boundaries to repress gene expression [15,16,17]. The archetypal chromatin looping factors are the CCCTC-binding protein (CTCF) and Cohesin complex [18,19,20], with the initial transient chromatin loops are created by the Cohesin complex during the extrusion process, or anchored on one CTCF binding site while the other anchor moving dynamically [11, 21, 22]. Moreover, specific transcription factors such as EKLF, GATA-1, FOG-1, NANOG and YY1 [23,24,25,26,27,28] were confirmed to play important roles in the regulation of chromatin looping.

Fig. 1
figure 1

Illustration of genome architecture and the corresponding Hi-C interaction maps. Top panel: interaction heatmaps A, B, C, D are in different scales (kb or Mb per pixel) to correlate with the diagrams of 3D structures in the bottom panel, yellow boxes in A and B are identified TADs and small blue boxes in A indicate chromatin loops. The purple box in A is a frequently interacting region, with its classical “V” shape pattern coloured in purple dotted lines. Heatmaps were generated using Juicebox [29] with published Hi-C data of GM12878 [3]. Bottom panel: diagrams of 3D structures in the genome

Chromatin folding and DNA looping in particular leads to the formation of large-scale chromatin structures such as topologically associated domains (TADs) and chromosome compartments (Fig. 1B) [30]. TADs are defined by chromatin interactions occurring more frequently within the TAD boundaries, with TAD boundaries often demarcating a change in interaction frequency [30]. TAD boundaries are also enriched for the insulator-binding protein CTCF and cohesin complex [19, 20]. CTCF motif orientation appears to play a role in demarking TAD boundaries with some studies indicating that the majority of identified TADs (~ 60–90%) have a CTCF motif at both anchor boundaries with convergent orientation [3, 31, 32]. This is consistent with the loop extrusion model mentioned above, suggesting that the formation of most TADs are form by extrusion and are strictly confined by boundaries established by ‘architectural’ proteins such as CTCF and SMC cohesin complex [33], along with the boundaries engaging with strong 3D interactions [34]. Moreover, experimental inversion of CTCF orientation or complete removal of the CTCF binding sites have been shown to disrupt the formation or shift the boundary of a TAD [14, 16, 32], further emphasising the important role of CTCF defining TAD boundaries. The size of TADs are highly dependent on the resolution of the data and the chosen TAD caller and parameters [35], it can vary from hundreds of kilobases (kb) to 5 megabases (Mb) in mammalian genomes [36, 37], and also show significant conservation in related species [38], suggesting that they may serve as the functional base of genome structure and development. With higher sequencing depth, patterns of interactions across regions within a TAD can be further divided into “sub-TADs” with a median size of 185 kb using one kilobase resolution data [3], enabling finer scale investigation of the genome structure [39, 40]. In addition to “sub-TADs”, many other terms of TADs with different sizes and features have been proposed, including “micro-TADs” [41], “mega-domains” [42] and “super-TADs” [43]. However, functional distinction between the “conventional TADs” and them is still unclear. Evidence has shown that TADs are crucial structural units of long-range gene regulation [44,45,46,47], with interactions such as promoter–enhancer looping mostly found within the same TADs [48], and abnormal interactions across TADs (inter-TADs) can lead to significant regulation of expression level of important genes [49].

At a multi-megabase scale, the genome organisation is spatially segregated into euchromatin (gene-rich regions) or heterochromatin (gene-poor regions) to form active and inactive domains called ‘Compartments’ (Fig. 1C) [2]. This compartmentalisation of chromosome folding depicts the global organisation of chromosomes in the nucleus, where compartment A corresponds to gene-dense, euchromatic regions, and compartment B corresponding to gene-poor heterochromatin. Using higher resolution data, the genome can be further grouped into six sub-compartments, compartment A is separated into A1 and A2 whereas compartment B is separated into B1, B2, B3 and B4, with each one associated with specific histone marks [3]. Sub-compartments A1 and A2 are enriched with active genes and the activating histone marks H3K4me3, H3K36me3, H3K27ac and H3K4me1. Sub-compartments A1 and A2 are also depleted in nuclear lamina and nucleolus-associated domains (NADs). B1 domains correlate with H3K27me3 positively and H3K36me3 negatively, B2 and B3 are enriched in nuclear lamina but B3 is depleted in NADs, and B4 is an 11-Mb region, containing lots of KRAB-ZNF genes [3].

The interaction of transcription factors bound at regulatory elements, such as promoters, enhancers and super-enhancers, mediate the transcription level of a gene via interactions which are the direct result of the 3D chromosome structure, but which appear to be long-distance interactions when viewed through lens of a linear chromosome [50,51,52]. One early and well-characterised example is the interaction between beta-globin locus and its locus control region (LCR) [53]. During the development and differentiation of erythroid in human and mouse, the LCR, which is located 40–60 kb away from beta-globin genes, contains the hypersensitive sites that are exhibiting strong enhancer function and contacting to beta-globin genes distally via chromatin loops to regulation gene expressions [54,55,56]. Hox gene clusters, essential for patterning the vertebrate body axis, are also governed by a rich enhancer interaction network. Using chromatin conformation capture methods, a number of studies found that the transcriptional activation or inactivation of Hox clusters requires a bimodal transition between active and inactive chromatin [30, 57,58,59,60]. Taken together, the 3D genome structure governing long-distance contacts can build complex gene regulatory networks, allowing for either multiple enhancers to interact with a single promoter or a single enhancer to contact multiple promoters [61]. Disruption of these long-range regulatory networks is increasingly being linked to both monogenic and complex diseases [62, 63].

Hi-C assays to quantify chromatin interactions

In order to investigate the 3D genome architecture, a series of protocols called chromosome conformation capture (3C) assays have been developed that specifically capture the physical interactions between regions of DNA [1, 2, 64,65,66]. A suite of 3C-derived high-throughput DNA sequencing assays have been developed, including circular chromosome conformation capture sequencing (4C-seq) [64, 67], chromosome conformation capture carbon copy (5C) [65], chromatin interaction analysis by paired-end tag sequencing (ChIA-PET) [66], enrichment of ligation products (ELP) [68] and higher resolution chromosome conformation capture sequencing (Hi-C) [2], which vary in complexity or the scale of the interactions that are captured. The initial 3C method used PCR to quantify specific ligation products between a target sequence and a small number of defined regions [1]. 4C-seq, known as the “one vs all” method, uses an inverse PCR approach to convert all chimeric molecules associated with a specific region of interest generated in the proximity ligation step into a high-throughput DNA sequencing library [67]. 5C increased the number of regions that could be captured by multiplexing PCR reactions [65], and it is also considered as the first “many vs many” approach and has been used to examine the long-range interactions of between transcription start sites and approximately 1% of the human genome [69]. ChIA-PET implements a similar approach, however uses a specific, bound protein, generally a transcription factor protein, generating a protein-centric interaction profile [31]. ELP implements a double digestion strategy to improve the enrichment of 3C products in the library and is able to generate a detailed genome-wide contact map of the yeast genome [68].

Compared to other approaches, Hi-C, also known as the genome conformation capture method [70], is the first “all vs all” method of genome-wide, 3C-derived assay to capture all interactions in the nucleus, allowing for a more complete snapshot of nuclear conformation at the global level [36]. Hi-C works through cross-linking DNA molecules in close proximity via a formaldehyde treatment, preserving the 3D interaction between two genomic regions. The cross-linked DNA is then usually fragmented using a restriction enzyme, such as the 6-bp recognition enzyme HindIII [30, 71] or 4 bp cutter MboI, DpnII and Sau3AI, and the resultant DNA, ends held in close spatial proximity by the DNA cross-links, are ligated into chimeric DNA fragments. Subsequent steps convert these chimeric DNA fragments into linear fragments to which sequencing adapters are added to create a Hi-C library. The library is then sequenced using high-throughput sequencing technology, specifically limited to Illumina paired-end (as opposed to single-end/fragment) DNA sequencing to enable the accurate identification of the two ends of the hybrid molecule [2]. In the initial development of Hi-C, the identification of Hi-C interactions was impacted by the number of spurious ligation products generated as a result of the ligation step being carried in solution allowing for greater freedom for random inter-complex ligation reactions to occur. The resolution of Hi-C interactions in these earlier approaches was also limited by the cutting frequency of a 6-base restriction enzyme, such as HindIII [2, 30, 72,73,74]. To address these issues, an in situ Hi-C protocol was developed [3], where the ligation steps were performed within the constrained space of the nuclei, reducing the chance of random ligation [75, 76]. Furthermore, in situ Hi-C used a 4-base-cutter (such as MboI) for digestion, increasing the cutting frequency in the genome and improving the resolution of captured interactions [3]. Using this method, the first 3D map of the human genome was constructed using the GM12878 cell line with approximately 4.9 billion interactions [3], enabling interaction resolution at the kilobase level. In recent years, the in situ Hi-C protocol has been developed further to target different technical and/or biological questions (Table 1).

Table 1 Different Hi-C-derived methods. Optimisations indicate their modification in their protocols compared to traditional Hi-C

Owing to the vast complexity of the Hi-C ligation products generated, it is often too costly to sequence samples to a sufficient depth to achieve the resolution necessary to investigate specific interactions such as promoter–enhancer interactions, leading to the development of capture Hi-C (CHi-C) [82]. CHi-C employs a sequence capture approach, using pools of probes complementary to thousands of restriction fragments, to enrich for molecules containing the region of interest from the Hi-C library. This significantly reduces the complexity of the libraries and enables a significant increase in the number of detectable interactions within specific regions without the need for ultra-deep sequencing. Therefore CHi-C, has been used in many cases to analyse specific types of long-range interactions, such as interactions linked to promoter or enhancer regions. For example, CHi-C was recently used to characterise promoter interactions in 17 human primary hematopoietic cells to demonstrate the highly cell type-specific nature of many promoter interactions even with a group of related cell types [51]. Similar to CHi-C, another series of approaches, including Capture-C [83], NG Capture-C [84] and Tiled-C [85], that focus on capturing chromatin interaction of interest have been developed. Compared to the CHi-C protocols, they enrich the 3C library with biotinylated capture oligonucleotides instead of enrich the biotinylated Hi-C library, allowing the library to retain maximal library complexity, which is important for analysing data from small cell numbers [85].

Like many other high-throughput sequencing approaches, Hi-C continues to be modified to improve the efficiency and resolution of the approach. DNase Hi-C was developed to reduce the bias introduced through the use of restriction enzymes (e.g. MboI recognises GATC), due to the uneven distribution of restriction sites throughout the genome [77, 93]. Instead, DNase Hi-C replaces the restriction enzyme digestion of cross-linked DNA with the endonuclease DNase I that has a much reduced DNA sequence specificity to reduce bias in identifying Hi-C interactions. Commercial Hi-C library preparation kit such as Omni-C kit from Dovetail Genomics [94] exploits the use of DNase and is designed specifically to overcome limitations of only capturing Hi-C interactions near restriction sites. Similar to DNase Hi-C, Micro-C uses micrococcal nuclease (MNase) digestion, enabling the generation of high-resolution contact maps at 200 bp to ~ 4 kb scale in budding yeast [78] and sub-kilobase resolution contact maps in mammalian cells [41, 95]. What’s more, BL-Hi-C uses HaeIII, which has higher cutting frequency in the human genome compared to other 4-base cutter like MboI, to conduct digestion and a two-step ligation optimisation to reduce the chance of ligating event of random DNAs, increasing the capture efficiency with active regions in the genome and reducing the probability of random ligation events [79]. In addition to increasing the capture efficiency, optimised protocols are now much more cost effective. For example, DLO Hi-C [80] avoids biotin labelling and pull-down steps, and tagHi-C [81] uses Tn5-transposase tagmentation, similar to ATAC-seq, to capture the chromatin structure with hundreds of cells.

The integration of Hi-C with other genomic applications, such as chromatin immunoprecipitation (ChIP), formaldehyde-assisted isolation of regulatory elements (FAIRE) or bisulfite treatment has also occurred. The ChIP-integrated approaches, including HiChIP and PLAC-seq, combining the in situ Hi-C with ChIP, generating a Hi-C library enriched for interactions associated with specific bound proteins [86, 87], increasing the resolution of the library while reducing the sequencing cost. Combining the phenol–chloroform extraction step from FAIRE-seq [96] with in situ Hi-C, OCEAN-C was developed to prioritise the chromatin interactions on open chromatin [88]. Similarly, integrating with an assay called column purified chromatin (CoP), which is enriched for accessible chromatin regions such as active promoters, enhancers and insulators, HiCoP was recently developed to identify chromatin contacts in regulatory regions [89]. Methyl-HiC has been developed to jointly profile the DNA methylation and 3D genome structure [90]. Recent studies have also revealed that DNA methylation is able to impact 3D genome structure via polycomb complexes, which play an important part in repressing key developmental genes [27, 97,98,99,100].

The optimisations introduced by protocols such as Micro-C largely improve the cross-linked DNA capture specificity, allowing higher resolution data to be generated with less sequencing cost. Based on these optimisations, Hi-C 2.0 and Hi-C 3.0 have been developed as the updated versions of Hi-C protocol in recent years [91, 92]. In Hi-C 3.0, the protocol uses a combination of two restriction enzymes, DdeI and DpnII, and MNase to generate short fragments, which can improve the identification of genome compartmentalisation. Additionally, Hi-C 3.0 also uses DSG as cross-linker in addition to formaldehyde to generate cross-linked DNA, improving the enrichment level of regulatory elements such as promoters and enhancers in the identified chromatin loops [92].

As the development of Hi-C approaches continue, it is essential that computational methods are standardised in order to provide consistent results that are comparable across species or cell types. In the next section, we review the current data processing methods that are used in standard Hi-C sequencing approaches.

Prioritisation of chromatin interactions

Methodologies to extract meaningful, potentially functional information from the massive number of interactions identified through Hi-C data can be categorised into three groups: structural-based methods, detection of significant interactions and data integration (Fig. 2). The first approach is to define structures such as A/B compartments and TADs, based on the 2D interaction patterns across the genome. The second approach is to investigate only a subset of Hi-C interactions that are identified from a statistical test based on a trained model. Finally, taking advantage of the publicly available databases or the generation of epigenomics data in parallel with Hi-C data, the third approach is to prioritise interactions that are more likely to be biologically relevant through the investigation of genomic and epigenomic information. These approaches are not mutually exclusive and in many cases can be combined to address specific questions in genome organisation and gene regulation.

Fig. 2
figure 2

Approaches to prioritise interactions from Hi-C datasets. In this review, we categorised the approaches to identify potentially functional interactions into three ways, including significant interactions identification, structures summarisation and data integration. Referenced tools and sub-categorical analyses are marked on the figure with boxes and stars, respectively

Structural-based identification methods

Methods that identify structural aspects of chromatin interactions (i.e. A/B compartments and TADs) are employed as an avenue to reduce the dimensionality of the 3D interaction patterns across the genome by clustering or summarising regions with similar patterns across the genome. The A/B compartments are commonly predicted with normalised Hi-C matrices generated using vanilla coverage (VC) [2], Knight and Ruiz’s method (KR) [101] or iterative correction and eigenvector decomposition (ICE) [102]. Normalised data are then used to calculate Pearson’s correlation and through principal component analysis (PCA), the eigenvectors of the first (or second) principal component (PC) are usually used to assign bins to A or B compartments. Current analysis toolkits, such as Juicer [103] and FAN-C [104], have optimised correlation matrix functions to identify A/B compartments from Hi-C matrices without significant taxes on memory and computational resources.

As detailed above, TADs are defined as structures with interactions that occur within TADs rather than across TADs [30]. As such, they are often identified by finding domains where contacts are enriched within the same TAD as compared to neighbouring TADs [30, 105]. Currently, there are over 20 commonly used TADs callers that have been developed using various methodologies. For instance, arrowhead [3], armatus [106], directionality index [30], insulation score [107] and TopDom [108] use their own linear scoring system, clusterTAD [109] and ICFinder [110] are based on clustering, TADbit [111], TADtree [112] and HiCseg [113] use statistical models; and MrTADFinder [114] and 3DNetMod [115] rely on network-modelling approaches [37, 116]. Although comparisons reveal low reproducibility among tools, especially in the number and mean size of identified TADs, recent reviews [37, 116] have suggested a preference for TAD callers that allow for the detection of nested TADs or overlapped TADs, such as rGMAP [117], armatus, arrowhead and TADtree.

While theoretically similar to TAD calling, frequently interacting regions (FIREs) are also commonly used to describe structural interaction characteristics. Defined as genomic regions with significant interaction profile, FIREs exhibit strong connectivity with multiple regions in the chromosome neighbourhood [73]. FIREs can be easily visualised on the Hi-C interaction map, with interacting signals appearing from both sides of the FIREs, forming a characteristic “V” shape (Fig. 1A). Unlike TADs and compartments, which exhibit a certain level of conservation across cell types (about 50 ~ 60 and 40%, respectively) [3, 30, 73, 118], FIREs appear to be cell type- and tissue-specific and are often located near key cell phenotype-defining genes. However, similar to TADs, FIREs formation seems to be dependent on the Cohesin complex, as its depletion results in decreasing interactions at FIREs [73]. They are also enriched for super-enhancers, suggesting FIREs play an important role in the dynamic gene regulation network [119, 120]. Similar to FIREs, “V” shape structural feature that is referred to as “line” structure was observed at the edge of the TADs during the exploration or loop extrusion model using simulated Hi-C data [14].

Methods for identification of significant chromatin interactions

In order to prioritise potentially meaningful chromatin interactions, statistical significance is assigned to Hi-C interactions by comparing them to a background model and assessing the probability of observing the experimental set of counts if the background model were the underlying method of generating observed counts. The interaction frequency generally decays with increasing linear distance, and by applying this background model meaningful interactions can be identified through a higher than normal frequency. Here we summarise the current methodologies of significant interactions identification and categorise them into two groups; global background model methods, which define a background signal model by considering the read count of any pair of interactions, and local background model methods, which account for interactions in the neighbouring areas to identify peak interactions with statistical significance (Table 2).

Table 2 Methods for identification of statistically significant interactions for Hi-C data

Global background-based methods

The initial study which assigns statistical significance to Hi-C interactions is done in the yeast genome. The chromatin interactions in the yeast genome was first separated into intra-chromosomal interactions (within the same chromosome) and inter-chromosomal interactions (across two chromosomes), followed by a binomial distribution to assign confidence estimates for inter-chromosomal interactions [121]. A binning method is then used to account for the characteristic pattern of intra-chromosomal interactions, with the observed interacting probability decaying as the genomic distance increases linearly. This is then used to compute interacting probabilities for each bin separately and assigning statistical significance using the same binomial distribution as used for inter-chromosomal interactions [121]. Based on the same binomial distribution concept, Fit-Hi-C uses spline fitting procedure instead of binning, reducing the bias of artifactual stair-step pattern, allowing detection of statistically significant interactions in the mammalian genome [122]. Additionally, Fit-Hi-C also incorporates an extra refinement step using a conservative model with stringent parameters to remove outlier interactions, which can be applied iteratively, to achieve a more accurate empirical null model. However, Fit-Hi-C was initially limited by only allowing bin sizes larger than 5 kb to compute significance due to the heavy memory usage when dealing with higher resolution data. However this has been improved with recent updates [123], and is now able to handle data with high resolution (bin sizes from 1 to 5 kb). Another important new feature is that it is now accepting multiple input formats so that it is compatible with different Hi-C analysis pipelines. Another similar tool is included in the Homer toolkit [124], which accounts for biases such as sequencing depths, linear distance between regions, GC bias and chromatin compaction to establish a background model to estimate the expected interaction count between any two regions, followed by the use of a cumulative binomial distribution to assign significance to interactions. GOTHiC [125] also uses relative coverage of two interacting regions to estimate both known and unknown biases, followed by a cumulative binomial distribution to build the background model to identify significant interactions.

The Negative Binomial distribution is commonly utilised in the analysis of count-based data, including popular RNA-seq analysis tools such as edgeR [136] and DEseq2 [137], and has been implemented in a number of Hi-C programs such as HIPPIE [72, 127]. This method uses a negative binomial model to estimate the statistical significance of the interactions in one fragment region (< 2 Mb) while accounting for restriction fragment length bias and interacting probability distance bias simultaneously. However, negative binomial models can be confounded by many bins with zero counts [128] and a number of programs have developed approaches to account for “zero-inflated” observations. HiC-DC, for example, uses a hurdle negative binomial regression model to identify significant interactions [128], modelling the probability of non-zero counts and the rate of observed counts as separate components of the model.

While physical interactions between loci found in close linear proximity are likely to be more prevalent in Hi-C datasets, a known bias in Hi-C libraries is the correlation between two nearby restriction fragments brought about by ligation events. Ligation events can be the result of bias or random collision events between restriction fragments during library preparation, so with high coverage sequencing, false signals can impact the identification of significant interactions [72]. To tackle this problem, HMRFBayesHiC uses a negative binomial distribution to model observed interactions [72], followed by a hidden Markov random field model to account for the correlation between restriction fragments, and to model interaction probabilities [129]. This implementation required significant resources to run, leading to the development of FastHiC [130], which enables higher accuracy of interaction identification and faster performance. Recently, another tool called MaxHiC also based on negative binomial distribution was developed [131]. Compared to other tools, all parameters of the background model in MaxHiC are established by using the ADAM algorithm [138] to maximise the logarithm of likelihood of the observed Hi-C interactions. Significant interactions identified by MaxHiC were shown to outperform tools such as Fit-Hi-C/FitHiC2 and GOTHiC in identifying significant interactions enriched between known regulatory regions [131].

Compared to traditional Hi-C protocols, Capture Hi-C (CHi-C) requires different analytic methods due to the extra bias driven by the enrichment step in the protocol. Capture libraries can be regarded as a subset of the original Hi-C library, meaning the interaction matrix of CHi-C is asymmetric, and interestingly not accounted for in traditional normalisation methods [82, 132]. Because of this, many analysis approaches are specifically designed for CHi-C data analysis. CHiCAGO (Capture Hi-C Analysis of Genomic Organisation) was developed to account for biases from the CHi-C protocol and identify significant interactions [132], using a negative binomial distribution to model the background local profile and an additional Poisson random variable to model technical artefacts [132]. CHiCAGO uses the implicit normalisation method ICE [102] and multiple testing stages based on p-value weighting [139] to carefully identify significant interactions from each CHi-C dataset [132]. Another CHi-C-specific tool called ChiCMaxima was developed to identify significant interactions by defining them as local maxima after using loess smoothing on bait-specific interactions [133]. Compared to CHiCAGO, ChiCMaxima’s approach is more stringent and exhibits a more robust performance when comparing biological replicates [133]. As well as being applicable to conventional HiC data, MaxHiC is also able to identify significant interactions in CHi-C data [131] and offers robust performance to identify regulatory areas compared to CHi-C-specific tools including CHiCAGO [131].

Like the other capture approaches, HiChIP cannot use traditional (Hi-C-specific) interaction callers (e.g. Fit-Hi-C or GOTHiC) due to the inherent biases associated with an enrichment with specific immunoprecipitation targets [86]. Hichipper was developed to firstly identify ChIP peaks while accounting for the read density bias in restriction fragments, enabling a more accurate identification of interactions from HiChIP dataset [140]. While hichipper does not implement any function to identify significant interactions, FitHiChIP was developed to account for non-uniform coverage bias and distance bias in restriction fragments using a regression model, together with 1D peak information in a spline fitting procedure to accurately identify significant interactions from HiChIP data [126].

Local background-based methods

Chromatin looping structures can be regarded as the basic unit of 3D genomic architecture and play an important role in the regulatory process, by bringing distal promoter and enhancer elements together or excluding enhancers from the looping domain [15,16,17]. Chromatin loops from Hi-C data were first defined by searching for the strongest “pixel” on a normalised Hi-C contact map (Fig. 1A). Different from the global background models used by methods like Fit-Hi-C and MaxHiC, using a local background model to compare all pixels in a neighbouring area is able to detect pixels with the strongest signals as the anchor points of chromatin loops [3]. A searching algorithm named Hi-C Computational Unbiased Peak Search (HICCUPS) was therefore developed to rigorously search for these pixels based on the local enrichment in the pixel neighbourhood, followed by hypothesis testing with Poisson statistics, enabling the identification of chromatin loops from Hi-C data [3]. Somewhat similar to TADs, published information on chromatin loops demonstrates structural conservation between a number of human cell lines (~ 55–75% similarity), and between human and mouse (about 50% similarity), suggesting conserved loops may serve as a basic functional unit for the genome [3]. However, loop detection using HICCUPS requires high-resolution data with extremely high sequencing depth. For example, almost 5 billion unique interactions were required by HICCUPS to identify 10,000 unique loops in the GM12878 cell line [3]. This limitation can potentially be addressed by the current development of deep learning approaches, such as DeepHiC [141] using generative adversarial networks, as well as HiCPlus [142] and HiCNN [143] which use deep convolutional neural networks. Such methods can be used to increase the resolution of Hi-C data to achieve necessary resolution so that chromatin loops can be identified, or to improve loop detection accuracy [141, 142].

Hardware requirements to identify loops in high-resolution data is also extremely restrictive with HICCUPS requiring specific architectures (i.e. NVIDIA GPUs) to identify looping patterns. However this has been addressed recently with the HICCUPS algorithm being reimplemented in the cooltools package (https://github.com/mirnylab/cooltools), allowing HICCUPS to be run on a regular server or compute cluster [95]. Alternatively, an approach called cLoops was implemented which identifies peak interactions from chromatin contact map [134]. cLoops initiates loop detection by finding candidate loops via an unsupervised clustering algorithm, Density-Based Spatial Clustering of Applications with Noise (DBSCAN) [144], which enables computing statistical significance of interactions with less amount of input and reduced computational resources. Candidate loops are then compared with a permuted background model, based on the interaction decay over linear distance, to estimate statistical significance.

Further investigation in high-resolution Hi-C data (< = 10 kb), another local background model method was developed to identify architectural stripe structures rather than loops [135]. The stripe structure is similar to FIRE, where a genomic region contacts other regions of the entire domain with high interacting frequency [135]. Its identification algorithm Automated identification of stripes computes the pixel-specific enrichment relative to its local neighbourhood, then performs Poisson statistics to test if the signal is statistically significant [135]. It was further shown that stripe anchors highly correspond to loop anchors, and stripes appear to be relevant with enhancer activity [135, 145].

Potentially functional interaction identification via data integration

While variation in gene-coding regions can lead to significant alterations in one gene or abnormalities across a region in the genome, causing mendelian diseases such as chronic granulomatous disease [146], cystic fibrosis [147] and Fanconi’s anaemia [148], the fundamental motivation for identifying interacting regions across a genome is to establish how non-coding regions of the genome impact gene expression [1, 149, 150]. However potentially functionally relevant interactions, whether this be chromatin interactions between gene promoters and enhancers or transcription factor binding mechanisms, are often established in a cell type-specific manner [71, 82]. By integrating Hi-C interactions with local or publicly available genomic, transcriptomic and epigenomic datasets, such as regulatory elements, gene expression, genetic variation and quantitative trait loci (QTL) information, potentially functional interactions can be prioritised.

Potentially functional Hi-C interactions can be identified by integration with transcriptomics and enhancer data. Promoter–enhancer interactions (PEI), promoter–promoter interactions (PPI) or enhancer–enhancer interactions (EEI), where distal promoters or enhancers are brought into close proximity by chromatin contacts to form complex contact, are three widely accepted potentially functional Hi-C interaction types to be studied [51, 69, 151,152,153,154,155,156,157]. These interaction categories are often identified by finding overlaps of promoter or enhancer signals separately at each anchor of a Hi-C interaction [51, 155, 158]. However, when identifying PEI or PPI from Hi-C data for a specific cell type, the gene expression profile of such cell type should be considered to determine which promoters are active given that promoter interactions are shown highly cell-type specific [51].

Similar to promoters of expressed genes, active enhancers of a specific cell type are necessary to identify potentially functional PEI or EEI for a specific cell type. Expressed enhancers (eRNAs) or experimentally verified enhancers of different human cell types and tissues are available in publicly available projects and databases such as FANTOM5 project [159], the NIH Roadmap Epigenomics project [160], the EU Blueprint project [161], ENCODE [162, 163] and ENdb [164]. Additionally, previous studies also used cell type-specific histone markers ChIP-seq data, such as H3K27ac and H3K4me1, or integrated chromHMM chromatin state information predicted from a variety of epigenomic sequencing information [165, 166] to indicate the activity of an enhancer in a specific cell type [51, 155, 158, 167, 168]. In addition to using Hi-C data, there are numerous methods that have been developed to predict potentially functional interactions based on histone marker signals [169], gene expression and methylation data [170], ATAC-seq data [171], DNase-seq data [172] or even DNA sequence alone [173]. These types of methods have been comprehensively reviewed in a recent review study [174].

Besides promoters and enhancers, Super-enhancers (SEs) are another major regulatory element that is crucial to the identification of potentially functional interactions. SEs are defined as a clustered region of enhancers exhibiting significantly higher levels of active enhancer marks and an enrichment with transcription factor binding sites (TFBS) [175]. These regions act as “regulatory hubs”, which are higher-order complexes consisting of interactions between multiple enhancers and promoters at individual alleles [152, 176, 177]. The formation of these regulatory hubs are proposed to be the consequence of the high level of TF and co-factor localisation to the SE interacting to form a biomolecular condensate by a phase separation model [178,179,180,181,182,183]. Identified Hi-C interactions with linkages to SE have been shown to be potentially functional by mediating multiple gene expression regulations three-dimensionally, or being essential for cell identity and development [50, 184,185,186,187,188,189]. SE can be identified from H3K27ac ChIP-seq using the ROSE algorithm [186], and currently SE information can be easily accessible from databases such as AnimalTFDB [190], PlantTFDB [191], GTRD [192], SEdb [193], dbSUPER [194] and SEA [195, 196], allowing cell-type regulatory hubs to identified and linked to phenotypic traits and/or disease.

In genome-wide association studies (GWAS), almost 90% of the identified genetic single-nucleotide polymorphisms (SNPs) associated with phenotypic traits are located in non-coding regions such as gene desert, which are areas lacking protein-coding genes, hence making the interpretation of the functions of such variants much more challenging than the ones located within or nearby protein-coding genes [197,198,199]. Hi-C data have been proved to be useful in many studies for addressing this issue by forming linkages between diseases-associated variants and genes using long-range chromatin interactions. For examples, interactions between gene promoters and variation-located long coding RNAs (lncRNA), where GWAS SNPs can impact the expression of the target genes by affecting the binding of TF binding to the lncRNA [200]; direct interactions between SNPs and multiple genes, exhibiting co-regulation function of the SNPs [201]; interaction networks based on a SNP, bringing gene promoter, TF binding site and active enhancer region together by chromatin interactions to affect gene expression [202]. Variants may also impact gene-coding regions over large distances meaning that target genes of the variations are not necessarily their closest proximal gene [71, 203]. Currently, databases such as GWAS catalog [204], ImmunoBase [205], GWAS Central [206], GWAS ALTAS [207] and GWASdb [208] contain information of the level of genetic association of each variant to specific diseases, which are invaluable data to be integrated in a high-dimensional interaction dataset.

Tissue-specific quantitative trait loci (QTLs) are identified as the possession of variants that can significantly impact the level of quantitative trait [209], such as expression QTLs (eQTLs) that affect the expression level of the target genes [210], histone QTLs (hQTLs) that affect histone modifications [211, 212], methylation QTLs (meQTLs) that impact DNA methylations [213, 214] and ATAC-QTL that affect the accessibility of the corresponding areas [215]. In recent QTL studies, QTLs are found to affect their target regions by the long-range chromatin interactions between them observed from Hi-C data. For example, Greenwald et al. has recently used pancreatic islet-specific data to investigate the risk gene loci of type 2 diabetes (T2D) [216]. In their work they combined gene and enhancers interaction maps generated from Hi-C data, together with variant and gene expression linkage data, provided by tissue-specific eQTL analysis, to establish an enhancer network for T2D risk loci. In support of genetic variation at enhancers influencing transcriptional regulation, Yu et al. used HiC data to demonstrate that eQTLs tend to be in close spatial proximity with their target genes [217]. Additionally, a recent multi-tissues integration analysis between eQTLs and Hi-C interactions revealed the close proximity between eQTLs and their target genes, indicating that eQTLs regulate the expression of their target genes through chromatin contacts [217]. Therefore, with publicly available QTL databases such as the GTEx project [210], seeQTL [218], Haploreg [219], Blood eQTL browser [220], Pancan-meQTL [221] and QTLbase [222], the linkages between such QTLs and their target genes or regions can be used to infer potentially functional Hi-C interactions.

Future prospects

The investigation of 3D chromosome structure can provide novel insights into the complex regulatory network in the genome. The development of Hi-C and its derived protocols have facilitated the studies of the 3D genome structure, generating numerous high-quality datasets. However, due to the complexity of the Hi-C library preparation and analysis, the biologically meaningful, small-scale interactions may still lack sufficient signals, hindering the detection and interpretation of 3D interactions. The approaches that we presented in this review all aim to reduce the complexity of 3D interaction data, narrowing down information based on structure, statistical inference and additional lines of experimental evidence (i.e. cell type-specific epigenomic data).

Incremental development of Hi-C calling applications (chromatin loops, TADs, etc.) has continued with a focus on correcting biases introduced by library preparation and sequencing. As more and more sequencing data are deposited on open-access data repositories such as NCBI Short Read Archive (SRA) [223] and European Nucleotide Archive (ENA) [224], it has allowed the development of novel Machine Learning models trained on known interactions to identify novel patterns when applying these models to new datasets. Incorporation of publicly available cell type/tissue-specific epigenomics data into these machine learning models of chromatin interactions will allow for more accurate predictions on the molecular mechanisms by which diseases-associated genetic acts. In the future, such models of 3D interactions can potentially be used as markers for disease screening and used for personalised medicine development.

Although the development in protocol efficiency, parallel algorithmic improvements are likely to improve current approaches for identifying 3D interactions. Additional imaging technologies such as real-time signal fluorescence in situ hybridisation and advanced imaging approaches such as STORM imaging have been used to visualise the nuclear organisation in living cells and leading to the identification of clusters of clutch domains that are thought to correspond to TAD [7, 225]. Lastly the ability to engineer specific mutations in DNA through genome editing technology such as the CRISPR–Cas9 system [226, 227], means that future experiments using Hi-C and 3D imaging in-parallel with genetically modification of genomes will vastly improve our understanding of how variation may impact genomic structure, and the regulations of gene expression.

Conclusion

In this review, we first introduced the three-dimensional chromosome architecture in different scales, followed by presenting the chromosome conformation capture assays, with a focus on Hi-C and its variations, which are the state-of-the-art methods for investigating the 3D genome structure. Lastly, we comprehensively reviewed methodologies that are developed to reduce the complexity of 3D physical interactions identified from Hi-C datasets to detect potentially functional interactions. We also categorised the methods into three types, including structural-based detection methods, significant chromatin interactions identification methods and data integration methods. Taken together, by utilising these methods carefully, we are able to detect physical interactions with biological meaning and impact from complicated Hi-C dataset, which may serve a purpose in diagnosis and precision medicine.