Input data for MHiC
MHiC accepts contact maps (Hi-C interactions) on three different formats generated by leading Hi-C analysis tools HiCUP, HiC-Pro, and Homer. MHiC accepts contact maps (Hi-C interactions) on three different formats generated by leading Hi-C analysis tools HiCUP, HiC-Pro, and Homer. After getting contact maps from these tools, MHiC converts it to a single matrix with at least 5 columns: id, fragment 1 chromosome, fragment 2 chromosome, fragment 1 start position, and fragment 2 start position (for HiCNorm method this matrix has 8 columns including GC content, effective length, and mappability features). Then, MHiC does some preprocess on data; such as changing the data resolution, calculating mid locus positions or removing diagonal interactions. In the next step, data format changes to GOTHiC, HiCNorm, or Fit-Hi-C background models formats based on user needs. In the final step, MHiC store and visualize the result from the modeling result.
HiCUP
HiCUP [8] is a pipeline produced by the Babraham Institute to map and perform quality control on Hi-C data. HiCUP outputs include two text files. The first is a file with four columns: id, flag, chromosome and locus position. The second is a digest file which includes chromosome ID, fragment start position and fragment end position. In the first file, two separate rows with the same ID define an interaction. In order to create this structure, users should use the hicup2gothic script, which is available as a HiCUP tool.
HiC-pro
HiC-Pro [9] is developed by Nicolas Servant to process Hi-C data from raw FASTQ files into the normalized contact maps. The HiC-Pro output is a matrix file with three columns: Locus1 ID, Locus2 ID and Interaction counts (number of interacting read between two locus), and a bed file with four columns: chromosome ID, fragment start position, fragment end position, and fragment ID.
Homer
HOMER [10] is an analysis tool that contains several programs and analysis routines to facilitate the analysis of Hi-C data. In the Hi-C data processing section, HOMER process FASTQ and bowtie2 files to map and perform quality control on Hi-C data. In this process, HOMER creates some CSV files to define Hi-C interactions for the next processing steps. In order to create this structure, users should visit the HOMER website (http://homer.ucsd.edu).
To identify Hi-C significant interactions and visualize Hi-C contact maps, we have developed MHiC in two main modules. The first module of MHiC is implemented as an R package to provide multiple backgrounds and correction models. The second module is a user-friendly graphical interface, which provides an interactive environment for users to plot Hi-C interactions in both an Arc diagram and a contact map diagram. MHiC accepts input data from different tools such as HiCUP, HiC-Pro and HOMER and then identifies significant interactions through the GOTHiC, HiCNorm and Fit-Hi-C methods at a desired resolution of the contact map.
Identifying significant interaction with MHiC
We developed MHiC based on the GOTHiC, HiCNorm, and Fit-Hi-C background models. These methods use different mathematical models to identify significant interactions. In the following, we explain each of the models in detail.
GOTHiC
GOTHiC was developed by Mifsud et al. This method assumes both ends of each read-pair are affected by biases. Therefore, the probability of observing nj, h or more read-pairs between two loci, j and h, by chance in a dataset of N reads is given by the cumulative binomial density:
$$ pva{l}_{j,h}=1-{\sum}_{i=0}^{n_{j,h}-1}\left(\underset{i}{N}\right){\left({p}_{j,h}\right)}^i\left(1-{p}_{j,h}\right){N}^{-i} $$
(1)
where the probability that a read pair is the consequence of a spurious ligation between two sites is:
$$ {p}_{j,h}=2\ast relativecoverag{e}_j\ast relativecoverag{e}_h $$
(2)
Immediately following eq. 2, the relative coverage of a given site or region is:
$$ relativecoverag{e}_j=\frac{reads_j}{2N} $$
(3)
where readsj is the mapped read count for genomic locusj. After calculating the probabilities, this method uses the Benjamini-Hochberg multiple-testing correction to obtain a false discovery rate adjusted p-value (q-value), which is used to find significant interactions. The Benjamini-Hochberg Procedure is a technique that decreases the false discovery rate. Adjusting the rate helps to control the fact that sometimes small p-values (less than 5%) happen by chance, which could lead you to incorrectly reject the true null hypotheses. In this method, the p-values are first sorted and ranked. Then, each p-value is multiplied by m, the number of comparisons, and divided by its assigned rank, rj, h, to give the adjusted p-values.
$$ qva{l}_{j,h}= pva{l}_{j,h}\ast \frac{m}{r_{j,h}} $$
(4)
In this method m is described as maximum number of interactions between all regions.
HiCNorm
HiCNorm was developed by Ming Hu et al.. HiCNorm assumes a Poisson distribution to model sequencing errors and artefacts. It normalizes Hi-C contact maps and estimate the bias effects by using the effective length feature and the GC content feature while fixing the mappability feature as a Poisson offset. In this process, the normalized Hi-C contact map (e) for chromosome i at locus j and h is calculated based on effective length feature (x), GC content feature (y), the mappability feature (z) and Hi-C contact map u. The equations for intra-chromosomal Hi-C interactions follow as:
$$ {e}_{j,h}^i=\frac{u_{j,h}^i}{t_{j,h}^i} $$
(5)
where t calculated by:
$$ {t}_{j,h}^i=\exp \left[{\beta}_0^i+{\beta}_{len}^i\lg \left({x}_j^i{x}_h^i\right)+{\beta}_{gc}^i\lg \left({y}_j^i{y}_h^i\right)+\lg \left({z}_j^i{z}_h^i\right)\right] $$
(6)
Equations for the intra-chromosomal Hi-C interactions between chromosomes i1 and i2 are:
$$ {e}_{j,h}^{i_1{i}_2}=\frac{u_{j,h}^{i_1{i}_2}}{t_{j,h}^{i_1{i}_2}} $$
(7)
where t is calculated by:
$$ {t}_{j,h}^{i_1{i}_2}=\exp \left[{\beta}_0^{i_1{i}_2}+{\beta}_{len}^{i_1{i}_2}\lg \left({x}_j^{i_1}{x}_h^{i_2}\right)+{\beta}_{gc}^{i_1{i}_2}\lg \left({y}_j^{i_1}{y}_h^{i_2}\right)+\lg \left({z}_j^{i_1}{z}_h^{i_2}\right)\right] $$
(8)
Fit-Hi-C
The Fit-Hi-C method was developed by Ferhat Ay et al.. This method uses a binomial distribution and works on intra-chromosomal interactions. In the first step, this method assumes that a single observed contact is equally likely to come from any of the M possible pairs of loci, so the null probability of this contact being between a specific locus pair is p = 1/M. Therefore, the probability of a given pair that has an exactly k contact count is:
$$ \mathit{\Pr}\left(K=k\right)=\left(\begin{array}{l}N\\ {}k\end{array}\right){p}^k{\left(1-p\right)}^{N-k} $$
(9)
The P-value is the corresponding cumulative probability of observing at least k contacts is:
$$ P\left(K\ge k\right)=\sum \limits_{i=K}^N\mathit{\Pr}\left(\mathrm{K}=\mathrm{i}\right) $$
(10)
In the second step, this method replaces the binned binomial method (contact probability p) with a spline-fitting procedure that provides a more precise estimate of the probability of observing a contact with a specified genomic distance (dj, h). In other words, Fit-Hi-C replaces in Eq. (9) the contact probability p with a function f (1)(d) which is computed by a spline fit to observe contact probabilities of locus pairs based on their genomic distances. To achieve a smooth spline fit, this method segregate the locus pairs into b equal-occupancy bins (in this method b = 200). The smallest distances in bin i and bin i + 1 define the lower and upper genomic distance boundaries, si and ei, respectively, for bin i. Then, for each bin i, this method computes three values: (1) the average number of contact counts per locus pair (ci); (2) the prior contact probability that a given mid-range read comes from one specific locus pair in this bin \( \frac{{\mathrm{c}}_{\mathrm{i}}}{\mathrm{N}} \), where N is the total number of mid-range reads; and (3) the average interaction distance di over all locus pairs in the bin, including pairs that have a contact count of zero. This method then fits a univariate spline to the resulting b points \( \left(\left({d}_i,\frac{c_i}{\mathrm{N}}\right),\dots, \left(\ {d}_b,\frac{c_i}{\mathrm{N}}\right)\right) \).
$$ {f}^{(1)}(d)=\sum \limits_{i=K}^b{f}^{(1)}\left({d}_i\right){\mathrm{f}}_i(d) $$
(11)
In the third step, this method uses a two-phase spline fitting procedure to modify the binning method, which involves producing a more accurate estimate of the null distribution by excluding contacts that are likely to be real.
Visualize Hi-C interactions
In MHiC, we developed a Graphical User Interface for MHiC. The Interface enables the user to set parameters, generate significant interactions and also visualize Hi-C contact maps. We implemented the visualization as a HTML page to show Hi-C interactions on an Arc diagram or a Heatmap. The MHiC user interface contains two sections: 1) call background models to detect statistically significant interactions; 2) visualization options to visualize a Hi-C contact map. You can find more details about parameters related to each section in the user manual supplementary file.
Data
The visualization part of MHiC needs a file with five columns (chr1, locus1, chr2, locus2 and readCount) to create an Arc diagram or a Heatmap diagram.
Arc diagram
The Arc diagram only works on cis interactions. The number of interacting reads between two regions impacts on the thickness of the Arc link. Users can change the data resolution without doing any process of changing data resolution. In addition, MHiC can visualize interactions in a specific range of fragments and also users can select each fragment and visualize that region’s interactions. In addition, users can easily set a threshold for read counts to remove interactions that has fewer read counts than the threshold. MHiC can import different annotation files to show on top of the Arc Diagram and it can show valid interactions by changing colors or removing invalid interactions. This is also noticeable that the user can change the diagram’s colors if they need it.
Contact map Heatmap
The Contact map diagram draws a Heatmap of interactions. The Heatmap Diagram can show interactions within a single chromosome multiple chromosomes or whole genome. In this diagram, the number of interacting reads or p-value between two regions impacts on the interaction’s color. Users can select interactions color range so when a user selects yellow and black, the interaction read count will show in the range between these colors. Also, the contact map diagram has the same options as the Arc Diagram except it can show the whole genome or inter-interactions.