A guideline based on fluorescence in-situ hybridization(FISH) experiments
FISH protocol associates fluorescent tags to a few specific genomic sites. It allows the accurate measurement in a population of fixed cells of the spatial distances between these sites and their distribution. However, the number of investigated sites is very limited, in contrast to the genome-wide coverage permitted by conformational capture techniques. FISH experiments have been used to check that conformational capture actually provides information on in-vivo distances. They provide the only independent constraint on the 3D reconstruction from Hi-C maps.
A negative correlation has been observed for the sites tagged by FISH between their distance d
ij
(average over numerous single cells) and the number C
ij
of Hi-C reads, or equivalently the contact frequency F
ij
[2]. This correlation was the argument for using L as a proxy for the 3D distances. In the experimental situation considered in [2], it could be satisfactorily summarized in a heuristic power-law \(d_{ij}\sim F_{ij}^{-\alpha _{FISH}}\), with a (non universal) exponent α
FISH
≈0.227, Fig. 3.
In the analyses that follow, we used Hi-C data obtained in human cells (lymphoblastoids) as in [2], Fig. 3, but with a higher resolution [3], Fig. 4
a.
Tunable graph distances
In the line of the power-law correlation observed in FISH data, we endow each contact-associated edge with a length \(L_{ij}\sim F_{ij}^{-\alpha }\), depending on a tunable parameter α. This extension, proposed for L used as an ansatz for the distances [25, 26], is here integrated in our network-based computation of the distances. We investigated the influence of the value of α on the properties of the shortest-path distance matrix D and its relationship with F (short blue arrow in Fig. 2
a), with two extreme cases α=0.2 (the rounded value of the exponent observed experimentally in the above-described situation) and α=1 (the value adopted in the original algorithm).
By definition, the shortest-path distance D
ij
is always smaller or equal to the edge length L
ij
, as can be seen on Fig. 4
b. It is expected —and intended— that D does not rely on low contact frequencies, associated with long edges in the contact network. Figure 4
b shows that the difference between D and L is indeed more marked for smaller contact frequencies, i.e. larger distances. We quantified this feature by the percentage N
Sh
of pairs (i,j) with nontrivial shortest-path distance D
ij
<L
ij
. The pairs of sites contributing to N
Sh
are those with low contact frequencies, for which the shortest-path travels through different and shorter connections than the edge (i,j). When α increases, the discrepancy between L and D is observed to increase, as illustrated by the two panels of Fig. 4
b. This trend is assessed by plotting the increase of the percentage N
Sh
when α increases, Fig. 4
c. The correlation between the contact frequency F
ij
observed for a pair of sites and their shortest-path distance D
ij
can be summarized in a scaling law, with an exponent α
Sh
(minus the slope of the red lines in Fig. 4
b). The dependence of α
Sh
as a function of α is shown on Fig. 4
d. A crossover is observed at a value α≈0.2.
Overall, the improvement brought by using shortest-path distances D as an input to MDS is more important for larger distances and larger values of α. However, choosing a large value of α is not necessarily the best choice: in this regime, the distances D are derived mainly from a few large contact frequencies measured in the Hi-C experiment while less frequent contacts do not contribute, which filters out noise and unreliable recordings but possibly also relevant information. Also, the scaling of the distances with respect to the contact frequencies is modified by the shortest-path computation, and Fig. 4
d provides a calibration curve for the considered data, allowing one to control α
Sh
by a proper choice of α. Further analysis is presented below, with a focus on the extreme values α=0.2 and α=1.
Effect of the multidimensional scaling
We further explored the relationship between the reconstructed distances R and the contact frequencies F (long blue arrow in Fig. 2
a) as a function of α. We moreover compared two versions of MDS, corresponding to different optimization criteria hence different approximations. Classical MDS corresponds to the minimization of \(\sum _{i,j}(D_{ij}-R_{ij})^{2}\). The strength of this method is to reduce to the determination of the three first eigenvectors of the metric matrix M, as explained above. Its weakness is the low constraint on small distances, since minimizing the error is achieved mainly by controlling the large distances. This dominance of large distances can be corrected by considering the relative error [25], leading to the so-called (nonclassical) metric MDS (see Methods). Importantly, both classical MDS and metric MDS are applied to the shortest-path distance matrix D. In contrast, MDS applied to L is highly unstable, due to the treatment of infinite or abnormal components of L and the fact that L is not a distance matrix [9]. As regards computational time, nonclassical MDS starts from the classical MDS solution hence takes more time. At larger sizes, their computational performances converge, due to the fact that the (common) limiting step is the computation of shortest paths, see Additional file 1: Figure S1.
As shown in Fig. 5, we observe a correlation between the reconstructed distances R and the contact frequencies F, which can be summarized by a power law with exponent α
∗ (minus the slope of the red lines in Fig. 5
a and b), depending on the value of α and MDS implementation. Note that we do not claim that these power-laws have a deep meaning, reflecting e.g. some self-similar or fractal structure of the chromosomes; the range of the fit is not large enough to make such a claim. These power-laws are used as the simplest way to quantitatively describe the correlation between F and distances matrices L, D and R. The comparison of the exponent α
∗ with α
Sh
(Fig. 4
c) and α (Fig. 5
d) provides a global quantification of the effect on the distances of the MDS step and the integrated algorithm, respectively. A local quantification will be implemented in the next section.
The value of α initially taken in the expression of edge lengths L is not recovered in the relationship between the reconstructed distance and the contact frequencies, with exponent α
∗. Part of the difference between the two exponents comes from the shortest-path computation, Fig. 4
d, and part from the MDS dimensional reduction, Fig. 5
c. This latter figure shows that metric MDS has a smaller impact on the exponent α
∗ than classical MDS. Using Fig. 4
d, it is possible to choose a value of α to get the desired correlation behavior in the reconstructed structure, with some limitations. Noticeably, the effect of MDS on α
∗ is weaker at larger α.
The value α
FISH
=0.227 is at the lower boundary of the accessible range for α
∗. However, this exponent has been obtained from experimental data corresponding to large distances. This experimental range is difficult to delineate precisely, so that a partial fit would not be reliable; it is nevertheless apparent on Figs. 5
a–d (dashed black line) that a smaller exponent α
∗ would be obtained in the large-distance range, supporting the experimental consistency of the reconstructed structure.
Flexibility of the extended ShRec3D algorithm
We computed the component-wise relative error |D
ij
−R
ij
| / D
ij
to analyze quantitatively the action of the MDS step according to the scale. The comparisons displayed in Fig. 6
a and b show that metric MDS better controls the error on small distances than classical MDS, which performs better at large distances, as expected mathematically. The trade-off offered by implementing either classical or metric MDS is more marked for α=1, see also Additional file 1: Figure S2.
It also apparent on the respective 3D reconstructions, Fig. 6
c and Fig. 6
d, that metric MDS reproduces small-scale features (e.g. small loops), while the global shape is more clearly represented with classical MDS.
For small values of α (Fig. 6
c), the reconstructed structure is more compact, closer to the results of imaging experiments. For larger values of α (Fig. 6
d), the reconstructed 3D structure is more extended, which is specially suitable for 3D genome browsers. Tuning the exponent α thus allows one to focus either on short or large scales.
Note that a distortion arises in Fig. 6
c and d due to the 2D projection of the 3D structures on the printed sheet. The alignement of the structures obtained with different MDS implementations have been done without any rescaling, since they are based on the same distance matrix D.
Such a rescaling is necessary to compare the structures obtained for different values of α, as presented in Fig. 7. Small-scale features are reproduced with α=0.2, while the skeleton of the overall shape is better perceived with α=1. Intermediary values of α offer a continuous trade-off between these two extreme behaviors, as can be seen in Additional file 1: Figure S3. The reconstruction of the whole chromosome 1 is presented in Additional file 1: Figure S4, as an illustration of the performance of our algorithm at large sizes.