Introduction

Theoretical and experimental efforts of the last two decades have established that kinetic and thermodynamic properties of the protein folding process can be inferred by some spatial features of the native structure itself1,2,3. For instance, the contact map of the native state4,5,6, the matrix indicating which pairs of residues are close in space, determines the folding nucleus, i.e. the group of residues whose interaction network is essential for driving the folding. Similarly, the loops formed between residues in contact have an average chemical length, the contact order, which is strongly correlated with the folding time7,8,9. However, some proteins, being extremely self-entangled in space, are characterized by a folding process that cannot be simply rationalized by local (contact) properties. Examples are proteins hosting knots10,11,12,13,14,15,16,17, slipknots18,19, lassos20,21 and links22,23. These complex motifs were found in about 6% of the structures deposited in the protein data bank (PDB) and, although it is expected that their presence can severely restrict the available folding pathways17,22, it is not clear how proteins avoid the ensuing kinetic traps and fold into the topologically correct state.

A crucial question is whether and how these topologically entangled motifs affect the protein energy landscape. According to the well established paradigm of minimal frustration24,25, energetic interactions in proteins are optimized in order to avoid as much as possible the presence of unfavorable interactions in the native state. Although non optimized interactions may result in kinetic traps along the folding pathway, some amount of residual frustration has been detected and related to functionality and allosteric transitions26.

A further issue is whether the effect of topology-induced traps depends on the folding direction along the chain: if proteins fold cotranslationally when they are being produced at the ribosome, one then expects sequential folding pathways proceeding from the N-terminus to be less hindered by such traps.

To understand the relevance of topological motifs in proteins, here we quantify the amount of self-entanglement in native structures by computing the Gaussian entanglement (GE)9,23, a generalization of the Gauss integrals used to compute the linking number27 (see Materials and Methods for details). Indeed, if integrals were computed for two closed curves, e.g. loops in proteins closed by disulphide bridges or by any other form of covalent bond20,21,22 (Fig. 1b), the result would be the (integer) linking number27. By applying the method to open chains9,23,28,29,30, the GE provides a real number that quantifies the mutual winding of any pair of subchains along the structure9 or between two proteins in a dimer23.

Figure 1
figure 1

Sketches of proteins with (a) a knot, (b) two linked loops with cysteine closures (magenta dots), (c) two linked loops with virtual non-covalent closures (yellow and green dots form two different contacts), and (d) a loop (red) intertwined with an open chain portion (blue) - a “thread”. (e) A configuration with the loop (γi) closer to the C-terminus than the thread (γj), and (f) one with the loop closer to the N-terminus. In the two latter pictorial representations of non-structured proteins, we also show the loop-thread sequence separation s and the loop length m.

Our analysis, when applied to a large-scale database of protein domains (see Materials and Methods), identifies entangled motifs that are more elusive than knots (see Fig. 1a). For instance, portions of proteins characterized by high values of GE correspond to links between non-covalent loops (Fig. 1c) as well as to interlacings between a loop and another part of the polypeptide chain (Fig. 1d). A preliminary search for some of these motifs was carried out in the 80′ but, due to the shortage of protein structures available in the PDB and the specificity of the chosen entanglement to be explored (threading, pokes or co-pokes31), the conclusion was that these forms of entanglement were rare32. This finding was practically used to discriminate between natural proteins and artificial decoys33,34.

By performing a detailed analysis of protein structures with the GE tool, we discover that mutually entangled motifs as those sketched in Fig. 1c,d, at a first glance, are not uncommon, given that about one third of the 16968 analyzed proteins include at least one entangled loop. Nonetheless, we find that natural existing folds are much less topologically intertwined than same-length protein-like structures generated by all-atom molecular dynamics35.

More importantly, by focusing on the pairs of amino acids forming contacts at the end of entangled loops, we discover that they are enriched in hydrophilic classes with respect to the mainly hydrophobic generic contacts. Therefore, the corresponding effective interactions are on average significantly weaker. The presence of non-optimized interactions and the consequent energetic frustration could be interpreted as the result of natural selection toward sequences that keep the intertwined structures more flexible.

Another possible fingerprint of evolutionary mechanisms is the observation that entangled loops more frequently follow the chain portion threading through them (Fig. 1e) rather than preceding it along the chain as in Fig. 1f. Indeed, in the case of cotranslational folding, where the N-terminus starts to fold already during the translation process, it seems kinetically simpler to first fold the threading chain portion (blue in Fig. 1e) and then bundle the (red) loop around it than to first fold the loop and then thread the other portion through it (Fig. 1f), as already pointed out in literature31.

Results and Analysis

Proteins with entangled loops are not rare

We use the contact GE parameter \({G^{\prime} }_{c}\) to find protein domains with at least one loop γi intertwined with a “thread” γj, which is another portion of the protein (Fig. 1d–f). More precisely, we can associate \({G^{\prime} }_{c}\)(i, j) to a given loop-thread pair by using the Gauss double integral described in Materials and Methods. By maximizing |\({G^{\prime} }_{c}\)(i, j)| over all the possible threads γj we assign an entanglement score \({G^{\prime} }_{c}\)(i) to the loop and, by further maximizing |\({G^{\prime} }_{c}\)(i)| over all loops γi, we find the entanglement \({G^{\prime} }_{c}\) of the protein. At variance with similar quantities defined for closed curves, \({G^{\prime} }_{c}\) is a real number. Yet, we define a loop γi in a configuration as in Fig. 1d to be entangled if |\({G^{\prime} }_{c}\)(i)| ≥ 1. Such a threshold is natural because a linking number |L| = 1 is the minimum value that guarantees that two closed curves are linked27. In a data set of 16968 protein domains, 5375, the 31.7%, host at least one entangled loop. We also monitor the value L′ of the linking entanglement (LE) for a single protein, defined as \({G^{\prime} }_{c}\) for two subchains that are both loops, as in Fig. 1c. In Fig. 2 we show five examples of “entangled” protein domains, along with their respective values of \({G^{\prime} }_{c}\) and L′.

Figure 2
figure 2

Some examples of protein domains with nontrivial entanglement, in which one notes a looped portion (red, with yellow ends, following the color code of Fig. 1d) intertwined with another portion of the protein, a thread (blue). (a) protein 2bjuA02, with L′ = 1.145 (also the blue chain contains a loop) and \({G^{\prime} }_{c}\) = 1.285 of similar magnitude; (b) protein 3thtA01, with significant entanglement (\({G^{\prime} }_{c}\) = −1.56) but without two linked loops (L′ = −0.34); (c) protein 3tnxA00, with large \({G^{\prime} }_{c}\) = −3.07, while L′ = −1.31 is much smaller and with the same sign; (d) protein 2i06A01, with \(L^{\prime} =0.74\simeq 1\), partitioned in the two corresponding linked loops and thus following the color code of Fig. 1c (green ends of the blue loop); (e) again 2i06A01, with \({G^{\prime} }_{c}\) = −1.26, highlighting the related loop-thread partition. In the last two points one notes that the sign of L′ is opposite to the sign of \({G^{\prime} }_{c}\). It is an example of the coexistence of different forms of entanglement in the same protein domain. (f) protein 1otjA00, one of the protein domains with largest (absolute) Gaussian Entanglement, with \({G^{\prime} }_{c}\) = −3.24 and L′ = −3.02. The red loop, with yellow ends, is extremely entangled with the blue portion (which in this case also contains a loop).

The non trivial entanglement features of protein structures, when analyzed with GE and LE, are apparent in Fig. 3A, where protein domains are represented in the L′ vs \({G^{\prime} }_{c}\) space. All the points lie in the region |L′| ≤ |\({G^{\prime} }_{c}\)| because the latter quantity is defined as an extremum over a wider subset.

Figure 3
figure 3

(A) Plot of L′ vs \({G^{\prime} }_{c}\) for each protein in the CATH database; the five proteins shown in Fig. 2a–f are highlighted with the corresponding letter. (B) Smoothed histogram of data with significant linking (|L′| > 1/2). The highest probability is around \({G^{\prime} }_{c}\simeq L^{\prime} \simeq 1\). The data with the values of L′ and \({G^{\prime} }_{c}\) computed for each protein in the CATH database are available at http://researchdata.cab.unipd.it/id/eprint/123.

A typical example with \({G^{\prime} }_{c}\simeq L^{\prime} \simeq 1\) is shown in Fig. 2a. As expected, however, there are cases of proteins with at least one entangled loop (|\({G^{\prime} }_{c}\)| ≥ 1) and all pairs of loops with negligible |L′|. These proteins corresponds to the conformation sketched in Fig. 1d and to the natural protein represented in Fig. 2b. In other cases, the difference between |\({G^{\prime} }_{c}\)| and |L′| is large, even in the presence of linked loops. This is due to the behavior of the protein portion which, after threading the first loop, forms a second loop linked with it, and then continues to wind around it without further looping, see Fig. 2c.

It is interesting to observe that in several other cases the GE has a different sign with respect to the LE. This may take place if the chain winds around itself with opposite chiralities in different portions of the same protein. An example is shown in Fig. 2d,e. One of the most entangled structures found in the database, with \({G^{\prime} }_{c}\simeq L^{\prime} \simeq -\,3\), is shown in Fig. 2f.

Figure 3A shows that the GE is distributed over a broad spectrum of values and that the threshold |\({G^{\prime} }_{c}\)| ≥ 1 for entangled loops is conservative enough. Clusters emerge in the density plot of Fig. 3B, where the majority low LE points are removed by excluding data with |L′| < 1/2 (see also Fig. S1, which is an enlargement of Fig. 3A). The clusters are found around \(L^{\prime} \simeq \pm \,1\) vs \({G^{\prime} }_{c}\simeq \pm \,1\), in particular the most populated region has \({G^{\prime} }_{c}\simeq L^{\prime} \simeq 1\).

For further analysis, we consider only the GE indicator, which captures more varieties of entangled motifs than the LE (e.g. winding without linking, see Fig. 2b).

Entangled loops are found more frequently on the C-terminal side of the corresponding intertwining segment

In the definition of \({G^{\prime} }_{c}\)(i, j), the role of the loop γi cannot be exchanged with that of the other chain portion γj. This feature of the GE may be exploited to detect possible asymmetries in the respective location along the backbone of γi and γj. The score \({G^{\prime} }_{c}\)(i) associates to a given loop γi the open arm with which it is mostly entangled. We consider separately the following two cases: when the threading arm is between the N-terminus and the loop (N-terminal thread, see Fig. 1e) or when it is between the loop and the C-terminus (C-terminal thread, Fig. 1f).

We now focus on the properties of entangled loops (3.75% of the total) and we count how many of them are classified as N-terminal threads or C-terminal threads. In principle, there is no reason to expect one of the two classes to be more populated than the other.

In order to discuss carefully the statistical significance of possible asymmetries, we need to take into account that some degree of correlation occurs in the counting of entangled loops. In fact, different loops can belong to essentially the same topological configuration, for example when a protein arm intertwines both with the loop formed between amino acids i1 and i2, and with the one formed between i1 + 1 and i2. Thus, we employ a clustering procedure based on a pairwise distance defined between loops (see Materials and Methods for details).

By using the effective countings given by the clustering procedure, we find that the fraction of N-terminal (C-terminal) threads within entangled loops is 0.55 (0.45). The highlighted bias in favor of the N-terminal threads (Fig. 1e) against the C-terminal ones (Fig. 1f) is statistically significant at the level of 14 standard deviations. A somewhat similar result was found by studying topological barriers in protein folding36.

One may ask whether N-terminal threads are favored, simply, because entangled loops are by themselves closer to the C-terminus, without the need of considering the |\({G^{\prime} }_{c}\)| maximization that selects the intertwining thread. In order to check this, we consider a random reference case, whereby one putative threads is sampled randomly for each entangled loop, with uniform probability across all possible segments non overlapping with it. As a matter of fact, putative N-terminal threads are not favored in the reference case. We find instead a small bias in the opposite direction; namely, the fraction of putative N-terminal (C-terminal) threads within entangled loops is 0.487 (0.513). This small but statistically significant (3.5 standard deviations) imbalance, suggests that entangled loops are slightly, if any, located closer to the N-terminus, thereby highlighting even more that the favored placement of the intertwining thread to the entangled loop N-terminal side is a genuine effect.

Entangled loops favor positive chirality only for N-terminal threads

The formation of an entangled structure is not simple, as it requires a non local concerted organization of the amino acids in space, where a crucial role is played by the order of formation of different native structural elements along the folding pathway18. A misplaced nucleation event in the early stages of the folding pathway might prevent the protein from folding correctly. Dealing with spontaneous “in vitro” refolding, there is no reason to expect the folding order of different elements to be related to a preferential specific direction along the chain.

Nevertheless, an asymmetry can be envisaged if a protein folds cotranslationally, according to the following argument. For the C-terminal thread, the loop might be formed in the early folding stages, making it difficult for the rest of the protein to entangle with it and thus to reach the native conformation. Conversely, for the N-terminal thread, the loop could wrap more easily around the open threading arm, already folded in its native conformation, after ejection from the ribosome. If confirmed, this picture would explain the asymmetry we observe between N- and C-terminal threads. The latter could be anyway interpreted as a possible fingerprint of an evolutionary process, intimately related to entanglement regulation driven by cotranslational folding. Such conclusion is corroborated by looking, separately for C- and N-terminal threads, at the normalized distributions of loop-thread sequence separations s, plotted in Fig. 4a. The distributions for the random reference case (empty symbols) are very similar for N-terminal (triangles) and C-terminal (circles) threads, showing again that if the putative thread is chosen randomly no asymmetry is present. One would expect a uniform reference distribution for loops at a fixed distance from the relevant chain terminus (N for N-threads). The regular decay observed for increasing s is due instead to the fact that different protein domains have different lengths and different loops are located differently along the backbone. On the other hand, the |\({G^{\prime} }_{c}\)(i, j)| maximization leading to \({G^{\prime} }_{c}\)(i) selects preferentially arms that start just after (or before) the loop, at a distance of one or few amino acids. This is similar to what already observed for pokes33, and reflects the fact that a rapid turning of the protein chain is the simplest way for maximizing the mutual winding between two subchains. However, loop-thread pairs that are one unit distance apart are significantly more favored for C-terminal (squares) than for N-terminal (diamonds) threads (notice the logarithmic scale and the associated statistical errors), showing again a genuine asymmetry between the two cases. A Kolmogorov-Smirnov test yields P = 4 · 10−58 for the null hypothesis that the two distributions are the same. Interestingly, the resulting enhancement at intermediate separations (5 < s < 20) allows N-terminal threads to follow closely the reference decay. Consistently with cotranslational folding, N-terminal threads could allow for more complex topological structures with on average larger separations, when compared to C-terminal threads. Accordingly, the distribution of \({G^{\prime} }_{c}\)(i) values for both the N- and C-terminal threads, shown in Fig. 4b, highlights that the values around \({G^{\prime} }_{c}\)(i) ≈ 1 are more probable in the former case. Strikingly, this happens only for positive \({G^{\prime} }_{c}\)(i) values, whereas for negative ones there is a small bias favoring C-terminal threads. As a matter of fact, we find C-thread entangled loops to be balanced between positive and negative chiralities within the 0.3%, whereas N-thread entangled loops are highly biased (74%) in favor of positive chiralities, \({G^{\prime} }_{c}\)(i) > 0.

Figure 4
figure 4

(a) For four cases (see legend), distributions of the loop-thread sequence separation s. Error bars are based on the effectively independent countings determined through the clustering procedure. (b) For the separate cases of N- and C-terminal threads (see legend), tails of the distributions of the loop entanglement for |\({G^{\prime} }_{c}\)(i)| > 1/2. Error bars are based on the effectively independent countings determined through the clustering procedure.

Natural proteins are less entangled than protein-like compact conformations

In the ensemble of the CATH domains there are 3617208 loops, of which 135530 (3.75%) are entangled. To assess whether this fraction is small or large we compare it with an analogous quantity computed in an unbiased reference state formed by a set of putative alternative compact conformations (i.e. rich in secondary structures) that a protein could in principle adopt. This ensemble is found in a poly-valine “VAL60” database35, obtained with an all atom simulation that accurately sampled the configurational space of a homopolypeptide formed by 60 valine amino acids (see Materials and Methods for details).

For a proper comparison with VAL60, we restrict our CATH database only to the proteins of comparable length, filtering out 772 proteins with length n from n = 55 to n = 64 amino acids. In this reduced “CATH60” ensemble of natural proteins there are 47954 loops, of which 138 (0.3%) are entangled. There are 19 proteins (2.46%) hosting at least one entangled loop. These values are of course lower than those for the full CATH ensemble, in which longer proteins can host more entanglement. In VAL60 there are 2284693 loops, of which 57577 are entangled (2.52%), a fraction ten times larger than for natural proteins of CATH60. Similarly, 3560 out of the 30064 VAL60 structures host at least one entangled loop (11.8%), a fraction five times larger than for natural proteins.

However, it is known that, presumably for kinetic reasons35, VAL60 is characterized by loops on average longer than those of natural proteins. Consequently, to avoid any possible bias in the comparison, we divide loops in classes of homogeneous length m. For some classes, the normalized histogram of the GE for CATH60 and VAL60 datasets are plotted in Fig. 5a–c. In all cases it is apparent that the range of \({G^{\prime} }_{c}\)(i) is wider for the VAL60 homopolypeptides than for the natural proteins. The deep difference between the two distributions can be appreciated in Fig. 5d, where the root mean squared \({G^{\prime} }_{c}\)(i) is plotted as a function of the loop length: the values for VAL60 are always significantly higher than those for natural proteins. Note that the root mean squared \({G^{\prime} }_{c}\)(i) increases with m only up to half of the protein length. From there on, the remaining subchain starts getting too short to entangle.

Figure 5
figure 5

For both natural protein domains of length n in the range 55 ≤ n ≤ 64 from the CATH database, and the VAL60 ensemble of homopolypeptides, we plot the normalized histogram of \({G^{\prime} }_{c}\)(i) for loops of length m in the intervals 20 ≤ m ≤ 24 (a), 30 ≤ m ≤ 34 (b), and 40 ≤ m ≤ 44 (c). (d) For natural protein domains and the VAL60 ensemble, root mean squared \({G^{\prime} }_{c}\)(i) as a function of the loop length m.

In conclusion, we have a clear statistical evidence that entangled loops occur less frequently in natural proteins with respect to random compact protein-like structures.

Amino acids at the ends of entangled loops are frustrated

In the preceding sections we provided two independent evidences that, although entangled loops are not rare in natural protein structures, their occurrence and position along the backbone chain are kept under control. A possible reason is the need to limit potential kinetic traps in the folding process brought about by entangled loops, for example by deferring their formation to the latter stages of the folding pathway. Thus, we expect to detect a related fingerprint in the specific amino acids found in contact with each other at the end of entangled loops (“entangled contacts”). We check whether such amino acids share the same statistical properties of the amino acids forming any possible contact (“normal contacts”).

The frequency with which two amino acids are in contact is typically employed to estimate knowledge based potentials37,38. In a nutshell, if two amino acids a and b occur to be in contact more frequently than on average, they are expected to manifest a mutual attraction and are therefore characterized by a negative effective interaction energy Enorm(a, b) (see Materials and Methods).

If effective interaction energies are computed by restricting the analysis only to the entangled contacts, a new set of entangled contact potentials EGE(a, b) can be derived. The discrepancies between such potentials and the normal ones can be conveniently captured by an enrichment score ΔEenr(a, b). A negative enrichment score ΔEenr(a, b) < 0 implies that (a, b) are more frequently in contact when they are at the ends of entangled loops, and vice-versa for positive scores. Figure 6 shows that ΔEenr(a, b) anticorrelates with Enorm(a, b). This correlation is statistically significant. The Pearson correlation coefficient is r = −0.31, with a P-value of 2 × 10−6. The Spearman rank correlation is ρ = −0.23 with a P-value of 6 × 10−4.

Figure 6
figure 6

Scatter plot of the enrichment score ΔEenr(a, b) vs normal contact potential Enorm(a, b). Each point is for an amino acid pair (a, b) and is colored according to amino acid types: black for pairs of aromatic residues (HIS, PHE, TRP, TYR); magenta for CYS-CYS; green for the rest. The dashed line is a linear fit with slope −0.12. Error bars are computed with a boostrapping procedure and we plot only errors for ΔEenr as those for Enorm are smaller than the symbol size.

The anticorrelation of Fig. 6 has an important consequence: pairs of amino acids that in a globular protein interact strongly (Enorm(a, b) < 0, mainly hydrophobic amino acids) are present less often (ΔEenr(a, b) > 0) in entangled contacts, while amino acids that typically interact weakly (Enorm(a, b) > 0, mainly polar and hydrophilic amino acids) are instead more abundant (ΔEenr(a, b) < 0) at the ends of entangled loops. We checked that this result is not trivially due to entangled contacts being preferentially located on the protein surface, finding that residues involved in entangled contacts are even slightly more buried in the protein interior than those involved in normal contacts (see Fig. S2). The deep difference between the two set of scores Enorm(a, b) and EGE(a, b) emerges clearly from the graphical representations in Fig. 7 of Enorm(a, b) and ΔEenr(a, b), in which positive and negative values are marked red and blue, respectively, whereas white boxes mark scores that are not significant within the related statistical uncertainty.

Figure 7
figure 7

(a) Normal contact potential Enorm; amino acids are ranked from left to right (top to bottom) with increasing average Enorm (over row/column). (b) Enrichment score ΔEenr for entangled contacts. Different backgrounds are used for highlighting negative and positive values: blue for E < −E0, light blue for −E0 ≤ E ≤ 0, pink for 0 < E ≤ E0, and red for E > E0 with E0 = 35. White is used for scores that differ from zero less than the corresponding statistical uncertainty, computed by means of a bootstrapping procedure.

The blue spots in Fig. 7a represent interactions between amino acids that interact frequently with each other (mainly hydrophobic pairs), whereas the red area is populated by amino acids that are rarely in contact (mainly polar pairs).

In Fig. 7b, the blue spots highlight amino acids that have decreased their energy score and which are therefore more present at the ends of the entangled loops than in normal contacts. These include mainly polar amino acids. Note that proline is particularly enriched at the end of entangled loops. The red spots in Fig. 7b indicate amino acids which are less present at the ends of the entangled loops than in normal contacts. These include mainly hydrophobic ones. The case of cysteine self-interaction is pedagogical: the strongest attractive interaction between amino acids turns out to be the more diminished one at the end of entangled loops (see also Fig. 6), consistently with the very low number of linked loops closed by disulphide bonds (Fig. 1c) that was found in the PDB39.

Interestingly, the four aromatic amino-acids (HIS, PHE, TRP, TYR) violate the general trend. Interactions between aromatic pairs are found in the bottom-left quadrant of Fig. 6. Despite being very frequent in normal contacts (all their mutual entries are dark blue in Fig. 7a), they become even more abundant when at the ends of entangled loops (still blue in Fig. 7b), highlighting a special role likely played by aromatic rings in such complex structures.

Figures 6 and 7b provide clear evidence for the existence of an evolutionary pressure shaping the amino acid sequences. This natural bias weakens energetically the contacts which close entangled loops, consistently with the argument that a too early stable formation of the loop could prevent the correct folding of the full protein.

These results are very robust to changes in the \({G^{\prime} }_{c}\) threshold used to define entangled contacts, in the minimum length m0 of the considered subchains, and to the introduction of a minimum loop-thread separation s0, see Figs S3S5.

Discussion and Conclusions

With the notion of Gaussian entanglement we extend the measure of mutual entanglement between two loops to any pair of open subchains of a protein structure. This allows us to perform an unprecedented large scale investigation of the self entanglement properties of protein native structures, through which we identify and locate a large variety of entangled motifs (Fig. 2), by focusing on the notion of “entangled loop”, a loop intertwining with another subchain (Fig. 1d). Different entangled motifs can coexist in the same protein domain, even with opposite chiralities, and few domains exhibit a pair of loops intertwining even thrice around each other (see the examples in Fig. 2d,f, and points in Fig. 3). Gaussian entanglement could be used to improve the classification of existing protein folds40, as previously done with Gauss integrals computed over the whole protein chain41.

Our analysis shows unequivocally (Fig. 5) that, although entangled motifs are present in a remarkably high fraction, 32%, of protein domains, these host a lower amount of entangled loops than protein-like decoys produced with molecular dynamic simulations35. The question is then why natural folds avoid overly entangled conformations with otherwise plausible secondary structure elements. Are entangled loops obstacles for the folding process? If yes, how does Nature cope with them when they are present?

To answer these questions, we recall that an efficient folding of proteins is fundamental for sustaining the biological machinery of cell functioning. The rate and the energetics of the protein folding process, which are defined by its energy landscape, are encoded in the amino acid sequence. Over the course of evolution, this landscape was shaped to allow and stabilize protein folding, avoiding possible slowdowns.

We find indeed two clear hallmarks suggesting that the entangled loops in proteins are kept under control: (i) a statistically significant asymmetry in their positioning with respect to the other intertwining chain portion, which is consistent with cotranslational folding promoting the presence of entangled loops on the C-terminal side of the intertwining thread (see Fig. 4a); (ii) weak non optimized interactions between the amino acids in contact at the end of entangled loops, an example of energetic frustration (see Figs 6 and 7). Both these findings suggest that the late formation of entangled loops along the folding pathway could be a plausible control mechanism to avoid kinetic traps. On the other hand, the additional stability that one expects to be provided by entangled and knotted structures can compensate for the presence of such weak interactions.

Interestingly, interactions between aromatic amino acid pairs are promoted at the end of entangled loops (see Fig. 7b), suggesting that their presence could be related to the protein biological function. Whether entangled loops may have specific biological functions is an intriguing open question, as in the case of knots in protein domains17,42.

Finally, we detect a remarkable bias, favoring positive chiralities, that is present only for entangled loops on the C-terminal side of the intertwining thread (see Fig. 4b). This suggests that the observed chirality bias arises in the context of cotranslational folding. A simple possibility is that loop winding of the C-terminal part of the chain may have a preferred orientation when just released from the ribosome. Further work will be needed to test this speculation and to fully rationalize the chirality bias. As a matter of fact, the ribosome can discriminate the chirality of amino acids during protein synthesis43.

Stemming from works on glassy transitions44,45, the concept of minimal frustration between the conflicting forces driving the folding process is a well established paradigm24,25,26 in protein physics. It has been further argued26 that frustration is an essential feature for the folding dynamics and that it can give surprising insights into how proteins fold or misfold.

Is it possible to reconcile the frustration detected at the ends of entangled loops with the minimal frustration principle? Let us assume that a non optimal ordering of the events along the folding pathways (for example, the formation of a loop which has then to be threaded by another portion of the protein to form an entangled structure) is highly deleterious. In order to prevent this, it could indeed be preferable to select suitable sub-optimal interactions. In fact, this would be a remarkable example of minimal frustration in action, having to compromise between topological and energetic frustration.

Obviously, other data will be needed to confirm this proposed mechanism for the folding process, from both simulations and experiments. In either case, a simple protocol could consist in mutating into cysteine both residues at the ends of an entangled loop, provided no other cysteines are present in the sequence, and in assessing whether the folding is then hindered by the formation of a disulfide bridge in oxidizing conditions. In the context of knotted proteins, single molecule force spectroscopy techniques were shown to be particularly useful in controlling the topology of the unfolded state46. Similarly, both “in vivo” folding experiments47 and appropriate simulation protocols48,49,50 could be employed to test the possible role of cotranslational folding in determining the patterns detected for entangled motifs: double cysteine mutants would then be predicted to be more deleterious for the folding of C-terminal threads with respect to N-terminal threads. In all cases, it is essential to gather statistics over several different proteins before validating or rejecting our hypothesis; the signals that we reveal in this contribution are statistical in nature; therefore we do not expect all entangled loops to form late in the folding process nor all C-terminal threads to be cotranslationally disfavored. For example, it has been recently proposed, in the context of deeply knotted proteins, that loops formed by a synthesized earlier portion of the same protein can be actively threaded by nascent chains at the ribosome51. However, this is not in contradiction with our findings, since knotted proteins are much less frequent than the general entangled motifs discussed here.

Materials and Methods

CATH database

We use the v4.1 release of the CATH database for protein domains, with a non-redundancy filter of 35% homology52. To avoid introducing entanglement artificially for proteins with big gaps in their experimental native structures, we do not consider any protein in the CATH database that presents a distance > 10 Å between subsequent Cα atoms in the coordinate file. We find that this selection keeps Nprot = 16968 out of the available 21155 proteins. CATH domain names such as 2bjuA02 refer to the 2nd domain from chain A with PDB code 2bju. The CATH database is available at http://download.cathdb.info/cath/releases/all-releases/v4_1_0/.

Poly-valine database

The VAL60 database is an ensemble of 30064 structures obtained by an exhaustive exploration of the conformational space of a 60 amino acid poly-valine chain described with an accurate all-atom interaction potential35. The exploration was performed with molecular dynamics simulations using the AMBER03 force field53 and the molecular dynamics package GROMACS54 and by exploiting a bias exchange metadynamics approach55 with 6 replicas. The simulation was performed in vacuum at a temperature of 400 K. The conformations have been selected as local minima of the potential energy with a secondary structure content of at least 30% and a small gyration radius. The protein-like character of VAL60 conformations was successfully tested by using different criteria commonly employed to assess the quality of protein structures35. The stability of a small subset of VAL60 structures was successfully tested even after mutation of all residues to Alanines. Crucially, it was observed that the VAL60 database contains almost all the natural existing folds of similar length35. However, these known folds form a rather small subset of the full ensemble, which can be thought as an accurate representation of the universe of all possible conformations physically attainable by polypeptide chains of length around 60. A repository for the VAL60 database is available at https://doi.org/10.5061/dryad.1922.

Mathematical definition of the linking number and its computational implementation

The linking number between two closed oriented curves γi = {r(i)} and γj = {r(j)} in \({{\mathbb{R}}}^{3}\) may be computed with the Gauss double integral

$$G\equiv \frac{1}{4\pi }{\oint }_{{\gamma }_{i}}{\oint }_{{\gamma }_{j}}\frac{{{\boldsymbol{r}}}^{(i)}-{{\boldsymbol{r}}}^{(j)}}{{|{{\boldsymbol{r}}}^{(i)}-{{\boldsymbol{r}}}^{(j)}|}^{3}}\cdot (d{{\boldsymbol{r}}}^{(i)}\times d{{\boldsymbol{r}}}^{(j)})$$
(1)

It is an integer number and a topological invariant27. If computed for open curves, it becomes a real number G′ (the GE) that quantifies the mutual entanglement between the curves9,23,28,29,30. In proteins, piece-wise linear curves join the coordinates of subsequent Cα atoms. In particular, γi is an open subchain joining Cα atoms from index i1 to i2 and similarly γj is another nonoverlapping subchain from j1 to j2.

We specialize to the configurations studied in ref.9, in which i1 and i2 amino acids are required to be in contact. In this study, the contact is present if any of the heavy (non hydrogen) atoms of residue i1 is near any of the heavy atoms of residue i2, namely they are at a distance at most d = 4.5 Å. The “contact” Gaussian entanglement of these configurations (sketched in Fig. 1d–f) is named \({G^{\prime} }_{c}\)(i, j). Since proteins are thick polymers and bonds joining Cα atoms are quite far from each other (compared to their length), we may approximate the integral1 with a discrete sum. Given the coordinates ri of Cα‘s, the average bond positions \({{\boldsymbol{R}}}_{i}\equiv \frac{1}{2}({{\boldsymbol{r}}}_{i}+{{\boldsymbol{r}}}_{i+1})\) and the bond vectors ΔRi = ri + 1 − ri enter in the estimate of \({G^{\prime} }_{c}\)(i, j) for γi and γj,

$${G^{\prime} }_{c}(i,j)\equiv \frac{1}{4\pi }\sum _{i={i}_{1}}^{{i}_{2}-1}\,\sum _{j={j}_{1}}^{{j}_{2}-1}\,\frac{{{\boldsymbol{R}}}_{i}-{{\boldsymbol{R}}}_{j}}{{|{{\boldsymbol{R}}}_{i}-{{\boldsymbol{R}}}_{j}|}^{3}}\cdot ({\rm{\Delta }}{{\boldsymbol{R}}}_{i}\times {\rm{\Delta }}{{\boldsymbol{R}}}_{j}\mathrm{).}$$
(2)

We then associate a contact entanglement \({G^{\prime} }_{c}\)(i) to a “loop” γi as the extreme (i.e. with largest modulus) \({G^{\prime} }_{c}\)(i, j), for all “threads” γj, with j2 − j1 ≥ m0 (m0 = 10). Finally, the contact entanglement \({G^{\prime} }_{c}\) of a protein is the extreme of \({G^{\prime} }_{c}\)(i) for all loops of length m = i2 − i1 ≥ m0. The linking entanglement L′ is equal to \({G^{\prime} }_{c}\) for configurations with two loops as in Fig. 1c. It is not exactly the linking number L because the two closures between contacts are not performed.

Clustering procedure for counting effectively independent loops

Each entangled loop γ is characterized by five numbers, its two indices (i1, i2), the indices of the threading portion (j1, j2), and the corresponding Gaussian entanglement \({G^{\prime} }_{c}\)(i, j). It is thus natural to define a distance between two entangled loops γA and γB as

$${d}_{AB}=\sqrt{{({i}_{1}^{A}-{i}_{1}^{B})}^{2}+{({i}_{2}^{A}-{i}_{2}^{B})}^{2}+{({j}_{1}^{A}-{j}_{1}^{B})}^{2}+{({j}_{2}^{A}-{j}_{2}^{B})}^{2}+{w}_{g}{[{G^{\prime} }_{c}({i}^{A},{j}^{A})-{G^{\prime} }_{c}({i}^{B},{j}^{B})]}^{2}}.$$
(3)

where wg is a weight to be defined. In order to count the effectively independent loops we used the following procedure within each protein in the CATH database: first we selected the loop with the largest number of neighbors, namely with the largest number of loops at a distance smaller than a threshold d*. We assign the selected loop and all its neighbors to the same cluster, removing them from the running list of loops. We iterate this procedure until the running list is empty, so that each loop γi belongs to a cluster with a given number of members \({N}_{{C}_{i}}\). Each loop is then included in all the statistics and distributions reported in the main text, with an effective counting weight \(1/{N}_{{C}_{i}}\). By using d* = 20, wg = 104, we find an overall effective counting of 18041 independent entangled loops, a 13% of the original 135530 countings. The results reported in the main text are qualitatively robust against reasonable variations of the d*, wg parameters. The data with the values of \({G^{\prime} }_{c}\) computed for each loop in each protein in the CATH database, and grouped after clustering, are available at http://researchdata.cab.unipd.it/id/eprint/123.

Inference of statistical potentials

In order to estimate effective interactions between amino acids in protein structures, we use an established knowledge based approach38. Pairwise potentials can be obtained by analyzing databases of know protein conformations56. These potentials are derived measuring the probability of an observable, such as the formation of a contact, relative to a reference unbiased state37. The conversion of the probability in an energy is done by employing Boltzmann’s law57.

The first step includes characterizing the reference null space of possible pairs of amino acids. All amino acid pairs within each protein sum up to a grand total of N generic pairs (i.e. just combinatorial pairings not necessarily related to a spatial contact) in our ensemble of protein structures. In the same way, given two amino acid kinds a and b, one sums up the occurrence of a-b pairs within each protein to a grand total of N(a, b) pairs in the ensemble.

To quantify energies of “normal” contacts Enorm(a, b) between amino acids of type a and b, we consider two amino acids to be in contact if any inter-residue pair of their side chain heavy atoms is found at a distance lower than 4.5 Å. By considering only the ensemble of amino acids which are in contact within each protein, their total counting results in Nc generic contacts. Similarly, the specific contacts between amino acids of kind a and b are summed up to a total Nc(a, b).

The statistical potentials for normal contacts are defined by comparing the frequencies37,58

$$f(a,b)=N(a,b)/N,\,\,{f}_{c}(a,b)={N}_{c}(a,b)/{N}_{c},$$
(4)

within the ensemble of all pairs or the ensemble of contacts, respectively. If fc(a, b) is relatively high compared to f(a, b), it means that chemistry and natural selection favored the organization of native protein structures toward configurations where a and b types were in contact. Thus, the argument is that a lower potential energy should be associated to such contacts; the normal contact potentials are therefore given by

$${E}_{{\rm{norm}}}(a,b)=-\,\tau \,\mathrm{log}\,\frac{{f}_{c}(a,b)}{f(a,b)},$$
(5)

where we introduced a parameter τ = 100 for the convenience of rescaling numbers and rounding them off to readable integers.

We can compute a similar kind of potentials EGE(a, b) for “entangled” contacts, just restricting the analysis to the subset of contacts between amino acids that are at the end of an “entangled loop”, defined as a loop γi for which |\({G^{\prime} }_{c}\)(i)| ≥ 1, that is a loop for which at least one thread γj exists such that the corresponding |\({G^{\prime} }_{c}\)(i, j)| ≥ 1. Within all proteins, in total we count \({N}_{c}^{G}\) of such contacts while the specific ones are \({N}_{c}^{G}(a,b)\), and hence

$${f}_{c}^{G}(a,b)={N}_{c}^{G}(a,b)/{N}_{c}^{G},$$
(6)
$${E}_{{\rm{GE}}}(a,b)=-\,\tau \,\mathrm{log}\,\frac{{f}_{c}^{G}(a,b)}{f(a,b)}.$$
(7)

To easily capture dissimilarities between Enorm(a, b) and EGE(a, b), we introduce an enrichment score defined as

$${\rm{\Delta }}{E}_{{\rm{enr}}}(a,b)=-\,\tau \,\mathrm{log}\,\frac{{f}_{c}^{G}(a,b)}{{f}_{c}(a,b)}={E}_{{\rm{GE}}}(a,b)-{E}_{{\rm{norm}}}(a,b)$$
(8)

In all cases we have imposed a constraint on the pairs of amino acid considered: the two amino acids i1 and i2 in contact must have indices difference i2 − i1 ≥ m0 = 10. This threshold m0 removes any eventual bias in comparing potentials due to the entanglement constraint, which requires entangled loops and threads of a minimal length to be present.

We computed all statistical potentials, together with the related uncertainties, by using a bootstrapping procedure with 101 independent resamplings. The countings and the statistical scores obtained for each amino-acid pair are available at http://researchdata.cab.unipd.it/id/eprint/123.