In order to sample the conformational space of a protein, we use a Branch-and-Prune algorithm to build a tree in which each node represents a solution for one atomic position. We limit ourselves in the present work to the calculation of the backbone and C β atomic coordinates.
The constraints used to generate atomic coordinates along the Branch-and-Prune algorithm are the following:
-
1.
covalent distance constraints corresponding to bond lengths and bond angles, whose values are derived from high-resolution small molecule X-ray crystal structures [25];
-
2.
NMR distance constraints;
-
3.
van der Waals radii of atoms between non-bonded atom pairs (i,j): a fraction of the sum of the van der Waals radii of each atom provides a lower bound to the corresponding inter-atomic distances:
$$ d_{ij}\geq \sigma (r^{vdw}_{i} + r^{vdw}_{j}), $$
((1))
where σ∈ [ 0,1], and is typically around 0.85. The values for the radii are given in Table 1 [26,27]. These lower bounds apply only in the cases where no larger lower bound has been determined from NMR distance constraints;
Table 1
Van der Waals radii (see [
26
] and [
27
])
-
4.
distances derived from the backbone torsion angles ϕ and ψ;
-
5.
hydrogen bonds in α-helix;
-
6.
amino-acid chirality;
-
7.
α-helix geometry.
The atom coordinates are calculated, one by one, following the atom order P
ato described in Figure 3 and previously proposed in [24]. In this order, some atoms are repeated to insure that any entered atom is defined by distance constraints with respect to three preceding atoms in P
ato [24]. The carbonyl oxygens and the atoms C β, which were not present in the order P
ato, are calculated separately.
Then, the tree is built using a recursive procedure to create each node of the tree. This procedure is called branching phase. The created nodes are then submitted to the pruning devices in order to decide whether the node should be kept or removed. If the node is removed, the possible branches starting from this node are also pruned. A pruning device is responsible for checking whether a partial solution is feasible, i.e. to check whether a set of embedded atoms fulfill the constraints (1)-(7) described above.
In the following, we describe the branching phase and the pruning devices. Then, the complexity of the algorithm is described from a theoretical point of view, before presenting some application cases.
Branching devices
The tree parsed during iBP is formed by nodes, each corresponding to one set of atomic coordinates from the order P
ato (Figure 3) [24]. At each level of the tree, the atomic coordinates of the corresponding atom are calculated by making use of a recursive procedure, called branching phase. The current atom position is defined by distance constraints to three other atoms. These distances are obtained from the constraints (1-3) described above: (1) the covalent constraints, (2) the NMR distance constraints, (3) the van der Waals radii.
If the distance constraints specify a unique value rather than an interval, this signifies that the distances to three immediate predecessors from the current vertex are known: these are the centers of the three spheres, and the distances are the radii of these spheres. The position of the current vertex/atom is thus defined by the intersection of three spheres, so there are at most two solutions for the current atom position: this is called a 2-branching situation (Figure 4).
When a distance is not uniquely defined, but rather defined by lower and upper bounds, i.e. d
i,j
∈[l
i,j
,u
i,j
], this distance is uniformly discretized by sampling b≥1 values in [l
i,j
,u
i,j
], as depicted in Figure 5.
$$ \tilde d_{i}=\left\{ l_{i,i-3} + (t-1)\frac{(u_{i,i-3}-l_{i,i-3})}{b} : t=1,\ldots,b\right\}. $$
((2))
In this case, we have a b-branching situation.
The algorithm used for calculating the atom coordinates is then applied to each set of \(\tilde {d}_{i}\) values sampled for the distance constraints. The choice of the discretization factor
b is a crucial point: a small value might lead to an infeasible problem because we may not select any feasible distance; a larger value increases the computational burden. In general, the finer the discretization, the more accurate the computation is, but it is not trivial to figure out the optimal value for b. One way to choose b is to consider that the number of nodes in the search tree is bounded by 3+(2l
b
k), where l is the number of tree levels where we have a 2-branching situation, and k is the number of tree levels where we have a b-branching situation [28]. Appropriate values of b should result in a manageable number of nodes.
Given the position of the three previous atoms k−3, k−2, k−1 in the order P
ato and given the constraints to these atoms of the atom k to be embedded, the position of k is calculated by a recursive matrix multiplication by making use of the set of distances d={d
k,k−1,d
k,k−2,d
k,k−3} between the previous atoms and k. Although there are several methods to compute sphere intersections [29], in our experience, the best trade-off between efficiency and numerical stability is given by the use of recursion matrices [23], and of the two following angles: (i) the torsion angle ω
3 formed by atoms {k,k−1,k−2,k−3} which depends on the distance between k and k−3, (ii) the angle θ
2 formed by atoms {k,k−1,k−2}.
The recursion is applied through the equation:
$$ \begin{aligned} \left[\begin{array}{c} x_{k} \\ y_{k} \\ z_{k} \\ 1 \end{array} \right] &= B_{1} B_{2} B_{3} \ldots B_{k}(d,\sigma) \left[\begin{array}{c} 0 \\ 0 \\ 0 \\ 1 \end{array}\right]\\ &= Q_{k-1}B_{k}(d,\sigma) \left[\begin{array}{c} 0 \\ 0 \\ 0 \\ 1 \end{array}\right] = Q_{k} \left[\begin{array}{c} 0 \\ 0 \\ 0 \\ 1 \end{array}\right], \end{aligned} $$
((3))
where:
$$ {\fontsize{7}{6}\begin{aligned} B_{k} (d,\sigma) = \left[\begin{array}{cccc} -\cos\theta_{2} & -\sigma \sin\theta_{2} &0& -d_{k,k-1}\cos\theta_{2}\\ \sigma\sin\theta_{2}\cos\omega_{3} & -\cos\theta_{2}\cos\omega_{3} & -\sin\omega_{3}& \sigma d_{k,k-1} \sin\theta_{2}\cos\omega_{3}\\ \sigma\sin\theta_{2}\sin\omega_{3} & -\cos\theta_{2} \sin\omega_{3}&\cos\omega_{3} &\sigma d_{k,k-1} \sin\theta_{2}\sin\omega_{3}\\ 0 & 0& 0&1 \end{array}\right], \end{aligned}} $$
((4))
and σ∈{+1,−1}. The series of recursion matrices is initialized as:
$$ \begin{aligned} B_{1}= \left[\begin{array}{cccc} 1 & 0& 0&0\\ 0 & 1& 0&0\\ 0 & 0& 1&0\\ 0 & 0& 0&1 \end{array}\right], B_{2}= \left[\begin{array}{cccc} -1 & 0& 0& -d_{2,1}\\ 0 & 1& 0&0\\ 0 & 0& -1&0\\ 0 & 0& 0&1 \end{array}\right],\\ B_{3}= \left[\begin{array}{cccc} -\cos\theta_{3} & -\sin\theta_{3} &0& -d_{3,2}\cos\theta_{3}\\ \sin\theta_{3}& -\cos\theta_{3} & 0 & d_{3,2} \cos\theta_{3}\\ 0 & 0& 1&0\\ 0 & 0& 0&1 \end{array}\right]. \end{aligned} $$
((5))
d
2,1 being the distance between the first and the second atom, and d
3,2 the distance between the third and the second atom in the order P
ato.
The total number of B
k
matrices to be calculated along the parsing of the tree is bounded by 2∣ P
ato ∣b, where ∣ P
ato ∣ is the size of the ordered atom list P
ato. The product Q
k−1
B
k
is calculated in two steps: (1) the fourth column of Q
k
, which gives us the coordinates of k, is computed; (2) only if k is not pruned, the three remaining columns are computed.
We must distinguish two cases when embedding an atom k. If it is the first appearance of k in P
ato, we use equation 3 to compute all possible embeddings of k for σ∈{+1,−1} and the set of distances d. If it is not the first appearance of k in P
ato, we need to take into account the fact that numerical instabilities generate matrices which will lead to slightly different coordinates for k than those computed the first time. In order to decrease the impact of these numerical errors, we compute the set of distances d, the angles θ
2,ω
3 and for σ∈{+1,−1} the corresponding matrices B
k
(d,+1),B
k
(d,−1), which lead to two possible embeddings of k (Equation 3), as k
+=Q
k−1
B
k
(d,+1) and k
−=Q
k−1
B
k
(d,−1). We choose the value of k that yields the updated coordinates of k being the closest to the previous coordinates of this atom.
Each carbonyl oxygen O i−1 is uniquely determined for residue i, once C
i−1, N
i and H
i have been embedded, since these atoms are all part of the peptide plane [30]. As is common practice (see, e.g., [31-33]), we fix here the torsion angle ω of the peptide plane to -180° or 0°. In a previous implementation [34], the positions of the carboxylic oxygens were not stored. Although this approach leads to memory savings, the availability of carboxylic oxygen positions can improve the definition of the α-helix secondary structure.
The positions of the carbonyl oxygens are thus now calculated in the following way. If k=O
i−1 is the carboxylic oxygen atom located at the vertex k, and {v
1,v
2,v
3} are the vertices corresponding to atoms {C
i−1,N
i,H
i}, belonging on the same peptide plane π, we denote n
π
the normal vector to π. The coordinates of k can then be computed by solving the following non-linear system:
$$ \left\{ \begin{array}{ll} \| k - v_{i} \|^{2} = d_{ki}^{2}, & i=1,2,3\\ n_{\pi}^{T} (v_{1} - k) = 0& \end{array} \right.. $$
((6))
where d
ki
are the distances between atoms k and i. Using an approach similar to those employed in [35], we obtain the equivalent linear system:
$$ \left\{ \begin{array}{l} 2 (v_{2} - v_{1})^{T} k = d_{k1}^{2} - d_{k2}^{2} -\|v_{1}\|^{2} +\|v_{2}\|^{2}\\ 2 (v_{3} - v_{1})^{T} k = d_{k1}^{2} - d_{k3}^{2} -\|v_{1}\|^{2} +\|v_{3}\|^{2}\\ n_{\pi}^{T} (v_{1} - k) = 0 \end{array} \right. $$
((7))
The parameter d
k1 is the length of the bond connecting O
i−1 and C
i−1, the parameters d
k2 and d
k3 are the distances between k=O
i−1 and N
i, H
i, calculated from bond angles and bond lengths between atoms of the peptide plane, and the angle ω of 180° in a trans peptide plane. The case of the cis peptide plane can be treated in the same way, modifying the value of ω to be 0°.
Following the idea proposed for carbonyl oxygens, the coordinates k of a C
β atom can be computed from previously calculated atoms, because the four distances of k to atoms {v
1=C
α,v
2=H
α,v
3=N,v
4=C} are exactly known, and because these five atoms are not coplanar. The coordinates k are calculated by solving the linear system:
$$ \left\{ \begin{aligned} 2 (v_{2} - v_{1})^{T} k = d_{k1}^{2} - d_{k2}^{2} -\|v_{1}\|^{2} +\|v_{2}\|^{2}\\ 2 (v_{3} - v_{1})^{T} k = d_{k1}^{2} - d_{k3}^{2} -\|v_{1}\|^{2} +\|v_{3}\|^{2}\\ 2 (v_{4} - v_{1})^{T} k = d_{k1}^{2} - d_{k4}^{2} -\|v_{1}\|^{2} +\|v_{4}\|^{2} \end{aligned} \right. $$
((8))
The parameter d
k1 is the length of the bond connecting k=C
β and C
α, the parameters d
k2, d
k3 and d
k4 are the distances between k=C
β and H
α, N, C, calculated from bond angles and bond lengths between these atoms.
Pruning devices
Once the set of possible coordinates of the atom k has been determined in the branching phase described above, pruning devices are used to check whether the coordinates of k are feasible. In some cases described below, the coordinates of k along with the coordinates of previously embedded atoms are checked together. If the check is negative, the solution obtained for k is discarded, which prunes all tree branches originating from the node k. In this section, we present the pruning devices used to accept or discard the coordinates of the atom k generated by the branching devices. The pruning device applies all these tests as soon as the involved atoms have been embedded.
Direct distance feasibility (DDF)
As the coordinates for an atom k are determined, we first check that all distances between k and the other embedded atoms respect the given lower and upper bounds arising from the constraints (1-3) listed in section “Solving the DGP with iBP”.
Torsion angle feasibility (TAF)
The values of the backbone torsion angles ϕ,ψ, are used as a pruning device, checking whether they are located in the permitted regions of the Ramachandran plot. The pruning device, first introduced in [34], is implemented in the following way. The torsion angle ξ
ijkl
defined by a quadruple of atoms {i,j,k,l} falls into a domain Ξ
ijkl
, up to a certain tolerance ε
t
>0. In general, Ξ
ijkl
is the union of κ dis-joined intervals, i.e.
$$ \Xi_{ijkl} = \bigcup\limits_{c=1}^{\kappa} \Xi_{ijkl}^{c} $$
((9))
From the bounds on a torsion angle ξ
ijkl
it is possible to derive bounds on the distance d
il
, noticing that
$$ d_{il}(\xi_{ijkl}) = \sqrt{d_{ij}^{2} + d_{lj}^{2} - 2(\cos(\xi_{ijkl})\sqrt{ef} + bc) d_{ij}d_{lj} }, $$
((10))
where:
$$ \begin{aligned} b&= \frac{1}{2}\frac{d_{lj}^{2} + d_{jk}^{2} - d_{lk}^{2}}{d_{lj}d_{kj}} \\ c&= \frac{1}{2}\frac{d_{ij}^{2} + d_{jk}^{2} - d_{ik}^{2}}{d_{ij}d_{jk}}\\ e&= 1-b^{2}, f= 1-c^{2}.\\ \end{aligned} $$
((11))
Taking the maximum and minimum values of d(ξ
ijkl
) for ξ
ijkl
∈Ξ
ijkl
, we obtain an interval [ l
il
,u
il
] for the distance d
il
. The sign of the angle ξ
ijkl
is used as an additional pruning criterion along with the d
il
interval.
Dijkstra shortest-path (DSP)
As introduced in [23], we can exploit the fact that the distances are Euclidean to improve the iBP pruning capabilities. We extend and generalize the procedure presented in [36] in the following way. We introduce an auxiliary graph G
+ with the same topology as the graph connecting the atoms in the protein, but such that the weight of each edge (i,j) is the upper bound of the distance d
ij
. For every pair of atoms i,j, the shortest-path between i,j in G
+ is a valid over-estimate of d
ij
. Thus we used an all-to-all shortest-path algorithm, the Floyd-Warshall algorithm [37], to refine the upper bound for each pair of atoms.
The Dijkstra Shortest-Path pruning device uses the refined upper bounds of inter-atomic distances in the following way. According to Lemma 4 in [23], for an atom k and for each atom pair i,j such that i<j<k in the order P
ato and for which d
ik
is known, the embedding of k can be pruned if:
$$ \|i-j\| - d_{ik}> u_{jk} $$
((12))
where u
jk
is the upper bound of the atom pair (j,k) obtained using the Floyd-Warshall algorithm [37].
Chirality (CHI)
The pruning of atom coordinates through the amino-acid chirality is implemented through the so-called CORN rule of thumb: in amino acids, the groups COOH, R (sidechain), NH2 and H are bonded to the chiral center C α carbon. Starting with the hydrogen atom away from the viewer, if these groups are arranged clockwise around the C α carbon, then the amino-acid is in the D-form. If these groups are arranged counter-clockwise, the amino-acid is in the L-form. The CORN rule was restated by imposing that the torsion angle defined by the atoms C,C
β,N,H
α of residue i for the D-form or C,N,C
β,H
α of residue i for the L-form, is positive.
α-helix secondary structure
We proposed the use of α helix information as a pruning device in the context of the iBP algorithm first in [34]. The α helix location can be determined from an analysis of the NMR chemical shifts by TALOS [38]. Four criteria are used to enforce the formation of an α helix: (i) the formation of backbone hydrogen bonds between amide hydrogens and carbonyl oxygens, (ii) the alignment of the amide and carbonyl functions checked by a qualitative condition on the energy of the hydrogen bond, (iii) the definition of backbone ϕ and ψ torsion angles already described in the Torsion Angle Feasibility, (iv) the definition of three additional angles θ, θ’ and θ” similar to the ones introduced by Grishaev et al. [39].
On a sequence of m+1 contiguous residues I
α
={i,i+1,…,i+m} forming an α helix, for any pair of residues (i−4,i) belonging to I
α
, the lower and upper bounds on the distance between the carboxylic oxygen O
i−4 and the amide hydrogen H
i should be compatible with the formation of an hydrogen bond. The upper and lower bounds are defined in an input parameter file of iBP, and were set to 1.9 and 3.0 Å in the present work.
The condition checking the alignment of atoms involved in the hydrogen bond is implemented by calculating a local energy information defined in the DSSP package [40]:
$$ q_{1}q_{2}\!\left[ \frac{1}{d_{O_{i-4}N_{i}}}\! +\! \frac{1}{d_{C_{i-4}H_{i}}}- \frac{1}{d_{O_{i-4}H_{i}}}\! - \!\frac{1}{d_{C_{i-4}N_{i}}} \right]\cdot f< -0.5, $$
((13))
with q
1=0.42,q
2=0.2 and f=332, and d
AB
correspond to the distance between atoms A and B.
The last criterion enforces the angles θ, θ’, θ” to be respectively into the interval values 0/70°, 0/90° and 110/180°.
Implementation details
In this section we provide an overview of the main implementation features. The iBP algorithm has been coded in C++ with extensive use of template meta-programming [41], STL [42,43], and BOOST (www.boost.org). Linear systems, as for instance (7), are solved using the LAPACK library [44].
Discretizable DGP instances were represented by simple weighted undirected graphs G=(V,E,d), which were handled by the Boost Graph Library (BGL) [45]. The points in \(\mathbb {R}^{3}\) were represented using the Boost Geometry Library (also known as Generic Geometry Library, GGL: www.boost.org).
Constraints on distances, angles or energy are typically expressed by enforcing a variable x to take values in a domain
, which is generally the union of intervals and singletons:
$$ \mathcal{D} =\left\{ \bigcup_{j=1}^{m} \bar x_{j} \right\}\cup \left\{ \bigcup_{i=1}^{k} \left[{x_{i}^{l}},{x_{i}^{u}}\right]\right\}. $$
((14))
The Boost Interval Library (BIL – see [46,47]) was used to store such representation, and to perform basic operations for intervals and singletons. On top of the BIL, we define the type domain which contains a set of intervals and operations as intersection, scaling, etc. The BIL allows also to select the underlining data format for the interval (single/double precision real, integer).