The specific molecular system used in our experiments consists of a pre-formed NFGAILS heptamer fibril and an incoming monomer. One monomer consists of 53 heavy atoms, the overall system hence of 424 heavy atoms. Simulation is performed in aqueous solution at 310K. For details on the MD parameters, see the supplementary information (SI).
The locking phase of each of the six possible initial docking contacts described in Sect. 2.1 (three for each the “even” and “odd” configuration) essentially corresponds to a separate molecular system, with its own transition manifold. Hence, also for longer peptides with more native contacts, each starting state of interest needs to be investigated separately. “Interesting” states would typically be those with the highest probability to be formed during the docking phase, but could also be selected based on specific chemical expert knowledge. Hence, we limit our investigations to the two outer, i.e., most exposed contacts LEU–PHE and PHE–LEU of the “even” configuration. However, the experimental setup is valid for the remaining contacts, as well.
To facilitate the subsequent analysis, we impose two artificial restrictions on the binding process:
-
1.
The heavy atoms of the fibril core are restrained to their crystal configuration. This way, only the motion of the monomer atoms is relevant for the subsequent analysis.
-
2.
The initial native contact is prevented from breaking. This prevents the monomer from dissociating from the fibril and thus ending the locking phase. As such trajectories are not part of any successful docking pathways, they represent wasted computational effort.
The restraints are realized by imposing a strong harmonic potential on the respective atom positions.
As these restraints leave the heptamer fibril essentially motionless (except of fast, low-amplitude vibrations in the restraint potential), the system effectively consists only of the 53 heavy atoms of the incoming monomer. Hence, we will consider only the degrees of freedom of the monomer in the transition manifold analysis. We therefore have \(N=53\) atoms which we consider in cartesian coordinates, leading to a \(3 \cdot 53 = 159\)-dimensional state space. Moreover, due to the fixed position of the template fibril in space, no global translational or rotational movement can occur in the incoming monor which normally would have to be removed by alignment to some reference structure.
Sampling of the reaction space
The first step of the transition manifold algorithm now consists of sampling starting points \(x_k\) from configuration space. The sampled states should roughly cover the full range of the reaction, i.e., contain states that are “freshly docked”, “almost locked”, and everything in between, to obtain a dense covering of the transition manifold. Note that for this, it is not necessary to sample the admissible state space densely, as one point on the transition manifold corresponds to many points in state space, and in theory, one of these points is sufficient to mark the transition manifold in an embedding. The number of required sample points hence scales (linearly) with the size and complexityFootnote 3 of the transition manifold, and not with the dimension of the state space.
For creating the random samples, we use a heat sampling approach: we consider a configuration with all native contacts between the monomer and fibril intact, i.e., the bound state. We then restrain the initial contact (as well as the heptamer fibril), and simulate the system at very high temperature at which the unrestrained contacts break. The resulting trajectory will explore all of the admissible state space, but no bonds will be formed due to the high temperature. The same technique has previously been applied [19] to generate the “milestone states” for a Markov model analysis of the \(\text {A}\beta _{16-22}\) amyloid. Like in [19], we used a temperature of 1000 K for the heat sampling, but were able to reduce the simulation length from 50 ns to 20 ns, due to the smaller size of NFGAILS.
From the resulting trajectory, we then sample the desired number n of starting points, separately for each of the two docking contacts. As the subsampling method, we apply the k-means clustering algorithm with \(k=n\) to the high-temperature trajectory. Note, however, that the purpose of this step is not to find clusters in the trajectory, but instead to exploit the fact that the centroids generated by k-means are evenly spaced across the whole data range. This generates starting points that more uniformly cover the admissible state space compared to, for example, simple random subsampling. This trick of using k-means as a subsampling method is also commonly used in the construction of Markov state models [29].
For the number of starting points, we found that choosing \(n=192\) leads to a clear image of the transition manifold in the latter embeddings. For longer peptides, with more complex transition pathways lined with non-native intermediate states, this number will grow accordingly.
Parallel simulation
In the next step, the transition densities \(p^t(x_k,\cdot )\) associated with each test point \(x_k\) need to be approximated by Monte Carlo sampling. The number of samples required to approximate the density up to a given error tolerance hereby scales with the variance of the density [30]. This variance will be small, as \(p^t(x_k,\cdot )\) is non-zero only in a small portion of state space (recall that the simulation time t is only long enough for the system to equilibrate locally). This holds independently of the system size, and hence, M is essentially independent of the peptide length.
As there is no practical way for us to estimate this variance prior to sampling, we will justify our choice a posteriori: if a clear low-dimensional structure is visible in the final embedding, the number of samples has been sufficient; otherwise, more samples have to be created. We will see that \(M=32\) samples produce a reasonably clear embedding of the transition manifold.
As explained in Sect. 2.2, the parameter t must fall between the fast and slow timescales. The estimation of these time scales is the only step in our algorithm that requires (limited) expert chemical knowledge. We can expect the elastic bond- and valence-angle vibrations to belong to the fast process and be irrelevant for the locking dynamics. The equilibration of these vibrations occurs on the picosecond time scale. Moreover, the residual side-chains may contain quickly equilibrating torsion angle rotations, which fall on time scales of a few hundred picoseconds.
The slow processes on the other hand will consist of the backbone configurational changes that are associated with the formation of the remaining native contacts. In [19], the longest formation time of a single native contacts in the A\(\beta _{16-22}\) amyloid has been found to be on the order of 6 ns. As NFGAILS and A\(\beta _{16-22}\) are of comparable size, we take 6 ns as an estimate for the slow timescale. In conclusion, t should be chosen on a timescale of several hundred picoseconds. To exactly characterize the slow and fast degrees of freedom, we will perform our experiments for \(t=0.1\) ns, \(t=0.4\) ns, and \(t=1\) ns, and compare the results.
The sampling is now realized by performing \(M=32\) MD simulations for each of the \(n=192\) test points, each simulation with different random momenta and a different random seed on the heat bath. Hence, overall, \(n\cdot M = 6144\) simulations need to be performed for each of the two initial contacts we consider. Simulations were performed on a 1536 core compute cluster (32 Intel Xeon 9242 CPUs) using the Gromacs molecular dynamics package [31], which allows easy parallelization of multiple runs of the same system via the multidir option. The overall runtime for one contact was 14 h. The resulting GROMACS structure files of the simulation end points (for the three lag times mentioned above) are available in the SI.
Transition manifold analysis
In this section, we describe the various steps of the transition manifold analysis that are performed on the simulation data. The transition manifold data analysis was performed using the special-purpose pyTMRC (Python Transition Manifold Reaction Coordinate) package [32]. The completion time for all the steps described in this section was less than 5 min on a 4-core laptop. Two Jupyter notebooks, implementing the analysis for the LEU–PHE and the PHE–LEU initial contact, respectively, can be found in the SI. To reproduce our results, download the pyTMRC package, download and extract the end point data, and execute all cells in the notebooks.
Pair-wise distances
In a first step, the samples are used to estimate the relative position of the transition densities to each other (in density space), i.e., computation of the distance matrix \(D\in \mathbb {R}_+^{n\times n}\). For the statistical distance (called d in Sect. 2.2), we use the maximum mean discrepancy (MMD) [33], which, as the name suggests, measures the discrepancy between two densities by computing the mean of a class of test functions applied to the densities, and choosing the maximum distance between the means. More precisely, we define the distance d as
$$\begin{aligned} d(p^t(x_i,\cdot ),p^t(x_j,\cdot )) :=&\sup _{f\in \mathcal {F}}\left| \mathbb {E}_{x\sim p^t(x_i,\cdot )}[f(x)]\right. \\&\left. - \mathbb {E}_{x\sim p^t(x_j,\cdot )}[f(x)] \right| , \end{aligned}$$
where the class of test functions f is generated by the so-called kernel function \(k:\mathbb {R}^{3N}\times \mathbb {R}^{3N}\rightarrow \mathbb {R}.\)
$$\begin{aligned} \mathcal {F} = {\text {span}}\big \{k(x,\cdot ),~x\in \mathbb {R}^{3d}\big \}. \end{aligned}$$
For the kernel k, we use a Gaussian kernel of bandwidth \(\sigma =5000\). The bandwidth was optimized manually to produce the clearest image of the transition manifold under the MDS embedding (see the next section). The MMD has been shown to both analytically and numerically preserve the distance structure of the transition manifold [34]. Moreover, its estimation from samples of the compared densities is straight-forward.
Euclidean embedding
To visualize the low-dimensional structure of the transition manifold that is encoded in D, we use the multi-dimensional scaling (MDS) algorithm [27, 35]. MDS constructs a set of n points in Euclidean space of selectable dimension (in our case, two-dimensional), so that the pair-wise distances between those points approximate D optimally. More precisely, MDS implicitly constructs an embedding of the densities, i.e., a map \(\mathcal {E}:L^1(\mathbb {R}^{3N})\rightarrow \mathbb {R}^2\)
$$\begin{aligned} \mathcal {E}: p^t(x_i,\cdot ) \mapsto z_i \in \mathbb {R}^2, \quad i=1,\ldots ,n, \end{aligned}$$
so that the Euclidean distances between the embedded points, i.e., \(\Vert z_i-z_j\Vert _2\), optimally approximate the distance \(D_{ij}\), for all pairs \(i,j=1,\ldots ,n\). Note that domain of \(\mathcal {E}\) is the infinite-dimensional space of absolutely integrable functions \(L^1(\mathbb {R}^{3N})\), which includes probability densities. The points \(z_i\) then serve as the Euclidean representation of the densities \(p^t(x_i,\cdot )\).
We specifically use the implementation of MDS provided by the Python package Scikit-learn [36]. Besides the distance matrix D, it does not require additional input parameters.
Reaction coordinate computation
Next, we seek the “best” one-dimensional parametrization of the low-dimensional structure encoded in D. Pulled back onto the starting points, this will then become our final reaction coordinate. Again, we have multiple options in choosing the error metric. For preserving the distances in D directly, the one-dimensional MDS embedding gives the optimal result. However, due to its higher robustness to outliers and good performance in previous computations [21], we here use the diffusion maps method. Its parametrization optimally preserves the so-called diffusion distance between the points underlying the matrix D, which characterizes closeness by a high transition probability in an artificially constructed Markov jump process between the points (not to be confused with the original molecular dynamical process). This process, a discretized heat diffusion, contains a scale parameter \(\tau \) controlling the velocity of the diffusion, which we choose as \(\tau =20\) (optimized manually to achieve an even parametrization of the structure observed in the MDS embedding).
Shortest locking pathway
Finally, we discuss how the transition manifold embedding can be used to identify transition pathways and artificial trajectories between two states \(x_A\) and \(x_B\) on the transition manifold. There is no single, universally accepted concept of an “optimal” transition pathway between two states, and many proposed definitions with differing objectives and physical interpretations exist [37,38,39]. The TMF proposes another such pathway, namely the geodesic between \(p^t(x_A,\cdot )\) and \(p^t(x_B,\cdot )\) on the transition manifold \(\mathbb {M}\). This is the shortest differentiable curve \(\varGamma \) in the metric space \(L^1\) that starts in \(p^t(x_A,\cdot )\), ends in \(p^t(x_B,\cdot )\), and does not leave \(\mathbb {M}\). As each point \(p^t(x,\cdot )\in \mathbb {M}\) corresponds to exactly one starting point \(x\in \mathbb {R}^{3d}\), we can “pull back” \(\varGamma \) to a “traditional” transition pathway \(\gamma \) in state space by setting \(\gamma (x):= \varGamma \left( p^t(x,\cdot )\right) \). Note that, while \(\gamma \) has a clear interpretation within the TMF, its interpretation in terms of more intuitive dynamical concepts such as transition probabilities or minimum energy pathways is still outstanding. For further discussion on the link between the transition manifolds and transition path theory, see [21].
As our data consist only of discrete samples close to \(\mathbb {M}\), we take a heuristic approach for the numerical computation of \(\gamma \). We construct a weighted, complete graph \(G= (V,E,W)\) with nodes \(V=\{x_1,\ldots x_n \}\) and edges \(E=\{(x_i,x_j)~|~i,j=1,\ldots ,n\}\). For the weight matrix \(W\in \mathbb {R}^{n\times n}\), we take the squared maximum mean discrepancy
$$\begin{aligned} W_{ij} = D_{ij}^2. \end{aligned}$$
The squaring compresses small, local distances, and further increases already long distances. The discrete shortest path in G between the nodes \(x_A, x_B\) thus tends to take small steps instead of large jumps, and thus is encouraged to follow the transition manifold. Thus, we can take this discrete shortest path as a heuristic approximation of \(\gamma \).