Predicting Supramolecular Structure from the Statistics of Individual Molecular Events

As manipulating the self-assembly of supramolecular and nanoscale constructs at the single-molecule level increasingly becomes the norm, new theoretical scaffolds must be erected to replace the thermodynamic and kinetics based models used to describe traditional bulk phase syntheses. Like the statistical mechanics underpinning these latter theories, the framework we propose uses state probabilities as its fundamental objects; but, contrary to the Gibbsian paradigm, our theory directly models the transition probabilities between the initial and final states of a trajectory, foregoing the need to assume ergodicity. We leverage these probabilities in the context of molecular self-assembly to compute the overall likelihood that a specified experimental condition leads to a desired structural outcome. We demonstrate the application of this framework to a toy model in which N identical molecules can assemble into oligomers of different lengths and conclude with a discussion of how the high computational cost of such a fine-grained model can be overcome through approximation when extending it to larger, more complex systems.


Introduction
Just as verbal communication is facilitated by the grammatical structure of sentences, which are themselves built up from combinations of words, each of which is a specific sequence of letters; biomolecular communication is principally driven by the folded structures of proteins, which are assembled through electrostatic binding interactions between polypeptide sequences, each generated from the alphabet of amino acids.The careless verbal malapropism that can completely change the meaning of a sentence has a biomolecular analogue as well in the form of misfolding events, which can at best render a protein functionally inert.At worst, the misfolding of proteins can cause great harm, such as in the cases of the neuronal amyloid-beta (Aβ) protein and the pancreatic amylin, whose misfolded structures can seed the formation of plaques that have been implicated as a potential cause for Alzheimer's disease and type II diabetes, respectively [1].
If even nature has yet to achieve perfectly reliable molecular self-assembly after billions of years, it is hardly surprising that human efforts to synthesize nanoscale structures have likewise struggled with exerting precise structural control.Gold nanoparticle [2] and liposome [3] syntheses have difficulty achieving acceptable levels of monodispersity, biofilm [4] and other monolayer surface depositions [5] are prone to disorder and defects, and supramolecular assemblies [6] are often plagued by competing interactions that lead to disparate products.The problem with these experimental techniques is that, like most naturally ocurring forms of synthesis and self-assembly, they rely upon a presumption of ergodicity.Essentially, ergodicity is the idea that, given enough time, a group of molecules will eventually manage to assemble into the configuration that minimizes their free energy-possibly only after stochastically exploring many less favorable structural outcomes.Many of these alternate structures are only modestly less stable than the optimal structure, however, and the probability of a large enough thermal fluctuation enabling the improperly assembled molecules to break back apart and reassemble "correctly" is too negligible to be observed on experimentally acceptable time scales [7].
One way to tip the odds in the experimenter's favor would be to stop relying upon diffusion to bring molecules together in the bulk phase and instead both limit the number of reacting molecules present and exercise control over their collisional trajectories in order to promote the formation of desirable products.Newer laboratory techniques such as optical and magnetic tweezers [8], molecular beams [9], and microand nanofluidics [10] all embody this approach towards improved synthetic control.While experiments have moved boldly in this new direction, theory has sadly lagged behind, with self-assembly still being modeled in ways that largely rely on statistical thermodynamics and kinetics [11][12][13].This older modeling paradigm is insufficient to correctly predict the structural outcomes of experiments that involve only a few molecules and that by construction occur far from thermodynamic equilibrium.Theorists are essentially still using a thermal game of Plinko as a model for experiments that are more akin to molecular Tetris.
In this paper we attempt to address this gap by introducing a new modeling framework for the probability of observing a desired molecular product that derives from considerations of the stochastic dynamics of a single selfassembly sequence and its branching structural end states.The result is a probabilistic model that is more appropriate for predicting the outcomes of the newer generation of "single-molecule" experiments and that requires neither an ergodic hypothesis nor a thermodynamic limit.A short version of this work appeared in [14].

The model
We imagine a generalized experimental setup in which molecular subunits are injected into a system and propelled by some means towards a targeted site of self-assembly.This site could, for example, be a surface where a monolayer is intended to be grown, or it could be a single molecule or a partially assembled "seed" structure suspended in an optical or magnetic trap.Typically the experimenter can only directly control external parameters such as the temperature, the molecular emission frequency, or the distance to the selfassembly site.We denote this set of tunable parameters as σ , understanding that the specific members of this set will depend upon the precise details of the experiment.
When each molecule is emitted into the system, it will have some initial state characterized by dynamic variables such as its velocity, orientation, quantum vibrational state, and the time of its emission relative to the start of the experiment.For the i th emitted molecule, we define this initial state as ψ i .The set of the initial states of all the molecules involved in the assembly procedure of interest will be denoted {ψ}, and the distribution of this set of states will depend upon the tuning parameters σ through a conditional probability distribution function p S ({ψ}|σ ), where the subscript S denotes this as the "source" probability.
At some later set of times, the emitted molecules will, one-by-one, reach the self-assembly site; and their dynamical state will have evolved as a consequence of the stochastic transport process used to propel them there.We will denote this new set of dynamical states as {φ}, emphasizing that the arrival time of each molecule at the target site, which will also be a random variable, is included as part of {φ}.The distribution of this latter set of states will depend upon the initial set of states, and we denote the corresponding conditional probability p T ({φ}|{ψ}) as the "transport" probability.
In most self-assembling systems, there are numerous structural products that can be observed at long times depending on the specific details of the molecular collisions, i.e., the set of arrival states {φ}.If we denote the set of all these possible structures as S, then the conditional probability that any specific structure s ∈ S is observed will be p R (s ∈ S|{φ}), which we denote the "result" probability.Note that this probability depends on the states of all the molecules that collide with the self-assembly siteeven those that do not end up binding to become part of the final structure.For example, while more molecules must necessarily be involved to produce larger, more complex structures, one can imagine that bombarding the selfassembly site with new molecules at too fast a rate could actually serve to disrupt the assembly process, ultimately making smaller final structures more likely than larger ones.
The ultimate probability that we wish to compute is p F (s|σ ), the overall final probability that a set of input parameters will result in molecules assembling into structure s.This probability can be related to the latter three by the following convolution integral: The complexity involved in actually evaluating Eq. 1 will naturally depend upon the details of the system under consideration.
In this paper, we restrict our attention to a toy model that demonstrates how this theoretical framework might be applied and what sort of predictions it can be used to make.In this model, we assume that N identical molecules are released at randomly selected times into a one-dimensional drift-diffusion channel characterized by a drift speed v, a diffusion constant D, and a channel length .The first molecule to traverse the channel binds to a receptor site that catalyzes a self-assembly process with the second molecule to arrive, resulting in a dimer state.We assume that this assembly process takes a finite amount of time, which we denote as the assembly time T α .If the next molecule arrives while dimer assembly is still occuring, it will be repelled, preserving the dimer state.If, on the other hand, it arrives once the dimer is complete, then a trimer will be formed.This process continues until all N molecules have traversed the channel and have either been repelled or polymerized.Figure 1 depicts a cartoon representation of our toy model and summarizes its possible outcomes in the simplest nontrivial case of N = 3.
Continuing to restrict our attention to the three particle scenario, we can label the three molecules 0, 1, and 2 based on the order in which they are released into the channel, and we may then define the initial state of the i th particle ψ i as its injection time τ i .Its interaction state φ i can be defined analogously as its arrival time t i at the channel terminus.Because the first such arrival time may be thought of as the start of the experiment, the absolute release and arrival times are irrelevant, and we can replace these six time variables with four time intervals.We define the intervals Δτ 1 and Δτ 2 as the differences between the injection times of particle 1 and particle 0 and particle 2 and particle 0, respectively.Similarly, the intervals Δt 1 and Δt 2 are the equivalent differences in the arrival times of the particles at the self-assembly site.Because the order in which the particles arrive is not fixed, due to the stochasticity of the transport down the channel, these latter time intervals can potentially be negative.
The space of self-assembled structures S in this case contains only the dimer and trimer configurations, which we shall denote as s 2 and s 3 , respectively.The probability of observing a dimer at the end of the experiment depends upon whether or not the third molecule to interact does so within a time T α of the second molecule's arrival.We can thus write the dimer result probability p R (s 2 |Δt 1 , Δt 2 ) as the following conditional: ( In the above, Θ(t) is the Heaviside step function.We adopt the convention that it takes value unity when its argument exceeds zero and has value zero otherwise.Since the only other possibility is that a trimer is formed, p R (s Note that, regardless of the order of monomer arrival, if we define the time between the second and third monomer arrival events as Δt 21 , then Eq. 2 can be rewritten more compactly as p R (s This simplification is a direct consequence of the three monomers being identical in our model.If they were instead each a different chemical species, the specific order of their arrival would impact the self-assembly process, and Eq. 2 would typically have a different functional form for each ordering. The first passage time across a drift-diffusion channel is distributed according to the standard inverse Gaussian Fig. 1 Toy model timeline.If t = 0 is the time at which the first two monomers initiate their self-assembly into a dimer, then t = Δt is the time at which the third monomer arrives.Depending upon whether this time is smaller or larger than the self-assembly time scale T α , the final state of the model will either be the original dimer or a trimer distribution I G(μ, λ; t) [15], analytically continued to be zero for negative values of its time argument: The parameter μ ≡ /v is the time it takes to cross the channel in the absence of diffusion, and λ ≡ 2 /2D is the average time it would take in the absence of drift.This suggests the following form for the transport distribution where the time variable of integration t is defined as t 0 − τ 0 , the total time it takes the first emitted monomer to traverse the channel.Finally, we will assume that each molecule has an equal chance of being released into the channel at any moment in time after the previous molecule's emission, which results in the release time intervals being exponentially distributed (as in a radioactive decay process).Assuming an average injection rate 1/τ , we get Note that the dependence of this distribution on Δτ 1 cancels out of the exponent and that τ is the presumptive tuning parameter of the experiment.Equations 2, 4, and 5 can be substituted into Eq. 1 to calculate the total dimer probability p F (s 2 |τ ).The simple conditional form of the result probability p R (s 2 |Δt 1 , Δt 2 ) will lead to a modification in the limits of integration over the arrival time intervals.This leads to a more complicated looking expression for p F (s 2 |τ ) that is nonetheless more straightforward to evaluate numerically: The three integrals over the arrival time intervals in the above expression correspond, respectively, to the cases in which particle 0 arrives first, second, and third.The factors of 2 account for the symmetry, in each case, of swapping the index labels 1 and 2.

Results and discussion
Even for such a simple system, the numerical integration required to calculate p F (s 2 |τ ) is computationally intensive, with the principal time sink being the repeated evaluations of Eq. 4 for all the different values of the release and arrival time intervals needed to evaluate Eq. 6.We resolved this difficulty by parallelizing the computation, evaluating each instance of p T (Δt 1 , Δt 2 |Δτ 1 , Δτ 2 ) on a separate thread of an Nvidia GeForce GTX TITAN GPU with 3,072 cores, 12 GB of RAM, and 1,000 MHz clock speed.This reduced the total computational time by a factor of roughly 1,000.For our integration mesh, we chose a lattice spacing (bin width) of 0.02 time units, and a mesh domain defined in terms of model time units by the inequalities 0 ≤ Δτ 1 ≤ Δτ 2 , 0 ≤ Δτ 2 ≤ r, −10 ≤ Δt 1 ≤ r + 10, and −10 ≤ Δt 2 ≤ r + 10, where the integration range r was set equal to 20 time units.
For each point on this mesh, the formally infinite upper limit of each of the parallelized time integrals was approximated as 50 time units.These restricted integration ranges were sufficient to approximately normalize all of the probability distributions of the model to within an acceptable tolerance.After computing the transport probability at each point of the chosen integration mesh, it became tractable to evaluate the integrals over the release and arrival time intervals serially, using an Intel Core i7-2600 CPU with 3.40GHz clock speed and 8 GB of memory.We demonstrate how this computational time varies with integration range r in Fig. 2 for three different bin widths.As the logarithmic scale makes clear, the serial computation time grows roughly exponentially with the integration range.It also scales approximately as an inverse power law of the bin width, with a negative exponent of about 4.
We plot our numerically evaluated probability p F (s 2 |τ ) in Fig. 3(A) as a function of the self-assembly time scale T α for values of the mean release interval τ = 0.5, 1.5, 2.5, and 3.5, in descending order.For all curves, the time scales μ and λ were both fixed at unity.As expected, when selfassembly is instantaneous (T α = 0), there is no interval of time during which the third molecule can be repelled, so trimer formation is inevitable (p F (s 2 |τ ) = 0).At the other extreme, as T α → ∞, the dimer becomes the only possible product (p F (s 2 |τ ) → 1).As molecule emissions into the channel become more infrequent (larger τ ), the arrival times between molecules will also tend to increase, depressing the dimer probability.These curves are all fit very well by an extended exponential function of the form IntegraƟon Range ComputaƟonal Time (sec) Fig. 2 Computational time (in seconds) plotted versus the integration range (in model time units) for three different bin widths (also in model time units): from top to bottom, 0.02 (red), 0.05 (green), and 0.1 (blue).The ordinate axis is on a log scale to better illustrate the exponential growth of the computational time for sufficiently large integration ranges where c 1 and c 2 are fitting parameters that may depend in a complicated manner upon some dimensionless combination of the time scales τ , μ, and λ.These best fit functions are plotted as solid curves over the numerical data in Fig. 3(A).
Figure 3(B) also plots the final dimer probability versus T α , but this time τ is held fixed at τ = 1.5 and μ and λ are varied instead.The top curve is the same as the second to top curve in panel (A) (μ = λ = 1).The remaining curves are, in descending order, for (μ, λ) = (2,2), (2,1), and (4,2).These curves illustrate several general trends.First, the dimer probability decreases monotonically with increasing μ, reflecting the fact that a less facilitated channel will tend to space out the arrival times of the molecules, making trimer formation more likely.Increasing λ tends to have the opposite effect, since reducing the diffusivity of the channel narrows the distribution of arrival times (Eq.3), resulting in a less noisy channel.The variance of the inverse Gaussian distribution is μ 3 /λ, explaining why p F (s 2 |τ ) has a stronger dependence on μ than on λ.These curves are well modeled by the same class of fitting function used in Fig. 3(A).
Perhaps the most informative way of quantifying how self-assembly depends upon our model parameters is with a "phase diagram," where a relevant parameter subspace is divided into regions based upon the most probable structure in each.For our toy system, this phase diagram is fairly simple and is plotted in Fig. 4(A) as a function of the control parameter τ and the self-assembly time T α .The transport parameters μ and λ are both fixed at unity.The phase boundary, which turns out to be approximately linear (R 2value of ≈ 0.997), was determined by finding, for each value of τ , the critical value of T α for which p F (s 2 |τ ) = 1/2.For large T α and small τ , the shorter average interval between particle emissions and the longer assembly time will make it more likely for the third particle to arrive while the first two are still docking, thereby frustrating trimer formation.In the opposite limit, the time between emissions will be long and assembly will occur swiftly-both circumstances that favor the trimer product.
The phase diagram of our toy model naturally becomes more interesting as the number of structural end states increases, but although it is straightforward to generalize Eqs. 2, 4, and 5 to account for N > 3 particles, the computational cost rapidly becomes untenable.Even the next most complicated case of N = 4 taxed our available GPU resources, limiting the accuracy of our numerical integrations as the range of possible departure and arrival time intervals increased with growing τ .To circumvent this, we simulated the drift-diffusion process directly as a one-dimensional Gaussian-distributed random walk and computed the probabilities p F (s|τ ) as the fraction of replicate simulations whose set of first-passage times corresponded in our model to the end state s.Formally, the transport process in our model is a continuous Wiener process, which is only equivalent to a Gaussian random walk in the limit that the discrete time step of the walk goes to zero, but we found that a time step of dt = 0.001 and 10 5 replicate simulations was sufficient to produce quantitative agreement with the N = 3 results we computed through numerical integration.The computational time was substantially lowered by this approach as well.
The phase diagram computed from our simulations for the N = 4 case is plotted in Fig. 4(B).With four molecules being released into our one-dimensional channel, a tetrameric end state becomes possible in addition to the dimer and trimer structures of the previous case, and we see from the figure that all three final oligomers have a region of parameter space where they are statistically favored.The boundaries separating the regions are both approximately linear.Note that while in the three-particle case the phase diagram regions represent ranges of the parameters τ and T α for which one structure is the majority product (p F (s i |τ ) > 0.5), in the case of four or more particles, the regions only denote which product subsumes a plurality of the probability (p F (s i |τ ) > p F (s j |τ ), ∀j = i).
To better portray the prevalence of each structural end state at different values of τ and T α , we plot a trio of heat maps in Fig. 5  The first is the fact that there are two distinct trimer formation sequences, not counting permutations in the molecular labels, compared with only one each for the dimer and tetramer.Specifically, the trimer will result if the difference in the second and third arrival times or the third and fourth arrival times is less than T α .For the dimer, both of those time differences must be less than T α , whereas for the tetramer neither must be.This enables the trimer structure to have a substantial likelihood of formation even in the regions where it is not the most favored oligomer.
The second factor contributing to the trimer's dominance can be understood from considering the distribution of the random variable Δt i − Δt i−1 , the difference in the arrival times of two consecutively released molecules.The mean of this random variable is τ , the average difference in their release times, but the variance is equal to 2μ 3 /λ+(2i−1)τ 2 .(To understand why this is the case, note that Δt i − Δt i−1 can be rewritten in terms of first passage times and release time intervals as When τ is very small, the standard deviation approaches a constant value, making it not improbable for one pair of consecutive arrival times to exceed T α , so long as T α (2μ 3 /λ) 1/2 .Similarly, as τ gets large, the standard deviation tends to grow faster than the mean, making it not unlikely that one pair of consecutive arrival times will be less than T α .These circumstances allow for a substantial trimer probability to bleed into the regions of dimer and tetramer dominance.

Conclusions
While the framework we have devised for quantifying selfassembly in terms of individual molecular interactions is quite general, we have seen even in the simple case of our three-molecule toy model that its computational cost is problematic, especially were one to extend it to the selfassembly of long biopolymers like proteins or microtubules.The number of dynamical variables that characterize the set of interacting states {φ} will necessarily grow linearly with the number of molecules considered, but the real problem is that an integral like that in Eq. 4 will have to be evaluated for every permissible set of values these variables can take.The number of integrations will thus grow exponentially with the number of molecules, rendering even parallelization schemes unfeasible for supramolecular assemblies consisting of more than a handful of subunits.
For systems where direct simulation of the dynamics is similarly intractable, the most straightforward way to address this problem is to make physically sensible approximations that constrain the hypervolumetric domain of the variables {φ}, thereby reducing the number of integrals that must be computed in parallel.In our N = 3 toy model, for example, we must consider a range of Δt 1 and Δt 2 values broad enough to allow for 3! = 6 different interaction orders.If we work in the large τ limit, however, we can assume that the probability of nonconsecutively released particles interacting in reversed order is negligible.This reduces the number of permissible interaction orderings to three (removing the factor of two from the second term on the right of Eq. 6 and deleting the third term entirely) and eliminates the need to consider negative values of Δt 2 .This is only a modest gain, but when we extend our toy model to include tetrameric structures, this approximation scheme reduces the number of allowable orderings from 4! = 24 to a paltry five.
Another way to reduce the computational costs of our modeling framework is to take advantage of the fact that the conditional probabilities at its core are not limited to describing physics at the scale of individual molecular interactions.In experiments performed by Freer et al. [16], for example, nanofluidics and electrophoresis were combined to direct nanowires to self-assemble across designated electrode sites.In that context, it would make more sense to treat individual nanowires as the subunits in our model, even though each wire is composed of a large number of atoms.Similarly, the method developed by Nie et al. [17] to generate Janus microparticles by throttling two imiscible liquids through a narrow microfluidic aperture represents a potential application of our theory where the self-assembly would most appropriately be modeled at the continuum fluid level.In both these cases, the physical forces involved are well understood and straightforward to numerically model, and the number of subunits involved and assembled structures possible are reasonably limited.We hope to leverage systems like these in the future to validate our methods against real experimental results.

Fig. 4 (Fig. 5
Fig. 4 (A) The phase diagram for the three-molecule toy model when μ = λ = 1.The phase boundary separating the dimer and trimer-favoring regions is approximately linear.(B) The phase diagram for the four-molecule toy model, also for μ = λ = 1.Both the dimer/trimer and trimer/tetramer phase boundaries are roughly linear, though they have different slopes