Diffusions interacting through a random matrix: universality via stochastic Taylor expansion

Consider $(X_{i}(t))$ solving a system of $N$ stochastic differential equations interacting through a random matrix $\mathbf J = (J_{ij})$ with independent (not necessarily identically distributed) random coefficients. We show that the trajectories of averaged observables of $(X_i(t))$, initialized from some $\mu$ independent of $\mathbf J$, are universal, i.e., only depend on the choice of the distribution $\mathbf{J}$ through its first and second moments (assuming e.g., sub-exponential tails). We take a general combinatorial approach to proving universality for dynamical systems with random coefficients, combining a stochastic Taylor expansion with a moment matching-type argument. Concrete settings for which our results imply universality include aging in the spherical SK spin glass, and Langevin dynamics and gradient flows for symmetric and asymmetric Hopfield networks.


Introduction
Markov processes with random coefficients arise in numerous contexts: e.g., dynamics of spin glasses, optimization on random landscapes, and learning with neural networks. In many cases, when the underlying randomness is Gaussian, they have been found to give rise to a rich class of behaviors, including metastability, trapping, and aging. In this paper, we analyze a class of stochastic differential systems (sds's) in their high dimensional limit, where the couplings are linear and encoded by a random matrix. We show that trajectories of polynomial statistics of the sds are universal: they have the same high-dimensional behavior if one replaces the Gaussian interaction matrix by a non-Gaussian one with the same mean and variance profiles.
Universality, can broadly be described as the phenomenon that for high dimensional ensembles (X i ) i≤N governed by a large number of independent random variables (Z i ) i≤N , macrocopic statistics of the ensemble only depend on the laws of (Z i ) through their low moments. Of course, the most classical example of universality is the central limit theorem (clt), where (X i ) = (Z i ), and the statistic is the normalized sum. Slightly more involved examples are invariance principles, where the limiting Brownian motion only depends on the distribution of the random walk increments through its first and second moments.
Lindeberg's classical proof of the clt iteratively replaces Z i withZ i (Gaussian with the same mean and variance) and shows that the cumulative effect of these replacements is microscopic. This approach has proven to be very robust, and has been generalized e.g., to polynomials f (Z 1 , . . . , Z N ) in [28,33] and more generally, smooth functions with bounded derivatives in [8,9]. A more combinatorial approach is a moment matching argument to compare moments of statistics f (X 1 , . . . , X N ) to moments of f (X 1 , . . . ,X N ) and showing that the difference is dominated by the differences in the first few moments of Z i andZ i .
With these approaches, universality has been proven in a wide range of ensembles where the relationship between (X i ) and (Z i ) is more complicated. A fundamental example is when (X i ) are the eigenvalues of a random matrix with entries (Z i ). There, the empirical distribution of (X i ) is well-known to have the same limit (e.g., the semi-circle law for Wigner matrices [38]). In the last decade, remarkably, universality has been found to extend to local statistics of the ensemble (X i ) e.g., typical size of gaps between eigenvalues, and kpoint correlations. Universality in random matrix theory has been a tremendous success and we cannot hope to do justice to the literature therein; we instead refer to the seminal works [19,36] and the surveys [20,37].
A separate class of ensembles for which universality has been shown are examples of interacting particle systems from statistical physics, and in particular the family of mean-field spin glass models. A canonical example of these are spin glasses where N particles in states (X i ), interact through a random symmetric coupling matrix (or in the case of higher order interactions, tensor) composed of independent entries Z i . More precisely, with these interactions, they are endowed with an energy landscape, or Hamiltonian, that is topologically complex, and (X i ) are drawn from the corresponding Gibbs distribution. The statistics of (X i ) in such families of spin glasses have been found to exhibit an extremely rich and varied phase diagram featuring phenomena like breaking of ergodicity and replica symmetry [32]. Most of their analysis, including the calculation of the free energy, and the proof of the celebrated Parisi formula for the overlap distribution, were first carried out in the Gaussian setting [22,31,35]. Talagrand later showed that these also held in the case of Bernoulli (Z i ) in [34]; this universality was extended to general (Z i ) as an application of [9].
The dynamics (Markov processes exploring the Hamiltonian) for such spin glass models are a prototype and motivating force for this paper. The general setting we consider here is that of a system of N linearly coupled sde's, where the couplings are encoded in a random matrix J, and driven by N independent Brownian motions. That is, X t = (X 1 (t), . . . , X N (t)) is the solution to the sds where J is a random matrix with independent entries (up to, possibly, a symmetry constraint) and variance profile m = (m ij ) i,j scaled such that E[ J 2 ] = O(1), h is a bounded drift vector, and Σ is an affine transform of X t . Note that for Σ(X t ) non-constant, we do not expect to have an explicit closed-form solution to (1.1). In the N → ∞ limit, the diffusions of (1.1) encompass many interesting and well-studied models of Markov processes with random coefficients, and give rise to rich and varied behavior. This includes metastability, aging, and non-Markovian limiting evolution equations, in e.g., randomly coupled (geometric) Brownian motions, and Langevin dynamics and gradient flows for the spherical Sherrington-Kirkpatrick (sk) spin glass and symmetric and asymmetric Hopfield nets [6,13,[24][25][26]: concrete applications are described in Section 1. 4. In many such examples, the analysis is more tractable when J is Gaussian and one can use tools like Gaussian integration by parts, Girsanov, and the rotational invariance of the Gaussian ensemble.
In this paper, we develop a simple combinatorial framework for proving universality for the solution trajectories of sds's of the form (1.1). Before describing our approach, we explain a few difficulties one encounters when trying to prove universality for solutions of randomly coupled dynamical systems, using some of the approaches described above for other universality results. We begin by considering a Lindeberg approach where we examine the effect that re-sampling one J ij has on an averaged statistic F (t) = F (X 1 (t), . . . , X N (t)). The obstacle in employing such an approach is that changing J ij toJ ij on X j (t), say, beyond affecting the drift 1≤i≤N J ij X i (t) + h j , of the j-th coordinate of the sds, also induces a highly non-linear effect both on X j (t) and on X i (t) for all i = j. The problem instead lends itself to comparing the effect of J →J in a more averaged way.
An alternative approach would be to use the linear structure of the problem in a strong way, relying on sharp universality results on the spectra of random matrices to study the problem. This approach, while feasible if Σ(X t ) is constant, requires one to diagonalize the problem without loss of generality-i.e., it requires an assumption of joint rotational invariance for the laws of (X 0 , J, B). In [2], such an approach is followed for analyzing the dynamics of the spherical sk model, and their results hold assuming the law of J is invariant under the orthogonal group, and its spectrum satisfies certain large deviation estimates satisfied by the goe. However, this restriction would not include the cases of e.g., the uniform measures on [−1, 1] N and {±1} N absent the rotational symmetry, and could not include the case of non-constant Σ(X t ).
Very recently, [17] proved a universality result for the dynamics of the asymmetric Langevin dynamics for the soft-spin sk model. There they used large deviations theory to obtain exponential control on the empirical measure on sample paths-as obtained in the Gaussian setting in [6,7]-together with sharp control on the Radon-Nikodym derivative between the Gaussian paths and those driven by non-Gaussian J on short time scales, to show universality for the empirical measure L N = 1 N i δ Xi(t) . Their arguments were able to handle a deterministic non-linearity in the drift through a (double-well) confining potential, but the need for control at the exponential scale forced them to take, e.g., asymmetric i.i.d. J and Σ ≡ 1.
This approach works quite generally, and is robust to symmetric and asymmetric choices of J with nonhomogenous means and variances, and general choices of diffusion coefficients in (1.1), including Σ(X t ) non-constant making the diffusion non-linear, and Σ ≡ 0 corresponding to a deterministic dynamical system. Lastly, the analysis works for arbitrary initialization independent of J. The assumption of linear drift is, of course, important, and one would like to be able to drop it. We emphasize, though, that this is primarily used in order to justify the absolute convergence of the Taylor expansion of the semigroup, which one could hope to justify by other means for higher order diffusions given that a strong solution exists; the remaining combinatorial framework for moments of the generator may then generalize. We discuss this in Remark 1.5.
We end this section by mentioning two recent results [1,10] showing universality for a Lipschitz family of approximate message passing (amp) algorithms-a discrete-time state evolution that has found many applications to inference and optimization in high dimensions. Some of the ideas there appear similar in spirit to our approach, using a combinatorial approach to control moments of the final state of the amp. All the same, the general setting of (1.1) introduces many key differences e.g., the diffusions of (1.1) are in general non-linear, not globally Lipschitz, and have a built-in stochasticity.
1.1. Setup: diffusions with random linear interactions. Consider an N -dimensional stochastic differential system with a mixture of random and deterministic linear interactions, along with possibly, some constant drifts. More precisely, consider the sds X t := (X i (t)) N i=1 driven by the following parameters. Suppose that for some matrix m = (m ij ) i,j we have random interactions given by the random matrix We assume that the entries A ij are either fully independent, or are independent up to a symmetry constraint A ij = A ji . Let P A be the law of A. In order to scale the interactions to have an order one cumulative effect, it will be convenient to work with the rescaled interactions matrix J given by We then denote the distribution induced by P A on J by P J .
We further consider additional deterministic interactions satisfying, for some constant C Λ < ∞, (the · 0 -norm of a vector is its number of non-zero entries). We also consider external drift parameters and diffusion coefficients Σ(X t ) governed by the matrix The sds (X t ) t≥0 = (X 1 (t), X 2 (t), . . . , X N (t)) t≥0 initialized from some random X 0 distributed according to a product measure µ is driven by a standard Brownian motion B t = (B 1 (t), . . . , B N (t)) as follows where for ease of notation, we hereon set X 0 (t) ≡ 1 so that (σ 0j ) j≥1 capture the constant diffusion coefficients. We denote the martingale part of X t by The process X t is well-defined for a.e. J and all t ≥ 0 (as we have finite, possibly N -dependent operator norms J 2 , Λ 2 and (σ ij ) i≥1 2 , see e.g., [30,Theorem 5.2.1]).
Notational comment. There are three distinct sources of randomness above dictating the law of the solution X t to (1.2): the law of the interaction matrix P J , the law of the Brownian motions, denoted P B , and the law of the initial data µ-each of these are product measures and we do not distinguish notationally between the law of the individual entries of J, B or X 0 and the ensembles.
In proving universality, we consider the difference between the laws P J , PJ induced by P A and PÃ with variance profiles m =m. For ease of notation, we will henceforth use and denote the corresponding expectations E andẼ respectively.

1.2.
Main results. We begin by describing the observables to which our universality results apply. The building blocks of these observables are chosen among the family of vector valued functions, We establish universality in the mean for weighted empirical averages of monomials in functions from F evaluated at a finite collection of times. Specifically, fixing an m-tensor a = (a i1,...,im ) with entries bounded by C a and a p-tuple of times t = (t 1 , . . . , t p ), for every ℓ ≤ m, fix p observables Y (ℓ,1) , . . . , Y (ℓ,p) ∈ F which are to be evaluated at these p times. That is, We also need to add a sub-exponential tail constraint on µ and P A beyond the minimal assumptions of zero-mean and matching variances of P A and PÃ; this is henceforth referred to as Hypothesis 1.
Hypothesis 1. Assume that the law µ is a product of µ i of X i (0) having finite moments of all order, which are bounded uniformly over i and N . That is, there exist C µ (r) ≥ 1 such that for any r finite, Further assume P A has uniformly bounded exponential tails, i.e., the following equivalent properties hold: A , ∀ℓ ≥ 1 and some C A < ∞ . (1.8) For ease of notation for dependencies on constants, we denote by C ⋆ := max{C A , C Λ , C h , C 2 σ }, and state our first result, on universality at the level of the mean (hence also of moments), for observables (1.5).
Theorem 1. Let µ, P A , PÃ satisfy Hypothesis 1 and suppose that A,Ã, symmetric or independent, are meanzero of matching variance profile m = (m ij ) i,j . For any T, m, p < ∞ and a ∈ R N m with a ∞ ≤ C a , there exists C(T, m, p, C a , C ⋆ , C µ ) < ∞, such that for every N and F as in For a more restricted class of observables, with additional restrictions on the distributions µ and P A and PÃ, we extend the above to almost sure and L q convergence for the observable trajectories. Precisely, we restrict the observables of (1.5) to m = 1 and p = 2, leaving, the following quadratic observables In order to extend Theorem 1 to a convergence for the trajectories of these observables, we further need to assume that Σ is constant, so that M t is just a scaled Brownian motion, and assume the following concentration property on µ, P A , PÃ, which we refer to as Hypothesis 2.

Hypothesis 2.
A sequence of probability measures (P (n) ) n≥1 over Z n in metric spaces (X n , d) satisfies exponential concentration for Lipschitz functions if there exists some C > 0 such that for any sequence of 1-Lipschitz functions f n : (X n , d) → (R, | · |) and all λ > 0, Assume that µ, P A respectively satisfy exponential concentration for Lipschitz functions on R N and R N 2 (or R N (N +1)/2 if A is symmetric), equipped with their Euclidian norms, for some C µ , C A > 0.
Remark 1.1. Recall, from the theory of measure concentration, that Hypothesis 2 holds for any distribution on R n which satisfy a Poincaré inequality with constant c > 0 (independent of n), namely for all nice f one has that Var[f (Z n )] ≤ cE[|∇f (Z n )| 2 ] (see [21]). By the tensorization of the Poincaré inequality, if Z n = (Z 1 , . . . , Z n ), and each of the laws of Z i satisfy this inequality, then the product also satisfies it with the worst constant c. Having here product measures µ, P A , the marginal laws can come from any distribution satisfying a Poincaré inequality in n = 1. These include • Exponential, Gaussian, and log-concave measures of the form exp(−V (x)) for V (x) strictly convex, • Linear functionals of r.v.'s having a Poincaré inequality: e.g., the uniform measure on [−1, 1].
The next theorem shows that under Hypothesis 2, any F of the form (1.9) concentrates around its mean.
Combining Theorems 1 and 2 we get the following strong universality for such quadratic observables.
Corollary 3. Suppose µ, P A , PÃ satisfy Hypotheses 1-2, where A,Ã, symmetric or independent, are meanzero and have matching variance profile m = (m ij ) i,j . Let F (·) andF (·) be as in (1.9), for a ∈ R N such that a ∞ ≤ C a , with respect to the corresponding solutions X t ,X t for (1.2) with constant Σ, i.e., σ ij = 0 if i = 0. Then, for every T < ∞ we have that as N → ∞, almost surely, and in L q for q ≥ 1 .
→ 0 as N → ∞. Similarly, upon using the triangle inequality for · q we get from Theorems 1-2 that Further, N → p N (·) decrease pointwise on [C, ∞), while for any q ≥ 1, the preceding integral is finite for all N large enough. With {Z q N } N uniformly integrable, it follows that Z N → 0 also in L q . 1.3. Proof strategy. As mentioned in the introduction, traditional approaches to proving universality run into substantial difficulty when we apply them to diffusions with random coefficients. The dependence on specific entries of the random matrix are quite bad, as the dependence applies in the drift both through the J ij , and through its effect on X t , whose history evidently also depends on J ij : this effect can exponentially amplify small differences; in fact, the exponential amplification is inherent to the problem at hand.
At a high level, our strategy for proving Theorem 1, and the main novelty of the paper, is to leverage the independence of µ from P J , PJ by pulling back f (X t ) and f (X t ) to properties of (time) derivatives of f (X t ) evaluated at t = 0. At the level of expectations, these derivatives can be seen as iterates of the infinitesimal generator applied to the function F , which can then be controlled by combinatorial moment methods. The dominant contribution to the drift of F comes from drift terms that are polynomials of degree at most two in (J ij ) ij . Since the first two moments of P A and PÃ match, these terms do not contribute to the difference in expectations above. We emphasize that the approach does not need rely on an explicit solution to the sde of (1.2), nor does it use exponential control, or large deviations theory as in [17], or refined estimates on the spectrum of A as in the setting of [2] where, crucially, the process has a rotational symmetry.
Recall that the sde defined in Eq. (1.2) has infinitesimal generator L that we split as follows: (1.12) By Ito's formula, we have for every f , say, in C ∞ (R N ), where P t = P t (J) denotes the semi-group operator with formal expansion P t = e tL (1.13) in terms of the generator L. In order to reduce the problem to a combinatorial question, we wish to Taylor expand the semi-group operator P t f = e tL f . As long as f is smooth and the Taylor expansion converges absolutely-shown in Section 2.2-this formal expansion is valid and we can switch expectations over µ, P J , PJ with the sum, and compute expectations of powers of the generator L acting on f . Namely, the difference in expectations is bounded by controlling (1) the size in N , and (2) the growth in k of (1.14) Expanding these terms as words in L J , L Λ , L h , L ∆ , we observe that a non-zero difference between the two expectations in (1.14), can only come from the summands (monomials in J, X, Λ, h, σ) satisfying • Every J ij that is present, must appear at least twice.
• At least one J ij must appear at least three times.
This is because the means of P A , PÃ are zero, and the variances of P A and PÃ match. A careful analysis of this combinatorial problem for the monomials eventually yields that the contributions from these monomials are, together, O(N −1/2 ) in N , and o(k!) in k: this computation is carried out in Section 2.3.
Remark 1.2. One may notice that in the case where Σ(X t ) is constant so that M t is just a Brownian motion, we are left with a linear sds and one could use this linearity in a more central way, to explicitly solve expectations of monomials in (X i (t)) i as Gaussian integrals and time integrals over words in e sJ and (X i (t)) i . If the system X t is invariant under rotations, then we can work in the coordinates of J so that it is diagonal and apply universality results for the spectrum of J. Absent rotational symmetry, however, the natural step would be to Taylor expand e sJ , at which point the expansion and the resulting combinatorics will be similar, and perhaps less transparent, than our generator based approach. Of course, for non-constant Σ(X t ) as in Theorem 1, the sds is non-linear, and such an approach would not generalize.
In Section 3, we extend this bound on the difference in expectations of statistics f to multi-time observables, then to statistics that contain the driving martingale terms and finally establish the universality at the level of expectation for observables of the form of (1.5), as stated in Theorem 1. In Section 4, we adapt the approach of [3] to establish Theorem 2, namely, to show that the restricted class of observables of (1.9) concentrate around their expectations, by localizing to a set of large probability where F is O(N −1/2 )-Lipschitz in the triplet (X 0 , J, (M t ) t∈[0,T ] ) and using Hypothesis 2.
1.4. Applications. In this section, we discuss systems for which Theorem 1-Corollary 3 imply concrete universality results. All the examples that follow will be in the context of Σ that is constant, i.e., σ ij = 0 if i = 0, where both Theorems 1-2 apply. Among the examples with non-constant Σ, one which may be of interest is a system of geometric Brownian motions interacting linearly through J.
We next describe two well-studied families of Markov processes/dynamical systems to which our results apply: Langevin dynamics and gradient flows on various energy landscapes (Hamiltonians) or loss functions.
Langevin dynamics. In the case where J and Λ are symmetric matrices, and σ 0j are identically one, (1.2) corresponds exactly to the Langevin dynamics for the Hamiltonian The linearity of the diffusion here corresponds to having a quadratic Hamiltonian. The Langevin dynamics is a reversible Markov process designed such that, when non-degenerate, its invariant measure on R N is given by dπ(x) ∝ e −H(x) dx. For Hamiltonians coming from spin glass theory, the Langevin dynamics has been analyzed at length in the case of Gaussian disorder, and found to have a varied and rich behavior; in §1.4.1, we explore this further in the context of a simple spin glass model, called the spherical sk model.
Gradient flows. The case where σ 0j are identically zero-i.e., besides the randomness of J and, possibly, the initial data, the dynamics is deterministic-also fits into the framework of the paper. Here, given J and X 0 , the law of the dynamics is taken to be the delta function on the trajectory of the solution to We now turn to a few well-studied concrete problems to which our results are applicable.
1.4.1. The (soft) spherical sk model. The dynamics of spin glasses are a canonical setting in which Markov processes with random coefficients are studied in their thermodynamic (N → ∞) limit. The short-time (N → ∞, then T → ∞) behavior of Langevin dynamics, especially, in the context of spin glasses have been extensively studied in both the physics and math literature [2-7, 11, 12, 15, 18]. Perhaps the most well-known mean field spin glass is the Sherrington-Kirkpatrick (sk) spin glass, where N spins taking values in {+1, −1} interact pairwise with one another, and their interaction strengths are moderated by "coupling" parameters J ij = J ji which are drawn i.i.d., say, Gaussian. We discuss a simplification of this known as the spherical sk model, which has been found to nevertheless exhibit some of the same phenomena. Take an i.i.d. symmetric matrix J = (J ij ) ij with law P J . The spherical sk model has Hamiltonian To avoid differential geometry on the sphere, it is sometimes preferable to extend the Hamiltonian to all x ∈ R N (note that the Hamiltonian is homogeneous so that dividing x by the Euclidean norm x / √ N gives the same process on S N −1 ( √ N )). Instead of adding a non-linear confining force as is done in, e.g., [2], we either add a linear confining force F K (x) = Kx, or have no confinement (K = 0) (the linearity of the system ensures no finite time blowup). Consider now the Langevin dynamics at inverse temperature β > 0 for the Hamiltonian of (1.16), corresponding to X t = X (β) t solving the sds We also consider the gradient flow where we take β = ∞, so that the Brownian motion term drops out: X t is then the (deterministic) dynamical system following the (random) gradient vector field of H(x)+F K ( x 2 /N ). The following universality for the above system is an immediate corollary of Theorem 3.
and consider the sds's X t andX t given by (1.17) for A andÃ having mean zero, matching variance profiles m ij = 1{i = j}. Suppose µ is independent of P A , PÃ and these satisfy Hypotheses 1-2. Then for F as in (1.9) with Y, Y ′ ∈ F and a ∞ ≤ C a , for every T < ∞, almost surely, and in L q for q ≥ 1 .
As shown in [14] and rigorously proved in [2], when J is Gaussian, the spherical sk model, or the soft spherical sk Model with confining potential F satisfying F (x)/x → ∞ as x → ∞, exhibits a sharp aging transition. Informally, aging is defined as the notion that the older a system gets, the more it remembers its past; formally, it corresponds to a transition in the behavior of the auto-correlation, between a (fdt) regime where C N (s, t) ∼ Φ(t − s) and an aging regime where C N (s, t) ∼ Φ( t s ) for large s, t. In [2], it was established that for J having rotationally invariant law, e.g., a goe matrix, C N (s, t) solves a non-linear equation [2,Eq. (2.16)], which exhibits exactly this type of transition at some β ag . Our results allow us to read off universality for this limiting behavior, as formalized in the following corollary. Corollary 1.4. Consider the Langevin dynamics for the soft spherical sk model, as defined in (1.17) where P A is a Wigner matrix satisfying Hypothesis 2, the confinement is F K (x) = Kx for some K > E[ J 2→2 ], and the initialization µ is e.g., standard Gaussian, independent of P A . Then, for every β ∈ (0, ∞] and every T < ∞, the limit (lim N →∞ C N (s, t)) s,t∈[0,T ] exists, and satisfies [2,Eq. (2.16)].
In the specific case of β = ∞, the conclusions of [2, §3.2.2] apply, and the solution exhibits aging: i.e., there is a γ > 0 (specified therein) such that for every λ > 1, Proof. For the first statement, while [2, Theorem 2.6] is stated for confinement F growing super-linearly, following the proof one sees that it is only used to localize the process, for which it suffices for K to exceed J 2→2 (which for Wigner matrices is a.s. less than 2 + ǫ for any ǫ > 0). The first part of the corollary therefore follows from Corollary 1.3 together with the result of [2,Theorem 2.6] showing that for A standard normal, C N (s, t) satisfies [2, (2.16)].
For concreteness, the analysis of the limiting equation [2, (2.16)] and the derivation of the aging transition is carried out in [2] only for a specific choice of quadratic F . One could in principle perform the same analyses with other choices of F including F = F K that is linear, corresponding to the case we consider, and understand the limiting behavior of C N (s, t) as N → ∞ then s, t → ∞ as β varies. We do not pursue this, and instead notice that in the specific case of β = ∞, the homogeneity allows us to disregard the choice of the confining potential and obtain universality for the zero-temperature aging behavior. To see this, since H(x) is a homogeneous polynomial, if β = ∞, we see that dX t is a constant multiple (for a constant depending only on X t ) of d(X t / X t ). Therefore, at β = ∞, the projection of the dynamics (1.17) onto the sphere S N −1 ( √ N ) matches the projection of the Langevin sds of [2], regardless of the choice of confining potential used therein. We apply Corollary 1.3 first to deduce that lim s→∞ lim N →∞ C N (s, s) =: C ∞ is the same for Gaussian and non-Gaussian P A . Then applying it to C N (s, λs), we find that the N → ∞ limit of the normalized auto-correlation is the same for Gaussian and non-Gaussian P A , and it is further independent of the choice of confining potential: as such for any P A , it has the same N → ∞ limit as in [2]. Remark 1.5. It would be of interest to consider similar Langevin dynamics for the spherical or soft spherical p-spin glass models for p > 2. Permitting higher order interactions gives rise to a wealth of more complicated models and different behavior. At the level of the off-equilibrium Langevin dynamics, these lead to the famous Cugliandolo-Kurchan/Crisanti-Horner-Sommers limit of coupled integro-differential equations for C N (s, t) and an integrated response [3,11,12,15,16,18,23], as well as the evolution of other observables e.g., the Hamiltonian and its square gradient [5]. Our combinatorial framework suggests that the differences in expectations (over p-tensors J andJ) of averaged observables are microscopic, as long as there is a non-linear confining potential to prevent finite-time blowup. The complication is in the fact that the two non-linearities (from the interactions, and the confining potential) cancel out, but these cancellations are not easily seen in the Taylor series obtained by expanding in powers of the generator; thus we are not able to show that this series is absolutely summable and exchange the infinite sum with its expectation.

1.4.2.
Symmetric and asymmetric Hopfield networks. Let us also mention a different context in which diffusions of the form of (1.1) appear. Hopfield networks were introduced by [25] and have become one of the simplest and most fundamental examples of neural networks. In this model, a set of N neurons (X i ) i are either active {+1} or inactive {−1} depending on whether the neuron X j 's input J ij X i , for some weights J = (J ij ) i,j , exceeds a deterministic threshold h i . This model was introduced in the symmetric setting, but has since been analyzed extensively both in symmetric and asymmetric setups [13,24,39].
One typically initializes the neurons at some pre-determined state independent of J, e.g., all inactive/active, or uniformly at random, and tracks their time-evolution, whereby each neuron activates/de-activates at some rate, depending on the relationship between its input and threshold. Though there are many ways this is implemented, one is to soften the problem to continuous state space, either to the sphere, or to full-space and add in stochasticity by running some Langevin dynamics. This is the approach pursued in [13] as well as e.g., [39]. Then, with a linear confining force, our results imply universality for both for the symmetric and asymmetric Langevin dynamics (and gradient flow) of general Hopfield networks: this includes universality for observables capturing the energy/loss in the network, its square gradient, and its "memory".

1.4.3.
Rayleigh quotient minimization for random matrices. We conclude with a related optimization problem in high dimensions: that of optimizing the Rayleigh quotient of a random matrix J with a certain mean and variance profile. Maximizing the Rayleigh quotient is an efficient way to find the top eigenvector and eigenvalue of the random matrix via local iteration, e.g., either gradient descent or Langevin dynamics at low temperatures (large β). To place this in the framework of (1.2), take H(x) = x, Jx and either no confining force or F ′ K = K for some K > J 2→2 in (1.17). In the situation where the matrix ensemble is rotationally invariant, e.g., the goe, the limiting trajectories of, say, H(X t ) for the gradient flow/Langevin dynamics can be explicitly solved (by diagonalization). Corollary 3 implies these limiting trajectories will be universal, and thus, match the limiting trajectories obtained when J is not Gaussian. In [1,10], similar universality results were described for an amp approach to finding the top eigenvalue/eigenvector of J.

Universality of expectations of monomial observables
In this section, we prove that two solutions X andX of (1.2) driven by J andJ are such that expectations of observables of the form (1.9) are universal, as long as A andÃ have the same variance profiles. As discussed in Section 1.3, we reduce differences in expectations to combinatorial calculations by expanding the Markov transition semi-group of the process X t in terms of its generator, an approach for proving universality in randomly driven dynamical systems which is the key contribution of this paper.
For the entirety of this paper, we will take two distributions P A and PÃ on A andÃ that are mean zero and have the same, uniformly bounded, variance profiles m =m. Recall that P A and PÃ are either fully independent or symmetric ensembles. For conciseness, we present our results in the case of fully independent (in particular, not symmetric). The case where they are symmetric is handled mutatis mutandis and only induces a few constant factors in certain estimates (see Remark 2.9 for more on these minimal modifications).

Main result on difference in expectations.
The observables in Theorem 1 are composed of polynomials in J and X, as well as M. We first establish the universality of expectations for general monomials in J and X via a combinatorial moment matching type of argument. In Section 3 such universality is reduced for monomials that additionally involve the martingale, to that of monomials only in J and X.
More precisely, the statistics we consider throughout this section are of the following form. Fix any s (not necessarily distinct) pairs α = (α 1 , . . . , α s ) where each α k = (i k , j k ), and r-tuple (not necessarily distinct) γ = (γ 1 , . . . , γ r ) where each γ i ∈ {1, . . . , N }. Then consider observables f α,γ (x) of the form For an s-tuple of pairs α, let • I α count the number of distinct pairs in α, i.e., I α = |{α 1 , . . . , α s }|, • I α,1 count the number of (α k ) k which appear exactly once in α, and • I + α,1 equal I α,1 plus the indicator that no pair appears more than twice in α. Our bound on the distance between the expectations of f α,γ (X t ) and f α,γ (X t ) depends on α, γ and the laws µ, P A , PÃ only through C ⋆ , C µ , s, r and I + α,1 . More precisely, we derive here the following. Proposition 2.1. There exists C = C(r, s, T, C ⋆ , C µ (r)) such that for every T, r, s ≥ 0, every s-tuple of pairs α and every r-tuple γ, if P A , PÃ and µ satisfy Hypothesis 1, then Observe that in the case s = 0, the right-hand side is CN −1/2 .
Remark 2.2. The above theorem shows that having more distinct J's in the observable, decreases the difference in expectations by more than N −s/2 as would be expected from the typical size of J ij . This should be expected due to clt-type cancellations: one way to motivate this scaling is by recalling averaged statistics which have J in them, in the context of the spherical sk model, e.g., the most relevant being (Notice that these statistics are not rescaled by the number of order-one sized monomials; but they remain on the O(1) scale due to additional cancellations from (J ij )). This gain in the scaling has to be visible at the level of the difference in expectations under P andP in order to hope for universality for such statistics.
Recall from Section 1.3 that our high level strategy is to reduce the expectations of statistics of the solution X t of the sds to combinatorial calculations in terms of mixed moments of J and X 0 . This is possible by as P t f (X 0 ) and then Taylor expanding P t = e tL where L is the generator for the process X t as defined in (1.12). In order for this expansion to be valid, and therefore our approach to be permissible, we need the Taylor expansion for e tL to converge absolutely, for each fixed N . In the next sub-section, we show that indeed with µ, P A , PÃ satisfying Hypothesis 1, for each fixed N , the infinite series corresponding to P t f converges absolutely, so we can follow this plan.
Before proceeding further, we make the following notational remark.

2.2.
Switching the expectation and the infinite series. The goal of this sub-section is to prove the following absolute convergence result.
As a consequence of Proposition 2.3 and Fubini-Toninelli, we may use the following expansion.
Corollary 2.4. Suppose P A , PÃ, µ satisfy Hypothesis 1. Setting L andL for their generators, we have that for every N ≥ N o (r, T, C ⋆ ), every t < ∞, and every s-tuple of pairs α and r-tuple of indices γ.
Proceeding hereafter to prove Proposition 2.3, we fix r, s, α and γ, and set f = f α,γ . Aiming for upper bounds on E[|L k f (X 0 )|] which are summable against T k /k!, we first utilize (1.12) to expand L k as a sum over the 4 k words W in the letters {L J , L Λ , L h , L ∆ } and thereby get the bound where for every x ∈ R N , W f (x) should be understood as (W k · · · W 2 W 1 f )(x). For every word W ∈ {L J , L Λ , L h , L ∆ } k , let k J = k J (W ) denote the number of L J 's that appear in W , and similarly define k Λ , k h , and k ∆ , so that k J + k Λ + k h + k ∆ = k and the following structural decomposition of W f holds.
Claim 2.5. For any word W ∈ {L J , L Λ , L h , L ∆ } k with k J , k Λ , k h , k ∆ occurrences of the corresponding symbols, W f can be expressed as a sum of (not necessarily distinct) monomials of the form
In view of Hypothesis 1 on P A we have that for every N , ℓ ≥ 0, and index pair α, Thus, if I α∐β distinct index pairs appear at multiplicities (n ℓ + 1) ℓ≤I α∐β in the sequence α ∐ β of length k J + s, then by the independence of (J α ) α , Consequently, with X 0 independent of J we have in view of the assumed bounds on (Λ ij ) i,j (σ ij ) i,j and (h i ) i , that for any term of the form (2.3) with I ζ entries such that ζ ℓ ∈ (0j) j , using in the last inequality also (1.6) from Hypothesis 1 on µ, and the definition of C ⋆ . Our next result is a first step in controlling the number of monomial terms that can appear in the expansion of each word W ∈ {L J , L Λ , L h , L ∆ } k . Lemma 2.6. For every k J , k Λ , k h , k ∆ and every β, β ′ , ζ ′ , ζ, ξ, if we let φ = φ β,β ′ ,ζ ′ ,ζ,ξ be as in (2.3), then L h φ, L J φ, L Λ φ and L ∆ φ can each be expressed as a sum of at most r, rN , rN Λ and rN 2 σ many such monomials, respectively, each of the same form (with possibly different β, β ′ , ζ ′ , ζ, ξ) as (2.3), with the respective k J , k Λ , k h or k ∆ increased by one.
Fixing N , k, an s-tuple of pairs α, an r-tuple of indices γ and W ∈ {L J , L Λ , L h , L ∆ } k , upon inductively applying Lemma 2.6, we are able to express W f as the sum of at most (2.9) many non-zero monomials of the form of (2.3). Recall that for a monomial φ, we use I ζ for the number of ζ ℓ / ∈ (0j) j , I α for the number of distinct pairs in α, I α∐β for the number of distinct pairs in α ∐ β, and introduce I ⋆ = I α∐β − I α , which counts the number of distinct pairs in {β} \ {α}. A careful examination of the proof of Lemma 2.6, yields the following significant refinement upon the crude bound of (2.9). Proposition 2.7. Fix N , r, s, k ≥ 0, an s-tuple of pairs α, an r-tuple of indices γ, and W ∈ {L J , L Λ , L h , L ∆ } k . Then, of the monomials in such expansion of W f , at most k J I ⋆ , n 1 , . . . , n I α∐β have I ζ elements of ζ with ζ ℓ ∈ (0j) j , and the I α∐β = I α + I ⋆ distinct pairs in α ∐ β appear in multiplicities {n ℓ + 1 {ℓ>Iα} } ℓ≤I α∐β within the sequence β of length k J . (N.b. we ordered the (n ℓ ) with multiplicities in β of the distinct pairs of α appearing first, and the multiplicities in β of the remaining I ⋆ distinct pairs next.) Proof. The first improvement in (2.10) over (2.9) is from observing that the growth factor N σ applies only in those I ζ of the 2k ∆ applications of L ∆ within W which have led to an element ζ ℓ ∈ (0j) j (see (2.8)), and that there are at most 2k∆ I ζ ways to choose which I ζ elements of ζ are not from the 0-th row of σ. Similarly, the growth factor N in counting the number of monomials after applying L J is only relevant during the I ⋆ applications of L J within W in which a new pair (ij) is selected (see (2.6)). The left-most term in (2.10) counts the number of ways to select the locations of these I ⋆ new elements within the k J long sequence β, and thereafter to partition the remaining k J − I ⋆ consistently with having the prescribed n ℓ ≥ 0 repeats for each of the I α∐β distinct pairs in question. Putting all this together yields the stated bound (2.10) on the number of relevant monomials in the expansion of W f .
Proof of Proposition 2.3. Combining Proposition 2.7 with the bound (2.4) we deduce that for any word W of length k and any α whose I α distinct terms appear in multiplicities (c ℓ ) ℓ≤Iα , where the inner sum is over all partitions of k J − I ⋆ into I α∐β indistinguishable integers n ℓ ≥ 0. Since ℓ c ℓ = s and n ℓ + c ℓ ≤ k J + s for all ℓ, the right-most product is at most (k J + s) s . Further, the number of (n ℓ ) ℓ considered here is at most the number of integer partitions of k J , which grows slower than e kJ (c.f. the Hardy-Ramanujan asymptotic partition formula). Thus, we find that for C(r, s, C µ , C ⋆ ) finite and any word W of length k, Since k! ≥ k J !(k − k J )!, the bounds (2.11) and (2.2) will yield the stated absolute convergence of the infinite series. Specifically, fixing T < ∞ and setting δ = 1/(16T reC ⋆ ), we have that where, as before, I ⋆ = I α∐β − I α denotes the number of distinct elements in {β} \ {α}.
Proof. By the independence of J,J and µ, if which for independent, zero-mean (J ij ) ij of matching variances 1 N m = 1 Nm , requires that simultaneously: No pair α ⋆ appears exactly once in the concatenation α ∐ β. (2.15) Some α ⋆ appears more than twice in the concatenation α ∐ β. (2.16) The condition (2.15) implies that each of the I ⋆ distinct elements in {β} \ {α} must appear at least twice in {β}, to which end we need at least 2I ⋆ applications of L J to select those elements. In addition, some other I α,1 of the k J applications of L J must align exactly with the pairs (α ij ) appearing only once in α, so necessarily k J ≥ 2I ⋆ + I α,1 . Further, the condition (2.16) requires k J + s ≥ 3 and when no pair appears more than twice in α, an extra application of L J beyond the preceding 2I ⋆ + I α,1 is needed for producing the third appearance of some α ⋆ , as stated in (2.14).
We are now able to prove that the expectations of monomials of the form f α,γ (X t ) are universal.
Proof of Proposition 2.1. Fixing α, γ, in view of Lemma 2.8, it suffices when bounding the rhs of (2.13), to consider only words W and monomials φ for which (2.14) holds. Thus, upon combining the bound (2.4) on E[|φ(X 0 )|] andẼ[|φ(X 0 )|] with Proposition 2.7, we find, for the same constant C as in (2.11), that Plugging (2.17) into (2.13), as in the derivation of (2.12), we get for δ = 1/(16T reC ⋆ ) and N ≥ ρ : whereC = 2Ce −1/δ ρ I + α,1 /2 . This completes the proof, as both series on the rhs of (2.18) are finite and independent of N . Remark 2.9. In the case of symmetric random matrices A,Ã (where only the upper triangular and diagonal elements are independent), we identify index pairs β = ij andβ = ji as being the same. We do so whenever considering I α , I α,1 , I + α,1 , I α∐β , I ⋆ , and the multiplicities (n ℓ ) ℓ , as well as in the restrictions (2.15)-(2.16) imposed on the multiplicities within α ∐ β. Once this is done, the only difference in our proof is to replace in (2.10) the weight r k by (2r) k .

The extension to multi-time polynomial observables
In this section, we extend the results of Section 2 to more general observables, namely those that contain coefficients that depend on the driving martingale, and those that depend on the trajectory through multiple times, rather than just one. We then use those extensions to prove Theorem 1. To this end, fix any l, any (α (1) , . . . , α (l) ) each consisting of s i pairs, any (γ (1) , . . . , γ (l) ) each consisting of r i indices, and also fix m indices ξ = (ξ 1 , . . . , ξ m ). Fix l times 0 ≤ t 1 ≤ · · · ≤ t l ≤ T and m times 0 ≤ u 1 ≤ · · · ≤ u m ≤ T . For f α (i) ,γ (i) as in (2.1), consider observables of the form, (3.1) Letr = i r i + m andᾱ denote the concatenation α (1) ∐ · · · ∐ α (l) of lengths := i s i .
Proposition 3.1. There exist finite C(r,s, m, l, T, C ⋆ , C µ (r)) such that for every l, m, every (α (i) ) i≤l , We proceed to prove Proposition 3.1, which we thereafter combine with a short combinatorial estimate bounding the number of terms with specific values of I + α,1 to establish Theorem 1.
We express the expectation E B with respect to the Brownian motion of g(t), in terms of the (diffusion) semi-group operator as Expanding each semi-group operator in terms of powers of the generator L, the above is precisely Taking the difference in expectations between E andẼ, upon justifying swapping the expectation with the infinite sum (as done in Section 2.2), and using the fact that k k 1 , . . . , k l l −k ≤ k1,...,k l ≥0 ki=k k k 1 , . . . , k l l −k = 1 , for every k 1 , k 2 , . . . , k l such that k 1 + · · · + k l = k, we obtain that The following structural property for words appearing in the above will allow us to reduce the analysis of multi-time observables to the combinatorial analysis of one-time observables fᾱ ,γ = f (1) f (2) · · · f (l) , for α = α (1) ∐ · · · ∐ α (l) andγ := γ (1) ∐ · · · ∐ γ (l) , which we have already completed.
of each appearing, respectively. Then, the function consists of a sum of (not necessarily distinct) monomials of the form Moreover, each monomial φ(x) appearing in this expansion, must also appear in such monomial expansion of W fᾱ ,γ for W = W 1 · · · W l ∈ {L J , L Λ , L h , L ∆ } k .
Proof. The structure of the monomials is evident. Every such monomial in W 1 f (1) W 2 f (2) · · · W l f (l) must also appear in the monomial expansion of [W 1 · · · W l ]fᾱ ,γ because a subset of the terms in the latter are obtained by applying the letters in W l to f (l) , then the letters in W l−1 to f (l−1) (W l f (l) ), and so on. Finally, observe that W 1 · · · W l is always a word in With Claim 3.3 in hand, we further get that where the sums are over the monomials φ in the decomposition of W 1 f (1) · · · W l f (l) and that of W fᾱ ,γ per Claim 3.3. Note that each summand on the rhs of (3.4) is at most some (k + 1) l l k times the corresponding summand of (2.13) for the choice f = fᾱ ,γ for which we have deduced the bound of (2.17). Utilizing the latter and the elementary bound k + 1 ≤ (k J + 1)(k + 1 − k J ), by proceeding as in the derivation of (2.18), we find that for C = C(r,s, C µ (r), C ⋆ ) finite, δ = 1/(16 l Tr e C ⋆ ) positive and N ≥ (2/δ) 2 , for some finiteC =C(l,r,s, T, C ⋆ , C µ (r)).
We now add in the driving martingale observables (i.e., m > 0) and conclude the proof of Proposition 3.1.
Proof of Proposition 3.1. We reduce the situation m > 0 to the combinatorial calculations of Lemma 3.2 by utilizing the following expansion from Ito's lemma: When expanding (3.1) in this manner, the terms containing only products of X ξi (u i ) can be absorbed into γ, in which case their difference in expectations has already been handled in Lemma 3.2, so by linearity it suffices for us to focus on handling terms of the form Thus, fixing l, m, (α (i) ), (γ (i) ), ξ and letting h(t, u) = h (α (i) ),(γ (i) ),ξ (t, u) we obtain after swapping the expectation and integrals that which thereby yields the following bound on the relevant difference in expectations Proceeding hereafter wlog to bound the difference in expectations for h(t, τ ), we suppose for ease of exposition that 0 ≤ t l = τ 0 ≤ τ 1 ≤ · · · ≤ τ m (the situation where the two groups intertwine is similarly analyzed with the obvious modifications). As done in the proof of Lemma 3.2, first expressing E B in terms of the semi-group operator and then expanding that in powers of the generator L we find that with the sum running over monomial decomposition of ( . Then, utilizing again Claim 3.3, as well as the bound k! ≥k!/(k) m , we arrive at where as beforeᾱ = α (1) ∐ · · · ∐ α (l) is of lengths = i s i , whileγ of lengthr = r i + m has now the additional elements (x ξi ) i≤m . Up to this update ofr and the immaterial weight factor (k/(lT )) m of its summands, the expression on the rhs of (3.5) is the same as that in (3.4). We thus conclude as in the proof of Lemma 3.2 that for some C(l, m,r,s, T, C ⋆ , C µ (r)) all t ∈ [0, T ] l and u ∈ [0, T ] m ,

3.2.
Proof of Theorem 1. Fix T, m, p, C a , a ∈ R N m such that a ∞ ≤ C a , and t ∈ [0, T ] p . For every ℓ ≤ m, fix observables Y (ℓ,1) , . . . , Y (ℓ,p) ∈ F and let F (t) be as in (1.5) with those choices. By linearity of expectations and the uniform bound on a ∞ , it suffices to show that uniformly over i 1 , . . . , i m , We denote bys the number of Y terms appearing in the preceding product which is a coordinate of G t . In cases = 0, the bound (3.6) follows from considering Proposition 3.1 ats = 0, in which case I + α,1 = 1. Otherwise, we expand every term in that product which is a coordinate of G t to obtain a sum of monomials of the form of (3.1). Each of these monomials has a sequenceᾱ of lengths, and as a result of such expansion there are at mostssN Iᾱ monomials with precisely Iᾱ distinct pairs in the sequenceᾱ. Note that for anyᾱ, s + I + α,1 ≥ 2Iᾱ + 1 . Indeed, each pair which appears once inᾱ, is counted both ins and in Iᾱ ,1 , all other pairs are counted at least twice ins, and for anyᾱ of maximal multiplicity two, we have added one to I + α,1 . Consequently, the bound of Proposition 3.1 on the difference in expectation for each of thesessN Iᾱ many monomials is at most CN −Iᾱ−1/2 for some constant C(T, m, p, C ⋆ , C µ ). From this, the bound (3.6) immediately follows upon enumerating over the at mosts many choices for Iᾱ.

Concentration for quadratic observables: Proof of Theorem 2
Assuming henceforth that M t is a scaled Brownian motion (i.e., that σ ij are identically zero for i = 0), our goal is to prove Theorem 2 about the uniform over t ∈ [0, T ] 2 concentration property of the quadratic observable of (1.9), To this end, we introduce in Subsection 4.1 high probability localizing sets L N,R on which various norms of X t (and our observables F (t)), are uniformly bounded. We begin by bounding the probability that (X 0 , J, M) / ∈ L N,R .
Lemma 4.1. There exists C = C(T, C µ , C A , C σ ) > 0 and R 0 (T, C µ , C A , C σ ) < ∞, such such that for every R ≥ R 0 if µ, P A satisfy Hypotheses 1-2, then Proof. We bound L c N,R by the union of the events where each of the three norms is greater than RN/3. First, since M t is a Brownian motion (scaled by (σ 0j ) j ), by Doob's maximal inequality for the sub-martingale exp(δ M t 2 ), we have for some C(C σ ) > 0 any R ≥ T R 0 (C σ ) and all N , Next, since µ satisfies Hypotheses 1-2, the independent X i (0) have uniform (in i and N ), second moments and exponential tails. Hence, applying [29,Theorem 3] for the centered sum of i.i.d. variables that stochastically dominate X 2 i (0), we have for some C(C µ ) > 0, any R ≥ R 0 (C µ ) and all N , It thus remains only to show that when P A satisfies Hypothesis 2, we have for some C(C A ) > 0 any R ≥ R 0 (C A ) and all N , To this end, recall [27,Theorem 2] that there exists a universal constant C such that for any matrix A with independent, zero-mean entries of second moments m ij and fourth moments b ij , For P A satisfying Hypothesis 1, b ij and m ij are bounded uniformly in i, j and N (see (1.8)). Hence, in the case where A is composed of independent entries, for some C(C A ) finite and all N , Likewise, representing a symmetric A as A = A + + A − , with A + the upper triangle (including the diagonal) part of A and A − its lower triangle part, [27,Theorem 2] holds for the matrices A − and A + of zero-mean, independent entries (with uniformly bounded forth moments). Thus, (4.4) holds also in this case up to a factor of 2. Thanks to (4.4), if √ R ≥ 4C then Hypothesis 2 for P A then yields the bound (4.3), upon recalling that A 2→2 , which is largest singular value of A, is 1-Lipschitz in its entries (endowed with the Euclidean norm, on A + when A assumed symmetric).
We further have on the sets L N,R the following localization for both (X t ) t∈[0,T ] and (G t ) t∈[0,T ] .
In addition, for every a such that a ∞ ≤ C a (uniformly over N ) and every Y, Y ′ ∈ F , if F (t) is as in (1.9), we have for all k ≥ 1, Proof. Setting e N (t) = 1 √ N X t , we get upon expanding (1.2), that From the definition of the 2-to-2 norm, evidently Hence, by Cauchy-Schwarz, t 0 e N (s)ds , where in the last inequality we rely on our assumption that Λ 1→1 ≤ C Λ and Λ ∞→∞ ≤ C Λ , to deduce that Λ 2→2 ≤ C Λ . Combining these bounds on (I i ) i≤5 , and dividing out by e N (t), we see that By Gronwall's inequality, using the localization to L N,R , it then follows that for any t ∈ [0, T ], yielding the lhs of (4.5) as soon as R ≥ R 0 (T, C ⋆ ) ≥ 1. From the lhs of (4.8) we know that G t ≤ √ R X t throughout L N,R , hence after suitably increasing C 0 and R 0 , the rhs of (4.5) holds as well.
To deduce the uniformly bounded moment estimate of (4.6) for X t , recall first from the lhs of (4.5) that Combining the latter bound with that of Lemma 4.1, we arrive at The rhs decreases in N and as f ′ (R) = (C 0 T k)/(2 √ R)f (R), it is finite for √ N /C > C 0 T k, yielding the lhs of (4.6). The rhs of (4.6) follows by applying the same reasoning to Z k N, utilizing the rhs of (4.5).
Turning to (4.7), note that for any k ≥ 1 and F (t) of (1.9) with a ∞ ≤ C a , by Cauchy-Schwarz, Thus, yet another application of Cauchy-Schwarz results with If Y is 1, this latter expectation is simply 1. If Y is M, using the tail bound of (4.2) in combination with (4.9) (now for f (R) = (R/3) k ), the latter expectation is uniformly bounded in N . Lastly if Y is from {X, G}, the expectation above is uniformly bounded in N by (4.6). Combining these yields the desired (4.7).  Fixing a such that a ∞ ≤ C a and Y, Y ′ ∈ F , denote by F (t; (X 0 , J, M)) the observable in (1.9) evaluated on the trajectory X t constructed out of the triplet (X 0 , J, M). There exist R 0 (T, C a , C ⋆ ) and C(T, C a , C ⋆ ) such that for any R ≥ R 0 all N and (X 0 , J, M),

A Lipschitz
The key to Proposition 4.3 is to show that X t is O(1)-Lipschitz on L N,R endowed with · mix . Specifically, denoting by X t (X 0 , J, M) the solution to (1.2), constructed from the triplet (X 0 , J, M) and X ′ t (X 0 , J, M) the solution constructed from the triplet (X ′ 0 , J ′ , M ′ ), our next lemma establishes a uniform over L N,R Lipschitz bound on X t − X ′ t . Lemma 4.4. There exist R 0 (T, C ⋆ ), C(T, C ⋆ ) such that for all R ≥ R 0 and (X 0 , J, M), (X ′ 0 , J ′ , M ′ ) ∈ L N,R , sup Proof. Following the strategy of proof of [3, Lemma 2.6], we let and expanding over j ≤ N , we have by the definition of the solution X t for the sds (1.2)-(1.3), that where G ′ (·) is defined as G(·) but constructed using J ′ instead of J. By Cauchy-Schwarz, Recalling (4.8), we similarly find that Turning to the terms involving G(·) or G ′ (·), observe first that Using the localization to L N,R , we thus find that where in the last inequality we further assumed R ≥ R 0 (T, C ⋆ ), utilizing the lhs of (4.5). Further increasing R 0 such that T e C0 √ R 0 T ≥ 1, upon combining the bounds on (I i ) i≤5 , and dividing out by e N (t), we see that Recall that J 2 2→2 ≤ ij J 2 ij , so by Gronwall's inequality, there exist C(T, C ⋆ ), such that for any R ≥ R 0 , every N and all t ∈ [0, T ], as claimed.
Proof of Proposition 4.3. Fix Y 1 , Y 2 ∈ F , a such that a ∞ ≤ C a and t = (t 1 , t 2 ) ∈ [0, T ] 2 . Equipped with Lemma 4.4 and (4.11) it remains to establish a Lipschitz control on differences of F (t; (X 0 , J, M)) in terms of differences of G t , X t and M t corresponding to any pair of triplets (X 0 , J, M) and (X ′ 0 , J ′ , M ′ ) in L N,R . To this end, we start with the following bound on differences of F (t; ·): Since the two terms on the rhs can be bounded symmetrically, wlog we focus on the first one, which by Cauchy-Schwarz, is at most where, as before, X ′ t is constructed out of the triplet (X ′ 0 , J ′ , M ′ ). Now recall from (X 0 , J, M) ∈ L N,R and Proposition 4.2, that the right-most term in (4.12) is at most exp(C 0 √ RT ) for all R ≥ R 0 , in which case by the preceding Recall Lemma 4.4 and (4.11), to deduce that for some C(T, C ⋆ ) > 0, every R ≥ R 0 , and all (X 0 , J, M), we have ( Putting these all together, we deduce that there exists some other R 0 (T, C ⋆ ) and C(T, C a , C ⋆ ), such that for We conclude this subsection by combining the respective exponential concentrations of Lipschitz functions due to µ, P A and P B . ≤ E P B (|I M | > r/3 | X 0 , J) + E P J |I J | > r/3 | X 0 + µ |I X0 | > r/3 we see that the exponential concentrations for 1-Lipschitz functions of µ, P A and P B lift to exponential concentration of P for functions that are 1-Lipschitz in the triplet (X 0 , J, M) on (E N , · mix ).

4.3.
Proof of Theorem 2. We first prove a concentration estimate for F at a fixed pair of times t ∈ [0, T ] 2 , before extending this to the full trajectory (F (t)) t∈[0,T ] 2 by bounding the modulus of continuity of F . Proposition 4.6. Suppose µ, P A satisfy Hypotheses 1-2. There exist C(T, C a , C ⋆ , C µ ) large, such that for any F as in (1.9) with a ∞ ≤ C a , Y, Y ′ ∈ F , all t ∈ [0, T ] 2 , λ > 0 and N ≥ N 0 (T, C a , C ⋆ , C µ ), (4.14) Proof. In proving [3, Lemma 2.5] it is shown, using a Lipschitz extension, that if P satisfies exponential concentration for Lipschitz functions as in (1.10) and V is an A-Lipschitz function on a set L on which |V | is uniformly bounded by K, then for some universal constant C > 0 and every λ > 0, For R = R 0 we can embed the constant factor 2D(R 0 ) into C and further adjust C 3 to bound the preexponent 2(C 2 + K(R 0 )) within the factor exp(− √ R 0 N /(2C 3 )) multiplying it, resulting with q N (λ; R 0 ) as in the top line on the rhs of (4.14). For a better tail decay, consider R λ = (η log λ) 2 ≥ R 0 , with η = 1/(2C 1 ) so D(R λ ) = C 1 e C1η log λ ≤ C 1 λ/ log λ for all λ ≥ 4. In addition, once √ N /(2C 3 ) ≥ 4C 0 T we can again embed the pre-exponent 2(C 2 + K(R λ ))/λ within the factor exp(− √ R λ N /(2C 3 )) multiplying it . Thus, upon adjusting the various constants we end up with q N (λ; R λ ) as in the bottom line on the rhs of (4.14).
Setting hereafter R for the larger of R 0 and R λ values from the preceding proof of Proposition 4.6, recall that the event L c N,R was already ruled out as part of the derivation of (4.14). Thus, proceeding to prove Theorem 2, we fix ε = N −k , k > 1, and apply Proposition 4.6 at the M N = ⌈T N k ⌉ 2 grid points t i,j = (iε, jε) within [0, T ] 2 , to deduce by the union bound that P(L c N,R ) + P sup It is easy to check that 2M N q N (λ) is further bounded by p N (3λ) of (1.11) once we suitably enlarge the constant C on the rhs of (1.11) relative to that of (4.14). In addition, since the right-most term in (4.15) exceeds one whenever E[|V |1 L c N,R ] = E[|F (t)|1 L c N,R ] ≥ λ/2, if that inequality holds for any t ∈ [0, T ] 2 , then q N (λ) and in turn p N (3λ) of (1.11) would exceed one. Thus, we may assume wlog that Restricting to λ > 1/ √ N (as otherwise p N (3λ) ≥ 1), and using p N (3λ) ≫ M N exp(−(λ 2 ∧ λ)N k /C ′ ) (as k > 1) with the above, the stated bound of Theorem 2, follows from the following short-time estimates.