Diffusions interacting through a random matrix: universality via stochastic Taylor expansion

Consider (Xi(t))\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$(X_{i}(t))$$\end{document} solving a system of N stochastic differential equations interacting through a random matrix J=(Jij)\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\mathbf {J}} = (J_{ij})$$\end{document} with independent (not necessarily identically distributed) random coefficients. We show that the trajectories of averaged observables of (Xi(t))\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$(X_i(t))$$\end{document}, initialized from some μ\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mu $$\end{document} independent of J\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\mathbf {J}}$$\end{document}, are universal, i.e., only depend on the choice of the distribution J\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathbf {J}$$\end{document} through its first and second moments (assuming e.g., sub-exponential tails). We take a general combinatorial approach to proving universality for dynamical systems with random coefficients, combining a stochastic Taylor expansion with a moment matching-type argument. Concrete settings for which our results imply universality include aging in the spherical SK spin glass, and Langevin dynamics and gradient flows for symmetric and asymmetric Hopfield networks.


Introduction
Markov processes with random coefficients arise in numerous contexts: e.g., dynamics of spin glasses, optimization on random landscapes, and learning with neural networks. In many cases, when the underlying randomness is Gaussian, they have been found to give rise to a rich class of behaviors, including metastability, trapping, and aging. In consider here is that of a system of N linearly coupled SDE's, where the couplings are encoded in a random matrix J, and driven by N independent Brownian motions. That is, X t = (X 1 (t), . . . , X N (t)) is the solution to the SDS , (1.1) where J is a random matrix with independent entries (up to, possibly, a symmetry constraint) and variance profile m = (m i j ) i, j scaled such that E[ J 2 ] = O(1), h is a bounded drift vector, and is an affine transform of X t . Note that for (X t ) non-constant, we do not expect to have an explicit closed-form solution to (1.1). In the N → ∞ limit, the diffusions of (1.1) encompass many interesting and well-studied models of Markov processes with random coefficients, and give rise to rich and varied behavior. This includes metastability, aging, and non-Markovian limiting evolution equations, in e.g., randomly coupled (geometric) Brownian motions, and Langevin dynamics and gradient flows for the spherical Sherrington-Kirkpatrick (SK) spin glass and symmetric and asymmetric Hopfield nets [6,13,[25][26][27]: concrete applications are described in Sect. 1.4. In many such examples, the analysis is more tractable when J is Gaussian and one can use tools like Gaussian integration by parts, Girsanov, and the rotational invariance of the Gaussian ensemble.
In this paper, we develop a simple combinatorial framework for proving universality for the solution trajectories of SDS's of the form (1.1). Before describing our approach, we explain a few difficulties one encounters when trying to prove universality for solutions of randomly coupled dynamical systems, using some of the approaches described above for other universality results. We begin by considering a Lindeberg approach where we examine the effect that re-sampling one J i j has on an averaged statistic F(t) = F(X 1 (t), . . . , X N (t)). The obstacle in employing such an approach is that changing J i j toJ i j on X j (t), say, beyond affecting the drift 1≤i≤N J i j X i (t) + h j , of the j-th coordinate of the SDS, also induces a highly non-linear effect both on X j (t) and on X i (t) for all i = j. The problem instead lends itself to comparing the effect of J →J in a more averaged way.
An alternative approach would be to use the linear structure of the problem in a strong way, relying on sharp universality results on the spectra of random matrices to study the problem. This approach, while feasible if (X t ) is constant, requires one to diagonalize the problem without loss of generality-i.e., it requires an assumption of joint rotational invariance for the laws of (X 0 , J, B). In [2], such an approach is followed for analyzing the dynamics of the spherical SK model, and their results hold assuming the law of J is invariant under the orthogonal group, and its spectrum satisfies certain large deviation estimates satisfied by the GOE. However, this restriction would not include the cases of e.g., the uniform measures on [−1, 1] N and {±1} N absent the rotational symmetry, and could not include the case of non-constant (X t ).
Very recently, [17] proved a universality result for the dynamics of the asymmetric Langevin dynamics for the soft-spin SK model. There they used large deviations theory to obtain exponential control on the empirical measure on sample paths-as obtained in the Gaussian setting in [6,7]-together with sharp control via Girsanov's theorem on the Radon-Nikodym derivative between the Gaussian paths and those driven by non-Gaussian J on short time scales, to show universality for the empirical measure (t) . While such an approach allows for a deterministic non-linearity in the drift through a (double-well) confining potential, it cannot handle degenerate diffusions, e.g. the gradient flow. Further, the need for control on the trajectories at the exponential scale forces [17] to consider only asymmetric i.i.d. J (whereby the Radon-Nikodym derivative is a product of functions of independent rows of J T ).
We introduce a simple combinatorial approach to proving universality for SDS's of the form of (1.1), similar in flavor to the moment method. Namely, we avoid the inherent difficulty of the problem, that the transformation J →J affects X j (t) through both (J i j ) i → (J i j ) i and (X i (t)) i → (X i (t)) i . We do so by Taylor expanding the semigroup P t f = E X 0 [ f (X t )] in powers of the infinitesimal generator: each term appearing in this expansion is a polynomial in (x i ), (J i j ) evaluated at X 0 where, crucially, the initial data is independent of J i j . One then finds that on order one timescales, the predominant contribution to E[P t f ] is from polynomials whose degree in (J i j ) i, j is at most two. We refer to Sect. 1.3 for more details.
This approach works quite generally, and is robust to symmetric and asymmetric choices of J with non-homogenous means and variances, and general choices of diffusion coefficients in (1.1), including (X t ) non-constant making the diffusion non-linear, and ≡ 0 corresponding to a deterministic dynamical system. Lastly, the analysis works for arbitrary initialization independent of J. The assumption of linear drift is, of course, important, and one would like to be able to drop it. We emphasize, though, that this is primarily used in order to justify the absolute convergence of the Taylor expansion of the semigroup, which one could hope to justify by other means for higher order diffusions given that a strong solution exists; the remaining combinatorial framework for moments of the generator may then generalize. We discuss this in Remark 1.5. We end this section by mentioning two recent results [1,10] showing universality for a Lipschitz family of approximate message passing (AMP) algorithms-a discretetime state evolution that has found many applications to inference and optimization in high dimensions. Some of the ideas there appear similar in spirit to our approach, using a combinatorial approach to control moments of the final state of the AMP. All the same, the general setting of (1.1) introduces many key differences e.g., the diffusions of (1.1) are in general non-linear, not globally Lipschitz, and have a built-in stochasticity.

Setup: diffusions with random linear interactions
Consider an N -dimensional stochastic differential system with a mixture of random and deterministic linear interactions, along with possibly, some constant drifts. More precisely, consider the SDS X t := (X i (t)) N i=1 driven by the following parameters.
Suppose that for some matrix m = (m i j ) i, j we have random interactions given by the random matrix We assume that the entries A i j are either fully independent, or are independent up to a symmetry constraint A i j = A ji . Let P A be the law of A. In order to scale the interactions to have an order one cumulative effect, it will be convenient to work with the rescaled interactions matrix J given by We then denote the distribution induced by P A on J by P J . We further consider additional deterministic interactions satisfying, for some constant C < ∞, (the · 0 -norm of a vector is its number of non-zero entries). We also consider external drift parameters and diffusion coefficients (X t ) governed by the matrix The SDS (X t ) t≥0 = (X 1 (t), X 2 (t), . . . , X N (t)) t≥0 initialized from some random X 0 distributed according to a product measure μ is driven by a standard Brownian where for ease of notation, we hereon set X 0 (t) ≡ 1 so that (σ 0 j ) j≥1 capture the constant diffusion coefficients. We denote the martingale part of X t by The process X t is well-defined for a.e. J and all t ≥ 0 (as we have finite, possibly N -dependent operator norms J 2 , 2 and (σ i j ) i≥1 2 , see e.g., [31, Theorem 5.2.1]). Notational comment There are three distinct sources of randomness above dictating the law of the solution X t to (1.2): the law of the interaction matrix P J , the law of the Brownian motions, denoted P B , and the law of the initial data μ-each of these are product measures and we do not distinguish notationally between the law of the individual entries of J, B or X 0 and the ensembles.
In proving universality, we consider the difference between P J , PJ induced by two different distributions P A and PÃ over mean-zero random matrices A,Ã with independent entries (possibly up to symmetry), having matching variance profiles m =m. For ease of notation, we will henceforth use and denote the corresponding expectations E andẼ respectively.

Main results
We begin by describing the observables to which our universality results apply. The building blocks of these observables are chosen among the family of vector valued functions, We establish universality in the mean for weighted empirical averages of monomials in functions from F evaluated at a finite collection of times. Specifically, fixing an m-tensor a = (a i 1 ,...,i m ) with entries bounded by C a and a p-tuple of times t = (t 1 , . . . , t p ), for every ≤ m, fix p observables Y ( ,1) , . . . , Y ( , p) ∈ F which are to be evaluated at these p times. That is, We also need to add a sub-exponential tail constraint on μ and P A beyond the minimal assumptions of zero-mean and matching variances of P A and PÃ; this is henceforth referred to as Hypothesis 1.

Hypothesis 1
Assume that the law μ is a product of μ i of X i (0) having finite moments of all order, which are bounded uniformly over i and N . That is, there exist C μ (r ) ≥ 1 such that for any r finite, Further assume P A has uniformly bounded exponential tails, i.e., the following equivalent properties hold: (1.8) For ease of notation for dependencies on constants, we denote by C := max{C 1/2 , and state our first result, on universality at the level of the mean (hence also of moments), for observables (1.5).
Theorem 1 Let μ, P A , PÃ satisfy Hypothesis 1 and suppose that A,Ã, symmetric or independent, are mean-zero of matching variance profile m = (m i j ) i, j . For any T , m, p < ∞ and a ∈ R N m with a ∞ ≤ C a , there exists C(T , m, p, C a , C , C μ ) < ∞, such that for every N and F as in (1.5 Theorem 1 follows from a more general result bounding the difference in expectations for each individual monomial F 1) , . . . , Y ( , p) ) ∈ F. As a special case, see Proposition 2.1, we find that the moments of each spin X i (t) are universal. Specifically, for every fixed k, For a more restricted class of observables, with additional restrictions on the distributions μ and P A and PÃ, we extend the above to almost sure and L q convergence for the observable trajectories. Precisely, we restrict the observables of (1.5) to m = 1 and p = 2, leaving, the following quadratic observables (1.10) In order to extend Theorem 1 to a convergence for the trajectories of these observables, we further need to assume that is constant, so that M t is just a scaled Brownian motion, and assume the following concentration property on μ, P A , PÃ, which we refer to as Hypothesis 2.
Hypothesis 2 A sequence of probability measures (P (n) ) n≥1 over Z n in metric spaces (X n , d) satisfies exponential concentration for Lipschitz functions if there exists some C > 0 such that for any sequence of 1-Lipschitz functions f n : (X n , d) → (R, | · |) and all λ > 0, Assume that μ, P A respectively satisfy exponential concentration for Lipschitz functions on R N and R N 2 (or R N (N +1)/2 if A is symmetric), equipped with their Euclidian norms, for some C μ , C A > 0.

Remark 1.1
Recall, from the theory of measure concentration, that Hypothesis 2 holds for any distribution on R n which satisfy a Poincaré inequality with constant c > 0 (independent of n), namely for all nice f one has that (see [21]). By the tensorization of the Poincaré inequality, if Z n = (Z 1 , . . . , Z n ), and each of the laws of Z i satisfy this inequality, then the product also satisfies it with the worst constant c. Having here product measures μ, P A , the marginal laws can come from any distribution satisfying a Poincaré inequality in n = 1. These include (see e.g., [39]) -Exponential, Gaussian, and log-concave measures of the form exp(−V (x)) for V (x) strictly convex, -Linear functionals of r.v.'s having a Poincaré inequality: e.g., the uniform measure on [−1, 1].
The next theorem shows that under Hypothesis 2, any F of the form (1.10) concentrates around its mean.
Theorem 2 Suppose μ, P A satisfy Hypotheses 1-2 and the diffusion coefficients have (One might observe that the exp(− ( √ N )) concentration in (1.12) differs from the more traditional exp(− (N )) concentration in e.g. [2,3]; such differences, which recur throughout the paper, are because our Hypothesis 2 allows for merely subexponential, as opposed to Gaussian, tails.) Combining Theorems 1 and 2 we get the following strong universality for such quadratic observables.

Corollary 3
Suppose μ, P A , PÃ satisfy Hypotheses 1-2, where A,Ã, symmetric or independent, are mean-zero and have matching variance profile m = (m i j ) i, j . Let F(·) andF(·) be as in (1.10), for a ∈ R N such that a ∞ ≤ C a , with respect to the corresponding solutions X t ,X t for (1.2) with constant , i.e., σ i j = 0 if i = 0. Then, for every T < ∞ we have that as N → ∞, Proof The observables of (1.10) correspond to the m = 1 and p = 2 case of (1.5), so Theorem 1 applies here with some constant C 1 = C(T , m, p, C a , C , C μ ). For N ≥ (λ/C 1 ) 2 we then get upon combining the triangle inequality with Theorems 1-2, that Similarly, upon using the triangle inequality for · q we get from Theorems 1 and 2 that Further, N → p N (·) decrease pointwise on [C, ∞), while for any q ≥ 1, the preceding integral is finite for all N large enough. With {Z q N } N uniformly integrable, it follows that Z N → 0 also in L q .

Proof strategy
As mentioned in the introduction, traditional approaches to proving universality run into substantial difficulty when we apply them to diffusions with random coefficients. The dependence on specific entries of the random matrix are quite bad, as the dependence applies in the drift both through the J i j , and through its effect on X t , whose history evidently also depends on J i j : this effect can exponentially amplify small differences; in fact, the exponential amplification is inherent to the problem at hand.
At a high level, our strategy for proving Theorem 1, and the main novelty of the paper, is to leverage the independence of μ from P J , PJ by pulling back f (X t ) and f (X t ) to properties of (time) derivatives of f (X t ) evaluated at t = 0. At the level of expectations, these derivatives can be seen as iterates of the infinitesimal generator applied to the function F, which can then be controlled by combinatorial moment methods. The dominant contribution to the drift of F comes from drift terms that are polynomials of degree at most two in (J i j ) i j . Since the first two moments of P A and PÃ match, these terms do not contribute to the difference in expectations above. We emphasize that the approach does not need rely on an explicit solution to the SDE of (1.2), nor does it use exponential control, or large deviations theory as in [17], or refined estimates on the spectrum of A as in the setting of [2] where, crucially, the process has a rotational symmetry.
Recall that the SDE defined in Eq. (1.2) has infinitesimal generator L that we split as follows (see e.g., [31,Theorem 7 (1.13) By Ito's formula, we have for every f , say, in C ∞ (R N ), where P t = P t (J) denotes the semi-group operator (1.14) in terms of the generator L. In order to reduce the problem to a combinatorial question, we wish to Taylor expand the semi-group operator P t f = e t L f . As long as f is smooth and the Taylor expansion converges absolutely-shown in Sect. 2.2-this formal expansion is valid and we can switch expectations over μ, P J , PJ with the sum, and compute expectations of powers of the generator L acting on f . Namely, the difference in expectations is bounded by controlling (1) the size in N , and (2) the growth in k of Expanding these terms as words in L J , L , L h , L , we observe that a non-zero difference between the two expectations in (1.15), can only come from the summands (monomials in J, X, , h, σ ) satisfying -Every J i j that is present, must appear at least twice.
-At least one J i j must appear at least three times.
This is because the means of P A , PÃ are zero, and the variances of P A and PÃ match. A careful analysis of this combinatorial problem for the monomials eventually yields that the contributions from these monomials are, together, O(N −1/2 ) in N , and o(k!) in k: this computation is carried out in Sect. 2.3.

Remark 1.2
One may notice that in the case where (X t ) is constant so that M t is just a Brownian motion, we are left with a linear SDS and one could use this linearity in a more central way, to explicitly solve expectations of monomials in (X i (t)) i as Gaussian integrals and time integrals over words in e sJ and (X i (t)) i . If the system X t is invariant under rotations, then we can work in the coordinates of J so that it is diagonal and apply universality results for the spectrum of J. Absent rotational symmetry, however, the natural step would be to Taylor expand e sJ , at which point the expansion and the resulting combinatorics will be similar, and perhaps less transparent, than our generator based approach. Of course, for non-constant (X t ) as in Theorem 1, the SDS is non-linear, and such an approach would not generalize.
In Sect. 3, we extend this bound on the difference in expectations of statistics f to multi-time observables, then to statistics that contain the driving martingale terms and finally establish the universality at the level of expectation for observables of the form of (1.5), as stated in Theorem 1. In Sect. 4, we adapt the approach of [3] to establish Theorem 2, namely, to show that the restricted class of observables of (1.10) concentrate around their expectations, by localizing to a set of large probability where ) and using Hypothesis 2.

Applications
In this section, we discuss systems for which Theorem 1-Corollary 3 imply concrete universality results. All the examples that follow will be in the context of that is constant, i.e., σ i j = 0 if i = 0, where both Theorems 1-2 apply. Among the examples with non-constant , one which may be of interest is a system of geometric Brownian motions interacting linearly through J.
We next describe two well-studied families of Markov processes/dynamical systems to which our results apply: Langevin dynamics and gradient flows on various energy landscapes (Hamiltonians) or loss functions.

Langevin dynamics
In the case where J and are symmetric matrices, and σ 0 j are identically one, (1.2) corresponds exactly to the Langevin dynamics for the Hamiltonian The linearity of the diffusion here corresponds to having a quadratic Hamiltonian. The Langevin dynamics is a reversible Markov process designed such that, when non-degenerate, its invariant measure on R N is given by Hamiltonians coming from spin glass theory, the Langevin dynamics has been analyzed at length in the case of Gaussian disorder, and found to have a varied and rich behavior; in §1.4.1, we explore this further in the context of a simple spin glass model, called the spherical SK model.

Gradient flows
The case where σ 0 j are identically zero-i.e., besides the randomness of J and, possibly, the initial data, the dynamics is deterministic-also fits into the framework of the paper. Here, given J and X 0 , the law of the dynamics is taken to be the delta function on the trajectory of the solution to the resulting system of ODE's. This corresponds to the gradient flow on H (x): in optimization and learning settings, e.g., the examples of Sects. 1.4.2-1.4.3, gradient descent and its many variants, are favored methods.
We now turn to a few well-studied concrete problems to which our results are applicable.

The (soft) spherical SK model
The dynamics of spin glasses are a canonical setting in which Markov processes with random coefficients are studied in their thermodynamic (N → ∞) limit. The shorttime (N → ∞, then T → ∞) behavior of Langevin dynamics, especially, in the context of spin glasses have been extensively studied in both the physics and math literature [2][3][4][5][6][7]11,12,15,18]. Perhaps the most well-known mean field spin glass is the Sherrington-Kirkpatrick (SK) spin glass, where N spins taking values in {+1, −1} interact pairwise with one another, and their interaction strengths are moderated by "coupling" parameters J i j = J ji which are drawn i.i.d., say, Gaussian. We discuss a simplification of this known as the spherical SK model, which has been found to nevertheless exhibit some of the same phenomena.
Take an i.i.d. symmetric matrix J = (J i j ) i j with law P J . The spherical SK model has Hamiltonian (1.17) To avoid differential geometry on the sphere, it is sometimes preferable to extend the Hamiltonian to all x ∈ R N (note that the Hamiltonian is homogeneous so that dividing x by the Euclidean norm x / √ N gives the same process on S N −1 ( √ N )). Instead of adding a non-linear confining force as is done in, e.g., [2], we either add a linear confining force F K (x) = K x, or have no confinement (K = 0) (the linearity of the system ensures no finite time blowup). Consider now the Langevin dynamics at inverse temperature β > 0 for the Hamiltonian of (1.17), corresponding to We also consider the gradient flow where we take β = ∞, so that the Brownian motion term drops out: X t is then the (deterministic) dynamical system following the (random) gradient vector field of H (x) + F K ( x 2 /N ). The following universality for the above system is an immediate corollary of Theorem 3.

Corollary 1.3 Fix β ∈ (0, ∞] and consider the SDS's X t andX t given by (1.18) for
A andÃ having mean zero, matching variance profiles m i j = 1{i = j}. Suppose μ is independent of P A , PÃ and these satisfy Hypotheses 1-2. Then for F as in (1.10) with Y, Y ∈ F and a ∞ ≤ C a , for every T < ∞, As shown in [14] and rigorously proved in [2], when J is Gaussian, the spherical SK model, or the soft spherical SK Model with confining potential F satisfying F(x)/x → ∞ as x → ∞, exhibits a sharp aging transition. Informally, aging is defined as the notion that the older a system gets, the more it remembers its past; formally, it corresponds to a transition in the behavior of the auto-correlation, In [2], it was established that for J having rotationally invariant law, e.g., a GOE matrix, C N (s, t) solves a non-linear equation [2, Eq. (2.16)], which exhibits exactly this type of transition at some β ag . Our results allow us to read off universality for this limiting behavior, as formalized in the following corollary. In the specific case of β = ∞, the conclusions of [2, §3.2.2] apply, and the solution exhibits aging: i.e., there is a γ > 0 (specified therein) such that for every λ > 1, Proof For the first statement, while [2, Theorem 2.6] is stated for confinement F growing super-linearly, following the proof one sees that it is only used to localize the process, for which it suffices for K to exceed J 2→2 (which for Wigner matrices is a.s. less than 2 + for any > 0). The first part of the corollary therefore follows from Corollary 1.  [2] only for a specific choice of quadratic F. One could in principle perform the same analyses with other choices of F including F = F K that is linear, corresponding to the case we consider, and understand the limiting behavior of C N (s, t) as N → ∞ then s, t → ∞ as β varies. We do not pursue this, and instead notice that in the specific case of β = ∞, the homogeneity allows us to disregard the choice of the confining potential and obtain universality for the zerotemperature aging behavior. To see this, since H (x) is a homogeneous polynomial, if β = ∞, we see that dX t is a constant multiple (for a constant depending only on X t ) of d(X t / X t ). Therefore, at β = ∞, the projection of the dynamics (1.18) onto the sphere S N −1 ( √ N ) matches the projection of the Langevin SDS of [2], regardless of the choice of confining potential used therein. We apply Corollary 1.3 first to deduce that lim s→∞ lim N →∞ C N (s, s) =: C ∞ is the same for Gaussian and non-Gaussian P A . Then applying it to C N (s, λs), we find that the N → ∞ limit of the normalized auto-correlation is the same for Gaussian and non-Gaussian P A , and it is further independent of the choice of confining potential: as such for any P A , it has the same N → ∞ limit as in [2].

Remark 1.5
It would be of interest to consider similar Langevin dynamics for the spherical or soft spherical p-spin glass models for p > 2. Permitting higher order interactions gives rise to a wealth of more complicated models and different behavior. At the level of the off-equilibrium Langevin dynamics, these lead to the famous Cugliandolo-Kurchan/Crisanti-Horner-Sommers limit of coupled integro-differential equations for C N (s, t) and an integrated response [3,11,12,15,16,18,23], as well as the evolution of other observables e.g., the Hamiltonian and its square gradient [5]. Our combinatorial framework suggests that the differences in expectations (over p-tensors J andJ) of averaged observables are microscopic, as long as there is a non-linear confining potential to prevent finite-time blowup. The complication is in the fact that the two non-linearities (from the interactions, and the confining potential) cancel out, but these cancellations are not easily seen in the Taylor series obtained by expanding in powers of the generator; thus we are not able to show that this series is absolutely summable and exchange the infinite sum with its expectation.

Symmetric and asymmetric Hopfield networks
Let us also mention a different context in which diffusions of the form of (1.1) appear. Hopfield networks were introduced by [26] and have become one of the simplest and most fundamental examples of neural networks. In this model, a set of N neurons , exceeds a deterministic threshold h i . This model was introduced in the symmetric setting, but has since been analyzed extensively both in symmetric and asymmetric setups [13,25,41].
One typically initializes the neurons at some pre-determined state independent of J, e.g., all inactive/active, or uniformly at random, and tracks their time-evolution, whereby each neuron activates/de-activates at some rate, depending on the relationship between its input and threshold. Though there are many ways this is implemented, one is to soften the problem to continuous state space, either to the sphere, or to full-space and add in stochasticity by running some Langevin dynamics. This is the approach pursued in [13] as well as e.g., [41]. Then, with a linear confining force, our results imply universality for both for the symmetric and asymmetric Langevin dynamics (and gradient flow) of general Hopfield networks: this includes universality for observables capturing the energy/loss in the network, its square gradient, and its "memory".

Rayleigh quotient minimization for random matrices
We conclude with a related optimization problem in high dimensions: that of optimizing the Rayleigh quotient of a random matrix J with a certain mean and variance profile. Maximizing the Rayleigh quotient is an efficient way to find the top eigenvector and eigenvalue of the random matrix via local iteration, e.g., either gradient descent or Langevin dynamics at low temperatures (large β). To place this in the framework of (1.2), take H (x) = x, Jx and either no confining force or F K = K for some K > J 2→2 in (1.18). In the situation where the matrix ensemble is rotationally invariant, e.g., the GOE, the limiting trajectories of, say, H (X t ) for the gradient flow/Langevin dynamics can be explicitly solved (by diagonalization). Corollary 3 implies these limiting trajectories will be universal, and thus, match the limiting trajectories obtained when J is not Gaussian. In [1,10], similar universality results were described for an AMP approach to finding the top eigenvalue/eigenvector of J.

Universality of expectations of monomial observables
In this section, we prove that two solutions X andX of (1.2) driven by J andJ are such that expectations of observables of the form (1.10) are universal, as long as A andÃ have the same variance profiles. As discussed in Sect. 1.3, we reduce differences in expectations to combinatorial calculations by expanding the Markov transition semigroup of the process X t in terms of its generator, an approach for proving universality in randomly driven dynamical systems which is the key contribution of this paper.
For the entirety of this paper, we will take two distributions P A and PÃ on A andÃ that are mean zero and have the same, uniformly bounded, variance profiles m =m. Recall that P A and PÃ are either fully independent or symmetric ensembles. For conciseness, we present our results in the case of fully independent (in particular, not symmetric). The case where they are symmetric is handled mutatis mutandis and only induces a few constant factors in certain estimates (see Remark 2.8 for more on these minimal modifications).

Main result on difference in expectations
The observables in Theorem 1 are composed of polynomials in J and X, as well as M. We first establish the universality of expectations for general monomials in J and X via a combinatorial moment matching type of argument. In Sect. 3 such universality is reduced for monomials that additionally involve the martingale, to that of monomials only in J and X.
Then consider observables f α,γ (x) of the form For an s-tuple of pairs α, let -I α count the number of distinct pairs in α, i.e., I α = |{α 1 , . . . , α s }|, -I α,1 count the number of (α k ) k which appear exactly once in α, and -I + α,1 equal I α,1 plus the indicator that no pair appears more than twice in α.
Our bound on the distance between the expectations of f α,γ (X t ) and f α,γ (X t ) depends on α, γ and the laws μ, P A , PÃ only through C , C μ , s, r and I + α, 1 . More precisely, we derive here the following.
Observe that in the case s = 0, the right-hand side is C N −1/2 .

Remark 2.2
The above theorem shows that having more distinct J 's in the observable, decreases the difference in expectations by more than N −s/2 as would be expected from the typical size of J i j . This should be expected due to CLT-type cancellations: one way to motivate this scaling is by recalling averaged statistics which have J in them, in the context of the spherical SK model, e.g., the most relevant being (Notice that these statistics are not rescaled by the number of order-one sized monomials; but they remain on the O(1) scale due to additional cancellations from (J i j )). This gain in the scaling has to be visible at the level of the difference in expectations under P andP in order to hope for universality for such statistics.
Recall from Sect. 1.3 that our high level strategy is to reduce the expectations of statistics of the solution X t of the sds to combinatorial calculations in terms of mixed moments of J and X 0 . This is possible by writing E B [ f (X t )] as P t f (X 0 ) and then Taylor expanding P t = e t L where L is the generator for the process X t as defined in (1.13). In order for this expansion to be valid, and therefore our approach to be permissible, we need the Taylor expansion for e t L to converge absolutely, for each fixed N . In the next sub-section, we show that indeed with μ, P A , PÃ satisfying Hypothesis 1, for each fixed N , the infinite series corresponding to P t f converges absolutely, so we can follow this plan.
Before proceeding further, we make the following notational remark. Notational comment on set and sequence differences For sets {b 1 , . . . , b m } ⊂ {a 1 , . . . , a n }, we let {a 1 , . . . , a n } \ {b 1 , . . . , b m } denote the set difference as usual. Frequently we deal with tuples, or sequences in which the order does not matter. For two such tuples (a 1 , . . . , a n ) and (b 1 , . . . , b m ) (where of course there may be repetitions in each sequence), we denote by (a 1 , . . . , a n ) \ (b 1 , . . . , b m ) the difference wherein for each b i appearing in {a 1 , . . . , a n } we only remove one of its appearancessay the first one- from (a 1 , . . . , a n ). We also define (a 1 , . . . , a n ) (b 1 , . . . , b m ) to be the concatenation given by (a 1 , . . . , a n , b 1 , . . . , b m ).

Switching the expectation and the infinite series
The goal of this sub-section is to prove the following absolute convergence result.
As a consequence of Proposition 2.3 and Fubini-Tonelli, we may use the following expansion.

Corollary 2.4
Suppose P A , PÃ, μ satisfy Hypothesis 1. Setting L andL for their generators, we have that where for every x ∈ R N , W f (x) should be understood as (W k · · · W 2 W 1 f )(x). For every word W ∈ {L J , L , L h , L } k , let k J = k J (W ) denote the number of L J 's that appear in W , and similarly define k , k h , and k , so that k J + k + k h + k = k and the following structural decomposition of W f holds.
Claim For any word W ∈ {L J , L , L h , L } k with k J , k , k h , k occurrences of the corresponding symbols, W f can be expressed as a sum of (not necessarily distinct) monomials of the form β, β , ζ denote the collection of pairs (β ) ≤k J , (β ) ≤k , (ζ ) ≤2k , while ζ , ξ denote the sequences (ζ ) ≤k h , (ξ ) ≤r and hereupon we adopt the convention x 0 ≡ 1, allowing for ξ = 0 as well as ζ ∈ (0 j) j .
In view of Hypothesis 1 on P A we have that for every N , ≥ 0, and index pair α, Thus, if I α β distinct index pairs appear at multiplicities (n +1) ≤I α β in the sequence α β of length k J + s, then by the independence of (J α ) α , Consequently, with X 0 independent of J we have in view of the assumed bounds on ( i j ) i, j (σ i j ) i, j and (h i ) i , that for any term of the form (2.3) with I ζ entries such that ζ / ∈ (0 j) j , using in the last inequality also (1.6) from Hypothesis 1 on μ, and the definition of C . Our next result is a first step in controlling the number of monomial terms that can appear in the expansion of each word W ∈ {L J , L , L h , L } k . every k J , k , k h , k and every β, β , ζ , ζ , ξ , if we let φ = φ β,β ,ζ ,ζ ,ξ be as in (2.3), then L h φ, L J φ, L φ and L φ can each be expressed as a sum of at most r , r N , r N and r N 2 σ many such monomials, respectively, each of the same form (with possibly different β, β , ζ , ζ , ξ ) as (2.3), with the respective k J , k , k h or k increased by one.

Lemma 2.5 For
Proof Fixing k J , k , k h , k which sum up to k, we proceed by separately considering the effect each of L h φ, L J φ, L φ and L φ has on the monomial φ. First, with non-zero contribution only from j ∈ ξ , yielding at most r non-zero terms. To each of these corresponds a monomial of the form of (2.3), for k h → k h+1 , ζ → ζ ( j) and ξ → (ξ \ ( j)) (0). Next, with non-zero contribution only when j ∈ ξ . With i ≤ N the total number of resulting non-zero monomials is now at most r N, each having the stated form with k J → k J +1, β → β (i j) and ξ → (ξ \ ( j)) (i). Likewise, we have that with non-zero contributions only for j ∈ ξ . Enumerating over i ≤ N , gives now at most r N non-zero monomials, of the stated form, with k → k +1, β → β (i j) and ξ → (ξ \ ( j)) (i). Finally, is non-zero only for the summands in which j ∈ ξ . Enumerating over 0 ≤ i, i ≤ N (recalling the convention that x 0 ≡ 1), gives at most r N 2 σ non-zero monomials, of the stated form, with k → k + 1, ζ → ζ (i j) (i j) and ξ → (ξ \ ( j, j)) (i, i ).
Fixing N , k, an s-tuple of pairs α, an r -tuple of indices γ and W ∈ {L J , L , L h , L } k , upon inductively applying Lemma 2.5, we are able to express W f as the sum of at most (2.9) many non-zero monomials of the form of (2.3). Recall that for a monomial φ, we use I ζ for the number of ζ / ∈ (0 j) j , I α for the number of distinct pairs in α, I α β for the number of distinct pairs in α β, and introduce I = I α β − I α , which counts the number of distinct pairs in {β}\{α}. A careful examination of the proof of Lemma 2.5, yields the following significant refinement upon the crude bound of (2.9). Proof The first improvement in (2.10) over (2.9) is from observing that the growth factor N σ applies only in those I ζ of the 2k applications of L within W which have led to an element ζ / ∈ (0 j) j (see (2.8)), and that there are at most 2k I ζ ways to choose which I ζ elements of ζ are not from the 0-th row of σ .

Proposition 2.6 Fix N , r , s, k ≥ 0, an s-tuple of pairs α, an r -tuple of indices γ , and a word W ∈ {L
Similarly, the growth factor N in counting the number of monomials after applying L J is only relevant during the I applications of L J within W in which a new pair (i j) is selected (see (2.6)). The left-most term in (2.10) counts the number of ways to select the locations of these I new elements within the k J long sequence β, and thereafter to partition the remaining k J − I consistently with having the prescribed n ≥ 0 repeats for each of the I α β distinct pairs in question. Putting all this together yields the stated bound (2.10) on the number of relevant monomials in the expansion of W f .

Proof of Proposition 2.3.
Combining Proposition 2.6 with the bound (2.4) we deduce that for any word W of length k and any α whose I α distinct terms appear in multiplicities (c ) ≤I α , where the inner sum is over all partitions of k J − I into I α β indistinguishable integers n ≥ 0. Since c = s and n + c ≤ k J + s for all , the right-most product is at most (k J + s) s . Further, the number of (n ) considered here is at most the number of integer partitions of k J , which grows slower than e k J (c.f. the Hardy-Ramanujan asymptotic partition formula [24]). Thus, we find that for C(r , s, C μ , C ) finite and any word W of length k, Since k! ≥ k J !(k − k J )!, the bounds (2.12) and (2.2) will yield the stated absolute convergence of the infinite series. Specifically, fixing T < ∞ and setting which is finite for any fixed N > δ −2 , thereby concluding the proof.

Controlling the differences of the k'th order Taylor coefficients
By Corollary 2.4, we have that where the last sum is over φ appearing in the monomial decomposition of W f (x) per Claim 2.2. To bound the differences of expectations on the rhs of (2.14), we next control the type of monomials φ of the form ( which for independent, zero-mean (J i j ) i j of matching variances 1 N m = 1 Nm , requires that simultaneously: No pair α appears exactly once in the concatenation α β.
(2.16) Some α appears more than twice in the concatenation α β. (2.17) The condition (2.16) implies that each of the I distinct elements in {β} \ {α} must appear at least twice in {β}, to which end we need at least 2I applications of L J to select those elements. In addition, some other I α,1 of the k J applications of L J must align exactly with the pairs (α i j ) appearing only once in α, so necessarily k J ≥ 2I + I α, 1 . Further, the condition (2.17) requires k J + s ≥ 3 and when no pair appears more than twice in α, an extra application of L J beyond the preceding 2I + I α,1 is needed for producing the third appearance of some α , as stated in (2.15).
We are now able to prove that the expectations of monomials of the form f α,γ (X t ) are universal.
Proof of Proposition 2.1. Fixing α, γ , in view of Lemma 2.7, it suffices when bounding the rhs of (2.14), to consider only words W and monomials φ for which (2.15) holds. By restricting attention only to monomials for which (2.15), holds, we find as in (2.11), that for any α whose I α distinct terms appear in multiplicities (c ) ≤I α , and every word W of length k such that k J + s ≥ 3, where as in (2.11), the inner sum runs over all partitions of k J − I into I α β indistinguishable integers n ≥ 0. Reasoning as we did leading up to (2.12), we find that Plugging (2.18) into (2.14), as in the derivation of (2.13), we get for δ = 1/(16T reC ) and N ≥ ρ : This completes the proof, as both series on the rhs of (2.19) are finite and independent of N .

Remark 2.8
In the case of symmetric random matrices A,Ã (where only the upper triangular and diagonal elements are independent), we identify index pairs β = i j andβ = ji as being the same. We do so whenever considering I α , I α,1 , I + α,1 , I α β , I , and the multiplicities (n ) , as well as in the restrictions (2.16)-(2.17) imposed on the multiplicities within α β. Once this is done, the only difference in our proof is to replace in (2.10) the weight r k by (2r ) k .

The extension to multi-time polynomial observables
In this section, we extend the results of Sect. 2 to more general observables, namely those that contain coefficients that depend on the driving martingale, and those that depend on the trajectory through multiple times, rather than just one. We then use those extensions to prove Theorem 1. To this end, fix any l, any (α (1) , . . . , α (l) ) each consisting of s i pairs, any (γ (1) , . . . , γ (l) ) each consisting of r i indices, and also fix m indices ξ = (ξ 1 , . . . , ξ m ). Fix l times 0 ≤ t 1 ≤ · · · ≤ t l ≤ T and m times 0 ≤ u 1 ≤ · · · ≤ u m ≤ T . For f α (i) ,γ (i) as in (2.1), consider observables of the form, (3.1) Letr = i r i +m andᾱ denote the concatenation α (1) · · · α (l) of lengths := i s i .

Proof of Proposition 3.1
We start with the case of m = 0 to which we will reduce the case of m > 0.
We express the expectation E B with respect to the Brownian motion of g(t), in terms of the (diffusion) semi-group operator as Expanding each semi-group operator in terms of powers of the generator L, the above is precisely Taking the difference in expectations between E andẼ, upon justifying swapping the expectation with the infinite sum (as done in Sect. 2.2), and using the fact that for every k 1 , k 2 , . . . , k l such that k 1 + · · · + k l = k, we obtain that The following structural property for words appearing in the above will allow us to reduce the analysis of multi-time observables to the combinatorial analysis of onetime observables fᾱ ,γ = f (1) f (2) · · · f (l) , forᾱ = α (1) · · · α (l) andγ := γ (1) · · · γ (l) , which we have already completed.
of each appearing, respectively. Then, the function consists of a sum of (not necessarily distinct) monomials of the form Moreover, each monomial φ(x) appearing in this expansion, must also appear in such monomial expansion of W fᾱ ,γ for W = W 1 · · · W l ∈ {L J , L , L h , L } k .

Proof
The structure of the monomials is evident. Every such monomial in W 1 f (1) W 2 f (2) · · · W l f (l) must also appear in the monomial expansion of [W 1 · · · W l ] fᾱ ,γ because a subset of the terms in the latter are obtained by applying the letters in W l to f (l) , then the letters in W l−1 to f (l−1) (W l f (l) ), and so on. Finally, observe that W 1 · · · W l is always a word in {L J , L , L h , L } k .
With Claim 3.1 in hand, we further get that (3.4) where the sums are over the monomials φ in the decomposition of W 1 f (1) · · · W l f (l) and that of W fᾱ ,γ per Claim 3.1. Note that each summand on the rhs of (3.4) is at most some (k + 1) l l k times the corresponding summand of (2.14) for the choice f = fᾱ ,γ for which we have deduced the bound of (2.18). Utilizing the latter and the elementary bound k + 1 ≤ (k J + 1)(k + 1 − k J ), by proceeding as in the derivation of (2.19), we find that for C = C(r ,s, C μ (r ), C ) finite, δ = 1/(16 l Tr e C ) positive for some finiteC =C(l,r ,s, T , C , C μ (r )). We now add in the driving martingale observables (i.e., m > 0) and conclude the proof of Proposition 3.1.
Proof of Proposition 3.1. We reduce the situation m > 0 to the combinatorial calculations of Lemma 3.2 by utilizing the following expansion from Ito's lemma: When expanding (3.1) in this manner, the terms containing only products of X ξ i (u i ) can be absorbed into γ , in which case their difference in expectations has already been handled in Lemma 3.2, so by linearity it suffices for us to focus on handling terms of the form Thus, fixing l, m, (α (i) ), (γ (i) ), ξ and letting h(t, u) = h (α (i) ),(γ (i) ),ξ (t, u) we obtain after swapping the expectation and integrals that which thereby yields the following bound on the relevant difference in expectations Proceeding hereafter wlog to bound the difference in expectations for h(t, τ ), we suppose for ease of exposition that 0 ≤ t l = τ 0 ≤ τ 1 ≤ · · · ≤ τ m (the situation where the two groups intertwine is similarly analyzed with the obvious modifications). As done in the proof of Lemma 3.2, first expressing E B in terms of the semi-group operator and then expanding that in powers of the generator L we find that At this point, proceeding as in the derivation of (3.4), up to the transformations k → k + m =:k , l → l + m =:l , and ( f (l+1) , . . . , f (l) ) → (x ξ 1 , . . . , x ξ m ) , with the sum running over monomial decomposition of (W 1 f (1) · · · W l f (l) W 1 x ξ 1 · · · W m x ξ m )(x). Then, utilizing again Claim 3.1, as well as the bound k! ≥k!/(k) m , we arrive at where as beforeᾱ = α (1) · · · α (l) is of lengths = i s i , whileγ of length r = r i + m has now the additional elements (x ξ i ) i≤m . Up to this update ofr and the immaterial weight factor (k/(lT )) m of its summands, the expression on the rhs of (3.5) is the same as that in (3.4). We thus conclude as in the proof of Lemma 3.2 that for some C(l, m,r ,s, T , C , C μ (r )) all t ∈ [0, T ] l and u ∈ [0, T ] m ,

Proof of Theorem 1.
Fix T , m, p, C a , a ∈ R N m such that a ∞ ≤ C a , and t ∈ [0, T ] p . For every ≤ m, fix observables Y ( ,1) , . . . , Y ( , p) ∈ F and let F(t) be as in (1.5) with those choices. By linearity of expectations and the uniform bound on a ∞ , it suffices to show that uniformly over i 1 , . . . , i m , We denote bys the number of Y terms appearing in the preceding product which is a coordinate of G t . In cases = 0, the bound (3.6) follows from considering Proposition 3.1 ats = 0, in which case I + α,1 = 1. Otherwise, we expand every term in that product which is a coordinate of G t to obtain a sum of monomials of the form of (3.1). Each of these monomials has a sequenceᾱ of lengths, and as a result of such expansion there are at mostss N Iᾱ monomials with precisely Iᾱ distinct pairs in the sequenceᾱ. Note that for anyᾱ, Indeed, each pair which appears once inᾱ, is counted both ins and in Iᾱ ,1 , all other pairs are counted at least twice ins, and for anyᾱ of maximal multiplicity two, we have added one to I + α, 1 . Consequently, the bound of Proposition 3.1 on the difference in expectation for each of thesess N Iᾱ many monomials is at most C N −Iᾱ−1/2 for some constant C (T , m, p, C , C μ ). From this, the bound (3.6) immediately follows upon enumerating over the at mosts many choices for Iᾱ.

Concentration for quadratic observables: Proof of Theorem 2
Assuming henceforth that M t is a scaled Brownian motion (i.e., that σ i j are identically zero for i = 0), our goal is to prove Theorem 2 about the uniform over t ∈ [0, T ] 2 concentration property of the quadratic observable of (1.10),

Localizing the process
Denote the 2-to-2 matrix norm by We begin by bounding the probability that (X 0 , J, M) / ∈ L N ,R .
Lemma 4.1 There exists C = C(T , C μ , C A , C σ ) > 0 and R 0 (T , C μ , C A , C σ ) < ∞, such that for every R ≥ R 0 if μ, P A satisfy Hypotheses 1-2, then Proof We bound L c N ,R by the union of the events where each of the three norms is greater than √ R N /3. First, since M t is a Brownian motion (scaled by (σ 0 j ) j ), by Doob's maximal inequality for the sub-martingale exp(δ M t 2 ), we have for some C(C σ ) > 0 any R ≥ T R 0 (C σ ) and all N , Next, since μ satisfies Hypotheses 1-2, the independent X i (0) have uniform (in i and N ), second moments and exponential tails. Hence, applying [30,Theorem 3] for the centered sum of i.i.d. variables that stochastically dominate X 2 i (0), we have for some C(C μ ) > 0, any R ≥ R 0 (C μ ) and all N , It thus remains only to show that when P A satisfies Hypothesis 2, we have for some C(C A ) > 0 any R ≥ R 0 (C A ) and all N , To this end, recall [28,Theorem 2] that there exists a universal constant C such that for any matrix A with independent, zero-mean entries of second moments m i j and fourth moments b i j , For P A satisfying Hypothesis 1, b i j and m i j are bounded uniformly in i, j and N (see (1.8)). Hence, in the case where A is composed of independent entries, for some C(C A ) finite and all N , Likewise, representing a symmetric A as A = A + + A − , with A + the upper triangle (including the diagonal) part of A and A − its lower triangle part, [28,Theorem 2] holds for the matrices A − and A + of zero-mean, independent entries (with uniformly bounded forth moments). Thus, (4.4) holds also in this case up to a factor of 2. Thanks to (4.4), if √ R ≥ 4C then Recall that A 2→2 , which is the largest singular value of A, is 1-Lipschitz in its entries (endowed with the Euclidean norm, on A + when A assumed symmetric).
In addition, for every a such that a ∞ ≤ C a (uniformly over N ) and every Y, Y ∈ F, if F(t) is as in (1.10), we have for all k ≥ 1, Proof Setting e N (t) = 1 √ N X t , we get upon expanding (1.2), that From the definition of the 2-to-2 norm, evidently Hence, by Cauchy-Schwarz, t 0 e N (s)ds , where in the last inequality we rely on our assumption that 1→1 ≤ C and ∞→∞ ≤ C , to deduce that 2→2 ≤ C . Combining these bounds on (I i ) i≤5 , and dividing out by e N (t), we see that By Gronwall's inequality, using the localization to L N ,R , it then follows that for any t ∈ [0, T ], yielding the lhs of (4.5) as soon as R ≥ R 0 (T , C ) ≥ 1. From the lhs of (4.8) we know that G t ≤ √ R X t throughout L N ,R , hence after suitably increasing C 0 and R 0 , the rhs of (4.5) holds as well.
To deduce the uniformly bounded moment estimate of (4.6) for X t , recall first from the lhs of (4.5) that Combining the latter bound with that of Lemma 4.1, we arrive at (4.9) The rhs decreases in N and as f ( yielding the lhs of (4.6). The rhs of (4.6) follows by applying the same reasoning to Z k N ,G = N −1/2 sup t∈[0,T ] G t k while utilizing the rhs of (4.5).
Turning to (4.7), note that for any k ≥ 1 and F(t) of (1.10) with a ∞ ≤ C a , by Cauchy-Schwarz, Thus, yet another application of Cauchy-Schwarz results with If Y is 1, this latter expectation is simply 1. If Y is M, using the tail bound of (4.2) in combination with (4.9) (now for f (R) = (R/3) k ), the latter expectation is uniformly bounded in N . Lastly if Y is from {X, G}, the expectation above is uniformly bounded in N by (4.6). Combining these yields the desired (4.7).

A Lipschitz estimate on quadratic observables
Our next proposition shows that on L N ,R all F(t) of the form (1. The key to Proposition 4.3 is to show that X t is O(1)-Lipschitz on L N ,R endowed with · mix . Specifically, denoting by X t (X 0 , J, M) the solution to (1.2), constructed from the triplet (X 0 , J, M) and X t (X 0 , J, M) the solution constructed from the triplet (X 0 , J , M ), our next lemma establishes a uniform over L N ,R Lipschitz bound on X t − X t .
where G (·) is defined as G(·) but constructed using J instead of J. By Cauchy-Schwarz, Recalling (4.8), we similarly find that Turning to the terms involving G(·) or G (·), observe first that (4.11) Using the localization to L N ,R , we thus find that where in the last inequality we further assumed R ≥ R 0 (T , C ), utilizing the lhs of (4.5). Further increasing R 0 such that T e C 0 √ R 0 T ≥ 1, upon combining the bounds on (I i ) i≤5 , and dividing out by e N (t), we see that Recall that J 2 2→2 ≤ i j J 2 i j , so by Gronwall's inequality, there exist C(T , C ), such that for any R ≥ R 0 , every N and all t ∈ [0, T ], as claimed.
Proof of Proposition 4.3. Fix Y 1 , Y 2 ∈ F, a such that a ∞ ≤ C a and t = (t 1 , t 2 ) ∈ [0, T ] 2 . Equipped with Lemma 4.4 and (4.11) it remains to establish a Lipschitz control on differences of F(t; (X 0 , J, M)) in terms of differences of G t , X t and M t corresponding to any pair of triplets (X 0 , J, M) and (X 0 , J , M ) in L N ,R . To this end, we start with the following bound on differences of F(t; ·): Since the two terms on the RHS can be bounded symmetrically, wlog we focus on the first one, which by Cauchy-Schwarz, is at most where as before, X t is constructed out of the triplet (X 0 , J , M ). Now recall from (X 0 , J, M) ∈ L N ,R and Proposition 4.2, that the right-most term in (4.12) is at most exp(C 0 √ RT ) for all R ≥ R 0 , in which case by the preceding Putting these all together, we deduce that there exists some other R 0 (T , C ) and C(T , C a , C ), such that for all R ≥ R 0 (T , C ), We conclude this subsection by combining the respective exponential concentrations of Lipschitz functions due to μ, P A and P B . Lemma 4.5 Suppose that μ, P A satisfy Hypothesis 2. Then P = μ ⊗ P A ⊗ P B satisfies exponential concentration of Lipschitz functions with respect to (E N , · mix ).
Proof Fix any function f that is 1-Lipschitz on (E N , · mix ). Let us expand where the subscripts of the expectations indicate which random variables the expectation is taken over. Call the above three differences I M , I J and I X 0 say. For every X 0 , J fixed, f (X 0 , J, M) is 1-Lipschitz in M ∈ C([0, T ], R N ) endowed with the norm sup t≤T · . As such, from the exponential concentration of Lipschitz functions satisfied by P B with respect to C([0, T ], R N ) endowed with sup t≤T · (see e.g., the discussion around [3, Hypothesis 1.1]), there exists C = C(C σ ) > 0 such that for every r > 0, Similarly, we have that for every fixed X 0 , E B [ f (X 0 , J, M)] is 1-Lipschitz in J endowed with its rescaled Frobenius norm i, j ( √ N J i j ) 2 , and finally, E J,B [ f (X 0 , J, M)] is 1-Lipschitz in X 0 endowed with its 2 norm. Altogether, expanding we see that the exponential concentrations for 1-Lipschitz functions of μ, P A and P B lift to exponential concentration of P for functions that are 1-Lipschitz in the triplet (X 0 , J, M) on (E N , · mix ).

Proof of Theorem 2
We first prove a concentration estimate for F at a fixed pair of times t ∈ [0, T ] 2 , before extending this to the full trajectory (F(t)) t∈[0,T ] 2 by bounding the modulus of continuity of F. (4.14) Proof In proving [3, Lemma 2.5] it is shown, using a Lipschitz extension, that if P satisfies exponential concentration for Lipschitz functions as in (1.11) and V is an For R = R 0 we can embed the constant factor 2D(R 0 ) into C and further adjust C 3 to bound the pre-exponent 2(C 2 + K (R 0 )) within the factor exp(− √ R 0 N /(2C 3 )) multiplying it, resulting with q N (λ; R 0 ) as in the top line on the rhs of (4.14). For a better tail decay, consider R λ = (η log λ) 2 ≥ R 0 , with η = 1/(2C 1 ) so D(R λ ) = C 1 e C 1 η log λ ≤ C 1 λ/ log λ for all λ ≥ 4. In addition, once √ N /(2C 3 ) ≥ 4C 0 T we can again embed the pre-exponent 2(C 2 + K (R λ ))/λ within the factor exp(− √ R λ N /(2C 3 )) multiplying it . Thus, upon adjusting the various constants we end up with q N (λ; R λ ) as in the bottom line on the rhs of (4.14).
Setting hereafter R for the larger of R 0 and R λ values from the preceding proof of Proposition 4.6, recall that the event L c N ,R was already ruled out as part of the derivation of (4.14). Thus, proceeding to prove Theorem 2, we fix ε = N −k , k > 1, and apply Proposition 4.6 at the M N = T N k 2 grid points t i, j = (iε, jε) within [0, T ] 2 , to deduce by the union bound that P(L c N ,R ) + P sup It is easy to check that 2M N q N (λ) is further bounded by p N (3λ) of (1.12) once we suitably enlarge the constant C on the rhs of (1.12) relative to that of (4.14). In addition, since the right-most term in (4.15) exceeds one whenever E[|V |1 L c N ,R ] = E[|F(t)|1 L c N ,R ] ≥ λ/2, if that inequality holds for any t ∈ [0, T ] 2 , then q N (λ) and in turn p N (3λ) of (1.12) would exceed one. Thus, we may assume wlog that Restricting to λ > 1/ √ N (as otherwise p N (3λ) ≥ 1), and using p N (3λ) M N exp(−(λ 2 ∧ λ)N k /C ) (as k > 1) with the above, the stated bound of Theorem 2, follows from the following short-time estimates. Proof Similarly to the computation leading to (4.13), we find that for any t+s ∈ [0, T ] 2 and F as in Theorem 2, evaluated on the solution X t (X 0 , J, M) that corresponds to some (X 0 , J, M) ∈ L N ,R When Y = 1 this difference is zero, whereas in case Y = X and s i ≤ ε, assuming wlog that R 0 , C ≥ 1, we have on L N ,R , by (4.5) and the rhs of (4.8), that Further, similarly to the lhs of (4.11), on L N ,R , so up to extra factor √ R the bound (4.19) applies for Y = G, and considering all cases we get for s ∈ [0, ε] 2 , (4.20) For some C > 0, when R = R 0 and λ ≥ C ε, the right most term in (4.20) can not exceed λ/2. The same applies for R = R λ = (η log λ) 2 provided η ≤ 1/(3C 0 T ). By the same reasoning, for such η and some C 4 (T , C a , R 0 ) > 0, the factor multiplying M t i +s i −M t i in (4.20), is in both cases at most ( √ λ∨1)/(2C 4 √ N ). Recall from (4.2) and the stationarity of Brownian increments, that there exists C(C σ ) such that for every L ≥ ε 2 L 0 (C σ ), every N , Our assumption that λ ≥ ε 1/(2k) for some k > 1 guarantees that the right most term is at most λ/2 (as soon as N ≥ N 0 ), thereby establishing (4.18).