Learning interaction kernels in stochastic systems of interacting particles from multiple trajectories

We consider stochastic systems of interacting particles or agents, with dynamics determined by an interaction kernel which only depends on pairwise distances. We study the problem of inferring this interaction kernel from observations of the positions of the particles, in either continuous or discrete time, along multiple independent trajectories. We introduce a nonparametric inference approach to this inverse problem, based on a regularized maximum likelihood estimator constrained to suitable hypothesis spaces adaptive to data. We show that a coercivity condition enables us to control the condition number of this problem and prove the consistency of our estimator, and that in fact it converges at a near-optimal learning rate, equal to the min-max rate of $1$-dimensional non-parametric regression. In particular, this rate is independent of the dimension of the state space, which is typically very high. We also analyze the discretization errors in the case of discrete-time observations, showing that it is of order $1/2$ in terms of the time gaps between observations. This term, when large, dominates the sampling error and the approximation error, preventing convergence of the estimator. Finally, we exhibit an efficient parallel algorithm to construct the estimator from data, and we demonstrate the effectiveness of our algorithm with numerical tests on prototype systems including stochastic opinion dynamics and a Lennard-Jones model.


Introduction
We consider a system of particles or agents interacting in a random environment, with their motion described by a first-order Stochastic Differential Equation in the form φ( x j,t − x i,t )(x j,t − x i,t )dt + σdB i,t , for i = 1, . . . , N , (1.1) where x i,t ∈ R d represents the position of particle i at time t, φ : R + → R is an interaction kernel dependent on the pairwise distance between particles, and B t is a standard Brownian motion in R N d , with σ > 0 representing the scale of the random noise. This is a gradient system, with the energy potential V φ : where X t = (x i,t ) i=1,...,N ∈ R dN is the state of the system. Letting we can write Eq.(1.1) in vector format as The particles interact with each other based on their pairwise distance, with dissipation of the total energy, with the system tending to a stable point of the energy potential, while the random noise injects energy to the system. Such systems of interacting particles arise in a wide variety of disciplines, including interacting physical particles [50,23] or granular media [3,2,4,9,13,14] in Physics, opinion aggregation on interacting networks in Social Science [25,47,44], and Monte Carlo sampling [39,37], to name just a few.
Motivated by these applications, the inference of such systems from data gains increasing attention. For deterministic multi-particles systems, various types of learning techniques have been developed (see e.g. [10,41,40,42,1,15] and the reference therein). When it comes to stochastic multi-particle systems, only a few efforts have been made, e.g. learning reduced Langevin equations on manifolds in [20] (without however assuming nor exploiting the structure of pairwise interactions), learning the parametric potential function in [11,16] from single trajectory data, and estimating the diffusion parameter in [20,27].
Our goal is to estimate the interaction kernel φ given discrete-time observation data from trajectories {X t0 } M m=1 are independent samples drawn from a distribution µ 0 on R dN , and t 0 : t L indicates times 0 = t 0 < t 1 < · · · < t l < · · · < t L = T , with with t l = l∆t.
Since in general, little information about the analytical form of the kernel is available, we infer it in a nonparametric fashion (e.g. [21,7,24]). We note that the problem we consider is to learn a latent function in the drift term given observations from multiple trajectories, which is different from the ample literature on the inference of stochastic differential equations (see e.g. [35,30]), focusing either on parameter estimation or inference for ergodic system. In particular, our learning approach is close in spirit to the nonparametric regression of the drift studied in [45] for ergodic system and in [18] from i.i.d paths. However, for systems of interacting particles one faces the curse of dimensionality when learning the high-dimensional drift directly as a general function on the high-dimensional state space R dN . Instead, we will exploit the structure of the system and learn the latent interaction kernel in the drift, which only depends on pairwise distances, and show that the curse of dimensionality may be avoided, when such inverse problem is well-conditioned.
We introduce a maximum likelihood estimator (MLE), along with an efficient algorithm that can be implemented in parallel over trajectories, with an hypothesis space adaptive to data to reach optimal accuracy. Under a coercivity condition, we prove that the MLE is consistent, and converges at the min-max rate for one-dimensional nonparametric regression. We also analyze the discretization errors due to discrete-time observations: we show it leads to an error in the estimator that is of order ∆t 1/2 (with ∆t = T /L = t l+1 − t l ), and as a result, it prevents us from obtaining the min-max learning rate in sample size. We demonstrate the effectiveness of our algorithm by numerical tests on prototype systems including opinion dynamics and a stochastic Lennard-Jones model (see Section 5). Numerical results verify our learning theory in the sense that that the min-max rate of convergence is achieved, and the bias due to the numerical error is close to the order ∆t 1/2 .

Overview of the main results
We consider an approximate maximum likelihood estimator (MLE), which is the maximizer of the approximate likelihood of the observed trajectories, over a suitable hypothesis space H:  We develop a systematic learning theory on the performance of this MLE. We propose first a coercivity condition that ensures the robust identifiability of the kernel φ, in the sense that the derivative of the pairwise potential defined in (1.2), Φ (r) = φ(r)r, can be uniquely identified in the function space L 2 (R + , ρ T ), where ρ T is the measure of all pairwise distances between particles. Then, we consider the convergence of the estimator, from both continuous-time and discrete-time observations, under the norm |||ϕ||| := ϕ(·) · L 2 (ρ T ) = R + |ϕ(r)r| 2 ρ T (dr) [0,T ] } M m=1 . We show that the MLE is consistent, that is, converges in probability to the true kernel under the norm |||·|||. Furthermore, we show that the MLE converges at a rate which is independent of the dimension of the state space of the system, and corresponds to the minmax rate for one-dimensional non-parametric regression ( [21,17,7,24]), when choosing the hypothesis space adaptively according to data, the Hölder continuity s of the true kernel, and with dimension increasing with the amount of observed data. With dim(H n ) ( M log M ) 1 2s+1 , and assuming that the coercivity condition holds on H n with a constant c Hn > 0, we have, with high probability and in expectation, The case of discrete-time observations (Section 4). In this case derivatives and statistics of the trajectories inbetween observations need to be approximated, while keeping the estimator efficiently computable: this leads to further approximations of the likelihood, and consequently of the MLE. This discretization error of the approximations we use is of order 1/2 in the observation time gap ∆t = T /L, using an approximation of the likelihood based on the Euler-Maruyama integration scheme. We show that for some C > 0, for any > 0, with high probability φ L,T,M,H − φ ≤ φ T,∞,H − φ + C n M + ∆t 1 2 , where φ T,∞,H is the projection of the true kernel to H. The discretization error will flatten the learning curve when the sample size is large, overshadowing the sampling error and the approximation error cause by working within the hypothesis space. for some positive constants c 2 and c 3 , where φ T,∞,H is the projection of the true kernel to H. The numerical error may overshadow the sampling error and the approximation error of the hypothesis space. Variable Definition x i,t ∈ R d position or opinion of particle i at time t, see (1.1) Xt = (x 1,t , . . . , x N,t ) ∈ R dN state vector: position of the N particles, see (1.4) · Euclidean norm in R d or operator norm of a matrix In both cases, we decompose the error in the MLE into sampling error from the trajectory data, and approximation error from the hypothesis space, as illustrated in the diagram in Figure 1. In the case of continuous-time observations, the sampling error is the error between φ T,M,H and the MLE from infinitely many trajectories (denoted by φ T,∞,H ): this will be controlled with concentration equalities. The approximation error φ T,∞,H − φ is adaptively controlled by a proper choice of hypothesis space. The analysis is carried out in the infinite-dimensional space L 2 (ρ T ). In the case of discrete-time observations, we provide a finite-dimensional analysis to study directly the MLE in our proposed algorithm, that is, analyzing the error of a L,T,M,H in (1.5) with proper conditions on the basis functions. The sampling error φ L,T,M,H − φ L,T,∞,H is analyzed through a L,T,M,H − a L,T,∞,H , and the discretization error between φ L,T,∞,H and φ T,∞,H is analyzed through a L,T,M,H − a T,∞,H . The discretization error comes from the discrete-time approximation of the likelihood, and it vanishes when the observation time gap ∆t reduces to zero, recovering the convergence of the MLE as in the case of the continuous-time observations.

Notation and outline
Throughout this paper, we use bold letters to denote vectors or vector-valued functions. We use the notation in Table 1 for variables in the system of interacting particles.
We restrict our attention to interaction kernels φ in the admissible set Let Ω be an arbitrary compact (or precompact) set of a Euclidean space (which may be R + , R d or R dN ), with the Lebesgue measure unless otherwise specified. We consider the following function spaces L 2 (ρ T )-based norm: |||φ||| = φ(·) · L 2 (ρ T ) , see (1.6) Table 2: Notation used for the estimator of the interaction kernel φ.
-C k,α (Ω) with k ∈ N, 0 < α ≤ 1: the space of functions whose k-th derivative is Hölder continuous of order α. In the special case of k = 0 and α = 1, g ∈ C 0,α (Ω) is called Lipchitz continuous on Ω; the Lipschitz constant of g ∈ Lip(Ω) is defined as Lip[g] := sup x =y We summarize the notation for the inference of the interaction kernel in Table 2. The function space in which we perform the estimation is the space of functions ϕ such that ϕ(·)· ∈ L 2 (R + , ρ T ), where ρ T is the measure of pairwise distances between all particles on the time interval [0, T ] (see (2.9)). We will focus on learning on the compact (finite-or infinite-dimensional) subset of L ∞ ([0, R]) (where [0, R] is the support of the functions in the admissible set K R,S ) in the theoretical analysis, however in the numerical implementation we will use finite-dimensional linear subspaces L 2 ([0, R], ρ T ) spanned by piecewise polynomials functions. While these linear subspaces are not compact, it is shown that the minimizers over the whole linear space are bounded and thus the compactness requirements are not essential (e.g., see Theorem 11.3 in [24]). We shall therefore assume the compactness of the hypothesis space in the theoretical analysis.
The remainder of the paper is organized as follows. We first provide an overview of our learning theory. In Section 2, we present a practical learning algorithm with theory-guided optimal settings on the choice of hypothesis spaces and with a practical assessment of the learning results. We then demonstrate the effectiveness of the algorithm on prototype systems including a stochastic model for opinion dynamics, and a stochastic Lennard-Jones model in Section 5. We establish a systematic learning theory analyzing the performance of the MLE, considering continuous-time observations in Section 3 and discrete-time observations in Section 4. We present in the appendix detailed proofs.

Nonparametric inference of the interaction kernel
We present in this section the nonparametric technique we study for the inference of the interaction kernel, and corresponding algorithms. We discuss assessment of the performance of the estimator and its performance in trajectory prediction. The proposed estimator is based on maximum likelihood estimation on data-adaptive hypothesis spaces so as to achieve optimal rate of convergence, guided by our learning theory in Section 3 -4.

The maximum likelihood estimator
As a variational approach, we set the error functional to be the negative log-likelihood of the data {X The error functional. Recall that by the Girsanov theorem, for a continuous trajectory X [0,T ] , its negative loglikelihood ratio between the measure induced by system (1.1), with an admissible kernel φ, and the Wiener measure is (2.1) As we do not know the interaction kernel φ that generated the trajectory X [0,T ] , we can let ϕ be any possible admissible interaction kernel, and upon replacing φ by ϕ in (2.1), observe that E X [0,T ] (ϕ) is the log-likelihood of seeing the trajectory X [0,T ] if the system (1.1) were driven by the interaction kernel ϕ. In this case E X [0,T ] (ϕ) may be interpreted as a error functional, which we wish to minimize over ϕ, in order to obtain an estimator for φ. Given only discrete-time observations X t0:t L , where (t l = l∆t, l = 0, . . . , L) with ∆t = T /L (the case of non-equispaced-in-time observation is a straightforward generalization), the error functional E X [0,T ] (ϕ) may be approximated as (2. 2) The corresponding approximate likelihood is equivalent to the likelihood based on the Euler-Maruyama (EM) scheme (whose transition probability density is Gaussian): Note that while higher-order approximations of the stochastic integral (or, equivalently, approximations based on higher order numerical schemes) may be more accurate than the EM scheme, they lead to nonlinear optimization problems in the computation of the MLE defined below, and we shall therefore avoid them. The EM-based approximation preserves the quadratic form of the error functional, and leads to an optimization problem that can may be solved by least squares. As we show in Theorem 4.2, this discrete-time approximation leads to an error term of order ∆t 1/2 in the MLE, which will be small in the regime on which we focus in this work.
Since the observed discrete-time trajectories {X (m) t0:t L } M m=1 are independent, since X 0 is drawn i.i.d. from µ 0 , the joint likelihood of the trajectories is the product of the likelihood of each trajectory. Therefore, the corresponding empirical error functional is defined to be A regularized Maximum Likelihood Estimator. The regularized MLE we consider is a minimizer of the above empirical error functional over a suitable hypothesis space H:

5)
This regularized MLE is well-defined when the minimizer exists and is unique over H. We shall discuss the uniqueness of the minimizer in Section 3.1, where we show it is guaranteed by a coercivity condition. When the hypothesis space H is a finite dimensional linear space, say, H = span{ψ i } n i=1 with basis functions ψ i : R + → R, the regularized MLE is the solution of a least squares problem. To see this, letting ϕ = n i=1 a(i)ψ i and a := (a(1), . . . , a(n)) ∈ R n , we have f ϕ (X) = n i=1 a(i)f ψi (X), due to the linear dependence of f ϕ on ϕ. Then, we can write the error functional in Eq.(2.2) for each trajectory as where the matrix A (m) ∈ R n×n and the vector b (m) ∈ R n are given by The normal equations (2.7) are solved by least squares, so the solution always exists. We will show in Section 4 that assuming a coercivity condition, the matrix A M,L ∈ R n×n is invertible with high probability when M and L are large, so the least squares estimator is the unique solution to the normal equations, and the regularized MLE is the unique minimizer of the empirical error functional over H.

Dynamics-adapted measures and function spaces
We will assess the estimation error in a suitable function space: L 2 (R + , ρ T ). Here ρ T is the distribution of pairwise distances between all particles: where δ is the Dirac δ distribution, so that E[δ r ii (t) (dr)] is the distribution of the random variable r ii (t) = ||x i,t − x i ,t ||, with x i,t being the position of particle i at time t.
The probability measure ρ T depends on both the distribution of initial conditions µ 0 and the measure determining the random noise on the path space, while it is independent of the observed data. The measure ρ T encodes the information about the dynamics marginalized to pairwise distances; regions with large ρ T -measure correspond to pairwise distances between particles that are often encountered during the dynamics.
With observations of M trajectories at L discrete-times each, we introduce a corresponding measure where r (m) i ,t || is from the m-th observed trajectory. We think of this as an approximation to ρ T , in two significantly different aspects. In L, because as L → +∞ our observations tends to be continuous in time, and in M , as ρ L,M T can be thought of, after letting M → +∞, as an empirical approximation to ρ T performed from data on the M independent trajectories.
Accuracy of the estimator. We measure the accuracy of our estimator φ L,T,M,H by the quantity The function φ(·)·, instead of φ, which at r ∈ R + takes value φ(r)r, appears naturally in our learning theory in Section 3, fundamentally because it is the derivative of the pairwise distance potential Φ in (1.2). For simplicity of notation, for a function ϕ in the hypothesis space, we let (2.11) Then the mean square error of the estimator is

Hypothesis spaces and nonparametric estimators
As standard in nonparametric estimation, we let the hypothesis space H grow in dimension with the number of observations, avoiding under-or over-parametrization, and leading to consistent estimators, that in fact reach an optimal minimax rate of convergence (see e.g. [21,24,22]). Similar to [41,40], we set the basis functions {ψ i } n i=1 to be piecewise polynomials on a partition of the support of the density function of the empirical probability measure ρ L,M T . Guided by the optimal rate convergence results in Section 3, we will set the dimension of the hypothesis space H to be where the number s is the Hölder index of continuity of the basis functions, and it is be chosen according the regularity of the true kernel. When T is large and when the system is ergodic, we set where N ess := M T τ , with τ denoting the auto-correlation time of the system, is the effective sample size of the data. Here the auto-correlation time τ is the equivalent of the mixing time for a reversible ergodic Markov chain [36].
We estimate the auto-correlation time by the sum of the temporal auto-correlation function of a pairwise distance r i,j (we refer to [51] for detailed discussion on the estimation of auto-correlation time, which is a whole subject by itself).
We will prove bounds, that hold with high probability, on the Mean Squared error (MSE) of the MLE φ L,T,M,Hn M in (2.8) for fixed and large M , for a fixed time T and for suitable hypothesis spaces H n M with dimension n M as in (2.13). When continuous-time trajectories are observed, the MSE is of the order ( log M M ) 2s 2s+1 with high probability, according to Theorem 3.2, and so is its expectation. In particular, this avoids the curse of dimensionality of the state space (dN ). When the observations are discrete-time trajectories with observation gap ∆t, the error is of the order ( log M M ) 2s 2s+1 + ∆t 1/2 with a high probability according to Theorem 4.2.

Algorithmic and computational considerations
The algorithm is summarized in Algorithm 1. Note that the normal matrices {A (m) } and vectors {b (m) } are defined trajectory-wise and therefore may be computed in parallel. When the size of the system is large (i.e. dN is large), this allows one to accelerate the computation of the estimator, by assembling these normal matrices and vectors for each trajectory in parallel, and updating the normal matrix A M,L and vector b M,L . The total computational cost of constructing our estimator, given ) when n is chosen optimally according to Theorem 3.2 and φ is at least in C 1,1 (corresponding to the index of regularity s ≥ 2 in the theorem).

Accurracy of trajectory prediction
One application of estimating the interaction kernel from data is to preform predictions of the dynamics. Given an estimator, the following Proposition bounds its accuracy in predicting the trajectories of the system driven by the true interaction kernel: Proposition 2.1 Let φ ∈ K R,S be an estimator of the true kernel φ , where K R,S is the admissible set defined in (1.7). Denote by X t and X t the solutions of the systems with kernels φ and φ respectively, starting from the same initial condition and with the same random noise. Then we have where the measure ρ T is defined by (2.9).
We postpone the proof to Section A.3.
is the negative log-likelihood of a trajectory X [0,T ] , with respect to the measure induced by the system with interaction kernel ϕ. Then, the negative log-likelihood of independent trajectories {X (m) The existence of the minimizer follows from the fact that the error functional E T,M (ϕ) is quadratic in ϕ, which in turn is a consequence of the linearity of f ϕ in ϕ. The uniqueness of the minimizer, however, requires a coercivity condition and is related to the learnability of the kernel, which we discuss in the next section.

Identifiability and learnability: a coercivity condition
The uniqueness of the minimizer of the error functional E T,M (ϕ) over the hypothesis space ensures that the kernel is identifiable. This is not granted, even when the number of observed trajectories is infinite: denote where E here, and in all that follows unless otherwise indicated, is the expectation over initial conditions, independently sampled from µ 0 , and over the Wiener measure underlying the random noise, and observe that can one ensure the uniqueness of minimizer. This motivates us to propose the following coercivity condition, introduced in [10] in the case of non-stochastic systems: Definition 3.1 (Coercivity condition) We say that the stochastic system defined in (1.1) satisfies a coercivity condition on a set H of functions on R + , with a constant 0 < c H , if for all ϕ ∈ H such that ϕ(·)· ∈ L 2 (ρ T ). Here |||·||| denotes the norm defined in (2.11). We will denote by c H the largest constant for which the inequality holds, and call it the coercivity constant.
The coercivity condition ensures identifiability of the kernel. We emphasize that the kernel is latent, in the sense that its values at {r ii = x i − x i } are undeterminable from data. In fact, to recover (φ(r ii )) ∈ R N (N −1) 2 from the observed trajectories, even if we ignore the stochastic noise in the system and assume to have access to f φ (x) ∈ R dN , which consists of a linear combination of (φ(r ii )) with coefficients r ii = x i − x i , we face a linear system that is underdetermined as soon as dN (=number of known quantities) ≤ N (N −1) 2 (=number of unknowns), i.e. for d < (N − 1)/2. Thus, in general the exact values of φ at locations {r ii } i,i can not be determined. Furthermore, we have stochastic noise in the system. This suggests that the inverse problem of estimating the interaction kernel in a space of continuous functions is ill-posed. We will see that the coercivity condition ensures well-posedness in L 2 (ρ T ), both in the sense of uniqueness and in the sense of stability.
The coercivity condition plays a key role in the learning of the kernel. Beyond ensuring learnability of kernels by ensuring the uniqueness of minimizer over any compact convex sets, it also enables us to control the error of the estimator by the discrepancy between the expectation of error functionals, as is shown in Proposition 3.1. We will use this property to establish the convergence of the estimators in later sections.
To simplify notation, we define a bilinear functional product over H by Proposition 3.1 Let H be a compact convex subset of L 2 (R + , ρ T ) and assume the coercivity condition (3.5) holds true on H with constant c H . Then, the error functional E T,∞ defined in (3.3) has a unique minimizer over H in Moreover, for all ϕ ∈ H, where the inequality follows from the coercivity condition. Then, Eq.(3.8) follows once we notice that H is a minimizer, and so, equivalently, Well-conditioning from coercivity. When the hypothesis space H is a finite-dimensional linear space, the coercivity constant provides a lower bound for the smallest eigenvalue of the limit of the normal equations matrix A M,L in Eq.(2.7) as M, L → +∞. Therefore, when the sample size M is large and when the observation frequency L is high, the matrix A M,L is invertible with a high probability (see Corollary 3 for details), and thus the coercivity condition ensures the uniqueness of the regularized MLE in Eq.(2.8): Proposition 3.2 Suppose that the coercivity condition holds on H = span{ψ 1 , · · · , ψ n }, where the basis functions satisfy ψ p (·)·, ψ p (·)· L 2 (ρ T ) = δ p,p . Let A ∞ = ψ p , ψ p p,p ∈ R n×n with the bilinear functional ·, · defined in (3.6). Then the smallest singular value of Proof For an arbitrary a ∈ R n , denoting ψ = n p=1 a p ψ p , we have where the first equality follows from that the functional ·, · is bilinear, and the inequality follows from the coercivity condition. Note that by the definition of the coercivity constant in (3.5), we have which is attained at some ψ * ∈ H since H is finite dimensional. Hence, the inequality in (3.9) becomes inequality for ψ * and the smallest eigenvalue of of A ∞ is c H .
Proposition 3.2 suggests that for the hypothesis space H, it is important to choose a basis that is orthonormal in L 2 (ρ T ), so as to make the matrix in the normal equations as well-conditioned as possible given the dynamics. In practice, the unknown ρ T is approximated by the empirical density ρ L,M T . Therefore, when using local basis functions, it is natural to use a partition of the support of ρ M T .
The coercivity condition and positive integral operators. The coercivity condition introduces constraints on the hypothesis spaces and on the distribution of the solutions of the system, and it is therefore natural that it depends on the distribution µ 0 of the initial condition X 0 , the true interaction kernel φ, and the random noise. We review below briefly the recent developments in [38], where the coercivity condition is proved to hold on any compact sets of L 2 (ρ T ) for special classes of systems, such as linear systems and nonlinear systems with a stationary distribution. As discussed in [10,41,40] for the deterministic cases, we believe that the coercivity condition is "generally" satisfied for "relevant" hypothesis spaces, with a constant independent of the number of particles N , thanks to the exchangeability of both the distribution of the initial conditions and that of the particles at any time t.
The coercivity condition is equivalent to the positiveness of integral operators that arise in the expectation in Eq.(3.5). More precisely, by the exchangeability of the distribution of X t , one can rewrite Eq.(3.5) as where the integral kernel K T : R + × R + → R is defined as with p t (u, v) denoting the joint density function of the random vector (r t 12 , r t 13 ) and S d−1 denoting the unit sphere in R d . The integral kernel K is symmetric semi-positive definite and leads to a semi-positive self-adjoint integral operator L K . Then, the coercivity condition holds on H if and only if H is contained in the eigen-space of L K with eigenvalues no less than c H . In particular, if L K is strictly positive, then the coercivity condition holds true for any compact H ⊂ L 2 (ρ T ). Using Müntz-type theorems, it is shown in [38] that L K is strictly positive definite, and therefore the coercivity condition holds, for a large class of systems with interaction kernels in form

Consistency and rate of convergence of the estimator
In this section, we consider using a family of finite dimensional linear spaces {L n : n ∈ N + } ⊂ C 1,1 [0, R] as hypothesis spaces and establish the consistency and rate of convergence of our estimators. We assume the spaces This condition has a long history and rich literature in classical approximation theory, where it is studied when function spaces satisfy (3.11) (e.g. see the survey paper [53]), which is an important step in establishing inverse approximation theorems. This kind of inequality holds true on many function spaces that are commonly used as approximation spaces in practice, including: This result dates back to Bernstein [6].
-L n : the polynomial space consisting of all polynomials with degree less than n − 1 on [0, R] (see Theorem 3.3 in [49]), for which ϕ ∞ ≤ 2 R (dim(L n ) + 1) 2 ϕ ∞ . As a result, (3.11) also holds true for polynomial splines; other extensions include rational functions. We refer to the reader to [31] for details.
If we choose a compact convex hypothesis set H M contained in some L n , with a suitable correspondence between n and M , such that the distance between H M and the true kernel φ vanishes as M increases, the following consistency result holds: Finally, suppose the coercivity condition holds true on ∪ n L n . Then we have If we know the explicit approximation rate of the family {L n : n ∈ N + }, then by carefully choosing the dimension of hypothesis spaces as a function of M , we can obtain a near-optimal rate of convergence of our estimators.
Theorem 3.2 (Convergence rate of estimators) Suppose φ ∈ K R,S , the admissible set defined in (1.7). Assume that there exits a sequence of linear spaces {L n : where C is a constant depending only on σ, N, T, R, S 0 .
It is fruitful to compare (up to log terms) the rate 2s/(2s + 1) to that for nonparametric 1-dimensional regression, where one can observe directly noisy values of the target function φ at sample points drawn i.i.d from ρ T . For the function space C k,α , this rate is min-max optimal. Our numerical examples in Section 5 empirically validate the desired convergence rate for s = 1, 2 where we use piecewise constant and linear polynomials. Note that in our setting, learning φ is an inverse problem, as we do not directly observe the values . We also do not require the underlying stochastic process satisfies certain mixing properties and starts from a stationary distribution. Obtaining this optimal convergence rate in M for short time trajectory observations is therefore satisfactory. For long trajectories and under ergodicity assumptions, rates in terms of M T are likely to be obtainable: in Section 5 we present numerical evidence that suggests that the error does decrease with M T at a near optimal rate.

Proof of the main theorems
In the following part, we present the proof for Theorem 3.2, which also yields the proof for Theorem 3.1. The main techniques includes the Itô formula, concentration inequalities of unbounded random variables, and a generalization of a novel covering argument in [52] that enables us to deal efficiently with the fluctuations in the data due to the stochastic noise in the dynamics of the system. One major obstacle in the non-asymptotic analysis of our regularized MLE estimators is the unboundness of stochastic integral of the form 1 T T 0 f ϕ (X t ), dX t dt appearing in the empirical error functional. Unlike the deterministic case σ = 0, our empirical error functional E T,M (·) is in general not continuous over H with respect to the · ∞ norm. In the following, we first leverage the general Itô formula described in Theorem A.3, to obtain a form of the empirical error functional that does not involve a stochastic integral and is amenable to analysis; we then show that it is continuous on C 1,1 ([0, R]) with respect to the · 1,1 norm. Therefore, in the following preliminary results for the proofs of the main theorems, we consider the following generic hypothesis space: with respect to the uniform norm · ∞ and bounded above by S 0 ≥ S.
then, we have, almost surely Using Itô's formula (Theorem A.3) for the Itô process g(X t ), the conclusion follows.
where we used which follows from its definition in (3.12). Combining the estimates for I 1 and I 2 , and using When M = ∞, i.e. we observe infinitely many trajectories, the expectation of our error functional E T,∞ , as in (3.3), does not involve the stochastic integral term. From the proof of Proposition 3.3, we see that it is continuous over H with respect to the · ∞ norm: We now analyze the discrepancy between the empirical minimizer φ T,M,H and φ T,∞,H , which we called Sampling Error (SE) in the diagram in Figure 1. We introduce a measurable function on the path space by for any ϕ ∈ H. From Proposition 3.1 we have so D ϕ in fact bounds (in expectation) the distance between ϕ and φ T,∞,H w.r.t. the |||·|||-norm. We now perform a non-asymptotic analysis of D ϕ . We shall show that the random variable D ϕ satisfies moment conditions, sufficient to guarantee strong concentration about its expectation (Proposition 3.4). To do this, we decompose D ϕ as the sum of a bounded component only involving time integrals and an unbounded component involving stochastic integrals: We prove moment conditions independently for each of these components in the next two Lemmata.
Proof First of all, note that . The conclusion then follows by adding in the scaling factor 1 σ 2 N T .
Proof From the inequality (3.16) and the linear dependence of f ϕ on ϕ, we have Therefore, Now we combine Lemma 3.2 and 3.3 to prove the moment condition for D ϕ .
where the constants are , and the last inequality is derived from the Stirling's lemma.
We now tie the discrepancy functionals for finite and infinite M : Then with probability at least 1 − δ 2 , we have c H + 4C 0 S 0 , and C 0 , C 1 as in (3.17), with c H the coercivity constant defined in (3.5).
Proof For each ϕ j ∈ H, recall that in Eq.(3.15), the coercivity condition on H implies that Then, Eq.(3.17) in Lemma 3.4 yields that Therefore the random variable D ϕj satisfies the moment condition in Corollary 5, and so ∀ > 0 . Taking a union bound on all these events, over j ∈ {1, 2, · · · , N }, we obtain that Setting the right-hand side to be δ 2 , we get M,δ,N = C M log( 2N δ ), where C := 2 Then there exists ϕ j M in the net such that ϕ j M − φ T,M,Hn ∞ ≤ η; by Corollary 1 On the other hand, since H n ⊂ L n ⊂ C 1,1 ([0, R]), thanks to the almost sure control in Proposition 3.3 and the uniformly bound sup ϕ∈Ln ϕ ∞ ϕ ∞ ≤ c 1 (dim(L n )) γ from the assumption (3.11), we have, almost surely, where Notice that D φ T ,M,Hn ,M ≤ 0, so the above inequality implies that The covering number of H n satisfies N (H n , η) ≤ 4S0 η c0n (e.g. Proposition 5 in [21]). By the triangle inequality, we split the error we want to control into Sampling Error (SE) and Approximation Error (AE) (see Figure 1) (3.24) From (3.23) and the coercivity condition (3.15), we obtain that, with probability at least 1 − δ 2 , Let φ Hn := argmin ψ∈Hn ψ − φ ∞ . By coercivity condition (3.15), we have where we used , and note that c L = c ∪nLn = c ∪nHn ≤ c Hn = c Ln ≤ 1 for all n. We obtain that, with probability at least 1 − δ 2 , the following estimate holds true: The bound in expectation is obtained by standard techniques, writing We obtain where C 6 is an absolute constant only depending on σ, N, T, S 0 , R.
Proof (of Theorem 3.1) In this proof, a b means there there exist a constant c such that a ≤ cb. For any > 0, we claim Strong consistency will then follow from the Borel-Cantelli Lemma. Notice that converges. The claim is proved.

Learning theory: discrete-time observations
In this section, we analyze the estimation error of the (regularized) MLE φ L,T,M,H , defined in (2.5) for finite dimensional linear space H and for discrete-time observations. We show that it is of order n M + ∆t 1/2 with high probability, where n is the dimension of H and ∆t is the observation gap. As a consequence, the MLE is consistent when M → ∞ and ∆t → 0; and the MLE converges at an optimal rate as when n is optimally chosen as in (2.13).
We shall first prove the main theorems on the error of the MLE in Section 4.1, postponing technical details, including concentration inequalities and discretization error bounds, to later subsections.
Recall that we denote X [0,T ] the solution to system (1.1) with the true interaction kernel φ, and denote {X  (a) {ψ p (·)(·)} n p=1 are orthonormal in L 2 (ρ T ); Item (a) aims to make the normal equations matrix nonsingular, as discussed in Proposition 3.2. In item (b), the uniform bound for the derivatives aims to control the discretization error due to discrete-time observations; the uniform boundless of the functions will be used for concentration inequalities. Item (c) states that the number of such orthonormal basis functions are bounded by the measure ρ T and the uniform bounds of the functions and their derivatives. Item (c) often follows from (a) − (b), with an intuition from examples including polynomials and trigonometric polynomials, and smoothed piecewise polynomials, if r 2 ρ T (dr) is equivalent to the Lebesgue measure on an interval [R 0 , R] ⊂ R + . Such an interval is where pairwise distances explores with a noticeable probability (see for example, in Figure 2 and Figure 6). It exists in general when the initial distribution spreads out the pairwise distances or when the system is ergodic, since the density of ρ T is smooth and nonnegative on R + .   , the discretization error will dominate the error of the estimator, preventing us from observing the min-max learning rate. This phenomenon is well-illustrated by the left plots in Figure 5 and 9 in our numerical experiments.

Remark 2
We assumed C 1 b regularity for the basis functions {ψ p } for the above numerical error analysis, stronger than that of piecewise polynomials (which may be discontinuous) used in the numerical tests. Such a difference between the regularity requirements stems from the numerical representation, and we can view the piecewise polynomials as numerical approximations of regular functions. This view is supported by the numerical experiments: the estimator has only small jumps at the discontinuities in the high probability region. where φ L,T,∞,H is the infinite-data estimator. We shall study the discretization error and the sampling error by analyzing the differences between their coefficient vectors.
All these coefficients are solutions to the corresponding normal equations (e.g. Eq.(2.7)). To facilitate the study of these normal matrices and vectors, we introduce the following notions. For any f, g ∈ C 1 b (R N d , R N d ), we define the following functionals of the observation paths: (4.7) Here the matrix A ∞ is invertible due the coercivity condition: its smallest eigenvalue is the coercivity constant c H (see Proposition 3.2). The matrix A ∞,L is invertible when L = T /(∆t) is large, with its smallest eigenvalue bounded below by c H − c 3 ∆t 1/2 , see Corollary 2. Furthermore, Corollary 3 shows that, with probability at 1 − δ, the matrix A M,L is invertible with its smallest eigenvalue bounded blow by c H − n M + c 3 ∆t where the second inequality holds with probability at least 1 − δ.
For the discretization error, since {ψ i (·)(·)} are orthonormal in L 2 (ρ T ), we have, by Eq.(4.7): 8) and the inequality for the discretization error follows. Similarly, for the sampling error, we have hold with probability at least 1 − δ; (iii) since c 3 ∆t Then, the inequality for the sampling error follows.

Concentration and discretization error of empirical functionals
We introduce concentration inequalities for the above empirical functionals on the path space of diffusion processes and a bound on the discretization error of the estimator due to discrete-time approximations. Our first lemma studies concentration of the discrete-time empirical functionals ξ M,L and η M,L . t0:t L } M m=1 be discrete-time observations, with t l = l∆t and L = T /∆t, of the system (1.1) with φ. Then for any f, g ∈ C b (R dN , R dN ), the error functionals defined in (4.5) satisfy the concentration inequalities: for any > 0, where Proof Note that η(f, g, X [t0:t L ] ) ≤ f ∞ g ∞ . Then, the exponential inequality for η M L follows from the Hoeffding inequality, which states that, for i.i.d. random variables {Z i } bounded above by K, one has To study ξ M L , we decompose ξ(f, X [t0:t L ] ) into two parts, a bounded part and a martingale part: where we denote f L (s) := We call Z T a bounded part because We call Y T a martingale part since T N Y T = T 0 f L (s), σdB s is a martingale. Correspondingly, we can write Then, denoting K 1 := 1 N f ∞ f φ ∞ and K 2 = 2σ f ∞ , and noticing that C 1 = K 1 + K 2 / √ T N , we can conclude the first concentration inequality in (4.9) from where the first exponential bound follows directly from Hoeffding inequality applied to {Z for any λ > 0. As a consequence, we have Lastly, Eq.(4.10) follows directly from Eq.(4.9).
We remark that here we focus on the case M → ∞ with finite time T . If the system is ergodic, one may extends the concentration inequalities to the case when T → ∞. The next lemma shows that the discretization error of the empirical functionals, as discrete-time approximation of the integrals, is at order ∆t   (R dN , R dN ). Let X t0:t L be a discretetime trajectory, with t l = l∆t and L = T /∆t, of the system (1.1) with φ. Then, the error functionals defined in (4.4) satisfy where the constants are Proof Note that since X [0,T ] is a solution to the system (1.1), we have for each l, where in the first inequality we have applied the mean value theorem to bound f (X t l ) − f (X s ): and in the second inequality, we used the fact that Thus, we obtain the bound in (4.11) by a summation over l: and the bound for η follows from the fact that

Error bounds for the normal matrix and vector
Proof Applying Lemma 4.2, in combination of the basic fact that b ≤ √ n max k=1,...,n |b(k)| for any b ∈ R n , and A ≤ √ n max k,k =1,...,n |A(k, k )| for any A ∈ R n×n , we obtain with constants C 1 and C 2 in form of To complete the proof, we are left to estimate f ψp ∞ and ∇f ψp ∞ . From the definition of f · , we have and f φ ∞ ≤ Rb 0 √ N as well. Note that for each i, i ∈ {1, · · · , N }, with notation r ji = x j −x i and r ji = |r ji |, we have, Thus, the norm this d × d matrix is uniformly bounded, and as a result, the norm of the dN × dN matrix is uniformly bounded, Combining the above estimates with f φ ∞ ≤ Rb 0 N (the same as f ψp ∞ ), we obtain that C 1 and C 2 are both bounded by C.
It follows directly that the matrix A ∞,L is invertible: The smallest eigenvalue of the matrix A ∞,L defined in (4.6) is bounded below by c H −c 3 ∆t 1/2 when c 3 ∆t 1/2 < c H , with c 3 defined in (4.2).
Proof Recall that from Proposition 3.2, we have a T A ∞ a ≥ c H |a| 2 for an arbitrary a ∈ R n . Then, by Proposition 4.1 with the bound of √ n in Assumption 4.1.
We prove next that the matrix A M,L is invertible, and concentrates around A ∞,L .
where the constant C is obtained from (4.12). In combination of the basic fact that b ≤ √ n max k=1,...,n |b(k)| for any b ∈ R n , and A ≤ √ n max k,k =1,...,n |A(k, k )| for any A ∈ R n×n , we obtain The third exponential inequality follows directly by combining the first two. Proof Note that for any a ∈ R n such that a = 1, we have, by Corollary 2, a T A ∞,L a ≥ c H − c 3 ∆t , for any > 0.
Remark 5 In practice, the minimum eigenvalue of A ∞ may be small due to the redundancy of the local basis functions or due to the coercivity constant on H being small. Thus, the smallest eigenvalue of A M,L may be zero. On the other hand, these matrices are always symmetric and nonnegative, so it is advisable to regularize the matrix by pseudo-inverse.

Examples and numerical simulation results
In this section, we performed numerical experiment to validate that our estimator defined in (2.5), and implemented by Algorithm 1, behaves in practice as predicted by the theory. We consider two examples: a stochastic opinion dynamical system and a stochastic Lennard-Jones system, using observations from simulated data.
The setup for the numerical simulations is as follows. We simulate sample paths on the time interval [0, T ] with the standard Euler-Maruyama scheme (see (2.3)), with a sufficiently small time step length dt. When observations are made at every time-step, i.e. ∆t = t l+1 −t l = dt for each l, we view X train,M := {X From the observations we construct the empirical probability measure ρ L,M T (defined in (2.10)), and let [R min , R max ] be its support. We choose the hypothesis spaces H consisting of piecewise constant or piecewise linear polynomials on interval-based partitions of [R min , R max ]. This choice is dictated by the ease of obtaining an orthonormal basis for H, ease and efficiency of computation, and ability to capture local features of the interaction kernel. To avoid discontinuities at the extremes of the intervals in the partition, and to reduce stiffness of the equations of the system with the estimated interaction kernels, we interpolate the estimator linearly on a fine grid and extrapolate it with a constant to the left of R min and the right of R max . This post-processing procedure ensures the Lipschitz continuity of the estimators. We use the post-processed estimators to predict and generate the dynamics with the estimated interaction kernels.
We mainly focus on the case where T is small and report on the results as follows: -Interaction kernel estimation. We compare φ and φ T,M,H , the true and estimated interaction kernels (after smoothing), by plotting them side-by-side, superimposed with an estimate of ρ T , obtained as in (2.10) by using M ρ T (M ρ T M ) independent trajectories. The estimated kernel is plotted in terms of its mean and standard deviation, computed over 10 independent learning trials. To demonstrate the dependence of the estimator on the sample size and the scale of the random noise, we report the above for different values of M and σ.
-Trajectory prediction. In the spirit of Proposition 2.1, we compare the discrepancy between the true trajectories (evolved using φ) and predicted trajectories (evolved using φ T,M,H ) on both the training time interval [0, T ] and on a future time interval [T, T f ], over two different sets of initial conditions -one taken from the training data, and one consisting of new samples from µ 0 . When simulating the trajectories for the systems driven by φ T,M,H using the EM scheme, we use the same initial conditions and the same realization of the random noise as in the trajectory of the system driven by φ. The mean trajectory error is estimated using M test trajectories (the same number as in the training data). -Rate of convergence. We report the convergence rate of φ T,M,H to φ in the |||·||| norm on L 2 (ρ T ) as the sample size M increases, with the dimension of H growing with M according to Theorem 3.2, for different scales σ of the random noise. We also investigate numerically the convergence rate when both T and M increase, with the dimension of the hypothesis space H set according to the effective sample size as discussed in Section 2.2. -Discretization errors from discrete-time observations. To study the discretization error due to discrete-time observations, we report the convergence rate (in M ) of estimators φ L,T,M,H obtained from data with different observation gaps ∆t = T /L. We also verify numerically that the |||·||| error of the estimators increases with ∆t as predicted by Theorem 4.2. These experiments are carried out for different values of the square root of the diffusion constant σ.
We will report the conclusions of our experiments in Section 5.3

Example 1: Stochastic opinion dynamics
We first consider a 1D system of stochastic opinion dynamics with interaction kernel It is straightforward to see that φ is in C 1,1 c ([0, 2]) and non-negative. Systems of this form are motivated in various applications, from Biology to in social science, where φ models how the opinions of people influence each other (see [34,8,44,12,19] and references therein), with one or a multiplicity of consensuses may be eventually reached.
In the system we consider, each agent tries to align its opinions more with its farther neighbours than with its closer neighbours: such interactions are called heterophilious. For deterministic systems of this type, [44] shows that the opinions of agents merge into clusters, with the number of clusters significantly smaller than the number of agents. This is natural, as increased alignment with farther neighbors increases mixing and consensus. In our stochastic setting, the random noise prevents the opinions from converging to single opinions. Instead, soft clusters form at large time, that are metastable states for the dynamics, i.e. states where agents dwell for long times, rarely switching between them.   We study the performance of our estimators of the interaction kernel, from trajectory data. Table 3 summarizes parameters of the setup. In this example, we choose H n M to be the function space consisting of piecewise constant functions on n M uniform partitions of the interval [0, 10]. Figure 2 shows that, as the number of trajectories increases, we obtain increasingly accurate approximations to the true interaction kernel, including at locations with sharp transitions of φ. The lack of artifacts at these locations is an advantage provided by the use of local basis. The estimators oscillate near 0, with amplitudes scaling with the level of noise. We believe that the reason for this phenomenon is that due to the structure of the equations, we have terms of the form φ(0)0 = 0 at, and near, 0, with subsequent loss of information about the interaction kernel about 0.
We then use the learned interaction kernels φ in Figure 2 to predict the dynamics, and summarize the results in Figure 3 and Table 4. Even with M = 32, our estimator produces very accurate approximations of the true trajectories both in the training time interval [0, 5] and the future time interval [5,50], including number and location of clusters, and the time of their formation. As M increases to 4096, we have more accurate predictions on the locations of clusters. We impute this improvement to the better reconstruction of estimators at locations near 0. divides the "training" interval [0, T ] from the "prediction" interval [5,50]. As M increases, our estimators achieve better approximation of the true kernel overall, and at regions near 0 (see Figure 2). As a result, they produced more faithful prediction of the number and location of clusters for large time. Statistics of trajectory prediction errors are reported in Table 4.
Next we investigate the convergence rate of estimators. It is well-known in approximation theory (see Theorem 6.1 in [49]) that inf ϕ∈Hn ϕ − φ ∞ ≤ Lip[φ]n −1 . With the dimension n being proportional to ( M log M ) 1 3 , Figure  4 shows that the learning rate in terms of M is around M −0.34 , which matches the optimal min-max rate M − 1 We also investigate the effects of the scale of the random noise, which is represented by the standard deviation σ. Figure 2 shows that the estimators for the system with σ = 0.5 have much large oscillations than those with σ = 0.1. The left plot in Figure 4 shows that the scale of the random noise does not affect the learning rate, matching our theory. We also see that the absolute L 2 (ρ T ) error of estimators increase as the system noise increases, this may indicate that the coercivity constant decreases as the level of noise in the system increases. The left plot in Figure 5 shows that the scale of the errors increase linearly in σ (in particular, when the observation gap is 1).
Finally, we study the discretization error due to approximation of the integral in the likelihood using discretetime observations. In the left plot of Figure 5, as the observation gap k increases, the learning rate curves become     flat, due to the error induced by discretization of the likelihood function (2.1). The right plot shows that the absolute error of the estimator is dominated by σO((∆t) 1/2 ).

Example 2: Stochastic Lennard Jones dynamics
In this example, we consider the Lennard-Jones type kernel φ(r) = Φ (r) r , with r m r q for some p > q ∈ N. The system of particles is assumed to be associated with a potential energy function only depending on the pairwise distance and Φ, and the evolution is driven by minimization of the energy function. In particular, represents the depth of the potential well, r is the distance between the particles, and r m is the distance at which the potential reaches its minimum. At r m , the potential function has the value − . The r −p term, which is the repulsive term, describes Pauli repulsion at short ranges due to overlapping electron orbitals, and the r −q term, which is the attractive long-range term. The corresponding system has wide applications in molecular dynamics and materials sciences where φ models atom-atom interactions. Note that φ is singular at r = 0: we truncate it at r trunc by connecting it with an exponential function of the form a exp(−br 12 ) so that it has a continuous derivative on R + . In this system, the particle-particle interactions are all short-range repulsions and long-range attractions. The short-range repulsion force prevents the particles to collide and long-range attractions keep the particles in the flock. In the deterministic setting, the system evolves to equilibrium configurations very quickly, which are crystallike structure, whose pairwise distance corresponds to the local minimizers of the associated energy function. Table  5 and 6 summarize the system and learning parameters.
Note that the true kernel φ is not compactly supported. But in our simulations, we observe the dynamics up to a time T which is a fraction of the equilibrium time. Since the particles only explore a bounded region due to the large-range attraction, ρ T is essentially compactly supported on a bounded region (see the histogram background of Figure 6), on which φ is in our admission space.
We use piecewise linear functions on n uniform partitions of the learning interval to approximate the true kernel φ. With M = 32, Figure 6 shows that we have already obtained faithful approximations to the true interaction kernel, except for on regions are close 0. Increasing number of observations improves the accuracy of estimators at locations near 0, which seems to be very helpful for the system with larger noise level.
In terms of the trajectory prediction, we use the learned interaction kernels φ in Figure 2. We summarize the results in Figure 7 and Table 7. In the experiments, we study two cases, one with small random noise where the particles still form an equilibrium configuration, and then this configuration have small fluctuation in the space; the other one with medium level of random noise, where the random noise begins to break the formation of a fixed equilibrium configuration and we see the transition between different configurations. We see that in both cases, our estimators produce good prediction of the true dynamics in both training and future time interval.
We plot the convergence rate of estimators in terms of M in the right plot of Figure 8. In this case, we have We choose a choice of dimension n proportional to ( M log M ) 1 5 , our numerical results show that the learning rate is around M −0.39 , which matches the optimal min-max rate M − 2 5 stated in Theorem 3.2.
We also study the convergence of the estimators as the length of the trajectory T increases. In this example with σ = 0.35, the estimated auto-correlation time is about τ = 10 time units. Therefore, we use relatively long trajectories up to T = 1200 time units, contributing up to about 120 effective samples. We set the dimension of the hypothesis space to be n = 4( M T /dt log(M T /dt) ) 1 5 for each pair (M, T ), where dt is the time step size of the Euler-Maruyama scheme. The right plot of Figure 8 shows that the rate is 0.39, indicating the equivalence between a single long trajectory and multiple short trajectories for inference.     Next, we investigate the effects of the scale of the random noise on learning. We observe phenomenon similar to those in Example 1. Figure 6 shows that the estimators for the system with σ = 0.25 oscillates more than the one with σ = 0.05 at locations near 0. The random noise also did not affect the learning rates, suggested by the   Table 7.     Figure 8. as the random noise increases, absolute L 2 (ρ T ) error of estimators also increases, suggesting that coercivity constant is getting smaller.
At last, we study the effects of discretization error induced by discrete observations. As the observation gap increases, the discretization errors flatten the learning rate curve of M , see left plot of Figure 8. Similar to Example 1, the right plot of Figure 8 shows that the absolute error of the estimator is of order close to the theoretical order σO((∆t) 1/2 ).

Conclusions from the numerical experiments
Numerical results show that in case of continuous-time observations, the algorithm effectively estimates the interaction kernel, achieves the near-optimal learning rate in M , is robust to different magnitudes of the random noise, and the system with the estimated kernels accurately predicts trajectories. In case of discrete-time observations, the estimator has an estimation error of order ∆t 1/2 , due to the discretization error in the approximation of the likelihood ratio. These numerical results are in full agreement with the learning theory in Section 3-4: -In case of continuous-time observations, the estimators in 10 trials are faithful approximations of the true interaction kernels, with a mean close to the truth. The standard deviation of the estimators decreases as the sample size increases, and gets larger as the diffusion constant increases. -The estimator from data achieves the min-max learning rate (log M/M ) s/(2s+1) in Theorem 3.2 by the appropriate choice of the hypothesis spaces and their dimension as a function of M . For φ in C k+α with k + α ≥ 2, the learning rate is around M − 1 3 when using piece-wise constant estimators (s = 1); and the learning rate is around M − 2 5 using the piecewise linear estimators (s = 2), which is the minmax optimal rate for the case k + α = 2.
-The estimators predict transient dynamics well in the training time interval, and the results validate Proposition 2.1: the trajectory discrepancy is controlled by L 2 (ρ T ) error of estimators, demonstrating the effectiveness of distances in L 2 (ρ T ) in quantifying the performance of estimators. In addition, the estimators even predict in a remarkably accurate fashion the collective behaviour of particles in larger future time intervals, indicating that the bound in Proposition 2.1 may be overly pessimistic in some cases. Our intuition is that this benign phenomenon benefits from the large support of ρ T , encouraged by the randomness of the initial conditions and presence of stochastic noise. -In case of discrete-time observations with observation gap ∆t, the estimation error of the estimator is of order ∆t 1/2 and depends linearly on σ, the square root of the diffusion constant. Therefore, as ∆t increases, the discretization error dominates the estimation error, consistently with the learning theory in Section 4, which leads to bounding the esitmation error of the estimator by M − s 2s+1 + σO(∆t 1/2 ).
-When the length T of the trajectories increases, the optimal learning rate (in M ) is still achieved. The estimation errors of the estimator exhibits a convergence rate around ( log(M T ) M T ) s/(2s+1) with s = 1, 2 respectively, demonstrating an equivalence of "information" between few long trajectories and many short trajectories initiated at suitably random initial conditions, as discussed above in Section 2.3.

Final remarks and future work
There are many venues in which the present work could be extended.
The first notable extension is to heterogeneous particle systems with multiple types of particles, which arise in many applications. In this case one assumes that there are different interaction kernels, modeling the nonsymmetric interactions between different types of particles. Examples of these systems are considered in [41] in the deterministic case, with the theoretical analysis achieved in [40], where the coercivity condition is generalized to the multiple-particle-types setting, and (near-)optimal convergence rates of the estimators where established. We believe a similar extension is possible in the stochastic case, combining the ideas of this work and [40].
Another notable extension is to second order differential systems of interacting particles, where interaction kernels of more general forms than those considered here arise. In the deterministic case [41] considers examples of such systems, with a forthcoming theoretical analysis. In the stochastic case the extension would require significant effort, especially if important cases of systems with degenerate diffusion (e.g. stochastic Langevin) were considered. We also remark again that in this work we do not observe velocities, as done in the works just cited in the case of deterministic systems: here we fully take into account the discretization (in time) error, and if we let σ → 0, the results here would imply similar results in the deterministic case. Extending these considerations to second-order systems would be valuable.
Further work is also needed to formalize the considerations we put forward in Section 2.3 regarding ergodic systems, and design robust and optimal algorithms in the regimes of observation a long trajectory or many independent trajectories.
We assume in this work that all particles are observed. A desirable extension is to the case of partial observations of a subset of particles or macroscopic observations of the population density, which is a practical concern when the system is large with millions of particles in high dimension. Since it is an ill-posed inverse problem to recover the missing trajectories of unobserved particles [54], a new formulation based on the corresponding mean field equations [44,28,29] is under investigation.
In this work we assume that the noise coefficient is a known constant: there has been of course significant work in estimating the noise coefficient, for example in the case of interacting particle systems see the recent work [27] and references therein, and for the case of model reduction for Langevin equations with state-dependent diffusion coefficient [20].
It is straightforward to show that f φ satisfy (A.2) and (A.3). Therefore, suppose µ 0 is independent of the underlying Brownian motion and has finite second moment, then there exists a unique strong solution up to time T for the system (1.1) for any X 0 drawn from µ 0 .
Theorem A.2 (Girsonov Theorem) Let Pσ be the probability measure induced by the solution of the SDEs (A.1) for t ∈ [T 0 , T ] and a fixed starting value at time T 0 , and let Wσ be the law of the respective driftless process. Suppose that Σ = σσ is invertible and V fulfills the Novikov condition The proof of Theorem A.2 can be found in [32,Chapter 3.5], [46,Chapter 8.6].
Theorem A.3 (The Itô formula, see Theorem 4.1.2 in [46]) Let g(X) = (g 1 (X), · · · , gp(X)) be a C 2 map from R dN into R p . Then the process Y (t) = g(Xt) is an Itô process with components given by where dB i dB j = δ ij dt, dB i dt = dtdB i = 0.

A.2 Useful inequalities
Theorem A.4 (Bernstein inequality for unbounded random variables) Let X 1 , X 2 , · · · , X M be independent random variables with E(X i ) = 0. If for some constants K 1 , v 1 > 0, the bound E|X i | p ≤ 1 2 p!K p−2 1 v 1 holds for every 2 ≤ p ∈ N, then For the proof of Theorem A.4 , we refer to [5] and David Pollard's book notes [48](page 14).