Abstract
We consider stochastic systems of interacting particles or agents, with dynamics determined by an interaction kernel, which only depends on pairwise distances. We study the problem of inferring this interaction kernel from observations of the positions of the particles, in either continuous or discrete time, along multiple independent trajectories. We introduce a nonparametric inference approach to this inverse problem, based on a regularized maximum likelihood estimator constrained to suitable hypothesis spaces adaptive to data. We show that a coercivity condition enables us to control the condition number of this problem and prove the consistency of our estimator, and that in fact it converges at a nearoptimal learning rate, equal to the min–max rate of onedimensional nonparametric regression. In particular, this rate is independent of the dimension of the state space, which is typically very high. We also analyze the discretization errors in the case of discretetime observations, showing that it is of order 1/2 in terms of the time spacings between observations. This term, when large, dominates the sampling error and the approximation error, preventing convergence of the estimator. Finally, we exhibit an efficient parallel algorithm to construct the estimator from data, and we demonstrate the effectiveness of our algorithm with numerical tests on prototype systems including stochastic opinion dynamics and a LennardJones model.
Introduction
We consider a system of particles or agents interacting in a random environment, with their motion described by a firstorder stochastic differential equation in the form
where \(\varvec{x}_{i,t} \in {\mathbb {R}}^{d}\) represents the position of particle i at time t, \(\phi :{\mathbb {R}}^+\rightarrow {\mathbb {R}}\) is an interaction kernel dependent on the pairwise distance between particles, and \(\varvec{B}_t\) is a standard Brownian motion in \({\mathbb {R}}^{Nd}\), with \(\sigma >0\) representing the scale of the random noise. This is a gradient system, with the energy potential \(V_{\phi }:{\mathbb {R}}^{Nd} \rightarrow {\mathbb {R}}\)
where \(\varvec{X}_t=(\varvec{x}_{i,t})_{i=1,\dots ,N}\in {\mathbb {R}}^{dN}\) is the state of the system. Letting
we can write Eq.(1.1) in vector format as
The particles interact with each other based on their pairwise distance, with dissipation of the total energy, with the system tending to a stable point of the energy potential, while the random noise injects energy to the system.
Such systems of interacting particles arise in a wide variety of disciplines, including interacting physical particles [22, 49] or granular media [1,2,3, 8, 12, 13] in Physics, opinion aggregation on interacting networks in Social Science [24, 43, 46], and Monte Carlo sampling [36, 39], to name just a few.
Motivated by these applications, the inference of such systems from data gains increasing attention. For deterministic multiparticle systems, various types of learning techniques have been developed (see, e.g., [9, 14, 40, 41, 50, 55] and the reference therein). When it comes to stochastic multiparticle systems, only a few efforts have been made, e.g., learning reduced Langevin equations on manifolds in [19] (without, however, assuming nor exploiting the structure of pairwise interactions), learning parametric potential functions in [10, 15] from single trajectory data, estimating the diffusion parameter in [26], and estimating effective Langevin equations on manifolds in [19].
Our goal is to estimate the interaction kernel \(\phi \) given discretetime observation data from trajectories \(\{\varvec{X}^{(m)}_{t_0:t_L}\}_{m=1}^M\), where the initial conditions \(\{\varvec{X}^{(m)}_{t_0}\}_{m=1}^M\) are independent samples drawn from a distribution \(\mu _0\) on \({\mathbb {R}}^{dN}\), and \(t_0:t_L\) indicates times \(0=t_0<t_1<\cdots<t_l<\cdots <t_L=T\), with with \(t_l = l{\varDelta }t\).
Since, in general, little information about the analytical form of the kernel is available, we infer it in a nonparametric fashion (e.g., [6, 20, 23]). We note that the problem we consider is to learn a latent function in the drift term given observations from multiple trajectories, which is different from the ample literature on the inference of stochastic differential equations (see, e.g., [29, 34]), focusing either on parameter estimation or on inference for ergodic system. In particular, our learning approach is close in spirit to the nonparametric regression of the drift studied in [44] for ergodic system and in [17] from i.i.d paths. However, for systems of interacting particles one faces the curse of dimensionality when learning the highdimensional drift directly as a general function on the highdimensional state space \({\mathbb {R}}^{dN}\). Instead, we will exploit the structure of the system and learn the latent interaction kernel in the drift, which only depends on pairwise distances, and show that the curse of dimensionality may be avoided, when such inverse problem is wellconditioned.
We introduce a maximum likelihood estimator (MLE), along with an efficient algorithm that can be implemented in parallel over trajectories, with an hypothesis space adaptive to data to reach optimal accuracy. Under a coercivity condition, we prove that the MLE is consistent and converges at the min–max rate for onedimensional nonparametric regression. We also analyze the discretization errors due to discretetime observations: we show it leads to an error in the estimator that is of order \({\varDelta }t^{1/2}\) (with \({\varDelta }t=T/L=t_{l+1}t_l\)), and as a result, it prevents us from obtaining the min–max learning rate in sample size. We demonstrate the effectiveness of our algorithm by numerical tests on prototype systems including opinion dynamics and a stochastic LennardJones model (see Sect. 5). Numerical results verify our learning theory in the sense that the min–max rate of convergence is achieved, and the bias due to the numerical error is close to the order \({\varDelta }t^{1/2}\).
Overview of the Main Results
We consider an approximate maximum likelihood estimator (MLE), which is the maximizer of the approximate likelihood of the observed trajectories, over a suitable hypothesis space \({\mathcal {H}}\):
where \({\mathcal {E}}_{L,T,M}(\varphi )\) is an approximation of the negative loglikelihood of the discrete data \(\{\varvec{X}^{(m)}_{t_0:t_L}\}_{m=1}^M\). Using the fact that the drift term \(\varvec{f}_{\phi }\) is linear in \(\phi \) and hence \({\mathcal {E}}_{L,T,M}(\varphi )\) is a quadratic functional, we propose an algorithm (see Algorithm 1) that efficiently computes this MLE by least squares. With a dataadaptive choice of the basis functions \(\{\psi _p\}_{p=1}^n\) for the hypothesis space \({\mathcal {H}}\), we obtain the MLE
by computing the coefficients \({\widehat{a}}_{L,T,M, {\mathcal {H}}}\in {\mathbb {R}}^n\) from normal equations. The algorithm may be implemented by building in parallel the equations for each trajectory.
We develop a systematic learning theory on the performance of this MLE. We propose first a coercivity condition that ensures the robust identifiability of the kernel \(\phi \), in the sense that the derivative of the pairwise potential defined in (1.2), \({\varPhi }'(r) = \phi (r)r\), can be uniquely identified in the function space \(L^2({\mathbb {R}}^+,\rho _T)\), where \(\rho _T\) is the measure of all pairwise distances between particles. Then, we consider the convergence of the estimator, from both continuoustime and discretetime observations, under the norm
The case of continuoustime observations (Sect. 3). We consider the MLE
where \({\mathcal {E}}_{T,M}(\varphi )\) is the exact negative loglikelihood of the continuoustime trajectories \(\{\varvec{X}^{(m)}_{[0,T]}\}_{m=1}^M\). We show that the MLE is consistent, that is, converges in probability to the true kernel under the norm \({\left \left \left \cdot \right \right \right }\). Furthermore, we show that the MLE converges at a rate which is independent of the dimension of the state space of the system and corresponds to the mini–max rate for onedimensional nonparametric regression ([6, 16, 20, 23]), when choosing the hypothesis space adaptively according to data, the Hölder continuity s of the true kernel, and with dimension increasing with the amount of observed data. With \(\mathrm {dim}({\mathcal {H}}) \asymp (\frac{M}{\log M})^{\frac{1}{2s+1}}\), and assuming that the coercivity condition holds on \({\mathcal {H}}\) with a constant \(c_{{\mathcal {H}}}>0\), we have, with high probability and in expectation,
The case of discretetime observations (Sect. 4). In this case, derivatives and statistics of the trajectories inbetween observations need to be approximated, while keeping the estimator efficiently computable: this leads to further approximations of the likelihood and consequently of the MLE. This discretization error of the approximations we use is of order 1/2 in the observation time gap \({\varDelta }t = T/L\), using an approximation of the likelihood based on the Euler–Maruyama integration scheme. We show that for some \(C>0\), for any \(\epsilon >0\), with high probability
where \({\widehat{\phi }}_{T, \infty , {\mathcal {H}}} \) is the projection of the true kernel to \({\mathcal {H}}\) and n is the dimension of hypothesis space \({\mathcal {H}}\). The discretization error will flatten the learning curve when the sample size is large, overshadowing the sampling error and the approximation error cause by working within the hypothesis space. For some positive constants \(c_2 \) and \(c_3\), where \({\widehat{\phi }}_{T, \infty , {\mathcal {H}}} \) is the projection of the true kernel to \({\mathcal {H}}\). The numerical error may overshadow the sampling error and the approximation error of the hypothesis space.
In both cases, we decompose the error in the MLE into sampling error from the trajectory data, and approximation error from the hypothesis space, as illustrated in the diagram in Fig. 1. In the case of continuoustime observations, the sampling error is the error between \({\widehat{\phi }}_{T,M,{\mathcal {H}}} \) and the MLE from infinitely many trajectories (denoted by \({\widehat{\phi }}_{T,\infty ,{\mathcal {H}}}\)): this will be controlled with concentration equalities. The approximation error \({\widehat{\phi }}_{T,\infty ,{\mathcal {H}}} \phi \) is adaptively controlled by a proper choice of hypothesis space. The analysis is carried out in the infinitedimensional space \(L^2(\rho _T)\). In the case of discretetime observations, we provide a finitedimensional analysis to study directly the MLE in our proposed algorithm, that is, analyzing the error of \({\widehat{a}}_{L,T,M, {\mathcal {H}}}\) in (1.5) with proper conditions on the basis functions. The sampling error \({\widehat{\phi }}_{L,T,M,{\mathcal {H}}}  {\widehat{\phi }}_{L,T,\infty ,{\mathcal {H}}} \) is analyzed through \({\widehat{a}}_{L,T,M, {\mathcal {H}}} {\widehat{a}}_{L,T,\infty , {\mathcal {H}}}\), and the discretization error between \({\widehat{\phi }}_{L,T,\infty ,{\mathcal {H}}} \) and \({\widehat{\phi }}_{T,\infty ,{\mathcal {H}}} \) is analyzed through \({\widehat{a}}_{L,T,M, {\mathcal {H}}} {\widehat{a}}_{T,\infty , {\mathcal {H}}}\). The discretization error comes from the discretetime approximation of the likelihood, and it vanishes when the observation time gap \({\varDelta }t\) reduces to zero, recovering the convergence of the MLE as in the case of the continuoustime observations.
Notation and Outline
Throughout this paper, we use bold letters to denote vectors or vectorvalued functions. We use the notation in Table 1 for variables in the system of interacting particles.
We call the system of relative positions to a reference particle, say, \((\varvec{r}_{1i} = \varvec{x}_{i,t} \varvec{x}_{1,t})_{i=2}^N\), by “relative position system”. The relative position system can be ergodic under suitable conditions on the potential [37], and these relative positions are the variables we need to learn the interaction kernel. We point out that the interacting particle system (1.1) itself is not ergodic, because the center \({\overline{\varvec{x}}}_{t} = \frac{1}{N}\sum _{i=1}^N \varvec{x}_{i,t} \) satisfies \(\mathrm{d}{\overline{\varvec{x}}}_{t} = \sigma \frac{1}{N}\sum _{i=1}^N \mathrm{d}\varvec{B}_{i,t} \).
We restrict our attention to interaction kernels \(\phi \) in the admissible set
Let \({\varOmega }\) be an arbitrary compact (or precompact) set of a Euclidean space (which may be \({\mathbb {R}}^{+}\), \({\mathbb {R}}^{d}\) or \({\mathbb {R}}^{dN}\)), with the Lebesgue measure unless otherwise specified. We consider the following function spaces

\(L^{\infty }({\varOmega })\): the space of bounded functions on \({\varOmega }\) with norm \(\Vert g\Vert _{\infty }=\mathrm{ess\,sup}_{x\in {\varOmega }}g(x)\);

\(C({\varOmega }):\) the closed subspace of \(L^{\infty }({\varOmega })\) consisting of continuous functions;

\(C_c({\varOmega }):\) the set of functions in \(C({\varOmega })\) with compact support;

\(C^{k,\alpha }({\varOmega })\) with \(k \in {\mathbb {N}}, 0<\alpha \le 1\): the space of functions whose kth derivative is Hölder continuous of order \(\alpha \). In the special case of \(k=0\) and \(\alpha =1\), \(g \in C^{0,\alpha }({\varOmega })\) is called Lipchitz continuous on \({\varOmega }\); the Lipschitz constant of \(g \in \text {Lip}({\varOmega })\) is defined as \(\text {Lip}[g]:=\sup _{x\ne y} \frac{g(x)g(y)}{\Vert xy\Vert }.\)
We summarize the notation for the inference of the interaction kernel in Table 2. The function space in which we perform the estimation is the space of functions \(\varphi \) such that \(\varphi (\cdot )\cdot \in L^2({\mathbb {R}}^+, \rho _T)\), where \(\rho _T\) is the measure of pairwise distances between all particles on the time interval [0, T] (see (2.9)). We will focus on learning on the compact (finite or infinitedimensional) subset of \(L^{\infty }([0,R])\) (where [0, R] is the support of the functions in the admissible set \({\mathcal {K}}_{R,S}\)) in the theoretical analysis; however, in the numerical implementation we will use finitedimensional linear subspaces \(L^2([0,R], \rho _T)\) spanned by piecewise polynomial functions. While these linear subspaces are not compact, it is shown that the minimizers over the whole linear space are bounded and thus the compactness requirements are not essential (e.g., see Theorem 11.3 in [23]). We shall therefore assume the compactness of the hypothesis space in the theoretical analysis.
The remainder of the paper is organized as follows: We first provide an overview of our learning theory. In Sect. 2, we present a practical learning algorithm with theoryguided optimal settings on the choice of hypothesis spaces and with a practical assessment of the learning results. We then demonstrate the effectiveness of the algorithm on prototype systems including a stochastic model for opinion dynamics and a stochastic LennardJones model in Sect. 5. We establish a systematic learning theory analyzing the performance of the MLE, considering continuoustime observations in Sect. 3 and discretetime observations in Sect. 4. We present in “Appendix” detailed proofs.
Nonparametric Inference of the Interaction Kernel
We present in this section the nonparametric technique we study for the inference of the interaction kernel and corresponding algorithms. We discuss the assessment of the performance of the estimator and its performance in trajectory prediction. The proposed estimator is based on maximum likelihood estimation on dataadaptive hypothesis spaces so as to achieve the optimal rate of convergence, guided by our learning theory in Sects. 3–4.
The Maximum Likelihood Estimator
As a variational approach, we set the error functional to be the negative loglikelihood of the data \(\{\varvec{X}^{(m)}_{t_0:t_L}\}_{m=1}^M\) and compute the maximum likelihood estimator (MLE).
The error functional. Recall that by the Girsanov theorem, for a continuous trajectory \(\varvec{X}_{[0,T]}\), its negative loglikelihood ratio between the measure induced by system (1.1), with an admissible kernel \(\phi \), and the Wiener measure is
As we do not know the interaction kernel \(\phi \) that generated the trajectory \(\varvec{X}_{[0,T]}\), we can let \(\varphi \) be any possible admissible interaction kernel, and upon replacing \(\phi \) by \(\varphi \) in (2.1), observe that \({\mathcal {E}}_{\varvec{X}_{[0,T]}}(\varphi )\) is the loglikelihood of seeing the trajectory \(\varvec{X}_{[0,T]}\) if system (1.1) were driven by the interaction kernel \(\varphi \). In this case, \({\mathcal {E}}_{\varvec{X}_{[0,T]}}(\varphi )\) may be interpreted as a error functional, which we wish to minimize over \(\varphi \), in order to obtain an estimator for \(\phi \).
Given only discretetime observations \(\varvec{X}_{t_0:t_L}\), where \((t_l = l{\varDelta }t, l=0,\dots ,L)\) with \({\varDelta }t = T/L\) (the case of nonequispacedintime observation is a straightforward generalization), the error functional \({\mathcal {E}}_{\varvec{X}_{[0,T]}}(\varphi )\) may be approximated as
The corresponding approximate likelihood is equivalent to the likelihood based on the Euler–Maruyama (EM) scheme (whose transition probability density is Gaussian):
Note that while higherorder approximations of the stochastic integral (or, equivalently, approximations based on higherorder numerical schemes) may be more accurate than the EM scheme, they lead to nonlinear optimization problems in the computation of the MLE defined below, and we shall therefore avoid them. The EMbased approximation preserves the quadratic form of the error functional and leads to an optimization problem that can be solved by least squares. As we show in Theorem 4.2, this discretetime approximation leads to an error term of order \({\varDelta }t^{1/2}\) in the MLE, which will be small in the regime on which we focus in this work.
Since the observed discretetime trajectories \(\{\varvec{X}^{(m)}_{t_0:t_L}\}_{m=1}^M\) are independent, as the \(\varvec{X}_{t_0}^{(m)}\)’s are drawn i.i.d. from \(\mu _0\), the joint likelihood of the trajectories is the product of the likelihoods of each trajectory. Therefore, the corresponding empirical error functional is defined to be
A regularized Maximum Likelihood Estimator. The regularized MLE we consider is a minimizer of the above empirical error functional over a suitable hypothesis space \({\mathcal {H}}\):
This regularized MLE is well defined when the minimizer exists and is unique over \({\mathcal {H}}\). We call this MLE “regularized” to emphasize the constraint \(\phi \in {\mathcal {H}}\), and the fact that \({\mathcal {H}}\) will change with M, as in nonparametric statistics; this naming is somewhat not standard though. We shall discuss the uniqueness of the minimizer in Sect. 3.1, where we show it is guaranteed by a coercivity condition. When the hypothesis space \({\mathcal {H}}\) is a finitedimensional linear space, say, \({\mathcal {H}}=\mathrm {span} \{\psi _i\}_{i=1}^n\) with basis functions \(\psi _i: {\mathbb {R}}^+\rightarrow {\mathbb {R}}\), the regularized MLE is the solution of a least squares problem. To see this, letting \(\varphi = \sum _{i=1}^{n} a(i) \psi _i\) and \(a :=(a(1),\dots ,a(n))\in {\mathbb {R}}^n\), we have \(\varvec{f}_{\varphi }(\varvec{X}) =\sum _{i= 1}^n a(i) \varvec{f}_{\psi _i}(\varvec{X}) \), due to the linear dependence of \(\varvec{f}_\varphi \) on \(\varphi \). Then, we can write the error functional in Eq.(2.2) for each trajectory as
where the matrix \(A^{(m)} \in {\mathbb {R}}^{n\times n}\) and the vector \(b^{(m)}\in {\mathbb {R}}^{n}\) are given by
Hence, corresponding to \(\nabla {\mathcal {E}}_{L,T,M}=0\) for the error functional in (2.4), we solve the normal equations for a to obtain the solution \({\widehat{a}}_{L,T,M,{\mathcal {H}}}\):
and corresponding desired MLE for the interaction kernel:
The normal equations (2.7) are solved by least squares, so the solution always exists. We will show in Sect. 4 that assuming a coercivity condition, the matrix \(A_{M,L}\in {\mathbb {R}}^{n\times n}\) is invertible with high probability when M and L are large, so the least squares estimator is the unique solution to the normal equations, and the regularized MLE is the unique minimizer of the empirical error functional over \({\mathcal {H}}\).
DynamicsAdapted Measures and Function Spaces
We will assess the estimation error in a suitable function space: \(L^2({\mathbb {R}}^+,\rho _T)\). Here \(\rho _T\) is the distribution of pairwise distances between all particles:
where \(\delta \) is the Dirac \(\delta \)distribution, so that \({\mathbb {E}}[\delta _{r_{ii'}(t)}(\mathrm{d}r)]\) is the distribution of the random variable \(r_{ii'}(t)=  \varvec{x}_{i,t} \varvec{x}_{i',t} \), with \(\varvec{x}_{i,t}\) being the position of particle i at time t. Here the expectation is taken over the distribution \(\mu _0\) of initial conditions and realizations of the system. The probability measure \(\rho _T\) depends on both \(\mu _0\) and the measure determining the random noise on the path space, while it is independent of the observed data. The measure \(\rho _T\) encodes the information about the dynamics marginalized to pairwise distances; regions with large \(\rho _T\)measure correspond to pairwise distances between particles that are often encountered during the dynamics.
With observations of M trajectories at L discretetimes each, we introduce a corresponding measure
where \(r^{(m)}_{ii'}(t)=  \varvec{x}^{(m)}_{i,t} \varvec{x}^{(m)}_{i',t} \) is from the mth observed trajectory. We think of this as an approximation to \(\rho _T\), in two significantly different aspects. In L, because as \(L\rightarrow +\infty \) our observations tend to be continuous in time, and in M, as \(\rho ^{L,M}_T\) can be thought of, after letting \(M\rightarrow +\infty \), as an empirical approximation to \(\rho _T\) performed from data on the M independent trajectories.
Accuracy of the estimator. We measure the accuracy of our estimator \({\widehat{\phi }}_{L,T,M,{\mathcal {H}}}\) by the quantity
The function \(\phi (\cdot )\cdot \), instead of \(\phi \), which at \(r\in {\mathbb {R}}_+\) takes value \(\phi (r)r\), appears naturally in our learning theory in Sect. 3, fundamentally because it is the derivative of the pairwise distance potential \({\varPhi }\) in (1.2). We define
Then, the mean squared error (MSE) of the estimator is
Hypothesis Spaces and Nonparametric Estimators
As standard in nonparametric estimation, we let the hypothesis space \({\mathcal {H}}\) grow in dimension with the number of observations, avoiding under or overparameterization, and leading to consistent estimators, that in fact reach an optimal min–max rate of convergence (see, e.g., [20, 21, 23]).
Similar to [40, 41], we set the basis functions \(\{\psi _i\}_{i=1}^n\) to be piecewise polynomials on a partition of the support of the density function of the empirical probability measure \(\rho ^{L,M}_T\).
Guided by the optimal rate convergence results in Sect. 3, we will set the dimension of the hypothesis space \({\mathcal {H}}\) to be
where the number s is the Hölder index of continuity of the basis functions, and it is to be chosen according the regularity of the true kernel. When T is large and when the relative position system is ergodic, we set
where \(N_{ess}:= M \frac{T}{\tau }\), with \(\tau \) denoting the autocorrelation time of the system, is the effective sample size of the data. Here the autocorrelation time \(\tau \) is the equivalent of the mixing time for a reversible ergodic Markov chain [35].
We estimate the autocorrelation time by the sum of the temporal autocorrelation function of a pairwise distance \(r_{i,j}\). (We refer to [51] for detailed discussion on the estimation of autocorrelation time, which is a whole subject by itself.)
We will prove bounds, that hold with high probability, on the mean squared error (MSE) of the MLE \(\phi _{L,T,M,{\mathcal {H}}_{n_M}}\) in (2.8) for fixed and large M, for a fixed time T and for suitable hypothesis spaces \({\mathcal {H}}_{n_M}\) with dimension \(n_M\) as in (2.13). When continuoustime trajectories are observed, the MSE is of the order \((\frac{\log M}{ M})^{\frac{2s}{2s+1}}\) with high probability, according to Theorem 3.2, and so is its expectation. In particular, this avoids the curse of dimensionality of the state space (dN). When the observations are discretetime trajectories with observation gap \({\varDelta }t\), the error is of the order \((\frac{\log M}{ M})^{\frac{2s}{2s+1}} + {\varDelta }t^{1/2}\) with a high probability according to Theorem 4.2.
Algorithmic and Computational Considerations
The algorithm is summarized in Algorithm 1. Note that the normal matrices \(\{A^{(m)} \}\) and vectors \(\{b^{(m)} \}\) are defined trajectorywise and therefore may be computed in parallel. When the size of the system is large (i.e., \(\mathrm{d}N\) is large), this allows one to accelerate the computation of the estimator, by assembling these normal matrices and vectors for each trajectory in parallel, and updating the normal matrix \(A_{M,L}\) and vector \(b_{M,L}\). The total computational cost of constructing our estimator, given P CPU’s, is \(O(L\frac{N^2d}{P}Mn^2+n^3)\). This becomes \(O(L\frac{N^2d}{P}M^{1+\frac{1}{2s+1}}+CM^{\frac{3}{2s+1}})\) when n is chosen optimally according to Theorem 3.2 and \(\phi \) is at least in \(C^{1,1}\) (corresponding to the index of regularity \(s\ge 2\) in the theorem).
Accuracy of Trajectory Prediction
One application of estimating the interaction kernel from data is to perform predictions of the dynamics. Given an estimator, the following proposition bounds its accuracy in predicting the trajectories of the system driven by the true interaction kernel:
Proposition 2.1
Let \({\widehat{\phi }} \in {\mathcal {K}}_{R,S}\) be an estimator of the true kernel \(\phi \) , where \( {\mathcal {K}}_{R,S}\) is the admissible set defined in (1.7). Denote by \(\widehat{\varvec{X}}_t\) and \(\varvec{X}_t\) the solutions of the systems with kernels \({\widehat{\phi }}\) and \(\phi \), respectively, starting from the same initial condition and with the same random noise. Then, we have
where the measure \(\rho _T\) is defined by (2.9).
We postpone the proof to Sect. A.3.
Learning Theory: ContinuousTime Observations
We analyze first the regularized MLE in the case of continuoustime observations \(\{\varvec{X}^{(m)}_{[0,T]}\}_{m=1}^M\). We show that under a coercivity condition, the regularized MLE is consistent, and that with proper choice of the hypothesis spaces, we can achieve an optimal learning rate \((\frac{\log M}{ M})^{\frac{2s}{2s+1}}\).
Recall from Eq.(2.1) that
is the negative loglikelihood of a trajectory \(\varvec{X}_{[0,T]}\), with respect to the measure induced by the system with interaction kernel \(\varphi \). Then, the negative loglikelihood of independent trajectories \(\{\varvec{X}^{(m)}_{[0,T]}\}_{m=1}^M\) is
and the regularized MLE over a hypothesis space \({\mathcal {H}}\) is
The existence of the minimizer follows from the fact that the error functional \({\mathcal {E}}_{T,M}(\varphi )\) is quadratic in \(\varphi \), which in turn is a consequence of the linearity of \(\varvec{f}_\varphi \) in \(\varphi \). The uniqueness of the minimizer, however, requires a coercivity condition and is related to the learnability of the kernel, which we discuss in the next section.
Identifiability and Learnability: A Coercivity Condition
The uniqueness of the minimizer of the error functional \({\mathcal {E}}_{T,M}(\varphi )\) over the hypothesis space ensures that the kernel is identifiable. This is not granted, even when the number of observed trajectories is infinite: denote
where \({\mathbb {E}}\) here, and in all that follows unless otherwise indicated, is the expectation over initial conditions, independently sampled from \(\mu _0\), and over the Wiener measure underlying the random noise, and observe that
Only when \({\mathbb {E}}\int _{0}^{T} \Vert {\mathbf {f}}_{\varphi \phi }(\varvec{X}_t)\Vert ^2 dt > 0\) for any \(\varphi \phi \ne 0\) can one ensure the uniqueness of minimizer. This motivates us to propose the following coercivity condition, introduced in [9] in the case of nonstochastic systems:
Definition 3.1
(Coercivity condition) We say that the stochastic system defined in (1.1) satisfies a coercivity condition on a set \({\mathcal {H}}\) of functions on \({\mathbb {R}}_+\), with a constant \(0<c_{{\mathcal {H}}}\), if
for all \(\varphi \in {\mathcal {H}}\) such that \(\varphi (\cdot )\cdot \in L^2(\rho _T)\). Here \({\left \left \left \cdot \right \right \right }\) denotes the norm defined in (2.11). We will denote by \(c_{{\mathcal {H}}}\) the largest constant for which the inequality holds and call it the coercivity constant.
The coercivity condition ensures identifiability of the kernel. We emphasize that the kernel is latent, in the sense that its values at \(\{r_{ii'}=\Vert \varvec{x}_{i'}\varvec{x}_i\Vert \}\) are undeterminable from data. In fact, to recover \((\phi (r_{ii'}) ) \in {\mathbb {R}}^{\frac{N(N1)}{2}}\) from the observed trajectories, even if we ignore the stochastic noise in the system and assume to have access to \({\mathbf {f}}_\phi (\varvec{x})\in {\mathbb {R}}^{dN}\), which consists of a linear combination of \((\phi (r_{ii'}) )\) with coefficients \(\varvec{r}_{ii'}=\varvec{x}_{i'}\varvec{x}_i\), we face a linear system that is underdetermined as soon as dN (=number of known quantities) \(\le \frac{N(N1)}{2}\) (=number of unknowns), i.e., for \(d<(N1)/2\). Thus, in general the exact values of \(\phi \) at locations \(\{r_{ii'}\}_{i,i'}\) cannot be determined. Furthermore, we have stochastic noise in the system. This suggests that the inverse problem of estimating the interaction kernel in a space of continuous functions is illposed. We will see that the coercivity condition ensures wellposedness in \(L^2(\rho _T)\), both in the sense of uniqueness and in the sense of stability.
The coercivity condition plays a key role in the learning of the kernel. Beyond ensuring learnability of kernels by ensuring the uniqueness of minimizer over any compact convex sets, it also enables us to control the error of the estimator by the discrepancy between the expectation of error functionals, as is shown in Proposition 3.1. We will use this property to establish the convergence of the estimators in later sections.
To simplify notation, we define a bilinear functional product over \({\mathcal {H}}\) by
Proposition 3.1
Let \({\mathcal {H}}\) be a compact convex subset of \(L^{2}({\mathbb {R}}^+,\rho _T)\) and assume the coercivity condition (3.5) holds true on \({\mathcal {H}}\) with constant \(c_{{\mathcal {H}}}\). Then, the error functional \({\mathcal {E}}_{T,\infty }\) defined in (3.3) has a unique minimizer over \({\mathcal {H}}\) in \(L^2(\rho _T)\):
Moreover, for all \(\varphi \in {\mathcal {H}}\),
Proof
From Eq. (3.4), we have \({\mathcal {E}}_{T,\infty }(\varphi ){\mathcal {E}}_{T,\infty }(\phi )=\langle \langle {\varphi \phi , \varphi \phi }\rangle \rangle \). Then,
where the inequality follows from the coercivity condition. Then, Eq.(3.8) follows once we notice that \(\langle \langle {\varphi \widehat{\phi }_{T, \infty ,{\mathcal {H}}}, \widehat{\phi }_{T,\infty ,{\mathcal {H}}}\phi }\rangle \rangle \ge 0\) by the convexity of \({\mathcal {H}}\). In fact, since \(t\varphi +(1t) \widehat{\phi }_{L,\infty ,{\mathcal {H}}} \in {\mathcal {H}}\), \(\forall t \in [0, 1]\), we have \({\mathcal {E}}_{T,\infty }(t\varphi +(1t) \widehat{\phi }_{T,\infty ,{\mathcal {H}}} ) {\mathcal {E}}_{T,\infty }(\widehat{\phi }_{T,\infty ,{\mathcal {H}}}) \ge 0\) since \(\widehat{\phi }_{T,\infty ,{\mathcal {H}}}\) is a minimizer, and so, equivalently,
Sending \(t \rightarrow 0^+\), we obtain \(\langle \langle {\varphi \widehat{\phi }_{T,\infty , {\mathcal {H}}}, 2\widehat{\phi }_{T,\infty , {\mathcal {H}}}2\phi }\rangle \rangle \ge 0.\) \(\square \)
Wellconditioning from coercivity. When the hypothesis space \({\mathcal {H}}\) is a finitedimensional linear space, the coercivity constant provides a lower bound for the smallest eigenvalue of the limit of the normal equations matrix \(A_{M,L}\) in Eq.(2.7) as \(M,L\rightarrow +\infty \). Therefore, when the sample size M is large and when the observation frequency L is high, the matrix \(A_{M,L}\) is invertible with a high probability (see Corollary 4.2 for details), and thus, the coercivity condition ensures the uniqueness of the regularized MLE in Eq.(2.8):
Proposition 3.2
Suppose that the coercivity condition holds on \({\mathcal {H}}= \mathrm {span}\{\psi _1,\ldots , \psi _n\}\), where the basis functions satisfy \(\langle \psi _p(\cdot )\cdot ,\psi _{p'}(\cdot )\cdot \rangle _{L^2{(\rho _T)}}=\delta _{p,p'}\). Let \(A_{\infty }=\big (\langle \langle {\psi _{p},\psi _{p'}}\rangle \rangle \big )_{p, p'} \in {\mathbb {R}}^{n \times n}\) with the bilinear functional \(\langle \langle {\cdot ,\cdot }\rangle \rangle \) defined in (3.6). Then the smallest singular value of \(A_{\infty }\) is \(\lambda _{\min }(A_{\infty }) = c_{{\mathcal {H}}}\,.\)
Proof
For an arbitrary \(a\in {\mathbb {R}}^n\), denoting \(\psi = \sum _{p=1}^n a_p \psi _p\), we have
where the first equality follows from that the functional \(\langle \langle {\cdot ,\cdot }\rangle \rangle \) is bilinear, and the inequality follows from the coercivity condition. Note that by the definition of the coercivity constant in (3.5), we have
which is attained at some \(\psi _* \in {\mathcal {H}}\) since \({\mathcal {H}}\) is finite dimensional. Hence, the inequality in (3.9) becomes inequality for \(\psi _*\) and the smallest eigenvalue of \(A_{\infty }\) is \(c_{\mathcal {H}}\). \(\square \)
Proposition 3.2 suggests that for the hypothesis space \({\mathcal {H}}\), it is important to choose a basis that is orthonormal in \(L^2(\rho _T)\), so as to make the matrix in the normal equations as wellconditioned as possible given the dynamics. In practice, the unknown \(\smash {\rho _T}\) is approximated by the empirical density \(\smash {\rho _T^{L,M}}\). Therefore, when using local basis functions, it is natural to use a partition of the support of \(\smash {\rho _T^{ M}}\).
The coercivity condition and positive integral operators. The coercivity condition introduces constraints on the hypothesis spaces and on the distribution of the solutions of the system, and it is therefore natural that it depends on the distribution \(\mu _0\) of the initial condition \(\varvec{X}_0\), the true interaction kernel \(\phi \), and the random noise. We review below briefly the recent developments in [37, 38], where the coercivity condition is proved to hold on any compact sets of \(L^2(\rho _T)\) for special classes of systems, such as linear systems and nonlinear systems with a stationary distribution. As discussed in [9, 40, 41] for the deterministic cases, we believe that the coercivity condition is “generally” satisfied for “relevant” hypothesis spaces, with a constant independent of the number of particles N, thanks to the exchangeability of both the distribution of the initial conditions and that of the particles at any time t.
The coercivity condition is ensured by the positiveness of integral operators that arise in the expectation in Eq.(3.5). More precisely, recall that the drift of the SDE is cyclic in the indexes. Thus, the distribution of \(\varvec{X}_t\) is exchangeable and
for \(i\ne i'\), one has
One can rewrite Eq.(3.5) as
where the integral kernel \({\overline{K}}_T:{\mathbb {R}}^+\times {\mathbb {R}}^+ \rightarrow {\mathbb {R}}\) is defined as
with \(p_t(u,v)\) denoting the joint density function of the random vector \((\varvec{r}_{12}^t, \varvec{r}_{13}^t)\) and \(S^{d1}\) denoting the unit sphere in \({\mathbb {R}}^d\). It is shown in [37, 38] that the associated integral operator defined by \({\overline{K}}_T\) is strictly positive definite, and therefore, the coercivity condition holds, for a large class of systems with interaction kernels in form of \(\phi (r)= (a+r^\theta )^{\gamma 1}r^{\theta 2}\) with \(a\ge 0\) and \(\{ (\theta , \gamma ) \in (0,1] \times (1,2]: \theta \gamma >1 \}\).
Consistency and Rate of Convergence of the Estimator
In this section, we consider using a family of finitedimensional linear spaces \(\{{\mathcal {L}}_n: n\in {\mathbb {N}}^+\} \subset C^{1,1}[0,R]\) as hypothesis spaces and establish the consistency and rate of convergence of our estimators, which are our main theorems for continuoustime observations. We assume the spaces \(\{{\mathcal {L}}_n: n\in {\mathbb {N}}^+\} \subset C^{1,1}[0,R]\) satisfying Markov–Bernsteintype inequality: there exist \(c_1, \gamma >0\) s.t. for all \(\varphi \in {\mathcal {L}}_n\)
This condition has a long history and rich literature in classical approximation theory, where it is studied when function spaces satisfy (3.11) (e.g., see the survey paper [53]), which is an important step in establishing inverse approximation theorems. This kind of inequality holds true on many function spaces that are commonly used as approximation spaces in practice, including:

\({\mathcal {L}}_n:\) trigonometric polynomials of degree n on \([0, 2\pi ]\) (similarly on [0, R]), for which \(\Vert \phi '\Vert _{\infty }\le \frac{1}{2}({\mathrm {dim}({\mathcal {L}}_n)1})\Vert \phi \Vert _{\infty }\). This result dates back to Bernstein [5].

\({\mathcal {L}}_n\): the polynomial space consisting of all polynomials with degree less than \(n1\) on [0, R] (see Theorem 3.3 in [48]), for which \(\Vert \varphi '\Vert _{\infty }\le \frac{2}{R}{(\mathrm {dim}({\mathcal {L}}_n)+1)^2}\Vert \varphi \Vert _{\infty }\). As a result, (3.11) also holds true for polynomial splines; other extensions include rational functions. We refer to the reader to [30] for details.
If we choose a compact convex hypothesis set \({\mathcal {H}}_M\) contained in some \({\mathcal {L}}_n\), with a suitable correspondence between n and M, such that the distance between \({\mathcal {H}}_M\) and the true kernel \(\phi \) vanishes as M increases, the following consistency result holds:
Theorem 3.1
(Strong consistency of estimators) Suppose \(\phi \in {\mathcal {K}}_{R,S}\), the admissible set defined in (1.7). Let \(\{{\mathcal {L}}_n: n\in {\mathbb {N}}^+\} \subset C^{1,1}[0,R]\) satisfying (3.11) and
Let \(S_0 \ge S\) and \({\mathcal {H}}_M={\mathcal {B}}_{2S_0}^{\infty }({\mathcal {L}}_{n_M})=\{\varphi \in {\mathcal {L}}_{n_M}\,:\,\varphi _\infty <2S_0\}\) with \(\mathrm {dim}({\mathcal {L}}_{n_M})=n_M\) and \(\lim _{M\rightarrow \infty }\frac{n_M\log n_M}{M} =0\). Finally, suppose the coercivity condition holds true on \( \cup _{n}{\mathcal {L}}_{n}\). Then, we have
If we know the explicit approximation rate of the family \(\{{\mathcal {L}}_n: n\in {\mathbb {N}}^+\}\), then by carefully choosing the dimension of hypothesis spaces as a function of M, we can obtain a nearoptimal rate of convergence of our estimators.
Theorem 3.2
(Convergence rate of estimators) Suppose \(\phi \in {\mathcal {K}}_{R,S}\), the admissible set defined in (1.7). Assume that there exits a sequence of linear spaces \(\{{\mathcal {L}}_n: n\in {\mathbb {N}}^+\} \subset C^{1,1}[0,R]\) satisfying (3.11) with the properties

(i)
\(\mathrm {dim}({\mathcal {L}}_n)\le c_0 n\) for \(n \in {\mathbb {N}}^+\),

(ii)
\(\inf _{\varphi \in {\mathcal {L}}_n}\Vert \varphi \phi \Vert _{\infty }\le c_2n^{s}\).
For example, when \(\phi \in C^{k,\alpha }\) with \(s=k+\alpha \ge 2\), we may choose \({\mathcal {L}}_n\) to consist of polynomial splines of degree \(\lfloor s1\rfloor \) with uniform knots on [0, R]. Let \({\mathcal {H}}_n={\mathcal {B}}_{S_0}^{\infty }({\mathcal {L}}_{n})\) with \(S_0=c_2+S\) and \(n \asymp \left( {M}/{\log M}\right) ^{{1}/{(2s+1)}}\), and assume that the coercivity condition holds on \({\mathcal {L}}:=\cup _{n}{\mathcal {L}}_n\) with a constant \(c_{{\mathcal {L}}}>0\). Then, we have
where C is a constant depending only on \(\sigma ,N, T, R, S_0\).
It is fruitful to compare (up to log terms) the rate \({2s}/({2s+1})\) to that for nonparametric 1dimensional regression, where one can observe directly noisy values of the target function \(\phi \) at sample points drawn i.i.d from \(\rho _T\). For the function space \(C^{k,\alpha }\), this rate is min–max optimal. Our numerical examples in Sect. 5 empirically validate the desired convergence rate for \(s=1,2\) where we use piecewise constant and linear polynomials. Note that in our setting, learning \(\phi \) is an inverse problem, as we do not directly observe the values \(\{\phi (\Vert \varvec{x}_{i',t}^{(m)}  \varvec{x}_{i,t}^{(m)}\Vert )\}_{ i, i' =1,\ m=1}^{N,N,M}\). While our result is stated in such a way that a knowledge of s is required, in fact an upper bound \({{\tilde{s}}}\) is sufficient, as choosing sufficiently regular splines, of degree \(\lfloor {\tilde{s}}1\rfloor \) would give the optimal sdependent rate, at the cost of possibly larger constants. We also remark that we do not require that the underlying stochastic process satisfies certain mixing properties nor starts from a stationary distribution. Obtaining this optimal convergence rate in M for shorttime trajectory observations is therefore satisfactory. For long trajectories and under ergodicity assumptions, rates in terms of MT are likely to be obtainable: In Sect. 5, we present numerical evidence that suggests that the error does decrease with MT at a nearoptimal rate.
Proof of the Main Theorems
In the following part, we present the proof for Theorem 3.2, which also yields the proof for Theorem 3.1. The main techniques include the Itô formula, concentration inequalities of unbounded random variables, and a generalization of a novel covering argument in [52] that enables us to deal efficiently with the fluctuations in the data due to the stochastic noise in the dynamics of the system.
One major obstacle in the nonasymptotic analysis of our regularized MLE estimators is the unboundedness of stochastic integral of the form \(\frac{1}{T}\int _{0}^{T}\langle {\mathbf {f}}_{\varphi }(\varvec{X}_t), d\varvec{X}_t\rangle \mathrm{d}t\) appearing in the empirical error functional. Unlike the deterministic case \(\sigma =0\), our empirical error functional \({\mathcal {E}}_{T,M}(\cdot )\) is in general not continuous over \({\mathcal {H}}\) with respect to the \(\Vert \cdot \Vert _{\infty }\) norm. In the following, we first leverage the general Itô formula described in Theorem A.3, to obtain a form of the empirical error functional that does not involve a stochastic integral and is amenable to analysis; we then show that it is continuous on \(C^{1,1}([0,R])\) with respect to the \(\Vert \cdot \Vert _{1,1}\) norm. Therefore, in the following preliminary results for the proofs of the main theorems, we consider the following generic hypothesis space:
Assumption 3.3
The hypothesis space \({\mathcal {H}}\) is a compact convex subset of \(C^{1,1}([0,R])\) with respect to the uniform norm \(\Vert \cdot \Vert _\infty \) and bounded above by \(S_0\ge S\).
Lemma 3.1
Suppose \(\varphi \in {\mathcal {K}}_{R,S}\), the admissible set defined in (1.7). Let
then, we have, almost surely
Proof
Let \(g(\varvec{X})=V_{\varphi }(\varvec{X})\). Note that g is \(C^2\), with derivatives
Using Itô’s formula (Theorem A.3) for the Itô process \(g(\varvec{X}_t)\), the conclusion follows. \(\square \)
Proposition 3.3
Suppose \(\varphi _1, \varphi _2 \in {\mathcal {H}}\), then it holds almost surely that
where \(C_1=\frac{ R^2S_0}{\sigma ^2}+\frac{ R^2 }{2\sigma ^2T}+\frac{d}{2}\) and \(C_2=\frac{R}{2}\).
Proof
Note that
\(I_1\) satisfies
since \(\Vert {\mathbf {f}}_{\varphi }(\varvec{X}_t)\Vert \le \sqrt{N}R\Vert \varphi \Vert _{\infty }\). For \(I_2\), Lemma 3.1 yields
where we used
which follows from its definition in (3.12). Combining the estimates for \(I_1\) and \(I_2\), and using \(\Vert \varphi _1 +\varphi _2\Vert _{\infty }\le 2S_0\), we obtain
where \(C_1=\frac{ R^2S_0}{\sigma ^2}+\frac{ R^2 }{2\sigma ^2T}+\frac{d}{2}\) and \(C_2=\frac{R}{2}\). \(\square \)
When \(M=\infty \), i.e., we observe infinitely many trajectories, the expectation of our error functional \({\mathcal {E}}_{T,\infty }\), as in (3.3), does not involve the stochastic integral term. From the proof of Proposition 3.3, we see that it is continuous over \({\mathcal {H}}\) with respect to the \(\Vert \cdot \Vert _{\infty }\) norm:
Corollary 3.1
Suppose \(\varphi _1, \varphi _2 \in {\mathcal {H}}\), then, with \(C_1=\frac{2R^2S_0}{\sigma ^2}\), we have
Proof
Using (3.4), we obtain that
\(\square \)
Recall the definition
We now analyze the discrepancy between the empirical minimizer \(\widehat{\phi }_{T,M, {\mathcal {H}}}\) and \(\widehat{\phi }_{T,\infty , {\mathcal {H}}}\), which we called sampling error (SE) in the diagram in Fig. 1. We introduce a measurable function on the path space by
for any \(\varphi \in {\mathcal {H}}\). From Proposition 3.1, we have
so \(D_{\varphi }\) in fact bounds (in expectation) the distance between \(\varphi \) and \(\widehat{\phi }_{ T,\infty , {\mathcal {H}}}\) w.r.t. the \(\cdot \)norm. We now perform a nonasymptotic analysis of \(D_{\varphi }\). We shall show that the random variable \(D_{\varphi }\) satisfies moment conditions, sufficient to guarantee strong concentration about its expectation (Proposition 3.4). To do this, we decompose \(D_{\varphi }\) as the sum of a bounded component only involving time integrals and an unbounded component involving stochastic integrals:
We prove moment conditions independently for each of these components in the next two Lemmata.
Lemma 3.2
(Bounds on \(D_{\varphi }^{\mathrm {ubd}}\)) For \(\varphi \in {\mathcal {H}}\) and \(p=2,3,4,\dots \),
where \(C=(\frac{p(p1)}{2})^{\frac{p}{2}}\frac{R^{p2}}{\sigma ^{2p}(NT)^{\frac{p+2}{2}}}\).
Proof
First of all, note that
Therefore \({\mathbf {f}}_{\varphi \widehat{\phi }_{T,\infty , {\mathcal {H}}}}(\varvec{X}_t)\) is a \(L^2\)integrable process. Applying Theorem A.5, we obtain
with \(C_{p,T}=\big (\frac{p(p1)}{2}\big )^{\frac{p}{2}}T^{\frac{p2}{2}}\). The conclusion then follows by adding in the scaling factor \(\frac{1}{\sigma ^2 NT}\). \(\square \)
Lemma 3.3
(Bounds on \(D_{\varphi }^{\mathrm {bd}}\)) For \(\varphi \in {\mathcal {H}}\) and \(p = 2,3,4,\dots \),
Proof
From inequality (3.16) and the linear dependence of \({\mathbf {f}}_{\varphi }\) on \(\varphi \), we have
Therefore,
\(\square \)
Now we combine Lemmas 3.2 and 3.3 to prove the moment condition for \(D_{\varphi }\).
Lemma 3.4
(Moment conditions) For \(\varphi \in {\mathcal {H}}\), and every \(p=2,3,\ldots \), we have
where
with \(C_0=\sqrt{\frac{2 e^2R^2}{\sigma ^4NT}}, \) \(C_1 =\max _{p\ge 2} \frac{1}{\sqrt{2\pi p}NTR^2}\bigg (1+ \frac{c_{\sigma ,S_0,R}}{C_0^p} \bigg ),\) and \( c_{\sigma ,S_0,R}=\max _{p\ge 2} (\frac{8S_0}{\sigma ^2e})^p \frac{R^{2p2}}{\sqrt{2\pi }p^{p+\frac{1}{2}}}. \)
Proof
The proof is based on the Jensen’s inequality, Lemma 3.2 and Lemma 3.3.
where the constants are
and the last inequality is derived from the Stirling’s lemma. \(\square \)
We now tie the discrepancy functionals for finite and infinite M:
Proposition 3.4
(Sampling Error bound) Let \(0<\delta <1\) and \(\{\phi _j\}_{j=1}^{{\mathcal {N}}}\) be an \(\eta \)net of functions in a compact convex hypothesis space \({\mathcal {H}} \subset Ball_{S_0}(L^{\infty }[0,R])\), that is for any function \(\varphi \in {\mathcal {H}}\), there exists some j such that \(\Vert \varphi \phi _j\Vert _{\infty } \le \eta \). Denote
Then with probability at least \(1\frac{\delta }{2}\), we have
for all j, where \(\epsilon _{M,\delta ,{\mathcal {N}}}=\frac{C}{M}\log (\frac{2{\mathcal {N}}}{\delta })\), \(C=2\frac{C_0^2C_1}{c_{\mathcal {H}}}+4C_0S_0\), and \(C_0,C_1\) as in (3.17), with \(c_{{\mathcal {H}}}\) the coercivity constant defined in (3.5).
Proof
For each \(\varphi _j \in {\mathcal {H}}\), recall that in Eq.(3.15), the coercivity condition on \({\mathcal {H}}\) implies that
Then, Eq.(3.17) in Lemma 3.4 yields that
Therefore, the random variable \({D}_{\varphi _j}\) satisfies the moment condition in Corollary A.2, and so \(\forall \epsilon >0\)
We have \(K_{\varphi _j,{\mathcal {H}}}\le 2C_0S_0\) where \(C_0\) is defined in (3.17). Taking a union bound on all these events, over \(j \in \{1,2,\ldots ,{\mathcal {N}}\}\), we obtain that
Setting the righthand side to be \(\frac{\delta }{2}\), we get \(\epsilon _{M,\delta ,{\mathcal {N}}}=\frac{C}{M}\log (\frac{2{\mathcal {N}}}{\delta })\), where \( C:=2\frac{C_0^2C_1}{c_{\mathcal {H}}}+4C_0S_0 \). Using the inequality \( \sqrt{\epsilon _{M,\delta ,{\mathcal {N}}}( \epsilon _{M,\delta ,{\mathcal {N}}}+D_{\varphi _j,\infty })}\le \epsilon _{M,\delta ,{\mathcal {N}}} +\frac{1}{2}D_{\varphi _j,\infty }, \) we conclude that with probability at least \(1\frac{\delta }{2}\)
\(\square \)
Proof of Theorem 3.2
For \({\mathcal {H}}_n={\mathcal {B}}_{S_0}^{\infty }({\mathcal {L}}_n)\), let \(\{\varphi _j: j=1,\ldots , {\mathcal {N}}\}\) be an \(\eta \)net of \({\mathcal {H}}_n\). Let
Then there exists \(\varphi _{j_M}\) in the net such that \(\Vert \varphi _{j_M}{\widehat{\phi }}_{T,M,{\mathcal {H}}_n}\Vert _{\infty } \le \eta \); by Corollary 3.1
On the other hand, since \({\mathcal {H}}_n \subset {\mathcal {L}}_n \subset C^{1,1}([0,R])\), thanks to the almost sure bound in Proposition 3.3 and the uniformly bound \(\sup _{\varphi \in {\mathcal {L}}_n}\frac{\Vert \varphi '\Vert _{\infty }}{\Vert \varphi \Vert _{\infty } }\le c_1(\mathrm {dim}({\mathcal {L}}_n))^\gamma \) from assumption (3.11), we have, almost surely,
where \(C_1 = \frac{ R^2S_0}{\sigma ^2}+\frac{ R^2 }{2\sigma ^2T}+\frac{d}{2}\), \(C_2 =\frac{R}{2}\).
By Lemma 3.4, for each \(\eta >0\), with probability at least \(1\frac{\delta }{2}\), (3.18) holds for this \(\eta \)net \(\{\varphi _j: j=1,\ldots ,{\mathcal {N}}\}\). Combining (3.18) with (3.21) and (3.22), we conclude that, with probability at least \(1\frac{\delta }{2}\),
Notice that \({\mathcal {D}}_{\smash {{\widehat{\phi }}_{T,M,{\mathcal {H}}_n},M}}\le 0\), so the above inequality implies that
Let \({\mathcal {S}}\) be a metric space and \(\eta >0.\) We define the covering number \({\mathcal {N}}({\mathcal {S}},\eta )\) to be the minimal number of disks in \({\mathcal {S}}\) with radius \(\eta \) covering \({\mathcal {S}}\). The covering number of \({\mathcal {H}}_n\) satisfies \({\mathcal {N}}({\mathcal {H}}_n, \eta ) \le \left( \frac{4S_0}{\eta }\right) ^{c_0n}\) (e.g., Proposition 5 in [20]). By the triangle inequality, we split the error we want to control into sampling error (SE) and approximation error (AE) (see Fig. 1)
From (3.23) and the coercivity condition (3.15), we obtain that, with probability at least \(1\frac{\delta }{2}\),
Let \(\phi _{{\mathcal {H}}_n}:=\mathrm {argmin}_{\psi \in {\mathcal {H}}_n}\Vert \psi \phi \Vert _{\infty }\). By coercivity condition (3.15), we have
where we used
Therefore, we have
Now we combine the estimates (3.24), (3.25) and (3.26) together, and let \(\eta =n^{2s\gamma }\) and \(n=(\frac{M}{\log M})^{\frac{1}{2s+1}}\), and note that \(c_{{\mathcal {L}}}=c_{\cup _n{\mathcal {L}}_n}=c_{\cup _n{\mathcal {H}}_n}\le c_{{\mathcal {H}}_n}=c_{{\mathcal {L}}_n}\le 1\) for all n. We obtain that, with probability at least \(1\frac{\delta }{2}\), the following estimate holds true:
where we used (3.18) to get \(\epsilon _{M,\delta ,N}\), C, and \(\{C_i\}_{i=0}^2\) is defined in (3.18), (3.21), and (3.22) respectively, and
The bound in expectation is obtained by standard techniques, writing
and splitting the integration interval into \([0, \frac{C_5}{c_{{\mathcal {L}}}} (\frac{\log M}{M})^{\frac{2s}{2s+1}}]\) and \([\frac{C_5}{c_{{\mathcal {L}}}} (\frac{\log M}{M})^{\frac{2s}{2s+1}}, \infty ]\). On the first interval, we use \( {\mathbb {P}}\left\{ { {\left \left \left {\widehat{\phi }}_{T,M,{\mathcal {H}}_n}\phi \right \right \right }^2 >\epsilon }\right\} \le 1\). On the second one, we use a change of variables and the probability estimate (3.27). We obtain
where \(C_6\) is an absolute constant only depending on \(\sigma , N, T, S_0,R\). \(\square \)
Proof of Theorem 3.1
In this proof, \(a \lesssim b\) means that there exists a constant c such that \(a\le cb\). For any \(\epsilon >0\), we claim
Strong consistency will then follow from the Borel–Cantelli lemma. Notice that
and \( {\mathbb {P}}\left\{ {{\left \left \left \widehat{\phi }_{T,\infty ,{\mathcal {H}}_M}\phi \right \right \right }^2 \ge \frac{\epsilon }{2}}\right\} =0\) when M is large enough (see (3.26)). It suffices to prove
Let \(\{\varphi _j\}_{j=1}^{{\mathcal {N}}}\) be an \(\eta \) net for \({\mathcal {H}}_M\). Consider the event
The bound (3.20) in Proposition 3.4 yields
Using the fact that there exists \(j_M \in \{1,2\ldots ,{\mathcal {N}}\}\) such that \(\Vert \phi \varphi _{j_M}\Vert _{\infty }\le \eta \), and following the same argument as in (3.21) and (3.22), we obtain,
Notice that \({\mathcal {D}}_{\widehat{\phi }_{T,M,{\mathcal {H}}_M},M}\le 0\) and \( {\mathcal {D}}_{\widehat{\phi }_{T,M,{\mathcal {H}}_M},\infty } \ge c_{{\mathcal {H}}_M} {\left \left \left \widehat{\phi }_{T,M,{\mathcal {H}}_M}\widehat{\phi }_{T,\infty ,{\mathcal {H}}_M} \right \right \right }^2\), so that we have
Let \(\eta n_{M}^{\gamma }=\epsilon \), i.e., \(\eta = n_{M}^{\gamma }\epsilon \), by assumption, we have \(c_{{\mathcal {H}}_M}\ge c_{\cup _M {\mathcal {H}}_M}>0\) and \(\lim _{M\rightarrow \infty }\frac{n_M\log n_M}{M}= 0\), we have
when M is large enough. By the comparison theorem, the series
converges. The claim is proved. \(\square \)
Learning Theory: DiscreteTime Observations
In this section, we analyze the estimation error of the (regularized) MLE \({\widehat{\phi }}_{L,T,M,{\mathcal {H}}}\), defined in (2.5) for finitedimensional linear space \({\mathcal {H}}\) and for discretetime observations. We show that it is of order \(\sqrt{\frac{n}{M}} + {\varDelta }t^{1/2}\) with high probability, where n is the dimension of \({\mathcal {H}}\) and \({\varDelta }t \) is the observation gap. As a consequence, the MLE is consistent when \(M\rightarrow \infty \) and \({\varDelta }t\rightarrow 0\); and the MLE converges at an optimal rate as when n is optimally chosen as in (2.13).
This section is organized as follows: we shall first prove the main theorem, Theorem 4.2, on the error bounds of the MLE in Sect. 4.1. We postpone the technical details, including concentration inequalities and discretization error bounds, to Sects. 4.2–4.3.
Recall that we denote \(\varvec{X}_{[0,T]}\) the solution to system (1.1) with the true interaction kernel \(\phi \) and denote \(\{\varvec{X}^{(m)}_{t_0:t_L}\}_{m=1}^M\) independent trajectories observed at discrete times \(t_l= l{\varDelta }t\) with \({\varDelta }t = T/L\). Recall that when \({\mathcal {H}}= \mathrm {span}\{\psi _p\}_{p=1}^n\), the MLE \({\widehat{\phi }}_{L,T,M,{\mathcal {H}}}\) is given in (2.8).
Throughout this section, we assume
Assumption 4.1
(Basis functions) Assume that the basis functions \(\{ \psi _p \}_{p=1}^n \subset C^1_b({\mathbb {R}}^+,{\mathbb {R}}) \) satisfy the following conditions:

(a)
\(\{\psi _p(\cdot )(\cdot )\}_{p=1}^n\) are orthonormal in \(L^2(\rho _T)\);

(b)
\(\max _k \Vert \psi _p\Vert _\infty \le b_0\), \(\max _k \Vert \psi '_k(\cdot )(\cdot )\Vert _\infty \le b_1\);

(c)
there exists a constant \(c_{\rho _T}\) such that \(n\le c_{\rho _T} \min (b_0^2R,b_1R^{3/2})\).
Item (a) aims to make the normal equations matrix nonsingular, as discussed in Proposition 3.2. In item (b), the uniform bound for the derivatives aims to control the discretization error due to discretetime observations; the uniform boundness on the functions will be used for concentration inequalities. Item (c) states that the number of such orthonormal basis functions is bounded by the measure \(\rho _T\) and the uniform bounds of the functions and their derivatives. Item (c) often follows from \((a)(b)\), with an intuition from examples including polynomials and trigonometric polynomials, and smoothed piecewise polynomials, if \(r^2\rho _T(\mathrm{d}r)\) is equivalent to the Lebesgue measure on an interval \([R_0,R]\subset {\mathbb {R}}^+\). Such an interval is where pairwise distances explore with a noticeable probability (see, for example, in Figs. 2 and 6). It exists in general when the initial distribution spreads out the pairwise distances or when the relative position system is ergodic, since the density of \(\rho _T\) is smooth and nonnegative on \({\mathbb {R}}^+\).
Error Bounds for the MLE
We show first that the \(L^2(\rho _T)\) error of the estimator \({\widehat{\phi }}_{L,T,M,{\mathcal {H}}}\) in (2.8) converges as \(M\rightarrow \infty \) and \({\varDelta }t : = T/L\rightarrow 0\), with high probability.
Theorem 4.2
(Error bounds for the MLE) Let the hypothesis space be \({\mathcal {H}}=\mathrm {span} \{\psi _i\}_{i=1}^n \), where the set of functions \(\{\psi _i\}_{i=1}^n\), satisfying Assumption 4.1, are orthonormal in \(L^2(\rho _T)\) with respect to the norm \({\left \left \left \cdot \right \right \right }\) defined in Eq. (2.11). Suppose that the coercivity condition holds on \({\mathcal {H}}\) with a constant \(c_{\mathcal {H}}>0\). Then, with a probability at least \(1{(4n+2n^2) }\exp {\left( \frac{\epsilon ^2 }{8c_1^2} \right) }\), the error of the estimator \({\widehat{\phi }}_{L,T,M,{\mathcal {H}}}\) in (2.8) satisfies
where \({\widehat{\phi }}_{T, \infty , {\mathcal {H}}} \) is the projection of the true kernel to \({\mathcal {H}}\), \({\varDelta }t = T/L \le c_{\mathcal {H}}/(2c_3)\), and the constants are
Remark 1
(The discretization error may dominate the statistical error) When the observation gap \({\varDelta }t=0\), we recover the min–max learning rate \(M^{\frac{s}{2s+1}}\) proved in the previous section, if we choose the optimal dimension \(n=C ( M/\log {M})^{1/(2s+1)}\) for the hypothesis space. However, when \({\varDelta }t>0\), once \(M^{\frac{s}{2s+1}}(\log {M})^{\frac{1}{2(2s+1)}} \lesssim {\varDelta }t ^{\frac{1}{2}}\), the discretization error will dominate the error of the estimator, preventing us from observing the min–max learning rate. This phenomenon is wellillustrated by the left plots in Figs. 5 and 9 in our numerical experiments.
Remark 2
We assumed \(C^1_b\) regularity for the basis functions \(\{\psi _p\}\) for the above numerical error analysis, stronger than that of piecewise polynomials (which may be discontinuous) used in the numerical tests. Such a difference between the regularity requirements stems from the numerical representation, and we can view the piecewise polynomials as numerical approximations of regular functions. This view is supported by the numerical experiments: the estimator has only small jumps at the discontinuities in the high probability region.
Remark 3
A smaller coercivity constant \(c_{\mathcal {H}}\) corresponds to a worse conditioned problem (Proposition 3.2), and so the condition \(L\gtrsim 1/c_{\mathcal {H}}\) that requires L to be larger for small \(c_{\mathcal {H}}\) makes sense.
We shall prove Theorem 4.2 as follows: we first outline the main idea and introduce the key elements, such as the normal matrices and vectors and the empirical error functionals in their entries; then, we provide a proof with key but technical estimations, including the concentration inequalities and discretization error bounds, postponed to Sects. 4.2–4.3.
The error of the MLE \({\widehat{\phi }}_{L,T,M,{\mathcal {H}}} \) consists of three parts: approximation error, discretization error and sampling error:
where \( {\widehat{\phi }}_{L,T,\infty ,{\mathcal {H}}} \) is the infinitedata estimator. We shall study the discretization error and the sampling error by analyzing the differences between their coefficient vectors.
All these coefficients are solutions to the corresponding normal equations (e.g., Eq. (2.7)). To facilitate the study of these normal matrices and vectors, we introduce the following notions. For any \(f, g\in C^1_b({\mathbb {R}}^{Nd},{\mathbb {R}}^{Nd})\), we define the following functionals of the observation paths:
Correspondingly, we define the empirical functionals
Using the notation of empirical functional introduced in (4.4)–(4.5), we consider the following normal matrixes and vectors:
It is clear that (here, to ease the notation, we denote the coefficient \( {\widehat{a}}_{L,T,M,{\mathcal {H}}}\) in Fig. 1 as \(a_{M,L}\), and similarly for others)
Here the matrix \(A_\infty \) is invertible due the coercivity condition: its smallest eigenvalue is the coercivity constant \(c_{\mathcal {H}}\) (see Proposition 3.2). The matrix \(A_{\infty ,L}\) is invertible when \(L= T/({\varDelta }t)\) is large, with its smallest eigenvalue bounded below by \(c_{\mathcal {H}} c_3 {\varDelta }t^{1/2}\), see Corollary 4.1. Furthermore, Corollary 4.2 in Sect. 4.3 shows that, with probability at \(1\delta \), the matrix \(A_{M,L}\) is invertible with its smallest eigenvalue bounded blow by \(c_{\mathcal {H}} \left( \sqrt{\frac{n}{M}}\epsilon + c_3{\varDelta }t ^{\frac{1}{2}}\right) \).
Proof of Theorem 4.2
By Eq. (4.3), it suffices to prove upper bounds for the discretization error and the sampling error separately:
where the second inequality holds with probability at least \(1\delta \).
For the discretization error, since \(\{\psi _i(\cdot )(\cdot )\}\) are orthonormal in \(L^2(\rho _T)\), we have, by Eq. (4.7):
Using the formula \(A_{\infty ,L}^{1} A_{\infty }^{1} = A_{\infty ,L}^{1}(A_\infty  A_{\infty ,L})A_\infty ^{1} \) (see, e.g., [25, Appendix B9]), we have
Note that (i) \(\left\ A_{\infty ,L}^{1} \right\ \le 2 c_{\mathcal {H}}^{1}\), since \(c_3{\varDelta }t ^{1/2} < \frac{1}{2}c_{\mathcal {H}}\); (ii) by Proposition 4.1 (in Sect. 4.3) in combination of Assumption 4.1,
and (iii) \(\left\ A_\infty ^{1} b_\infty \right\ = {\left \left \left {\widehat{\phi }}_{T,\infty ,{\mathcal {H}}} \right \right \right }\). Then,
and the inequality for the discretization error follows.
Similarly, for the sampling error, we have
Note that (i) \(\left\ A_{M,L}^{1} \right\ \le 2 c_{\mathcal {H}}^{1}\) when M and L are large enough such that \(\sqrt{\frac{n}{M}}\epsilon + c_3{\varDelta }t ^{\frac{1}{2}} < \frac{1}{2}c_{\mathcal {H}}\); (ii) we have, by Proposition 4.2 (in Sect. 4.3), that
hold with probability at least \(1\delta \); (iii) since \( c_3{\varDelta }t ^{\frac{1}{2}} < \frac{1}{2}c_{\mathcal {H}}\) and \(\left\ A_{\infty }^{1} b_{\infty } \right\ = {\left \left \left {\widehat{\phi }}_{T,\infty ,{\mathcal {H}}} \right \right \right }\), we have
Then, the inequality for the sampling error follows. \(\square \)
In in the next two subsections, we prove Proposition 4.1–4.2 that we just used in the above proof. Section 4.2 studies the concentration inequalities and discretization error bounds for the empirical error functionals in (4.5), which are the entries of the normal matrices and vectors. Based on them, Sect. 4.3 provides error bounds for the normal matrices and the normal vectors in Proposition 4.1–4.2.
Concentration and Discretization Error of Empirical Functionals
We introduce concentration inequalities for the above empirical functionals on the path space of diffusion processes and a bound on the discretization error of the estimator due to discretetime approximations. Our first lemma studies concentration of the discretetime empirical functionals \(\xi _{M,L}\) and \(\eta _{M,L}\).
Lemma 4.1
(Concentration of empirical functionals) Let \(\{\varvec{X}^{(m)}_{t_0:t_L}\}_{m=1}^M\) be discretetime observations, with \(t_l= l{\varDelta }t\) and \(L= T/{\varDelta }t\), of the system (1.1) with \(\phi \). Then for any \(f,g\in C_b({\mathbb {R}}^{dN}, {\mathbb {R}}^{dN} )\), the error functionals defined in (4.5) satisfy the concentration inequalities:
for any \(\epsilon >0\), where \(C_1 = \frac{1}{N}\Vert f\Vert _\infty \max (\frac{2\sigma \sqrt{N}}{\sqrt{T}}, \, \Vert f_\phi \Vert _\infty )\), and \(C_2 =\frac{1}{N} \Vert f\Vert _\infty \Vert g\Vert _\infty \). Furthermore,
where \(C= \frac{1}{N}\Vert f\Vert _\infty \max (\frac{2\sigma }{\sqrt{T/N}}, \Vert f_\phi \Vert _\infty ,\, \Vert g\Vert _\infty )\).
Proof
Note that \({\left \eta (f, g,\varvec{X}_{[t_0:t_L]})\right } \le \Vert f\Vert _\infty \Vert g\Vert _\infty \). Then, the exponential inequality for \(\eta _{ML}\) follows from the Hoeffding inequality, which states that, for i.i.d. random variables \(\{Z_i\}\) bounded above by K, one has \( {\mathbb {P}}\left\{ {{\left \frac{1}{M}\sum _{m=1}^M (Z_i{\mathbb {E}}Z_i) \right } > \epsilon }\right\} \le 2 \exp {(\frac{M\epsilon ^2}{2K^2}) } \).
To study \(\xi _{ML}\), we decompose \(\xi (f,\varvec{X}_{[t_0:t_L]})\) into two parts, a bounded part and a martingale part:
where we denote \(f^L(s) :=\sum _{l=0}^{L1} f(\varvec{X}_{t_l}) {\mathbf {1}}_{[t_l,t_{l+1}]}(s) \). We call \(Z_T\) a bounded part because
We call \(Y_T\) a martingale part since \(TNY_T= \int _0^T\langle f^L(s),\sigma \mathrm{d}\varvec{B}_s \rangle \) is a martingale. Correspondingly, we can write
Then, denoting \(K_1:=\frac{1}{N} \Vert f\Vert _\infty \Vert f_\phi \Vert _\infty \) and \(K_2= 2\sigma \Vert f\Vert _{\infty }\), and noticing that \(C_1= K_1 + K_2 /\sqrt{TN}\), we can conclude the first concentration inequality in (4.9) from
where the first exponential bound follows directly from Hoeffding inequality applied to \(\{Z_T^{(m)}\}\), and the second exponential bound \( {\mathbb {P}}\left\{ { {\left \frac{1}{M}\sum _{m=1} Y_T^{(m)} \right }\ge \frac{\epsilon }{2}}\right\} \le e^{\frac{M\epsilon ^2}{8K_2^2}}\) is proved as follows.
Note that \({\mathbb {E}}Y_T=0\) and \(TNY_T= \int _0^T\langle f^L(s),\sigma \mathrm{d}\varvec{B}_s \rangle \) is a martingale satisfying \({\mathbb {E}}[e^{\sigma ^2\int _0^T{\left f^L(s)\right }^2d_s}] < \infty \) because \({\left f^L(s)\right }^2\le \Vert f\Vert _\infty \). By the Novikov theorem, the process \(( e^{\alpha TNY_T  \frac{\alpha ^2}{2} \sigma ^2 \int _0^T {\left f^L(s)\right }^2d_s }, T\ge 0)\) is a martingale for any \(\alpha \in {\mathbb {R}}\) (see, e.g., [31, Corollary 5.13]). Therefore, with \(\alpha = \frac{\lambda }{MTN}\), we have
for any \(\lambda >0\). As a consequence, we have
Lastly, Eq. (4.10) follows directly from Eq. (4.9). \(\square \)
We remark that here we focus on the case \(M\rightarrow \infty \) with finite time T. If the relative position system is ergodic, one can extend the concentration inequalities to the case when \(T\rightarrow \infty \).
The next lemma shows that the discretization error of the empirical functionals, as discretetime approximation of the integrals, is at order \({\varDelta }t^ \frac{1}{2}\).
Lemma 4.2
(Discretization error of empirical functionals) Let \(f,g\in C^1_b({\mathbb {R}}^{dN}, {\mathbb {R}}^{dN}) \). Let \(\varvec{X}_{t_0:t_L}\) be a discretetime trajectory, with \(t_l=l{\varDelta }t\) and \(L= T/{\varDelta }t\), of the system (1.1) with \(\phi \). Then, the error functionals defined in (4.4) satisfy
where the constants are
Proof
Note that since \(\varvec{X}_{[0,T]}\) is a solution to system (1.1), we have for each l,
where in the first inequality we have applied the mean value theorem to bound \(f(\varvec{X}_{t_l})  f(\varvec{X}_s)\):
and in the second inequality, we used the fact that
Thus, we obtain the bound in (4.11) by a summation over l:
Similarly, we have
and the bound for \(\eta \) follows from the fact that
\(\square \)
Error Bounds for the Normal Matrix and Vector
Proposition 4.1
(Discretization error) For the normal matrix \(A_{\infty ,L}\) and vector \(b_{\infty ,L}\) defined in (4.6) with \(\{ \psi _p \}_{p=1}^n\) satisfying Assumption 4.1, we have
where the constant C is \( C = dN(b_1+ b_0) Rb_0 (Rb_0 {\varDelta }t ^{\frac{1}{2}} + \sqrt{\mathrm{d}}\sigma ) \).
Proof
Applying Lemma 4.2, in combination of the basic fact that \(\Vert b\Vert \le \sqrt{n} \max _{k=1,\dots ,n} b(k)\) for any \(b\in {\mathbb {R}}^n\), and \(\Vert A\Vert \le \sqrt{n} \max _{k,k'=1,\dots ,n} A(k,k')\) for any \(A\in {\mathbb {R}}^{n\times n}\), we obtain
with constants \(C_1\) and \(C_2\) in the form of
To complete the proof, we are left to estimate \(\left\ f_{\psi _p} \right\ _\infty \) and \( \left\ \nabla f_{\psi _p} \right\ _\infty \). From the definition of \(f_\cdot \), we have
and \(\left\ f_{\phi } \right\ _\infty \le R b_0\sqrt{N}\) as well. Note that for each \(i,i'\in \{1,\ldots , N\}\), with notation \(\varvec{r}_{ji}= \varvec{x}_j\varvec{x}_i\) and \(r_{ji}= {\left \varvec{r}_{ji} \right }\), we have,
Thus, the norm of this \(d\times d\) matrix is uniformly bounded,
and as a result, the norm of the \(dN \times dN\) matrix is uniformly bounded,
Combining the above estimates with \(\Vert f_\phi \Vert _\infty \le Rb_0N\) (the same as \(\Vert f_{\psi _p}\Vert _\infty \)), we obtain that \(C_1\) and \(C_2\) are both bounded by C. \(\square \)
It follows directly that the matrix \(A_{\infty ,L}\) is invertible:
Corollary 4.1
The smallest eigenvalue of the matrix \(A_{\infty ,L}\) defined in (4.6) is bounded below by \(c_{\mathcal {H}} c_3{\varDelta }t^{1/2}\) when \(c_3 {\varDelta }t^{1/2}< c_{\mathcal {H}}\), with \(c_3\) defined in (4.2).
Proof
Recall that from Proposition 3.2, we have \(a^T A_{\infty } a \ge c_{{\mathcal {H}}} a^2\) for an arbitrary \(a\in {\mathbb {R}}^n\). Then,
by Proposition 4.1 with the bound of \(\sqrt{n}\) in Assumption 4.1. \(\square \)
We prove next that the matrix \(A_{M,L}\) is invertible and concentrates around \(A_{\infty , L}\).
Proposition 4.2
(Concentration of the normal matrix and vector) Suppose that the coercivity condition holds on \({\mathcal {H}}=\mathrm {span} \{\psi _i\}_{i=1}^n\) with a constant \(c_{{\mathcal {H}}}>0\), where \(\{ \psi _p \}_{p=1}^n\) satisfying Assumption 4.1. Then, the normal matrix \(A_{M,L}\) and vector \(b_{M,L}\) defined in (4.6) satisfy concentration inequalities in the sense that for any \(\epsilon >0\),
where the constant C is \(C= R b_0 (R S_0+2\sigma /\sqrt{T})\).
Proof
Recall that by definition in (4.6), \(b_{M,L}(k) = \xi _{M,L}(f_{\psi _p}) \) with \( {\mathbb {E}}[b_{M,L}] =b_{\infty ,L}\) and \(A_{M,L}(k,k') = \eta _{M,L}(f_{\psi _p}, f_{\psi _{k'} })\) with \( {\mathbb {E}}[A_{M,L}]=A_{\infty ,L} \). Lemma 4.1 implies that each of these entries concentrates around their mean:
where the constant C is obtained from (4.12). In combination of the basic fact that \(\Vert b\Vert \le \sqrt{n} \max _{k=1,\dots ,n} b(k)\) for any \(b\in {\mathbb {R}}^n\), and \(\Vert A\Vert \le \sqrt{n} \max _{k,k'=1,\dots ,n} A(k,k')\) for any \(A\in {\mathbb {R}}^{n\times n}\), we obtain
The third exponential inequality follows directly by combining the first two. \(\square \)
Corollary 4.2
Denote \(\lambda _{\mathrm {min}}(A_{ML})\) the smallest eigenvalue of the normal matrix \(A_{ML}\) defined in (4.6). We have
with \(\delta =2n^2 \exp {\left( \frac{M\epsilon _1^2 }{2nc_1^2} \right) } \), for any \(\epsilon _1>0\) and any \({\varDelta }t = T/L\) such that \( \epsilon _1 + c_3 {\varDelta }t ^{\frac{1}{2}}= \epsilon < c_{{\mathcal {H}}} \), where \(c_1 \) and \(c_3\) are defined in (4.2).
Proof
Note that for any \(a\in {\mathbb {R}}^n\) such that \(\Vert a\Vert =1\), we have, by Corollary 4.1, \(a^T A_{\infty , L}a \ge c_{{\mathcal {H}}}  c_3 {\varDelta }t ^{\frac{1}{2}} \). Meanwhile, Proposition 4.2 implies that
with probability at least \(1\delta \). Thus,
and the corollary follows. \(\square \)
Remark 4
The above corollary requires \(\epsilon > c_3 {\varDelta }t ^{\frac{1}{2}}\). This condition can be removed if the coercivity holds for the discretetime observations on \({\mathcal {H}}\) with a constant \(c_{{\mathcal {H}},T,L}\), which can be tested numerically from a data set with a large M. In fact, we obtain directly from the above proof that \( {\mathbb {P}}\left\{ {\lambda _{\mathrm {min}}(A_{ML})> c_{{\mathcal {H}},T,L} \epsilon }\right\} >1\delta \) with \(\delta =2n^2 \exp {\left( \frac{M\epsilon _1^2 }{2nc_1^2} \right) } \), for any \(\epsilon >0\).
Remark 5
In practice, the minimum eigenvalue of \(A_\infty \) may be small due to the redundancy of the local basis functions or due to the coercivity constant on \({\mathcal {H}}\) being small. Thus, the smallest eigenvalue of \(A_{M,L}\) may be zero. On the other hand, these matrices are always symmetric and nonnegative, so it is advisable to regularize the matrix by pseudoinverse.
Examples and Numerical Simulation Results
In this section, we performed numerical experiment to validate that our estimator defined in (2.5) and implemented by Algorithm 1, behaves in practice as predicted by the theory. We consider two examples: a stochastic opinion dynamical system and a stochastic LennardJones system, using observations from simulated data.
The setup for the numerical simulations is as follows. We simulate sample paths on the time interval [0, T] with the standard Euler–Maruyama scheme (see (2.3)), with a sufficiently small time step length \(\mathrm{d}t\). When observations are made at every time step, i.e., \({\varDelta }t = t_{l+1}t_l = \mathrm{d}t\) for each l, we view \(\varvec{X}_{\mathrm {train},M}:=\{\varvec{X}^{(m)}_{t_0:t_L}\}_{m=1}^M\) as continuoustime trajectories. When observations occur spaced in time with observation gap \({\varDelta }t\) equal to an integer multiple of \(\mathrm{d}t\), we refer to them here as discretetime observations.
From the observations, we construct the empirical probability measure \(\rho ^{L,M}_T\) (defined in (2.10)), and let \([R_{\min },R_{\max }]\) be its support. We choose the hypothesis spaces \({\mathcal {H}}\) consisting of piecewise constant or piecewise linear polynomials on intervalbased partitions of \([R_{\min },R_{\max }]\). This choice is dictated by the ease of obtaining an orthonormal basis for \({\mathcal {H}}\), ease and efficiency of computation, and ability to capture local features of the interaction kernel. To avoid discontinuities at the extremes of the intervals in the partition and to reduce stiffness of the equations of the system with the estimated interaction kernels, we interpolate the estimator linearly on a fine grid and extrapolate it with a constant to the left of \(R_{\min }\) and the right of \(R_{\max }\). This postprocessing procedure ensures the Lipschitz continuity of the estimators. We use the postprocessed estimators to predict and generate the dynamics with the estimated interaction kernels.
We mainly focus on the case where T is small and report on the results as follows:

Interaction kernel estimation. We compare \(\phi \) and \({\widehat{\phi }}_{T,M,{\mathcal {H}}}\), the true and estimated interaction kernels (after smoothing), by plotting them side by side, superimposed with an estimate of \(\rho _T\), obtained as in (2.10) by using \(\smash {M_{\rho _T}}\) (\(\smash {M_{\rho _T}}\gg M\)) independent trajectories. The estimated kernel is plotted in terms of its mean and standard deviation, computed over 10 independent learning trials. To demonstrate the dependence of the estimator on the sample size and the scale of the random noise, we report the above for different values of M and \(\sigma \).

Trajectory prediction. In the spirit of Proposition 2.1, we compare the discrepancy between the true trajectories (evolved using \(\phi \)) and predicted trajectories (evolved using \({\widehat{\phi }}_{T,M,{\mathcal {H}}}\)) on both the training time interval [0, T] and a future time interval \([T, T_{f}]\), over two different sets of initial conditions—one taken from the training data, and one consisting of new samples from \(\mu _0\). When simulating the trajectories for the systems driven by \({\widehat{\phi }}_{T,M,{\mathcal {H}}}\) using the EM scheme, we use the same initial conditions and the same realization of the random noise as in the trajectory of the system driven by \(\phi \). The mean trajectory error is estimated using M test trajectories (the same number as in the training data).

Rate of convergence. We report the convergence rate of \({\widehat{\phi }}_{T,M,{\mathcal {H}}}\) to \(\phi \) in the \({\left \left \left \cdot \right \right \right }\) norm on \(\smash {L^2(\rho _T)}\) as the sample size M increases, with the dimension of \({\mathcal {H}}\) growing with M according to Theorem 3.2, for different scales \(\sigma \) of the random noise. We also investigate numerically the convergence rate when both T and M increase, with the dimension of the hypothesis space \({\mathcal {H}}\) set according to the effective sample size as discussed in Sect. 2.2.

Discretization errors from discretetime observations. To study the discretization error due to discretetime observations, we report the convergence rate (in M) of estimators \({\widehat{\phi }}_{L,T,M,{\mathcal {H}}}\) obtained from data with different observation gaps \({\varDelta }t=T/L\). We also verify numerically that the \({\left \left \left \cdot \right \right \right }\) error of the estimators increases with \({\varDelta }t\) as predicted by Theorem 4.2. These experiments are carried out for different values of the square root of the diffusion constant \(\sigma \).
We will report the conclusions of our experiments in Sect. 5.3
Example 1: Stochastic Opinion Dynamics
We first consider a 1D system of stochastic opinion dynamics with interaction kernel
It is straightforward to see that \(\phi \) is in \(C_c^{1,1}([0,2])\) and nonnegative. Systems of this form are motivated in various applications, from biology to in social science, where \(\phi \) models how the opinions of people influence each other (see [7, 11, 18, 33, 43] and references therein), with one or a multiplicity of consensuses may be eventually reached. In the system we consider, each agent tries to align its opinions more with its farther neighbors than with its closer neighbors: such interactions are called heterophilious. For deterministic systems of this type, [43] shows that the opinions of agents merge into clusters, with the number of clusters significantly smaller than the number of agents. This is natural, as increased alignment with farther neighbors increases mixing and consensus. In our stochastic setting, the random noise prevents the opinions from converging to single opinions. Instead, soft clusters form at large time, that are metastable states for the dynamics, i.e., states where agents dwell for long times, rarely switching between them.
We study the performance of our estimators of the interaction kernel, from trajectory data. Table 3 summarizes parameters of the setup. In this example, we choose \({\mathcal {H}}_{n_M}\) to be the function space consisting of piecewise constant functions on \(n_M\) uniform partitions of the interval [0, 10].
Figure 2 shows that, as the number of trajectories increases, we obtain increasingly accurate approximations to the true interaction kernel, including at locations with sharp transitions of \(\phi \). The lack of artifacts at these locations is an advantage provided by the use of local basis. The estimators oscillate near 0, with amplitudes scaling with the level of noise. We believe that the reason for this phenomenon is that due to the structure of the equations, we have terms of the form \(\phi (0)\mathbf {0} = \mathbf {0}\) at, and near, 0, with subsequent loss of information about the interaction kernel about 0.
We then use the learned interaction kernels \({\widehat{\phi }}\) in Fig. 2 to predict the dynamics and summarize the results in Fig. 3 and Table 4. Even with \(M=32\), our estimator produces very accurate approximations of the true trajectories both in the training time interval [0, 5] and the future time interval [5, 50], including number and location of clusters, and the time of their formation. As M increases to 4096, we have more accurate predictions on the locations of clusters. We impute this improvement to the better reconstruction of estimators at locations near 0.
Next we investigate the convergence rate of estimators. It is well known in approximation theory (see Theorem 6.1 in [48]) that \(\inf _{\varphi \in {\mathcal {H}}_n}\Vert \varphi \phi \Vert _{\infty } \le \text {Lip}[\phi ]n^{1}\). With the dimension n being proportional to \(\smash {(\frac{M}{\log M})^{\frac{1}{3}}}\), Fig. 4 shows that the learning rate in terms of M is around \(M^{0.34}\), which matches the optimal min–max rate \(M^{\frac{1}{3}}\) stated in Theorem 3.2 with \(s=1\).
We also study the convergence of the estimator as the length of the trajectory T increases, for the estimator \({\widehat{\phi }}_{T,M,{\mathcal {H}}}\) from continuoustime trajectories (i.e., without gaps between observations). The autocorrelation time for this system is estimated to be about \(\tau = 10\) time units. Therefore, we use relatively long trajectories up to \(T=1500\) time units to test the convergence, contributing up to about 150 effective samples. We set the dimension of the hypothesis space to be \(n=4(\frac{MT/\mathrm{d}t}{\log (MT/\mathrm{d}t)})^{\frac{1}{3}} \) for each pair (M, T), where \(\mathrm{d}t\) is the time step size of the Euler–Maruyama scheme. The convergence rate of the estimators in terms of MT is about 0.33, showing the equivalence of learning from a single long trajectory with multiple short trajectories when the underlying process is ergodic.
We also investigate the effects of the scale of the random noise, which is represented by the standard deviation \(\sigma \). Figure 2 shows that the estimators for the system with \(\sigma =0.5\) have much large oscillations than those with \(\sigma =0.1\). The left plot in Fig. 4 shows that the scale of the random noise does not affect the learning rate, matching our theory. We also see that the absolute \(L^2(\rho _T)\) error of estimators increases as the system noise increases; this may indicate that the coercivity constant decreases as the level of noise in the system increases. The left plot in Fig. 5 shows that the scale of the errors increase linearly in \(\sigma \) (in particular, when the observation gap is 1).
Finally, we study the discretization error due to approximation of the integral in the likelihood using discretetime observations. In the left plot of Fig. 5, as the observation gap k increases, the learning rate curves become flat, due to the error induced by discretization of the likelihood function (2.1). The right plot shows that the absolute error of the estimator is dominated by \(\sigma O(({\varDelta }t)^{1/2})\).
Example 2: Stochastic Lennard Jones Dynamics
In this example, we consider the LennardJonestype kernel \(\phi (r)=\frac{{\varPhi }'(r)}{r}\), with
for some \(p>q \in {\mathbb {N}}\). The system of particles is assumed to be associated with a potential energy function only depending on the pairwise distance and \({\varPhi }\), and the evolution is driven by minimization of the energy function. In particular, \(\epsilon \) represents the depth of the potential well, r is the distance between the particles, and \(r_m\) is the distance at which the potential reaches its minimum. At \(r_m\), the potential function has the value \(\epsilon \). The \(r^{p}\) term, which is the repulsive term, describes Pauli repulsion at short ranges due to overlapping electron orbitals, and the \(r^{q}\) term, which is the attractive longrange term. The corresponding system has wide applications in molecular dynamics and material sciences where \(\phi \) models atom–atom interactions. Note that \(\phi \) is singular at \(r=0\): we truncate it at \(r_{\text {trunc}}\) by connecting it with an exponential function of the form \(a\exp (br^{12})\) so that it has a continuous derivative on \({\mathbb {R}}^+\).
In this system, the particle–particle interactions are all shortrange repulsions and longrange attractions. The shortrange repulsion force prevents the particles to collide and longrange attractions keep the particles in the flock. In the deterministic setting, the system evolves to equilibrium configurations very quickly, which are crystallike structure, whose pairwise distance corresponds to the local minimizers of the associated energy function. Tables 5 and 6 summarize the system and learning parameters.
Note that the true kernel \(\phi \) is not compactly supported. But in our simulations, we observe the dynamics up to a time T which is a fraction of the equilibrium time. Since the particles only explore a bounded region due to the largerange attraction, \(\rho _T\) is essentially compactly supported on a bounded region (see the histogram background of Fig. 6), on which \(\phi \) is in our admission space.
We use piecewise linear functions on n uniform partitions of the learning interval to approximate the true kernel \(\phi \). With \(M=32,\) Fig. 6 shows that we have already obtained faithful approximations to the true interaction kernel, except for on regions are close 0. Increasing number of observations improves the accuracy of estimators at locations near 0, which seems to be very helpful for the system with larger noise level.
In terms of the trajectory prediction, we use the learned interaction kernels \({\widehat{\phi }}\) in Fig. 2. We summarize the results in Fig. 7 and Table 7. In the experiments, we study two cases, one with small random noise where the particles still form an equilibrium configuration, and then, this configuration has small fluctuation in the space; the other one with medium level of random noise, where the random noise begins to break the formation of a fixed equilibrium configuration and we see the transition between different configurations. We see that in both cases, our estimators produce good prediction of the true dynamics in both training and future time interval.
We plot the convergence rate of estimators in terms of M in the right plot of Fig. 8. In this case, we have \(\inf _{\varphi \in {\mathcal {H}}_n}\Vert \varphi \phi \Vert _{\infty } \le \text {Lip}[\phi ']n^{2}\). We choose a choice of dimension n proportional to \((\frac{M}{\log M})^{\frac{1}{5}}\), our numerical results show that the learning rate is around \(M^{0.39}\), which matches the optimal min–max rate \(M^{\frac{2}{5}}\) stated in Theorem 3.2.
We also study the convergence of the estimators as the length of the trajectory T increases. In this example with \(\sigma = 0.35\), the estimated autocorrelation time is about \(\tau = 10\) time units. Therefore, we use relatively long trajectories up to \(T=1200\) time units, contributing up to about 120 effective samples. We set the dimension of the hypothesis space to be \(n=4(\frac{MT/\mathrm{d}t}{\log (MT/\mathrm{d}t)})^{\frac{1}{5}} \) for each pair (M, T), where \(\mathrm{d}t\) is the time step size of the Euler–Maruyama scheme. The right plot of Fig. 8 shows that the rate is 0.39, indicating the equivalence between a single long trajectory and multiple short trajectories for inference.
Next, we investigate the effects of the scale of the random noise on learning. We observe phenomenon similar to those in Example 1. Figure 6 shows that the estimators for the system with \(\sigma =0.25\) oscillate more than the one with \(\sigma =0.05\) at locations near 0. The random noise also did not affect the learning rates, suggested by the left plot of Fig. 8. As the random noise increases, absolute \(L^2(\rho _T)\) error of estimators also increase, suggesting that coercivity constant is getting smaller.
At last, we study the effects of discretization error induced by discrete observations. As the observation gap increases, the discretization errors flatten the learning rate curve of M, see left plot of Fig. 8. Similar to Example 1, the right plot of Fig. 8 shows that the absolute error of the estimator is of order close to the theoretical order \(\sigma O(({\varDelta }t)^{1/2})\) (Fig. 9).
Conclusions from the Numerical Experiments
Numerical results show that in case of continuoustime observations, the algorithm effectively estimates the interaction kernel, achieves the nearoptimal learning rate in M, is robust to different magnitudes of the random noise, and the system with the estimated kernels accurately predicts trajectories. In case of discretetime observations, the estimator has an estimation error of order \({\varDelta }t^{1/2}\), due to the discretization error in the approximation of the likelihood ratio. These numerical results are in full agreement with the learning theory in Sects. 3–4:

In case of continuoustime observations, the estimators in 10 trials are faithful approximations of the true interaction kernels, with a mean close to the truth. The standard deviation of the estimators decreases as the sample size increases and gets larger as the diffusion constant increases.

The estimator from data achieves the min–max learning rate \((\log {M}/ M)^{s/(2s+1)}\) in Theorem 3.2 by the appropriate choice of the hypothesis spaces and their dimension as a function of M. For \(\phi \) in \(C^{k+\alpha }\) with \(k+\alpha \ge 2\), the learning rate is around \(M^{\frac{1}{3}}\) when using piecewise constant estimators (\(s=1\)) and the learning rate is around \(M^{\frac{2}{5}}\) using the piecewise linear estimators (\(s=2\)), which is the mini–max optimal rate for the case \(k+\alpha =2\).

The estimators predict transient dynamics well in the training time interval, and the results validate Proposition 2.1: the trajectory discrepancy is controlled by \(L^2({\rho _T})\) error of estimators, demonstrating the effectiveness of distances in \(L^2(\rho _T)\) in quantifying the performance of estimators. In addition, the estimators even predict in a remarkably accurate fashion the collective behavior of particles in larger future time intervals, indicating that the bound in Proposition 2.1 may be overly pessimistic in some cases. Our intuition is that this benign phenomenon benefits from the large support of \(\rho _T\), encouraged by the randomness of the initial conditions and presence of stochastic noise.

In case of discretetime observations with observation gap \({\varDelta }t\), the estimation error of the estimator is of order \({\varDelta }t^{1/2}\) and depends linearly on \(\sigma \), the square root of the diffusion constant. Therefore, as \({\varDelta }t\) increases, the discretization error dominates the estimation error, consistently with the learning theory in Sect. 4, which leads to bounding the estimation error of the estimator by \(M^{\frac{s}{2s+1}}+\sigma {\mathcal {O}}({\varDelta }t^{1/2})\).

When the length T of the trajectories increases, the optimal learning rate (in M) is still achieved. The estimation errors of the estimator exhibit a convergence rate around \((\frac{\log (MT)}{MT})^{s/(2s+1)}\) with \(s=1,2\) respectively, demonstrating an equivalence of “information” between few long trajectories and many short trajectories initiated at suitably random initial conditions, as discussed above in Sect. 2.3.
Final Remarks and Future Work
There are many venues in which the present work could be extended.
The first notable extension is to heterogeneous particle systems with multiple types of particles, which arise in many applications. In this case, one assumes that there are different interaction kernels, modeling the nonsymmetric interactions between different types of particles. Examples of these systems are considered in [41] in the deterministic case, with the theoretical analysis achieved in [40], where the coercivity condition is generalized to the multipleparticletypes setting, and (near)optimal convergence rates of the estimators were established. We believe a similar extension is possible in the stochastic case, combining the ideas of this work and [40].
Another notable extension is to secondorder differential systems of interacting particles or systems with possible external potentials, where interaction kernels of more general forms than those considered here arise. In the deterministic case, [41] considers examples of such systems, with a forthcoming theoretical analysis. In the stochastic case, the extension would require significant effort, especially if important cases of systems with degenerate diffusion (e.g., stochastic Langevin) were considered. We also remark again that in this work we do not observe velocities, as done in the works just cited in the case of deterministic systems: here we fully take into account the discretization (in time) error, and if we let \(\sigma \rightarrow 0\), the results here would imply similar results in the deterministic case. Extending these considerations to secondorder systems would be valuable.
Further work is also needed to formalize the considerations we put forward in Sect. 2.3 regarding ergodic systems, and design robust and optimal algorithms in the regimes of observation a long trajectory or many independent trajectories.
We assume in this work that all particles are observed. A desirable extension is to the case of partial observations of a subset of particles or macroscopic observations of the population density, which is a practical concern when the system is large with millions of particles in high dimension. Since it is an illposed inverse problem to recover the missing trajectories of unobserved particles [54], a new formulation based on the corresponding mean field equations [27, 28, 43] is under investigation.
In this work, we assume that the noise coefficient is a known constant: there has been of course significant work in estimating the noise coefficient, for example in the case of interacting particle systems see the recent work [26] and references therein, and for the case of model reduction for Langevin equations with statedependent diffusion coefficient [19].
References
A. S. Baumgarten and K. Kamrin. A general constitutive model for dense, fineparticle suspensions validated in many geometries. Proc Natl Acad Sci USA, 116(42):20828–20836, 2019.
N. Bell, Y. Yu, and P. J. Mucha. Particlebased simulation of granular materials. In Proceedings of the 2005 ACM SIGGRAPH/Eurographics Symposium on Computer Animation  SCA ’05, page 77, Los Angeles, California, 2005. ACM Press.
S. Benachour, B. Roynette, D. Talay, and P. Vallois. Nonlinear selfstabilizing processes – I Existence, invariant probability, propagation of chaos. Stochastic Processes and their Applications, 75(2):173–201, 1998.
G. Bennett. Probability inequalities for the sum of independent random variables. Journal of the American Statistical Association, 57(297):33–45, 1962.
S. Bernstein. Sur l’ordre de la meilleure approximation des fonctions continues par des polynômes de degré donné, volume 4. Hayez, imprimeur des académies royales, 1912.
P. Binev, A. Cohen, W. Dahmen, R. DeVore, and V. Temlyakov. Universal algorithms for learning theory part i: piecewise constant functions. J. Mach. Learn. Res., 6(Sep):1297–1321, 2005.
V. D. Blodel, J. M. Hendricks, and J. N. Tsitsiklis. On Krause’s multiagent consensus model with statedependent connectivity. Automatic Control, IEEE Transactions on, 54(11):2586 – 2597, 2009.
F. Bolley, I. Gentil, and A. Guillin. Uniform Convergence to Equilibrium for Granular Media. Arch Rational Mech Anal, 208(2):429–445, 2013.
M. Bongini, M. Fornasier, M. Hansen, and M. Maggioni. Inferring interaction rules from observations of evolutive systems I: The variational approach. Math. Models Methods Appl. Sci., 27(05):909–951, 2017.
D. R. Brillinger. Learning a potential function from a trajectory. In Selected Works of David Brillinger, pages 361–364. Springer, 2012.
C. Brugna and G. Toscani. Kinetic models of opinion formation in the presence of personal conviction. Phys. Rev. E, 92(5):052818, 2015.
J. Carrillo, R. McCann, and C. Villani. Kinetic equilibration rates for granular media and related equations: Entropy dissipation and mass transportation estimates. Rev. Mat. Iberoamericana, pages 971–1018, 2003.
P. Cattiaux, A. Guillin, and F. Malrieu. Probabilistic approach for granular media equations in the nonuniformly convex case. Probab. Theory Relat. Fields, 140(12):19–40, 2007.
D. Chen, Y. Wang, G. Wu, M. Kang, Y. Sun, and W. Yu. Inferring causal relationship in coordinated flight of pigeon flocks. Chaos: An Interdisciplinary Journal of Nonlinear Science, 29(11):113118, 2019.
X. Chen. Maximum likelihood estimation of potential energy in interacting particle systems from singletrajectory data. arXiv preprint arXiv:2007.11048, 2020.
A. Cohen, M. A. Davenport, and D. Leviatan. On the stability and accuracy of least squares approximations. Foundations of computational mathematics, 13(5):819–834, 2013.
F. Comte and V. GenonCatalot. Nonparametric drift estimation for i.i.d. paths of stochastic differential equations. accepted for publication in The Annals of Statistics, 2019.
I. D. Couzin, J. Krause, N. R. Franks, and S. A. Levin. Effective leadership and decisionmaking in animal groups on the move. Nature, 433(7025):513 – 516, 2005.
M. C. Crosskey and M. Maggioni. Atlas: A geometric approach to learning highdimensional stochastic systems near manifolds. Journal of Multiscale Modeling and Simulation, 15(1):110–156, 2017. arxiv:1404.0667.
F. Cucker and S. Smale. On the mathematical foundations of learning. Bulletin of the American mathematical society, 39(1):1–49, 2002.
F. Cucker and D. X. Zhou. Learning theory: an approximation theory viewpoint, volume 24. Cambridge University Press, 2007.
M. R. D’Orsogna, Y.L. Chuang, A. L. Bertozzi, and L. Chayes. Selfpropelled particles with softcore interactions: patterns, stability, and collapse. Phys. Rev. Lett., 96:104 – 302, 2006.
L. Györfi, M. Kohler, A. Krzyzak, and H. Walk. A distributionfree theory of nonparametric regression. Springer, New York, 2002.
R. Hegselmann and U. Krause. Opinion dynamics and bounded confidence models, analysis, and simulation. JASSS, 5(3):33, 2002.
N. J. Higham. Functions of matrices: theory and computation, volume 104. SIAM, 2008.
H. Huang, J.G. Liu, and J. Lu. Learning interacting particle systems: Diffusion parameter estimation for aggregation equations. Mathematical Models and Methods in Applied Sciences, 29(01):1–29, 2019.
P.E. Jabin and Z. Wang. Mean field limit and propagation of chaos for Vlasov systems with bounded forces. Journal of Functional Analysis, 271(12):3588–3627, 2016.
P.E. Jabin and Z. Wang. Quantitative estimates of propagation of chaos for stochastic systems with \({W}^{1,\infty }\) kernels. Invent. math., 214(1):523–591, 2018.
J. Kaipio and E. Somersalo. Statistical and Computational Inverse Problems. Number v. 160 in Applied Mathematical Sciences. Springer, New York, 2005.
S. Kalmykov, B. Nagy, V. Totik, et al. Bernsteinand Markovtype inequalities for rational functions. Acta Mathematica, 219(1):21–63, 2017.
I. Karatzas and S. E. Shreve. Brownian motion. In Brownian Motion and Stochastic Calculus, pages 47–127. Springer, 1998.
F. C. Klebaner. Introduction to stochastic calculus with applications. World Scientific Publishing Company, 2005.
U. Krause. A discrete nonlinear and nonautonomous model of consensus formation. Communications in difference equations, 2000:227–236, 2000.
Y. A. Kutoyants. Statistical Inference for Ergodic Diffusion Processes. Springer London, 2004.
D. A. Levin and Y. Peres. Markov chains and mixing times, volume 107. American Mathematical Soc., 2017.
L. Li, Y. Li, J.G. Liu, Z. Liu, and J. Lu. A stochastic version of Stein Variational Gradient Descent for efficient sampling. Commun. Appl. Math. Comput. Sci., 15(1):37–63, 2020.
Z. Li and F. Lu. On the coercivity condition in the learning of interacting particle systems. arXiv preprint arXiv:2011.10480, 2020.
Z. Li, F. Lu, M. Maggioni, S. Tang, and C. Zhang. On the identifiability of interaction functions in systems of interacting particles. Stochastic Processes and their Applications, 132:135–163.
Q. Liu and D. Wang. Stein Variational Gradient Descent: A General Purpose Bayesian Inference Algorithm. arXiv:1608.04471 Cs Stat, 2019.
F. Lu, M. Maggioni, and S. Tang. Learning interaction kernels in heterogeneous systems of agents from multiple trajectories. Journal of Machine Learning Research, 22(32):1–67, 2021.
F. Lu, M. Zhong, S. Tang, and M. Maggioni. Nonparametric inference of interaction laws in systems of agents from trajectory data. Proc Natl Acad Sci USA, 116(29):14424–14433, 2019.
X. Mao. Stochastic differential equations and applications. Elsevier, 2007.
S. Motsch and E. Tadmor. Heterophilious Dynamics Enhances Consensus. SIAM Rev., 56(4):577 – 621, 2014.
R. Nickl, K. Ray, et al. Nonparametric statistical inference for drift vector fields of multidimensional diffusions. Annals of Statistics, 48(3):1383–1408, 2020.
B. Øksendal. Stochastic differential equations: an introduction with applications. Springer Science & Business Media, 6th edition, 2013.
R. OlfatiSaber and R. M. Murray. Consensus problems in networks of agents with switching topology and timedelays. IEEE Transactions on automatic control, 49(9):1520–1533, 2004.
D. Pollard. Mini Book notes. 2000. http://www.stat.yale.edu/~pollard/Books/Mini/Basic.pdf.
L. Schumaker. Spline functions: basic theory. Cambridge University Press, 2007.
A. V. Skorokhod. On the regularity of manyparticle dynamical systems perturbed by white noise. Journal of Applied Mathematics and Stochastic Analysis, 9(4):427–437, 1996.
R. H. Stefano Almi, Massimo Fornasier. Datadriven evolutions of critical points. Foundations of Data Science, 2(3):207–255, 2020.
M. B. Thompson. A comparison of methods for computing autocorrelation time. arXiv preprint arXiv:1011.0175, 2010.
C. Wang and D.X. Zhou. Optimal learning rates for least squares regularized regression with unbounded sampling. Journal of Complexity, 27(1):55–67, 2011.
J. P. Ward. \(l^p\) Bernstein inequalities and inverse theorems for RBF approximation on \(r^d\). Journal of Approximation Theory, 164(12):1577–1593, 2012.
Z. Zhang and F. Lu. Cluster prediction for opinion dynamics from partial observations. IEEE Transactions on Signal and Information Processing over Networks, 2020.
M. Zhong, J. Miller, and M. Maggioni. Datadriven discovery of emergent behaviors in collective dynamics. Physica D: Nonlinear Phenomena, 411:132542, 2020.
Acknowledgements
We thank the anonymous referee for their helpful comments. FL and MM are grateful for partial support from NSF1913243, FL for NSF1821211; MM for NSF1837991, NSF1546392, AFOSRFA95501710280 and the Simons Fellowship; ST for AMS Simons travel grant and a startup fund sponsored by University of California Santa Barbara.
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by Hans Munthe Kaas.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendix
Appendix
Preliminaries for SDEs
Let \((\varvec{X}_t,t\ge 0) \) be a stochastic process on \({\mathbb {R}}^n\) satisfying
We first review the existence and uniqueness of strong solutions for SDEs (see Theorem 5.4 in [32])
Theorem A.1
(Existence and Uniqueness) If the following conditions are satisfied

The coefficients V and \(\sigma \) are locally Lipschitz in \(\varvec{x}\) uniformly in t, that is for every T and K, there is a constant C depend only on T and K such that for all \(\Vert \varvec{x}\Vert , \Vert \varvec{x}\Vert \le K\) and all \(0\le t \le T\)
$$\begin{aligned} \Vert V(\varvec{x},t)V(\varvec{x},t)\Vert +\Vert \sigma (\varvec{x},t)\sigma (\varvec{Y},t)\Vert <C\Vert \varvec{x}\varvec{x}\Vert . \end{aligned}$$(A.2) 
Coefficients satisfy the linear growth condition
$$\begin{aligned} \Vert V(\varvec{x},t)\Vert +\Vert \sigma (\varvec{x},t)\Vert <C(1+\Vert \varvec{x}\Vert ). \end{aligned}$$(A.3) 
\(\varvec{X}_0\) is independent of \((\varvec{B}_t,0\le t\le T),\) and \({\mathbb {E}}\Vert \varvec{X}_0\Vert ^2<\infty \).
Then, there exists a unique strong solution \(\varvec{X}_t\) of the SDEs (A.1). \(\varvec{X}_t\) has continuous paths; moreover,
where constants \(C_1\) depend only on C and T.
It is straightforward to show that \({\mathbf {f}}_{\phi }\) satisfy (A.2) and (A.3). Therefore, suppose \(\mu _0\) is independent of the underlying Brownian motion and has finite second moment; then, there exists a unique strong solution up to time T for system (1.1) for any \(\varvec{X}_0\) drawn from \(\mu _0\).
Theorem A.2
(Girsonov theorem) Let \(P_{\sigma }\) be the probability measure induced by the solution of the SDEs (A.1) for \(t \in [T_0, T]\) and a fixed starting value at time \(T_0\), and let \(W_{\sigma }\) be the law of the respective driftless process. Suppose that \({\varSigma }=\sigma \sigma '\) is invertible and V fulfills the Novikov condition
Then \(P_{\sigma }\) and \(W_{\sigma }\) are equivalent measures with Radon–Nikodym derivative given by Girsonov’s formula
for all \(s \in [T_0, t]\) and \(\varvec{X}_{[T_0,s]}=(\varvec{X}_{t})_{t \in [T_0, s]}.\)
The proof of Theorem A.2 can be found in [31, Chapter 3.5], [45, Chapter 8.6].
Theorem A.3
(The Itô formula, see Theorem 4.1.2 in [45]) Let \(g:{\mathbb {R}}^n\rightarrow {\mathbb {R}}\) be a \(C^{2}\) map and \((X_t)\) be a solution to (A.1) with \(\sigma \) being a constant. Then, the process \(Y(t)=g(\varvec{X}_t)\) is an Itô process satisfying
Useful Inequalities
Theorem A.4
(Bernstein inequality for unbounded random variables) Let \(X_1, X_2,\ldots ,X_M\) be independent random variables with \({\mathbb {E}}(X_i)=0\). If for some constants \(K_1, v_1>0\), the bound \({\mathbb {E}}X_i^p \le \frac{1}{2} p! K_1^{p2}v_1\) holds for every \(2 \le p \in {\mathbb {N}}\), then
For the proof of Theorem A.4 , we refer to [4] and David Pollard’s book notes [47](page 14).
Corollary A.1
Denote \({\mathcal {E}}_{M}(g)=\frac{1}{M} \sum _{m=1}^{M}g(X_m)\) for a measurable function g. If for some \(K_2, v_2 >0\), the bound
holds for \(2\le p \in {\mathbb {N}}\), then there holds
Proof
Applying Theorem A.4 on the random variable \({\mathbb {E}}gg\), we immediately obtain the desired bound. \(\square \)
Corollary A.2
If for some \(K_3, v_3>0\), the bound
holds for \(2\le p \in {\mathbb {N}}\), then
Proof
If we replace \(\epsilon \) with \(\sqrt{\epsilon (\epsilon +{\mathbb {E}}g)}\) in (A.5), and let \(K_2=K_3\), \(v_2=v_3{\mathbb {E}}g\), the desired bound follows from the inequality
where the last inequality is true since \(\sqrt{\epsilon (\epsilon +{\mathbb {E}}g)}\le \epsilon +{\mathbb {E}}g\) for all \(\epsilon \ge 0\). \(\square \)
We also refer to [52] (see its Lemma 3 and Lemma 5) for the analog of Corollary A.1 and A.2.
Theorem A.5
(Moment inequality for stochastic integrals, see Theorem 7.1 in [42]) Let \({\mathcal {M}}^2([0, T]; {\mathbb {R}}^{n \times m})\) denote the family of all \(n\times m\)matrixvalued measurable \(\{{\mathcal {F}}_t\}_{t\ge t_0}\) adapted process \(f=\{(f_{ij}(t))_{n \times m}\}_{0\le t\le T}\) such that \({\mathbb {E}}\int _{0}^{T}\Vert f(t)\Vert ^2\mathrm{d}t <\infty .\) If \(p \ge 2\), \( f \in {\mathcal {M}}^2([0, T]; {\mathbb {R}}^{n \times m} )\) such that
then
In particular, for \(p=2\), there is equality.
Proof of Proposition 2.1
Proof of Proposition 2.1
For ease of notation, in this proof we use \({\mathbb {E}}\) to represent \({\mathbb {E}}_{\mu _0,\varvec{B}}\). For every \(t \in [0,T]\), we have
Letting \( \varvec{x}_{ji}(s) := \varvec{x}_j(s) \varvec{x}_i(s)\), \( {\widehat{\varvec{x}}}_{ji}(s) := {\widehat{\varvec{x}}}_j(s){\widehat{\varvec{x}}}_i(s)\), and \(F_{\varphi }(\varvec{x})=\varphi (\Vert \varvec{x}\Vert )\varvec{x}\), for \(\varphi \in {\mathcal {K}}_{R,S}\) and \(\varvec{x}\in {\mathbb {R}}^d\),
Then, an application of Gronwall’s inequality yields the estimate
Note that by Jensen’s inequality,
Then, the conclusion follows by combining with the estimate \(\text {Lip}(F_{[\widehat{\phi }]})\le (R+1)S\). \(\square \)
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Lu, F., Maggioni, M. & Tang, S. Learning Interaction Kernels in Stochastic Systems of Interacting Particles from Multiple Trajectories. Found Comput Math 22, 1013–1067 (2022). https://doi.org/10.1007/s1020802109521z
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s1020802109521z
Keywords
 Inverse problems
 Interacting particle systems
 Statistical and machine learning
Mathematics Subject Classification
 70F17
 62G05
 62M05