1 Introduction

Physical, biological, and social systems across all scales of complexity and size can often be described as dynamical systems written in terms of interacting agents (e.g. particles, cells, humans, planets,...). Rich theories have been developed to explain the collective behavior of these interacting agents across many fields including astronomy, particle physics, economics, social science, and biology. Examples include predator–prey systems, molecular dynamics, coupled harmonic oscillators, flocking birds or milling fish, human social interactions, and celestial mechanics, to name a few. In order to encompass many of these examples, we will consider a rather general family of second-order, heterogeneous (the agents can be of different types), interacting (the acceleration of an agent is a function of properties of the other agents) agent systems that includes external forces, masses of the agents, multivariable interaction kernels, and an additional environment variable that is a dynamical property of the agent (for example, a firefly having its luminescence varying in time). We propose a learning approach that combines machine learning and dynamical systems in order to provide highly accurate dynamical models of the observation data from these systems.

The model and learning framework presented in Sects. 24 includes a very large number of relevant systems and allows for their modeling. Clustering of opinions [8, 21, 41, 53] is a simple first-order case that exhibits clustering. Flocking of birds [22, 23, 26] can be modeled as the behavior of a second-order system that exhibits an emergent shared velocity of all agents. Milling of fish [1, 2, 19, 20] may be modeled as a large-time behavior of a second-order system (in 2 or 3-dimensions), with a non-collective force from the environment. A model of oscillators (fireflies) that sync and swarm together, and have their dynamics governed by their positions and a phase variable \(\xi \), was studied by [54,55,56, 66]. There are also models that include both energy and alignment interaction kernels, a particular case of this is the anticipation dynamics model from [65], which we also consider in this work. These dynamics exhibit a wide range of emergent behaviors, and as shown in [4, 18, 23, 35, 53, 67, 72], the behaviors can be studied when the governing equations are known. However, if the equations are not known and the data consists of only trajectories, we still wish to develop a model that can make accurate predictions of the trajectories and discover a dynamical form that accurately reflects their emergent properties. To achieve this, we present a provably optimal learning algorithm that is accurate, captures emergent behavior for large time, and, by exploiting the structure of the collective dynamical system, avoids the curse of dimensionality.

Our learning approach discovers the governing laws of a particular subset of dynamical systems of the form,

$$\begin{aligned} {{\dot{\varvec{Y}}}}(t) = \varvec{F}_{\varvec{\phi }^{EA}, \varvec{\phi }^{\xi }}(\varvec{Y}(t)), \quad \varvec{Y}(0) = \varvec{Y}_0 \in {\mathbb {R}}^D, \quad t \in [0, T]. \end{aligned}$$

The learning problem is to infer the right hand side function \(\varvec{F}_{\varvec{\phi }^{EA}, \varvec{\phi }^{\xi }}\) from observations \(\{\varvec{Y}^{(m)}_{t_l}, \smash {{{\dot{\varvec{Y}}}}^{(m)}_{t_l}}\}_{m=1,l= 1}^{M,L}\) of the dynamical system, where m indexes different trajectories, started from initial conditions (IC) sampled i.i.d. from a measure \(\varvec{\mu ^{\varvec{Y}}}\) on the state space. Here M is the total number of trajectories observed, with each trajectory forming a single observation (M plays a fundamental role in the learning theory where we study convergence as M varies); L refers to the number of observations at different times along each trajectory. Throughout this work, m will index the trajectories \(1,\ldots , M\) and l will index the points in time \(1, \ldots , L\). The main difficulties in establishing an effective theory of learning \(\varvec{F}_{\varvec{\phi }^{EA}, \varvec{\phi }^{\xi }}\) are the curse of dimensionality caused by the dimension of \(\varvec{Y}\), which is \(D = N(2d + 1)\), where N is the number of agents, d the dimension of physical space; and the dependence of the observation data, for example \(\varvec{Y}(t_{l + 1})\) is a deterministic function of \(\varvec{Y}(t_{l})\).

We present a learning approach based on exploiting the structure of collective dynamical systems and nonparametric estimation techniques (see [6, 24, 31, 36, 69]) where we recover the interaction kernels \(\varvec{\phi }^{EA}, \varvec{\phi }^{\xi }\). A simplified form of our model equations, generalizing the first order models, is derived from Newton’s second law and given by: for \(i = 1, \ldots , N\)

$$\begin{aligned} m_i\ddot{\varvec{x}}_i(t)&= F^{{{\dot{\varvec{x}}}}}(\varvec{x}_i, {{\dot{\varvec{x}}}}_i) + \frac{1}{N} \sum _{i' = 1}^N \phi ^E(\left\| \varvec{x}_{i'}(t) - \varvec{x}_i(t) \right\| )(\varvec{x}_{i'}(t) - \varvec{x}_i(t)) \nonumber \\&\quad + \phi ^A(\left\| \varvec{x}_{i'}(t) - \varvec{x}_i(t) \right\| )({\dot{\varvec{x}}}_{i'}(t) - {\dot{\varvec{x}}}_i(t)) . \end{aligned}$$
(1.1)

Here, \(m_i\) is the mass of the ith agent, \(\varvec{x}_i\) is its position, \(F^{{{\dot{\varvec{x}}}}}\) is a non-collective force, and \(\phi ^E, \phi ^A: {\mathbb {R}}^+ \rightarrow {\mathbb {R}}\) are known as the interaction kernels.

To use the trajectory data to derive estimators, we consider appropriate hypothesis spaces in which to build our estimators, measures adapted to the dynamics, norms, and other performance metrics, and ultimately an inverse problem built from these tools. More specifically, let \(\widehat{\varvec{\phi }}^{EA}\) denote the direct sum of the kernels \(\widehat{\varvec{\phi }}^{E}\oplus \widehat{\varvec{\phi }}^{A}\) (for the notation, see Sect. 3), and define our estimator as

$$\begin{aligned} \widehat{\varvec{\phi }}^{EA}:= \underset{\varvec{\varphi }^{EA}\in \varvec{{\mathcal {H}}}^{EA}}{{\text {arg}}{\text {min}}}\;\varvec{{\mathcal {E}}}_{M}^{EA}(\varvec{\varphi }^{EA}), \end{aligned}$$

where \(\varvec{{\mathcal {E}}}_{M}^{EA}\) is an empirical error functional depending on the observation data, \(\varvec{{\mathcal {H}}}^{EA}\) is a hypothesis space to search for our estimators, and based on the form of the error functional the estimator is calculated as the solution of a constrained least squares problem. Once we have obtained this estimated interaction kernel, we want to study its properties as a function of the amount of trajectory data we receive, which is the M trajectories sampled from different initial conditions from the same underlying system, each consisting of L time observations along the trajectory. Here we study properties of the error functional, establish the uniqueness of its minimizers, and use the probability measures to define a dynamics-adapted norm to measure the error of our estimators over the hypothesis spaces. In comparing the estimators to the true interaction kernels, we first establish concentration estimates over the hypothesis space.

Our first main result is the strong asymptotic consistency of our learned estimators, as the number M of trajectories increases, which for the model (1.1) yields:

$$\begin{aligned} \lim _{M\rightarrow \infty }\Vert \widehat{\varvec{\phi }}^{EA}- \varvec{\phi }^{EA}\Vert _{\varvec{L}^2(\varvec{\rho }_T^{EA,L})} =0\quad \text { with probability one,} \end{aligned}$$
(1.2)

where \(\varvec{\rho }_T^{EA,L}\) is a dynamics-adapted probability measure on pairwise distances, and we use a weighted \(\varvec{L}^2\) space (see Sect. 4, particularly (4.3)); see Sect. 3 for the required definitions and Sect. 4.3.2 for the full theorem. In fact, we also prove a stronger result that provides the rate of convergence. We achieve the minimax rate of convergence for any number of effective variables \({\mathcal {V}}\) in the interaction kernels. See Sect. 4.4 for the full theorem (see Sect. 3 for relevant definitions) which is given by:

$$\begin{aligned} {\mathbb {E}}_{\varvec{\mu ^{\varvec{Y}}}}\Big [\Vert \widehat{\varvec{\phi }}^{EA}-\varvec{\phi }^{EA}\Vert ^2_{\varvec{L}^2(\varvec{\rho }_T^{EA,L})} \Big ] \le C \left( \frac{\log M}{M}\right) ^{\frac{2s}{2s+{\mathcal {V}}}}. \end{aligned}$$
(1.3)

In the case of model (1.1), we have \({\mathcal {V}}=1\). Our result recovers the results for first-order systems [45, 47].

This means that our estimators converge at the same rate in M as the best possible estimator (up to a logarithmic factor) one could construct when the initial conditions are randomly sampled from some underlying initial condition distribution denoted \(\varvec{\mu ^{\varvec{Y}}}\) throughout this work (see Sect. 4.3).

To solve the inverse problem, we give a detailed discussion of an essential link between these three aspects, the notion of coercivity of the system—detailed in Sect. 4.2. Coercivity plays a key role in the approximation properties, the algorithm design, and the learning theory. We also present numerical examples, see also the detailed numerical study in [75], which help to explain why the particular norms we define are the right choice, as well as show excellent performance on complex dynamical systems, in Sect. 6.

Our paper is structured as follows. The first part of the paper describes the model, learning framework, inference problem, and the basic tools needed for the learning theory. These ideas are all explained in detail in Sects. 24. If one wishes to quickly jump to the theoretical sections, and then refer back to the definitions as needed, we have provided Tables 13 which explains the model equations and outlines the definitions and concepts needed for the learning theory and general theoretical results, respectively. The theoretical part of the paper (Sects. 4.24.5) discusses fundamental questions of identifiability and solvability of the inverse problem, consistency, and rate of convergence of the estimators, and the ability to control trajectory error of the evolved trajectories using our estimators. Some key highlights of our theoretical contributions are described in Sect. 3.4, with full details in the corresponding sections. Lastly, we consider applications in Sect. 6, as well as have many additional proofs and details in appendices E-D.

2 Model description

In order to motivate the choice of second-order models considered in this paper, we begin our discussion with a simple second-order model derived from classical mechanics. Let us consider a closed system of N homogeneous agents (or particles) equipped with a certain type of Lagrangian L(t) in the form

$$\begin{aligned} L(t) = \frac{1}{2}\sum _{i = 1}^N m_i\left\| {{\dot{\varvec{x}}}}_i(t) \right\| ^2 - \frac{1}{2N}\sum _{i, i' = 1}^NU(\left\| \varvec{x}_{i'}(t) - \varvec{x}_i(t) \right\| ), \quad i = 1, \ldots , N. \end{aligned}$$

Here U is a potential energy depending on pairwise distance. From the Lagrange equation, \(\frac{d}{dt}\partial _{{{\dot{\varvec{x}}}}_i}L = \partial _{\varvec{x}_i}L\), we obtain the second-order collective dynamics model

$$\begin{aligned} m_i\ddot{\varvec{x}}_i(t) = \frac{1}{N}\sum _{i' = 1}^N\phi ^{E}(\left\| \varvec{x}_{i'}(t) - \varvec{x}_i(t) \right\| )(\varvec{x}_{i'}(t) - \varvec{x}_i(t)), \quad i = 1, \ldots , N.\qquad \end{aligned}$$
(2.1)

Here, \(\phi ^{E}(r) = \frac{U'(r)}{r}\) represents an energy-based interaction between agents. We are assuming a regularity condition on \(\phi ^{E}\), i.e. \(\phi ^{E}(0)0 = 0\). For example, the choice \(U(r) = \frac{NGm_{i'}m_i}{r}\) corresponds to Newton’s gravity model.

In order to incorporate a wider spectrum of behaviors, we add alignment-based interactions, which enable the alignment of velocities (so that short-range repulsion, mid-range alignment, and long arrange attraction are all present), auxiliary state variables describing internal states of agents (emotion, excitation, phases, etc.), and non-collective forces (interaction with the environment). We also allow for heterogeneous systems, consisting of agents belonging to \(K\) disjoint types: in this case we partition agents in \(K\) disjoint subsets \(\{C_{k}\}_{k= 1}^{K}\), with \(N_{k}\) being the number of agents of type \(k\), grouped in the index subset \(C_k\). These systems will be modeled by equations of the form

figure a

for \(i = 1, \ldots , N\), where is the type of the agent i. The interaction kernels \(\phi ^{E}_{kk'}, \phi ^{A}_{kk'}, \phi ^{\xi }_{kk'}\) are in general different for each directed pair of interacting agent types; moreover, they not only depend on the pairwise distance \(r_{ii'}(t) = \left\| \varvec{x}_{i'}(t) - \varvec{x}_i(t) \right\| \), but also on other (known) pairwise features, \(\varvec{s}^E_{i i'}, \varvec{s}^A_{i i'}, \varvec{s}^{\xi }_{i i'}\), which are vector-valued functions of \(\varvec{x}_i(t), {{\dot{\varvec{x}}}}_i(t), \xi _i(t), \varvec{x}_{i'}(t), {{\dot{\varvec{x}}}}_{i'}(t), \xi _{i'}(t)\). For example, the interactions between birds or fish may depend on the field of vision, not just the distance between pairs of birds or fish. We will often suppress the explicit dependence on time t when it is clear from the context. These feature maps are modeled as being the composition of a vector-valued feature map \({\mathcal {F}}\) shared among all types of agent, and a projection, for each ordered pair of agent types, onto a subset of coordinates in the range of this map—see Table 1. The unknowns, for which we will construct estimators, in these equations, are the functions , and ; everything else is assumed given.

We note that in what follows, the notation \({\{E,A,\xi \}}\) attached to a map/function/etc. means that there is one of those maps/functions/etc. for each element in the set \({\{E,A,\xi \}}\). It is a convenient way to avoid excessive repetition of similar definitions.

Table 1 Notation for the variables in (2.2)

The specific instances of the feature map \({\mathcal {F}}\) together with corresponding projections \(\pi _{kk'}^{{\{E,A,\xi \}}}\) include a variety of systems that have found a wide range of applications in physics, biology, ecology, and social science; see the examples in the Table 2 below. We assume that the function \({\mathcal {F}}\) is Lipschitz, and known, together with all the \(\pi _{kk'}^{{\{E,A,\xi \}}}\)’s. The Lipschitz assumption is sufficient to ensure the well-posedness of the system and will also be used to control the trajectory error, and of course implies that the feature maps \(\varvec{s}_{(k, k')}^{{\{E,A,\xi \}}}\) are all Lipschitz. The function \({\mathcal {F}}\) is a uniform way to collect all of the different variables (functions of the inputs) used across any of the \((k,k')\) pairs over all of the \(E,A,\xi \) functions in the system. This uniformity is helpful when discussing the rate of convergence, among other places. Examples of where this generality matters emerge naturally, say when one has a different number of variables across interaction kernels for different pairs \((k,k')\), or when the energy and alignment kernels depend on r and then additional but distinct other variables. From this uniform set of variables, we then project to arrive at the relevant function \(\varvec{s}_{(k, k')}^{{\{E,A,\xi \}}}\) for each pair (and each of the elements of the wildcard). Lastly, we can then evaluate this map at the specific pair of agents \((i,i')\), that leads to the feature evaluation, \(\varvec{s}_{ii'}^{{\{E,A,\xi \}}}\) which is the expression used in the model equation (2.2).

The model class (2.2) is quite large. We will consider several different concrete example in Sect. 6.1. We summarize how those examples, and others, map to the model class in Table 2, with a shaded (respectively: empty) cell indicating that the model has (respectively: has not) that characteristic. A numeric value indicates this is the number of unique variables, \({\mathcal {V}}, {\mathcal {V}}^{\xi }\) used within the EA or \(\xi \) portions of the system. The number of these unique variables specifies the dimension in the minimax convergence rate, see Sect. 4.4.

Table 2 Summary of the models studied in this work and in [45, 47, 49, 75]. The black boxes show the existence of the terms

3 Inference problem and learning approach

In this section, we first introduce the problem of inferring the interaction kernels from observations of trajectory data and give a brief review and generalization of the learning approach proposed in the works [47] and [75].

3.1 Preliminaries and notation

We vectorize the model in (2.2) in order to obtain a more compact description. We let \(\varvec{v}_i(t):= {{\dot{\varvec{x}}}}_i(t)\) and

$$\begin{aligned} \varvec{X}_t:= \begin{bmatrix} \varvec{x}_1(t) \\ \vdots \\ \varvec{x}_N(t) \end{bmatrix} \in {\mathbb {R}}^{Nd}, \quad \varvec{V}_t:= \begin{bmatrix} \varvec{v}_1(t) \\ \vdots \\ \varvec{v}_N(t) \end{bmatrix} \in {\mathbb {R}}^{Nd}, \quad \varvec{\Xi }_t:= \begin{bmatrix} \xi _1(t) \\ \vdots \\ \xi _N(t) \end{bmatrix} \in {\mathbb {R}}^N. \end{aligned}$$

We introduce the weighted norm

(3.1)

for \(\varvec{Z}= \begin{bmatrix} \varvec{z}_1^T,&\ldots ,&\varvec{z}_N^T\end{bmatrix}^T\) with each \(\varvec{z}_i \in {\mathbb {R}}^{d}\) or \({\mathbb {R}}\). Here \(\left\| \cdot \right\| \) is the same norm used in the construction of pairwise distance data for the interaction kernels (typically, the Euclidean norm). The weight factor is introduced so that different types of agents of different types are overall weighted equally, which is important in the estimation phase, especially in the case when the number of agents of different types is highly non-uniform. The model (2.2) becomes

$$\begin{aligned} {\left\{ \begin{array}{ll} \vec {m} \circ \ddot{\varvec{X}}_t = {\textbf{f}}^{\text {nc}, {{\dot{\varvec{x}}}}}(\varvec{X}_t, \varvec{V}_t, \varvec{\Xi }_t) + {\textbf{f}}^{\varvec{\phi }^E}(\varvec{X}_t, \varvec{V}_t, \varvec{\Xi }_t) + {\textbf{f}}^{\varvec{\phi }^A}(\varvec{X}_t, \varvec{V}_t, \varvec{\Xi }_t) \\ {{\dot{\varvec{\Xi }}}}_t = {\textbf{f}}^{\text {nc}, \xi }(\varvec{X}_t, \varvec{V}_t, \varvec{\Xi }_t) + {\textbf{f}}^{{\varvec{\phi }}^{\xi }}(\varvec{X}_t, \varvec{V}_t, \varvec{\Xi }_t). \end{array}\right. } \end{aligned}$$

Here \(\vec {m} = \begin{bmatrix} m_1,&\ldots ,&m_N \end{bmatrix}^T\in {\mathbb {R}}^N\), \(\circ \) is the Hadamard product, and we use boldface fonts to denote the vectorized form of our estimators (with some once-for-all-fixed ordering of the pairs \((k,k')_{k,k'=1,\dots ,K}\)):

$$\begin{aligned} \varvec{\phi }^E = [\phi ^{E}_{kk'}]_{k, k' = 1}^{K}, \quad \varvec{\phi }^A = [\phi ^{A}_{kk'}]_{k, k'=1}^{K}, \quad \varvec{\phi }^{\xi } = [\phi ^{\xi }_{kk'}]_{k, k' = 1}^{K}, \end{aligned}$$
(3.2)

and of the non-collective force:

both of which are vectors in \({\mathbb {R}}^{Nd}\). We omit the analogous definitions for \({\textbf{f}}^{\varvec{\phi }^A}\) and \({\textbf{f}}^{\varvec{\phi }^{\xi }}\). We also use the shorthand:

$$\begin{aligned} \varvec{\phi }^{EA}:= \varvec{\phi }^{E}\oplus \varvec{\phi }^{A}, \end{aligned}$$
(3.3)

to denote the element of the direct sum of the function spaces containing \(\varvec{\phi }^{E}, \varvec{\phi }^{A}\).

3.2 Problem setting

Our observation data is given by approximations of \(\{\varvec{Y}^{(m)}_{t_l}, \smash {{{\dot{\varvec{Y}}}}^{(m)}_{t_l}}\}_{m=1,l= 1}^{M,L}\) for \(0 = t_1< t_2< \cdots < t_L = T\). Here \(\varvec{Y}_t=[\varvec{y}_1^T(t),\ldots , \varvec{y}_N^T(t)]\) and \( \varvec{y}_i(t) = \begin{bmatrix}\varvec{x}_i^T(t), {{\dot{\varvec{x}}}}_i^T(t), \xi _i(t)\end{bmatrix}^T\), and m indexes the M different trajectories, each generated by the system (2.1) with the unknown set of interaction kernels, i.e. \(\varvec{\phi }^E, \varvec{\phi }^A, \varvec{\phi }^{\xi }\), with initial conditions \(\{\varvec{Y}^{(m)}(0)\}_{m=1,\ldots ,M}\) drawn i.i.d from \(\varvec{\mu ^{\varvec{Y}}}\), a probability measure defined on the space \({\mathbb {R}}^{N(2d+2)}\). We use a superscript (m) to denote that the variable is calculated from the data from that \(m^{\text {th}}\) trajectory. The objective is to construct estimators \(\widehat{\varvec{\phi }}^{E}, \widehat{\varvec{\phi }}^{A}, \widehat{\varvec{\phi }}^{\xi }\) the unknown interaction kernels given these observations.

3.3 Loss functionals

For simplicity, we only consider equidistant observation points: \(t_{l} - t_{l - 1} = h\) for \(l = 2, \ldots , L\); the proposed estimator is easily extended to the case non-equispaced time points. Following and extending [45, 47, 75], we consider the empirical error functional (recall the shorthand notation (3.3))

$$\begin{aligned} \begin{aligned} \varvec{{\mathcal {E}}}_{M}^{EA}(\varvec{\varphi }^{EA})&:=\frac{1}{LM}\sum _{l=1, m=1}^{L,M}\Big \Vert \ddot{\varvec{X}}^{(m)}_{t_l} - {\textbf{f}}^{\text {nc}, {{\dot{\varvec{x}}}}}(\varvec{X}^{(m)}_{t_l},\varvec{V}^{(m)}_{t_l},\varvec{\Xi }^{(m)}_{t_l}) \\&\quad - {\textbf{f}}^{\varvec{\varphi }^{E}}(\varvec{X}^{(m)}_{t_l},\varvec{V}^{(m)}_{t_l},\varvec{\Xi }^{(m)}_{t_l}) - {\textbf{f}}^{\varvec{\varphi }^{A}}(\varvec{X}^{(m)}_{t_l},\varvec{V}^{(m)}_{t_l},\varvec{\Xi }^{(m)}_{t_l})\Big \Vert ^2_{{\mathcal {S}}}, \\ \varvec{{\mathcal {E}}}_{M}^{\xi }(\varvec{\varphi }^{\xi })&:= \frac{1}{LM} \sum _{l=1, m=1}^{L,M} \Big \Vert \dot{\varvec{\Xi }}_{t_l} - {\textbf{f}}^{\text {nc}, \xi }(\varvec{X}^{(m)}_{t_l},\varvec{V}^{(m)}_{t_l},\varvec{\Xi }^{(m)}_{t_l}) - {\textbf{f}}^{\varvec{\varphi }^{\xi }}(\varvec{X}^{(m)}_{t_l},\varvec{V}^{(m)}_{t_l},\varvec{\Xi }^{(m)}_{t_l})\Big \Vert ^2_{{\mathcal {S}}}. \end{aligned} \end{aligned}$$
(3.4)

where we recall \(\Vert \cdot \Vert _{{\mathcal {S}}}\) in (3.1) and the introduction of this norm is to balance the contributions from different species. The estimators of interaction kernels are defined as the minimizers of the error functionals \(\varvec{{\mathcal {E}}}_{M}^{EA}\) and \(\varvec{{\mathcal {E}}}_{M}^{\xi }\) over suitably chosen finite-dimensional function spaces \(\varvec{{\mathcal {H}}}^{EA}\) and \(\varvec{{\mathcal {H}}}^{\xi }\):

$$\begin{aligned} \widehat{\varvec{\phi }}^{EA}= \underset{\varvec{\varphi }^{EA}\in \varvec{{\mathcal {H}}}^{EA}}{{\text {arg}}{\text {min}}}\;\varvec{{\mathcal {E}}}_{M}^{EA}(\varvec{\varphi }^{EA}), \qquad {\widehat{\varvec{\phi }}}^{\xi }= \underset{{\varvec{\varphi }}^{\xi }\in \varvec{{\mathcal {H}}}^{\xi }}{{\text {arg}}{\text {min}}}\;\varvec{{\mathcal {E}}}_{M}^{\xi }({\varvec{\varphi }}^{\xi }). \end{aligned}$$
(3.5)

3.4 Overview of contributions

We focus on the regime where L is fixed but \(M \rightarrow \infty \). We provide a learning theory that answers the fundamental questions:

  • Quantitative description of estimator errors. We will introduce measures to describe how close the estimators are to the true interaction kernels, that lead to novel dynamics-adapted norms. See Sect. 4.

  • Identifiability of kernels. We will establish the existence and uniqueness of the estimators as well as relate the solvability of our inverse problem to a fundamental coercivity property. See Sect. 4.2.

  • Consistency and optimal convergence rate of the estimators. We prove theorems on the strong consistency and optimal minimax rates of convergence of the estimators, as the number of observe trajectories increases. These results exploit the separability of the learning on the energy and alignment from the learning on the environment variable. See Sect. 4.3.

  • Trajectory Prediction. We prove a theorem that bounds the expected (over initial conditions) supremum error (over the entire time interval) of the trajectories obtained via the estimated interaction kernel is controlled by the norm of the difference between the true and estimated kernels, which in turn is controlled by the result above. This further justifies our choice of norms and estimation procedure. See Sect. 4.5.

  • Applications.Our generalized model demonstrates unique features that were not explored in previous work on estimating interaction kernels. Specifically, we showcase applications of the anticipation dynamics (AD) model introduced in [65] that go beyond the scope of previous research. Our numerical results support the effectiveness of our method and are in line with our theoretical findings. In Sect. 5 and D we discuss computation complexity and algorithmic implementation.

3.5 Comparison with existing work

Nonparametric inference of radial interaction kernels. We have studied the nonparametric inference of radial interaction kernels from trajectory data in special cases of model 2.2. In [10], a convergence study for first-order models of homogeneous agents was done for increasing N, the number of agents. The estimation problem with N fixed, but the number of trajectories M varying, for first-order and second-order models of heterogeneous agents was numerically studied in [47] and learning theory on these first-order models was developed in [32, 42, 45]. A big data application to real celestial motion ephemerides is developed and discussed in [49]. In [75], we numerically examine the proposed inference approach on second-order systems, with particular emphasis on emergent collective behaviors.

Novelty of our work. In this paper, we provide a rigorous learning theory for second-order heterogeneous models 2.2 with interaction kernels depending on pairwise higher-dimensional features. Our second-order model equations cover the first-order models considered in [32, 45, 47, 75] as special cases, but they are a significantly larger class: even when written as a first-order system in more variables, they are a strict generalization of the previous first-order models. Furthermore, the dynamical characteristics produced by second-order models are much richer and can model more complicated collective motions and emergent behavior of the agents.

Our theoretical results also focus on the joint learning of \({\varvec{\phi }}^{E}, {\varvec{\phi }}^{A}\) that takes into account their natural weighted direct sum structure that is described in the following sections, whereas the previous work’s learning theories are on single \({\varvec{\phi }}^{E}\)’s. We carefully discuss the identifiability and separability of \(\phi ^{E}\) and \(\phi ^{A}\) from the sum. In general, the current theoretical framework is not able to conclusively show that \({\varvec{\phi }}^{E}\) and \({\varvec{\phi }}^{A}\) can be learned separately. However, we show a structured sum of \(\phi ^{E}\) and \(\phi ^{A}\) can be learned at an optimal rate only depending on the intrinsic dimension, which is sufficient to conclude the force field on the whole state space \({\mathbb {R}}^{2Nd+N}\) can be learned without curse of ambient dimensionality. This is also demonstrated in various numerical experiments.

Other works on data-driven discovery of collective dynamics. The majority of the earliest work in inferring interaction kernels in systems of the type (1.1), (2.2) occurred in the Physics literature, going back to the works of Newton. From the viewpoint of purely data-driven analysis of the equations, requiring limited or no physical reasoning, foundational work on estimating interaction laws includes [40, 48]. One can also refer to [50] for the recent development of the Weak SINDy algorithm that leverages the weak form of the differential equation and sparse parametric regression, with applications to cellular dynamics [51]. In these works, the interaction kernels are assumed to be in the span of a known family of functions and parameters are estimated. In statistics, the problem of parameter estimation in dynamical systems from observations is classical, e.g. [12, 15, 43, 59, 71]. The question of identifiability of the parameter emerges, see e.g. [28, 52]. Our work is closely related to this viewpoint but our parameter is now infinite-dimensional; identifiability is discussed in Sect. 4.2.

Another highly active area of research in recent years is on stochastic interacting agent systems. The maximum likelihood approach is the most frequently studied approach in recent works, including the parameter estimation [7, 16, 34, 39, 64] and nonparametric estimation of drift in stochastic McKean–Vlasov equation [30, 33, 73], and radial interaction kernel learning in [46].

Machine learning of dynamical system. A vast literature exists in the context of learning dynamical systems [3, 5, 9, 13, 14, 37, 38, 44, 57, 58, 60, 62, 74]. There are many techniques which can be used to tackle the high-dimensionality of the data set: sparsity assumptions [11, 14, 61, 68], dimension reduction, reduced-order modeling. The dependent nature of the data prevents traditional regression-based approaches, see the discussion in [45], but many of the approaches above successfully address this. Our work, however, exploits the interacting-agent structure of collective dynamical systems, which is driven by a collection of two-body interactions where each interaction depends only on pairwise data between the states of agents, as in (1.1). With this structure in mind, we are able to reduce the ambient dimension of the data \(N(2d + 1)\) to the dimension of the variables in the interaction kernels, which is independent of N. We also naturally incorporate the dependence in the data in an appropriate manner by considering trajectories generated from different initial conditions.

3.6 Function spaces

We begin by describing some basic ideas about measures and function spaces. Consider a compact or precompact set \({\mathcal {U}} \subset {\mathbb {R}}^{p}\) for some p; the infinity norm is defined as \(\Vert h\Vert _{\infty }:={\text {ess}} \sup _{x \in {\mathcal {U}}}|h(x)|\), and \(L^{\infty }({\mathcal {U}})\) as the space of real valued functions defined on \({\mathcal {U}}\) with finite \(\infty \)-norm. A key function space we need to consider is, \(C_{c}^{k, \alpha }(\mathcal {{\mathcal {U}}})\), for \(k \in {\mathbb {N}}\), \(0<\alpha \le 1\), defined as the space of compactly supported, k-times continuously differentiable functions with a k-th derivative that is Hölder continuous of order \(\alpha \). We can then consider vectorizations of these spaces over agent types as

$$\begin{aligned} {\varvec{L}}^{\infty }({\mathcal {U}}):=\bigoplus _{k, k^{\prime }=1,1}^{K, K} L^{\infty }({\mathcal {U}}),\text { endowed with the norm } \Vert {\varvec{f}}\Vert _{\infty }:=\max _{k, k^{'}}\left\| f_{k k^{\prime }}\right\| _{\infty }, \forall {\varvec{f}} \in {\varvec{L}}^{\infty }({\mathcal {U}}). \end{aligned}$$

Similarly, we consider direct sums of measures, with corresponding vectorized function spaces, in particular \(L^2\) (see Sect. 4.1).

We now define a suitable function class for the interaction kernels in the model (2.2). A simple model is that the agents get farther and farther apart, they eventually should have no influence on each other. For each pair \((k,k')\), \(k,k'=1,\ldots K\) we define the admissible space

$$\begin{aligned} {\mathcal {K}}_{kk'}^{\{E,A,\xi \}}:= L^{\infty }([R_{kk'}^{\min }, R_{kk'}^{\max }] \times {\mathbb {S}}^{{\{E,A,\xi \}}}_{kk'})\quad ,\quad \varvec{{\mathcal {K}}}^{{\{E,A,\xi \}}}:=\bigoplus _{k,k'=1,1}^{K,K} {\mathcal {K}}_{kk'}^{\{E,A,\xi \}},\nonumber \\ \end{aligned}$$
(3.6)

where we remind the reader that the \({\{E,A,\xi \}}\) notation means, in this case, that there is an admissible space for each element of the set \({\{E,A,\xi \}}\). Here, \(R_{kk'}^{\min },R_{kk'}^{\max }\) are the minimum or maximum, respectively, possible interaction radius for agents in \(C_{k'}\) influencing agents in \(C_{k}\). Similarly, \({\mathbb {S}}^{E}_{kk'}, {\mathbb {S}}^{A}_{kk'}, {\mathbb {S}}^{\xi }_{kk'}\) are compact sets in \({\mathbb {R}}^{p_{kk'}^E}, {\mathbb {R}}^{p_{kk'}^A}, {\mathbb {R}}^{p_{kk'}^{\xi }}\) which contain the ranges of the feature maps, \(\varvec{s}^E_{kk'}, \varvec{s}^A_{kk'}\) and \(\varvec{s}^{\xi }_{kk'}\). We will also need the sets:

$$\begin{aligned} {\textbf{S}}^{\{E,A,\xi \}}:= \prod _{k,k'}{\mathbb {S}}_{kk'}^{\{E,A,\xi \}}\quad ,\quad {\textbf{R}}:= \prod _{k,k'}[R_{kk'}^{\min },R_{kk'}^{\max }]\quad ,\quad R:= \max _{k,k'}R_{kk'}^{\max }.\qquad \end{aligned}$$
(3.7)

Notice that all interaction kernels are supported on the interval of pairwise distance [0, R].

We denote the distribution of the initial conditions by \(\varvec{\mu ^{\varvec{Y}}}\). This measure is unknown and is the source of randomness in our system. It is a product measure of three measures \(\varvec{\mu }^{\varvec{X}}\), \(\varvec{\mu }^{\varvec{V}}\), \(\varvec{\mu }^{\varvec{\Xi }}\), all also unknown, that represent the distribution on the initial positions, velocities, and environment variables, respectively. Specifically, we define,

$$\begin{aligned} \varvec{\mu ^{\varvec{Y}}}:= \begin{bmatrix}\mu ^{\varvec{X}}\\ \mu ^{\varvec{V}} \\ \mu ^{\varvec{\Xi }} \end{bmatrix} \end{aligned}$$
(3.8)

It reflects that we will observe trajectories which start at different initial conditions, but that evolve from the same dynamical system. For example, in our numerical experiments we will choose \(\varvec{\mu ^{\varvec{Y}}}\) to be uniform over a system-dependent compact set.

We let

$$\begin{aligned} R_{{\dot{x}}}&:= \sup _{\varvec{Y}(0) \sim \varvec{\mu ^{\varvec{Y}}}} \sup _{t \in [0,T]} \max _{i,i'} \Vert {\dot{\varvec{x}}}_i(t) - {\dot{\varvec{x}}}_{i'}(t) \Vert , \end{aligned}$$
(3.9)
$$\begin{aligned} R_{\xi }&:= \sup _{\varvec{Y}(0) \sim \varvec{\mu ^{\varvec{Y}}}} \sup _{t \in [0,T]} \max _{i,i'} \Vert \xi _i(t) - \xi _{i'}(t) \Vert , \end{aligned}$$
(3.10)

and we assume that both of these quantities are finite. A sufficient condition is that the measures \(\varvec{\mu ^{\varvec{V}}}, \varvec{\mu }^{\xi }\) (specifying the distribution of the initial conditions on the velocities and the environment variable are compactly supported, which follows by the assumptions on the interaction kernels below and that we only consider finite final time T.

First, we define the following vectorized function spaces, which we call admissible sets,

$$\begin{aligned} \varvec{{\mathcal {K}}}_{S_{{\{E,A,\xi \}}}}^{{\{E,A,\xi \}}}:= & {} \Big \{\left( \phi ^{{\{E,A,\xi \}}}_{kk'}\right) _{k,k' = 1,1}^{K,K}: \phi ^{{\{E,A,\xi \}}}_{kk'} \in C^{0,1}\left( [R_{kk'}^{\min }, R_{kk'}^{\max }] \times {\mathbb {S}}^{{\{E,A,\xi \}}}_{kk'}\right) ,\nonumber \\{} & {} \left\| \phi ^{{\{E,A,\xi \}}}_{kk'} \right\| _{\infty } + {\text {Lip}}\left[ \phi ^{{\{E,A,\xi \}}}_{kk'}\right] \le S_{{\{E,A,\xi \}}}, \text { for all } k,k'=1,\ldots ,K, \Big \}.\nonumber \\ \end{aligned}$$
(3.11)

We will assume that the interaction kernels are in corresponding admissible sets:

$$\begin{aligned} {\varvec{\phi }}^{E}\in \varvec{{\mathcal {K}}}_{S_{E}}^{E}, \qquad {\varvec{\phi }}^{A}\in \varvec{{\mathcal {K}}}_{S_{A}}^{A}, \qquad {\varvec{\phi }}^{\xi }\in \varvec{{\mathcal {K}}}_{S_{\xi }}^{\xi }. \end{aligned}$$
(3.12)

The admissibility assumptions (3.12) allow us to establish properties such as existence and uniqueness of solutions to (2.2) as well as to have control on the trajectory errors in finite time [0, T]. It further allows us to show regularity and absolute continuity with respect to Lebesgue measure of the appropriate performance measures defined in Sect. 4.1.

When estimating the EA part of the system, we will consider the direct sum admissible space, for \(S_{EA}\ge \max \{S_E,S_A\}\),

$$\begin{aligned} \varvec{{\mathcal {K}}}_{S_{EA}}^{EA}:= \varvec{{\mathcal {K}}}_{S_{E}}^{E} \oplus \varvec{{\mathcal {K}}}_{S_{A}}^{A} \end{aligned}$$
(3.13)

In the learning approach, we will consider hypothesis spaces that we will search in order to estimate the various interaction kernels. The hypothesis spaces corresponding to \(\{\phi _{kk'}^{{\{E,A,\xi \}}}\}\) are denoted as \(\{{\mathcal {H}}_{kk'}^{{\{E,A,\xi \}}}\}\) and we vectorize them as,

$$\begin{aligned} \varvec{{\mathcal {H}}}^{{\{E,A,\xi \}}}:= \bigoplus _{k,k'=1,1}^{K,K}{\mathcal {H}}_{kk'}^{{\{E,A,\xi \}}}. \end{aligned}$$
(3.14)

Analogous to our simplified notation for \(\varvec{\phi }^{EA}, \varvec{\varphi }^{EA}\) described in (3.3), we define the direct sum of the hypothesis spaces as,

$$\begin{aligned} \varvec{{\mathcal {H}}}^{EA}:= \varvec{{\mathcal {H}}}^E\oplus \varvec{{\mathcal {H}}}^A \end{aligned}$$
(3.15)

We will consider specific choices for the hypothesis spaces during the learning theory and numerical algorithm sections.

4 Learning theory

4.1 Probability measures and weighted \(L^2\) for measuring learning performance

The interaction kernels depend on (\(r,{\dot{r}},\varvec{s}^E,\varvec{s}^A,\varvec{s}^{\xi }\)), and to measure distances between estimated interaction kernels and true interaction kernels, we consider a natural set of probability measures and corresponding weighted \(L^2\) spaces. These generalize the constructions of [10, 45, 47]. For each interacting pair \((k,k')\), we let

$$\begin{aligned} \begin{aligned} \rho _T^{EA, k, k'}(r, \varvec{s}^E, {\dot{r}}, \varvec{s}^A)&:= {\mathbb {E}}_{\varvec{Y}_0 \sim \varvec{\mu ^{\varvec{Y}}}}\frac{1}{TN_{kk'}}\int _{t = 0}^T\sum _{\begin{array}{c} i \in C_{k}, i' \in C_{k'} \\ i \ne i' \end{array}}\delta _{i i', t}(r, \varvec{s}^E, {\dot{r}}, \varvec{s}^A)\, dt \\ \rho _T^{EA, L, k,k'}(r, \varvec{s}^E, {\dot{r}}, \varvec{s}^A)&:= {\mathbb {E}}_{\varvec{Y}_0 \sim \varvec{\mu ^{\varvec{Y}}}}\frac{1}{LN_{kk'}}\sum _{l = 1}^L\sum _{\begin{array}{c} i \in C_{k}, i' \in C_{k'} \\ i \ne i' \end{array}}\delta _{i i', t_l}(r, \varvec{s}^E, {\dot{r}}, \varvec{s}^A) \\ \rho _T^{EA, L, M, k, k'}(r, \varvec{s}^E, {\dot{r}}, \varvec{s}^A)&:= \frac{1}{MLN_{kk'}}\sum _{l, m = 1}^{L, M}\sum _{\begin{array}{c} i \in C_{k}, i' \in C_{k'} \\ i \ne i' \end{array}}\delta _{i i', t_l, m}(r, \varvec{s}^E, {\dot{r}}, \varvec{s}^A) \end{aligned} \end{aligned}$$
(4.1)

where \(N_{kk'} = N_kN_{k'}\) for \(k\ne k'\) and \(N_{k k^{\prime }}={N_{k} \atopwithdelims ()2}\) for \(k=k'\), and we used the following shorthand notation for the Dirac measures:

$$\begin{aligned} \begin{aligned} \delta _{i i', t}(r, \varvec{s}^E, {\dot{r}}, \varvec{s}^A)&:= \delta _{r_{i i'}(t), \varvec{s}^E_{i i'}(t), {\dot{r}}_{i i'}(t), \varvec{s}^A_{i i'}(t)}(r, \varvec{s}^E, {\dot{r}}, \varvec{s}^A) \\ \delta _{i i', t, m}(r, \varvec{s}^E, {\dot{r}}, \varvec{s}^A)&:= \delta _{r_{i i'}^{(m)}(t), \varvec{s}^{E, (m)}_{i i'}(t), {\dot{r}}_{i i'}^{(m)}(t), \varvec{s}^{A, (m)}_{i i'}(t)}(r, \varvec{s}^E, {\dot{r}}, \varvec{s}^A). \end{aligned} \end{aligned}$$

The measure \(\rho _T^{EA, L, k, k'}\) is the discrete counterpart of \(\rho _T^{EA, k, k'}\) with the continuous average over [0, T] replaced by the average over the observation times \(0=t_1<\ldots <t_L=T\). \(\rho _T^{EA, L, M, k, k'}\) can be computed from observations and converges to \(\rho _T^{EA, L, k, k'}\) as \(M \rightarrow \infty \).

We also consider the marginal distributions

$$\begin{aligned} \rho _T^{E, k, k'}(r, \varvec{s}^E):= \int _{{\dot{r}}}\int _{\varvec{s}^A} \rho _T^{EA, k, k'} \, d\varvec{s}^A\, d{\dot{r}}\quad ,\quad \rho _T^{A, k, k'}(r, {\dot{r}}, \varvec{s}^A):= \int _{\varvec{s}^E} \rho _T^{EA, k, k'} \, d\varvec{s}^E\nonumber \\ \end{aligned}$$
(4.2)

and \(\rho _T^{E, L, k, k'}(r, \varvec{s}^E) \), \(\rho _T^{E, L, M, k, k'}(r, \varvec{s}^E)\), \(\rho _T^{A, L, k, k'}(r, {\dot{r}}, \varvec{s}^A)\), \(\rho _T^{A, L, M, k, k'}(r, {\dot{r}}, \varvec{s}^A)\) defined analogously as above. The empirical measures, \(\rho _T^{E, L, M, k, k'}, \rho _T^{A, L, M, k,k'}\), are the ones used in the actual algorithm to quantify the learning performances of the estimators \({\widehat{\phi }}^{E}_{kk'}\) and \({\widehat{\phi }}^{A}_{kk'}\) respectively. They are also crucial in discussing the separability of \({\widehat{\phi }}^{E}_{kk'}\) and \({\widehat{\phi }}^{A}_{kk'}\).

For ease of notation, we introduce the following measures to handle the heterogeneity of the system, and which are used to describe error over all of the pairs \((k,k')\).

$$\begin{aligned}{} & {} \varvec{\rho }_T^{EA,L}=\bigoplus _{k, k^{\prime }=1,1}^{K, K} \rho _{T}^{EA,L, k k^{\prime }}, \quad \varvec{\rho }_T^{EA}=\bigoplus _{k, k^{\prime }=1,1}^{K, K} \rho _{T}^{EA,k k^{\prime }},\nonumber \\{} & {} \quad \varvec{L}^2\left( \varvec{\rho }_T^{EA,L}\right) =\bigoplus _{k, k^{\prime }=1,1}^{K, K} L^2\left( \rho _{T}^{EA,L, k k^{\prime }}\right) \end{aligned}$$
(4.3)

Similar definitions apply for measures related to learning the \(\xi \)-based interaction kernels, see supplement C. We discuss some key properties of the measures in supplement B.

We now discuss the performance measures for the estimated interaction kernels. We use weighted \(L^2\)-norms (with mild abuse of notation, we omit the weight from the notation) based on measures introduced above (with analogous definitions for the measures corresponding to finite L) that are adapted to the underlying dynamics:

$$\begin{aligned} \left\| {\widehat{\phi }}^{E}_{kk'} - \phi ^{E}_{kk'} \right\| _{L^2(\rho _T^{E, k, k'})}^2:= & {} \int _{(r,\varvec{s}^E)}({\widehat{\phi }}^{E}_{kk'}(r, \varvec{s}^E) - \phi ^{E}_{kk'}(r, \varvec{s}^E))^2r^2 \, d\rho _T^{E, k, k'} \nonumber \\ \left\| {\widehat{\phi }}^{A}_{kk'} - \phi ^{A}_{kk'} \right\| _{L^2(\rho _T^{A, k, k'})}^2:= & {} \int _{(r,{\dot{r}},\varvec{s}^A)}({\widehat{\phi }}^{E}_{kk'}(r, {\dot{r}}, \varvec{s}^A) - \phi ^{E}_{kk'}(r, {\dot{r}}, \varvec{s}^A))^2{\dot{r}}^2 \, d\rho _T^{A, k, k'} \nonumber \\ \left\| {\widehat{\phi }}_{kk'}^{EA} - \phi _{kk'}^{EA} \right\| _{L^2(\rho _T^{EA, k, k'})}^2:= & {} \int _{r,\varvec{s}^E,{\dot{r}},\varvec{s}^A} \Big [({\widehat{\phi }}^{E}_{kk'}(r, \varvec{s}^E) - \phi ^{E}_{kk'}(r, \varvec{s}^E))r \nonumber \\{} & {} + ({\widehat{\phi }}^{A}_{kk'}(r, {\dot{r}}, \varvec{s}^A) - \phi ^{A}_{kk'}(r, {\dot{r}}, \varvec{s}^A)){\dot{r}}\Big ]^2 \, d\rho _T^{EA, k, k'}. \end{aligned}$$
(4.4)

Our learning theory focuses on minimizing the difference between \({\widehat{\phi }}^{E}_{kk'} \oplus {\widehat{\phi }}^{A}_{kk'}\) and \(\phi ^{E}_{kk'} \oplus \phi ^{A}_{kk'}\) in the joint norm given by (4.4). As long as the joint norm is small, our estimators produce faithful approximations of the right hand side function of the original system and trajectories. However, it does not necessarily imply that both \({\widehat{\phi }}^{E}_{kk'} - \phi ^{E}_{kk'}\)’s and \({\widehat{\phi }}^{A}_{kk'} - \phi ^{A}_{kk'}\)’s are small in their corresponding energy- and alignment-based norms, since the joint norm is a weaker norm. It would be interesting to study if there is any equivalence between these two norms, but the problem appears to be quite delicate. The theoretical investigation is still ongoing.

Now, we have all the tools needed to establish a theoretical framework: dynamics induced probability measures, performance measurements in appropriate norms, and loss functionals. These will allow us to discuss the convergence properties of our estimators. Full details of the numerical algorithm are given in supplement D.

4.2 Notational summary

A summary of the learning theory notation introduced in Sects. 3.13, and the notation above, is given below in Table 3.

Table 3 Notation used throughout the paper

4.3 Identifiability of kernels from data

In this section we introduce a technical condition, called coercivity condition, on the dynamical system that relates to the well-posedness (solvability and uniqueness of the solution) of the inverse problem and plays a key role in the learning theory. It generalizes the previous work [42, 45, 47]. In fact, for the second-order systems considered here, we will have two coercivity conditions, one for the energy and alignment terms and the other one for the \(\xi \) variable. These conditions ensure that the minimizers to the error functionals are unique, and second that when the expected error functional is small, then the distance from the estimator to the true kernels is small in the appropriate \(\varvec{\rho }_T\) norm.

Definition 1

(Coercivity condition) For the dynamical system (2.2) observed at time instants \(0=t_1<t_2<\dots <t_L=T\) and with initial condition distributed \(\varvec{\mu ^{\varvec{Y}}}\) on \({\mathbb {R}}^{(2d+1)N}\), it satisfies the coercivity condition on the hypothesis space \(\varvec{{\mathcal {H}}}^{EA}\) with constant \(c_{\varvec{{\mathcal {H}}}^{EA}}\) if

$$\begin{aligned} c_{\varvec{{\mathcal {H}}}^{EA}} := \inf _{ \varvec{\varphi }^{EA}\in \varvec{{\mathcal {H}}}^{EA}\backslash \{\varvec{0} \}} \,\frac{ \frac{1}{L}\sum _{l=1}^{L}{\mathbb {E}}_{\varvec{\mu ^{\varvec{Y}}}} \bigg [ \big \Vert {\textbf{f}}_{\varvec{\varphi }^{EA}}(\varvec{X}_{t_l},\varvec{V}_{t_l},\varvec{\Xi }_{t_l}) \big \Vert _{{\mathcal {S}}}^2\bigg ] }{\Vert \varvec{\varphi }^{EA}\Vert _{\varvec{L}^2(\varvec{\rho }_T^{EA,L})}^2} >0 . \end{aligned}$$
(4.5)

Similarly, the system satisfies the coercivity condition on the hypothesis space \(\varvec{{\mathcal {H}}}^{\xi }\) with constant \(c_{\varvec{{\mathcal {H}}}^{\xi }}\) if

$$\begin{aligned} c_{\varvec{{\mathcal {H}}}^{\xi }} := \inf _{ \varvec{\varphi }^{\xi }\in \varvec{{\mathcal {H}}}^{\xi }\backslash \{\varvec{0} \}} \,\frac{ \frac{1}{L}\sum _{l=1}^{L}{\mathbb {E}}_{\varvec{\mu ^{\varvec{Y}}}} \bigg [ \big \Vert {\textbf{f}}_{\varvec{\varphi }^{\xi }}(\varvec{X}_{t_l},\varvec{V}_{t_l},\varvec{\Xi }_{t_l}) \big \Vert _{{\mathcal {S}}}^2\bigg ] }{\Vert \varvec{\varphi }^{\xi }\Vert _{\varvec{L}^2(\varvec{\rho }_T^{\xi ,L})}^2} >0. \end{aligned}$$
(4.6)

Analogous definitions holds for continuous observations over the time interval [0, T], by replacing the average over observations at discrete times with an integral average over [0, T].

4.4 Consistency and optimal convergence rate of estimators

4.4.1 Concentration

Our first main result is a concentration estimate that relates the coercivity condition to an appropriate bias-variance tradeoff in our setting. Let \({\mathcal {N}}(\varvec{{\mathcal {H}}},\delta )\) be the \(\delta \)-covering number, with respect to the \(\infty \)-norm, of the set \(\varvec{{\mathcal {H}}}\).

Theorem 2

(Concentration) Suppose that \({\varvec{\phi }}^{\{E,A,\xi \}}\in \varvec{{\mathcal {K}}}_{S_{\{E,A,\xi \}}}^{\{E,A,\xi \}}\). Consider a convex, compact (with respect to the \(\infty \)-norm) hypothesis spaces

$$\begin{aligned} \varvec{{\mathcal {H}}}_M^{EA} \subset \varvec{L^\infty }({\textbf{R}}\times {\textbf{S}}^E) \oplus \varvec{L^\infty }({\textbf{R}}\times {\textbf{S}}^A),\quad \varvec{{\mathcal {H}}}_M^{\xi } \subset \varvec{L^\infty }({\textbf{R}}\times {\textbf{S}}^{\xi }), \end{aligned}$$

bounded above by \(S_0 \ge \max \{S_E, S_A,S_{\xi }\}\) respectively. Additionally, assume that the coercivity conditions (4.5), (4.6) hold on \(\varvec{{\mathcal {H}}}_M^{EA}\) and \(\varvec{{\mathcal {H}}}_M^{\xi }\), respectively.

Then for all \(\epsilon >0\), with probability (with respect to \(\varvec{\mu ^{\varvec{Y}}}\)) at least \(1-\delta \), we have the estimates

$$\begin{aligned} \begin{aligned} c_{\varvec{{\mathcal {H}}}_M^{EA}}\Vert \widehat{\varvec{\phi }}^{EA}_{M} - \varvec{\phi }^{EA}\Vert ^2_{\varvec{L}^2(\varvec{\rho }_T^{EA,L})}&\le 2\inf _{\varvec{\varphi }^{EA}\in \varvec{{\mathcal {H}}}_M^{EA}}\Vert \varvec{\varphi }^{EA}-\varvec{\phi }^{EA}\Vert ^2_{\varvec{L}^2(\varvec{\rho }_T^{EA,L})} +2\epsilon ,\\ c_{\varvec{{\mathcal {H}}}_M^{\xi }}\Vert {\widehat{{\varvec{\phi }}}}_{M}^{\xi } - \varvec{\phi }^{\xi }\Vert ^2_{\varvec{L}^2(\varvec{\rho }_T^{\xi , L})}&\le 2\inf _{\varvec{\varphi }^{\xi }\in \varvec{{\mathcal {H}}}_M^{\xi }}\Vert \varvec{\varphi }^{\xi }-\varvec{\phi }^{\xi }\Vert ^2_{\varvec{L}^2(\varvec{\rho }_T^{\xi , L})} + 2\epsilon , \end{aligned} \end{aligned}$$
(4.7)

provided that, for the first bound to hold,

$$\begin{aligned} M \ge \frac{1152S_{0}^2\max \{R, R_{{\dot{x}}} \}^2K^4}{\epsilon c_{\varvec{{\mathcal {H}}}_M^{EA}}}\bigg (\log \Big ({\mathcal {N}}\Big (\varvec{{\mathcal {H}}}^{EA}_M,\frac{\epsilon }{48S_{0}\max \{R, R_{{\dot{x}}} \}^2K^4} \Big )\Big )+\log \Big (\frac{1}{\delta }\Big ) \bigg ), \end{aligned}$$

and similarly for the second inequality, using \(\varvec{{\mathcal {H}}}_M^{\xi }\).

4.4.2 Consistency

In the regime where \(M \rightarrow \infty \), we can choose a sequence of \(\varvec{{\mathcal {H}}}_M^{EA}\)’s such that the approximation error goes to 0. This enables us to control the infimum on the right hand side of (4.7).

Theorem 3

(Strong consistency) Suppose that

$$\begin{aligned} \{\varvec{{\mathcal {H}}}_M^{EA}\}_{M=1}^{\infty } \subset \varvec{L^\infty }({\textbf{R}}\times {\textbf{S}}^E) \oplus \varvec{L^\infty }({\textbf{R}}\times {\textbf{S}}^A) \end{aligned}$$

is a family of compact and convex subsets such that the approximation error goes to zero,

$$\begin{aligned} \inf _{\varvec{\varphi }^{EA}\in \varvec{{\mathcal {H}}}_M^{EA}}\Vert \varvec{\varphi }^{EA}-\varvec{\phi }^{EA}\Vert _{\infty } \xrightarrow {M\rightarrow \infty } 0. \end{aligned}$$

Further suppose that the coercivity condition holds on \(\bigcup _{M}\varvec{{\mathcal {H}}}_M^{EA}\), and that \( \bigcup _{M}\varvec{{\mathcal {H}}}_M^{EA}\) is compact in \(\varvec{L^\infty }({\textbf{R}}\times {\textbf{S}}^E) \oplus \varvec{L^\infty }({\textbf{R}}\times {\textbf{S}}^A)\). Then the estimator is strongly consistent with respect to the \(\varvec{L}^2(\varvec{\rho }_T^{EA,L})\) norm:

$$\begin{aligned} \lim _{M\rightarrow \infty }\Vert \widehat{\varvec{\phi }}^{EA}_{M} - \varvec{\phi }^{EA}\Vert _{\varvec{L}^2(\varvec{\rho }_T^{EA,L})} =0 \quad \text { with probability one.} \end{aligned}$$

An analogous consistency result holds for the estimator in the \(\xi \) variable.

These two results together provide a consistency result on the full estimation of the triple \((\widehat{{\varvec{\phi }}^{\xi }}, \widehat{{\varvec{\phi }}^{E}}, \widehat{{\varvec{\phi }}^{A}})\) and thus consistency of our estimation procedure on the full system (2.2).

4.5 Rate of convergence

Theorem 2 highlights the classical bias-variance tradeoff in our setting. Given data collected from M trajectories, we would like to choose the best hypothesis space to maximize the accuracy of the estimators. On the one hand, we would like the hypothesis space \(\varvec{{\mathcal {H}}}^{EA}_M\) to be large so that the bias

$$\begin{aligned} \inf _{\varvec{\varphi }^{EA}\in \varvec{{\mathcal {H}}}^{EA}_M}\Vert \varvec{\varphi }^{EA}-\varvec{\phi }^{EA}\Vert ^2_{\varvec{L}^2(\varvec{\rho }_T^{EA,L})},\text { or }\,\,\inf _{\varvec{\varphi }^{EA}\in \varvec{{\mathcal {H}}}^{EA}}\Vert \varvec{\varphi }^{EA}-\varvec{\phi }^{EA}\Vert ^2_{\infty }, \end{aligned}$$

is small. Simultaneously, we would like \(\varvec{{\mathcal {H}}}^{EA}_M\) to be small enough so that the covering number \({\mathcal {N}}(\varvec{{\mathcal {H}}}^{EA}_M,\epsilon ) \) is small. Just as in nonparametric regression, our rate of convergence depends on a regularity condition of the true interaction kernels and corresponding approximation properties of the hypothesis space, as is demonstrated in the following theorem. We establish the optimal (up to a log factor) min–max rate of convergence by choosing a hypothesis space of an optimal dimension as a function of the sample size M.

The dimension of the space supporting \(\varvec{\rho }_T^{EA,L}\) is typically large: it is equal to \(1+\sum _{kk'}p_{(k,k')}^E + \sum _{(k,k')}p_{(k,k')}^A\), see Table 1 for the definition of the \(p_{(k,k')}\). However we can exploit the structure of the system in such a way that our convergence rates only depend on the maximum number of unique variables in a pair \((k,k')\) across the EA portions of the system. A similar result holds for the \(\varvec{\rho }_T^{\xi , L}\) and its convergence rate. For the system (2.2), consider \({\mathcal {V}}^{E,kk'}\) to be the number of distinct variables in the function \(\phi ^{E}_{kk'}(r,\varvec{s}^E_{(k,k')})\), similarly we define \({\mathcal {V}}^{A,kk'}\), \({\mathcal {V}}^{\xi ,kk'}\), more precisely, and recalling the notation in Table 1:

$$\begin{aligned} {\mathcal {V}}^{E,kk'} := 1 + p_{(k,k')}^E,\qquad {\mathcal {V}}^{A,kk'} := 1 + p_{(k,k')}^A,\qquad {\mathcal {V}}^{\xi ,kk'} := 1 + p_{(k,k')}^{\xi }. \end{aligned}$$
(4.8)

Using these, we get the dimensions needed for the minimax rates:

$$\begin{aligned} {\mathcal {V}}:= \max _{k,k'} \{{\mathcal {V}}^{E,kk'} ,{\mathcal {V}}^{A,kk'} \},\quad {\mathcal {V}}^{\xi } := \max _{k,k'} {\mathcal {V}}^{\xi ,kk'} \end{aligned}$$
(4.9)

The dimension for the minimax rates on the energy and alignment inference is given by \({\mathcal {V}}\), representing the maximum number of unique variables used in any one of the \(\phi ^{E}_{kk'}, \phi ^{A}_{kk'}\) pairs. Analogously, \({\mathcal {V}}^{\xi }\) is used for the minimax convergence rate for the inference of \({\varvec{\phi }}^{\xi }\). As an extreme example, consider a problem with 10 different types of agents, leading to 100 distinct interaction kernels, each depending on r and one additional variable that is unique for each function. In this case, we only pay the 2 dimensional rate, rather than the 101-dimensional rate in the ambient space of the 101 unique variables, although the heterogeneity affects the constant in the convergence rate. We note that we are not predicting the number of variables nor their form: these are assumed known. In Table 2 we report the values of \({\mathcal {V}}\) and \({\mathcal {V}}^{\xi }\) for a variety of prototypical systems. We are now ready to state our main result:

Theorem 4

(Rate of convergence) Let \(\widehat{\varvec{\phi }}^{EA}:={{\widehat{{\varvec{\phi }}}}}_M^E \oplus {{\widehat{{\varvec{\phi }}}}}_M^A\) denote the minimizer of the empirical error functional \(\varvec{{\mathcal {E}}}_{M}^{EA}\) (defined in (3.4)) over the hypothesis space \(\varvec{{\mathcal {H}}}^{EA}_M\).

  1. (a)

    Let the hypothesis space be chosen as the direct sum of the admissible spaces, namely \(\varvec{{\mathcal {H}}}^{EA}= \varvec{{\mathcal {K}}}_{S_E}^E \oplus \varvec{{\mathcal {K}}}_{S_A}^A,\) and assume that the coercivity condition (4.5) holds on \(\varvec{{\mathcal {H}}}^{EA}\). Then, there exists a constant C depending only on \(K,S_{EA},R, R_{{\dot{x}}}\) such that

    $$\begin{aligned} {\mathbb {E}}_{\varvec{\mu ^{\varvec{Y}}}}\Big [\Vert \widehat{\varvec{\phi }}^{EA}_M-\varvec{\phi }^{EA}\Vert ^2_{\varvec{L}^2(\varvec{\rho }_T^{EA,L})} \Big ]\le \frac{C}{c_{\varvec{{\mathcal {H}}}^{EA}}} M^{-\frac{1}{{\mathcal {V}}+1}}. \end{aligned}$$
  2. (b)

    Assume that \(\{\varvec{{\mathcal {L}}}_n\}_{n=1}^{\infty }\) is a sequence of finite-dimensional linear subspaces of \(\varvec{L}^{\infty }({\textbf{R}}\times {\textbf{S}}^E) \oplus \varvec{L}^{\infty }({\textbf{R}}\times {\textbf{S}}^A)\) satisfying the dimension and approximation constraints

    $$\begin{aligned} \text {dim}(\varvec{{\mathcal {L}}}_n) \le c_0K^2n^{{\mathcal {V}}},\quad \inf _{\varvec{\varphi }^{EA}\in \varvec{{\mathcal {L}}}_n}\Vert \varvec{\varphi }^{EA}-\varvec{\phi }^{EA}\Vert _{\infty } \le c_1 n^{-s}, \end{aligned}$$
    (4.10)

    for some fixed constants \(c_0,c_1\) representing dimension-independent approximation characteristics of the linear subspaces, and \(s>0\) related to the regularity of the interaction kernels. The value n can be thought of as the number of basis functions along each of the (up to) \({\mathcal {V}}\) axes for each \((k,k')\). Suppose the coercivity condition holds true on the set \(\varvec{{\mathcal {L}}}:=\cup _n\varvec{{\mathcal {L}}}_n\), and let \(c_{\varvec{{\mathcal {L}}}}^{EA}\) be the coercivity constant of \(\varvec{{\mathcal {L}}}\). Define \(\varvec{{\mathcal {B}}}_n\) to be the closed ball centered at the origin of radius \((c_1+S_{EA})\) in \(\varvec{{\mathcal {L}}}_{n}\). If we choose the hypothesis space as \(\varvec{{\mathcal {H}}}_M=\varvec{{\mathcal {B}}}_{k(M)}\), where \(k(M) \asymp (\frac{M}{\log M})^{\frac{1}{2\,s+{\mathcal {V}}}}\), then there exists a constant C depending on \(K,R, R_{{\dot{x}}}, S_{EA},c_0, c_1, s\) such that we achieve the convergence rate,

    $$\begin{aligned} {\mathbb {E}}_{\varvec{\mu ^{\varvec{Y}}}}\Big [\Vert \widehat{\varvec{\phi }}^{EA}_{M} -\varvec{\phi }^{EA}\Vert ^2_{\varvec{L}^2(\varvec{\rho }_T^{EA,L})} \Big ] \le \frac{C}{c_{\varvec{{\mathcal {L}}}}^{EA}} \left( \frac{\log M}{M}\right) ^{\frac{2s}{2s+{\mathcal {V}}}}. \end{aligned}$$
    (4.11)
  3. (c)

    under the corresponding assumptions as in (a), there exists a constant C depending only on \(K,S_{\xi },R\) such that

    $$\begin{aligned} {\mathbb {E}}_{\varvec{\mu ^{\varvec{Y}}}}\Big [\Vert {{\widehat{{\varvec{\phi }}}}}_M^{\xi }-{\varvec{\phi }}^{\xi }\Vert ^2_{\varvec{L}^2(\varvec{\rho }_T^{\xi , L})} \Big ]\le \frac{C}{c_{\varvec{{\mathcal {H}}}^{\xi }}} M^{-\frac{1}{{\mathcal {V}}^{\xi }+1}}. \end{aligned}$$
  4. (d)

    under the corresponding assumptions as in (b), there exists a constant C depending only on \(K,R, S_{\xi },c_0, c_1, s\) such that, and for \(c^{\xi }\) the coercivity constant of the corresponding linear space,

    $$\begin{aligned} {\mathbb {E}}_{\varvec{\mu ^{\varvec{Y}}}}\Big [\Vert {{\widehat{{\varvec{\phi }}}}}^{\xi }_{M} -{\varvec{\phi }}^{\xi } \Vert ^2_{\varvec{L}^2(\varvec{\rho }_T^{\xi , L})} \Big ] \le \frac{C}{c^{\xi }} \left( \frac{\log M}{M}\right) ^{\frac{2s}{2s+{\mathcal {V}}^{\xi }}}. \end{aligned}$$
    (4.12)

We in fact prove bounds not only in expectation, but also with high probability, for every fixed large-enough M, as the proof will show. In addition, we remark that in the case that the coercivity constant is independent of the number of agents N, and not only will the convergence rate of our estimators be independent of the dimension \((2d+1)N\) of the phase space, but even the constants in front of the rate term are independent of N. We have proven this could be true for some initial conditions [45] for first order systems. The analysis could be extend to some special cases of second-order systems we considered in this paper. In general, we do expect that the coercivity constant is dependent on N. The empirical numerical experiments on some second-order systems [75] support the idea that the coercivity condition is satisfied by large classes of second-order systems, we leave it as a future work for further theoretical investigation.

In both theorems, the convergence rates \(\frac{2s}{2s+{\mathcal {V}}}\) and \(\frac{2s}{2s+{\mathcal {V}}^{\xi }}\) coincide with the minimax rate of convergence \(\frac{2s}{2s+d}\) for nonparametric regression in the corresponding dimension d—up to the logarithmic factor. (This logarithmic factor may be removable (using, e.g., the techniques in Chapter 11–15 of [36]), but with additional complexity of the proofs.) Achieving the same rate of convergence as if we had observed the noisy values of the interaction kernels directly, rather than through the dynamics, demonstrates the optimality of our approach. The strong consistency results show the asymptotic optimality of our method, and for wide classes of systems the assumptions of the theorems apply. Specifically, for part (b) of the theorems, the dimension and approximation conditions can be explicitly achieved by piecewise polynomials or splines appropriately adapted to the regularity of the interaction kernel. In the conditions of theorem 4, n can be the number of partitions along each axis of the variables in \({\mathcal {V}}\). Then, by using multivariate splines or piecewise polynomials we will have a fixed constant \(c_0\) (corresponding to the number of parameters to estimate for each function) times \(Kn^{{\mathcal {V}}}\) as the dimension of the linear space. Furthermore, by standard approximation theory results, see [63] (Chapters 12,13), [27, 29], for s the regularity of the interaction kernels we achieve the desired approximation condition with piecewise polynomials of degree \(\lfloor s\rfloor \). In our admissible spaces we have \(s=1\), note that the rate of convergence is faster if we have a kernel of higher regularity.

4.5.1 Examples of convergence rates in prototype systems

We next briefly examine the convergence rate on a few systems of interest. Recall that in Table 2 we have as the final two columns the values \({\mathcal {V}}, {\mathcal {V}}^{\xi }\), which dictate the rate of convergence of our estimators in each system. Some specific highlights:

  • For Anticipation Dynamics (AD) (see (6.2)), we have \({\mathcal {V}}=2\) unique variables shared across both of energy and alignment based kernels. So we learn at the 2-dimensional rate.

  • For the Synchronized Oscillator dynamics (see (11) in [75]), each agent is indexed by i, \(\xi _i\) is its phase, \(\varvec{x}_i\) is (as usual) its position, \(\omega _i\) is the fixed natural frequency, \(\varvec{v}_i\) is the fixed self-propulsion velocity. The dynamics of \(\varvec{x}_i\) and \(\xi _i\) are governed by the following equations,

    $$\begin{aligned} {\left\{ \begin{array}{ll} {{\dot{\varvec{x}}}}_i &{}= \varvec{v}_i + \frac{1}{N}\sum _{i'=1}^{N}\left( \frac{\varvec{x}_{i'} - \varvec{x}_i}{\left\| \varvec{x}_{i'} - \varvec{x}_i \right\| }(A+J \cos (\xi _{i'} - \xi _i)) - B\frac{\varvec{x}_{i'} - \varvec{x}_i}{\left\| \varvec{x}_{i'} - \varvec{x}_i \right\| ^2} \right) \\ {{\dot{\xi }}}_i &{}= \omega _i + \frac{K}{N} \sum _{i'=1}^{N} \frac{\sin (\xi _{i'} - \xi _i)}{\left\| \varvec{x}_{i'} - \varvec{x}_i \right\| } \end{array}\right. }, \nonumber \\ \end{aligned}$$
    (4.13)

    where AJBK are constants. Formatting (4.13) into our model, we can see that we achieve the 2-dimensional optimal learning rate on each of the EA and \(\xi \) portions (rather than a 4-dimensional rate) due to the decoupled nature of the system; similarly we only pay the 1-dimensional rate twice for the Phototaxis system (see section D of SI in [47]). This is a key reason for splitting our learning theory between EA- and \(\xi \)-interaction kernels and accounting for shared and non-shared variables.

  • The rates of convergence of our estimators for all previously-studied first-order systems (see [45, 47, 75]) can be derived from Theorem 4.

One downside of the results above is the lack of dependence on L: it seems natural that finer time samples in each trajectory should improve the results, at least up to a point. Indeed, the numerical experiments of [45, 47, 75] demonstrate that more data in L may indeed be helpful to improve the performance. One technique used in [75] for very long trajectory data (large L, medium to small M) is to split each trajectory into larger M with smaller L in each. The dependence on the number of agents N is not the objective of this work; it was considered [10] in the case of first-order systems; but further study in this mean-field regime is of interest to the authors and work is ongoing.

4.6 Performance of trajectory prediction

Once estimators \(\widehat{\varvec{\varphi }}^{EA},\widehat{\varvec{\varphi }}^{\xi }\) are obtained, a natural question is the accuracy of the evolved trajectory based on these estimated kernels. We compare the observed trajectories to the estimated trajectories evolved from the same initial conditions but with the estimated interaction kernels. Recall that \(\varvec{Y}_t = [ \varvec{X}^T_t, \varvec{V}^T_t, \varvec{\Xi }^T_t ]^T\) be the trajectory from dynamics generated by the true and unknown interaction kernels with initial condition \(\varvec{Y}_0\), and \({{\hat{\varvec{Y}}}}_t = [ {{\hat{\varvec{X}}}}^T_t, {{\hat{\varvec{V}}}}^T_t, {{\hat{\varvec{\Xi }}}}^T_t]^T\) be the trajectory from dynamics generated, with the same initial condition \({{\hat{\varvec{Y}}}}_0 = \varvec{Y}_0\), by the interaction kernels estimated from observations at times \(\{t_l\}_{l = 1}^L\). We let

$$\begin{aligned} \left\| \varvec{Y}_{t} - {{\hat{\varvec{Y}}}}_{t} \right\| _{{\mathcal {Y}}}^2:= \left\| \varvec{X}_{t} - {{\hat{\varvec{X}}}}_{t} \right\| _{{\mathcal {S}}}^2 + \left\| \varvec{V}_{t} - {{\hat{\varvec{V}}}}_{t} \right\| _{{\mathcal {S}}}^2 + \left\| \varvec{\Xi }_{t} - {{\hat{\varvec{\Xi }}}}_{t} \right\| _{{\mathcal {S}}}^2. \end{aligned}$$
(4.14)

The next theorem shows that the error in prediction is (i) bounded trajectory-wise by a continuous-time version of the error functional, and (ii) bounded on average by the \(\varvec{L}^2(\varvec{\rho }_T^{EA}), \varvec{L}^2(\varvec{\rho }_T^{\xi })\), respectively, error of the estimator. This further validates the effectiveness of our error functional and \(\varvec{L}^2(\varvec{\rho }_T)\)-metric to assess the quality of the estimator. In particular, this emphasizes that although the system is a coupled system of ODE’s, our decoupled learning procedure with our choice of norm will lead to control of the expected supremum error as long as we minimize the \(\varvec{L}^2(\varvec{\rho }_T^{EA}), \varvec{L}^2(\varvec{\rho }_T^{\xi })\) norms in obtaining our estimators.

Theorem 5

Suppose that \({{\widehat{{\varvec{\phi }}}}}^E \in \varvec{{\mathcal {K}}}_{S_E}^E\), \({{\widehat{{\varvec{\phi }}}}}^A \in \varvec{{\mathcal {K}}}_{S_A}^A\) and \({{\widehat{{\varvec{\phi }}}}}^{\xi } \in \varvec{{\mathcal {K}}}_{S_{\xi }}^{\xi }\). Denote by \({\widehat{\varvec{Y}}}(t)\) and \(\varvec{Y}(t)\) the solutions of the systems with kernels \({{\widehat{{\varvec{\phi }}}}}^E=({{\widehat{\phi }}}_{kk'}^{E})_{k,k'=1}^{K,K}, {{\widehat{{\varvec{\phi }}}}}^A =({{\widehat{\phi }}}_{kk'}^{A})_{k,k'=1}^{K,K}\), and \({{\widehat{{\varvec{\phi }}}}}^{\xi } =({{\widehat{\phi }}}_{kk'}^{\xi })_{k,k'=1}^{K,K}\) and \({\varvec{\phi }}^E, {\varvec{\phi }}^A,{\varvec{\phi }}^{\xi }\) respectively, both with the same initial condition. Then

$$\begin{aligned} \sup _{t\in [0,T]}\Vert {\widehat{\varvec{Y}}}_t- \varvec{Y}_t\Vert _{{\mathcal {Y}}}^2&\le g(T) \bigg [2T^2 \int _{u=0}^t \int _{s=0}^u \Vert \ddot{\varvec{X}}_s \\&\quad - {\textbf{f}}^{\text {nc}, {{\dot{\varvec{x}}}}}(\varvec{X}_s,\varvec{V}_s,\varvec{\Xi }_s) - {\textbf{f}}^{\widehat{\varvec{\phi }}^{EA}}(\varvec{X}_s,\varvec{V}_s,\varvec{\Xi }_s) \Vert _{{\mathcal {S}}}^2 \text { ds du} \\&\quad + 2T \int _{s=0}^t \Vert \ddot{\varvec{X}}_s - {\textbf{f}}^{\text {nc}, {{\dot{\varvec{x}}}}}(\varvec{X}_s,\varvec{V}_s,\varvec{\Xi }_s) - {\textbf{f}}^{\widehat{\varvec{\phi }}^{EA}}(\varvec{X}_s,\varvec{V}_s,\varvec{\Xi }_s) \Vert _{{\mathcal {S}}}^2 ds \\&\quad + 2T\int _{s=0}^t \Vert {\dot{\varvec{\Xi }}}_s - {\textbf{f}}^{\text {nc}, \xi }(\varvec{X}_s,\varvec{V}_s,\varvec{\Xi }_s) - {\textbf{f}}^{{\widehat{{\varvec{\phi }}}}_{\xi }}(\varvec{X}_s,\varvec{V}_s,\varvec{\Xi }_s) \Vert _{{\mathcal {S}}}^2 ds \bigg ] \end{aligned}$$

where \(g(T) =1+(1+B_1T)T\exp (A_1T+T^2/2)\). The constants are \(A_1 = 2T(8KP + {\mathcal {L}} +8QK + {\mathcal {L}}^{\xi })\) and \(B_1 = 2T^2(8KP + {\mathcal {L}})\), with any unspecified constants made precise in the proof and only depending on the Lipschitz constants of the noncollective forces and the feature maps, as well as the values \(S^E, S^A, S^{\xi }\) coming from the admissible spaces. It is bounded on average, with respect to the initial distribution \(\varvec{\mu ^{\varvec{Y}}}\), by

$$\begin{aligned} {\mathbb {E}}_{\varvec{\mu ^{\varvec{Y}}}}[\sup _{t\in [0,T]} \Vert {\widehat{\varvec{Y}}}_t- \varvec{Y}_t\Vert _{{\mathcal {Y}}}^2]&\le g(T)\bigg ((T^2K^2 + TK^2)\Vert \widehat{\varvec{\phi }}^{EA}- \varvec{\phi }^{EA}\Vert _{\varvec{L}^2(\varvec{\rho }_T^{EA})}^2 \nonumber \\&\quad + TK^2\Vert \widehat{\varvec{\phi }}^{\xi } - \varvec{\phi }^{\xi } \Vert _{\varvec{L}^2(\varvec{\rho }_T^{\xi })}^2 \bigg ) \end{aligned}$$
(4.15)

with the measures \(\varvec{\rho }_T^{EA},\varvec{\rho }_T^{\xi }\) defined in [(4.1), (C.1)]. Expression (4.15) shows that by minimizing the right hand side, we can control the expected \({\mathcal {Y}}\)-supremum error of the estimated trajectories.

We postpone the somewhat lengthy proof to Appendix E.

5 Algorithm for constructing the interaction kernel estimators

Let \({\mathcal {H}}^{E}_{kk'}\) be a finite dimensional function space of dimension \(n_{kk'}^{E}\) with basis functions given by piecewise polynomials whose degree will be chosen later (other type of basis functions are also possible, e.g., clamped B-splines as shown in [47]). It is built on uniform partitions of \([R^{\min , \text {obs}}_{kk'}, R^{\max , \text {obs}}_{kk'}]\) where \(R^{\min , \text {obs}}_{kk'}\)/\(R^{\max , \text {obs}}_{kk'}\) is the minimum/maximum interacting radius for agents in type \(k'\) influencing agents in type \(k\), derived from the observation data. Similar construction is done for \({\mathcal {H}}^{A}\) with dimension \(n_{kk'}^{A}\). We write the candidate \(\varphi ^{E}_{kk'}, \varphi ^{A}_{kk'}\) as a linear combination of the basis functions:

$$\begin{aligned} \varphi ^{E}_{kk'}(r, \varvec{s}^E)&= \sum _{\eta _{kk'}^{E} = 1}^{n_{kk'}^{E}} \alpha _{k, k', \eta _{kk'}^{E}}^{E}\psi ^{\varvec{x}}_{k, k', \eta _{kk'}^{E}}(r, \varvec{s}^E), \\ \varphi ^{A}_{kk'}(r, {\dot{r}}, \varvec{s}^A)&= \sum _{\eta _{kk'}^{A} = 1}^{n_{kk'}^{A}} \alpha _{k, k', \eta _{kk'}^{A}}^{A}\psi ^{\varvec{x}}_{k, k', \eta _{kk'}^{A}}(r, {\dot{r}}, \varvec{s}^A),. \end{aligned}$$

Substituting this linear combination back into (3.4), we obtain a system of linear equations,

$$\begin{aligned} A^{EA}_M\vec {\alpha }^{EA} = \vec {b}^{EA}_M, \end{aligned}$$

and minimizing the empirical loss functional corresponds to solving this system in the last square sense. Here, \(\vec {\alpha }^{EA} \in {\mathbb {R}}^{n^{EA}} = \begin{bmatrix} (\vec {\alpha }^E)^T&(\vec {\alpha }^A)^T\end{bmatrix}^T\) with \(\vec {\alpha }^E\) and \(\vec {\alpha }^A\) being the collection of \(\alpha _{k, k', \eta _{kk'}^{E}}^{E}\) or \(\alpha _{k, k', \eta _{kk'}^{A}}^{A}\) respectively. Moreover, \(A^{EA}_M \in {\mathbb {R}}^{n^{EA} \times n^{EA}}\) and \(\vec {b}^{EA}_M \in {\mathbb {R}}^{n^{EA}}\). See Sec. D for full details.

The total computational complexity is detailed as follows: \(MLN^2\) for computing pairwise data, \(MLd(n^{EA})^2\) for constructing the learning matrix and right hand side vector, and \((n^{EA})^3\) for solving the linear system, hence the total computing time is \(MLN^2 + MLd(n^{EA})^2 + (n^{EA})^3\). Assuming that a tensor-grid construction of \({\mathcal {H}}^{E}_{kk'}\) is used, and the optimal \(n_*\) with

$$\begin{aligned} n_* = {\mathcal {O}}\left( \left( \frac{M}{\log M}\right) ^{\frac{1}{2s + {\mathcal {V}}}}\right) \approx M^{\frac{1}{2s + {\mathcal {V}}}}. \end{aligned}$$

is used in each dimension, we have \(n^{EA} = 2n_*^{{\mathcal {V}}} \approx 2\,M^{\frac{{\mathcal {V}}}{2\,s + {\mathcal {V}}}}\); then we obtain the total computing time in terms of M as follows

$$\begin{aligned} \text {Comp. Time} = MLN^2 + 4LdM^{\frac{2{\mathcal {V}}}{2s + {\mathcal {V}}} + 1} + 8M^{\frac{3{\mathcal {V}}}{2s + {\mathcal {V}}}} \end{aligned}$$

In the special case of \(s = 1\) (Lipscthiz functions) and \({\mathcal {V}}= 1\), we have

$$\begin{aligned} \text {Comp. Time} = MLN^2 + 4LdM^{\frac{2}{3} + 1} + 8M \approx M^{\frac{2}{3} + 1}. \end{aligned}$$

It is slightly super-linear in M.

Similar computational complexity analysis on solving \(A^{\xi }\vec {\alpha }^{\xi } = \vec {b}^{\xi }\) also shows that the computational cost is slightly super linear in M when \(s = 1\) and \({\mathcal {V}}= 1\).

The overall memory storage needed for the learning problem is \(MLN(d(5 + n^{EA} + n^{\xi }) + 3)\), with \(MLN(4d + 2)\) for storing the trajectory data, \(MLNd(n^{EA} + n^{\xi })\) (here \(n^{EA} = n^E + n^A\)) for learning matrices, and \(MLN(d + 1)\) for right hand side vectors. Hence if \(M \gg {\mathcal {O}}(1)\), we can consider parallelization in m in order to reduce the overhead memory, ending up with \(M_{\text {per core}}\big (LN(d(5 + n^{EA} + n^{\xi }) + 3)\big )\) with \(M_{\text {per core}} = \frac{M}{n_{\text {cores}}}\). The final storage of A and \(\vec {b}\) only needs \(n^{EA}(n^{EA} + 1) + n^{\xi }(n^{\xi } + 1)\).

6 Applications

Our learning theory, as well as measures, norms, functionals etc. can be applied to study all the examples considered in the works [45, 47, 75]. These examples, particularly those of [75], can thus be considered as applications of the theoretical results as well as of the algorithm in Sect. D. We choose to study one new dynamics, which is not considered in [47, 75] since they exhibit some unique features of our generalized model. In particular, we choose them due to their special form of having both energy-based and alignment-based interactions. It is called the anticipation dynamics (AD) model in [65]. Table 4 shows the value of learning parameters for these dynamics.

Table 4 Values of parameters for the learning

The setup of the learning experiment is as follows. We use \(M_{\rho }\) different initial conditions to evolve the dynamicsFootnote 1 from 0 to T for the sole purpose of obtaining a good approximation to \(\rho _T^{L, EA}, \rho _T^{L, E}\) and \(\rho _T^{L, A}\). Then we use another set of M (\(M = 500\) for FwEP and \(M = 750\) for AD) initial conditions to generate training data to learn the corresponding \(\phi ^{E}\) and \(\phi ^{A}\) from the empirical distributions, \(\rho _T^{L, M, EA}\)’s, etc. We report the relative learning errors calculated via (4.4) for \({\widehat{\phi }}^{E}\oplus {\widehat{\phi }}^{A}\), (4.4) for \({\widehat{\phi }}^{E}\), and (4.4) for \({\widehat{\phi }}^{A}\), along with pictorial comparison of those interaction kernels as well as a visualization on the pairwise data which is used to learn the estimated kernels. Then we evolve the dynamics either from the training set of M initial conditions or another set of M randomly chosen initial conditions with \(\phi ^{E}\oplus \phi ^{A}\) and \({\widehat{\phi }}^{E}\oplus {\widehat{\phi }}^{A}\) from 0 to \(T_f > T\), and report the trajectory errors calculated using (6.1) on \(\varvec{y}\) (the whole system), and for \(\varvec{x}\) (the position) and \(\varvec{v}\) (the velocity). Again, pictorial comparison of the trajectories are also shown. We report the trajectory errors over [0, T] and \([T, T_f]\). The learning results are shown in the following sections. We consider a related norm on the trajectory \(\varvec{Y}_{[0, T]} = \{\varvec{Y}_{t_l}\}_{l = 1}^L\) (\(0 = t_1< \cdots < t_L = T\)):

$$\begin{aligned} \left\| \varvec{Y}_{[0, T]} - {{\hat{\varvec{Y}}}}_{[0, T]} \right\| _{\text {traj}} = \max _{l = 1, \ldots , L} \left\| \varvec{Y}(t_l) - {{\hat{\varvec{Y}}}}(t_l) \right\| _{{\mathcal {Y}}}. \end{aligned}$$
(6.1)

We also consider a relative version, invariant under changes of units of measure:

$$\begin{aligned} \left\| \varvec{Y}_{[0, T]} - {{\hat{\varvec{Y}}}}_{[0, T]} \right\| _{\text {traj}^*} = \frac{\left\| \varvec{Y}_{[0, T]} - {{\hat{\varvec{Y}}}}_{[0, T]} \right\| _{\text {traj}}}{\left\| \varvec{Y}_{[0, T]} \right\| _{\text {traj}}}. \end{aligned}$$

Lastly, we report errors between \(\varvec{X}_{[0, T]}\) and \({{\hat{\varvec{X}}}}_{[0, T]}\),

$$\begin{aligned} \left\| \varvec{X}_{[0, T]} - {{\hat{\varvec{X}}}}_{[0, T]} \right\| _{{\mathcal {S}}^*} = \frac{\max _{l = 1,\ldots ,L}\{\left\| \varvec{X}(t_l) - {{\hat{\varvec{X}}}}(t_l) \right\| _{{\mathcal {S}}}\}}{\max _{l = 1, \ldots ,L}\{\left\| \varvec{X}(t_l) \right\| _{{\mathcal {S}}}\}}. \end{aligned}$$

Similar re-scaled norms are used for the difference between \(\varvec{V}_{[0, T]}\) and \({{\hat{\varvec{V}}}}_{[0, T]}\), and for the difference between \(\varvec{\Xi }_{[0, T]}\) and \({{\hat{\varvec{\Xi }}}}_{[0, T]}\).

6.1 Learning results for anticipation dynamics with \(U(r) = \frac{r^p}{p}\)

The energy-based interactions are constants in the FwEP models, if we want to consider more complicated models, i.e., interactions depending on pairwise distance and more, the AD models are suitable candidates. The dynamics of the AD model is given as follows,

$$\begin{aligned} \ddot{\varvec{x}}_i&= \frac{1}{N}\sum _{i' = 1, i' \ne i}^N \frac{\tau U'(\left\| \varvec{x}_{i'} - \varvec{x}_i \right\| )}{\left\| \varvec{x}_{i'} - \varvec{x}_i \right\| }({{\dot{\varvec{x}}}}_{i'} - {{\dot{\varvec{x}}}}_i) \nonumber \\&\quad + \frac{1}{N}\sum _{i' = 1, i' \ne i}^N\Big \{\frac{-\tau U'(\left\| \varvec{x}_{i'} - \varvec{x}_i \right\| )(\varvec{x}_{i'} - \varvec{x}_i)\cdot ({{\dot{\varvec{x}}}}_{i'} - {{\dot{\varvec{x}}}}_i)}{\left\| \varvec{x}_{i'} - \varvec{x}_i \right\| ^3} \nonumber \\&\quad + \frac{\tau U''(\left\| \varvec{x}_{i'} - \varvec{x}_i \right\| )(\varvec{x}_{i'} - \varvec{x}_i)\cdot ({{\dot{\varvec{x}}}}_{i'} - {{\dot{\varvec{x}}}}_i)}{\left\| \varvec{x}_{i'} - \varvec{x}_i \right\| ^2} + \frac{U'(\left\| \varvec{x}_{i'} - \varvec{x}_i \right\| )}{\left\| \varvec{x}_{i'} - \varvec{x}_i \right\| }\Big \}(\varvec{x}_{i'} - \varvec{x}_i). \end{aligned}$$
(6.2)

Here \(\tau \) measures the amount (in time) of anticipation. In order to fit the model into our learning regime, we take

$$\begin{aligned} \phi ^{A}(r):= \frac{\tau U'(r)}{r} \quad \text {and} \quad \phi ^{E}(r, s):= \frac{-\tau U'(r)s}{r^3} + \frac{\tau U''(r)s}{r^2} + \frac{U'(r)}{r}. \end{aligned}$$

Here we have no \(\xi _i\), \(K= 1\), \(m_i = 1\), and

$$\begin{aligned} s^E_{i, i'} = s^A_{i, i'}:= (\varvec{x}_{i'} - \varvec{x}_i)\cdot ({{\dot{\varvec{x}}}}_{i'} - {{\dot{\varvec{x}}}}_i). \end{aligned}$$

We also use \(\tau = 0.1\).

It is shown in [65] that if \(U''\) is bounded when \(r \rightarrow \infty \) with \(U(0) = U'(0) = 0\), then unconditionally flocking would occur. We take \(U(r) = \frac{r^p}{p}\) for \(1 < p \le 2\), then the system would show unconditional flocking. We choose \(p = 1.5\) for our learning trials.Footnote 2 We use a tensor grid of \(1^{st}\) degree piecewise standard polynomials with \(n^E = 28^2\) for learning \(\phi ^{E}(r, s)\), then a set of \(1^{st}\) degree piecewise standard polynomials with \(n^A = 138\) for learning \(\phi ^{A}(r)\). For the energy-based interactions we have the following results.

Fig. 1
figure 1

The lines shown in blue are the estimated interaction kernels, and the lines shown in black are the true interaction kernels. The colored areas shown in the background are the learned distributions of pairwise distance data

As is shown in Fig. 1b, the concentration of pairwise distance data is away from 0, making the estimation of the behavior of \(\phi ^{E}(r, s)\) at r close to 0 extremely difficult, meanwhile, since \(\phi ^{E}\) is also weighted by the pairwise difference, \(\varvec{x}_{i'} - \varvec{x}_i\), and at \(r_{i, i'}\) close to 0, the information is also lost. Next, we present the alignment-based interaction kernels in Fig. 2a.

Fig. 2
figure 2

The lines shown in blue are the estimated interaction kernels, and the lines shown in black are the true interaction kernels. The colored areas shown in the background are the learned distributions of pairwise distance data

As shown in Fig. 2, the behavior of \(\phi ^{A}\) at \(r = 0\) is learned accurately. Less accurate is the estimation of \(\phi ^{A}\) for large r: since the agents have aligned their velocities, the weight \(\varvec{v}_{i'} - \varvec{v}_i\) is close to a zero vector. The overall learning performance for estimating \(\phi ^{A}\) is better compared to that of estimating \(\phi ^{E}\). The \({\widehat{\phi }}^{E}\oplus {\widehat{\phi }}^{A}\) error is: \(6 \cdot 10^{-1} \pm 3.0 \cdot 10^{-1}\). The comparison of trajectories between the true kernels (LHS) and the estimators (RHS) is shown in Fig. 3.

Fig. 3
figure 3

\(U(r) = \frac{r^{1.5}}{1.5}\): trajectory comparison

As shown in Fig. 3, visually, there is no difference between the true dynamics and the estimated dynamics. We offer more quantitative insight into the difference between the two in Table 5.

Table 5 \(U(r) = \frac{r^{1.5}}{1.5}\): trajectory errors

We maintain a 3-digit relative accuracy in estimating the position/velocity of the agents, even though for the interaction kernels, we are only able to maintain a 1-digit relative accuracy.

7 Conclusion and further directions

We have described a second-order model of interacting agents that incorporates multiple agent types, an environment, external forces, and multivariable interaction kernels. The inference procedure described exploits the structure of the system to achieve a learning rate that only depends on the dimension of the interaction kernels, which is much smaller than the full ambient dimension \((2d+1)N\). Our estimators are strongly consistent, and in fact have learning rates that are min–max optimal within the nonparametric class, under mild assumptions on the interaction kernels and the system. We described how one can relate the expected supremum error of the trajectories for the system driven by the estimated interaction kernels to the difference between the true interaction kernels and the estimated ones—this result gives strong support to the use of our weighted \(L^2\) norms as the correct way to measure performance and derive estimators. A detailed discussion of the full numerical algorithm, including the inverse problem derived from data and a coercivity condition to ensure learnability, along with complex examples, were presented and we showed how the formulation presented covers a very wide range of systems coming from many disciplines.

There are various ways that one could build on this work to handle different systems and for many of these further directions, the theoretical framework, techniques, and theorems presented here would be directly useful. In particular, one could consider second-order stochastic systems or a similar system but on a manifold, more complex environments, having more unknowns within the model beyond just the interaction kernels (say estimating the non-collective forces as well), identifying the best feature maps to model the data, and considering semiparametric problems where there are hidden parameters within the interaction kernels or other parts of the model that we wish to estimate along with the interaction kernels. The generality of the model and its broad coverage of models across the sciences, together with the scalability and performance of the algorithm, could inspire new models—both explicit equations and nonparametric estimators learned from data—which are theoretically justified and highly practical.