Abstract
Modeling the complex interactions of systems of particles or agents is a fundamental problem across the sciences, from physics and biology, to economics and social sciences. In this work, we consider second-order, heterogeneous, multivariable models of interacting agents or particles, within simple environments. We describe a nonparametric inference framework to efficiently estimate the latent interaction kernels which drive these dynamical systems. We develop a learning theory which establishes strong consistency and optimal nonparametric min–max rates of convergence for the estimators, as well as provably accurate predicted trajectories. The optimal rates only depends on intrinsic dimension of interactions, which is typically much smaller than the ambient dimension. Our arguments are based on a coercivity condition which ensures that the interaction kernels can be estimated in stable fashion. The numerical algorithm presented to build the estimators is parallelizable, performs well on high-dimensional problems, and its performance is tested on a variety of complex dynamical systems.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
1 Introduction
Physical, biological, and social systems across all scales of complexity and size can often be described as dynamical systems written in terms of interacting agents (e.g. particles, cells, humans, planets,...). Rich theories have been developed to explain the collective behavior of these interacting agents across many fields including astronomy, particle physics, economics, social science, and biology. Examples include predator–prey systems, molecular dynamics, coupled harmonic oscillators, flocking birds or milling fish, human social interactions, and celestial mechanics, to name a few. In order to encompass many of these examples, we will consider a rather general family of second-order, heterogeneous (the agents can be of different types), interacting (the acceleration of an agent is a function of properties of the other agents) agent systems that includes external forces, masses of the agents, multivariable interaction kernels, and an additional environment variable that is a dynamical property of the agent (for example, a firefly having its luminescence varying in time). We propose a learning approach that combines machine learning and dynamical systems in order to provide highly accurate dynamical models of the observation data from these systems.
The model and learning framework presented in Sects. 2–4 includes a very large number of relevant systems and allows for their modeling. Clustering of opinions [8, 21, 41, 53] is a simple first-order case that exhibits clustering. Flocking of birds [22, 23, 26] can be modeled as the behavior of a second-order system that exhibits an emergent shared velocity of all agents. Milling of fish [1, 2, 19, 20] may be modeled as a large-time behavior of a second-order system (in 2 or 3-dimensions), with a non-collective force from the environment. A model of oscillators (fireflies) that sync and swarm together, and have their dynamics governed by their positions and a phase variable \(\xi \), was studied by [54,55,56, 66]. There are also models that include both energy and alignment interaction kernels, a particular case of this is the anticipation dynamics model from [65], which we also consider in this work. These dynamics exhibit a wide range of emergent behaviors, and as shown in [4, 18, 23, 35, 53, 67, 72], the behaviors can be studied when the governing equations are known. However, if the equations are not known and the data consists of only trajectories, we still wish to develop a model that can make accurate predictions of the trajectories and discover a dynamical form that accurately reflects their emergent properties. To achieve this, we present a provably optimal learning algorithm that is accurate, captures emergent behavior for large time, and, by exploiting the structure of the collective dynamical system, avoids the curse of dimensionality.
Our learning approach discovers the governing laws of a particular subset of dynamical systems of the form,
The learning problem is to infer the right hand side function \(\varvec{F}_{\varvec{\phi }^{EA}, \varvec{\phi }^{\xi }}\) from observations \(\{\varvec{Y}^{(m)}_{t_l}, \smash {{{\dot{\varvec{Y}}}}^{(m)}_{t_l}}\}_{m=1,l= 1}^{M,L}\) of the dynamical system, where m indexes different trajectories, started from initial conditions (IC) sampled i.i.d. from a measure \(\varvec{\mu ^{\varvec{Y}}}\) on the state space. Here M is the total number of trajectories observed, with each trajectory forming a single observation (M plays a fundamental role in the learning theory where we study convergence as M varies); L refers to the number of observations at different times along each trajectory. Throughout this work, m will index the trajectories \(1,\ldots , M\) and l will index the points in time \(1, \ldots , L\). The main difficulties in establishing an effective theory of learning \(\varvec{F}_{\varvec{\phi }^{EA}, \varvec{\phi }^{\xi }}\) are the curse of dimensionality caused by the dimension of \(\varvec{Y}\), which is \(D = N(2d + 1)\), where N is the number of agents, d the dimension of physical space; and the dependence of the observation data, for example \(\varvec{Y}(t_{l + 1})\) is a deterministic function of \(\varvec{Y}(t_{l})\).
We present a learning approach based on exploiting the structure of collective dynamical systems and nonparametric estimation techniques (see [6, 24, 31, 36, 69]) where we recover the interaction kernels \(\varvec{\phi }^{EA}, \varvec{\phi }^{\xi }\). A simplified form of our model equations, generalizing the first order models, is derived from Newton’s second law and given by: for \(i = 1, \ldots , N\)
Here, \(m_i\) is the mass of the ith agent, \(\varvec{x}_i\) is its position, \(F^{{{\dot{\varvec{x}}}}}\) is a non-collective force, and \(\phi ^E, \phi ^A: {\mathbb {R}}^+ \rightarrow {\mathbb {R}}\) are known as the interaction kernels.
To use the trajectory data to derive estimators, we consider appropriate hypothesis spaces in which to build our estimators, measures adapted to the dynamics, norms, and other performance metrics, and ultimately an inverse problem built from these tools. More specifically, let \(\widehat{\varvec{\phi }}^{EA}\) denote the direct sum of the kernels \(\widehat{\varvec{\phi }}^{E}\oplus \widehat{\varvec{\phi }}^{A}\) (for the notation, see Sect. 3), and define our estimator as
where \(\varvec{{\mathcal {E}}}_{M}^{EA}\) is an empirical error functional depending on the observation data, \(\varvec{{\mathcal {H}}}^{EA}\) is a hypothesis space to search for our estimators, and based on the form of the error functional the estimator is calculated as the solution of a constrained least squares problem. Once we have obtained this estimated interaction kernel, we want to study its properties as a function of the amount of trajectory data we receive, which is the M trajectories sampled from different initial conditions from the same underlying system, each consisting of L time observations along the trajectory. Here we study properties of the error functional, establish the uniqueness of its minimizers, and use the probability measures to define a dynamics-adapted norm to measure the error of our estimators over the hypothesis spaces. In comparing the estimators to the true interaction kernels, we first establish concentration estimates over the hypothesis space.
Our first main result is the strong asymptotic consistency of our learned estimators, as the number M of trajectories increases, which for the model (1.1) yields:
where \(\varvec{\rho }_T^{EA,L}\) is a dynamics-adapted probability measure on pairwise distances, and we use a weighted \(\varvec{L}^2\) space (see Sect. 4, particularly (4.3)); see Sect. 3 for the required definitions and Sect. 4.3.2 for the full theorem. In fact, we also prove a stronger result that provides the rate of convergence. We achieve the minimax rate of convergence for any number of effective variables \({\mathcal {V}}\) in the interaction kernels. See Sect. 4.4 for the full theorem (see Sect. 3 for relevant definitions) which is given by:
In the case of model (1.1), we have \({\mathcal {V}}=1\). Our result recovers the results for first-order systems [45, 47].
This means that our estimators converge at the same rate in M as the best possible estimator (up to a logarithmic factor) one could construct when the initial conditions are randomly sampled from some underlying initial condition distribution denoted \(\varvec{\mu ^{\varvec{Y}}}\) throughout this work (see Sect. 4.3).
To solve the inverse problem, we give a detailed discussion of an essential link between these three aspects, the notion of coercivity of the system—detailed in Sect. 4.2. Coercivity plays a key role in the approximation properties, the algorithm design, and the learning theory. We also present numerical examples, see also the detailed numerical study in [75], which help to explain why the particular norms we define are the right choice, as well as show excellent performance on complex dynamical systems, in Sect. 6.
Our paper is structured as follows. The first part of the paper describes the model, learning framework, inference problem, and the basic tools needed for the learning theory. These ideas are all explained in detail in Sects. 2–4. If one wishes to quickly jump to the theoretical sections, and then refer back to the definitions as needed, we have provided Tables 1, 3 which explains the model equations and outlines the definitions and concepts needed for the learning theory and general theoretical results, respectively. The theoretical part of the paper (Sects. 4.2–4.5) discusses fundamental questions of identifiability and solvability of the inverse problem, consistency, and rate of convergence of the estimators, and the ability to control trajectory error of the evolved trajectories using our estimators. Some key highlights of our theoretical contributions are described in Sect. 3.4, with full details in the corresponding sections. Lastly, we consider applications in Sect. 6, as well as have many additional proofs and details in appendices E-D.
2 Model description
In order to motivate the choice of second-order models considered in this paper, we begin our discussion with a simple second-order model derived from classical mechanics. Let us consider a closed system of N homogeneous agents (or particles) equipped with a certain type of Lagrangian L(t) in the form
Here U is a potential energy depending on pairwise distance. From the Lagrange equation, \(\frac{d}{dt}\partial _{{{\dot{\varvec{x}}}}_i}L = \partial _{\varvec{x}_i}L\), we obtain the second-order collective dynamics model
Here, \(\phi ^{E}(r) = \frac{U'(r)}{r}\) represents an energy-based interaction between agents. We are assuming a regularity condition on \(\phi ^{E}\), i.e. \(\phi ^{E}(0)0 = 0\). For example, the choice \(U(r) = \frac{NGm_{i'}m_i}{r}\) corresponds to Newton’s gravity model.
In order to incorporate a wider spectrum of behaviors, we add alignment-based interactions, which enable the alignment of velocities (so that short-range repulsion, mid-range alignment, and long arrange attraction are all present), auxiliary state variables describing internal states of agents (emotion, excitation, phases, etc.), and non-collective forces (interaction with the environment). We also allow for heterogeneous systems, consisting of agents belonging to \(K\) disjoint types: in this case we partition agents in \(K\) disjoint subsets \(\{C_{k}\}_{k= 1}^{K}\), with \(N_{k}\) being the number of agents of type \(k\), grouped in the index subset \(C_k\). These systems will be modeled by equations of the form
![figure a](http://media.springernature.com/lw685/springer-static/image/art%3A10.1007%2Fs43670-023-00055-9/MediaObjects/43670_2023_55_Figa_HTML.png)
for \(i = 1, \ldots , N\), where is the type of the agent i. The interaction kernels \(\phi ^{E}_{kk'}, \phi ^{A}_{kk'}, \phi ^{\xi }_{kk'}\) are in general different for each directed pair of interacting agent types; moreover, they not only depend on the pairwise distance \(r_{ii'}(t) = \left\| \varvec{x}_{i'}(t) - \varvec{x}_i(t) \right\| \), but also on other (known) pairwise features, \(\varvec{s}^E_{i i'}, \varvec{s}^A_{i i'}, \varvec{s}^{\xi }_{i i'}\), which are vector-valued functions of \(\varvec{x}_i(t), {{\dot{\varvec{x}}}}_i(t), \xi _i(t), \varvec{x}_{i'}(t), {{\dot{\varvec{x}}}}_{i'}(t), \xi _{i'}(t)\). For example, the interactions between birds or fish may depend on the field of vision, not just the distance between pairs of birds or fish. We will often suppress the explicit dependence on time t when it is clear from the context. These feature maps are modeled as being the composition of a vector-valued feature map \({\mathcal {F}}\) shared among all types of agent, and a projection, for each ordered pair of agent types, onto a subset of coordinates in the range of this map—see Table 1. The unknowns, for which we will construct estimators, in these equations, are the functions
,
and
; everything else is assumed given.
We note that in what follows, the notation \({\{E,A,\xi \}}\) attached to a map/function/etc. means that there is one of those maps/functions/etc. for each element in the set \({\{E,A,\xi \}}\). It is a convenient way to avoid excessive repetition of similar definitions.
The specific instances of the feature map \({\mathcal {F}}\) together with corresponding projections \(\pi _{kk'}^{{\{E,A,\xi \}}}\) include a variety of systems that have found a wide range of applications in physics, biology, ecology, and social science; see the examples in the Table 2 below. We assume that the function \({\mathcal {F}}\) is Lipschitz, and known, together with all the \(\pi _{kk'}^{{\{E,A,\xi \}}}\)’s. The Lipschitz assumption is sufficient to ensure the well-posedness of the system and will also be used to control the trajectory error, and of course implies that the feature maps \(\varvec{s}_{(k, k')}^{{\{E,A,\xi \}}}\) are all Lipschitz. The function \({\mathcal {F}}\) is a uniform way to collect all of the different variables (functions of the inputs) used across any of the \((k,k')\) pairs over all of the \(E,A,\xi \) functions in the system. This uniformity is helpful when discussing the rate of convergence, among other places. Examples of where this generality matters emerge naturally, say when one has a different number of variables across interaction kernels for different pairs \((k,k')\), or when the energy and alignment kernels depend on r and then additional but distinct other variables. From this uniform set of variables, we then project to arrive at the relevant function \(\varvec{s}_{(k, k')}^{{\{E,A,\xi \}}}\) for each pair (and each of the elements of the wildcard). Lastly, we can then evaluate this map at the specific pair of agents \((i,i')\), that leads to the feature evaluation, \(\varvec{s}_{ii'}^{{\{E,A,\xi \}}}\) which is the expression used in the model equation (2.2).
The model class (2.2) is quite large. We will consider several different concrete example in Sect. 6.1. We summarize how those examples, and others, map to the model class in Table 2, with a shaded (respectively: empty) cell indicating that the model has (respectively: has not) that characteristic. A numeric value indicates this is the number of unique variables, \({\mathcal {V}}, {\mathcal {V}}^{\xi }\) used within the EA or \(\xi \) portions of the system. The number of these unique variables specifies the dimension in the minimax convergence rate, see Sect. 4.4.
3 Inference problem and learning approach
In this section, we first introduce the problem of inferring the interaction kernels from observations of trajectory data and give a brief review and generalization of the learning approach proposed in the works [47] and [75].
3.1 Preliminaries and notation
We vectorize the model in (2.2) in order to obtain a more compact description. We let \(\varvec{v}_i(t):= {{\dot{\varvec{x}}}}_i(t)\) and
We introduce the weighted norm
![](http://media.springernature.com/lw157/springer-static/image/art%3A10.1007%2Fs43670-023-00055-9/MediaObjects/43670_2023_55_Equ5_HTML.png)
for \(\varvec{Z}= \begin{bmatrix} \varvec{z}_1^T,&\ldots ,&\varvec{z}_N^T\end{bmatrix}^T\) with each \(\varvec{z}_i \in {\mathbb {R}}^{d}\) or \({\mathbb {R}}\). Here \(\left\| \cdot \right\| \) is the same norm used in the construction of pairwise distance data for the interaction kernels (typically, the Euclidean norm). The weight factor is introduced so that different types of agents of different types are overall weighted equally, which is important in the estimation phase, especially in the case when the number of agents of different types is highly non-uniform. The model (2.2) becomes
Here \(\vec {m} = \begin{bmatrix} m_1,&\ldots ,&m_N \end{bmatrix}^T\in {\mathbb {R}}^N\), \(\circ \) is the Hadamard product, and we use boldface fonts to denote the vectorized form of our estimators (with some once-for-all-fixed ordering of the pairs \((k,k')_{k,k'=1,\dots ,K}\)):
and of the non-collective force:
![](http://media.springernature.com/lw427/springer-static/image/art%3A10.1007%2Fs43670-023-00055-9/MediaObjects/43670_2023_55_Equ181_HTML.png)
both of which are vectors in \({\mathbb {R}}^{Nd}\). We omit the analogous definitions for \({\textbf{f}}^{\varvec{\phi }^A}\) and \({\textbf{f}}^{\varvec{\phi }^{\xi }}\). We also use the shorthand:
to denote the element of the direct sum of the function spaces containing \(\varvec{\phi }^{E}, \varvec{\phi }^{A}\).
3.2 Problem setting
Our observation data is given by approximations of \(\{\varvec{Y}^{(m)}_{t_l}, \smash {{{\dot{\varvec{Y}}}}^{(m)}_{t_l}}\}_{m=1,l= 1}^{M,L}\) for \(0 = t_1< t_2< \cdots < t_L = T\). Here \(\varvec{Y}_t=[\varvec{y}_1^T(t),\ldots , \varvec{y}_N^T(t)]\) and \( \varvec{y}_i(t) = \begin{bmatrix}\varvec{x}_i^T(t), {{\dot{\varvec{x}}}}_i^T(t), \xi _i(t)\end{bmatrix}^T\), and m indexes the M different trajectories, each generated by the system (2.1) with the unknown set of interaction kernels, i.e. \(\varvec{\phi }^E, \varvec{\phi }^A, \varvec{\phi }^{\xi }\), with initial conditions \(\{\varvec{Y}^{(m)}(0)\}_{m=1,\ldots ,M}\) drawn i.i.d from \(\varvec{\mu ^{\varvec{Y}}}\), a probability measure defined on the space \({\mathbb {R}}^{N(2d+2)}\). We use a superscript (m) to denote that the variable is calculated from the data from that \(m^{\text {th}}\) trajectory. The objective is to construct estimators \(\widehat{\varvec{\phi }}^{E}, \widehat{\varvec{\phi }}^{A}, \widehat{\varvec{\phi }}^{\xi }\) the unknown interaction kernels given these observations.
3.3 Loss functionals
For simplicity, we only consider equidistant observation points: \(t_{l} - t_{l - 1} = h\) for \(l = 2, \ldots , L\); the proposed estimator is easily extended to the case non-equispaced time points. Following and extending [45, 47, 75], we consider the empirical error functional (recall the shorthand notation (3.3))
where we recall \(\Vert \cdot \Vert _{{\mathcal {S}}}\) in (3.1) and the introduction of this norm is to balance the contributions from different species. The estimators of interaction kernels are defined as the minimizers of the error functionals \(\varvec{{\mathcal {E}}}_{M}^{EA}\) and \(\varvec{{\mathcal {E}}}_{M}^{\xi }\) over suitably chosen finite-dimensional function spaces \(\varvec{{\mathcal {H}}}^{EA}\) and \(\varvec{{\mathcal {H}}}^{\xi }\):
3.4 Overview of contributions
We focus on the regime where L is fixed but \(M \rightarrow \infty \). We provide a learning theory that answers the fundamental questions:
-
Quantitative description of estimator errors. We will introduce measures to describe how close the estimators are to the true interaction kernels, that lead to novel dynamics-adapted norms. See Sect. 4.
-
Identifiability of kernels. We will establish the existence and uniqueness of the estimators as well as relate the solvability of our inverse problem to a fundamental coercivity property. See Sect. 4.2.
-
Consistency and optimal convergence rate of the estimators. We prove theorems on the strong consistency and optimal minimax rates of convergence of the estimators, as the number of observe trajectories increases. These results exploit the separability of the learning on the energy and alignment from the learning on the environment variable. See Sect. 4.3.
-
Trajectory Prediction. We prove a theorem that bounds the expected (over initial conditions) supremum error (over the entire time interval) of the trajectories obtained via the estimated interaction kernel is controlled by the norm of the difference between the true and estimated kernels, which in turn is controlled by the result above. This further justifies our choice of norms and estimation procedure. See Sect. 4.5.
-
Applications.Our generalized model demonstrates unique features that were not explored in previous work on estimating interaction kernels. Specifically, we showcase applications of the anticipation dynamics (AD) model introduced in [65] that go beyond the scope of previous research. Our numerical results support the effectiveness of our method and are in line with our theoretical findings. In Sect. 5 and D we discuss computation complexity and algorithmic implementation.
3.5 Comparison with existing work
Nonparametric inference of radial interaction kernels. We have studied the nonparametric inference of radial interaction kernels from trajectory data in special cases of model 2.2. In [10], a convergence study for first-order models of homogeneous agents was done for increasing N, the number of agents. The estimation problem with N fixed, but the number of trajectories M varying, for first-order and second-order models of heterogeneous agents was numerically studied in [47] and learning theory on these first-order models was developed in [32, 42, 45]. A big data application to real celestial motion ephemerides is developed and discussed in [49]. In [75], we numerically examine the proposed inference approach on second-order systems, with particular emphasis on emergent collective behaviors.
Novelty of our work. In this paper, we provide a rigorous learning theory for second-order heterogeneous models 2.2 with interaction kernels depending on pairwise higher-dimensional features. Our second-order model equations cover the first-order models considered in [32, 45, 47, 75] as special cases, but they are a significantly larger class: even when written as a first-order system in more variables, they are a strict generalization of the previous first-order models. Furthermore, the dynamical characteristics produced by second-order models are much richer and can model more complicated collective motions and emergent behavior of the agents.
Our theoretical results also focus on the joint learning of \({\varvec{\phi }}^{E}, {\varvec{\phi }}^{A}\) that takes into account their natural weighted direct sum structure that is described in the following sections, whereas the previous work’s learning theories are on single \({\varvec{\phi }}^{E}\)’s. We carefully discuss the identifiability and separability of \(\phi ^{E}\) and \(\phi ^{A}\) from the sum. In general, the current theoretical framework is not able to conclusively show that \({\varvec{\phi }}^{E}\) and \({\varvec{\phi }}^{A}\) can be learned separately. However, we show a structured sum of \(\phi ^{E}\) and \(\phi ^{A}\) can be learned at an optimal rate only depending on the intrinsic dimension, which is sufficient to conclude the force field on the whole state space \({\mathbb {R}}^{2Nd+N}\) can be learned without curse of ambient dimensionality. This is also demonstrated in various numerical experiments.
Other works on data-driven discovery of collective dynamics. The majority of the earliest work in inferring interaction kernels in systems of the type (1.1), (2.2) occurred in the Physics literature, going back to the works of Newton. From the viewpoint of purely data-driven analysis of the equations, requiring limited or no physical reasoning, foundational work on estimating interaction laws includes [40, 48]. One can also refer to [50] for the recent development of the Weak SINDy algorithm that leverages the weak form of the differential equation and sparse parametric regression, with applications to cellular dynamics [51]. In these works, the interaction kernels are assumed to be in the span of a known family of functions and parameters are estimated. In statistics, the problem of parameter estimation in dynamical systems from observations is classical, e.g. [12, 15, 43, 59, 71]. The question of identifiability of the parameter emerges, see e.g. [28, 52]. Our work is closely related to this viewpoint but our parameter is now infinite-dimensional; identifiability is discussed in Sect. 4.2.
Another highly active area of research in recent years is on stochastic interacting agent systems. The maximum likelihood approach is the most frequently studied approach in recent works, including the parameter estimation [7, 16, 34, 39, 64] and nonparametric estimation of drift in stochastic McKean–Vlasov equation [30, 33, 73], and radial interaction kernel learning in [46].
Machine learning of dynamical system. A vast literature exists in the context of learning dynamical systems [3, 5, 9, 13, 14, 37, 38, 44, 57, 58, 60, 62, 74]. There are many techniques which can be used to tackle the high-dimensionality of the data set: sparsity assumptions [11, 14, 61, 68], dimension reduction, reduced-order modeling. The dependent nature of the data prevents traditional regression-based approaches, see the discussion in [45], but many of the approaches above successfully address this. Our work, however, exploits the interacting-agent structure of collective dynamical systems, which is driven by a collection of two-body interactions where each interaction depends only on pairwise data between the states of agents, as in (1.1). With this structure in mind, we are able to reduce the ambient dimension of the data \(N(2d + 1)\) to the dimension of the variables in the interaction kernels, which is independent of N. We also naturally incorporate the dependence in the data in an appropriate manner by considering trajectories generated from different initial conditions.
3.6 Function spaces
We begin by describing some basic ideas about measures and function spaces. Consider a compact or precompact set \({\mathcal {U}} \subset {\mathbb {R}}^{p}\) for some p; the infinity norm is defined as \(\Vert h\Vert _{\infty }:={\text {ess}} \sup _{x \in {\mathcal {U}}}|h(x)|\), and \(L^{\infty }({\mathcal {U}})\) as the space of real valued functions defined on \({\mathcal {U}}\) with finite \(\infty \)-norm. A key function space we need to consider is, \(C_{c}^{k, \alpha }(\mathcal {{\mathcal {U}}})\), for \(k \in {\mathbb {N}}\), \(0<\alpha \le 1\), defined as the space of compactly supported, k-times continuously differentiable functions with a k-th derivative that is Hölder continuous of order \(\alpha \). We can then consider vectorizations of these spaces over agent types as
Similarly, we consider direct sums of measures, with corresponding vectorized function spaces, in particular \(L^2\) (see Sect. 4.1).
We now define a suitable function class for the interaction kernels in the model (2.2). A simple model is that the agents get farther and farther apart, they eventually should have no influence on each other. For each pair \((k,k')\), \(k,k'=1,\ldots K\) we define the admissible space
where we remind the reader that the \({\{E,A,\xi \}}\) notation means, in this case, that there is an admissible space for each element of the set \({\{E,A,\xi \}}\). Here, \(R_{kk'}^{\min },R_{kk'}^{\max }\) are the minimum or maximum, respectively, possible interaction radius for agents in \(C_{k'}\) influencing agents in \(C_{k}\). Similarly, \({\mathbb {S}}^{E}_{kk'}, {\mathbb {S}}^{A}_{kk'}, {\mathbb {S}}^{\xi }_{kk'}\) are compact sets in \({\mathbb {R}}^{p_{kk'}^E}, {\mathbb {R}}^{p_{kk'}^A}, {\mathbb {R}}^{p_{kk'}^{\xi }}\) which contain the ranges of the feature maps, \(\varvec{s}^E_{kk'}, \varvec{s}^A_{kk'}\) and \(\varvec{s}^{\xi }_{kk'}\). We will also need the sets:
Notice that all interaction kernels are supported on the interval of pairwise distance [0, R].
We denote the distribution of the initial conditions by \(\varvec{\mu ^{\varvec{Y}}}\). This measure is unknown and is the source of randomness in our system. It is a product measure of three measures \(\varvec{\mu }^{\varvec{X}}\), \(\varvec{\mu }^{\varvec{V}}\), \(\varvec{\mu }^{\varvec{\Xi }}\), all also unknown, that represent the distribution on the initial positions, velocities, and environment variables, respectively. Specifically, we define,
It reflects that we will observe trajectories which start at different initial conditions, but that evolve from the same dynamical system. For example, in our numerical experiments we will choose \(\varvec{\mu ^{\varvec{Y}}}\) to be uniform over a system-dependent compact set.
We let
and we assume that both of these quantities are finite. A sufficient condition is that the measures \(\varvec{\mu ^{\varvec{V}}}, \varvec{\mu }^{\xi }\) (specifying the distribution of the initial conditions on the velocities and the environment variable are compactly supported, which follows by the assumptions on the interaction kernels below and that we only consider finite final time T.
First, we define the following vectorized function spaces, which we call admissible sets,
We will assume that the interaction kernels are in corresponding admissible sets:
The admissibility assumptions (3.12) allow us to establish properties such as existence and uniqueness of solutions to (2.2) as well as to have control on the trajectory errors in finite time [0, T]. It further allows us to show regularity and absolute continuity with respect to Lebesgue measure of the appropriate performance measures defined in Sect. 4.1.
When estimating the EA part of the system, we will consider the direct sum admissible space, for \(S_{EA}\ge \max \{S_E,S_A\}\),
In the learning approach, we will consider hypothesis spaces that we will search in order to estimate the various interaction kernels. The hypothesis spaces corresponding to \(\{\phi _{kk'}^{{\{E,A,\xi \}}}\}\) are denoted as \(\{{\mathcal {H}}_{kk'}^{{\{E,A,\xi \}}}\}\) and we vectorize them as,
Analogous to our simplified notation for \(\varvec{\phi }^{EA}, \varvec{\varphi }^{EA}\) described in (3.3), we define the direct sum of the hypothesis spaces as,
We will consider specific choices for the hypothesis spaces during the learning theory and numerical algorithm sections.
4 Learning theory
4.1 Probability measures and weighted \(L^2\) for measuring learning performance
The interaction kernels depend on (\(r,{\dot{r}},\varvec{s}^E,\varvec{s}^A,\varvec{s}^{\xi }\)), and to measure distances between estimated interaction kernels and true interaction kernels, we consider a natural set of probability measures and corresponding weighted \(L^2\) spaces. These generalize the constructions of [10, 45, 47]. For each interacting pair \((k,k')\), we let
where \(N_{kk'} = N_kN_{k'}\) for \(k\ne k'\) and \(N_{k k^{\prime }}={N_{k} \atopwithdelims ()2}\) for \(k=k'\), and we used the following shorthand notation for the Dirac measures:
The measure \(\rho _T^{EA, L, k, k'}\) is the discrete counterpart of \(\rho _T^{EA, k, k'}\) with the continuous average over [0, T] replaced by the average over the observation times \(0=t_1<\ldots <t_L=T\). \(\rho _T^{EA, L, M, k, k'}\) can be computed from observations and converges to \(\rho _T^{EA, L, k, k'}\) as \(M \rightarrow \infty \).
We also consider the marginal distributions
and \(\rho _T^{E, L, k, k'}(r, \varvec{s}^E) \), \(\rho _T^{E, L, M, k, k'}(r, \varvec{s}^E)\), \(\rho _T^{A, L, k, k'}(r, {\dot{r}}, \varvec{s}^A)\), \(\rho _T^{A, L, M, k, k'}(r, {\dot{r}}, \varvec{s}^A)\) defined analogously as above. The empirical measures, \(\rho _T^{E, L, M, k, k'}, \rho _T^{A, L, M, k,k'}\), are the ones used in the actual algorithm to quantify the learning performances of the estimators \({\widehat{\phi }}^{E}_{kk'}\) and \({\widehat{\phi }}^{A}_{kk'}\) respectively. They are also crucial in discussing the separability of \({\widehat{\phi }}^{E}_{kk'}\) and \({\widehat{\phi }}^{A}_{kk'}\).
For ease of notation, we introduce the following measures to handle the heterogeneity of the system, and which are used to describe error over all of the pairs \((k,k')\).
Similar definitions apply for measures related to learning the \(\xi \)-based interaction kernels, see supplement C. We discuss some key properties of the measures in supplement B.
We now discuss the performance measures for the estimated interaction kernels. We use weighted \(L^2\)-norms (with mild abuse of notation, we omit the weight from the notation) based on measures introduced above (with analogous definitions for the measures corresponding to finite L) that are adapted to the underlying dynamics:
Our learning theory focuses on minimizing the difference between \({\widehat{\phi }}^{E}_{kk'} \oplus {\widehat{\phi }}^{A}_{kk'}\) and \(\phi ^{E}_{kk'} \oplus \phi ^{A}_{kk'}\) in the joint norm given by (4.4). As long as the joint norm is small, our estimators produce faithful approximations of the right hand side function of the original system and trajectories. However, it does not necessarily imply that both \({\widehat{\phi }}^{E}_{kk'} - \phi ^{E}_{kk'}\)’s and \({\widehat{\phi }}^{A}_{kk'} - \phi ^{A}_{kk'}\)’s are small in their corresponding energy- and alignment-based norms, since the joint norm is a weaker norm. It would be interesting to study if there is any equivalence between these two norms, but the problem appears to be quite delicate. The theoretical investigation is still ongoing.
Now, we have all the tools needed to establish a theoretical framework: dynamics induced probability measures, performance measurements in appropriate norms, and loss functionals. These will allow us to discuss the convergence properties of our estimators. Full details of the numerical algorithm are given in supplement D.
4.2 Notational summary
A summary of the learning theory notation introduced in Sects. 3.1, 3, and the notation above, is given below in Table 3.
4.3 Identifiability of kernels from data
In this section we introduce a technical condition, called coercivity condition, on the dynamical system that relates to the well-posedness (solvability and uniqueness of the solution) of the inverse problem and plays a key role in the learning theory. It generalizes the previous work [42, 45, 47]. In fact, for the second-order systems considered here, we will have two coercivity conditions, one for the energy and alignment terms and the other one for the \(\xi \) variable. These conditions ensure that the minimizers to the error functionals are unique, and second that when the expected error functional is small, then the distance from the estimator to the true kernels is small in the appropriate \(\varvec{\rho }_T\) norm.
Definition 1
(Coercivity condition) For the dynamical system (2.2) observed at time instants \(0=t_1<t_2<\dots <t_L=T\) and with initial condition distributed \(\varvec{\mu ^{\varvec{Y}}}\) on \({\mathbb {R}}^{(2d+1)N}\), it satisfies the coercivity condition on the hypothesis space \(\varvec{{\mathcal {H}}}^{EA}\) with constant \(c_{\varvec{{\mathcal {H}}}^{EA}}\) if
Similarly, the system satisfies the coercivity condition on the hypothesis space \(\varvec{{\mathcal {H}}}^{\xi }\) with constant \(c_{\varvec{{\mathcal {H}}}^{\xi }}\) if
Analogous definitions holds for continuous observations over the time interval [0, T], by replacing the average over observations at discrete times with an integral average over [0, T].
4.4 Consistency and optimal convergence rate of estimators
4.4.1 Concentration
Our first main result is a concentration estimate that relates the coercivity condition to an appropriate bias-variance tradeoff in our setting. Let \({\mathcal {N}}(\varvec{{\mathcal {H}}},\delta )\) be the \(\delta \)-covering number, with respect to the \(\infty \)-norm, of the set \(\varvec{{\mathcal {H}}}\).
Theorem 2
(Concentration) Suppose that \({\varvec{\phi }}^{\{E,A,\xi \}}\in \varvec{{\mathcal {K}}}_{S_{\{E,A,\xi \}}}^{\{E,A,\xi \}}\). Consider a convex, compact (with respect to the \(\infty \)-norm) hypothesis spaces
bounded above by \(S_0 \ge \max \{S_E, S_A,S_{\xi }\}\) respectively. Additionally, assume that the coercivity conditions (4.5), (4.6) hold on \(\varvec{{\mathcal {H}}}_M^{EA}\) and \(\varvec{{\mathcal {H}}}_M^{\xi }\), respectively.
Then for all \(\epsilon >0\), with probability (with respect to \(\varvec{\mu ^{\varvec{Y}}}\)) at least \(1-\delta \), we have the estimates
provided that, for the first bound to hold,
and similarly for the second inequality, using \(\varvec{{\mathcal {H}}}_M^{\xi }\).
4.4.2 Consistency
In the regime where \(M \rightarrow \infty \), we can choose a sequence of \(\varvec{{\mathcal {H}}}_M^{EA}\)’s such that the approximation error goes to 0. This enables us to control the infimum on the right hand side of (4.7).
Theorem 3
(Strong consistency) Suppose that
is a family of compact and convex subsets such that the approximation error goes to zero,
Further suppose that the coercivity condition holds on \(\bigcup _{M}\varvec{{\mathcal {H}}}_M^{EA}\), and that \( \bigcup _{M}\varvec{{\mathcal {H}}}_M^{EA}\) is compact in \(\varvec{L^\infty }({\textbf{R}}\times {\textbf{S}}^E) \oplus \varvec{L^\infty }({\textbf{R}}\times {\textbf{S}}^A)\). Then the estimator is strongly consistent with respect to the \(\varvec{L}^2(\varvec{\rho }_T^{EA,L})\) norm:
An analogous consistency result holds for the estimator in the \(\xi \) variable.
These two results together provide a consistency result on the full estimation of the triple \((\widehat{{\varvec{\phi }}^{\xi }}, \widehat{{\varvec{\phi }}^{E}}, \widehat{{\varvec{\phi }}^{A}})\) and thus consistency of our estimation procedure on the full system (2.2).
4.5 Rate of convergence
Theorem 2 highlights the classical bias-variance tradeoff in our setting. Given data collected from M trajectories, we would like to choose the best hypothesis space to maximize the accuracy of the estimators. On the one hand, we would like the hypothesis space \(\varvec{{\mathcal {H}}}^{EA}_M\) to be large so that the bias
is small. Simultaneously, we would like \(\varvec{{\mathcal {H}}}^{EA}_M\) to be small enough so that the covering number \({\mathcal {N}}(\varvec{{\mathcal {H}}}^{EA}_M,\epsilon ) \) is small. Just as in nonparametric regression, our rate of convergence depends on a regularity condition of the true interaction kernels and corresponding approximation properties of the hypothesis space, as is demonstrated in the following theorem. We establish the optimal (up to a log factor) min–max rate of convergence by choosing a hypothesis space of an optimal dimension as a function of the sample size M.
The dimension of the space supporting \(\varvec{\rho }_T^{EA,L}\) is typically large: it is equal to \(1+\sum _{kk'}p_{(k,k')}^E + \sum _{(k,k')}p_{(k,k')}^A\), see Table 1 for the definition of the \(p_{(k,k')}\). However we can exploit the structure of the system in such a way that our convergence rates only depend on the maximum number of unique variables in a pair \((k,k')\) across the E, A portions of the system. A similar result holds for the \(\varvec{\rho }_T^{\xi , L}\) and its convergence rate. For the system (2.2), consider \({\mathcal {V}}^{E,kk'}\) to be the number of distinct variables in the function \(\phi ^{E}_{kk'}(r,\varvec{s}^E_{(k,k')})\), similarly we define \({\mathcal {V}}^{A,kk'}\), \({\mathcal {V}}^{\xi ,kk'}\), more precisely, and recalling the notation in Table 1:
Using these, we get the dimensions needed for the minimax rates:
The dimension for the minimax rates on the energy and alignment inference is given by \({\mathcal {V}}\), representing the maximum number of unique variables used in any one of the \(\phi ^{E}_{kk'}, \phi ^{A}_{kk'}\) pairs. Analogously, \({\mathcal {V}}^{\xi }\) is used for the minimax convergence rate for the inference of \({\varvec{\phi }}^{\xi }\). As an extreme example, consider a problem with 10 different types of agents, leading to 100 distinct interaction kernels, each depending on r and one additional variable that is unique for each function. In this case, we only pay the 2 dimensional rate, rather than the 101-dimensional rate in the ambient space of the 101 unique variables, although the heterogeneity affects the constant in the convergence rate. We note that we are not predicting the number of variables nor their form: these are assumed known. In Table 2 we report the values of \({\mathcal {V}}\) and \({\mathcal {V}}^{\xi }\) for a variety of prototypical systems. We are now ready to state our main result:
Theorem 4
(Rate of convergence) Let \(\widehat{\varvec{\phi }}^{EA}:={{\widehat{{\varvec{\phi }}}}}_M^E \oplus {{\widehat{{\varvec{\phi }}}}}_M^A\) denote the minimizer of the empirical error functional \(\varvec{{\mathcal {E}}}_{M}^{EA}\) (defined in (3.4)) over the hypothesis space \(\varvec{{\mathcal {H}}}^{EA}_M\).
-
(a)
Let the hypothesis space be chosen as the direct sum of the admissible spaces, namely \(\varvec{{\mathcal {H}}}^{EA}= \varvec{{\mathcal {K}}}_{S_E}^E \oplus \varvec{{\mathcal {K}}}_{S_A}^A,\) and assume that the coercivity condition (4.5) holds on \(\varvec{{\mathcal {H}}}^{EA}\). Then, there exists a constant C depending only on \(K,S_{EA},R, R_{{\dot{x}}}\) such that
$$\begin{aligned} {\mathbb {E}}_{\varvec{\mu ^{\varvec{Y}}}}\Big [\Vert \widehat{\varvec{\phi }}^{EA}_M-\varvec{\phi }^{EA}\Vert ^2_{\varvec{L}^2(\varvec{\rho }_T^{EA,L})} \Big ]\le \frac{C}{c_{\varvec{{\mathcal {H}}}^{EA}}} M^{-\frac{1}{{\mathcal {V}}+1}}. \end{aligned}$$ -
(b)
Assume that \(\{\varvec{{\mathcal {L}}}_n\}_{n=1}^{\infty }\) is a sequence of finite-dimensional linear subspaces of \(\varvec{L}^{\infty }({\textbf{R}}\times {\textbf{S}}^E) \oplus \varvec{L}^{\infty }({\textbf{R}}\times {\textbf{S}}^A)\) satisfying the dimension and approximation constraints
$$\begin{aligned} \text {dim}(\varvec{{\mathcal {L}}}_n) \le c_0K^2n^{{\mathcal {V}}},\quad \inf _{\varvec{\varphi }^{EA}\in \varvec{{\mathcal {L}}}_n}\Vert \varvec{\varphi }^{EA}-\varvec{\phi }^{EA}\Vert _{\infty } \le c_1 n^{-s}, \end{aligned}$$(4.10)for some fixed constants \(c_0,c_1\) representing dimension-independent approximation characteristics of the linear subspaces, and \(s>0\) related to the regularity of the interaction kernels. The value n can be thought of as the number of basis functions along each of the (up to) \({\mathcal {V}}\) axes for each \((k,k')\). Suppose the coercivity condition holds true on the set \(\varvec{{\mathcal {L}}}:=\cup _n\varvec{{\mathcal {L}}}_n\), and let \(c_{\varvec{{\mathcal {L}}}}^{EA}\) be the coercivity constant of \(\varvec{{\mathcal {L}}}\). Define \(\varvec{{\mathcal {B}}}_n\) to be the closed ball centered at the origin of radius \((c_1+S_{EA})\) in \(\varvec{{\mathcal {L}}}_{n}\). If we choose the hypothesis space as \(\varvec{{\mathcal {H}}}_M=\varvec{{\mathcal {B}}}_{k(M)}\), where \(k(M) \asymp (\frac{M}{\log M})^{\frac{1}{2\,s+{\mathcal {V}}}}\), then there exists a constant C depending on \(K,R, R_{{\dot{x}}}, S_{EA},c_0, c_1, s\) such that we achieve the convergence rate,
$$\begin{aligned} {\mathbb {E}}_{\varvec{\mu ^{\varvec{Y}}}}\Big [\Vert \widehat{\varvec{\phi }}^{EA}_{M} -\varvec{\phi }^{EA}\Vert ^2_{\varvec{L}^2(\varvec{\rho }_T^{EA,L})} \Big ] \le \frac{C}{c_{\varvec{{\mathcal {L}}}}^{EA}} \left( \frac{\log M}{M}\right) ^{\frac{2s}{2s+{\mathcal {V}}}}. \end{aligned}$$(4.11) -
(c)
under the corresponding assumptions as in (a), there exists a constant C depending only on \(K,S_{\xi },R\) such that
$$\begin{aligned} {\mathbb {E}}_{\varvec{\mu ^{\varvec{Y}}}}\Big [\Vert {{\widehat{{\varvec{\phi }}}}}_M^{\xi }-{\varvec{\phi }}^{\xi }\Vert ^2_{\varvec{L}^2(\varvec{\rho }_T^{\xi , L})} \Big ]\le \frac{C}{c_{\varvec{{\mathcal {H}}}^{\xi }}} M^{-\frac{1}{{\mathcal {V}}^{\xi }+1}}. \end{aligned}$$ -
(d)
under the corresponding assumptions as in (b), there exists a constant C depending only on \(K,R, S_{\xi },c_0, c_1, s\) such that, and for \(c^{\xi }\) the coercivity constant of the corresponding linear space,
$$\begin{aligned} {\mathbb {E}}_{\varvec{\mu ^{\varvec{Y}}}}\Big [\Vert {{\widehat{{\varvec{\phi }}}}}^{\xi }_{M} -{\varvec{\phi }}^{\xi } \Vert ^2_{\varvec{L}^2(\varvec{\rho }_T^{\xi , L})} \Big ] \le \frac{C}{c^{\xi }} \left( \frac{\log M}{M}\right) ^{\frac{2s}{2s+{\mathcal {V}}^{\xi }}}. \end{aligned}$$(4.12)
We in fact prove bounds not only in expectation, but also with high probability, for every fixed large-enough M, as the proof will show. In addition, we remark that in the case that the coercivity constant is independent of the number of agents N, and not only will the convergence rate of our estimators be independent of the dimension \((2d+1)N\) of the phase space, but even the constants in front of the rate term are independent of N. We have proven this could be true for some initial conditions [45] for first order systems. The analysis could be extend to some special cases of second-order systems we considered in this paper. In general, we do expect that the coercivity constant is dependent on N. The empirical numerical experiments on some second-order systems [75] support the idea that the coercivity condition is satisfied by large classes of second-order systems, we leave it as a future work for further theoretical investigation.
In both theorems, the convergence rates \(\frac{2s}{2s+{\mathcal {V}}}\) and \(\frac{2s}{2s+{\mathcal {V}}^{\xi }}\) coincide with the minimax rate of convergence \(\frac{2s}{2s+d}\) for nonparametric regression in the corresponding dimension d—up to the logarithmic factor. (This logarithmic factor may be removable (using, e.g., the techniques in Chapter 11–15 of [36]), but with additional complexity of the proofs.) Achieving the same rate of convergence as if we had observed the noisy values of the interaction kernels directly, rather than through the dynamics, demonstrates the optimality of our approach. The strong consistency results show the asymptotic optimality of our method, and for wide classes of systems the assumptions of the theorems apply. Specifically, for part (b) of the theorems, the dimension and approximation conditions can be explicitly achieved by piecewise polynomials or splines appropriately adapted to the regularity of the interaction kernel. In the conditions of theorem 4, n can be the number of partitions along each axis of the variables in \({\mathcal {V}}\). Then, by using multivariate splines or piecewise polynomials we will have a fixed constant \(c_0\) (corresponding to the number of parameters to estimate for each function) times \(Kn^{{\mathcal {V}}}\) as the dimension of the linear space. Furthermore, by standard approximation theory results, see [63] (Chapters 12,13), [27, 29], for s the regularity of the interaction kernels we achieve the desired approximation condition with piecewise polynomials of degree \(\lfloor s\rfloor \). In our admissible spaces we have \(s=1\), note that the rate of convergence is faster if we have a kernel of higher regularity.
4.5.1 Examples of convergence rates in prototype systems
We next briefly examine the convergence rate on a few systems of interest. Recall that in Table 2 we have as the final two columns the values \({\mathcal {V}}, {\mathcal {V}}^{\xi }\), which dictate the rate of convergence of our estimators in each system. Some specific highlights:
-
For Anticipation Dynamics (AD) (see (6.2)), we have \({\mathcal {V}}=2\) unique variables shared across both of energy and alignment based kernels. So we learn at the 2-dimensional rate.
-
For the Synchronized Oscillator dynamics (see (11) in [75]), each agent is indexed by i, \(\xi _i\) is its phase, \(\varvec{x}_i\) is (as usual) its position, \(\omega _i\) is the fixed natural frequency, \(\varvec{v}_i\) is the fixed self-propulsion velocity. The dynamics of \(\varvec{x}_i\) and \(\xi _i\) are governed by the following equations,
$$\begin{aligned} {\left\{ \begin{array}{ll} {{\dot{\varvec{x}}}}_i &{}= \varvec{v}_i + \frac{1}{N}\sum _{i'=1}^{N}\left( \frac{\varvec{x}_{i'} - \varvec{x}_i}{\left\| \varvec{x}_{i'} - \varvec{x}_i \right\| }(A+J \cos (\xi _{i'} - \xi _i)) - B\frac{\varvec{x}_{i'} - \varvec{x}_i}{\left\| \varvec{x}_{i'} - \varvec{x}_i \right\| ^2} \right) \\ {{\dot{\xi }}}_i &{}= \omega _i + \frac{K}{N} \sum _{i'=1}^{N} \frac{\sin (\xi _{i'} - \xi _i)}{\left\| \varvec{x}_{i'} - \varvec{x}_i \right\| } \end{array}\right. }, \nonumber \\ \end{aligned}$$(4.13)where A, J, B, K are constants. Formatting (4.13) into our model, we can see that we achieve the 2-dimensional optimal learning rate on each of the EA and \(\xi \) portions (rather than a 4-dimensional rate) due to the decoupled nature of the system; similarly we only pay the 1-dimensional rate twice for the Phototaxis system (see section D of SI in [47]). This is a key reason for splitting our learning theory between EA- and \(\xi \)-interaction kernels and accounting for shared and non-shared variables.
-
The rates of convergence of our estimators for all previously-studied first-order systems (see [45, 47, 75]) can be derived from Theorem 4.
One downside of the results above is the lack of dependence on L: it seems natural that finer time samples in each trajectory should improve the results, at least up to a point. Indeed, the numerical experiments of [45, 47, 75] demonstrate that more data in L may indeed be helpful to improve the performance. One technique used in [75] for very long trajectory data (large L, medium to small M) is to split each trajectory into larger M with smaller L in each. The dependence on the number of agents N is not the objective of this work; it was considered [10] in the case of first-order systems; but further study in this mean-field regime is of interest to the authors and work is ongoing.
4.6 Performance of trajectory prediction
Once estimators \(\widehat{\varvec{\varphi }}^{EA},\widehat{\varvec{\varphi }}^{\xi }\) are obtained, a natural question is the accuracy of the evolved trajectory based on these estimated kernels. We compare the observed trajectories to the estimated trajectories evolved from the same initial conditions but with the estimated interaction kernels. Recall that \(\varvec{Y}_t = [ \varvec{X}^T_t, \varvec{V}^T_t, \varvec{\Xi }^T_t ]^T\) be the trajectory from dynamics generated by the true and unknown interaction kernels with initial condition \(\varvec{Y}_0\), and \({{\hat{\varvec{Y}}}}_t = [ {{\hat{\varvec{X}}}}^T_t, {{\hat{\varvec{V}}}}^T_t, {{\hat{\varvec{\Xi }}}}^T_t]^T\) be the trajectory from dynamics generated, with the same initial condition \({{\hat{\varvec{Y}}}}_0 = \varvec{Y}_0\), by the interaction kernels estimated from observations at times \(\{t_l\}_{l = 1}^L\). We let
The next theorem shows that the error in prediction is (i) bounded trajectory-wise by a continuous-time version of the error functional, and (ii) bounded on average by the \(\varvec{L}^2(\varvec{\rho }_T^{EA}), \varvec{L}^2(\varvec{\rho }_T^{\xi })\), respectively, error of the estimator. This further validates the effectiveness of our error functional and \(\varvec{L}^2(\varvec{\rho }_T)\)-metric to assess the quality of the estimator. In particular, this emphasizes that although the system is a coupled system of ODE’s, our decoupled learning procedure with our choice of norm will lead to control of the expected supremum error as long as we minimize the \(\varvec{L}^2(\varvec{\rho }_T^{EA}), \varvec{L}^2(\varvec{\rho }_T^{\xi })\) norms in obtaining our estimators.
Theorem 5
Suppose that \({{\widehat{{\varvec{\phi }}}}}^E \in \varvec{{\mathcal {K}}}_{S_E}^E\), \({{\widehat{{\varvec{\phi }}}}}^A \in \varvec{{\mathcal {K}}}_{S_A}^A\) and \({{\widehat{{\varvec{\phi }}}}}^{\xi } \in \varvec{{\mathcal {K}}}_{S_{\xi }}^{\xi }\). Denote by \({\widehat{\varvec{Y}}}(t)\) and \(\varvec{Y}(t)\) the solutions of the systems with kernels \({{\widehat{{\varvec{\phi }}}}}^E=({{\widehat{\phi }}}_{kk'}^{E})_{k,k'=1}^{K,K}, {{\widehat{{\varvec{\phi }}}}}^A =({{\widehat{\phi }}}_{kk'}^{A})_{k,k'=1}^{K,K}\), and \({{\widehat{{\varvec{\phi }}}}}^{\xi } =({{\widehat{\phi }}}_{kk'}^{\xi })_{k,k'=1}^{K,K}\) and \({\varvec{\phi }}^E, {\varvec{\phi }}^A,{\varvec{\phi }}^{\xi }\) respectively, both with the same initial condition. Then
where \(g(T) =1+(1+B_1T)T\exp (A_1T+T^2/2)\). The constants are \(A_1 = 2T(8KP + {\mathcal {L}} +8QK + {\mathcal {L}}^{\xi })\) and \(B_1 = 2T^2(8KP + {\mathcal {L}})\), with any unspecified constants made precise in the proof and only depending on the Lipschitz constants of the noncollective forces and the feature maps, as well as the values \(S^E, S^A, S^{\xi }\) coming from the admissible spaces. It is bounded on average, with respect to the initial distribution \(\varvec{\mu ^{\varvec{Y}}}\), by
with the measures \(\varvec{\rho }_T^{EA},\varvec{\rho }_T^{\xi }\) defined in [(4.1), (C.1)]. Expression (4.15) shows that by minimizing the right hand side, we can control the expected \({\mathcal {Y}}\)-supremum error of the estimated trajectories.
We postpone the somewhat lengthy proof to Appendix E.
5 Algorithm for constructing the interaction kernel estimators
Let \({\mathcal {H}}^{E}_{kk'}\) be a finite dimensional function space of dimension \(n_{kk'}^{E}\) with basis functions given by piecewise polynomials whose degree will be chosen later (other type of basis functions are also possible, e.g., clamped B-splines as shown in [47]). It is built on uniform partitions of \([R^{\min , \text {obs}}_{kk'}, R^{\max , \text {obs}}_{kk'}]\) where \(R^{\min , \text {obs}}_{kk'}\)/\(R^{\max , \text {obs}}_{kk'}\) is the minimum/maximum interacting radius for agents in type \(k'\) influencing agents in type \(k\), derived from the observation data. Similar construction is done for \({\mathcal {H}}^{A}\) with dimension \(n_{kk'}^{A}\). We write the candidate \(\varphi ^{E}_{kk'}, \varphi ^{A}_{kk'}\) as a linear combination of the basis functions:
Substituting this linear combination back into (3.4), we obtain a system of linear equations,
and minimizing the empirical loss functional corresponds to solving this system in the last square sense. Here, \(\vec {\alpha }^{EA} \in {\mathbb {R}}^{n^{EA}} = \begin{bmatrix} (\vec {\alpha }^E)^T&(\vec {\alpha }^A)^T\end{bmatrix}^T\) with \(\vec {\alpha }^E\) and \(\vec {\alpha }^A\) being the collection of \(\alpha _{k, k', \eta _{kk'}^{E}}^{E}\) or \(\alpha _{k, k', \eta _{kk'}^{A}}^{A}\) respectively. Moreover, \(A^{EA}_M \in {\mathbb {R}}^{n^{EA} \times n^{EA}}\) and \(\vec {b}^{EA}_M \in {\mathbb {R}}^{n^{EA}}\). See Sec. D for full details.
The total computational complexity is detailed as follows: \(MLN^2\) for computing pairwise data, \(MLd(n^{EA})^2\) for constructing the learning matrix and right hand side vector, and \((n^{EA})^3\) for solving the linear system, hence the total computing time is \(MLN^2 + MLd(n^{EA})^2 + (n^{EA})^3\). Assuming that a tensor-grid construction of \({\mathcal {H}}^{E}_{kk'}\) is used, and the optimal \(n_*\) with
is used in each dimension, we have \(n^{EA} = 2n_*^{{\mathcal {V}}} \approx 2\,M^{\frac{{\mathcal {V}}}{2\,s + {\mathcal {V}}}}\); then we obtain the total computing time in terms of M as follows
In the special case of \(s = 1\) (Lipscthiz functions) and \({\mathcal {V}}= 1\), we have
It is slightly super-linear in M.
Similar computational complexity analysis on solving \(A^{\xi }\vec {\alpha }^{\xi } = \vec {b}^{\xi }\) also shows that the computational cost is slightly super linear in M when \(s = 1\) and \({\mathcal {V}}= 1\).
The overall memory storage needed for the learning problem is \(MLN(d(5 + n^{EA} + n^{\xi }) + 3)\), with \(MLN(4d + 2)\) for storing the trajectory data, \(MLNd(n^{EA} + n^{\xi })\) (here \(n^{EA} = n^E + n^A\)) for learning matrices, and \(MLN(d + 1)\) for right hand side vectors. Hence if \(M \gg {\mathcal {O}}(1)\), we can consider parallelization in m in order to reduce the overhead memory, ending up with \(M_{\text {per core}}\big (LN(d(5 + n^{EA} + n^{\xi }) + 3)\big )\) with \(M_{\text {per core}} = \frac{M}{n_{\text {cores}}}\). The final storage of A and \(\vec {b}\) only needs \(n^{EA}(n^{EA} + 1) + n^{\xi }(n^{\xi } + 1)\).
6 Applications
Our learning theory, as well as measures, norms, functionals etc. can be applied to study all the examples considered in the works [45, 47, 75]. These examples, particularly those of [75], can thus be considered as applications of the theoretical results as well as of the algorithm in Sect. D. We choose to study one new dynamics, which is not considered in [47, 75] since they exhibit some unique features of our generalized model. In particular, we choose them due to their special form of having both energy-based and alignment-based interactions. It is called the anticipation dynamics (AD) model in [65]. Table 4 shows the value of learning parameters for these dynamics.
The setup of the learning experiment is as follows. We use \(M_{\rho }\) different initial conditions to evolve the dynamicsFootnote 1 from 0 to T for the sole purpose of obtaining a good approximation to \(\rho _T^{L, EA}, \rho _T^{L, E}\) and \(\rho _T^{L, A}\). Then we use another set of M (\(M = 500\) for FwEP and \(M = 750\) for AD) initial conditions to generate training data to learn the corresponding \(\phi ^{E}\) and \(\phi ^{A}\) from the empirical distributions, \(\rho _T^{L, M, EA}\)’s, etc. We report the relative learning errors calculated via (4.4) for \({\widehat{\phi }}^{E}\oplus {\widehat{\phi }}^{A}\), (4.4) for \({\widehat{\phi }}^{E}\), and (4.4) for \({\widehat{\phi }}^{A}\), along with pictorial comparison of those interaction kernels as well as a visualization on the pairwise data which is used to learn the estimated kernels. Then we evolve the dynamics either from the training set of M initial conditions or another set of M randomly chosen initial conditions with \(\phi ^{E}\oplus \phi ^{A}\) and \({\widehat{\phi }}^{E}\oplus {\widehat{\phi }}^{A}\) from 0 to \(T_f > T\), and report the trajectory errors calculated using (6.1) on \(\varvec{y}\) (the whole system), and for \(\varvec{x}\) (the position) and \(\varvec{v}\) (the velocity). Again, pictorial comparison of the trajectories are also shown. We report the trajectory errors over [0, T] and \([T, T_f]\). The learning results are shown in the following sections. We consider a related norm on the trajectory \(\varvec{Y}_{[0, T]} = \{\varvec{Y}_{t_l}\}_{l = 1}^L\) (\(0 = t_1< \cdots < t_L = T\)):
We also consider a relative version, invariant under changes of units of measure:
Lastly, we report errors between \(\varvec{X}_{[0, T]}\) and \({{\hat{\varvec{X}}}}_{[0, T]}\),
Similar re-scaled norms are used for the difference between \(\varvec{V}_{[0, T]}\) and \({{\hat{\varvec{V}}}}_{[0, T]}\), and for the difference between \(\varvec{\Xi }_{[0, T]}\) and \({{\hat{\varvec{\Xi }}}}_{[0, T]}\).
6.1 Learning results for anticipation dynamics with \(U(r) = \frac{r^p}{p}\)
The energy-based interactions are constants in the FwEP models, if we want to consider more complicated models, i.e., interactions depending on pairwise distance and more, the AD models are suitable candidates. The dynamics of the AD model is given as follows,
Here \(\tau \) measures the amount (in time) of anticipation. In order to fit the model into our learning regime, we take
Here we have no \(\xi _i\), \(K= 1\), \(m_i = 1\), and
We also use \(\tau = 0.1\).
It is shown in [65] that if \(U''\) is bounded when \(r \rightarrow \infty \) with \(U(0) = U'(0) = 0\), then unconditionally flocking would occur. We take \(U(r) = \frac{r^p}{p}\) for \(1 < p \le 2\), then the system would show unconditional flocking. We choose \(p = 1.5\) for our learning trials.Footnote 2 We use a tensor grid of \(1^{st}\) degree piecewise standard polynomials with \(n^E = 28^2\) for learning \(\phi ^{E}(r, s)\), then a set of \(1^{st}\) degree piecewise standard polynomials with \(n^A = 138\) for learning \(\phi ^{A}(r)\). For the energy-based interactions we have the following results.
As is shown in Fig. 1b, the concentration of pairwise distance data is away from 0, making the estimation of the behavior of \(\phi ^{E}(r, s)\) at r close to 0 extremely difficult, meanwhile, since \(\phi ^{E}\) is also weighted by the pairwise difference, \(\varvec{x}_{i'} - \varvec{x}_i\), and at \(r_{i, i'}\) close to 0, the information is also lost. Next, we present the alignment-based interaction kernels in Fig. 2a.
As shown in Fig. 2, the behavior of \(\phi ^{A}\) at \(r = 0\) is learned accurately. Less accurate is the estimation of \(\phi ^{A}\) for large r: since the agents have aligned their velocities, the weight \(\varvec{v}_{i'} - \varvec{v}_i\) is close to a zero vector. The overall learning performance for estimating \(\phi ^{A}\) is better compared to that of estimating \(\phi ^{E}\). The \({\widehat{\phi }}^{E}\oplus {\widehat{\phi }}^{A}\) error is: \(6 \cdot 10^{-1} \pm 3.0 \cdot 10^{-1}\). The comparison of trajectories between the true kernels (LHS) and the estimators (RHS) is shown in Fig. 3.
As shown in Fig. 3, visually, there is no difference between the true dynamics and the estimated dynamics. We offer more quantitative insight into the difference between the two in Table 5.
We maintain a 3-digit relative accuracy in estimating the position/velocity of the agents, even though for the interaction kernels, we are only able to maintain a 1-digit relative accuracy.
7 Conclusion and further directions
We have described a second-order model of interacting agents that incorporates multiple agent types, an environment, external forces, and multivariable interaction kernels. The inference procedure described exploits the structure of the system to achieve a learning rate that only depends on the dimension of the interaction kernels, which is much smaller than the full ambient dimension \((2d+1)N\). Our estimators are strongly consistent, and in fact have learning rates that are min–max optimal within the nonparametric class, under mild assumptions on the interaction kernels and the system. We described how one can relate the expected supremum error of the trajectories for the system driven by the estimated interaction kernels to the difference between the true interaction kernels and the estimated ones—this result gives strong support to the use of our weighted \(L^2\) norms as the correct way to measure performance and derive estimators. A detailed discussion of the full numerical algorithm, including the inverse problem derived from data and a coercivity condition to ensure learnability, along with complex examples, were presented and we showed how the formulation presented covers a very wide range of systems coming from many disciplines.
There are various ways that one could build on this work to handle different systems and for many of these further directions, the theoretical framework, techniques, and theorems presented here would be directly useful. In particular, one could consider second-order stochastic systems or a similar system but on a manifold, more complex environments, having more unknowns within the model beyond just the interaction kernels (say estimating the non-collective forces as well), identifying the best feature maps to model the data, and considering semiparametric problems where there are hidden parameters within the interaction kernels or other parts of the model that we wish to estimate along with the interaction kernels. The generality of the model and its broad coverage of models across the sciences, together with the scalability and performance of the algorithm, could inspire new models—both explicit equations and nonparametric estimators learned from data—which are theoretically justified and highly practical.
Data Availability
Data and software package can be fond on: https://github.com/mingjzhong/LearningDynamics/tree/master/LearningDynamics2ndOrder.
Notes
We use the built-in MATLAB integrating routine, ode15s, with relative tolerance at \(10^{-8}\) and absolute tolerance at \(10^{-11}\).
\(p = 2\) induces constant forces on the dynamics.
Mixture of basis functions in each dimension is possible, the algorithm does not required the basis functions in each dimension to be of the same kind. We make such assumption for simplicity sake.
References
Abaid, N., Porfiri, M.: Fish in a ring: spatio-temporal pattern formation in one-dimensional animal groups. J. R. Soc. Interface 7, 1441–1453 (2010)
Albi, G., Balagué, D., Carrillo, J.A., Brecht, J.V.: Stability analysis of flock and mill rings for second order models in swarming. SIAM J. Appl. Math. 74, 794–818 (2014)
Ballerini, M., Cabibbo, N., Candelier, R., Cavagna, A., Cisbani, E., Giardina, I., Lecomte, V., Orlandi, A., Parisi, G., Procaccini, A., Viale, M., Zdravkovic, V.: Interaction ruling animal collective behavior depends on topological rather than metric distance: evidence from a field study. Proc. Natl. Acad. Sci. U.S.A. 105, 1232–1237 (2008)
Bellomo, N., Degond, P., Tadmor, E. (eds.): Active Particles, vol. 1. Springer International Publishing AG, Cham (2017)
Bialek, W., Cavagna, A., Giardina, I., Mora, T., Silvestri, E., Viale, M., Walzak, A.M.: Statistical mechanics for natural flocks of birds. Proc. Natl. Acad. Sci. U.S.A. 109, 4786–4791 (2012)
Binev, P., Cohen, A., Dahmen, W., DeVore, R., Temlyakov, V.: Universal algorithms for learning theory part I: piecewise constant functions. J. Mach. Learn. Res. 6, 1297–1321 (2005)
Bishwal, J.P.N., et al.: Estimation in interacting diffusions: continuous and discrete sampling. Appl. Math. 2, 1154–1158 (2011)
Blodel, V., Hendricks, J., Tsitsiklis, J.: On Krause’s multi-agent consensus model with state-dependent connectivity. IEEE Trans. Autom. Control 54, 2586–2597 (2009)
Bongard, J., Lipson, H.: Automated reverse engineering of nonlinear dynamical systems. Proc. Natl. Acad. Sci. U.S.A. 104, 9943–9948 (2007)
Bongini, M., Fornasier, M., Hansen, M., Maggioni, M.: Inferring interaction rules from observations of evolutive systems I: the variational approach. Math. Models Methods Appl. Sci. 27, 909–951 (2017)
Boninsegna, L., Nüske, F., Clementi, C.: Sparse learning of stochastic dynamical equations. J. Chem. Phys. 148, 241723 (2018)
Brunel, N.: Parameter estimation of ODEs via nonparametric estimators. Electron. J. Stat. 2, 1242–1267 (2008)
Brunton, S., Kutz, N., Proctor, J.: Data-drive discovery of governing physical laws. SIAM News, 50 (2017)
Brunton, S., Proctor, J., Kutz, J.: Discovering governing equations from data by sparse identification of nonlinear dynamical systems. Proc. Natl. Acad. Sci. U.S.A. 113, 3932–3937 (2016)
Cao, J., Wang, L., Xu, J.: Robust estimation for ordinary differential equation models. Biometrics 67, 1305–1313 (2011)
Chen, X.: Maximum likelihood estimation of potential energy in interacting particle systems from single-trajectory data. Electron. Commun. Probab. 26, 1–13 (2021)
Cho, Y., Sever, S., Kim, Y.-H.: On some Gronwall type inequalities with iterated integrals. Math. Commun. 12, 63–73 (2007)
Chuang, Y., Huang, Y., D’Orsogna, M., Bertozzi, A.: Multi-vehicle flocking: scalability of cooperative control algorithms using pairwise potentials. In: IEEE International Conference on Robotics and Automation, pp. 2292–2299 (2007)
Chuang, Y.-L., Chou, T., D’Orsogna, M.R.: Swarming in viscous fluids: three-dimensional patterns in swimmer- and force-induced flows. Phys. Rev. E 93, 1–12 (2016)
Chuang, Y.-L., D’Orsogna, M.R., Marthaler, D., Bertozzi, A.L., Chayes, L.S.: State transitions and the continuum limit for a 2D interacting, self-propelled particle system. Physica D Nonlinear Phenom. 232, 33–47 (2007)
Couzin, I., Krause, J., Franks, N., Levin, S.: Effective leadership and decision-making in animal groups on the move. Nature 433, 513–516 (2005)
Cucker, F., Dong, J.-G.: A general collision-avoiding flocking framework. IEEE Trans. Autom. Control 56, 1124–1129 (2011)
Cucker, F., Mordecki, E.: Flocking in noisy environments. J. Math. Pures Appl. 89, 278–296 (2008)
Cucker, F., Smale, S.: On the mathematical foundations of learning. Bull. Am. Math. Soc. 39, 1–49 (2002)
Cucker, F., Smale, S.: On the mathematical foundations of learning. Bull. Am. Math. Soc. 39, 1–49 (2002)
Cucker, F., Smale, S.: Emergent behavior in flocks. IEEE Trans. Autom. Control 52, 852 (2007)
Dahmen, W., DeVore, R., Scherer, K.: Multi-dimensional spline approximation. SIAM J. Numer. Anal. 17, 380–402 (1980)
Dattner, I., Klaassen, C.: Optimal rate of direct estimators in systems of ordinary differential equations linear in functions of the parameters. Electron. J. Stat. 9, 1939–1973 (2015)
de Boor, C., DeVore, R.: Approximation by Smooth Multivariate Splines. Trans. Am. Math. Soc. 276, 775 (1983)
Della Maestra, L., Hoffmann, M.: The Lan property for Mckean–Vlasov models in a mean-field regime, arXiv preprint arXiv:2205.05932 (2022)
DeVore, R., Kerkyacharian, G., Picard, D., Temlyakov, V.: Approximation methods for supervised learning. Found. Comput. Math. 6, 3–58 (2006)
Feng, J., Ren, Y., Tang, S.: Data-driven discovery of interacting particle systems using gaussian processes, arXiv preprint arXiv:2106.02735 (2021)
Genon-Catalot, V., Larédo, C.: Inference for ergodic Mckean–Vlasov stochastic differential equations with polynomial interactions (2022)
Gomes, S.N., Stuart, A.M., Wolfram, M.-T.: Parameter estimation for macroscopic pedestrian dynamics models from microscopic data. SIAM J. Appl. Math. 79, 1475–1500 (2019)
Grégoire, G., Chaté, H.: Onset of collective and cohesive motion. Phys. Rev. Lett. 92 (2004)
Györfi, L., Kohler, M., Krzyzak, A., Walk, H.: A Distribution-Free Theory of Nonparametric Regression. Springer, New York (2002)
Han, X., Shen, Z., Wang, W., Di, Z.: Robust reconstruction of complex networks from sparse data. Phys. Rev. Lett. 114, 028701 (2015)
Kang, S., Liao, W., Liu, Y.: Ident: identifying differential equations with numerical time evolution, arXiv preprint arXiv:1904.03538 (2019)
Kasonga, R.A.: Maximum likelihood theory for large interacting systems. SIAM J. Appl. Math. 50, 865–875 (1990)
Katz, Y., Tunstrom, K., Ioannou, C., Huepe, C., Couzin, I.: Inferring the structure and dynamics of interactions in schooling fish. Proc. Natl. Acad. Sci. U.S.A. 108, 18720–8725 (2011)
Krause, U.: A discrete nonlinear and non-autonomous model of consensus formation. Commun. Differ. Equ. 2000, 227–236 (2000)
Li, Z., Lu, F., Maggioni, M., Tang, S., Zhang, C.: On the identifiability of interaction functions in systems of interacting particles, arXiv preprint arXiv:1912.11965 (2019)
Liang, H., Wu, H.: Parameter estimation for differential equation models using a framework of measurement error in regression models. J. Am. Stat. Assoc. 103, 1570–1583 (2008)
Long, Z., Lu, Y., Ma, X., Dong, B.: PDE-net: learning PDEs from data, arXiv preprint arXiv:1710.09668 (2017)
Lu, F., Maggioni, M., Tang, S.: Learning interaction kernels in heterogeneous systems of agents from multiple trajectories (2019)
Lu, F., Maggioni, M., Tang, S.: Learning interaction kernels in stochastic systems of interacting particles from multiple trajectories. Found. Comput. Math., 1–55 (2021)
Lu, F., Zhong, M., Tang, S., Maggioni, M.: Nonparametric inference of interaction laws in systems of agents from trajectory data. Proc. Natl. Acad. Sci. U.S.A. 116, 14424–14433 (2019)
Lukeman, R., Li, Y., Edelstein-Keshet, L.: Inferring individual rules from collective behavior. Proc. Natl. Acad. Sci. U.S.A. 107, 12576–12580 (2010)
Maggioni, M., Miller, J., Zhong, M.: Agent-based learning of celestial dynamics from ephemerides (2020) (in preparation)
Messenger, D.A., Bortz, D.M.: Learning mean-field equations from particle data using wsindy, arXiv preprint arXiv:2110.07756 (2021)
Messenger, D.A., Wheeler, G.E., Liu, X., Bortz, D.M.: Learning anisotropic interaction rules from individual trajectories in a heterogeneous cellular population, arXiv preprint arXiv:2204.14141 (2022)
Miao, H., Xia, X., Perelson, A., Wu, H.: On identifiability of nonlinear ODE models and applications in viral dynamics. SIAM Rev. 53, 3–39 (2011)
Mostch, S., Tadmor, E.: Heterophilious dynamics enhances consensus. SIAM Rev. 56, 577–621 (2014)
O’Keeffe, K., Bettstetter, C.: A review of swarmalators and their potential in bio-inspired computing, p. 85 (2019)
O’Keeffe, K.P., Evers, J.H., Kolokolnikov, T.: Ring states in swarmalator systems. Phys. Rev. E 98 (2018)
O’Keeffe, K.P., Hong, H., Strogatz, S.H.: Oscillators that sync and swarm. Nat. Commun. 8, 1–12 (2017)
Raissi, M.: Deep hidden physics models: deep learning of nonlinear partial differential equations. J. Mach. Learn. Res. 19, 932–955 (2018)
Raissi, M., Karniadakis, G.: Hidden physics models: machine learning of nonlinear partial differential equations. J. Comput. Phys. 357, 125–141 (2018)
Ramsay, J., Hooker, G., Campbell, D., Cao, J.: Parameter estimation for differential equations: a generalized smoothing approach. J. R. Stat. Soc. Ser. B (Stat. Methodol.) 69, 741–796 (2007)
Rudy, S., Brunton, S., Proctor, J., Kutz, N.: Data-driven discovery of partial differential equations. Sci. Adv. 3, e1602614 (2017)
Schaeffer, H., Tran, G., Ward, R.: Extracting sparse high-dimensional dynamics from limited data. SIAM J. Appl. Math. 78, 3279–3295 (2018)
Schmidt, M., Lipson, H.: Distilling free-form natural laws from experimental data. Science 324, 81–85 (2009)
Schumaker, L.: Spline Functions: Basic Theory, 3rd edn. Cambridge University Press, Cambridge (2007)
Sharrock, L., Kantas, N., Parpas, P., Pavliotis, G.A.: Parameter estimation for the mckean-vlasov stochastic differential equation, arXiv preprint arXiv:2106.13751 (2021)
Shu, R., Tadmor, E.: Anticipation breeds alignment, arXiv preprint arXiv:1905.00633 (2019)
Strogatz, S.H.: From Kuramoto to Crawford: exploring the onset of synchronization in populations of coupled oscillators. Physica D 143, 1–20 (2000)
Tonstrom, K., Katz, Y., Ioannou, C.C., Huepe, C., Kutz, M.J., Couzin, I.D.: Collective states, multistability and transitional behavior in schooling fish. Comput. Biol. 9 (2013)
Tran, G., Ward, R.: Exact recovery of chaotic systems from highly corrupted data. Multiscale Model. Simul. 15, 1108–1129 (2017)
Tsybakov, A.: Introduction to Nonparametric Estimation, 1st edn. Springer, New York (2008)
van der Vaart, A., Wellner, J.: Weak Convergence and Empirical Processes with Applications to Statistics, 1st edn. Springer, New York (1996)
Varah, J.: A spline least squares method for numerical parameter estimation in differential equations. SIAM J. Sci. Stat. Comput. 3, 28–46 (1982)
Vicsek, T., Czirók, A., Ben-Jacob, E., Cohen, I., Shochet, O.: Novel type of phase transition in a system of self-driven particles. Phys. Rev. Lett. 75, 1226–1229 (1995)
Yao, R., Chen, X., Yang, Y.: Mean-field nonparametric estimation of interacting particle systems, arXiv preprint arXiv:2205.07937 (2022)
Zhang, S., Lin, G.: Robust data-driven discovery of governing physical laws with error bars. Proc. R. Soc. A Math. Phys. Eng. Sci. 474, 20180305 (2018)
Zhong, M., Miller, J., Maggioni, M.: Data-driven discovery of emergent behaviors in collective dynamics. Physica D Nonlinear Phenom., 132542 (2020)
Acknowledgements
MM is grateful for discussions with Fei Lu and Yannis Kevrekidis, and for partial support from NSF-1837991, NSF-1913243, NSF-1934979, NSF-Simons-2031985, AFOSR-FA9550-17-1-0280 and FA9550-20-1-0288, ARO W911NF-18-C-0082, and to the Simons Foundation for the Simons Fellowship for the year ’20-’21; ST was partially supported by Regents Junior Faculty fellowship, Faculty Early Career Acceleration grant sponsored by University of California Santa Barbara, Hellman Family Faculty Fellowship, and the NSF under Award No. DMS-2111303; JM for support from NIH-T32GM11999; MZ for support from NSF-AoF-2225507. Please direct correspondence to JM and ST. All authors jointly designed research and wrote the manuscript; JM and ST derived theoretical results; MZ developed algorithms and applications; JM and MZ analyzed data.
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by Mark Iwen.
Appendices
Appendix A: Continuity of the error functionals
For any \(t\in [0,T]\), consider the two random variables,
These will be used in various places throughout the technical proofs and easily relate to the error functionals
which by the Strong Law of Large Numbers satisfy
Indeed
We begin by establishing basic continuity results for our error functionals over the hypothesis space. The specific structure of the governing equations plays a critical role in the analysis. We adapt the analysis in [45] to second-order systems. We list the key proofs and skip the proof of main technical lemmas.
1.1 Alignment and energy based kernels
Proposition 6
Recall that \(\varvec{\phi }^{EA}\) are the true interaction kernels. For \(\widehat{\varvec{\varphi }}^{EA}, \widehat{\varvec{\phi }}^{EA}\in \varvec{{\mathcal {H}}}^{EA}\) the true and empirical error functionals are bounded as follows,
Recall the definitions of \(R,R_{{\dot{x}}}\) in Eqs. (3.7), and (3.9).
Proof
Using Jensen’s inequality,
where
and \({{\widehat{\varvec{\rho }}}}_{T}^{t}=\bigoplus _{k,k'=1,1}^{K,K}{{\widehat{\rho }}}_{T}^{t, kk'}\). Therefore, we have that
Taking the expectation with respect to \(\varvec{\mu ^{\varvec{Y}}}\) on each side of (A.8) we get the first inequality. The second inequality follows by noticing that,
\(\square \)
1.2 Environment interaction kernels
Here we show an analogous result to the alignment and energy result above. The techniques are similar and the result serves an identical purpose in the theory. Recall the definition of \(R_{\xi }\) in (3.10).
Proposition 7
For \({{\widehat{{\varvec{\varphi }}}}},{{\widehat{{\varvec{\phi }}}}} \in \varvec{{\mathcal {H}}}^{\xi }\), we have
The following lemma can be immediately deduced using (A.4), (A.5).
Lemma 8
For all \(\varvec{\varphi }^{EA}\in \varvec{{\mathcal {H}}}^{EA}\), define the defect function \(L_M^{EA}(\varvec{\varphi }^{EA})\) as
Then, given two functions \(\varvec{\varphi }^{EA}_1, \varvec{\varphi }^{EA}_2\in \varvec{{\mathcal {H}}}^{EA}\), the defect function is bounded by
almost surely with respect to \(\varvec{\mu ^{\varvec{Y}}}\).
A similar lemma can be immediately deduced on the \(\xi \) variable.
Lemma 9
For all \(\varvec{\varphi }^{\xi }\in \varvec{{\mathcal {H}}}^{\xi }\), define the defect function \(L_M^{\xi }(\varvec{\varphi }^{\xi })\) as
Then, given two functions \(\varvec{\varphi }^{\xi }_1, \varvec{\varphi }^{\xi }_2 \in \varvec{{\mathcal {H}}}^{\xi }\), the defect function is bounded by
almost surely with respect to \(\varvec{\mu ^{\varvec{Y}}}\).
1.3 A.1 Uniqueness of minimizers over a compact convex space
Recall the energy and alignment bilinear functional \(\langle \langle {\cdot , \cdot }\rangle \rangle _{EA}\)
for any \(\varvec{\varphi }^{EA}_1, \varvec{\varphi }^{EA}_2 \in \varvec{{\mathcal {H}}}^{EA}\). The \({\mathcal {S}}\)-inner product is the inner product induced by the \(\Vert \cdot \Vert _{{\mathcal {S}}}\) norm by the polarization identity, which holds as we are working in an \(L^2\) space, so the parallelogram law holds. Then our coercivity condition (4.5). can be written in terms of this bilinear functional as: for all \(\varvec{\varphi }^{EA}\in \varvec{{\mathcal {H}}}^{EA}\)
Proposition 10
Let the minimizer of the error functional be denoted
then for all \(\varvec{\varphi }^{EA}\in \varvec{{\mathcal {H}}}^{EA}\), the difference of the error functional at this element of \( \varvec{{\mathcal {H}}}^{EA}\) and the minimizer is lower bounded as,
Thus, the minimizer of \(\varvec{{\mathcal {E}}}_{\infty }^{EA}\) over \(\varvec{{\mathcal {H}}}^{EA}\) is unique in \(\varvec{L}^2(\varvec{\rho }_T^{EA,L})\).
Proof
The proof leverages the convexity of the risk functional and hypothesis function space. The proof is the same with Proposition 18 in [45]. \(\square \)
Proposition 11
Let the minimizer of the error functional be denoted
then for all \(\varvec{\varphi }^{\xi }\in \varvec{{\mathcal {H}}}^{\xi }\), the difference of the error functional at this element of \( \varvec{{\mathcal {H}}}^{\xi }\) and the minimizer is lower bounded as,
Thus, the minimizer of \(\varvec{{\mathcal {E}}}_{\infty }^{\xi }\) over \(\varvec{{\mathcal {H}}}^{\xi }\) is unique in \(\varvec{L}^2(\varvec{\rho }_T^{\xi , L})\).
1.4 A.2 Uniform estimates on defect functions
We start this section by introducing normalized errors of the estimators. Denote the minimizer of \(\varvec{{\mathcal {E}}}_{\infty }^{EA}(\cdot )\) over \(\varvec{{\mathcal {H}}}^{EA}\) by
For any \( \varvec{\varphi }^{EA}\in \varvec{{\mathcal {H}}}^{EA}\), define the normalized errors as
These quantities capture the difference between the expected/empirical errors of the estimator and the function in the hypothesis space minimizing the expected error functional. We begin by proving a lemma that assumes the distance between the expected and empirical normalized errors are small for a given estimator. We then show that we have similar control on these distances for all points in a neighborhood of this particular one. This control enables us to apply a covering argument in the main proposition of this section due to the compactness of the hypothesis space.
Remark 12
Exactly analogous definitions hold for the \(\xi \) variable and we will simply state the results in that case.
We skip the proof of Lemma 13, Lemma 14, as they follow in the same way as Lemma 19 in [45].
Lemma 13
For all \(\epsilon >0\) and \(0<\alpha <1\), if the function \(\varvec{\varphi }^{EA}_1 \in \varvec{{\mathcal {H}}}^{EA}\) satisfies
then for all \( \varvec{\varphi }^{EA}_2 \in \varvec{{\mathcal {H}}}^{EA}\) such that \(\Vert \varvec{\varphi }^{EA}_1 - \varvec{\varphi }^{EA}_2 \Vert _{\infty }\le \frac{\alpha \epsilon }{8S_{EA}\max {\{R, R_{{\dot{x}}}\}}^2K^4}\), where \(S_{EA} = \max \{S_E,S_A\}\) we have
Arguing in the same way as above, we can derive the lemma below using Eq. (A.11). We define \({\mathcal {D}}_{\infty }^{\xi }, {\mathcal {D}}_M^{\xi }\) similarly to (A.17), (A.18) using \(\varvec{{\mathcal {E}}}_{\infty }^{\xi },\varvec{{\mathcal {E}}}_{M}^{\xi }\) in the obvious way.
Lemma 14
For all \(\epsilon >0\) and \(0<\alpha <1\), if the function \(\varvec{\phi }^{\xi }_1 \in \varvec{{\mathcal {H}}}^{\xi }\) satisfies
then for all \( \varvec{\phi }^{\xi }_2 \in \varvec{{\mathcal {H}}}^{\xi }\) such that \(\Vert \varvec{\phi }^{\xi }_1 - \varvec{\phi }^{\xi }_2 \Vert _{\infty }\le \frac{\alpha \epsilon }{8S_{0}R_{\xi }^2K^4}\), we have, for \(S_0\ge S_{\xi }\),
1.5 A.3 Concentration
Proposition 15
For all \(\epsilon > 0\), \(0<\alpha <1\), \(\varvec{\varphi }^{EA}\in \varvec{{\mathcal {H}}}^{EA}\), the following concentration bound holds
Proof
Consider the random variable \(\Theta \) (with randomness coming from the random initial condition distributed \(\varvec{\mu ^{\varvec{Y}}}\)), and to ease the notation let \({\widehat{{\varvec{\phi }}}}^{EA}:= {\widehat{{\varvec{\phi }}}}_{L,\infty , \varvec{{\mathcal {H}}}^{EA}}^{EA}\),
The coercivity condition given in Definition 1, Proposition 10 and (A.4) allow us to bound the variance, denoted \(\sigma ^2\), of \(\Theta \) as follows.
By applying Eq. (A.9) from the proof of Proposition 6, we have that \(\Theta \le 8S_{EA}^2\max \{R, R_{{\dot{x}}} \}^2K^4\) almost surely. We then apply the one-sided Bernstein inequality to \(\Theta \) and recalling the definitions (A.1) together with the definitions of the normalized errors in (A.17, A.18), we get that:
Now we provide a lower bound for the exponent to simplify the dependencies. We show that,
or equivalently,
By the estimate (A.19), since \(0< \alpha \le 1\), and \(0<c_{L, N, \varvec{{\mathcal {H}}}^{EA}}<K^2\) it is sufficient to show that
This follows from Young’s inequality as \(2{\mathcal {D}}_{\infty }(\varvec{\varphi }^{EA})\epsilon +\epsilon ^2 \le ({\mathcal {D}}_{\infty }(\varvec{\varphi }^{EA})+\epsilon )^2\), and together these results give the desired bound of the proposition. \(\square \)
We can easily derive the desired supremum bound by a covering argument. The estimation of the covering numbers involved will play a critical role in the main theorems and will be done in a dimension dependent way in order to get optimal minimax rates.
Proposition 16
In the notation of Proposition 15,
where \({\mathcal {N}}\bigg (\varvec{{\mathcal {H}}}^{EA}, \frac{\alpha \epsilon }{8S_{EA}\max \{R, R_{{\dot{x}}}\}^2K^4}\bigg )\) denotes the covering number of \(\varvec{{\mathcal {H}}}^{EA}\) with radius \(\frac{\alpha \epsilon }{8S_{EA}\max \{R, R_{{\dot{x}}}\}^2K^4}\).
Proof
Let \(\varvec{\varphi }^{EA}_i = \varvec{\varphi }^{E}_i \oplus \varvec{\varphi }^{A}_i \in \varvec{{\mathcal {H}}}^{EA}\), for \(i=1, \ldots , {\mathcal {N}}\bigg (\varvec{{\mathcal {H}}}^{EA}, \frac{\alpha \epsilon }{8S_{EA}\max \{R, R_{{\dot{x}}}\}^2K^4}\bigg )\), denote the center of disks \(D_i\) of radius \(\frac{\alpha \epsilon }{8S_{EA}\max \{R, R_{{\dot{x}}}\}^2K^4}\) covering \(\varvec{{\mathcal {H}}}^{EA}\). The covering number is finite by the compactness assumption on the hypothesis space. By Lemma 13,
Now, by Proposition 15, for each i,
By definition, \(\varvec{{\mathcal {H}}}^{EA}\subseteq \bigcup _{i}D_i\), so that
\(\square \)
Finally, we state the results for the \(\xi \) variable, the proofs are analogous. The advantage of splitting the theorems will become apparent. Specifically, it allows us to control the covering numbers on the EA and \(\xi \) hypothesis spaces separately, enabling us to get faster rates than if we viewed the task as estimating all functions simultaneously. This is possible due to the fundamentally decoupled nature of the dynamical system.
Proposition 17
For all \(\epsilon > 0\), \(0<\alpha <1\), \(\varvec{\varphi }^{\xi }\in \varvec{{\mathcal {H}}}^{\xi }\), the following concentration bound holds
Proposition 18
In the notation of Proposition 17,
where \({\mathcal {N}}\bigg (\varvec{{\mathcal {H}}}^{\xi }, \frac{\alpha \epsilon }{8S_{0}R_{\xi }^2K^4}\bigg )\) denotes the covering number of \(\varvec{{\mathcal {H}}}^{\xi }\) with radius \(\frac{\alpha \epsilon }{8S_{0}R_{\xi }^2K^4}\).
Now we are ready to present proofs for main concentration theorems.
Proof of Theorem 2
We start out by setting \(\alpha =\frac{1}{6}\) in Proposition 16, which yields the tightest bound in the argument below. To ease the notation we let \(\widehat{\varvec{\phi }}_{L,M,\varvec{{\mathcal {H}}}^{EA}}^{EA} = \widehat{\varvec{\phi }}_{L,M,\varvec{{\mathcal {H}}}^{EA}}^E \oplus \widehat{\varvec{\phi }}_{L,M,\varvec{{\mathcal {H}}}^{EA}}^A\) and similarly for \(\widehat{\varvec{\phi }}_{L,\infty ,\varvec{{\mathcal {H}}}^{EA}}^{EA}\). From the Proposition, we have that
holds true with probability
This immediately implies, by choosing \(\varvec{\varphi }^{EA}=\widehat{\varvec{\phi }}_{L,M,\varvec{{\mathcal {H}}}^{EA}}^{EA}\), that with probability \({\mathcal {P}}\)
By definition of \({{\widehat{{\varvec{\phi }}}}}_{L,M,\varvec{{\mathcal {H}}}^{EA}}^{EA}\) as the minimizer of the empirical error functional \(\varvec{{\mathcal {E}}}_{M}^{EA}\), we see that
and combining this result with equation (A.14) from Proposition 10, we have
with probability \({\mathcal {P}}\). With the same probability,
The first inequality follow from the coercivity condition (4.5) and the definition of \(\widehat{\varvec{\phi }}^{EA}_{\infty }\). The second follows by the definition of the norms. Now for a chosen \(0<\delta <1\), let
and solve for M. The proof for the part of the system involving \(\xi \) is similar. \(\square \)
Proof of Theorem 3
To simplify the notation, we use the same conventions as the proof of Theorem 2 and let \({\mathcal {D}}_{\infty } = {\mathcal {D}}_{L, \infty , \varvec{{\mathcal {H}}}^{EA}_M}\). By definition of the coercivity constant in (4.5), we have the inequality \(c_{\cup _{M}{\varvec{{\mathcal {H}}}^{EA}_M}} \le c_{\varvec{{\mathcal {H}}}^{EA}_M} \). From an argument analogous to the one used to arrive at Eq. (A.21) in the proof of Theorem 2, we obtain that
For \(\epsilon > 0\), inequality (A.22) implies
We now bound the two terms in the above expression separately. For the first term, the proof of Theorem 2 shows that
where \(C_1 = 96S_{EA}^2\max \{R, R_{{\dot{x}}} \}^2K^4\), \(C_2 = 2304S_{EA}^2\max \{R, R_{{\dot{x}}} \}^2K^4\), and \({\mathcal {N}}(\cup _{M}{\varvec{{\mathcal {H}}}^{EA}_M},\frac{\epsilon }{C_1})\) is finite because of the compactness assumption on \(\cup _{M}\varvec{{\mathcal {H}}}^{EA}_M\). Summing this bound in M,
For the second term, the bound (A.4) yields
Since \(\epsilon \) is fixed, the above result, together with our assumption on the sequence of hypothesis spaces, implies that \(P_{\varvec{\mu ^{\varvec{Y}}}}\Big \{\varvec{{\mathcal {E}}}_{\infty }^{EA}(\widehat{\varvec{\phi }}^{EA}_{\infty }) \ge \frac{\epsilon }{2}\Big \}=0\) for M sufficiently large. So we have \(\sum _{M=1}^{\infty }P_{\varvec{\mu ^{\varvec{Y}}}}\{\varvec{{\mathcal {E}}}_{\infty }^{EA}(\widehat{\varvec{\phi }}^{EA}_{\infty }) \ge \frac{\epsilon }{2}\} < \infty \). The finiteness of the two sums above implies, by the first Borel–Cantelli Lemma, that
Since \(\epsilon \) was arbitrary, we have the desired strong consistency of the estimator. An exactly analogous argument gives the result on the part of the system involving \(\xi \). \(\square \)
Proof of Theorem 4
For part (a), let \(\varvec{{\mathcal {H}}}=\varvec{{\mathcal {K}}}_{S_E}^E \oplus \varvec{{\mathcal {K}}}_{S_A}^A\). Standard results on covering numbers of function spaces (see theorem 2.7.1 of [70]) give us that the covering number of \(\varvec{{\mathcal {H}}}\) satisfies
for some absolute constant \(C_{\varvec{{\mathcal {H}}}}\) depending only on \(\varvec{{\mathcal {H}}}\) and \({\mathcal {V}}\). By assumption on the hypothesis space, we have that \( \inf _{\varvec{\varphi }^{EA}\in \varvec{{\mathcal {H}}}}\Vert \varvec{\varphi }^{EA}-\varvec{\phi }^{EA}\Vert ^2_{\infty }=0 \). From this, the concentration estimate (4.7) together with the covering number bound imply that,
where \(C_1 = \frac{c_{\varvec{{\mathcal {H}}}}}{48S_{EA}\max \{R, R_{{\dot{x}}} \}^2K^4}\) and \(C_2 = \frac{c_{\varvec{{\mathcal {H}}}}}{1152S_{EA}^2\max \{R, R_{{\dot{x}}} \}^2K^4}\). Next, define the function
which we will minimize to achieve the desired probability bound. By direct calculation, \(g(\epsilon )=0\) if we choose \(\epsilon =\epsilon _{M}=(\frac{C_3}{M})^{\frac{1}{{\mathcal {V}}+1}}\), where \(C_3=\Big (\frac{4K^2}{C_2C_1^{{\mathcal {V}}}}\Big )^{\frac{1}{{\mathcal {V}}+1}}\); moreover the derivative of \(g(\epsilon )\) is \(\le 0\) for all \(\epsilon \ge \epsilon _M\). Therefore, the bound (A.23) implies
Integrating over \(\epsilon \in (0,+\infty )\) and using \(e^{-x}\le x+1\) for all \(x\ge 0\), we obtain
Using coercivity and (4.7), we conclude that
where \(C_4\) is an absolute constant that only depends on \(K, S_{EA}, R, R_{{\dot{x}}}\).
For part (b), we recall (see [25, Proposition 5]) that
Using (4.7), and the approximation assumption, we bound the probability as
where \(c_2=\frac{1}{c_{\cup _n\varvec{{\mathcal {L}}}_n}}c_1\), \( c_3' = \frac{c_{\cup _n\varvec{{\mathcal {L}}}_n}}{48(S_{EA}+c_1)\max \{R, R_{{\dot{x}}}\}^2K^4}\), \(c_3=\frac{192 (S_{EA}+c_1)^2\max \{R, R_{{\dot{x}}}\}^2K^4}{c_{\varvec{{\mathcal {L}}}}^{EA}}\), and \(c_4=\frac{c_{\cup _n\varvec{{\mathcal {L}}}_n}}{1152(S_{EA}+c_1)^2\max \{R, R_{{\dot{x}}}\}^2K^4}\) are absolute constants independent of M. Define
To find the optimal n in terms of M, we minimize g in n. By taking a derivative, and solving the corresponding equation, we see that the optimal n is
with a constant depending on \(c_3, c_4, c_2\) but not on M. We choose \(n_*= \lfloor (\frac{M}{\log M})^{\frac{1}{2\,s+{\mathcal {V}}}}\rfloor \), let \(\epsilon _M = (\frac{M}{\log M})^{\frac{2\,s}{2\,s+{\mathcal {V}}}}\) and
As above, let \(\epsilon = tn_*^{-2s} = t\epsilon _M\) and consider \(h(t\epsilon _M)\). It is easy to see that \(\lim _{t \rightarrow 0^+} h(t\epsilon _M) = \infty \) and \(\lim _{t \rightarrow \infty }h(t \epsilon _M)=-\infty \). Together with the continuity of h, these facts imply that there exists a constant \(c_5\), depending on \(K,c_0, c_2,c_3,c_4\), such that \(h(c_5\epsilon _M) = 0\). We further need that \(h'(\epsilon ) \le 0\) for all \(\epsilon \ge c_5\epsilon _M\). By taking the derivative of h, setting it \(\le 0\), we find that this condition eventually holds by basic calculus. Therefore, if needed, to satisfy the derivative condition, we can enlarge the constant \(c_5\) to a constant \(c_6\) (independent of M) such that \(h(\epsilon ) \le 0\) and \( h'(\epsilon ) \le 0\) for all \(\epsilon \ge c_6\epsilon _M\). These results imply
and therefore
where \(C_1\) is a constant depending on \(c_0,c_1,s, K, S_{EA}, R, R_{{\dot{x}}}\).
Now, with \(\varvec{{\mathcal {H}}}^{EA}_M = \varvec{{\mathcal {B}}}_{n_*}\) and using (4.7), we have shown the convergence rate
where \(c_7\) is an absolute constant that only depends on \(s,K,c_0, c_1, S_{EA}, R, R_{{\dot{x}}}\). \(\square \)
Appendix B: Existence, uniqueness and properties of the measures
In this section, we provide technical details of the analytic properties of the collective system under consideration as well as of the measures that we defined in Sect. 4.1. We emphasize that for the analytic portion of the theory, as we saw with the trajectory prediction result, we view the system (2.2) as coupled (whereas for the learning theory we leverage that they can be decoupled to make the estimation have better performance). We begin by showing that under the assumption that the interaction kernels lie in the corresponding admissible spaces, then the system is well-posed.
1.1 B.1 Well-posedness of second-order heterogeneous systems
Proposition 19
Suppose the interaction kernels \({\varvec{\phi }}^E=(\phi _{kk'}^{E})_{k,k'=1}^{K,K},{\varvec{\phi }}^A=(\phi _{kk'}^A)_{k,k'=1}^{K,K},{\varvec{\phi }}^{\xi }=(\phi _{kk'}^{\xi })_{k,k'=1}^{K,K}\) lie in the admissible sets \(\varvec{{\mathcal {K}}}_{S_{E}}^E\),\(\varvec{{\mathcal {K}}}_{S_{A}}^A\), \(\varvec{{\mathcal {K}}}_{S_{\xi }}^{\xi }\) respectively. Where the admissible spaces are defined in (3.11). Then the second-order heterogenous system (2.2) admits a unique global solution in [0, T] for every initial datum \(\varvec{X}_0,\dot{\varvec{X}}(0) \in {\mathbb {R}}^{dN}\), \(\varvec{\Xi }(0) \in {\mathbb {R}}^N\) and the solution depends continuously on the initial condition.
The proof of Proposition 19 uses Lemma 20 and similar techniques used to prove the well-posedness of the first-order homogeneous system (see Section 6 in [10]) by rewriting the second-order system as a first-order system and then applying standard Caratheodory ODE results.
Lemma 20
For any \(\varphi ^E \in {\mathcal {K}}_{S_E}^E\), \(\varphi ^A \in {\mathcal {K}}_{S_A}^A\), the function
for \(\varvec{x}, {\dot{\varvec{x}}} \in {\mathbb {R}}^d\) is Lipschitz continuous on \({\mathbb {R}}^{2d+p^E+p^A}\) where \(p^E,p^A\) are the dimensions of the range of the functions \(s^E,s^A\), respectively. Additionally, for any \(\varphi ^{\xi } \in {\mathcal {K}}_{S_{\xi }}^{\xi }\), the function
is Lipschitz continuous on \({\mathbb {R}}^{d+1 +p^{\xi }}\), where \(p^{\xi }\) is the dimension of the range of \(s^{\xi }\).
1.2 B.2 Properties of measures
In this section we state and prove some technical properties of the measures described in Sect. 4.1.
Lemma 21
Suppose each of the interaction kernels lie in the respective admissible spaces, namely, \({\varvec{\phi }}^E \in \varvec{{\mathcal {K}}}_{S_E}^E\),\({\varvec{\phi }}^A \in \varvec{{\mathcal {K}}}_{S_A}^A\), \({\varvec{\phi }}^{\xi } \in \varvec{{\mathcal {K}}}_{S_{\xi }}^{\xi }\). Then, for each \((k,k')\), the measures \(\rho _T^{EA,k,k'},\rho _T^{EA,L,k,k'}\) and \(\rho _T^{\xi ,k,k'}, \rho _T^{\xi ,L,k,k'}\) defined in Sect. 4.1, are regular Borel probability measures.
The proof is similar to Lemma 1 in [45].
Proposition 22
Suppose the distribution \(\varvec{\mu ^{\varvec{Y}}}\) of the initial condition is compactly supported. Then for each \((k, k')\), the support of the measures \(\rho _T^{EA,kk'}, \rho _T^{\xi ,kk'}\) (and therefore \(\rho _T^{EA,L,kk'}, \rho _T^{\xi ,L,kk'}\)) is also compact.
Proof
The compact support of the variables \(r,{\dot{r}}, \xi \) and the feature maps follows by the global well-posedness of the system in finite time, together with the Lipschitz assumptions on the non-collective forces. This compact support over a fixed, finite time is what is claimed in Proposition 22. \(\square \)
The main point is that by making reasonable assumptions on the non-collective forces, feature maps, interaction kernels, and the interval of time, together with the assumption that our agents’ initial conditions cannot be arbitrarily far apart, we can derive that the pairwise distance, velocity and \(\xi \) will be controlled. Thus, the measures in Sect. 4.1, if given enough trajectories, will be well approximated by the discretized version using the numerical approach described in Sect. 6. Meaning that if we have a reasonable number of trajectories, we can look at the set of pairwise distances, velocities, etc. that these agents explore and bin them to set the support of the interaction kernels. Explicit values for the constants claimed in the proposition depend on the properties of the non-collective forces, the support and sup-norm of the interaction kernels, the interval T, and the number of agents.
Appendix C: Additional performance measures
For measures related to learning the \(\xi \)-based interaction kernels, we take
Then, the measures are given by
and similarly for \(\rho _T^{\xi , L, k, k'}(r, \varvec{s}^{\xi }, \xi )\), \(\rho _T^{\xi , L, M, k, k'}(r, \varvec{s}^{\xi }, \xi )\). Similarly, \(\rho _T^{\xi , k, k'}\) and its time-discretization version, \(\rho _T^{\xi , L, k, k'}\), are only used in the theoretical setting, whereas the empirical \(\rho _T^{\xi , L, M, k, k'}\) is used in the actual algorithm. We consider direct sums of the measures for the phase variable for ease of notation.
Lastly, for the \(\xi \)-based interaction kernels, i.e., \({\widehat{\phi }}^{\xi }_{kk'}\) versus \(\phi ^{\xi }_{kk'}\), we consider the following norm,
Appendix D: Numerical algorithm
In this section, we will detail the construction of the linear systems to learn \(\vec {\alpha }^{EA}\) and \(\vec {\alpha }^{\xi }\).
We start with the procedure of solving for \(\vec {\alpha }^{EA}\). First, we build the basis functions for the finite dimensional hypothesis spaces \({\mathcal {H}}_{kk'}^E,{\mathcal {H}}_{kk'}^A\) using piecewise polynomials or clamped B-splines as the basis functions (see Sect. 5), which altogether are represented as \(\varvec{{\mathcal {H}}}^{EA}\) as in (3.15).
Remark 23
The support of the unknown interaction kernels is not assumed to be known. We build our finite dimensional subspaces, \({\mathcal {H}}_{kk'}^{E},{\mathcal {H}}_{kk'}^{A}\), based on the empirical observation data. For the support-detection capability of our estimators, see the examples of opinion dynamics in [47, 75].
We utilize the tensor grid of basis functions, i.e., tensor product of basis functions in each dimension of the basis \([R_{kk'}^{\min , L, M}, R_{kk'}^{\max , L, M}] \times {\mathbb {S}}^{E, L, M}_{kk'}\) or \([R_{kk'}^{\min , L, M}, R_{kk'}^{\max , L, M}] \times {\mathbb {S}}^{A, L, M}_{kk'}\), where \([R_{kk'}^{\min , L, M}, R_{kk'}^{\max , L, M}]\) is the empirical range of r given by the observation data, similarly for the empirical \({\mathbb {S}}^{E, L, M}_{kk'}\) and \({\mathbb {S}}^{A, L, M}\) being the range of \(\varvec{s}^E_{kk'}\) and \(\varvec{s}^A_{kk'}\) given by the observation, respectively. In each dimensionFootnote 3 of \([R_{kk'}^{\min , L, M}, R_{kk'}^{\max , L, M}] \times {\mathbb {S}}^{E, L, M}_{kk'}\) or \([R_{kk'}^{\min , L, M}, R_{kk'}^{\max , L, M}] \times {\mathbb {S}}^{A, L, M}_{kk'}\), the basis functions are built as piecewise standard polynomials (or other functions, such as Clamped B-splines, Fourier basis, etc.) uniformly with the number of basis functions being \(n^{E, j}_{kk'}\) or \(n^{A, j}_{kk'}\). Hence \(n^{E}_{kk'} = \prod _{j}^{1 + p^E_{k, k'}}n^{E, j}_{kk'}\) and \(n^{A}_{kk'} = \prod _{j}^{1 + p^A_{k, k'}} n^{A, j}_{kk'}\). Then, we assemble \(\vec {d}^{(m)}\) as follows,
![](http://media.springernature.com/lw196/springer-static/image/art%3A10.1007%2Fs43670-023-00055-9/MediaObjects/43670_2023_55_Equ182_HTML.png)
If \(\ddot{\varvec{x}}_i(t_l)\) is not given, a finite difference scheme on \(\varvec{x}_i(t_l)\) or \({{\dot{\varvec{x}}}}_i(t_l)\) is used to approximate \(\ddot{\varvec{x}}_i(t_l)\). Next, we build, \(\vec {f}^{(m)}\) as follows,
![](http://media.springernature.com/lw340/springer-static/image/art%3A10.1007%2Fs43670-023-00055-9/MediaObjects/43670_2023_55_Equ183_HTML.png)
Then for the learning matrix, \(\Psi ^{EA, (m)} \in {\mathbb {R}}^{LNd \times n}\) with \(n = n^E + n^A\). It is a concatnation of two sub-matrix, \(\Psi ^{E, (m)}\) and \(\Psi ^{A, (m)}\), i.e.,
For the energy-based learning matrix, \(\Psi ^{E, (m)}\), we use a lexicographical order on \((k, k')\) for \(k, k' = 1, \ldots , K\). We define \(n^{E}_{k, k', \text {prev}} = \sum _{(k_1, k_2) < (k, k')} n^{E}_{k_1, k_2}\); if \((k, k') = (1, 1)\), we take \(n^{E}_{1, 1, \text {prev}} = 0\). Then for \(\eta _{kk'}^E = 1, \ldots , n^E_{k, k'}\), \(\Psi ^{E, (m)}\) is given as follows,
![](http://media.springernature.com/lw436/springer-static/image/art%3A10.1007%2Fs43670-023-00055-9/MediaObjects/43670_2023_55_Equ184_HTML.png)
and for \(l = 1, \ldots , L\). Similar process of construction is done for \(\Psi ^{A, (m)}\). Then we define,
And lastly,
Then, \(\vec {\alpha }^{EA} = \begin{bmatrix} (\vec {\alpha }^E)^T&(\vec {\alpha }^A)^T\end{bmatrix}^T\), is obtained by solving
Then, we assemble
Similar assembly from \(\alpha ^A\) is done for \({\widehat{\phi }}^{A}_{kk'}\). In the case of using finite difference approximation to approximate the second derivatives of \(\varvec{x}_i\), we end up with
where \(\vec {\zeta } = {\mathcal {O}}(\frac{T}{L})\) when a first-order finite difference scheme is used.
Next for \(\vec {\alpha }^{\xi }\), we build the basis functions for each of the finite dimensional spaces \({\mathcal {H}}^{\xi }_{kk'}\), using piecewise polynomials or clamped B-splines (as in the EA case, similarly, other many other bases work well in this algorithm). This is an explicit example of \(\varvec{{\mathcal {H}}}^{\xi }\). We utilize the tensor grid of basis functions, i.e., tensor product of basis functions in each dimension of the basis \([R_{kk'}^{\min , L, M}, R_{kk'}^{\max , L, M}] \times {\mathbb {S}}^{\xi , L, M}_{kk'}\), where \([R_{kk'}^{\min , L, M}, R_{kk'}^{\max , L, M}]\) is the empirical range of r given by the observation data, similarly for the empirical \({\mathbb {S}}^{\xi , L, M}_{kk'}\) being the range of \(\varvec{s}^{\xi }_{kk'}\) given by the observation. And in each dimension of \([R_{kk'}^{\min , L, M}, R_{kk'}^{\max , L, M}] \times {\mathbb {S}}^{\xi , L, M}_{kk'}\), the basis functions are built as piecewise standard polynomials (or other functions, such as Clamped B-splines, Fourier basis, etc.) uniformly with the number of basis functions being \(n^{\xi , j}_{kk'}\). Hence \(n^{\xi }_{kk'} = \prod _{j}^{1 + p^{\xi }_{k, k'}}n^{\xi , j}_{kk'}\). We let
![](http://media.springernature.com/lw536/springer-static/image/art%3A10.1007%2Fs43670-023-00055-9/MediaObjects/43670_2023_55_Equ185_HTML.png)
and
![](http://media.springernature.com/lw599/springer-static/image/art%3A10.1007%2Fs43670-023-00055-9/MediaObjects/43670_2023_55_Equ186_HTML.png)
Finally we define,
and
Thus, \(\vec {\alpha }^{\xi }\) is obtained by solving
Then, we assemble
Appendix E: Control of trajectory error
In this section, for the convenience of the reader, we gather a few of the technical tools used in the analysis of the system. These are fundamental results necessary for developing the trajectory prediction, the measure support, and the existence and uniqueness results. We also include some of the necessary results on covering numbers of function spaces used for the learning theory.
The first theorem we present is an iterated Grönwall type result that allows us to analyze the trajectory error of the full system \(\varvec{Y}(t)\).
Theorem 24
Let u(t), a(t), and b(t) be nonnegative continuous functions in \(J=[\alpha , \beta ],\) and suppose that
for all \(t \in J,\) where \(k_{i}\left( t, t_{1}, \ldots , t_{i}\right) \) are nonnegative continuous functions in \(J_{i+1}, i=\) \(1,2, \ldots , n,\) which are nondecreasing in \(t \in J\) for all fixed \(\left( t_{1}, \ldots , t_{i}\right) \in J_{i}, i=\) \(1,2, \ldots , n.\) Then, for all \(t \in J\)
where, for all \((t, s) \in J_{2}\)
for each continuous function w(t) in J.
Proof
See [17]. \(\square \)
Now we are ready to present the results for trajectory prediction.
Proof of Theorem 5
We introduce the function
defined on \({\mathbb {R}}^{2d+p^E+p^A}\) for functions \(\varphi ^E \in L^{\infty }([0,R]\times {\mathbb {S}}^E), \varphi ^A \in L^{\infty }([0,R]\times {\mathbb {S}}^A)\). Similarly, let \(F[\varphi ^{\xi }](\varvec{x}, \xi , s^{\xi }):= \varphi ^{\xi }(||\varvec{x}||, \varvec{s}^{\xi })\xi \). By assumption, \({\widehat{\varvec{Y}}}_0 = \varvec{Y}_0\) and \( \dot{{\widehat{\varvec{Y}}}}_0 = {\dot{\varvec{Y}}}_0\). For every \(t\in [0,T]\), by the fundamental theorem of calculus and the triangle inequality, we have
First we introduce the convenient notations of
![](http://media.springernature.com/lw524/springer-static/image/art%3A10.1007%2Fs43670-023-00055-9/MediaObjects/43670_2023_55_Equ68_HTML.png)
with analogous formulae for \(\varvec{s}_{{\widehat{i}} i'}^{E},\varvec{s}_{{\widehat{i}} i'}^{A},\varvec{s}_{i \widehat{i'}}^{A},\varvec{s}_{i \widehat{i'}}^{\xi }, \varvec{s}_{{\widehat{i}} i'}^{\xi },{\widehat{\varvec{s}^A_{ii'}}}, {\widehat{\varvec{s}^{\xi }_{ii'}}} \).
Above we have introduced the term
which can be expressed explicitly as
Note that in I, j is the index of the type among the \(\{1,\ldots K\}\) and i indexes within each type \(C_j\). This holds similarly in later expressions \(I_1,I_2\). For the third term of (E.1), we exploit the Lipschitz property of the non-collective force:
So that we have the bound
Now we break up I using the triangle inequality and get that \(I \le I_1 + I_2\) where
Using the Lipschitz property of \(F[{\widehat{\phi }}_{jj'}^{EA}]\) we get that, since
then,
By the assumptions on the feature maps, we have that
![](http://media.springernature.com/lw550/springer-static/image/art%3A10.1007%2Fs43670-023-00055-9/MediaObjects/43670_2023_55_Equ187_HTML.png)
Combining these bounds we see that,
Let \({\tilde{S}}= \max (S_E, S_A)^2\), \(J=(\max _{j,j'}\text {Lip}[\varvec{s}^E_{(j,j')},\varvec{s}^A_{(j,j')}]+1)^2\), and then let \(P={\tilde{S}}J\) and we get by Young’s inequality that,
and performing a similar analysis we get that
So gathering terms, we can reexpress (E.1) as
where \(F = \max _{i}\text {Lip}[F^{{{\dot{\varvec{x}}}}}_i]\). Performing an analogous analysis on \(\Vert \varvec{V}_t - {\widehat{\varvec{V}}}_t\Vert _{{\mathcal {S}}}^2,\Vert \varvec{\Xi }_t - {\widehat{\varvec{\Xi }}}_t\Vert _{{\mathcal {S}}}^2\), with some additional effort, one can get the following result on the phase variable
where \(F^{\xi } = \max _{i}\text {Lip}[F^{\xi }_i]\) and \(Q = \max (H, S^{\xi })\) where \(H = \max _{j,j'}\text {Lip}[\varvec{s}^E_{(j,j')},\varvec{s}^A_{(j,j')}]\). Similarly, we have that,
Gathering the bounds (E.9, E.10, E.11), we have that
where we denote the last three lines by a(t) and notice that this is a nondecreasing function in t. We also denote \(A_1 = 2T(8KP + F + 8QK + F^{\xi })\) and \(B_1 = 2T^2(8KP + F)\). Now use theorem 24, which is in [17] and is originally in Bainov and Simeonov. With this notation, we can rewrite the above bound as
And so in the notation of Theorem 24 we have \(u(t) = \Vert {\widehat{\varvec{Y}}}_t - \varvec{Y}_t \Vert _{{\mathcal {Y}}}^2\), \(b(t) = 1\), \(k_1(t,t_1) = A_1\) and \(k_2(t,t_1,t_2) = B_1\), so that for all t we have
and we have the simple bounds
So that,
So that we can immediately conclude the first assertion of the theorem,
Lastly, we can use the results of Sect. A to get the key result on the expected supremum error. We take expectation on each of the three terms of a(T) and normalize them so they are in the form of the results of A.
We similarly get that,
and can get an analogous bound for the remaining term of a(T). These bounds together lead to
which implies the desired result. \(\square \)
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Miller, J., Tang, S., Zhong, M. et al. Learning theory for inferring interaction kernels in second-order interacting agent systems. Sampl. Theory Signal Process. Data Anal. 21, 21 (2023). https://doi.org/10.1007/s43670-023-00055-9
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s43670-023-00055-9
Keywords
- Machine learning
- Dynamical systems
- Interacting particle systems
- Agent-based dynamics
- Inverse problems
- Regularized least squares
- Nonparametric statistics