Parameter calibration with stochastic gradient descent for interacting particle systems driven by neural networks

We propose a neural network approach to model general interaction dynamics and an adjoint-based stochastic gradient descent algorithm to calibrate its parameters. The parameter calibration problem is considered as optimal control problem that is investigated from a theoretical and numerical point of view. We prove the existence of optimal controls, derive the corresponding first-order optimality system and formulate a stochastic gradient descent algorithm to identify parameters for given data sets. To validate the approach, we use real data sets from traffic and crowd dynamics to fit the parameters. The results are compared to forces corresponding to well-known interaction models such as the Lighthill–Whitham–Richards model for traffic and the social force model for crowd motion.


Introduction
In the recent years, many models for interaction dynamics with various applications such as swarming, sheep and dogs, crowd motion, traffic and opinion dynamics have been proposed, see, e.g.[1,[3][4][5]8,9,21,26,27] for an overview.Typically, the models are based on ordinary differential equations (ODEs) which describe the dynamics of each particle (or agent) in the system by interaction forces.A common approach is to model forces that replicate observations made in nature.For example in swarming, there exists the well-known three-phase model that consists of a short-range repulsion, a neutral/comfort zone and a third zone with attraction for long-range interactions [5].The ranges can be adjusted with the help of parameters that need to be fitted for every application.Certainly, this is a smart approach and leads to promising results.Nevertheless, it is interesting to overboard all those assumptions and instead use a neural network to model the interactions with parameters based on data.This is the path we will follow in this paper.Actually, a similar viewpoint was taken for example in [2,19,28] where interaction kernels were inferred with a least squares approach from given data.The theoretical basis was discussed in [2], where well-posedness under a certain coercivity condition was shown.Moreover, the convergence of the approach w.r.t. to the mean-field limit is discussed.In [28], the methodology was generalized and applications with complicated dynamics allowing for pattern formation such as clustering and milling were studied.The learned interaction kernels are compared to well-known physical models with the help of confusion matrices and pattern indicator scores.
Clearly, the idea of using neural networks as substitutes for static/dynamic models or observers is well-known, see e.g [6,14] in the general context of ODEs.Furthermore, in [23] neural networks are discussed as alternative to estimate friction in automotive brakes.See [20] for a survey of artificial neural networks in energy systems.Moreover, in [12] deep neural networks are used to approximate Lyapunov functionals for systems of ordinary differential equations and in [13] the same author proposes to store approximate Lyapunov functions in a deep neural network in order to overcome the curse of dimensionality.
We intend to follow a similar approach for interacting particle systems in this paper.We first propose a framework to model very general particle interactions with the help of neural networks.Then, we state the corresponding parameter calibration problem.It enables us to identify the parameters using techniques from optimal control.We prove the well-posedness of the identification problem and derive the corresponding first-order optimality conditions that are used to compute the gradient for the stochastic descent algorithm.
In addition to the neural network model, the proposed parameter calibration approach can be used for general interacting particle systems with explicitly given, differentiable interaction forces, for example in terms of gradients of a potential.For our experiments, we apply the stochastic descent algorithm to real data sets from traffic and crowd experiments [18,24] to fit the neural network parameters.For comparison, we use the same algorithm to fit as well the parameters of well-known interaction models, such as the Lighthill-Whitham-Richards (LWR) model for traffic [17] and the social force model for crowd dynamics [15].The two applications discussed here are prototypical examples for a first-order approach in 1d and second-order approach in 2d, respectively.Moreover, in both cases real data are available for our calibration purposes.
Similar parameter identification studies for pedestrian models have been recently introduced in [7,10] using a Bayesian probabilistic method and in [11,25] using neural networks.In contrast to [11,25], where the pedestrian speed or unknown interaction forces have been estimated, we address a more general setting that also allows for theoretical investigations and a rigorous numerical treatment.More precisely, for the neural networks we make some basic assumptions, but it is important to note that we do not prescribe any physical interaction assumptions.Using techniques from optimal control, we derive and analyse a parameter identification procedure that is based on stochastic gradient descent in Sect. 2. In Sect.3, we begin with well-posedness results concerning a general interacting particle system that is driven by an artificial neural network.Then, we prove the existence of an optimal control for the parameter identification problem.Moreover, in Sect.4, we derive the first-order optimality system for the identification problem in order to state the corresponding algorithm, an adjointbased stochastic gradient descent method.A proof of existence of the adjoint concludes the theory part.
To validate our approach, we present an extended parameter estimation study for the traffic and the pedestrian model in Sects.5 and 6, respectively.We apply the algorithm to a traffic scenario and estimate the interaction force as well as the speed of cars.The second application is based on pedestrian data, here again we train the artificial neural network with the help of our algorithm and compare the results to optimal parameters resulting from interactions that involve the social model for pedestrian interaction.The numerical results for both applications offer interesting insights and give rise for future considerations.

Optimal Control problem
The parameter identification is cast as optimal control problem.Let u denote the control variables, or, in terms of the application, the parameters to be identified and z the reference data set.We denote the space of controls by U = R K .Then, we are interested in min u∈U J (x(u); z) for some tailored cost function J .As the parameters enter the cost function implicitly through the state x, we propose to compute the gradient of the cost functional used for a stochastic gradient decent method with the help of an adjoint-based approach.
From now on, it is clear that x depends on the parameters u, so for notational convenience we drop the dependence in some equations.Let us assume that for each cost evaluation, we consider N agents with corresponding trajectories z i : [0, T ] → R d for i = 1, . . ., N .Based on the applications we have in mind, we obtain parameterdependent trajectories x u i : [0, T ] → R d for i = 1, . . ., N .We focus on the cost functional The state x = x(u) is a solution to an ODE system that is driven by a feed-forward artificial neural network (NN) as follows where W u models the interaction of the agents and is the output of a neural network parametrized by u.
Remark 1 Throughout the work, we assume that every agent is driven by the same artificial neural network.We need the index W i, j only for the applications.For example in the traffic dynamic, the cars only interact with the car in front.This leads to W i, j being nonzero only for j = i + 1.In contrast, the car in front drives with fixed velocity, this can be represented with fixed weights, not involved in the parameter identification.Moreover, pedestrians interact with every other person which yields W i, j = W ; indeed, we can use the same artificial network to model the interaction forces for all the pedestrians.
To summarize, the optimal control problem we propose for the parameter identification driven by neural networks reads (1).

Remark 2
We emphasize that in contrast to [11] the cost functional is not of the usual structure given by where h u (x i ) denotes the output of a neural network defined by the parameters u.Indeed, in the present article, the trajectories z i (u), i = 1, . . ., N are not the output of the neural network, but solutions of ODEs that are driven by the neural network.
As mentioned above, we consider feed-forward artificial neural networks.For later reference, we define these as follows.
Definition 1 A feed-forward artificial neural network (NN) is characterized by the following structure: -Input layer: where x ∈ R n (1) is the input (feature) and n (1) ) is the number of neurons without the bias unit a -Hidden layers: Note that the output layer has no bias unit.The entry u j,k of the weight matrix u ( ) ∈ R n ( −1) ×n ( ) describes the weight from neuron a ( −1) j to the neuron a ( )  k .For notational convenience, we assemble all entries u ( ) j,k in a vector R K with K := n (1) • n (2) + n (2) • n (3) For the numerical experiment, we use g ( ) (x) = log(1 + e x ) for = 2, . . ., N − 1 and g (L) (x) = x.An illustration of an NN with L = 3, four inputs and six units in the hidden layer can be found in Fig. 1.
Fig. 1 Illustration of a feed-forward artificial network with 4 inputs and one bias input, one hidden layer with one bias unit and 5 neurons and two outputs.This corresponds to n (1) = 4, n (2) = 5, n (3) = 2, L = 3

Analysis of the optimal control problem
In this section, we analyse the parameter estimation problem in terms of wellposedness.We begin with the state system and discuss then the well-posedness of the parameter identification problem.Finally, we show the existence of an adjoint state.

Well-posedness of the ODE system
Under some assumption on the activation functions g ( ) , = 2, . . ., L we can establish a well-posedness result for system (1).As the output of the neural network is the righthand side of the ODE, we assume n (L) = d for the following considerations.
Theorem 1 (Well-posedness of the state equation) Let the activation functions, g ( ) of the NN defined in Definition 1 with n (L) = d be globally Lipschitz for = 2, . . ., L. Then, there exists a unique solution x ∈ C 1 ([0, T ], R d ) to (1).

Proof
The proof exploits the recursive structure of the neural network defined in Definition 1.In fact, let x ∈ R n (1) and a (L) ∈ R d then Note that all relations are linear except for the activation functions g ( ) , = 2, . . ., L.
Using the globally Lipschitz assumption on g and the aforementioned linearity, we obtain where the constant C g depends on all the Lipschitz constants of g ( ) , = 2, . . ., L. This allows us to conclude the global Lipschitz property of the right-hand side of the ODE.An application of the Picard-Lindelöf theorem yields the well-posedness of the ODE as desired.

Remark 3
The SmoothReLU activation function given by g(x) = ln(1 + e x ) is one example satisfying the assumptions of Theorem 1.Another suitable activation function is the identity g(x) = x.
The analysis of the optimal control problem is established in a Hilbert space framework.That is why we use the embedding define the control to state map which is well-defined thanks to Theorem 1.Moreover, we define the reduced cost functional Ĵ (u) := J (S(u); z).
The next theorem provides us the continuous dependence on the data for the ODE solution.It will help us in the proof of the existence of a minimizer below (Theorem 3).
Theorem 2 (Continuous dependence on the data) Let the assumptions of Theorem 1 hold and, additionally, let U be bounded and g ( ) be bounded for = 2, . . ., L − 1.Then, the solution to (1) depends continuously on the data, i.e.
Proof Let us consider the i-th component of the difference of two solutions corresponding to different data u and ū given by We therefore need to estimate the m-th entry of the difference of the interaction forces.Let m ∈ {1, . . ., d} be arbitrary, we find where M u denotes the upper bound of U, M g ( −1) denotes the bound for the activation functions g ( ) with = 2, . . ., L − 1 and L g ( ) denotes the global Lipschitz constants of the activation functions.
The last step of the previous estimate can be recursively applied to For the first layer, we have The two estimates together yield This implies An application of Gronwall's theorem gives us Using ( 2) and (3), we obtain for the time derivative Summing over i = 1, . . ., N , we get the bound Combining the estimates in (3) and ( 4) leads to the desired result.
In the following, we are concerned with the existence of a minimizer to the control problem and the existence of an adjoint state.Latter will help us to state the firstorder optimality conditions and to compute the gradient for the descent algorithm.For notational convenience, we define the operator e(x, u) : Theorem 3 (Existence of a minimizer) Let W i, j u be weakly continuous for all i, j = 1, . . ., N and g (L) (0) = 0.Then, there exists a minimizer u * ∈ U ad for Problem 1.
Proof Equipped with the results of Theorem 1 and Theorem 2, the proof can be established by standard arguments.See Theorem 1.45 in [16] for the main ideas.
Remark 4 Note that the minimizer of the parameter identification problem may not be unique due to the nonlinearity in the state problem.
Remark 5 Again, the SmoothReLU activation functions satisfy the requirements of the existence result in Theorem 3. Indeed, g(x) = ln(1 + e x ) and g(x) = x have uniformly bounded derivatives.Hence, we apply the mean-value theorem to obtain Moreover, choosing g (L) to be the identity leads to g (L) (0) = 0. Theorem 4 (Existence of adjoint state) Let g ( ) ∈ C 1 with (g ( ) ) ∈ Lip loc (R) globally bounded and let ( x, ū) an optimal solution of Problem 1.Then, there exists an adjoint state, p, such that the following optimality condition holds: 123 Proof We aim to apply Corollary 1.3 in [16] and check its four requirements.We begin with the set of admissible controls.
is nonempty, convex and closed.This verifies the first requirement.
Next, we have to show that J : Y × U → R and e : Y × U → Z are continuous Fréchet differentiable.We set Note that these are all Banach spaces.The cost functional J is Fréchet differentiable by standard arguments, see for example [16], it holds For the state operator e, we find Using the assumptions on the activation function g ( ) , this yields e y (y, u) ∈ L(Y , Z ).Third, Theorem 1 assures that e(y, u) = 0 admits a unique solution We are left to show that e y (y(u), u) ∈ L(Y , Z ) has a bounded inverse for all u ∈ V in a neighbourhood of U ad .In order to see that we consider e y (y(u), u)[h] = r for arbitrary r ∈ Z = L 2 ((0, T ), R d ).By the Caratheodory theory for ODEs, there exists a unique solution to this equation for every r ∈ Z .Using the explicit expression of e y (y(u), u), we find for some constant C DW > 0 depending on the global bounds on the derivatives of the activation functions.An application of Gronwall's Lemma yields the boundedness of the inverse of e y (y(u), u).
Hence, the requirements of Corollary 1.3 in [16] are satisfied and we find an adjoint state p such that the optimality condition holds as desired.
These results assure the well-posedness of the optimization problem and its firstorder optimality system.We are well-equipped to state the stochastic gradient descent algorithm that we propose for the treatment of general neural network-driven optimization problems.The numerical results below are computed with this algorithm as well.

Stochastic gradient descent algorithm
Having the results of the previous section at hand, we can now propose the stochastic gradient descent algorithm.In this section, we assume that p ∈ H 1 ((0, T ), R d ) ⊂ Z .This allows us to formulate the adjoint system in strong form given by supplemented with the terminal condition p(T ) = 0, where x is a solution to (1) with initial condition x 0 = z 0 .The gradient is based on this strong form of the adjoint.

Gradient of the reduced cost functional
We compute the gradient of the reduced cost functional as follows Let e be the operator defined above, we obtain where we used the adjoint equation e y (S(u), u) * p = −(z − S(u)).Altogether, the gradient of the reduced cost functional can be expressed as Based on these considerations, we may establish a gradient descent algorithm.Moreover, for U ad bounded, we may employ a projected gradient descent method.Note that in our application we face big data sets and therefore, the evaluation of the full cost function is very costly.This is why we use a mini-batch algorithm.Its details are discussed in the following section.
Remark 6 Note that we exploited the recursive structure of the artificial neural network in the previous derivations.The approach can be generalized to a wider class of neural networks, in fact, all neural networks that allow for backpropagation can be used.Certainly, the computation of ∇ x W i, j u and ∇ u W i, j u can be very complicated in general.

Stochastic descent
In many applications, we expect the cost functions to have several local minima, where usual gradient descent algorithms may get stuck.To prevent this issue when training the neural networks, we use a mini-batch gradient descent scheme.Indeed, we compute ADADELTA updates as proposed in [11].This leads to the following algorithm: Let α k denote the k-th iterate of the gradient descent.We define where ρ ∈ (0, 1) is the rate for the adaption of the squared gradient information and > 0 avoids singular values by division.Both are fixed parameters.
Note that these updates are still deterministic.In order to incorporate noise that help us to escape from local minima, we add a multivariate normal distributed random vector N k with N k ∼ N (0, k ) to the gradient in each iteration.Here, k denotes the variance matrix.We choose for some constants η 1 , η 2 > 0 and ( k ) i j = 0 whenever i = j.Note that the noise diminishes as the number of iteration, k, increases.For the numerical simulations, we set η 1 = 1 and η 2 = 0.55, the adaption rate ρ = 0.95 and = 10 −6 .

Parameter estimation based on traffic data
The stochastic gradient descent method proposed above can be used to treat general neural network driven optimization or parameter identification problem.In the following, we apply it to two applications for which we have real data available.We begin with a one-dimensional traffic dynamic modelled with the help of a first-order ODE system, then we consider a two-dimensional pedestrian dynamic modelled with a second-order dynamic.Moreover, we use the stochastic gradient descent algorithm to estimate parameters of well-known interaction forces for the two scenarios and compare the output and the cost.We want to emphasize that we treat both approaches, the NN and the physically inspired models, as equally admissible.In particular, we do not choose one of them to be the ground truth.

Remark 7
We noticed in discussions that often physical models are accepted as ground truth without question.Even though these models are well-motivated, we cannot blindly assume they represent the whole truth.

Traffic models
For the traffic dynamic, we assume to have a follower-leader dynamic.Therefore, we choose two microscopic versions of the well-known LWR-model [17] with logarithmic and linear velocity function, respectively, as reference models.

Traffic dynamic with LWR-model
Let x i (t) denote the position of the i-th car at time t ∈ [0, T ].Then, the evolution of the cars is given by The parameters are v 0 the velocity of the leading car and L the length of the cars.We consider

Traffic dynamic with artificial neural network
The model driven by the feed-forward neural network is given as follows.We parametrize the interactions of the follower-leader dynamic by W u , where u = (v 0 , u Net ) and u Net is assumed to contain all the information of the neural network.The dynamic is then given by supplemented with initial data x(0) = x 0 .
Remark 8 Note that we have to prescribe some value for the first vehicle even in this case, as we assume that the behaviour of cars depends on the distance to the vehicle in front and the first vehicle has no one in front.

Data processing of traffic data set
We use the microscopic traffic data set that was recorded within the project ESIMAS [18].It contains vehicle data from 5 cameras that were placed in a 1km tunnel section on the German motorway A3 nearby Frankfurt / Main.For the parameter estimation, we extract sequences from the data that contain three or more vehicles in one lane.For simplicity, we restrict our considerations to the middle lane data and neglect the data of the y−coordinate.Thus, we have one-dimensional traffic data for the parameter estimation.First, we interpolate the position data that is supplemented with time stamps to a reference time discretization.Having all the data aligned to the reference time discretization, we filter sequences of data where two or more vehicles are present in the camera frame.This yields a database with various sequences of different length and with different number of vehicles that we use for the parameter identification.
Note that after this data extraction, the vectors containing the positions of the cars for each sequence are ordered.In fact, we pass this ordered vector to the neural network and hold on to the assumption that the interaction of the vehicles depends on the distance to the vehicle in front.This is why we have to prescribe some velocity for the first car in the data set, even in the case of the neural network without physical parameters, see (6).
The ideas behind this data processing are the following.First, to solve the ODE it is convenient to have a fixed number of particle in the scene.We therefore cut the data into smaller time slices to ensure that the cars in the scene do not change within one sequence.Another argument in favour of the small time slices is that the prediction of the trajectories is more likely to be successful on short time intervals than predicting the whole journey of the particles.In fact, the ODE interaction models do react to the current situation and therefore the learning based on short time interval is justified.The preprocessed data with position and timestamp information are available online. 1emark 9 Note that even though the parameter set gives data on the vehicle type, we do not use this information in the approach.Therefore, we expect to obtain an averaged length L as output of the parameter estimation for the LWR-model with logarithmic and linear velocity function.

Numerical schemes
For both approaches, the LWR and the NN model, we solve all the parameter identification problems with the stochastic gradient descent method discussed in Sect.4.2.The forward and adjoint models are integrated using an Explicit Euler scheme.The traffic data set has time step dt data = 0.2 for the simulation we use a finer discretization, i.e. dt = 0.002.

Numerical results
For the numerical results, we use the two data sets of "day 1" of all five cameras.We test three different neural network settings with two, four and ten neurons in the hidden layer and call them, N2, N4, N10, respectively.The results corresponding to the LWR-model with logarithmic velocity function are denoted by "Log", and the ones obtained with the linear velocity function are denoted by "Lin".Note that the provided data units are meter m and seconds s.We initialize the velocity v 0 and car length L for the LWR cases with v 0 = 30 m s and L = 5m.The same initial velocity is used for all neural network dynamics.The weights for the neural networks are initialized with random values uniformly distributed in [−1, 1].We set a lower bound for the car length L min = 2m.
Figure 2 shows the results of the identification for the different models.The circles show the identified velocities of the leading car for the different methods and the different data sets.The lines show the mean velocities taken over all data sets for the different methods.The mean velocity of the LWR-model with linear velocity function is slightly larger than the other mean velocities which are close to 80 km h .The maximal speed allowed on the highway is 100 km h .We see differences between the models for single data sets.The car lengths identified for the two LWR approaches are shown in Table 1.
The average car length estimated with the linear velocity approach is 4.69m, and the one for the logarithmic velocity function is 8.43m.These numbers may indicate that the linear model estimates the true car length, while the logarithmic approach includes the distance to the next car into this value.For test case 9, the car length of the linear model hits the lower bound L min .Out of curiosity, we dropped the lower bound assumption in this case, leading to a value of 0.6m and an over all average velocity close to 80 km h similar to the other models.Table 2 shows the cost for the different approaches using the interaction forces and parameters identified by the algorithm.In average, the neural network approach with four neurons in the hidden layer performs best.In fact, all the neural network approaches perform better than the LWR-model with linear or logarithmic velocity function.The least cost for each data set and for the average is highlighted.N4 has the least cost in 60% of the test cases.N10 approximates the third data set best, and Lin gives the best result for three of the data sets.The linear model is closer to the NN approaches and outperforms the logarithmic model.We can see that the smallest network with only two neurons in the hidden layer is not able to reproduce the interaction forces (see Table 2).On the other hand, the bigger the neural network, the more likely we run into problems of overfitting.This could be a reason why N4 outperforms N10.Nevertheless, this paper can only serve as a proof of concept of the methodology.An investigation of the robustness of the results with respect to noise in the data or the network architecture would be interesting future work.
Figure 3 illustrates the different interaction forces resulting from the parameter estimation.The interaction forces of the neural network models are rather linear except for the range 2m − 5m.The interaction forces of the linear and the logarithmic LWRmodel behave different in this region; in fact, they have very steep gradients and enter a negative regime, which corresponds to slowing down.For distances in the range of 15m-50m, the linear LWR-model is close to the interaction forces of the neural networks, whereas the logarithmic LWR-model admits larger values.Due to the lack of data in the short distance regime, we cannot expect the NN approaches to reproduce the behaviour of slowing down.
The learning process is visualized exemplarily for data set 4 in Fig. 4 which shows the norm of the gradient for different iterations of the learning process.The stochasticity of our algorithms is visible; at the first iterations, we see a high variance which is then diminishing when the learning process evolves.We choose a maximal number of 2500 iterations for the learning process.

Parameter estimation based on crowd data
The following parameter estimation is based on a data set provided by the Institute for Advanced Simulation: Civil Safety Research of Forschungszentrum Jülich [24].In particular, the data set of a bidirectional flow in a corridor is used.We employ the stochastic gradient descent algorithm to fit the parameters of the neural network as well as the social force model.123 Fig. 3 Interaction forces resulting from the parameter identifications Fig. 4 Illustration of the learning process.At the first iterations, our learning algorithms has a high variance, and the variance diminishes as the learning process evolves.We stop at a maximal number of 2500 iterations

Crowd data with social force model
We assume that the social force model proposed by Helbing and Molnár [15] is a good fit to work with the crowd data set.It reads supplemented with initial data x(0) = x 0 and v(0) = v 0 .The relevant parameters are given in Table 3.
The relaxation term towards the desired velocity v des i is constructed from the given trajectories as follows Here, we assume that each pedestrian tries to head towards his or her destination which is given by the last position of the data sequence and keeps the current speed.The other force terms are assumed to be given as with 123 being the distance of pedestrian i and pedestrian j, the normalized vector pointing from pedestrian j to pedestrian i, the tangential direction and the tangential velocity difference, respectively, and, h(y) = H (y) y with Heaviside function H .The interaction with the walls is parametrized using We assume here and in the following that the walls consist of stationary points, such that d iw , n iw and t iw are easy to compute.We fix the radius r = 0.25, the scaling B = 0.1, the relaxation parameter τ = 0.5 and the mass m = 1 for all simulations.
To summarize, the relevant parameters to find via the estimation procedure are u = (A, k, κ).

Crowd data with neural networks
As we aim to compare the results of the model-based approach to the neural network approach, we pass the same data to neural network that models the acceleration.Indeed, we assume to neural network dynamic to be given by supplemented with initial data x(0) = x 0 and v(0) = v 0 .Here, we refer with x k to the positions of the discretization points of the wall and with v k to artificial velocity vectors of the wall points for k = 1, . . ., N wall .Using separate neural networks for the interaction and the walls allows us to compare the structure of the resulting terms one by one and to understand the two approaches in more detail.Note that we use the same neural network with different normal vectors for the different walls.As for the social force approach, we fix the relaxation parameter τ = 0.5.The other parameters need to be estimated.For notational convenience, we define where u int NN denotes all the parameters involved in the neural network modelling the pairwise interaction between pedestrians and u wall NN the parameters of the neural network modelling the interaction of with the walls, respectively.

Processing of the crowd data set
The crowd data set is processed with the same approach as the traffic data sets.Indeed, we split it into sequences of fixed length T seq .Then, we consider only the pedestrians that are present during the whole sequence.In fact, we obtain T /T seq sequences with a different number of pedestrians present.This allows us to use a fixed number of pedestrians in the models for each computation of the gradient in the parameter estimation.We expect to have only small errors raised by the fact that some of the pedestrians are neglected.
After the data processing, we have the trajectories of all pedestrians present in each sequence.We use the first and the last point of these trajectories to compute the relaxation term (8) in each time step.Moreover, we compute the velocities of the pedestrians using a finite difference approximation.This applies to both, the social force and the NN approach.

Numerical schemes and parameters
We use the Explicit Euler scheme to solve the state system and the adjoint systems.The information is then passed to the stochastic descent algorithm, see Sect.4.2.
Each sequence of the preprocessing involved 25 time steps of length dt = 0.04, leading to T seq = 1.We use the same time step dt = 0.04 for the Euler method.The parameter of the stochastic descent algorithms is given in Sect.4.2.The values for the initial neural networks are a random sample that is chosen independently and uniformly distributed on [−1, 1] K .We have four input variables, a hidden layer with four neurons and two output variables representing the interaction force in x-and y-direction.
The initial values for the social force model are a random sample uniformly distributed in the interval [0, 50] 3 .As the parameters of the social force model are assumed to be non-negative, we set them to zero, if they became negative in some iteration of the gradient descent algorithm.

Numerical results
In the following, we discuss the results of the parameter identification that we computed with the help of the stochastic gradient descent methods derived in Sect.4.2.

Results for social force model
First, we discuss the numerical results obtained for the social force model.We begin with plots similar to [11], where we show the results for the interaction forces by four contour plots for the x-direction and for the y-direction.For each of the plots, we fix a vector v = (v 1 , v 2 ) which describes the difference of the velocity vectors of two interaction pedestrians.For example, a pedestrian with velocity vector (1, 0) interacting with a pedestrian with velocity vector (−1, 0) leads to a difference vector  The optimal parameters estimated are A = 0.0044, k = 34.9539,κ = 9.8894.It is interesting to note that A, the prefactor of the exponential term, is almost switched off by the optimization.In other words, only the k-and κ-term are active, this reduces the interaction to a very short range, as both terms are multiplied by the Heaviside function H (2r − d), compare to (9).We therefore restrict the area of the contour plot in Figs. 5  and 6 to [−0.5, 0.5] 2 .In fact, we even see in Figs. 5 and 6 that strong forces occur in even shorter ranges around zero.In addition to these plots, we illustrate interaction forces for six settings in Fig. 7. Every subplot shows two interacting pedestrians at the positions marked with a blue and a red dot, respectively.Moreover, the velocity vectors of the pedestrians are shown in blue and red as well.The black arrow is the force vector resulting from the interaction of the two.
The data of the different settings are given in Table 4. Setting S1 and S2 show a well-known problem of interaction schemes with radially symmetric forces.The interaction forces of two pedestrians encountering each other are aligned along the vector of the differences of the pedestrians positions.Therefore, there is no evasive behaviour visible in this study.Nevertheless, in combination with the relaxation term that drives the pedestrian towards their desired destination, we expect to have a reasonable model.This problem for radially symmetric interactions is well-known and reported for example in [27].Therein, an approach to solve this issue is presented as well.Here, we see in practice what Remark 7 is referring to.Although physically inspired models are well-motivated, they can fail to represent real-life behaviour.
For slightly shifted positions (see S3 and S4), we see evasion behaviour of the pedestrians.The pedestrians are slowing down and slightly moving to the left or right, respectively.The force directions are similar for the case where the velocity vectors point at each other and also when the velocity vectors are aligned.In the settings S5 and S6, we see how two pedestrians walking next to each other interact.In both settings, there is a strong repulsion that pushes the pedestrians away from each other.The walls have no influence on this behaviour, as their influence is only in very short range and the positions are centred in the domain.Altogether, the resulting forces are reasonable.They model some kind of evasive behaviour, which is expected as they become active only if the two interacting pedestrians are very close together.

Results for NN model
Let us now discuss the interactions based on the neural network model.We consider the same plots as for the social force model above and combine the forces resulting from the neural network modelling the interaction with the forces resulting from the walls.Figures 8 and 9 show contour plots of the forces resulting from the neural network approach when we sum up the interaction force and the wall force.Comparing these Figures to the ones resulting from the social force model, we see that the interaction strength is smaller, but the interaction range is longer for the NN model.We analysed the forces resulting from the interactions and the walls separately as well, it turns out that the forces to not represent pairwise interaction between pedestrians and walls, as we intended to model.
The observation that the interaction range is longer in the NN approach motivates to adapt the ranges of the interaction settings.For a similar study as before for the social force model, we use the values given in Table 5.The arrows in Fig. 10 show the sum of the interaction and wall forces, i.e. they correspond to Figs. 8 and 9.The force vectors in the first and second row coincide.This indicates that the NN model does not have the same problem as forces resulting from radially symmetric potentials.It is very interesting to see that the NN model easily reads the evasion behaviour of pedestrians from the data.It therefore outperforms the social force model in this point.Moreover, the plots indicate that the NN model reacts with evasive behaviour to pedestrians approaching in a longer range (row one and two in Fig. 10).In contrast, pedestrians that are close to each other encounter forces in similar directions.Especially, when they head into the same direction (row three in Fig. 10).We summarize the findings and comparison of the two models in the conclusion.

Comparison of the two approaches
The numerical results show that the optimized forces on the social model have a very short range and show expected evasive behaviour for two pedestrians facing each other.On the other hand, the optimized forces of the neural network approach are long ranging.The splitting into a neural network for pairwise interactions and wall interactions is overturned by the optimization.Interestingly, it turns out that a familiar evasive behaviour is recovered when we add the forces resulting from the neural network for pairwise interaction and the neural network modelling the walls.
The comparison of pairwise interactions show that in case of the social force model, the pedestrians encounter a strong force slowing them down, in particular, the component of force vector pointing in the opposite direction of the velocity vector is very strong.For the neural network, the force vector is almost perpendicular to the velocity vector.Hence, the pedestrians are changing direction but not slowing down as much as in the social force case.The cost functional values for the two approaches averaged over all samples of the data set are J SF = 5.0841, J NN = 5.5979.
We conclude that despite the qualitative differences the two models perform similar in terms of cost.The social force model with short-range interaction performs slightly better in this measure as the neural network approach with long-range interaction.
On the other hand, the neural network approach delivers forces which imply evasive behaviour and thus seem to be closer to reality than the ones of the social force model.5 for more details.The positions of the two interacting pedestrians are marked by the blue and red dot, respectively.Their velocity vectors are the blue and red arrows, and the resulting forces are depicted as black arrows.On the left side, the pedestrians face each other; on the right side, the pedestrians walk in the same direction

Conclusion
We proposed to use neural networks to model forces in general interacting particle systems and derived an algorithm to estimate parameters with techniques from optimal control.For validation, the algorithm is applied to a traffic data set and a pedestrian data set.The results are compared to the well-known LWR-model and social force model, respectively.For the traffic data, it turns out that the neural network approach leads to almost linear interaction forces that average the interaction forces resulting for the LWRmodel with linear and logarithmic velocity ansatz.The cost functional values are best for the neural network with 4 neurons in the hidden layer.
In case of the pedestrian dynamics, the interaction forces of the social force model and the neural network approach differ in the range of interaction and the strength.The optimized social force model has strong interaction forces that act on a very short range.In contrast, the forces resulting from the neural network approach act on a longer range with less strength.Parameter identification with social model performs slightly better than the one based on neural networks in terms of the cost functional value.It is interesting to see that the NN approach is able to read evasion behaviour of the pedestrians from the data.
Future work includes the investigation of parameter identification problems using neural networks for partial differential equations arising in the context of crowd motion and traffic flow.A performance comparison of the stochastic gradient descent method to global optimization methods, such as Consensus-based global Optimization [22], using real data based parameter calibration for interacting particle models is planned as well.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made.The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material.If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Fig. 2
Fig. 2 The circles show the identified velocities of the leading car for the different methods (colour coded) and the different data sets.The lines show the mean velocities taken over all data sets for the different methods (Color figure online)

Fig. 5 Fig. 6
Fig. 5 First component of the interaction force for social force approach with optimized parameters

Fig. 7
Fig. 7 Interaction forces resulting from social force model for different settings, see Table4for more details.The positions of the two interacting pedestrians is marked by the blue and red dot, respectively.Their velocity vectors are the blue and red arrows, and the resulting forces are depicted as black arrows.On the left side, the pedestrians face each other; on the right side, the pedestrians walk in the same direction (Color figure online)

Fig. 8 Fig. 9 Fig. 10
Fig.8 First component of the interaction and wall force for NN approach with optimized parameters

Table 1
Car lengths (in m) estimated with the algorithm for the 10 data sets with the LWR-model with linear and logarithmic velocity

Table 2
Evaluations of the cost functional with the interaction forces identified by the algorithm for the 10 data sets.The least cost value for each column is highlighted