1 Introduction

A new class of numerical methods for global optimization based on particle dynamics has been introduced in some recent articles [12,13,14, 21,22,23, 44, 47]. These methods, referred to as consensus based optimization (CBO) methods for the similarities between the particle dynamics in the minimizer and consensus dynamics in opinion formation, fall within the large class of metaheuristic methods [1, 6, 11, 26]. Among popular metaheuristic methods we recall the simplex heuristics [41], evolutionary programming [20], the Metropolis-Hastings sampling algorithm [29], genetic algorithms [31], particle swarm optimization (PSO) [36, 45], ant colony optimization (ACO) [19], simulated annealing (SA) [32, 37].

In contrast to classic metaheuristic methods, for which it is quite difficult to provide rigorous convergence to global minimizers (especially for those methods that combine instantaneous decisions with memory mechanisms), CBO methods, thanks to the instantaneous nature of the dynamics permit to exploit mean-field techniques to prove global convergence for a large class of optimization problems [12, 14, 23, 24]. Despite their simplicity CBO methods seem to be powerful and robust enough to tackle many interesting high dimensional non-convex optimization problems of interest in machine learning [14, 17, 23].

As shown in [14, 23] in practical applications the methods benefit from the use of small batches of interacting particles since the global collective decision mechanism may otherwise lead the model to be more easily trapped in local minima. For these CBO methods based on small batches, however, a robust mathematical theory is still missing. We mention also that, recently, a continuous description of PSO methods based on a system of stochastic differential equations was proposed in [28] and its connections with CBO methods analyzed through the corresponding mean-field descriptions. Rigorous results concerning the mean-field limit of PSO methods and the corresponding CBO dynamics have been subsequently presented in [33]. We refer the reader to the recent surveys [27, 46] for a more complete overview.

Motivated by this, in the present paper we introduce a new class of kinetic theory based optimization (KBO) methods algorithmically solved by particle dynamics to address the following optimization problem

$$\begin{aligned} v^\star \in \mathrm{arg}\!\min \limits _{v\in \mathbb {R}^d}{\mathcal {E}}(v)\,, \end{aligned}$$
(1.1)

where \({\mathcal {E}}(v):{\mathbb {R}}^{d} \rightarrow {\mathbb {R}}\) is a given continuous cost function, which we wish to minimize. In the following, we will assume that the minimizing argument \(v^\star \) of (1.1) exists and is unique.

Both statistical estimation and machine learning consider the problem of minimizing an objective function in the form of a sum

$$\begin{aligned} {\mathcal {E}}(v)=\frac{1}{n}\sum _{i=1}^n {\mathcal {E}}_i(v), \end{aligned}$$
(1.2)

where each summand function \({\mathcal {E}}_i\) is typically associated with the i-observation in the data set, for example used for training [10]. In statistics, the problems of minimizing the sum occur in least squares, in the estimation of the highest probability (for independent observations), and more general in M-estimators [25]. The problem of sum minimization also arises for the minimization of empirical risk in statistical learning [48]. In this case, \({\mathcal {E}}_i\) is the value of the loss function at i-th example, and \({\mathcal {E}}\) is the empirical risk.

In many cases, the summand functions have a simple form that enables inexpensive evaluations of the sum-function and the sum gradient. First order methods, such as (stochastic) gradient descent methods, are preferred both because of speed and scalability and because they are considered generically able to escape the trap of critical points. However, in other cases, evaluating the sum-gradient may require expensive evaluations of the gradients and/or some of the functions may be noisy or discontinuous. Additionally, most gradient-based optimizers are not designed to handle multi-modal problems or discrete and mixed discrete-continuous design variables. Gradient-free methods, such as the metaheuristics approaches mentioned before, may therefore represent a valid alternative.

In contrast to previous CBO approaches, where the dynamic was of mean-field type, the new KBO methods are based on binary interactions between agents which can estimate the best position according to a combination of a local interaction and a global alignment process. Binary interactions are inspired by similar processes of social alignment in kinetic models for opinion formation, where agents modify their opinions according to a process of local compromise with other agents and the global influence of external media [2, 4, 5, 7, 8, 30, 43]. The corresponding dynamic is therefore described by a multidimensional Boltzmann equation that is solved by adapting the well-known direct simulation Monte Carlo methods [9, 40, 42] to the present case. We emphasize that, the resulting schemes present some analogies with the recently introduced random batch methods in the case of small batches of size two [3, 35, 38].

In particular, we show that, in a suitable scaling derived from the quasi-invariant limit in opinion dynamic, the corresponding mean-field dynamic is governed by CBO methods. Noticeably, the resulting CBO methods generalize the classical CBO approach in [14, 44] by preserving memory of the microscopic interaction dynamic. As shown by the numerical experiments, an interesting aspect in this direction is that the kinetic optimization model is able to capture the global minimum even in the case where there is no global alignment process, as in the original CBO models, but only a local alignment process where information is shared only between pairs of particles.

The rest of the paper is organized as follows. In the next section, we introduce the kinetic model and the corresponding Boltzmann equation. Section 3 is then devoted to analyze the main properties of the kinetic model and to consider a suitable scaling limit which permits to derive the analogous mean-field optimizers of CBO type. Convergence to the global optimum for KBO methods is then studied in Sect. 4, where we demonstrate exponentially fast convergence to the minimum, with a constraint on the parameters independent of the dimension for binary interactions with anisotropic noise. Finally, in Sect. 5 we present several numerical experiments including an application to a machine learning problem. Some concluding remarks are then given at the end of the manuscript.

2 A Kinetic Model for Global Optimization

In analogy to some key concepts of metaheuristic optimization methods based on particle dynamics, in the following we introduce an optimization process based on binary interaction dynamics inspired by kinetic models in social sciences described by spatially homogeneous Boltzmann-type equations (see [43]). To this aim, let us denote by \(f(v,t) \ge 0\), \(v\in {\mathbb {R}}^d\) the distribution of particles in position v at time \(t \ge 0\). Note that, by analogy with the classical space homogeneous Boltzmann description, we kept the notation v. However, we chose to refer to this as ’position’ in the search space instead of ’velocity’ to employ a standard terminology in optimization algorithms. Without loss of generality we assume \(\int _{{\mathbb {R}}^d} f(v,t)\,dv=1\), so that f(vt) is a probability density function.

2.1 The Binary Interaction Process

For a given pair of particles with positions \((v,v_*)\) we consider a binary interaction process generating the new positions \((v',v'_*)\) according to relations

$$\begin{aligned} \begin{aligned} v'&= v + \lambda _1(v_{\beta ,{\mathcal {E}}}(v,v_*)-v)+\lambda _2(v_{\alpha ,{\mathcal {E}}}(t)-v)+\sigma _1 D_1(v,v_*)\xi _1+\sigma _2 D_2(v)\xi _2 \\ v_*'&= v_* + \lambda _1(v_{\beta ,{\mathcal {E}}}(v_*,v)-v_*)+\lambda _2(v_{\alpha ,{\mathcal {E}}}(t)-v_*)+\sigma _1 D_1(v_*,v)\xi ^*_1+\sigma _2 D_2(v_*)\xi ^*_2 \end{aligned} \end{aligned}$$
(2.1)

where \(v_{\beta ,{\mathcal {E}}}(v,v_*)\), \(\beta > 0\), is the microscopic local estimate of the best position

$$\begin{aligned} v_{\beta ,{\mathcal {E}}}(v,v_*) = \frac{\omega _\beta ^{\mathcal {E}}(v) v + \omega _\beta ^{\mathcal {E}}(v_*)v_*}{\omega _\beta ^{\mathcal {E}}(v)+\omega _\beta ^{\mathcal {E}}(v_*)}\,, \qquad \omega _\beta ^{\mathcal {E}}(v):=e^{-\beta {\mathcal {E}}(v)}, \end{aligned}$$
(2.2)

and \(v_{\alpha ,{\mathcal {E}}}(t)\), \(\alpha > 0\), is the macroscopic global estimate of the best position

$$\begin{aligned} v_{\alpha ,{\mathcal {E}}}(t)=\frac{\int _{\mathbb R^{d}}v\omega _\alpha ^{\mathcal {E}}(v)f(v,t)\,dv}{\int _{\mathbb R^{d}}\omega _\alpha ^{\mathcal {E}}(v)f(v,t)\,dv}\,, \qquad \omega _\alpha ^{\mathcal {E}}(v):=e^{-\alpha {\mathcal {E}}(v)}\,. \end{aligned}$$
(2.3)

The choice of the weight function \(\omega _\alpha ^{\mathcal {E}}\) in (2.3) comes from the well-known Laplace principle [18, 39, 44], a classical asymptotic method for integrals, which states that for any probability f(vt), it holds

$$\begin{aligned} \lim \limits _{\alpha \rightarrow \infty }\left( -\frac{1}{\alpha }\log \left( \int _{{\mathbb {R}}^d}e^{-\alpha {\mathcal {E}}(v)}f(v,t)\,dv \right) \right) =\inf \limits _{v\,\in \, \mathrm{supp}\, f(v,t)} {\mathcal {E}}(v)\,. \end{aligned}$$
(2.4)

Similarly, in (2.2) as \(\beta \rightarrow \infty \) the value \(v_{\beta ,{\mathcal {E}}}(v,v_*)\) concentrates on the particle in the best position, namely

$$\begin{aligned} \lim _{\beta \rightarrow \infty } v_{\beta ,{\mathcal {E}}}(v,v_*) = { \underset{w \in \{v,v_*\}}{{{\,\mathrm{argmin}\,}}}\,{\mathcal {E}}(w) \,,} \end{aligned}$$
(2.5)

if \({\mathcal {E}}(v) \ne {\mathcal {E}}(v_*)\). Note that, \(v_{\beta ,{\mathcal {E}}}(v,v_*)\) depends on the interacting pair \((v,v_*)\), whereas \(v_{\alpha ,{\mathcal {E}}}(t)\) is the same for all particles. These quantities characterize two different dynamics where on one hand the particle pair aligns locally to \(v_{\beta ,{\mathcal {E}}}(v,v_*)\) in agreement with their weighted best position and on the other hand it aligns globally to \(v_{\alpha ,{\mathcal {E}}}(t)\) according to the weighted best position among all particles.

In (2.1) the scalar values \(\lambda _k\ge 0\) and \(\sigma _k \ge 0\), \(k=1,2\) define, respectively, the strength of the relative alignment and diffusion processes, whereas the terms \(\xi _k, \xi ^*_k\in {\mathbb {R}}^d\), \(k=1,2\) are vectors of i.i.d. random variables (with arbitrary distribution) with zero mean and unitary variance. Finally, \(D_k(\cdot ,\cdot )\), \(k=1,2\) denote \(d\times d\) dimensional diagonal matrices characterizing the stochastic exploration process. Isotropic exploration has been introduced in [44] and is defined by

$$\begin{aligned} D_1(v,v_*)={|}v_{\beta ,{\mathcal {E}}}(v,v_*)-v{|} I_d,\quad D_2(v)={|}v_{\alpha ,{\mathcal {E}}}(t)-v{|} I_d, \end{aligned}$$
(2.6)

with \(I_d\) denoting the d-dimensional identity matrix and \(|\cdot |\) the euclidian norm, whereas in the anisotropic case, introduced in [14], we have

$$\begin{aligned} \begin{aligned} D_1(v,v_*)&={\mathrm{diag}}\left\{ (v_{\beta ,{\mathcal {E}}}(v,v_*)-v)_1,\ldots , (v_{\beta ,{\mathcal {E}}}(v,v_*)-v)_d\right\} ,\\ D_2(v)&={\mathrm{diag}}\left\{ (v_{\alpha ,{\mathcal {E}}}(t)-v)_1,\ldots ,(v_{\alpha ,{\mathcal {E}}}(t)-v)_d\right\} . \end{aligned} \end{aligned}$$
(2.7)

2.2 A Boltzmann Description

A fundamental aspect in the derivation of the corresponding evolution equation of the probability density of particles f(vt) is to determine the so-called Boltzmann collision term describing the instantaneous variations in the particles distribution. This derivation results exclusively from the binary interactions between particles given by (2.1) that are assumed to be uncorrelated prior to the interaction. Under this assumption, known as molecular chaos, the collision term can be written as a multidimensional integral over the product of the distribution functions of a particle (see [15, 16] for further details).

Thus, formally, the particle distribution satisfies a Boltzmann-type equation, which can be conveniently written in weak form as

$$\begin{aligned} \frac{\partial }{\partial t} \int _{{\mathbb {R}}^d} f(v,t)\phi (v)\,dv= & {} \frac{1}{2}\left\langle \int _{{\mathbb {R}}^{2d}}\left( \phi (v')+\phi (v'_*)-\phi (v)-\phi (v_*)\right) f(v,t)f(v_*,t)\,dv\,dv_*\right\rangle \nonumber \\= & {} \left\langle \int _{{\mathbb {R}}^{2d}}\left( \phi (v')-\phi (v)\right) f(v,t) f(v_*,t)\,dv\,dv_*\right\rangle \end{aligned}$$
(2.8)

where \(\phi (v)\in {C}^\infty ({\mathbb {R}}^{d})\) is a smooth function, such that

$$\begin{aligned} \lim _{t\rightarrow 0}\int _{{\mathbb {R}}^d} \phi (v)f(v,t)\,dv = \int _{{\mathbb {R}}^d} \phi (v) f_0(v)\,dv \end{aligned}$$

with \(f_0(v)\) the initial density satisfying

$$\begin{aligned} \int _{{\mathbb {R}}^d} f_0(v)\,dv =1. \end{aligned}$$

In (2.8) we use the standard notation

$$\begin{aligned} \left\langle g(\xi )\right\rangle = \int _{{\mathbb {R}}^{4d}}g(\xi )p(\xi )\,d\xi , \end{aligned}$$
(2.9)

where we used the shortcut \(\xi =(\xi _1,\xi _2,\xi _1^*,\xi _2^*)\), to denote the mathematical expectation with respect to the i.i.d. random vectors \(\xi _k, \xi ^*_k\), \(k=1,2\), entering the definitions of \(v'\) and \(v'_*\) in (2.1). As a consequence \(p(\xi )=p_\xi (\xi _1)p_\xi (\xi _2)p_\xi (\xi ^*_1)p_\xi (\xi ^*_2)\), where \(p_\xi (\cdot )\) is the common probability density function of the random vectors.

The Boltzmann interaction term in (2.8) quantifies the variation in the probability density, at a given time, of particles that modify their position from v to \(v'\) (r.h.s with negative sign) and particles that change their value from \(v'\) to v (r.h.s. with positive sign). Here, the expectation \(\langle \cdot \rangle \) takes into account the presence of the random parameters in the microscopic interaction (2.1).

First of all, let us remark that from the binary dynamic (2.1) we get

$$\begin{aligned} \begin{aligned} \langle v'+v'_*\rangle&= (1-\lambda _1-\lambda _2) (v+v_*)+2\lambda _1 v_{\beta ,{\mathcal {E}}} + 2 \lambda _2 v_{\alpha ,{\mathcal {E}}}(t),\\ \langle v'-v'_*\rangle&= (1-\lambda _1-\lambda _2) (v-v_*). \end{aligned} \end{aligned}$$
(2.10)

The first equality describes the variation in the expected value of the particles positions. The second, under the assumption \(\lambda _1+\lambda _2 \le 1\), refers to the tendency of the interaction to decrease (in mean) the distance between positions after the interaction. This tendency is a universal consequence of the rule (2.1), in that it holds whatever distribution one assigns to \(\xi \), namely to the random variable which accounts for the exploration effects.

Before entering into a detailed analysis of the model, let us fix some notations. Throughout the paper, we will denote with m and E the first two moments of f(vt)

$$\begin{aligned} {m(t) := \int _{{\mathbb {R}}^d} v \,f(v,t)\,dv \,, \quad E(t) := \int _{{\mathbb {R}}^d} |v|^2\, f(v,t)\,dv\, , } \end{aligned}$$
(2.11)

and the variance as

$$\begin{aligned} {V(t) := \frac{1}{2} \int _{{\mathbb {R}}^d} | v- m(t)|^2\, f(v,t)\,dv = \frac{1}{2} \left( E(t) - |m(t)|^2\right) \,.} \end{aligned}$$
(2.12)

Furthermore, we will assume \(\kappa \) to be a constant equal to the dimension d if the isotropic exploration (2.6) is considered, and equal to one when the anisotropic exploration (2.7) is employed.

3 Main Properties and Mean-Field Limit

3.1 The Case with Only the Microscopic Best Estimate

Let us first consider the case where in the binary interaction rules (2.1) we assume \(\lambda _2=0\) and \(\sigma _2=0\). This case is particularly interesting since the dynamics is fully microscopic and therefore convergence to the global minimum will emerge from a sequel of binary interactions which are not influenced by any macroscopic information concerning the global minimum.

The binary interactions can be rewritten as

$$\begin{aligned} \begin{aligned} v'&= v + \lambda \gamma ^{\mathcal {E}}_\beta (v,v_*)(v_*-v)+\sigma D(v,v_*)\xi _1 \\ v_*'&= v_* + \lambda \gamma ^{\mathcal {E}}_\beta (v_*,v)(v-v_*)+\sigma D(v_*,v)\xi ^*_1 \end{aligned} \end{aligned}$$
(3.1)

where, for notational simplicity, we have set \(\lambda =\lambda _1\), \(\sigma =\sigma _1\), \(D(v,v_*)=D_1(v,v_*)\) and

$$\begin{aligned} \gamma ^{\mathcal {E}}_\beta (v,v_*) = \frac{\omega _\beta ^{\mathcal {E}}(v_*)}{\omega _\beta ^{\mathcal {E}}(v)+\omega _\beta ^{\mathcal {E}}(v_*)}\,. \end{aligned}$$

Note that, \(\gamma ^{\mathcal {E}}_\beta (v,v_*)+\gamma ^{\mathcal {E}}_\beta (v_*,v)=1\), and, since \(\gamma ^{\mathcal {E}}_\beta (v,v_*)\in (0,1)\), the expected support of the positions for \(\lambda \le 1\) is decreasing

$$\begin{aligned} |\langle v' \rangle | \le (1-\lambda \gamma ^{\mathcal {E}}_\beta (v,v_*)) |v| + \lambda \gamma ^{\mathcal {E}}_\beta (v,v_*)|v_*| < \max \left\{ |v|,|v_*|\right\} . \end{aligned}$$

Consider now, the time evolution of the expected position m(t). We have from the weak formulation (2.8) for \(\phi (v)=v\)

$$\begin{aligned} \begin{aligned} {\frac{d m(t)}{dt}}&= {\left\langle \int _{{\mathbb {R}}^{2d}} ( v' - v) f(v,t)f(v_*,t)\,dv_*\,dv \right\rangle }\\&= \lambda \int _{{\mathbb {R}}^{2d}} \gamma ^{\mathcal {E}}_\beta (v,v_*)(v_*-v)f(v,t)f(v_*,t)\,dv_* dv\\&= 2\lambda \int _{{\mathbb {R}}^{2d}}\gamma ^{\mathcal {E}}_\beta (v,v_*)f(v,t)f(v_*,t)v_*\,dv_*\,dv-\lambda m(t), \end{aligned} \end{aligned}$$
(3.2)

where we made use of the fact that \(\gamma _\beta ^{\mathcal {E}}(v, v_*) + \gamma _\beta ^{\mathcal {E}}(v_*,v) = 1\), from which follows

$$\begin{aligned} m(t)= & {} \int _{{\mathbb {R}}^{2d}} \gamma ^{\mathcal {E}}_\beta (v,v_*)(v_*-v)f(v,t)f(v_*,t)\,dv_* dv \\&+ \int _{{\mathbb {R}}^{2d}} \gamma ^{\mathcal {E}}_\beta (v_*,v)(v_*-v)f(v,t)f(v_*,t)\,dv_* dv\,. \end{aligned}$$

It is easy to verify that the above equation admits as steady state any Dirac delta distribution of the form \(f^\infty (v)=\delta (v-\bar{v})\), since \(\gamma ^{\mathcal {E}}_\beta (\bar{v},\bar{v})=1/2\), \(\forall \,\, \bar{v}\in {\mathbb {R}}^d\). In general, any symmetric function \(\gamma ^{\mathcal {E}}_\beta (v,v_*) {=\gamma ^{\mathcal {E}}_\beta (v_*,v)}\) would preserve the average position, and it is therefore the asymmetric behavior of this function based on the choice of the best value in the binary interaction that will asymptotically lead to the global minimum in the system. Note that, Eq. (3.2) is not closed.

In order to analyze the large time behavior of f(vt), we introduce the following boundedness assumption on \({\mathcal {E}}(v)\).

Assumption 3.1

Let us assume \({\mathcal {E}}(w)\) positive and for all \(w \in {\mathbb {R}}^d\)

$$\begin{aligned} {\underline{{\mathcal {E}}}} := \inf _{v\in {\mathbb {R}}^d}{\mathcal {E}}(v) \le {\mathcal {E}}(w) {\le } \sup _{v \in {\mathbb {R}}^d}{\mathcal {E}}(v)=: {\overline{{\mathcal {E}}}} \,. \end{aligned}$$

Under this assumption, it is possible to show that, when the alignment and exploration strengths satisfy suitable conditions, the particle system concentrates as it evolves.

Proposition 3.1

Let f(vt) be a weak solution of Eq. (2.8) with initial data \(f_0\) and binary interaction described by the system (3.1). If \({\mathcal {E}}\) satisfies Assumption 3.1 and \(\beta \) is sufficiently large, it holds

$$\begin{aligned} \frac{d V(t)}{dt} \le - \left( \frac{\lambda }{C_{\beta ,{\mathcal {E}}}} - \lambda ^2 - \sigma ^2 \kappa \right) V(t) \,, \end{aligned}$$
(3.3)

for all \(t>0\), where \(C_{\beta ,{\mathcal {E}}} := e^{\beta ({\overline{{\mathcal {E}}}} -\underline{{\mathcal {E}}})}\).

We start the proof by presenting an auxiliary result.

Lemma 3.1

If \(\beta \) is sufficiently large, it holds

$$\begin{aligned} \left( \gamma _\beta ^{\mathcal {E}}(v,v_*) \right) ^2 \le {\left( 1- \frac{1}{C_{\beta , {\mathcal {E}}}} \right) } \gamma _{2\beta }^{\mathcal {E}}\,{(v,v_*)} \,, \end{aligned}$$
(3.4)

where \(C_{\beta ,{\mathcal {E}}} := e^{\beta ({\overline{{\mathcal {E}}}} -{\underline{{\mathcal {E}}}})}\).

Proof

We start by rewriting \(( \gamma _\beta ^{\mathcal {E}}(v,v_*))^2 \) as

$$\begin{aligned} \begin{aligned} \left( \gamma _\beta ^{\mathcal {E}}(v,v_*) \right) ^2&= \frac{e^{-2\beta {\mathcal {E}}(v_*)}}{\left( e^{-\beta {\mathcal {E}}(v)} + e^{-\beta {\mathcal {E}}(v_*)}\right) ^2} = \frac{e^{-2\beta {\mathcal {E}}(v_*)}}{e^{-2\beta {\mathcal {E}}(v)} + e^{-2\beta {\mathcal {E}}(v_*)}} \frac{e^{-2\beta {\mathcal {E}}(v)} + e^{-2\beta {\mathcal {E}}(v_*)}}{\left( e^{-\beta {\mathcal {E}}(v)} + e^{-\beta {\mathcal {E}}(v_*)}\right) ^2}\\&= \;\gamma _{2\beta }^{\mathcal {E}}{(v,v_*)} \frac{e^{-2\beta {\mathcal {E}}(v)} + e^{-2\beta {\mathcal {E}}(v_*)}}{\left( e^{-\beta {\mathcal {E}}(v)} + e^{-\beta {\mathcal {E}}(v_*)}\right) ^2} {=: \gamma _{2\beta }^{\mathcal {E}}(v,v_*)\zeta _\beta ^{\mathcal {E}}(v,v_*) } \end{aligned} \end{aligned}$$

and further rewrite \(\zeta _\beta ^{\mathcal {E}}(v,v_*) \) as

$$\begin{aligned} \begin{aligned} {\zeta _\beta ^{\mathcal {E}}(v,v_*) }&= \frac{e^{-2\beta {\mathcal {E}}(v)} + e^{-2\beta {\mathcal {E}}(v_*)}}{\left( e^{-\beta {\mathcal {E}}(v)} + e^{-\beta {\mathcal {E}}(v_*)}\right) ^2} = \frac{e^{-2\beta {\mathcal {E}}(v_*)}\left( 1+ e^{-2\beta ( {\mathcal {E}}(v) - {\mathcal {E}}(v_*))} \right) }{e^{-2\beta {\mathcal {E}}(v_*)}\left( 1+ e^{-\beta ( {\mathcal {E}}(v) - {\mathcal {E}}(v_*)) } \right) ^2} \\&{=\; \frac{1+ e^{-2\beta ( {\mathcal {E}}(v) - {\mathcal {E}}(v_*))}}{\left( 1+ e^{-\beta ( {\mathcal {E}}(v) - {\mathcal {E}}(v_*)) } \right) ^2}\,.} \end{aligned} \end{aligned}$$

One can verify that \(\zeta _\beta ^{\mathcal {E}}(v,v_*)\) attains its maximum value when the difference \(|{\mathcal {E}}(v) - {\mathcal {E}}(v_*)|\) is maximized, from which follows

$$\begin{aligned} \zeta _\beta ^{\mathcal {E}}(v,v_*)\, {\le }\, \frac{1 + e^{-2\beta ({\overline{{\mathcal {E}}}} - {\underline{{\mathcal {E}}}})}}{\left( 1+e^{-\beta ({\overline{{\mathcal {E}}}} - \underline{{\mathcal {E}}})} \right) ^2} {\,= \frac{1 + C^2}{(1+C)^2}\,,} \end{aligned}$$

where we denoted for simplicity \((C_{\beta ,{\mathcal {E}}})^{-1}=:C\). We note that \(C \rightarrow 0\) as \(\beta \rightarrow \infty \). Finally, as \(\beta \rightarrow \infty \)

$$\begin{aligned} \begin{aligned} 1- \zeta _\beta ^{\mathcal {E}}(v,v_*) - (C_\beta ^{\mathcal {E}})^{-1}&= 1- \frac{1 + C^2}{(1+C)^2} - C = \frac{C + o(C) }{(1+C)^2} \ge 0, \end{aligned} \end{aligned}$$

if \(\beta \) is sufficiently large. This proves the assertion. \(\square \)

Proof of Proposition 3.1

From the definition of E(t), and the weak formulation (2.8), we can compute

$$\begin{aligned} \begin{aligned} \frac{d E(t)}{dt} =&\left\langle \int _{{\mathbb {R}}^{2d}}\left( {{|}v'{|}}^2-{|}v{|}^2\right) f(v,t)f(v_*,t)\,dv\,dv_*\right\rangle \\ =&\lambda ^2 \int _{{\mathbb {R}}^{2d}} \gamma ^{\mathcal {E}}_\beta (v,v_*)^2{|}v_*-v{|}^2 f(v,t)f(v_*,t) \, dv\, dv_*\\&+2\lambda \int _{{\mathbb {R}}^{2d}} \gamma ^{\mathcal {E}}_\beta (v,v_*) v{\cdot } (v_*-v) f(v,t)f(v_*,t) \, dv\, dv_*\\&+ \sigma ^2 \sum _{i=1}^d \int _{{\mathbb {R}}^{2d}} D_{ii}(v,v_*)^2 f(v,t)f(v_*,t) \, dv\, dv_*\,, \end{aligned} \end{aligned}$$
(3.5)

where with \(D_{ii}\) we denote the ith- diagonal element of the matrix D. From

$$\begin{aligned} \frac{d}{dt}V(t) = \frac{1}{2} \frac{d}{dt} {E}(t) - m(t) \frac{d}{dt} m(t)\end{aligned}$$

and the moment derivative (3.2), we recover

$$\begin{aligned} \begin{aligned} \frac{d V(t)}{dt} =&\left\langle \int _{{\mathbb {R}}^{2d}}\left( {|}v'{|}^2-{|}v{|}^2\right) f(v,t)f(v_*,t)\,dv\,dv_*\right\rangle {\,-\, m(t) \frac{d}{dt}m(t)} \\ =&\frac{\lambda ^2}{2} \int _{{\mathbb {R}}^{2d}} \gamma ^{\mathcal {E}}_\beta (v,v_*)^2 {|} v_*-v {|}^2 f(v,t)f(v_*,t) \, dv\, dv_*\\&+\lambda \int _{{\mathbb {R}}^{2d}} \gamma ^{\mathcal {E}}_\beta (v,v_*) (v- m(t)) {\cdot } (v_*-v) f(v,t)f(v_*,t) \, dv\, dv_*\\&+\frac{\sigma ^2}{2} \sum _{i=1}^d \int _{{\mathbb {R}}^{2d}} D_{ii}(v,v_*)^2 f(v,t)f(v_*,t) \, dv\, dv_* =: I_1 + I_2 + I_3. \end{aligned} \end{aligned}$$
(3.6)

Thanks to the relation, \(\gamma _\beta ^{\mathcal {E}}(v,v_*) + \gamma _\beta ^{\mathcal {E}}(v_*,v) = 1\), we note that for any symmetric function \(\psi (v, v_*)=\psi (v_*,v)\) it holds

$$\begin{aligned}&\int _{{\mathbb {R}}^{2d}} \gamma ^{\mathcal {E}}_\beta (v,v_*) \psi (v,v_*) f(v,t)f(v_*,t) \, dv\, dv_*\nonumber \\&\quad = \frac{1}{2} \int _{{\mathbb {R}}^{2d}} \psi (v,v_*) f(v,t)f(v_*,t) \, dv\, dv_*\,. \end{aligned}$$
(3.7)

It follows that \(I_1\) and \(I_3\) can be bounded as

$$\begin{aligned} I_1&\le \frac{\lambda ^2}{2} \int _{{\mathbb {R}}^{2d}} \gamma ^{\mathcal {E}}_\beta (v,v_*) {|}v_*-v{|}^2 f(v,t)f(v_*,t) \, dv\, dv_* = \lambda ^2 V(t) \end{aligned}$$
(3.8)
$$\begin{aligned} I_3&\le \frac{\sigma ^2}{2} \kappa \int _{{\mathbb {R}}^{2d}}{ \gamma ^{\mathcal {E}}_\beta (v,v_*)} |v_* - v|^2 f(v,t)f(v_*,t) \, dv\, dv_* {=} \sigma ^2\kappa V(t)\,, \end{aligned}$$
(3.9)

where we recall that \(\kappa =d\) in the isotropic case (2.6), and \(\kappa =1\) in the anisotropic case (2.7). We compute by means on Young’s inequality

$$\begin{aligned} \begin{aligned} I_2=&\lambda \int _{{\mathbb {R}}^{2d}} \gamma ^{\mathcal {E}}_\beta (v,v_*) (v- m(t)) {\cdot } (v_*-v) f(v,t)f(v_*,t) \, dv\, dv_*\\ \le&- \lambda \int _{{\mathbb {R}}^{2d}} \gamma _\beta ^{\mathcal {E}}(v,v_*) |v-v_*|^2 f(v,t) f(v_*,t) dv dv_* \\&+ \frac{\lambda }{2} \int _{{\mathbb {R}}^{2d}} |v_*-m(t)|^2 f(v,t) f(v_*,t) \,dv\, dv_* \\&+\frac{\lambda }{2} \int _{{\mathbb {R}}^{2d}} \left( \gamma _{\beta }^{\mathcal {E}}(v,v_*)\right) ^2 |v-v_*|^2 f(v,t) f(v_*,t) \,dv \,dv_*\,. \end{aligned} \end{aligned}$$
(3.10)

By applying Lemma 3.1 one can bound the last term as

$$\begin{aligned}&\int _{{\mathbb {R}}^{2d}} \left( \gamma _{\beta }^{\mathcal {E}}(v,v_*)\right) ^2 |v-v_*|^2 f(v,t) f(v_*,t) dv dv_* \\&\quad \le {\left( 1- \frac{1}{C_{\beta , {\mathcal {E}}}}\right) }\int _{{\mathbb {R}}^{2d}} { \gamma _{2\beta }^{\mathcal {E}}(v,v_*)} |v-v_*|^2 f(v,t) f(v_*,t) dv dv_*\,. \end{aligned}$$

Finally, we use again relation (3.7) to obtain

$$\begin{aligned} I_2 \le -2 \lambda V(t) +\lambda V(t) +\lambda {\left( 1- \frac{1}{C_{\beta , {\mathcal {E}}}}\right) } V(t) = -\frac{\lambda }{ C_{\beta ,{\mathcal {E}}}} V(t) \end{aligned}$$

and hence, together with (3.8) and (3.9), we get (3.3). \(\square \)

Corollary 3.1

Under the assumptions of Proposition 3.1, if \(\lambda \) and \(\sigma \) satisfy the condition

$$\begin{aligned} \frac{\lambda }{C_{\beta ,{\mathcal {E}}}} - \lambda ^2 - \sigma ^2 \kappa >0 \end{aligned}$$
(3.11)

then there exits \({\tilde{v}} \in {\mathbb {R}}^d\) such that \(m(t) \rightarrow {\tilde{v}}\), \(V(t) \rightarrow 0\) as \(t \rightarrow \infty .\)

Proof

By applying Grönwall’s inequality to Eq. (3.3), we obtain the decay estimate

$$\begin{aligned} V(t) \le V(0) e^{-\mu t} \quad \text {with}\quad \mu :=\frac{\lambda }{C_{\beta ,{\mathcal {E}}}} - \lambda ^2 - \sigma ^2 \kappa >0\,, \end{aligned}$$
(3.12)

which implies \(V(t) \rightarrow 0\) as \(t \rightarrow \infty \). From the weak formulation (2.8),

$$\begin{aligned} \begin{aligned} \left| \frac{d m(t)}{dt}\right|&= \left| \lambda \int _{{\mathbb {R}}^2} \gamma _\beta ^{\mathcal {E}}(v,v_*) (v_* - v) f(v,t)f(v_*,t)\,dv\,dv_* \right| \\&\le \lambda \int _{{\mathbb {R}}^2} | v_* - v| f(v,t)f(v_*,t)\,dv\,dv_* \\&\le \lambda \left( \int _{{\mathbb {R}}^2} | v_* - v|^2 f(v,t)f(v_*,t)\,dv\,dv_*\right) ^{\frac{1}{2}} \le 2\lambda \sqrt{V(t)} \le 2\lambda \sqrt{V(0)}e^{-\frac{1}{2}\mu t}\,, \end{aligned} \end{aligned}$$

where we used Jensen’s inequality to have an estimate in terms of the variance. The above proves that \(dm(t)/dt \in L^1(0, \infty )\) and, hence, that there exists a point \({\tilde{v}}\in {\mathbb {R}}^d\) such that

$$\begin{aligned} {\tilde{v}} = m(0) + \int _0^{\infty } \frac{d m(t)}{dt} \, dt = \lim _{t\rightarrow \infty } m(t)\,. \end{aligned}$$
(3.13)

\(\square \)

Remark 3.1

Clearly, the asymptotic value \({\tilde{v}}\) in general is not known. We will discuss in Sect. 4 appropriate conditions under which \({\mathcal {E}}({\tilde{v}})\) can be considered a good approximation of \(\inf _{v\in {\mathbb {R}}^d} {\mathcal {E}}(v)\). It should be noted that, condition (3.11) becomes rather restrictive for large values of \(\beta \). However, in the mean-field scaling such a condition becomes less stringent as observed in Remark 3.3. Additionally, when both processes for localizing the minimum, microscopic best and macroscopic best, are activated simultaneously the convergence conditions are much less stringent and correspond to those of the macroscopic best dynamics as shown at the end of Sect. 4 (see Theorem 4.3). From a physical point of view, this reflects the tendency of the binary dynamics based on the microscopic best to favor exploration over concentration when compared to the corresponding binary dynamics based on macroscopic best.

3.2 The Case with Only the Macroscopic Best Estimate

The case where the macroscopic best estimate contributes alone to the particle search dynamics can be analyzed following the same methodology of the previous section.

The binary interactions now read

$$\begin{aligned} \begin{aligned} \quad v'&= v + \lambda ({v_{\alpha ,{\mathcal {E}}}(t)}-v)+\sigma {D(v)\xi _2} \\ \quad v_*'&= v_* + \lambda ({v_{\alpha ,{\mathcal {E}}}(t)}-v_*)+\sigma {D(v_*)\xi ^*_2} \end{aligned} \end{aligned}$$
(3.14)

where we have set \(\lambda =\lambda _2\), \(\sigma =\sigma _2\), \({D(v)=D_2(v)}\) and \(\lambda _1 = \sigma _1 = 0\).

Again, the expected position is not conserved by the dynamics

$$\begin{aligned} {\frac{d m(t)}{dt}} = \lambda ({v_{\alpha ,{\mathcal {E}}}(t)}-m(t)), \end{aligned}$$
(3.15)

and describes a relaxation towards the estimated global minimum \({v_{\alpha ,{\mathcal {E}}}(t)}\).

As in the case with only microscopic interaction, we can derive an upper bound for the variance derivative.

Proposition 3.2

Let \({\mathcal {E}}\) satisfy Assumption 3.1 and f(vt) be a weak solution of the Boltzmann equation (2.8) were the binary interaction is described by (3.14) . For all \(\alpha >0\) and \(t>0\),

$$\begin{aligned} \frac{d V(t)}{dt} \le - \left( 2 \lambda - 2\frac{e^{-\alpha \underline{{\mathcal {E}}}}}{\Vert \omega _\alpha ^{\mathcal {E}}\Vert _{L^1(f(\cdot ,t))}} (\lambda ^2 + \kappa \sigma ^2) \right) V(t)\,. \end{aligned}$$
(3.16)

Proof

We start by noting that, according to (3.14),

$$\begin{aligned} \left\langle |v'|^2 \right\rangle= & {} \left\langle | v + \lambda ({v_{\alpha ,{\mathcal {E}}}}(t)-v)+\sigma D(v)\xi _2 |^2\right\rangle \nonumber \\= & {} |v|^2 + \lambda ^2 |{v_{\alpha ,{\mathcal {E}}}}(t)-v|^2+2 \lambda \, v \cdot ({v_{\alpha ,{\mathcal {E}}}}(t)-v) + \sigma ^2 \sum _{i=1}^d D_{ii}(v)^{2},\qquad \qquad \end{aligned}$$
(3.17)

where we used that the \(\langle \xi _2 \rangle =0\) and \(\langle |\xi _2|^2 \rangle = 1\). As before, we compute

$$\begin{aligned} \frac{dE(t)}{dt}= & {} \left\langle \int _{{\mathbb {R}}^{2d}} (| v'|^2 - |v|^2) f(v,t) f(v_*,t)\,dv\, dv_* \right\rangle \nonumber \\= & {} \int _{{\mathbb {R}}^{2d}}\left( \lambda ^2|{v_{\alpha ,{\mathcal {E}}}}(t)-v|^2 +2\lambda v \cdot ({v_{\alpha ,{\mathcal {E}}}}(t)-v) \right. \nonumber \\&\left. + \sigma ^2 \sum _{i=1}^d D_{ii}(v)^2 \right) f (v,t)f(v_*,t) \, dv\, dv_*\,, \end{aligned}$$
(3.18)

and the variance time evolution

$$\begin{aligned} \frac{d V(t)}{dt}= & {} \frac{\lambda ^2}{2}\int _{{\mathbb {R}}^{2d}} {|v_{\alpha ,{\mathcal {E}}}}(t)-v{|}^2 f(v,t)f(v_*,t) \, dv\, dv_* \nonumber \\&+\lambda \int _{{\mathbb {R}}^{2d}} (v-m(t)){\cdot }({v_{\alpha ,{\mathcal {E}}}}(t)-v) f(v,t)f(v_*,t) \, dv\, dv_* \nonumber \\&+ \frac{\sigma ^2}{2} \sum _{i=1}^d \int _{{\mathbb {R}}^{2d}} D_{ii}{(v)}^2 f(v,t)f(v_*,t) \, dv\, dv_*\,. \end{aligned}$$
(3.19)

Thanks to the identity

$$\begin{aligned} \begin{aligned}&\int _{{\mathbb {R}}^{d}}(v-m(t)) {\cdot } ({v_{\alpha ,{\mathcal {E}}}}(t)-v) f(v,t)\,dv \\&\quad = { \int _{{\mathbb {R}}^d} ( v \cdot {v_{\alpha ,{\mathcal {E}}}}(t) - m(t) \cdot v_{\alpha ,{\mathcal {E}}}(t) - |v|^2 + v \cdot m(t) ) f(v,t)\,dv } \\&\quad = \int _{{\mathbb {R}}^{d}} ( - |v|^2 + {|m(t)|^2}) f(v,t)\,dv, \end{aligned} \end{aligned}$$

we note that the second term of (3.19) is equal to \(-2\lambda V(t)\).

We recall that, from (2.3), \(v_{\alpha ,{\mathcal {E}}}(t)\) is defined as

$$\begin{aligned} v_{\alpha ,{\mathcal {E}}}(t) = \frac{\int _{\mathbb R^{d}}v\omega _\alpha ^{\mathcal {E}}(v)f(v,t)\,dv}{\int _{\mathbb R^{d}}\omega _\alpha ^{\mathcal {E}}(v)f(v,t)\,dv} = \int _{{\mathbb {R}}^{d}}v\, \frac{e^{-\alpha {\mathcal {E}}(v)}}{\Vert \omega _\alpha ^{\mathcal {E}}\Vert _{L^1(f(\cdot ,t))}}f(v,t)\,dv.\end{aligned}$$

The remaining terms in (3.19) can then be estimated by pointing out that

$$\begin{aligned} \begin{aligned} \int _{{\mathbb {R}}^d}|v_{\alpha ,{\mathcal {E}}}{(t)} - v|^2 f(v,t) dv&{\le \int _{{\mathbb {R}}^{2d}} |v-w|^2 \frac{e^{-\alpha {\mathcal {E}}(w)}}{{\Vert \omega _\alpha ^{\mathcal {E}}\Vert _{L^1(f(\cdot ,t))}}}f(v,t) f(w,t)dvdw}\\&\le {2\frac{e^{-\alpha {\underline{{\mathcal {E}}}}}}{\Vert \omega _\alpha ^{\mathcal {E}}\Vert _{L^1(f(\cdot ,t))}}}\int _{{\mathbb {R}}^d} |v-m(t)|^2 f(v,t) dv \end{aligned} \end{aligned}$$
(3.20)

thanks to Jensen’s inequality. Lastly, we obtain the desired upper bound

$$\begin{aligned} \begin{aligned} \frac{dV(t)}{dt} \le&{\lambda ^2 \frac{e^{-\alpha {\underline{{\mathcal {E}}}}}}{\Vert \omega _\alpha ^{\mathcal {E}}\Vert _{L^1(f(\cdot ,t))}}} \int _{{\mathbb {R}}^d} |v-m(t)|^2 f(v,t) dv - \lambda \int _{{\mathbb {R}}^d} |v-m(t)|^2 f(v,t) dv\\&+ {\sigma ^2} \kappa {\frac{e^{-\alpha {\underline{{\mathcal {E}}}}}}{\Vert \omega _\alpha ^{\mathcal {E}}\Vert _{L^1(f(\cdot ,t))}}} \int _{{\mathbb {R}}^d} |v-m(t)|^{{2}}f(v,t) dv \\ \le&-\left( 2\lambda - {2\frac{e^{-\alpha {\underline{{\mathcal {E}}}}}}{\Vert \omega _\alpha ^{\mathcal {E}}\Vert _{L^1(f(\cdot ,t))}} (\lambda ^2 + \kappa \sigma ^2)}\right) V(t)\,. \end{aligned} \end{aligned}$$
(3.21)

\(\square \)

Remark 3.2

We note that, by applying \(\Vert \omega _\alpha ^{\mathcal {E}}\Vert _{L^1(f(\cdot ,t))} \ge e^{-\alpha {\overline{{\mathcal {E}}}}}\) to (3.16) one gets an analogous condition, as in Corollary 3.1 with \(C_{\beta , {\mathcal {E}}}\) replaced by \(C_{\alpha , {\mathcal {E}}}:= e^{\alpha ({\overline{{\mathcal {E}}}} - {\underline{{\mathcal {E}}}})}\), under which the solution f concentrates around a point \({\tilde{v}} \in {\mathbb {R}}^d\). However, as we will see in Sect. 4, taking into account the time evolution of \(\Vert \omega _\alpha ^{\mathcal {E}}\Vert _{L^1(f(\cdot ,t))}\) a weaker condition can be obtained, which avoids the limitations induced by large values of \(\alpha \).

3.3 The Mean-Field Scaling Limit

Let us consider, for the sake of notational simplicity, the case with only the microscopic binary estimate. We introduce the following scaling

$$\begin{aligned} t \rightarrow \frac{t}{\varepsilon },\quad \lambda \rightarrow \lambda \varepsilon ,\; \sigma \rightarrow \sigma \sqrt{\varepsilon }. \end{aligned}$$
(3.22)

The scaling (3.22), allows to recover in the limit the contributions due both to alignment and random exploration by diffusion. Other scaling limits can be considered, which are diffusion dominated or alignment dominated. As we shall see, derivation of mean-field CBO models is possible only under this choice of scaling.

To illustrate this, let us consider the decay of the variance which is given by (3.3). If we now rescale time as \(t \rightarrow {t}/{\varepsilon }\) we get

$$\begin{aligned} \frac{dV(t)}{dt} \le - \frac{1}{\varepsilon }\left( \frac{\lambda }{C_{\beta ,{\mathcal {E}}}}-{\lambda ^2}-{\sigma ^2} \kappa \right) V(t). \end{aligned}$$
(3.23)

Letting now \(\varepsilon \rightarrow 0\) in order to preserve the behavior of the variance and both alignment and diffusion dynamics we need to assume both \(\lambda \) and \(\sigma ^2\) as \(O(\varepsilon )\). This argument shows that the choice of the scaling (3.22) is of paramount importance to get mean-field asymptotics which maintain memory of the microscopic interactions and concentration effects.

In the remainder of this section, we shall present the formal derivation of the mean-field limit, starting from weak form of the Boltzmann equation (2.8) under the scaling (3.22) which leads to the microscopic binary interactions

$$\begin{aligned} \begin{aligned} v'&= v + \varepsilon \lambda \gamma ^{\mathcal {E}}_\beta (v,v_*)(v_*-v)+\sqrt{\varepsilon }\sigma D(v,v_*)\xi _1 \\ v_*'&= v_* + \varepsilon \lambda \gamma ^{\mathcal {E}}_\beta (v_*,v)(v-v_*)+\sqrt{\varepsilon }\sigma D(v_*,v)\xi ^*_1. \end{aligned} \end{aligned}$$
(3.24)

For small values of \(\varepsilon >0\) we have \(v' \approx v\) and we can consider the multidimensional Taylor expansion

$$\begin{aligned} \phi (v')=\phi (v)+(v'-v)\cdot \nabla _v \phi (v) + \sum _{|\eta |=2} (v'-v)^\eta \frac{\partial ^\eta \phi (v)}{\eta !}+\sum _{|\eta |=3} (v'-v)^\eta \frac{\partial ^\eta \phi ({{\hat{v}}})}{\eta !}, \end{aligned}$$

where we used the multi-index notation \(|\eta |=\eta _1+\ldots +\eta _d\), \(\eta !=\eta _1!\ldots \eta _d!\),

$$\begin{aligned} \partial ^\eta \phi (v)= \frac{\partial ^{|\eta |}}{\partial ^{\eta _1} v_1\ldots \partial ^{\eta _d} v_d}{\phi (v)}, \quad (v'-v)^\eta = (v_1'-v_1)^{\eta _1}\cdots (v'_d-v_d)^{\eta _d}, \end{aligned}$$

and \({{\hat{v}}} = \theta v + (1-\theta ) v'\), for some \(\theta \in (0,1)\). We refer to [43] for an extensive discussion on this kind of asymptotic limits leading from a Boltzmann dynamic to the corresponding mean-field behavior. Here, we limit ourselves, to observe that form an algorithmic viewpoint this corresponds to increase the frequency of binary interactions by reducing the strength of each single interaction.

Now (2.8), under the scaling (3.22), can be written as

$$\begin{aligned}&\frac{\partial }{\partial t} \int _{{\mathbb {R}}^d} f(v,t)\phi (v)\,dv \nonumber \\&\quad = \frac{1}{\varepsilon }\left\langle \int _{{\mathbb {R}}^{2d}} \left( \phi (v')-\phi (v)\right) f(v,t)f(v_*,t)\,dv\,dv_*\right\rangle \nonumber \\&\quad = \lambda \int _{{\mathbb {R}}^{2d}}\gamma ^{\mathcal {E}}_\beta (v,v_*) \nabla _v \phi (v)\cdot (v_*-v)f(v,t)f(v_*,t)\,dv\,dv_*\nonumber \\&\qquad + \varepsilon \frac{\lambda ^2}{2} \int _{{\mathbb {R}}^{2d}}(\gamma ^{\mathcal {E}}_\beta (v,v_*))^2\sum _{|\eta |=2} (v_*-v)^\eta \frac{\partial ^\eta \phi (v)}{\eta !}f(v,t)f(v_*,t)\,dv\,dv_* \nonumber \\&\qquad + \frac{\sigma ^2}{2} \int _{{\mathbb {R}}^{2d}}\sum _{i=1}^d D_{ii}^2(v,v_*)\frac{\partial ^2 \phi (v)}{\partial v_i^2}f(v,t)f(v_*,t)\,dv\,dv_* \nonumber \\&\qquad + O(\sqrt{\varepsilon }). \end{aligned}$$
(3.25)

Under suitable boundedness assumptions on moments up to order three, we can formally pass to the limit \(\varepsilon \rightarrow 0\) to get the weak form

$$\begin{aligned} \begin{aligned} \frac{\partial }{\partial t} \int _{{\mathbb {R}}^d} f(v,t)\phi (v)\,dv =&\, \lambda \int _{{\mathbb {R}}^{2d}}\gamma ^{\mathcal {E}}_\beta (v,v_*) \nabla _v \phi (v)\cdot (v_*-v)f(v,t)f(v_*,t)\,dv\,dv_*\\&+ \frac{\sigma ^2}{2} \int _{{\mathbb {R}}^{2d}}\sum _{i=1}^d D_{ii}^2(v,v_*)\frac{\partial ^2 \phi (v)}{\partial v_i^2}f(v,t)f(v_*,t)\,dv\,dv_*. \end{aligned}\nonumber \\ \end{aligned}$$
(3.26)

This implies that f satisfies the mean-field limit equation

$$\begin{aligned} \begin{aligned}&\frac{\partial f(v,t)}{\partial t} + \lambda \nabla _v \cdot \left( f(v,t)\int _{{\mathbb {R}}^{d}}\gamma ^{\mathcal {E}}_\beta (v,v_*)(v_*-v)f(v_*,t)\,dv_* \right) \\&\quad = \frac{\sigma ^2}{2}\sum _{i=1}^d \frac{\partial ^2}{\partial v_i^2} \left( f(v,t)\int _{{\mathbb {R}}^{d}} D_{ii}^2(v,v_*) f(v_*,t)\,dv_*\right) . \end{aligned} \end{aligned}$$
(3.27)

The explicit expressions of the diffusion terms are given below for the isotropic case

$$\begin{aligned} \int _{{\mathbb {R}}^{d}} D_{ii}^2(v,v_*) f(v_*,t)\,dv_* = \sum _{j=1}^d \int _{{\mathbb {R}}^{d}} \gamma ^{\mathcal {E}}_\beta (v,v_*)^2 (v_{*,j}-v_j)^2 f(v_*,t)\,dv_*\nonumber \\ \end{aligned}$$
(3.28)

and the anisotropic one

$$\begin{aligned} \int _{{\mathbb {R}}^{d}} D_{ii}^2(v,v_*) f(v_*,t)\,dv_* = \int _{{\mathbb {R}}^{d}} \gamma ^{\mathcal {E}}_\beta (v,v_*)^2 (v_{*,i}-v_i)^2 f(v_*,t)\,dv_*. \end{aligned}$$
(3.29)

In the general case, by analogous computations, under boundedness assumptions on moments, in the limit \(\varepsilon \rightarrow 0\) we get the weak form

$$\begin{aligned} \frac{\partial }{\partial t} \int _{{\mathbb {R}}^d} f(v,t)\phi (v)\,dv= & {} \lambda _1 \int _{{\mathbb {R}}^{2d}}\gamma ^{\mathcal {E}}_\beta (v,v_*) \nabla _v \phi (v)\cdot (v_*-v)f(v,t)f(v_*,t)\,dv\,dv_* \nonumber \\&+ \lambda _2 \int _{{\mathbb {R}}^{d}}\nabla _v \phi (v)\cdot ({v_{\alpha ,{\mathcal {E}}}(t)}-v)f(v,t)\,dv \nonumber \\&+ \frac{\sigma _1^2}{2} \int _{{\mathbb {R}}^{2d}}\sum _{i=1}^d D_{1,ii}^2(v,v_*)\frac{\partial ^2 \phi (v)}{\partial v_i^2}f(v,t)f(v_*,t)\,dv\,dv_* \nonumber \\&+ \frac{\sigma _2^2}{2} \int _{{\mathbb {R}}^{2d}}\sum _{i=1}^d D_{2,ii}^2(v)\frac{\partial ^2 \phi (v)}{\partial v_i^2} f(v,t)\,dv, \end{aligned}$$
(3.30)

which corresponds to the mean-field limit equation

$$\begin{aligned}&\frac{\partial f(v,t)}{\partial t} + \lambda _1 \nabla _v \cdot \left( f(v,t)\int _{{\mathbb {R}}^{d}}\gamma ^{\mathcal {E}}_\beta (v,v_*)(v_*-v)f(v_*,t)\,dv_* \right) \nonumber \\&\qquad + \lambda _2 \nabla _v \cdot \left( f(v,t)({v_{\alpha ,{\mathcal {E}}}(t)}-v) \right) \nonumber \\&\quad = \frac{\sigma _1^2}{2}\sum _{i=1}^d \frac{\partial ^2}{\partial v_i^2} \left( f(v,t)\int _{{\mathbb {R}}^{d}} D_{1,ii}^2(v,v_*) f(v_*,t)\,dv_*\right) \nonumber \\&\qquad +\frac{\sigma _2^2}{2}\sum _{i=1}^d \frac{\partial ^2}{\partial v_i^2} \left( f(v,t)D_{2,ii}^2(v)\right) . \end{aligned}$$
(3.31)

Remark 3.3

System (3.31) generalizes the notion of CBO model to the case where a local interaction is taken into account. Additionally, let us remark that from the scaling (3.22) in the mean field limit we have the analogous of Proposition 3.1 and 3.2 where now the \(\lambda ^2\) terms disappear, making the corresponding concentration conditions less restrictive.

4 Convergence to the Global Minimum

In this section, we will attempt to understand under which conditions we can assume \(\displaystyle \lim _{t\rightarrow \infty } {\mathcal {E}}\left( m(t)\right) \) to be a good approximation of \(\underline{{\mathcal {E}}}:= \min _{v\in {\mathbb {R}}^d}{\mathcal {E}}(v)\).

In order to do so, we will investigate the large-time behavior of the solution f(vt) to the Boltzmann equation (2.8). Here, we will first limit ourselves to the case where only the microscopic best estimate occurs during the interactions and then study the case where only the macroscopic best estimate occurs.

4.1 The Case with Only the Microscopic Best Estimate

In order to study the fully microscopic dynamics, let us set \(\lambda _2 = \sigma _2 =0\) and \(\lambda = \lambda _1\), \(\sigma =\sigma _1\). Throughout this section we assume \({\mathcal {E}}\) to satisfy Assumption 3.1 and the following additional regularity assumptions.

Assumption 4.1

\({\mathcal {E}}\in \mathcal {C}^2({\mathbb {R}}^d)\) and there exist \(c_1,c_2 >0\) such that

  1. 1.

    \(\displaystyle \sup _{v \in {\mathbb {R}}^d} |\nabla {\mathcal {E}}(v) | \le c_1\;;\)

  2. 2.

    \(\displaystyle \sup _{v \in {\mathbb {R}}^d} {\Vert \nabla ^2{\mathcal {E}}(v)\Vert _2 }\le c_2 \;\; \forall \,\, i=1,\ldots ,d\, .\)

Under these assumptions on the objective function \({\mathcal {E}}\), the following result holds.

Theorem 4.1

Let f(vt) satisfy the Boltzmann equation (2.8) with initial datum \(f_0(v)\) and binary interaction described by (3.1). Let also Assumptions 3.1 and 4.1 hold for \({\mathcal {E}}\). If the model parameters \(\{\lambda ,\sigma ,\beta \}\) and \(f_0(v)\) satisfy

$$\begin{aligned}&\mu := \frac{\lambda }{C_{\beta ,{\mathcal {E}}}} {- \lambda ^2} - \sigma ^2 \kappa > 0 \end{aligned}$$
(4.1)
$$\begin{aligned}&\nu := \frac{2({\sqrt{2}}\lambda c_1 + {(\lambda ^2+ \sigma ^2 \kappa ) c_2} )\beta e^{-\beta {\underline{{\mathcal {E}}}}}}{\mu \Vert \omega _\beta ^{\mathcal {E}}\Vert _{L^1(f_0)}} {m_{V(0)}} < \frac{1}{2} \end{aligned}$$
(4.2)

where \(m_{V(0)}=\max {\sqrt{V(0)},V(0)}\), then there exists \({\tilde{v}} \in {\mathbb {R}}^d\) such that \(m(t) \longrightarrow \tilde{v}\) as \(t \rightarrow \infty \). Moreover, it holds the estimate

$$\begin{aligned} {\mathcal {E}}({\tilde{v}}) \le {\underline{{\mathcal {E}}}} + r(\beta ) + \frac{\log 2}{\beta } \end{aligned}$$
(4.3)

where, if a minimizer \(v^\star \) of \({\mathcal {E}}\) belongs to \(\text {supp}(f_0)\), then \(r(\beta ):= -\frac{1}{\beta }\log \Vert \omega _\beta ^{\mathcal {E}}\Vert _{L^1(f_0)} - {\underline{{\mathcal {E}}}} \longrightarrow 0\) as \(\beta \rightarrow \infty \) thanks to the Laplace principle (2.4).

Proof

Similar to what we did to derive the mean-field scaling limit, we consider the multidimensional Taylor expansion for \(\omega _\beta ^{\mathcal {E}}\)

$$\begin{aligned} \left\langle \omega _\beta ^{\mathcal {E}}(v') - \omega _\beta ^{\mathcal {E}}(v) \right\rangle = \left\langle \nabla \omega _\beta ^{\mathcal {E}}(v) \cdot (v'-v) +\frac{1}{2} (v'-v) \cdot \nabla ^2 \omega _\beta ^{\mathcal {E}}({\hat{v}})(v'-v) \right\rangle \end{aligned}$$
(4.4)

where \({\hat{v}} = \theta v +(1-\theta )v'\) for some \(\theta \in (0,1)\). Thanks to Assumption 4.1, one can bound the above terms as

$$\begin{aligned} \left\langle \nabla \omega _\beta ^{\mathcal {E}}(v) \cdot (v'-v) \right\rangle= & {} -\beta e^{-\beta {\mathcal {E}}(v)} \lambda \nabla {\mathcal {E}}(v) \cdot (v_{\beta ,{\mathcal {E}}}(v,v_*)-v)\\\ge & {} - \beta e^{-\beta \underline{{\mathcal {E}}}}\lambda c_1|v_{\beta ,{\mathcal {E}}}(v,v_*) - v|\,. \end{aligned}$$

By computing the Hessian of \(\omega _\beta ^{\mathcal {E}}(v)\)

$$\begin{aligned} \nabla ^2 \omega _\beta ^{\mathcal {E}}= \beta ^2 e^{-\beta {\mathcal {E}}} \nabla {\mathcal {E}}\otimes \nabla {\mathcal {E}}- \beta e^{-\beta {\mathcal {E}}} \nabla ^2 {\mathcal {E}}\,, \end{aligned}$$

we obtain

$$\begin{aligned}&\frac{1}{2}\left\langle (v'-v)\cdot \nabla ^2 \omega _\beta ^{\mathcal {E}}({\hat{v}}) (v'-v) \right\rangle \\&\quad = \left\langle \frac{1}{2} \beta ^2 e^{-\beta {\mathcal {E}}({\hat{v}})} |\nabla {\mathcal {E}}({\hat{v}}) \cdot (v'-v)|^2 - \frac{\beta }{2} e^{-\beta {\mathcal {E}}({\hat{v}})} (v'-v) \cdot \nabla ^2{\mathcal {E}}({\hat{v}}) (v'-v) \right\rangle \\&\quad \ge - \frac{\beta }{2} e^{-\beta {\underline{{\mathcal {E}}}}} \Vert \nabla ^2 {\mathcal {E}}({\hat{v}})\Vert _2 \left\langle |v'-v|^2 \right\rangle \\&\quad \ge - \frac{\beta }{2} e^{-\beta {\underline{{\mathcal {E}}}}} (\lambda ^2 + \sigma ^2\kappa ) c_2| v_{\beta , {\mathcal {E}}}(v,v_*) - v |^2 \end{aligned}$$

where in the last inequality we used Assumption 4.1 and the fact that

$$\begin{aligned}\left\langle |v'-v|^2 \right\rangle\le & {} \lambda ^2 |v_{\beta , {\mathcal {E}}}(v,v_*) - v|^2 + \sigma ^2 \left\langle |D(v,v_*) \xi _1|^2 \right\rangle \\\le & {} (\lambda ^2 + \sigma ^2 \kappa ) |v_{\beta , {\mathcal {E}}}(v,v_*) - v|^2\,, \end{aligned}$$

by definition of \(D(v,v_*)\) and \(\xi _1\).

We introduce

$$\begin{aligned} M_\beta (t) := \int _{{\mathbb {R}}^d} \omega _{\beta }^{\mathcal {E}}(v) f(v, t) \,dv = \Vert \omega _\beta ^{\mathcal {E}}\Vert _{L^1(f(\cdot ,t))} \end{aligned}$$
(4.5)

and apply the weak formulation (2.8) to \(\phi (v)= \omega _{\beta }^{\mathcal {E}}(v)\) to obtain

$$\begin{aligned} \begin{aligned} \frac{d M_\beta (t)}{dt} =&\left\langle \int _{{\mathbb {R}}^d} \left( \omega _\beta ^{\mathcal {E}}(v') - \omega _\beta ^{\mathcal {E}}(v) \right) f(v,t) f(v_*,t) \,dv\, dv_* \right\rangle \\ \ge&- \beta e^{-\beta {\underline{{\mathcal {E}}}}} \lambda c_1 \int _{{\mathbb {R}}^{2d}} |v_{\beta ,{\mathcal {E}}}(v,v_*) - v|\,f(v,t) f(v_*,t)\, dv\, dv_*\\&- \frac{\beta }{2} e^{-\beta {\underline{{\mathcal {E}}}}} (\lambda ^2 + \sigma ^2\kappa ) c_2 \int _{{\mathbb {R}}^{2d}} |v_{\beta ,{\mathcal {E}}}(v,v_*) - v|^2\,f(v,t) f(v_*,t)\, dv\, dv_*\,. \end{aligned} \end{aligned}$$
(4.6)

We recall that

$$\begin{aligned}&\int _{{\mathbb {R}}^{2d}} |v_{\beta ,{\mathcal {E}}}(v,v_*) - v|^2\,f(v,t) f(v_*,t)\, dv\, dv_* \\&\quad \le \int _{{\mathbb {R}}^{2d}} \gamma _{\beta }^{\mathcal {E}}(v,v_*)|v_* - v|^2\,f(v,t) f(v_*,t) \,dv\, dv_* = 2 V(t)\, \end{aligned}$$

form which also follows, by Jensen’s inequality,

$$\begin{aligned} \int _{{\mathbb {R}}^{2d}} |v_{\beta ,{\mathcal {E}}}(v,v_*) - v|\,f(v,t) f(v_*,t)\, dv\, dv_* \le \sqrt{2V(t)}\,. \end{aligned}$$

Finally, we obtain

$$\begin{aligned} \begin{aligned} \frac{d M_\beta (t)}{dt}&\ge - \beta e^{-\beta {\underline{{\mathcal {E}}}}}\lambda c_1\sqrt{2V(t)} - \beta e^{-\beta {\underline{{\mathcal {E}}}}} (\lambda ^2 + \sigma ^2\kappa ) c_2 V(t)\, \\&\ge - \beta e^{-\beta {\underline{{\mathcal {E}}}}} \left( \sqrt{2}\lambda c_1 + (\lambda ^2 + \sigma ^2 \kappa )c_2 \right) \max \{\sqrt{V(t)}, V(t)\}\,. \end{aligned} \end{aligned}$$
(4.7)

Now, by definition of \(\mu \) it holds \(d V(t)/dt \le - \mu V(t)\) thanks to Proposition 3.1. As we did in the proof of Corollary 3.1, we apply Grönwall’s inequality to obtain an exponential decay of the variance from which follows

$$\begin{aligned} { \max \{\sqrt{V(t)}, V(t)\} \le \max \{\sqrt{V(0)}, V(0)\} e^{-\frac{1}{2} \mu t}\, \quad \text {for all}\;\; t>0\,.} \end{aligned}$$

This leads to a lower bound for \(M_\beta (t)\) in terms of \(M_\beta (0)\):

$$\begin{aligned} \begin{aligned} {M_\beta (t)}\ge&{M_\beta (0)} - \beta e^{-\beta \underline{{\mathcal {E}}}} ({\sqrt{2}}\lambda c_1 + {(\lambda ^2 + \sigma ^2 \kappa )} c_2) {\max \{\sqrt{V(0)}, V(0)\}} \int _0^t e^{-\frac{1}{2}\mu s} ds \\ \ge&{M_\beta (0)} - \frac{2({\sqrt{2}}\lambda c_1 + {(\lambda ^2 + \sigma ^2 \kappa )} c_2)\beta e^{-\beta \underline{{\mathcal {E}}}}}{\mu } { \max \{\sqrt{V(0)}, V(0)\}} \\ =&{M_\beta (0)} (1-\nu )\,. \end{aligned} \end{aligned}$$
(4.8)

By definition of \(\nu \) and condition (4.2), it holds

$$\begin{aligned} \begin{aligned} {M_\beta (t) >} \frac{1}{2}{M_\beta (0)}\,. \end{aligned} \end{aligned}$$
(4.9)

Let us now consider the limit of the above inequality as \(t\rightarrow \infty \). Since \(m(t) \rightarrow {\tilde{v}}\) and \(V(t) \rightarrow 0\), it holds

$$\begin{aligned} {M_\beta (t) = \int \omega _\beta ^{{\mathcal {E}}} (v) f(v,t)\, dv\; \longrightarrow \; \omega _\beta ^{\mathcal {E}}({\tilde{v}}) = e^{-\beta {\mathcal {E}}(\tilde{v})}\quad \text {as} \quad t \rightarrow \infty \,. }\end{aligned}$$
(4.10)

The above limit is a consequence of Chebyshev’s inequality, we refer to the proof of [12, Lemma 4.2] for more details. Considering the limit of inequality (4.9) as \(t \rightarrow \infty \), we have

$$\begin{aligned} e^{-\beta {\mathcal {E}}({\tilde{v}}) }> \frac{1}{2}{M_\beta (0)} \, . \end{aligned}$$
(4.11)

Finally, we take the logarithm of both sides of the above inequality to obtain

$$\begin{aligned} \begin{aligned} {\mathcal {E}}({\tilde{v}})&{<} - \frac{1}{\beta }\log {M_\beta (0)} + \frac{\log 2}{\beta } {\,=\,} {\underline{{\mathcal {E}}}} + r(\beta ) + \frac{\log 2}{\beta }\,, \end{aligned} \end{aligned}$$
(4.12)

where \(r(\beta ) :={\,- \frac{1}{\beta }\log M_\beta (0) - {\underline{{\mathcal {E}}}} =} - \frac{1}{\beta }\log \Vert \omega _\beta ^{\mathcal {E}}\Vert _{L^1(f_0)} - {\underline{{\mathcal {E}}}} \). \(\square \)

4.2 The Case with Only the Macroscopic Best Estimate

We now consider the case where only the macroscopic dynamics occurs and the interaction is determined by (3.14).

Theorem 4.2

Let f(vt) satisfy the Boltzmann equation (2.8) with initial datum \(f_0(v)\) and binary interaction described by (3.14). Let also Assumptions 3.1 and 4.1 hold for \({\mathcal {E}}\). If the model parameters \(\{\lambda ,\sigma ,\alpha \}\) and \(f_0(v)\) satisfy

$$\begin{aligned} \mu :=&2 \lambda - 4 \frac{e^{-\alpha \underline{{\mathcal {E}}}}}{\Vert \omega _\alpha ^{\mathcal {E}}\Vert _{L^1(f_0)} } (\lambda ^2 + \kappa \sigma ^2) >0 \end{aligned}$$
(4.13)
$$\begin{aligned} \nu :=&\frac{4(2\lambda + {\lambda ^2 +} \sigma ^2 \kappa )c_2 \alpha e^{-2\alpha {\underline{{\mathcal {E}}}}}}{\mu \Vert \omega _\alpha ^{\mathcal {E}}\Vert _{L^1(f_0)} ^2} V(0) < \frac{3}{4} \end{aligned}$$
(4.14)

then there exists \({\tilde{v}} \in {\mathbb {R}}^d\) such that \(m(t) \longrightarrow \tilde{v}\) as \(t \rightarrow \infty \). Moreover, it holds the estimate

$$\begin{aligned} {\mathcal {E}}({\tilde{v}}) \le {\underline{{\mathcal {E}}}} + r(\alpha ) + \frac{\log 2}{\alpha } \end{aligned}$$
(4.15)

where, if a minimizer \(v^\star \) of \({\mathcal {E}}\) belongs to \(\text {supp}(f_0)\), then \(r(\alpha ):= -\frac{1}{\alpha }\log \Vert \omega _\alpha ^{\mathcal {E}}\Vert _{L^1(f_0)} - {\underline{{\mathcal {E}}}} \longrightarrow 0\) as \(\alpha \rightarrow \infty \) thanks to the Laplace principle (2.4).

Proof

Similar to the proof of Theorem 4.1, we consider the Taylor expansion of \(\omega _\alpha ^{\mathcal {E}}\) which reads as

$$\begin{aligned} \left\langle \omega _\alpha ^{\mathcal {E}}(v') - \omega _\alpha ^{\mathcal {E}}(v) \right\rangle =\lambda \nabla \omega _\alpha ^{\mathcal {E}}(v) \cdot (v_{\alpha ,{\mathcal {E}}}(t) - v) + \frac{1}{2} \left\langle (v'-v) \cdot \nabla ^2\omega _\alpha ^{\mathcal {E}}({\hat{v}})(v'-v) \right\rangle \nonumber \\ \end{aligned}$$
(4.16)

for some \({\hat{v}} \in {\mathbb {R}}^d\). As before, by using Assumption 4.1 and the definition of \(\nabla ^2 \omega _{\alpha }^{\mathcal {E}}\), the second term can be bounded as

$$\begin{aligned} \begin{aligned} \frac{1}{2} \left\langle (v'-v) \cdot \nabla ^2\omega _\alpha ^{\mathcal {E}}({\hat{v}})(v'-v) \right\rangle&\ge - \frac{\alpha }{2} e^{- \alpha {\mathcal {E}}({\hat{v}})} (v' - v)\cdot \nabla ^2{\mathcal {E}}({\hat{v}}) (v' - v) \\&\ge - \frac{\alpha }{2} e^{-\alpha {\underline{{\mathcal {E}}}}} (\lambda ^2 + \sigma ^2 \kappa ) c_2 | v_{\alpha , {\mathcal {E}}}(t) - v |^2\,. \end{aligned} \end{aligned}$$

For the first term of the expansion, it holds

$$\begin{aligned} \begin{aligned}&\int _{{\mathbb {R}}^{2d}} \lambda \nabla \omega _\alpha ^{\mathcal {E}}(v) \cdot (v_{\alpha ,{\mathcal {E}}}(t) - v) \,f(v,t)f(v_*,t)\,dv\,dv_* \\&\quad = - \alpha \lambda \int _{{\mathbb {R}}^{2d}} e^{-\alpha {\mathcal {E}}(v)} \nabla {\mathcal {E}}(v) \cdot (v_{\alpha ,{\mathcal {E}}}(t) - v) \,f(v,t)f(v_*,t)\,dv\,dv_* \\&\quad = \!-\! \alpha \lambda \int _{{\mathbb {R}}^{2d}} e^{-\alpha {\mathcal {E}}(v)} \left( \nabla {\mathcal {E}}(v) \!-\! \nabla {\mathcal {E}}(v_{\alpha ,{\mathcal {E}}}(t)) \right) \cdot (v_{\alpha ,{\mathcal {E}}}(t) \!-\! v) \,f(v,t)f(v_*,t)\,dv\,dv_*\\&\quad \ge - \alpha e^{-\alpha {\underline{{\mathcal {E}}}}}\lambda c_2 \int _{{\mathbb {R}}^{2d}} |v_{\alpha ,{\mathcal {E}}}(t) - v|^2 \,f(v,t)f(v_*,t)\,dv\,dv_*\,, \end{aligned} \end{aligned}$$

where we used that

$$\begin{aligned}\int _{{\mathbb {R}}^d} e^{-\alpha {\mathcal {E}}(v)} \nabla {\mathcal {E}}(v_{\alpha ,{\mathcal {E}}}(t)) \cdot (v_{\alpha ,{\mathcal {E}}}(t) - v)f(v,t)f(v_*,t)\,dv\,dv_* = 0\,.\end{aligned}$$

As before, we denote \(M_\alpha (t):= \Vert \omega _\alpha ^{\mathcal {E}}(t)\Vert _{L^1(f(\cdot ,t))}\). By the weak formulation (2.8) it then follows

$$\begin{aligned} \frac{d}{dt}M_\alpha ^2(t)\!= & {} \!2 M_\alpha (t) \frac{d}{dt}M_\alpha (t) \!=\!2 M_\alpha (t) \left\langle \!\int _{{\mathbb {R}}^{2d}} \omega _\alpha ^{\mathcal {E}}(v') \!-\! \omega _\alpha ^{\mathcal {E}}(v) \,f(v,t)f(v_*,t)\,dvdv_* \right\rangle \nonumber \\\ge & {} - 4\alpha c_2 \left( 2\lambda + \lambda ^2 + \sigma ^2\kappa \right) e^{-2\alpha {\underline{{\mathcal {E}}}}} V(t) \end{aligned}$$
(4.17)

where we used (3.20) to bound the expectation of \(|v_{\alpha , {\mathcal {E}}}(t) - v|^2\).

We now define the time

$$\begin{aligned} T := \sup \left\{ t\,:\, M_\alpha (s) >\frac{1}{2} M_\alpha (0), \; \forall \; s \in [0,t] \right\} \end{aligned}$$
(4.18)

and assume that \(T< \infty \). By assumption (4.13) on \(\mu \), for all \(t \in [0,T]\)

$$\begin{aligned} 2 \lambda - 2 \frac{e^{-\alpha {\underline{{\mathcal {E}}}}}}{M_\alpha (t)} (\lambda ^2 + \kappa \sigma ^2) \ge 2 \lambda - 4 \frac{e^{-\alpha {\underline{{\mathcal {E}}}}}}{M_\alpha (0)} (\lambda ^2 + \kappa \sigma ^2) = \mu >0\,, \end{aligned}$$

which leads to

$$\begin{aligned}\frac{dV(t)}{dt} \le - \mu V(t)\end{aligned}$$

thanks to Proposition 3.2. Due to Grönwall’s inequality one has \(V(t) \le V(0) \exp (-\mu t)\) for all \(t \in [0,T]\). By assumption (4.14),

$$\begin{aligned} \begin{aligned} M_\alpha ^2(t) \ge M^2_\alpha (0) - 4 (2\lambda + \lambda ^2 + \sigma ^2\kappa ) c_2 \alpha e^{-2\alpha {\underline{{\mathcal {E}}}}} V(0) \int _0^t e^{-\mu s} ds \\ > M^2_\alpha (0) - \frac{4 (2\lambda + \lambda ^2 + \sigma ^2\kappa )c_2 \alpha e^{-2\alpha {\underline{{\mathcal {E}}}}}}{\mu } V(0) \ge \frac{1}{4} M_\alpha ^2(0) \end{aligned} \end{aligned}$$
(4.19)

which implies that for all \(t \in [0,T]\),

$$\begin{aligned} M_\alpha (t) > \frac{1}{2} M_\alpha (0)\,. \end{aligned}$$
(4.20)

This means that for some \(\delta >0\), \(M_\alpha (t) \ge \frac{1}{2} M_\alpha (0)\) for all \(t\in [T, T+\delta )\) which contradicts the definition of T. Consequently, \(T=\infty \) and hence (4.20) holds for all \(t>0\). As a consequence, we obtain the exponential decay of the variance

$$\begin{aligned} V(t) \le V(0) e^{-\mu t} \quad \text {for all} \; t>0\,. \end{aligned}$$
(4.21)

As we showed in the proof of Corollary 3.1, there exists a \({\tilde{v}} \in {\mathbb {R}}^d\) such that \(m(t) \rightarrow {\tilde{v}}\) as \(t\rightarrow \infty \) with exponential rate, from which follows \(M_\alpha (t) \rightarrow e^{-\alpha {\mathcal {E}}({\tilde{v}})}\). By taking the limit as \(t \rightarrow \infty \) of (4.20), we obtain

$$\begin{aligned} e^{-\alpha {\mathcal {E}}({\tilde{v}})} > \frac{1}{2} M_\alpha (0) \end{aligned}$$
(4.22)

and we conclude that

$$\begin{aligned} {\mathcal {E}}({\tilde{v}}) < - \frac{1}{\alpha }\log M_\alpha (0) + \frac{\log 2}{\alpha }= {\underline{{\mathcal {E}}}} + r(\alpha ) + \frac{\log 2}{\alpha }\,, \end{aligned}$$
(4.23)

where \(r(\alpha ) := - \frac{1}{\alpha }\log M_\alpha (0) - {\underline{{\mathcal {E}}}} = - \frac{1}{\alpha }\log \Vert \omega _\alpha ^{\mathcal {E}}\Vert _{L^1(f_0)} - {\underline{{\mathcal {E}}}} \). \(\square \)

Finally, it is possible to prove a general convergence result to the global minimum for the case where both the local and global best alignments occur in the particles interaction. In the following, for simplicity we will set \(\beta = \alpha \).

Theorem 4.3

Let f(vt) satisfy the Boltzmann equation (2.8) with initial datum \(f_0(v)\) and binary interaction described by (2.1). Let also Assumptions 3.1 and 4.1 hold for \({\mathcal {E}}\). If the model parameters \(\{\lambda _1,\lambda _2,\sigma _1, \sigma _2,\alpha \}\) and \(f_0(v)\) satisfy

$$\begin{aligned} \mu :=&2 \lambda _2 - 4 \frac{e^{-\alpha \underline{{\mathcal {E}}}}}{\Vert \omega _\alpha ^{\mathcal {E}}\Vert _{L^1(f_0)} } (\lambda _2^2 + \kappa \sigma _2^2) - (\lambda ^2_1 +\sigma ^2_1 \kappa ) > 0 \end{aligned}$$
(4.24)
$$\begin{aligned} \nu :=&\frac{8\left( \sqrt{2}\lambda _1 c_1 \!+\! (\lambda _1^2 \!+\! \sigma _1^2\kappa ) c_2+ (2 \lambda _2 \!+\! \lambda _2^2 \!+\! \sigma _2^2 \kappa )c_2 \right) \alpha e^{-2\alpha {\underline{{\mathcal {E}}}}}}{\mu \Vert \omega _\alpha ^{\mathcal {E}}\Vert _{L^1(f_0)} ^2} \max \{\sqrt{V(0)}, V(0)\}< \frac{3}{4} \end{aligned}$$
(4.25)

then there exists \({\tilde{v}} \in {\mathbb {R}}^d\) such that \(m(t) \longrightarrow \tilde{v}\) as \(t \rightarrow \infty \). Moreover, it holds the estimate

$$\begin{aligned} {\mathcal {E}}({\tilde{v}}) \le {\underline{{\mathcal {E}}}} + r(\alpha ) + \frac{\log 2}{\alpha } \end{aligned}$$
(4.26)

where, if a minimizer \(v^\star \) of \({\mathcal {E}}\) belongs to \(\text {supp}(f_0)\), then \(r(\alpha ):= -\frac{1}{\alpha }\log \Vert \omega _\alpha ^{\mathcal {E}}\Vert _{L^1(f_0)} - {\underline{{\mathcal {E}}}} \longrightarrow 0\) as \(\alpha \rightarrow \infty \) thanks to the Laplace principle (2.4).

The proof closely follows the proofs of Theorems 4.1 and 4.2 and will be omitted for brevity. It is interesting to remark, however, that condition (4.24) is far less restrictive than the corresponding condition where only the local best is used (4.1). This suggest to use the local best in practical applications only in combination with the global best.

Before concluding our theoretical analysis, a few remarks are in order.

Remark 4.1

  • The assumptions in Theorems 4.1, 4.2 and 4.3 depend strongly on \(\beta \) and \(\alpha \), which have to be considered as fixed parameters. Therefore, the limits \(t\rightarrow \infty \) and \(\beta \), or \(\alpha \rightarrow \infty \) are not interchangeable. Furthermore, for a given \(\beta \), or \(\alpha \), a choice of \(\lambda _1, \lambda _2\) and \(\sigma _1, \sigma _2\) satisfying the assumptions is always possible, at the cost of taking V(0) sufficiently small.

  • Under the mean field scaling (3.22), one can directly derive the equivalent of Theorem 4.3 for the mean-field limit dynamics (3.31). For small values of the scaling parameter \(\varepsilon \), the quadratic terms in \(\lambda _1, \lambda _2\) will vanish and, in the case of global best only, we recover the same convergence result of CBO methods (see for instance [14, Theorem 3.1]).

  • Finally, in the case where the diffusion process in binary interactions is anisotropic, namely (2.7) holds and therefore \(\kappa =1\), convergence to the global minimum is guaranteed with parameter constraints independent of the problem dimensionality. For this reason, in all numerical examples of the next section only anisotropic noise has been considered.

5 Numerical Examples and Applications

This section is devoted to discuss the implementation of the proposed methods and to test their performance with the aid of several numerical experiments. The first experiment, in Sect. 5.2, consists of checking the fitness of the macroscopic best estimate in (2.3) employing in the evolution of the dynamic both terms (2.2) and (2.3), in comparison to the sole presence of one of the two. The second experiment, presented in Sect. 5.3, is devoted to show how even simple 1–dimensional problems may pose serious issues to classical descent methods, whilst the proposed procedure has an high success rate. Finally, the last section presents an application to a classical machine learning problem, showing that KBO methods have the potential to outperform classical approaches. It should be noted that, in numerical experiments, to facilitate comparison with the literature, we will denote by f(x) the function to be minimized and by x the variable in the search space, instead of \({\mathcal {E}}(v)\) and v as in the description of the KBO method.

5.1 Implementation

The numerical implementation of KBO relies on two different algorithms inspired by Nanbu’s and Bird’s direct simulation Monte Carlo methods in rarefied gas dynamics [9, 40, 42]. The former considers at each time step the evolution of distinct pairs of particles, while the latter allows for multiple interactions between pairs of particles in a time step. The methods are summarized in Algorithms 1 and 2, the interested reader can find additional details on similar algorithms used in particle swarming in [3, 43]. Mathematically, let us remark that in the limit of a large number of particles Nambu’s method converges to a discrete-time formulation of (2.8), while Bird’s method converges to the continuous-time formulation (2.8).

In the algorithms reported, the parameters \(\delta _{ \text{ stall }}\) and \(n_{ \text{ stall }}\) check if consensus has been reached in the last \(n_{ \text{ stall }}\) iterations within a tolerance \(\delta _{ \text{ stall }}\): in such case, the evolution is stopped without reaching the total number of iterations. The initial particles are drawn from a given distribution, typically uniform in the search space unless one has additional informations on the locations of the global minimum. Note that in Bird’s algorithm interactions take place without any time counter compared to Nanbu’s method. As a consequence the total number of interactions as well as the parameter \(n_{ \text{ stall }}\) have to be adjusted accordingly to the overall number of particles.

figure a
figure b
Fig. 1
figure 1

Minimization of Rastrigin function for KBO based on Nanbu’s algorithm. From left to right: success rate and average iterations number. Top row refers to the local best only, while the bottom one refers to the global best only

Fig. 2
figure 2

Minimization of Rastrigin function for KBO based on Bird’s algorithm. From left to right: success rate and average iterations number. Top row refers to the local best only, while the bottom one refers to the global best only

5.2 Validation of the Algorithms

The validation of the KBO algorithms is pursued initially on a classical benchmark function for global optimization, the Rastrigin function [34] in dimension \(d=20\), with the global minimum \(f(x^\star )=0\), at \(x^\star =0\) (see Appendix A). As shown in [14, 23, 28, 44], compared to other benchmark functions the Rastrigin function in high dimension has proven to be quite challenging for CBO-type methods if one is interested in the computation of the precise value \(x^\star \) in which the function reaches its global minimum. In fact, the Rastrigin function contains multiple similar minima located in different positions and the minimizer can get easily trapped in one local minimum without being able to compute the global optimum. This test is used to analyze the performances of the two different algorithmic implementations of the method and the effects of the parameters related to the alignment and the exploration processes based on the local best and the global best respectively.

The computational parameters are fixed as \(N=200\), \(N_t=10{,}000\), \(n_{ \text{ stall }}=1000\), \(\delta _{ \text{ stall }}=10^{-4}\) and the particles are initially distributed following an uniform distribution in the hypercube \([-3.12, 3.12]^d\), \(d=20\). Figures 1 and 2 show the performance of KBO algorithms, considering only the local best (2.3) or the global best (2.2). In both figures the first row refers to the case in which only the microscopic estimate has been used, i.e. \(\lambda _2=\sigma _2=0\), while \(\lambda _1=1\) and \(\sigma _1\) ranges in (0, 2], while the second row refers to the usage of the sole macroscopic estimation, i.e. \(\lambda _1=\sigma _1{=0}\), \(\lambda _2=1\) and \(\sigma _2\in (0,11]\). Two measures are used for the validation: the first one is the success rate, while the second is the number of iterations. In agreement with [14, 44], a simulation is considered successful if and only if

$$\begin{aligned} \Vert x^*_\alpha -x^\star \Vert _\infty <0.25 \end{aligned}$$
(5.1)

where \(x^*_\alpha \) is the macroscopic best estimate (provided by (2.3)), while \(x^\star \) is the actual minimizer of the Rastrigin function. Note that, in the case where only the local best has been used we still use the global best as an estimate of the global minimizer computed by the algorithm. The algorithms have been tested for three different choices for \(\varepsilon = 1, 0.1\) and 0.01. Each setting has been tested for 100 simulations. The local and global minimizers have been evaluated using \(\alpha =\beta =5\times 10^6\). For the numerical implementation, we refer to the algorithm introduced in [21] which permits to use arbitrary large values of \(\alpha \) and \(\beta \).

The results for the local best only, in the first row of Figures 1 and 2 , suggest that there are no great differences in terms of success rate between the two algorithms, even if the choice for \(\varepsilon =0.1\) seems to be the best compromise. On the other hand, for \(\varepsilon =1\) and \(\varepsilon =0.1\) Bird’s algorithm needs a slightly less number of iteration for reaching convergence. The second row is devoted to present the results regarding the use of the global best only. In general, decreasing the value for \(\varepsilon \) enlarges the interval in which the parameter \(\sigma _2\) can be chosen, but at the same time this interval is shifted to the right, meaning that the algorithm needs more noise in order to explore the search domain and identify the global minimum. It is also clear from Figures 1 and 2 that the convergence basin with only the local best is significantly smaller than that with only the global best. This is in agreement with the theoretical results of Sect. 4.

Fig. 3
figure 3

Minimization of Rastrigin function for KBO based on Nanbu’s algorithm. From left to right: success rate and average iterations number using both local and global best. Top row refers to the optimal value for the global best, while the bottom one refers to the optimal value for the local best

Fig. 4
figure 4

Minimization of Rastrigin function for KBO based on Bird’s algorithm. From left to right: success rate and average iterations number using both local and global best. Top row refers to the optimal value for the global best, while the bottom one refers to the optimal value for the local best

Note that the convergence region for Nanbu’s algorithm is slightly wider and that, as in the previous case, the Bird algorithm needs a lower number of iteration to reach convergence.

Figures 3 and 4 refer to the case in which both microscopic and macroscopic estimates are used in the procedure. In the first row, \(\sigma _2\) has been chosen as the optimal value that provided the best success rate in the previous experiment, in the second row the same strategy is applied to \(\sigma _1\). The depicted plot show that the performance drastically improves for certain options (check in particular Fig. 3a) and the required iteration number is decreasing too. This result is also in agreement with the theoretical analysis at the end of Sect. 4 that indicates an increase of the basin of convergence of the method based on the microscopic best when used in combination with the macroscopic best. As a final comment we can mention that Bird’s algorithm, thanks to the multiple interactions, produced less fluctuations in the numerical solution compared to Nanbu’s algorithm. This is well known in rarefied gas dynamics where the algorithms have their origins [42]. In our specific case, this translates is slightly narrower convergence regions and slightly faster convergence rates.

5.3 Comparison with Stochastic Gradient Descent

Next, we considered a test case to compare the proposed KBO algorithms with the classical Stochastic Gradient Descent (SGD). While the main interest in a gradient-free method is in situations where gradient computation is either not possible or is particularly expensive, the purpose of this simple numerical test, originally introduced in [14] is to illustrate the potential advantages of a consensus-based method even in circumstances where the gradient is available but get easily trapped into local minima without allowing the identification of the global minimum.

Following [14], we want to minimize the function

$$\begin{aligned} L(x) = \frac{1}{n}\sum _{i=1}^n f(x,\xi _i) \end{aligned}$$
(5.2)

where

$$\begin{aligned} f(x,\xi _i) = \exp \left( \sin (2x^2)\right) + \frac{1}{10}\left( x-\xi _i-\frac{\pi }{2}\right) ^2, \quad \xi _i\sim \mathcal {N}(0,0.01) \end{aligned}$$

The plot of (5.2) together with its minimum \(f(x^\star )\) at \(x^\star =1.5353\) (with \(n=10{,}000\)) is shown in Fig. 5.

Fig. 5
figure 5

Plot of (5.2). The orange dot refers to the minimum of the function, the shaded area to the basin of attraction for SGD, and \(x_1\) and \(x_2\) to the position of the peaks of the basin

The SGD procedure is shown in Algorithm 3: this algorithm implements the idea of minibatches, which consists of dividing the set \(\{\xi _i\}_{i=1,\ldots ,n}\) (the equivalent of a training set in Machine Learning problems) in smaller n/m subsets where m is the size of each subset, and then use the descent direction given by the average of these m gradients computed at the current iterate. Exploring the whole set \(\{\xi _i\}_{i=1,\ldots ,n}\) is called an epoch and one can decide to iterate the procedure for several epochs. The parameter \(\gamma \) chosen in Algorithm 3 is the stepsize, called learning rate in Machine Learning framework.

figure c

We minimize the function given in (5.2) with \(n=10000\) by using both SGD and the proposed KBO algorithm: the former is set with \(\gamma =0.1, m=100\), number of epochs equal to one and the procedure is stopped when \(|\nabla f(x^k)|<\varepsilon \), with \(\varepsilon =0.01\), while the setting for KBO can be found in Table 1, additional parameters are \(\delta _{ \text{ stall }}=10^{-4}\). For SGD the starting point is uniformly chosen in \([-3,3]\), the initial 20 particles are chosen in the same interval for KBO. We run 1000 simulations for SGD and 50 simulations for KBO: this is due to the equivalence of 20 runs of SGD to one of KBO. Indeed, the former case is equivalent to consider 20 different particles and then the minimization of the function is pursued independently on each particle. A simulation is considered successful for SGD if and only if the final iterate \(x_\alpha ^*\) satisfies \(|x_\alpha ^*-x^\star |<0.25\); for each simulation of KBO we count how many particles (in percentage) lie in the open ball \(\mathcal {B}_{0.25}(x^\star )\), i.e. how many particles reached a consensus around the actual solution: Table 1 collects the average of this consensus among the simulations.

Table 1 Performances of SGD and KBO

As shown, for this test case KBO algorithms outperforms the SGD method: even for a small number of particles (\(N_p=20\)), the minimum of the function is well recovered. The success rate of SGD is not surprisingly low: indeed, being a descent method without momentum, it hugely suffers from the presence of many local minima and from the initial position. The success rate of 18% is very close to the probability of randomly choosing the initial iterate in the interval containing the actual minimum, shaded in gray in Fig. 5: \(|x_2-x_1|/6=0.1833\). Enlarging or reducing the interval in which the initial point is chosen increases or decreases accordingly the success rate of SGD, while KBO does not seem to suffer from this problem. In conclusion, we observe how the implementation via Nanbu’s method leads to a higher success rate and how in general Bird’s method requires larger \(\sigma _1\) and \(\sigma _2\) exploration parameters. The latter aspect is in agreement with the lower statistical fluctuation of Bird’s method and has been already observed in the previous test case, an aspect that is advantageous in the simulation of physical particles in the context of rarefied gas dynamics, but can prove counterproductive in the case of minimum search problems. For this reason, in the following, we will limit the presentation of subsequent numerical tests to the use of Nanbu’s algorithm.

5.4 Results on High Dimensional Benchmark Functions

This section is devoted to test the performance of the KBO approach on classical benchmark functions in a high dimensional framework (\(d=50\)). The related optimization problems have been solved by using a common set of parameters for KBO algorithm

$$\begin{aligned} \lambda _1=\lambda _2=1,\quad \sigma _1=.1,\quad \sigma _2=6,\quad \varepsilon =0.01,\quad n_{stall}=500,\quad \delta _{stall} = 10^{-4}\nonumber \\ \end{aligned}$$
(5.3)

and the maximum number of iteration is fixed to 10000. The numerical implementation of KBO approach relies on Algorithm 1. Table 2 presents the results obtained on the functions listed in Appendix A.

Table 2 presents the success rate defined as in (5.1)

$$\begin{aligned} {\Vert x_\alpha ^*-x^\star \Vert _\infty \le \delta }, \end{aligned}$$

where \(\delta \) controls the severity of the criterion. We chose two different values, namely 0.25 and 0.1. We computed also the average number of iteration for achieving convergence. These results are obtained via 100 runs of each instance of the optimization problems. Two further performance measures are reported, the former being the expected error in Euclidean norm, defined as \( \mathbb {E}\left[ | x^*_\alpha -x^\star |\right] \) where \(x^\star \) is the solution and \(x^*_\alpha \) is the global estimate given by KBO procedure achieved for a successful run. The other measurement is the function value obtained at \(x^*_\alpha \).

Table 2 Performance of KBO on benchmark functions in dimension \(d=50\)

We employed here a strategy to dynamically reduce the number of particles used in the procedure. Indeed, as observed in [23], a constant number of particles is not optimal: while the dynamic evolves, the variance of the system diminishes due to consensus. We may then reduce the number of particles, according to this variance decreasing, using the following strategy: compute the variance \(S_t\) of the system at time t

$$\begin{aligned} S_t = \frac{1}{N_t}\sum _{i=1}^{N_t}{|}v_i^{(t)} - \bar{v}{|}^2, \quad \bar{v}=\frac{1}{N_t}\sum _{i=1}^{N_t}v_i^{(t)} \end{aligned}$$

where \(N_t\) is the number of particles at time t. As the consensus increases, the variance decreases: \(S_{t+1}\le S_t\), then the number of particles can be decreased following the ratio \(S_t/S_{t+1}\le 1\), using the formula

(5.4)

with denoting the integer part of x and

$$\begin{aligned} \hat{S}_{t+1} = \frac{1}{N_t}\sum _{i=1}^{N_t}{|}v_i^{(t+1)} - \hat{v}|^2, \quad \hat{v}=\frac{1}{N_t}\sum _{i=1}^{N_t}v_i^{(t+1)}. \end{aligned}$$

For \(\mu =0\) the discarding procedure is not employed, while for \(\mu =1\) the maximum speed up is achieved. For \(\mu >0\), a minimum number of particles \(N_{\min }\) is set and the reducing procedure is adopted every \(t_r\) iterations. For more practical detail, the interested reader may refer to [23]. In the experiments presented in Table 2, we set \(\mu =0.1\) for \(\delta =0.25\) and \(\mu =0.03\) for \(\delta =0.1\), \(t_r=10\) and \(N_{\min }=10\).

The initial distribution of the particles is uniform in the cube \([-1,1]^d\), while the initial number of particles is set to 2000. A rescaling strategy is adopted for the dynamics evolution: before computing the function values, the particles are rescaled into the benchmark research domain. For example, in the case of the Griewank function initially the candidates are uniformly drawn from \([-1,1]^d\): to compute the function values in these candidates the latter are rescaled into \([-600,600]^d\) and then these values are used in successive computation of \(v_\alpha \) and \(v_\beta \).

Table 2 shows that the success rate is very high and the error is very low for almost of the benchmark functions. The average number of particles decreases, reaching one tenth of the initial number in some cases, reducing overall both computational cost and time. Nonetheless, lowering the parameter \(\mu \) induces a higher success rate even with a more strict criterion (\(\delta =0.1\)): this amounts to use a larger number of particles, but at the same time it lowers the number of iterations in most cases. The trade–off to be considered is between computational time and computational cost: this consideration should be done case by case, since it depends on the function to minimize.

5.5 Application to a Machine Learning Problem

In the last test case, we apply the KBO technique to a classical problem of Machine Learning: the scope is to recognize digital numbers contained in images of the MNIST data set, by using a shallow network

$$\begin{aligned} f(x;W,b) = \mathrm{softmax} \left( \mathrm{ReLU} \left( Wx+b\right) \right) \end{aligned}$$

where \(x\in {\mathbb {R}}^{784}, W\in {\mathbb {R}}^{10\times 784}\), \(b\in {\mathbb {R}}^{10}\). Moreover

$$\begin{aligned} \mathrm{softmax}(x) = \frac{e^x_i}{\sum _i e^x_i}\,, \quad \mathrm{ReLU(x) = \max (0,x)} \end{aligned}$$

being ReLU the well–known Rectified Linear Unit function. The training of the shallow network consists in minimizing the following function

$$\begin{aligned} L(X,y;f)=\frac{1}{n}\sum _{i=1}^n \ell \left( f(X^{(i)};W,b), y^{i}\right) , \quad \ell (x,y) = -\sum _{i=1}^{10}y_i\log (x_i) \end{aligned}$$

where X is the training dataset, whose images are vectorized (\(\mathbb {R}^{28\times 28}\rightarrow \mathbb {R}^{784}\)) and stacked column–wise. The function \(\ell \) is the cross entropy.

We adopt a minibatch strategy both for the training set and for the particles used in KBO. The former consists in the classical strategy, depicted also in Algorithm 3, while the latter divides the particles set in \(N_p/{m_p}\) minibatches, where \(N_p\) is the number of total particles and \({m_p}\) is the number of particles in each batch. The KBO procedure is then iterated on the training batches. The final strategy is depicted in Algorithm 4.

figure d

At each epoch, the training dataset is shuffled in order to have different elements inside the batches. When exploring the current training batch, the particles are shuffled too. For our experiment, we used a datasetFootnote 1 with 10,000 images, 1000 per class, for the training and 10,000 images, 1000 per class, for validation. We compared the SGD method and KBO, both set with 20 epochs and minibatch size of 128; all the images in the training set have been normalized via zero centering and dividing by the standard deviation computed among the entire dataset. The learning rate for SGD is set to \(\gamma =0.1\), without momentum, with starting point randomly selected via a Gaussian distribution of zero mean and unitary variance. The settings for KBO is given by \(\sigma _1 =\sigma _2=1, \lambda _1=\lambda _2=1\), \(\varepsilon =\mathrm{d}t = 0.1, \alpha =\beta =5\cdot 10^6\) and we selected \({m_p=5}\) batches and \(N_p = 500\) particles. The initial candidates are randomly picked from a Gaussian Distribution with zero mean and unitary variance.

Fig. 6
figure 6

Performance comparison among SGD and KBO. The line referring to SGD shows the average over 500 simulations. The orange line refer to the KBO where both microscopic and macroscopic estimate are employed. The plot on the left depicts the performance of the KBO approach using \(N_p=500\) without any particle reduction strategy (the solid line is a smooth representation of the shaded one), while the plot on the right refers to the adoption of Eq. 5.4 with \(\mu =0.1\) with different choices for particle numbers \(N_p\) and particles’ batch \(m_p\). The average number of particles is denoted by \(N_a\) (Color figure online)

We run 500 simulations for SGD, since these runs are equivalent to one simulation of KBO with 500 particles. Figure 6a shows the accuracy obtained on the validation test all over the epochs. For computing the accuracy achieved by KBO, the parameters of the neural network are set as the macroscopic estimate reached at each iteration. The line referring to SGD corresponds to the average accuracy over the 500 simulations. In the numerical tests, the results obtained through the KBO method were shown to be superior in terms of accuracy to those obtained with classical SGD. A further test shows how the diminishing particle strategy depicted in Eq. 5.4 is very effective even in this context: starting with 500 particles and setting \(\mu =0.1\) ends the entire computation with just 270 particles, having a remarkable speed up in terms of computational time (see Fig. 6b). Beside the diminishing strategy, several coupling of number of particles and batch size have been tested in Fig. 6b: all of these setting lead to reliable results. Moreover, as already observed in Sect. 5.3, SGD is quite sensitive to the starting point, whereas KBO is able to reach similar performances with different initializations as shown in Fig. 7.

Fig. 7
figure 7

Comparison of SGD and KBO performances when the starting point and the particles are randomly chosen as realizations of a Gaussian distribution of zero mean and standard deviation equal to 10. KBO is set to employ the strategy depicted in Eq. 5.4 with \(\mu =0.1\). The initial number of particles is \(N_p=500\)

6 Conclusions

In this work we have presented a new gradient free method based on a kinetic dynamics characterized by binary interactions between particles. Unlike previously introduced consensus-based optimization (CBO) methods, the binary interaction process in the limit of a large number of particles does not correspond to a mean-field dynamics but to a Boltzmann-type dynamics inspired by classical kinetic theory. To our knowledge these are the first metaheuristic algorithms based on a Boltzmann-like dynamics for the identification of the global minimum. Compared to CBO methods, the kinetic theory based optimization method (KBO) introduced here can be seen as a mathematical formalism related to the use of mini-batches of interacting particles of size 2. The KBO method, uses both local binary information and global information to explore the search space. In both cases, we have been able to prove convergence to the global minimum under reasonable assumptions on the objective function using techniques inspired by those introduced in [14].

The numerical experiments reported have demonstrated the excellent performance of the KBO technique both in the case of high dimensional problems with benchmark test functions, and in the case of applications to machine learning. It is remarkable that the method can achieve good success rates also in the cases where no global information is used in the dynamics, namely there is only limited communication restricted to particles interacting by pairs. In this case, convergence to the global minimum can be seen as an emerging phenomena of a very simple dynamic where particles are not forced to converge towards a collective estimate of the global minimum.

On the other hand, from a mathematical viewpoint, the case with only local information is more difficult and convergence to global minimum requires more restrictive conditions on the parameters. These restrictions, however, become less stringent as soon as the method is used in combination with global information. In the sequel we plan to address our attention more specifically to the analysis of the Monte Carlo algorithms used in the KBO implementation and to the possible extension of the present methodology to non homogeneous dynamics in the spirit of particle swarm optimization as in [28].