1 Introduction

Partially motivated by the success of machine learning methods, which involve the minimization of high-dimensional and strongly non-convex objectives, in recent years the interest in consensus-based methods which do not rely on first-order gradient information has constantly increased. Typically, zero-order optimization methods either construct a surrogate of the gradient and then perform a gradient-descent-type update [1] or use a swarm model [2, 3] where particles are attracted to the particle in the swarm which has the lowest objective value. Notably, the former approach can lead to problems for strongly non-convex objectives with many critical points. This issue is circumvented by swarm models, however, they typically do not allow for a mean-field description which makes their mathematical understanding difficult.

In contrast, consensus-based methods aim to achieve consensus by letting particles \(\{x^{(i)}\}_{i=1}^J\) explore the objective landscape while attracting them to the weighted average of their positions with respect to the Gibbs measure \(\pi \propto \exp (-\beta V)\) of the objective function \(V:\mathbb {R}^d\rightarrow \mathbb {R}\), where \(\beta >0\) is an inverse heat parameter. This method was first introduced in [4] as consensus-based optimization (CBO). As opposed to most other swarm methods CBO has a mean-field formulation involving a nonlinear Fokker–Planck equation. In [5] convergence to consensus of this equation was first proved, and [6] showed consensus formation of the discrete particle method. The key property for showing that the consensus is achieved close to the global minimum of the objective V is that \(\exp (-\beta V)\) concentrates around the global minimizer of V as \(\beta \rightarrow \infty \). More recently, [7] presented an improved convergence analysis which weakens some of the assumptions in [5] and directly proves convergence to a point close to the global minimum in a Wasserstein-2 distance.

Fig. 1
figure 1

Dynamics of standard and the proposed polarized CBO for mininizing the Himmelblau function. The points mark particle locations, the arrows the drift field towards the weighted means

Following up on the original formulation of CBO, numerous extensions were suggested. In [8] an additional drift term, modelling the time-average of the personal best of every particle, is proposed. In [9] an extension of the CBO method for constrained problems is suggested, and [10] adapted the method for high-dimensional problems from machine learning by introducing random batching and changing the noise model. In a different line of work, CBO methods on hypersurfaces were studied in [11] and applications to machine learning were investigated in [12]. In [13] CBO was enriched by ensemble-based gradient information which comes at low computational cost and can improve upon the baseline method. For an overview of recent developments we also refer to the review [14]. Consensus-based methods have also been transferred to sampling. The work [15] proposes a consensus-based sampling (CBS) method by changing the noise term in CBO to include a weighted sample covariance. This prevents a collapse of the ensemble to full consensus and under suitable assumptions the method was shown to converge to a Gaussian approximation of the distribution \(\exp (-V)\).

While existing consensus-based methods have proven to work very well for non-convex objectives with many spurious local extreme points, they all suffer from the conceptual drawback that by definition they can at most approximate one minimum, respectively one mode in the context of sampling. Therefore, the goal of this paper is to design consensus-based particle dynamics which support multiple consensus points, or in other words, polarization.

The central idea for the method which we are presenting in this paper is based on the following though experiment: Assuming that two clusters of particles have formed, each one centered around a global minimum, we do not want to compute a shared weighted mean of their positions. This would pull one of the clusters into the other one. Therefore, we replace the weighted mean—being an integral component of current consensus-based methods—by a collection of localized means which additionally weight particle positions by their proximity to the considered particle. The localization is achieved by a suitable kernel function. This then leads to polarized dynamics where multiple consensus points can be reached which opens up the possibility of finding multiple global minima or multiple modes, respectively. Standard CBO, on the other hand is bound to converge to a single minimum.

As we shall discuss in more detail later, our approach carries strong similarities to bounded confidence models of opinion formation, introduced in [16] but see also [17,18,19]. There agents only interact with each other if their opinions are sufficiently close, which in the end can lead to formation of multiple consensus points and is referred to as polarization of opinions in [17].

Main contributions

The focus of this paper is the development of the novel polarized consensus-based dynamics and an extensive numerical evaluation of the method. The mathematical models we propose come with a lot of new theoretical questions, as well, which will be the subject of future investigations. The main contributions and structure of this paper can be summarized as follows:

  • Section 2.1: We propose a novel polarized computation of weighted means for CBO methods.

  • Section 2.2: We propose an algorithmic variant which uses a predetermined number of cluster points to compute weighted means.

  • Section 2.3: We propose a novel polarized computation of weighted covariances for the CBS method and prove that it is unbiased for Gaussian targets.

  • Section 2.4: We prove convergence of the mean-field dynamics in the Wasserstein-2 distance for sufficiently well-behaved objective functions (Theorem 4).

  • Section 3: We conduct extensive numerical and statistical evaluations of our polarized optimization method, showing that it can find multiple global minima in low and high dimensional optimization problems and can even improve upon standard CBO for the detection of one minimum. We also test our polarized CBS method for sampling from mixtures of Gaussian and a non-Gaussian distribution where it exhibits better performance than standard CBS.

2 Models

In this section we describe in detail how standard consensus-based models work and how we generalize them with our polarized approach. For this, we let \(V:\mathbb {R}^d\rightarrow [0,\infty )\) be a (possibly non-convex) objective function and \(\beta >0\) an inverse heat parameter.

2.1 Polarized consensus-based optimization

Given a measure \(\rho \in {\mathcal {M}}(\mathbb {R}^d)\), for standard CBO one defines a weighted mean as

$$\begin{aligned} \textsf{m}_{\beta }[\rho ] := \frac{\int y \exp (-\beta V(y))\,\textrm{d}\rho (y)}{\int \exp (-\beta V(y))\,\textrm{d}\rho (y)}. \end{aligned}$$
(1)

Here and in the rest of the paper all integrals will be over \(\mathbb {R}^d\). The consensus-based optimization method from [4] then amounts to solving the following system of stochastic differential equations (SDEs) for particles \(\{x^{(i)}\}_{i=1,\dots ,J}\):

$$\begin{aligned} \,\textrm{d}x^{(i)} = -(x^{(i)} - \textsf{m}_{\beta }[\rho ])\,\textrm{d}t + \sigma \left| x^{(i)} - \textsf{m}_{\beta }[\rho ] \right| \,\textrm{d}W^{(i)}, \qquad \rho := \frac{1}{J}\sum _{i=1}^J\delta _{x^{(i)}}, \end{aligned}$$
(2)

where \(\{W^{(i)}\}_{i=1}^J\) denote independent Brownian motions and \(\sigma \ge 0\) determines the strength of randomness in the model. The Fokker–Planck equation associated to (2) is the following PDE

$$\begin{aligned} \partial _t \rho _t(x) = {{\,\textrm{div}\,}}\Big (\rho _t(x)(x-\textsf{m}_{\beta }[\rho _t])\Big ) + \frac{\sigma ^2}{2}\Delta \left( \rho _t(x)\left| x - \textsf{m}_{\beta }[\rho ] \right| ^2\right) . \end{aligned}$$
(3)

As explained above, this dynamical system forces particles to collapse to consensus, meaning that under certain conditions on V the empirical measures \(\rho (t)\) converge to \(\delta _{{\tilde{x}}}\) as \(t\rightarrow \infty \), where for \(\beta \rightarrow \infty \) the consensus-point \(\hat{x}\) converges to the global minimizer of V, see [5, 7].

We now explain our polarized modification of CBO for optimizing objective functions with multiple global minima. Given a measure \(\rho \in {\mathcal {M}}(\mathbb {R}^d)\) and a kernel function \({{\,\mathrm{\textsf{k}}\,}}:\mathbb {R}^d\times \mathbb {R}^d\rightarrow [0,\infty )\) we define the weighted mean

$$\begin{aligned} { \textsf{m}_{\beta ,{{\,\mathrm{\textsf{k}}\,}}}[\rho ](x) := \frac{\int {{\,\mathrm{\textsf{k}}\,}}(x,y) y \exp (-\beta V(y))\,\textrm{d}\rho (y)}{\int {{\,\mathrm{\textsf{k}}\,}}(x,y) \exp (-\beta V(y))\,\textrm{d}\rho (y)}, \quad x\in \mathbb {R}^d.} \end{aligned}$$
(4)

The corresponding polarized optimization dynamics take the form

$$\begin{aligned} \boxed { \,\textrm{d}x^{(i)} = -(x^{(i)} - \textsf{m}_{\beta ,{{\,\mathrm{\textsf{k}}\,}}}[\rho ](x^{(i)}))\,\textrm{d}t + \sigma \left| x^{(i)} - \textsf{m}_{\beta ,{{\,\mathrm{\textsf{k}}\,}}}[\rho ](x^{(i)}) \right| \,\textrm{d}W^{(i)},} \end{aligned}$$
(5)

where \(\rho := \frac{1}{J}\sum _{i=1}^J\delta _{x^{(i)}}\). The Fokker–Planck equation associated to (5) is the following PDE

$$\begin{aligned} \partial _t \rho _t(x) = {{\,\textrm{div}\,}}\Big (\rho _t(x)(x-\textsf{m}_{\beta ,{{\,\mathrm{\textsf{k}}\,}}}[\rho _t](x))\Big ) + \frac{\sigma ^2}{2}\Delta \left( \rho _t(x)\left| x - \textsf{m}_{\beta ,{{\,\mathrm{\textsf{k}}\,}}}[\rho _t](x) \right| ^2\right) .\nonumber \\ \end{aligned}$$
(6)

Note that the idea of letting the dynamics of particle \(x^{(i)}\) mainly depend on spatially close particles, as modelled through the kernel \({{\,\mathrm{\textsf{k}}\,}}\), has strong similarities to bounded confidence models of opinion dynamics introduced in [16]. In these models, typically there is no objective function to be minimized and the kernel is of the form \({{\,\mathrm{\textsf{k}}\,}}(x,y):= 1_{\left| x-y \right| \le \kappa }(x,y)\), where \(\kappa >0\) is a so-called confidence level. The simplest such dynamics then take the form

$$\begin{aligned} \frac{\,\textrm{d}x^{(i)}}{\,\textrm{d}t} = - \left( x^{(i)} - \frac{1}{N(x^{(i)})}\sum _{j=1}^J 1_{\left| x^{(i)}-x^{(j)} \right| \le \kappa }x^{(j)}\right) , \end{aligned}$$
(7)

where \(N(x^{(i)}):= \#\{1\le j \le J \,:\,\left| x^{(i)}-x^{(j)} \right| \le \kappa \}\) denotes the numbers of points which are not farther than \(\kappa \) away from \(x^{(i)}\). Notably, (7) coincides with (5) for the special case of \({{\,\mathrm{\textsf{k}}\,}}(x,y):= 1_{\left| x-y \right| \le \kappa }(x,y)\), a constant objective \(V\equiv const\), and \(\sigma =0\). More generally, dynamics of the form (5) and (7) can be viewed as processes on co-evolving networks or graphs where the weights between different particles \(x^{(i)}\) and \(x^{(j)}\) depend on the kernel \({{\,\mathrm{\textsf{k}}\,}}\) and the loss function V. We refer to [20, 21] for a unified description and the study of mean-field equations for general processes of this form. Furthermore, for \(V=const\) and \(\sigma =0\), a time discretization of Equation (5) via an Euler–Maruyama scheme with stepsize \(\,\textrm{d}t =1\), yields the update

$$\begin{aligned} x^{(i)} \leftarrow x^{(i)} -(x^{(i)} - \textsf{m}_{\beta ,{{\,\mathrm{\textsf{k}}\,}}}[\rho ](x^{(i)})) = \frac{\sum _{j=1}^N {{\,\mathrm{\textsf{k}}\,}}(x^{(i)},x^{(j)})\, x^{(j)}}{\sum _{j=1}^N {{\,\mathrm{\textsf{k}}\,}}(x^{(i)},x^{(j)})}, \end{aligned}$$

which is known as the mean-shift algorithm [22, 23]. In the following, we would like to discuss two important special cases of our model: If one chooses the kernel \({{\,\mathrm{\textsf{k}}\,}}(x,y)=1\) for all \(x,y\in \mathbb {R}^d\), then (4) to (6) simply reduce to the standard CBO setup (1) to (3). Hence, our method is a generalization of CBO. On the other hand, if one chooses the Gaussian kernel \({{\,\mathrm{\textsf{k}}\,}}(x,y):= \exp \left( -\frac{\left| x-y \right| ^2}{2\kappa ^2}\right) \), then the weighted mean \(\textsf{m}_{\beta ,{{\,\mathrm{\textsf{k}}\,}}}[\rho ](x)\) can be rewritten as

$$\begin{aligned} \textsf{m}_{\beta ,{{\,\mathrm{\textsf{k}}\,}}}[\rho ](x) := \frac{\int y \exp \left( -\beta V(y) - \frac{\left| x-y \right| ^2}{2\kappa ^2}\right) \,\textrm{d}\rho (y)}{\int \exp \left( -\beta V(y) - \frac{\left| x-y \right| ^2}{2\kappa ^2}\right) \,\textrm{d}\rho (y)}. \end{aligned}$$
(8)

In this case our method can be regarded as standard CBO applied to a spatially varying quadratic regularization of the objective function \(y\mapsto V_x(y)\), defined as

$$\begin{aligned} V_x(y) := V(y) + \frac{1}{2\kappa ^2\beta }\left| x-y \right| ^2. \end{aligned}$$

Note that the central difference between the standard and our method is that the weighted mean (4) depends on the particle position x and is not same for all particles. Especially for high-dimensional problems it was shown in [10] that the performance of CBO can be significantly improved when using a coordinate-wise noise model. To be precise they suggest the following replacement in (5)

$$\begin{aligned} \left| x^{(i)} - \textsf{m}_{\beta ,{{\,\mathrm{\textsf{k}}\,}}}[\rho ](x^{(i)}) \right| \,\textrm{d}W^{(i)} \longrightarrow \sum _{n=1}^d \vec e_n \left( x^{(i)} - \textsf{m}_{\beta ,{{\,\mathrm{\textsf{k}}\,}}}[\rho ](x^{(i)})\right) _n \,\textrm{d}W^{(i)}_n, \end{aligned}$$
(9)

where \(\vec e_n\) denotes the nth unit vector in \(\mathbb {R}^d\). This changes the Laplacian term in the corresponding Fokker–Planck equation (6) to

$$\begin{aligned} \Delta (\rho _t(x)\left| x-\textsf{m}_{\beta ,{{\,\mathrm{\textsf{k}}\,}}}[\rho _t](x) \right| \longrightarrow \sum _{n=1}^d \partial _{nn}^2\Big (\rho _t(x)(x - \textsf{m}_{\beta ,{{\,\mathrm{\textsf{k}}\,}}}[\rho _t](x))_n\Big ). \end{aligned}$$

For computing an approximate solution of the stochastic CBO dynamics with either of the two discussed noise models, we employ a standard Euler–Maruyama scheme which we sum up in Algorithm 1. There the function ComputeMean determines the precise variant of CBO that is being used. In Algorithm 2 we specify this function for the proposed polarized CBO which reduces to standard CBO for a constant kernel \({{\,\mathrm{\textsf{k}}\,}}\equiv 1\). In the next section we will discuss an algorithmic variant (see Algorithm 3) of the mean computation (4) which is computationally more efficient and exhibits empirical advantages in high dimensions.

Algorithm 1
figure a

General consensus-based optimization

Algorithm 2
figure b

ComputeMean for polarized CBO

2.2 Cluster-based model

In this section we propose an algorithmic alternative to the weighted mean, defined in (4) with the motivation of making the computation of the weighted means more efficient and to also encourage polarizing effects between the different means. Given particles \(x^{(i)}\) for \(i=1,\ldots ,J\), standard CBO uses one weighted mean whereas our polarized version uses J many weighted means.

As an alternative model we consider cluster means \({{\,\mathrm{\textsf{c}}\,}}^{(j)}\) for \(j=1,\ldots , J_c\), where \(J_c\le J\). We encode the probability that the particle \(x^{(i)}\) belongs to the cluster mean \({{\,\mathrm{\textsf{c}}\,}}^{(j)}\) with \(p_{ij}>0\). Given initial probabilities \(p_{ij}>0\), we perform the following iterative update for each \(j=1,\ldots , J_c\),

$$\begin{aligned} p_{i}^{\text {max}}&:= \max _{j =1,\ldots ,J_c} p_{ij}\quad{} & {} \text { for }i=1,\ldots , J, \end{aligned}$$
(10a)
$$\begin{aligned} r_{ij}&:= \left( \frac{p_{ij}}{p_{i}^{\text {max}}}\right) ^\alpha \quad{} & {} \text { for }i=1,\ldots , J, \end{aligned}$$
(10b)
$$\begin{aligned} {\tilde{p}}_{ij}&:= r_{ij} {{\,\mathrm{\textsf{k}}\,}}(x^{(i)}, {{\,\mathrm{\textsf{c}}\,}}^{(j)}).{} & {} \end{aligned}$$
(10c)

The new values of the probabilities \(p_{ij}\) are then obtained by renormalization over j, i.e.,

$$\begin{aligned} p_{ij} \leftarrow \frac{\tilde{p}_{ij}}{\sum _{j=1}^{J_c} \tilde{p}_{ij}} \end{aligned}$$
(11)

and the clusters means are then updated via

$$\begin{aligned} {{\,\mathrm{\textsf{c}}\,}}^{(j)} \leftarrow \frac{\sum _{i=1}^J x^{(i)} p_{ij} \exp (-\beta V(x^{(i)}))}{\sum _{i=1}^J p_{ij} \exp (-\beta V(x^{(i)}))}. \end{aligned}$$
(12)

Here, \(\alpha \ge 0\) is a discounting coefficient. The interpretation of the above scheme is as follows: If the particle \(x^{(i)}\) identifies the cluster center \({{\,\mathrm{\textsf{c}}\,}}^{(j)}\) as “belonging to it the most”, then \(r_{ij}=1\) and the probability that \(x^{(i)}\) indeed belongs to \({{\,\mathrm{\textsf{c}}\,}}^{(j)}\) is only determined by the spatial proximity, encoded through \({{\,\mathrm{\textsf{k}}\,}}(x^{(i)},{{\,\mathrm{\textsf{c}}\,}}^{(j)})\). If, however, it “feels more dedicated” to another cluster mean \({{\,\mathrm{\textsf{c}}\,}}^{(k)}\) (different from \({{\,\mathrm{\textsf{c}}\,}}^{(j)}\)), then \(p_{ij} < p_{ik}\) and thus \(r_{ij} < 1\). This results in a correction of the particle-cluster correspondence, strengthening the bond to cluster j of particle i. The exponent \(\alpha \ge 0\) determines the strength of this additional polarization incentive, with larger \(\alpha \ge 0\) leading to greater polarization. In order to obtain the indivdual mean for each particle (like in the polarized CBO scheme) we simply compute

$$\begin{aligned} \textsf{m}^{(i)} := \sum _{j=1}^{J_c} p_{ij} {{\,\mathrm{\textsf{c}}\,}}^{(j)}. \end{aligned}$$
(13)

We summarize the cluster-based mean computations in Algorithm 3, which can be used in place of Algorithm 2 in the CBO scheme Algorithm 1.

Algorithm 3
figure c

ComputeMean for cluster-based method

Note that for \(\alpha =\infty \) we have that

$$\begin{aligned} r_{ij} := {\left\{ \begin{array}{ll} 1,\qquad &{}\text {if }j\in {{\,\mathrm{arg\,max}\,}}_{j=1,\dots ,J_c}p_{ij},\\ 0,\qquad &{}\text {else}, \end{array}\right. } \end{aligned}$$

meaning that the only probability which survives is the one for the likeliest cluster centers. Another interesting special case arises when choosing the Gaussian kernel \({{\,\mathrm{\textsf{k}}\,}}(x,y):= \exp \left( -\frac{\left| x-y \right| ^2}{2\kappa ^2}\right) \). In this case one gets that

$$\begin{aligned} p_{ij}&:= \frac{\exp \left( -\frac{1}{2\kappa ^2}\left| x^{(i)}-{{\,\mathrm{\textsf{c}}\,}}^{(j)} \right| ^2 + \log r_{ij}\right) }{\sum _{j=1}^{J_c} \exp \left( -\frac{1}{2\kappa ^2}\left| x^{(i)}-{{\,\mathrm{\textsf{c}}\,}}^{(j)} \right| ^2 + \log r_{ij}\right) } \\&\longrightarrow {\left\{ \begin{array}{ll} 1\quad &{}\text {if }\left| x^{(i)}-{{\,\mathrm{\textsf{c}}\,}}^{(j)} \right| \le \left| x^{(i')}-{{\,\mathrm{\textsf{c}}\,}}^{(j')} \right| \,\forall i'=1,\dots ,J,\,j'=1,\dots ,J_c,\\ 0\quad &{}\text {else}, \end{array}\right. } \end{aligned}$$

as \(\kappa \rightarrow 0\). So for very small values of \(\kappa \) a hard assignment of the points \(x^{(i)}\) to the clusters \({{\,\mathrm{\textsf{c}}\,}}^{(j)}\) based on spatial proximity is performed which is reminiscent of the k-means algorithm.

It is important to initialize all quantities correctly in order to obtain a meaningful algorithm. If one naively initializes \(p_{ij}:= \frac{1}{J_c}\) for all \(i=1,\dots ,J\) and computes initial cluster centers via (12), then \({{\,\mathrm{\textsf{c}}\,}}^{(j)} = \frac{\sum _{i=1}^J x^{(i)} \exp (-\beta V(x^{(i)}))}{\sum _{i=1}^J\exp (-\beta V(x^{(i)}))}\) equals the standard CBO weighted mean for all j. Correspondingly, the probability updates (10) and (11 will leave \(p_{ij}\) untouched. In this case the method reduces precisely to standard CBO. Therefore, we initialize the probabilities randomly by drawing \({\tilde{p}}_{ij} \sim {\text {Unif}}(0,1)\) and normalizing with (11). Then we compute initial cluster means using (12).

Let us also mention that the complexity of CBO where means are computed with Algorithm 3 is order \(O(J\cdot J_c)\) which is significantly smaller than \(O(J^2)\) which is the complexity of CBO with the baseline polarized mean computation from Algorithm 2. This is due to the fact that the cluster-based method models consensus and polarization of individuals with respect to existing opinionated groups, with the cluster center as a surrogate for these groups, whereas the polarized CBO approach tracks all particles’ interactions with each other.

Last, we demonstrate how to obtain a SDE and mean-field interpretation of Algorithm 3, where for conciseness we restrict ourselves to the case \(\alpha =0\). In this case (11) and (12) reduce to

$$\begin{aligned} p_{ij} \leftarrow \frac{{{\,\mathrm{\textsf{k}}\,}}(x^{(i)},{{\,\mathrm{\textsf{c}}\,}}^{(j)})}{\sum _{j=1}^{J_c}{{\,\mathrm{\textsf{k}}\,}}(x^{(i)},{{\,\mathrm{\textsf{c}}\,}}^{(j)})} , \qquad {{\,\mathrm{\textsf{c}}\,}}^{(j)} \leftarrow \frac{\sum _{i=1}^J x^{(i)} {{\,\mathrm{\textsf{k}}\,}}(x^{(i)},{{\,\mathrm{\textsf{c}}\,}}^{(j)}) \exp (-\beta V(x^{(i)}))}{\sum _{i=1}^J {{\,\mathrm{\textsf{k}}\,}}(x^{(i)},{{\,\mathrm{\textsf{c}}\,}}^{(j)}) \exp (-\beta V(x^{(i)}))}. \end{aligned}$$

This allows us to express the cluster-based mean computation with \(\alpha =0\) as the coupled system:

$$\begin{aligned} \,\textrm{d}x^{(i)}{} & {} = -(x^{(i)} - \textsf{m}^{(i)})\,\textrm{d}t + \sigma \left| x^{(i)} -\textsf{m}^{(i)} \right| \,\textrm{d}W^{(i)},\end{aligned}$$
(14a)
$$\begin{aligned} \textsf{m}^{(i)}{} & {} =\frac{\sum _{j=1}^{J_c}{{\,\mathrm{\textsf{k}}\,}}(x^{(i)},{{\,\mathrm{\textsf{c}}\,}}^{(j)}){{\,\mathrm{\textsf{c}}\,}}^{(j)}}{\sum _{j=1}^{J_c}{{\,\mathrm{\textsf{k}}\,}}(x^{(i)},{{\,\mathrm{\textsf{c}}\,}}^{(j)})}, \end{aligned}$$
(14b)
$$\begin{aligned} {{\,\mathrm{\textsf{c}}\,}}^{(j)}{} & {} = \frac{\int x {{\,\mathrm{\textsf{k}}\,}}(x,{{\,\mathrm{\textsf{c}}\,}}^{(j)})\exp (-\beta V(x))\,\textrm{d}\rho _t(x)}{\int {{\,\mathrm{\textsf{k}}\,}}(x,{{\,\mathrm{\textsf{c}}\,}}^{(j)})\exp (-\beta V(x))\,\textrm{d}\rho _t(x)}, \end{aligned}$$
(14c)
$$\begin{aligned} \rho _t{} & {} := \frac{1}{N}\sum _{i=1}^J \delta _{x^{(i)}}. \end{aligned}$$
(14d)

Note that Algorithm 3 approximates the fixed point equation for \({{\,\mathrm{\textsf{c}}\,}}^{(j)}\) with one iteration of the fixed point map. The corresponding mean-field system is readily obtained as:

$$\begin{aligned} \partial _t \rho _t(x)&= {{\,\textrm{div}\,}}(\rho _t(x)(x - \textsf{m}[\rho _t](x))) + \frac{\sigma ^2}{2}\Delta \left( \rho _t(x)\left| x - \textsf{m}[\rho _t](x) \right| ^2\right) , \end{aligned}$$
(15a)
$$\begin{aligned} \textsf{m}[\rho _t](x)&= \frac{\sum _{j=1}^{J_c}{{\,\mathrm{\textsf{k}}\,}}(x,{{\,\mathrm{\textsf{c}}\,}}^{(j)}){{\,\mathrm{\textsf{c}}\,}}^{(j)}}{\sum _{j=1}^{J_c}{{\,\mathrm{\textsf{k}}\,}}(x,{{\,\mathrm{\textsf{c}}\,}}^{(j)})}, \end{aligned}$$
(15b)
$$\begin{aligned} {{\,\mathrm{\textsf{c}}\,}}^{(j)}&= \frac{\int x {{\,\mathrm{\textsf{k}}\,}}(x,{{\,\mathrm{\textsf{c}}\,}}^{(j)})\exp (-\beta V(x))\,\textrm{d}\rho _t(x)}{\int {{\,\mathrm{\textsf{k}}\,}}(x,{{\,\mathrm{\textsf{c}}\,}}^{(j)})\exp (-\beta V(x))\,\textrm{d}\rho _t(x)}. \end{aligned}$$
(15c)

We expect that similarly one can derive a SDE and mean-field interpretation of the model with \(\alpha >0\) but we do not endeavour this here.

2.3 Polarized consensus-based sampling

The last model variant that we consider here is an application to sampling. In [15] a sampling version of CBO was proposed and termed consensus-based sampling (CBS). Defining a weighted covariance matrix as

$$\begin{aligned} \textsf{C}_{\beta }[\rho ]&:= \frac{\int (y-\textsf{m}_{\beta }[\rho ])\otimes (y-\textsf{m}_{\beta }[\rho ]) \exp (-\beta V(y))\,\textrm{d}\rho (y)}{\int \exp (-\beta V(y))\,\textrm{d}\rho (y)}. \end{aligned}$$
(16)

CBS aims to sample from the measure \(\exp (-V)\) by solving the following system of SDEs:

$$\begin{aligned} \,\textrm{d}x^{(i)} = -(x^{(i)} - \textsf{m}_{\beta }[\rho ])\,\textrm{d}t + \sqrt{2\lambda ^{-1}\textsf{C}_{\beta }[\rho ]}\,\textrm{d}W^{(i)}, \qquad \rho := \frac{1}{J}\sum _{i=1}^J\delta _{x^{(i)}}. \end{aligned}$$
(17)

Here the parameter \(\lambda \) interpolates between an optimization method (\(\lambda =1\)) and a sampling method (\(\lambda =(1+\beta )^{-1}\)). For the latter scaling of \(\lambda \) a collapse of \(\rho \) is avoided and \(\rho \) samples from \(\exp (-V)\) if this measure is Gaussian. The Fokker–Planck equation associated with (17) is given by

$$\begin{aligned} \partial _t \rho _t(x) = {{\,\textrm{div}\,}}\Big (\rho _t(x)(x-\textsf{m}_{\beta }[\rho _t])\Big )+ \lambda ^{-1}{{\,\textrm{div}\,}}\Big (\textsf{C}_{\beta }[\rho _t]\nabla \rho _t(x)\Big ). \end{aligned}$$
(18)

We can polarize CBS by using the mean from (4) to define a weighted variance, as follows:

$$\begin{aligned} { \textsf{C}_{\beta ,{{\,\mathrm{\textsf{k}}\,}}}[\rho ](x) := \frac{\int {{\,\mathrm{\textsf{k}}\,}}(x,y)(y-\textsf{m}_{\beta ,{{\,\mathrm{\textsf{k}}\,}}}[\rho ](x))\otimes (y-\textsf{m}_{\beta ,{{\,\mathrm{\textsf{k}}\,}}}[\rho ](x)) \exp (-\beta V(y))\,\textrm{d}\rho (y)}{\int {{\,\mathrm{\textsf{k}}\,}}(x,y) \exp (-\beta V(y))\,\textrm{d}\rho (y)},} \end{aligned}$$
(19)

for \(x\in \mathbb {R}^d\). The corresponding CBS dynamics are then given by

$$\begin{aligned} \boxed { \,\textrm{d}x^{(i)} = -(x^{(i)} - \textsf{m}_{\beta ,{{\,\mathrm{\textsf{k}}\,}}}[\rho ](x^{(i)}))\,\textrm{d}t + \sqrt{2\lambda ^{-1}\textsf{C}_{\beta ,{{\,\mathrm{\textsf{k}}\,}}}[\rho ](x^{(i)})}\,\textrm{d}W^{(i)},} \end{aligned}$$
(20)

where \(\rho := \frac{1}{J}\sum \nolimits _{i=1}^J\delta _{x^{(i)}}\). The associated Fokker–Planck equation is

$$\begin{aligned} {\partial _t \rho _t(x) = {{\,\textrm{div}\,}}\big (\rho _t(x)(x-\textsf{m}_{\beta ,{{\,\mathrm{\textsf{k}}\,}}}[\rho _t](x))\big )+\lambda ^{-1}{{\,\textrm{div}\,}}\big (\textsf{C}_{\beta ,{{\,\mathrm{\textsf{k}}\,}}}[\rho _t](x)\nabla \rho _t(x)\big ).} \end{aligned}$$
(21)

Perhaps a little unexpectedly we can prove that, just as the Fokker–Planck equation of CBS (18), for Gaussian kernels of arbitrary width our version (21) leaves Gaussians measures invariant, which is an important consistency property.

Proposition 1

The polarized CBS dynamics in sampling mode with a Gaussian kernel leaves a Gaussian target measure invariant. More precisely, let

\({{\,\mathrm{\textsf{k}}\,}}(x,y):= \exp \left( -\frac{1}{2}(x - y)^T \Sigma _1^{-1}(x - y)\right) \) for a symmetric and positive definite covariance matrix \(\Sigma _1\in {\mathbb {R}}^{d\times d}\), and let

$$\begin{aligned} V(y) := \frac{1}{2}(x - m)^T\Sigma _2^{-1}(x - m) \end{aligned}$$

for some \(m\in \mathbb {R}^d\) and a symmetric and positive definite covariance matrix \(\Sigma \in \mathbb {R}^{d\times d}\). Then \(\rho ^\star \) defined as

$$\begin{aligned} \rho ^\star (x) := \exp (-V(x)) \end{aligned}$$

is a stationary solution of the Fokker–Planck equation (21) for any \(\beta >0\) and for \(\lambda =(1+\beta )^{-1}\).

Proof

We use the following formula for the product of two Gaussians, which can be found, for instance, in [24], to obtain

$$\begin{aligned} {{\,\mathrm{\textsf{k}}\,}}(x,y)\exp (-(1+\beta )V(y))&= \exp \left( -\frac{1}{2}(x - y)^T \Sigma _1^{-1}(x - y)\right) \\&\exp \left( -\frac{1+\beta }{2}(x - m)^T\Sigma _2^{-1}(x - m)\right) \\&= c_x \exp \left( -\frac{1}{2}(y - m_x)^T \Sigma _3 (y-m_x)\right) , \end{aligned}$$

where \(c_x>0\) is a normalization constant and

$$\begin{aligned} \Sigma _3&:= \left( \Sigma _1^{-1} + (1+ \beta )\Sigma _2^{-1}\right) ^{-1}, \\ m_x&:= \Sigma _3\left( \Sigma _1^{-1}x + (1+\beta )\Sigma _2^{-1}m\right) . \end{aligned}$$

Using this we obtain

$$\begin{aligned} \int {{\,\mathrm{\textsf{k}}\,}}(x,y) \exp (-\beta V(y))\,\textrm{d}\rho ^\star (y)&= \int {{\,\mathrm{\textsf{k}}\,}}(x,y) \exp (-(1+\beta )V(y))\,\textrm{d}y \\&= c_x \int \exp \left( -\frac{1}{2}(y - m_x)^T \Sigma _3 (y-m_x)\right) \,\textrm{d}y \\&= c_x (2\pi )^\frac{d}{2}\det (\Sigma _3)^\frac{1}{2}. \end{aligned}$$

Similarly, we obtain

$$\begin{aligned} \int y {{\,\mathrm{\textsf{k}}\,}}(x,y) \exp (-V(y))\,\textrm{d}\rho ^\star (y)&= c_x \int y \exp \left( -\frac{1}{2}(y - m_x)^T \Sigma _3 (y-m_x)\right) \,\textrm{d}y \\&= c_x (2\pi )^\frac{d}{2}\det (\Sigma _3)^\frac{1}{2} m_x. \end{aligned}$$

Combining the two we obtain

$$\begin{aligned} \textsf{m}_{\beta ,{{\,\mathrm{\textsf{k}}\,}}}[\rho ^\star ](x)&= m_x. \end{aligned}$$

Similarly, we can compute the weighted covariance as

$$\begin{aligned} \textsf{C}_{\beta ,{{\,\mathrm{\textsf{k}}\,}}}[\rho ^\star ](x)&= \frac{\int (x-\textsf{m}_{\beta ,{{\,\mathrm{\textsf{k}}\,}}}[\rho ^\star ](x))\otimes (x-\textsf{m}_{\beta ,{{\,\mathrm{\textsf{k}}\,}}}[\rho ^\star ](x)) {{\,\mathrm{\textsf{k}}\,}}(x,y) \exp (-\beta V(y))\,\textrm{d}\rho ^\star (y) }{\int {{\,\mathrm{\textsf{k}}\,}}(x,y) \exp (-V(y))\,\textrm{d}\rho ^\star (y)} \\&= \frac{c_x \int (x-m_x) \otimes (x-m_x) c_x \exp \left( -\frac{1}{2}(y - m_x)^T \Sigma _3 (y-m_x)\right) \,\textrm{d}y }{c_x(2\pi )^\frac{d}{2}\det (\Sigma _3)^\frac{1}{2}} = \Sigma _3. \end{aligned}$$

Hence, we get for \(\lambda = (1+\beta )^{-1}\):

$$\begin{aligned}&\rho ^\star (x) (x - \textsf{m}_{\beta ,{{\,\mathrm{\textsf{k}}\,}}}[\rho ^\star ](x)) + \lambda ^{-1}\textsf{C}_{\beta ,{{\,\mathrm{\textsf{k}}\,}}}[\rho ^\star ](x)\nabla \rho ^\star (x) \\&\quad = \rho ^\star (x) (x-m_x) + (1+\beta ) \Sigma _3 \nabla \rho ^\star (x) \\&\quad = \rho ^\star (x) (x-m_x) - (1+\beta ) \Sigma _3 \nabla V(x)\rho ^\star (x) \\&\quad = \rho ^\star (x) \left( (x-m_x) - (1+\beta ) \Sigma _3\Sigma _2^{-1}(x-m) \right) . \end{aligned}$$

Now we use the definition of \(m_x = \Sigma _3(\Sigma _1^{-1}x+(1+\beta )\Sigma _2^{-1}m)\) to obtain

$$\begin{aligned}&x - m_x - (1+\beta ) \Sigma _3\Sigma _2^{-1}(x-m) \\&\quad = x - \Sigma _3(\Sigma _1^{-1}x+(1+\beta )\Sigma _2^{-1}m) - (1+\beta ) \Sigma _3 \Sigma _2^{-1}(x-m) \\&\quad = x - \Sigma _3\left( \Sigma _1^{-1} + (1+\beta )\Sigma _2^{-1}\right) x =0, \end{aligned}$$

using that \(\Sigma _3 = \left( \Sigma _1^{-1} + (1+\beta )\Sigma _2^{-1}\right) ^{-1}\). This proves that \(\rho ^\star \) is a stationary solution of the Fokker–Planck equation (21). \(\square \)

2.4 Towards mean-field analysis

In this section we will make some remarks on the analysis of the Fokker–Planck equation (6) for polarized CBO which we repeat here for convenience:

$$\begin{aligned} \partial _t \rho _t(x) = {{\,\textrm{div}\,}}\Big (\rho _t(x)(x-\textsf{m}_{\beta ,{{\,\mathrm{\textsf{k}}\,}}}[\rho _t](x))\Big ) + \frac{\sigma ^2}{2}\Delta \left( \rho _t(x)\left| x - \textsf{m}_{\beta ,{{\,\mathrm{\textsf{k}}\,}}}[\rho ](x) \right| ^2\right) . \end{aligned}$$

We shall first explain why current analytical frameworks for the mean-field analysis of consensus-based optimization do not carry over to this equation. Then, we shall prove convergence as \(t\rightarrow \infty \) of solutions to this equation to a singular measure located at a minimizer, restricting ourselves to the the zero temperature limit \((\beta =0)\) and to sufficiently nice objective functions V.

Weak solutions of this equation are continuous curves of probability measures \(t\mapsto \rho _t\) such that

$$\begin{aligned} \frac{\,\textrm{d}}{\,\textrm{d}t} \int \phi (x) \,\textrm{d}\rho _t(x)&= - \int \nabla \phi (x)\cdot (x-\textsf{m}_{\beta ,{{\,\mathrm{\textsf{k}}\,}}}[\rho _t](x)) \,\textrm{d}\rho _t(x)\\&\quad + \frac{\sigma ^2}{2} \int \Delta \phi (x) \left| x-\textsf{m}_{\beta ,{{\,\mathrm{\textsf{k}}\,}}}[\rho _t](x) \right| ^2\,\textrm{d}\rho _t(x) \end{aligned}$$

holds true for all smooth and compactly supported test functions \(\phi \in C^\infty _c(\mathbb {R}^d)\).

Using the Leray–Schauder fixed point theorem, existence proofs for this equation without a kernel, i.e., \({{\,\mathrm{\textsf{k}}\,}}(x,y)=1\), were given in [5, 7] under mild assumptions on the objective V and for initial distributions with finite fourth-order moment. Taking into account that

$$\begin{aligned} {{\,\mathrm{\textsf{k}}\,}}(x,y)\exp (-\beta V(y)) = \exp \left( -\beta \left( V(y) - \frac{1}{\beta } \log {{\,\mathrm{\textsf{k}}\,}}(x,y)\right) \right) \end{aligned}$$

we expect that under reasonable Lipschitz-like assumptions on the logarithm of the kernel these arguments translate to our case, but we leave this for future work. The biggest challenge is that for standard CBO \(t\mapsto \textsf{m}_{\beta }[\rho _t]\) is a continuous curve in \(\mathbb {R}^d\), whereas \(t\mapsto \textsf{m}_{\beta ,{{\,\mathrm{\textsf{k}}\,}}}[\rho _t](\cdot )\) is not. Rather, it is a curve in a space of vector fields. This requires more sophisticated compactness arguments than the one made in [5, 7].

Besides existence, in the literature there exist two different approaches to proving formation of consensus around the global minimizer of V for the Fokker–Planck equation (3) associated to standard CBO. The first one was presented in [5] and constitutes a two-step approach. First, they prove that the non-weighted standard variance

$$\begin{aligned} V(\rho _t) := \int \left| x - E(\rho _t) \right| ^2\,\textrm{d}\rho _t(x),\qquad \text {where }E(\rho _t) := \int y\,\textrm{d}\rho _t(y), \end{aligned}$$

decreases to zero along solutions \(\rho _t\) of (3). This obviously implies that \(\rho _t\) converges to a Dirac measure \(\delta _{{\tilde{x}}}\) concentrated on some point \({\tilde{x}} \in \mathbb {R}^d\). Second, the Laplace principle is invoked in order to conclude that \({\tilde{x}}\) lies close to the global minimizer of V if \(\beta \) is chosen sufficiently large. This analytical approach uses heavily that the weighted mean \(\textsf{m}_{\beta }[\rho _t]\) for CBO does not depend on the spatial variable x which is a linearity property that our weighted mean \(\textsf{m}_{\beta ,{{\,\mathrm{\textsf{k}}\,}}}[\rho _t](x)\) does not enjoy. Furthermore, by design our method does in general not converge to a single Dirac mass and hence its classical variance does not converge to zero.

A different approach is presented in [7] where the authors propose a more unified strategy. For this they fix a point \(\hat{x}\in \mathbb {R}^d\) (which later will be the global minimizer of V) and define the variance-type function

$$\begin{aligned} \textsf{V}[\rho _t] := \frac{1}{2}\int \left| x-\hat{x} \right| ^2\,\textrm{d}\rho _t(x). \end{aligned}$$
(22)

The authors note that \(\textsf{V}[\rho _t] = \frac{1}{2}{\mathcal {W}}_2^2(\rho _t,\delta _{\hat{x}})\) and so convergence of the variance implies convergence of \(\rho _t\) to \(\delta _{\hat{x}}\) in the Wasserstein-2 distance. They derive a differential inequality for \(\textsf{V}[\rho _t]\) which in combination with the Laplace principle allows them to show some kind of semi-convergence behavior, i.e., for every \(\varepsilon >0\) there exists \(\beta >0\) such that \(\textsf{V}[\rho _t]\) decreases exponentially fast until it hits the threshold \(\textsf{V}[\rho _t]\le \varepsilon \).

When it comes to generalizing this approach to our setting there are two main obstacles. First, in the case of several global minimizers \(\{\hat{x}_i\,:\,1\le i \le N\}\) the Wasserstein-2 distance between \(\rho _t\) and the empirical measure \(\frac{1}{N}\sum _{i=1}^N\delta _{\hat{x}_i}\) does not equal \(\frac{1}{N}\sum _{i=1}^N\int \left| x-\hat{x}_i \right| ^2\,\textrm{d}\rho _t(x)\). Indeed, the latter quantity is a very bad upper bound for the desired Wasserstein-2 distance since it is bounded from below by a positive number.

In the following, we therefore present an alternative approach for analyzing convergence of the Fokker–Planck equation (6) which is based on choosing the Lyapunov function

$$\begin{aligned} \textsf{L}_\beta [\rho _t] := \frac{1}{2}\int \left| x-\textsf{m}_{\beta ,{{\,\mathrm{\textsf{k}}\,}}}[\rho _t](x) \right| ^2\,\textrm{d}\rho _t(x) \end{aligned}$$

and the reasons for this choice will become clear in a moment. We consider the setting of a Gaussian kernel \({{\,\mathrm{\textsf{k}}\,}}(x,y):= \exp \left( -\beta \frac{\left| x-y \right| ^2}{2\kappa ^2}\right) \) of variance \(\kappa ^2/\beta \) for \(\kappa >0\) and consider the limit as \(\beta \rightarrow \infty \). The case of \(\beta <\infty \) is then treated using the quantitative Laplace principle as, for instance, in [7]. Let us assume that the support of \(\rho \) equals \(\mathbb {R}^d\). In this setting the Laplace principle implies that for \(\beta \rightarrow \infty \) one has

$$\begin{aligned} \textsf{m}_{\beta ,{{\,\mathrm{\textsf{k}}\,}}}[\rho ](x) = \frac{\int y\exp \left( -\beta \left( \frac{\left| x-y \right| ^2}{2\kappa ^2}+V(y)\right) \right) \,\textrm{d}\rho (y)}{\int \exp \left( -\beta \left( \frac{\left| x-y \right| ^2}{2\kappa ^2}+V(y)\right) \right) \,\textrm{d}\rho (y)} \rightarrow \mathop {\mathrm {arg\,min}}\limits _{y \in \mathbb {R}^d}\frac{\left| x-y \right| ^2}{2\kappa ^2}+V(y), \end{aligned}$$

where we assume that \(\kappa \) is sufficiently small such that \(y\mapsto \frac{\left| x-y \right| ^2}{2\kappa ^2}+V(y)\) has a unique minimizer for every \(x\in \mathbb {R}^d\). Note that this is possible, for instance, if V is \(C^2\) and the smallest eigenvalue of its Hessian matrix is bounded from below. Therefore, we consider the limiting dynamics as \(\beta \rightarrow \infty \), governed by the drift field \(x - p(x)\), where

$$\begin{aligned} p(x) := \mathop {\mathrm {arg\,min}}\limits _{y \in \mathbb {R}^d}\frac{\left| x-y \right| ^2}{2\kappa ^2}+V(y) \end{aligned}$$
(23)

is known as the proximal operator of \(\kappa ^2 V\). Correspondingly, the Lyapunov function becomes

$$\begin{aligned} \textsf{L}[\rho _t] := L_\infty [\rho _t] := \frac{1}{2}\int \left| x-p(x) \right| ^2\,\textrm{d}\rho _t(x) \end{aligned}$$

and we would like to emphasize that, using properties of the proximal operator, one has

$$\begin{aligned} \textsf{L}[\rho ] = 0 \iff x\in \mathop {\mathrm {arg\,min}}\limits V\quad \text {for } \rho \text {-almost every }x\in \mathbb {R}^d. \end{aligned}$$

The following analysis treats the model case where the loss function V is strongly convex, sufficiently smooth, and additionally satisfies a certain derivative bound in case we consider a Fokker–Planck equation with non-vanishing diffusion term. While this is just the first step toward a comprehensive analysis of polarized CBO method for much larger classes functions, the following results introduce a set of important techniques which—we believe—will be a cornerstone for future analysis. The core ideas of the following analysis are based on discussions of the first author with Massimo Fornasier and Oliver Tse and an extension to non-convex loss functions and \(\beta <\infty \) is ongoing work.

We start with the following decay property of \(\textsf{L}[\rho _t]\) for solutions of the associated Fokker–Planck equation:

Proposition 2

(Exponential decay of the Lypapunov function) Let \(t\mapsto \rho _t\) be a weak solution of the Fokker–Planck equation

$$\begin{aligned} \partial _t \rho _t(x) = {{\,\textrm{div}\,}}\Big (\rho _t(x)(x-p(x))\Big ) + \frac{\sigma ^2}{2}\Delta \left( \rho _t(x)\left| x - p(x) \right| ^2\right) . \end{aligned}$$
(24)

We pose the following assumptions on the loss function V:

  • If \(\sigma =0\) assume that \(V\in C^2(\mathbb {R}^d)\) and there exists \(\mu >0\) such that

    $$\begin{aligned} D^2 V \succcurlyeq \mu {\mathbb {I}} \end{aligned}$$
  • If \(\sigma >0\) assume additionally that \(V\in C^3(\mathbb {R}^d)\) and satisfies

    $$\begin{aligned} \sup _{x\in \mathbb {R}^d}\left| Dv(x) \right| \left| D^3V(x) \right| <\infty . \end{aligned}$$

Then there exists constants \(C_1,C_2>0\) such that, if \(\sigma <C_1\), it holds for all \(t>0\) that

$$\begin{aligned} \frac{\,\textrm{d}}{\,\textrm{d}t}\textsf{L}[\rho _t] \le -C_2\textsf{L}[\rho _t] \end{aligned}$$
(25)

and consequently

$$\begin{aligned} \textsf{L}[\rho _t] \le \textsf{L}[\rho _0]\exp (-C_2 t). \end{aligned}$$
(26)

Proof

Differentiating \(\textsf{L}[\rho _t]\) in time and using the weak form of the PDE implies

$$\begin{aligned} \begin{aligned} \frac{\,\textrm{d}}{\,\textrm{d}t}\textsf{L}[\rho _t]&= -\int \nabla \frac{1}{2}\left| x-p(x) \right| ^2\cdot (x-p(x))\,\textrm{d}\rho _t(x) \\&\qquad + \frac{\sigma ^2}{4} \int \Delta \left| x-p(x) \right| ^2\left| x-p(x) \right| ^2\,\textrm{d}\rho _t(x). \end{aligned} \end{aligned}$$
(27)

We continue by computing the derivatives that appear in this expression. First, it holds

$$\begin{aligned} \partial _i \frac{1}{2}\left| x-p(x) \right| ^2 = \partial _i \sum _j\frac{1}{2}\left| x_j-p_j(x) \right| ^2 = \sum _j(\delta _{ij}-\partial _i p_j(x))(x_j-p_j(x)). \end{aligned}$$
(28)

Next we observe that by definition of the proximal operator p(x) it holds

$$\begin{aligned} 0 = p(x) - x + \kappa ^2\nabla V(p(x)). \end{aligned}$$
(29)

Differentiating this equation yields

$$\begin{aligned} 0 = Dp(x) - {\mathbb {I}} + \kappa ^2 D^2V(p(x)) Dp(x) \end{aligned}$$
(30)

and therefore

$$\begin{aligned} Dp(x)&= \left( {\mathbb {I}} + \kappa ^2 D^2V(p(x))\right) ^{-1}, \end{aligned}$$
(31)

which is a symmetric matrix. Using (28) we get

$$\begin{aligned} -\nabla \frac{1}{2}\left| x-p(x) \right| ^2\cdot (x-p(x)) = -({\mathbb {I}} - Dp(x))(x-p(x)) \cdot (x-p(x)) \end{aligned}$$

For estimating this expression from above it suffices to bound the eigenvalues of \(M:={\mathbb {I}} - Dp(x)\) from below. By assumption we have \(D^2 V \succcurlyeq \mu {\mathbb {I}}\), which implies that \(M\succcurlyeq \left( 1-\frac{1}{1+\kappa ^2\mu }\right) {\mathbb {I}} = \frac{\kappa ^2\mu }{1+\kappa ^2\mu }{\mathbb {I}}\) and therefore we can bound the first term in (27)

$$\begin{aligned} -\nabla \frac{1}{2}\left| x-p(x) \right| ^2\cdot (x-p(x)) \le -\frac{\kappa ^2\mu }{1+\kappa ^2\mu } \left| x-p(x) \right| ^2. \end{aligned}$$
(32)

If \(\sigma =0\) we can already conclude the proof, using Gronwall’s inequality. For \(\sigma >0\) we have to bound the second term in (27), coming from the diffusion. Using (28) and the product rule it follows

$$\begin{aligned} \begin{aligned} \partial _i^2\frac{1}{2}\left| x-p(x) \right| ^2&=-\sum _j\partial _i^2 p_j(x)(x_j-p_j(x)) \\&\qquad + \sum _j(\delta _{ij}-\partial _i p_j(x))(\delta _{ij}-\partial _i p_j(x)). \end{aligned} \end{aligned}$$
(33)

Consequently, we obtain

$$\begin{aligned} \Delta \frac{1}{2}\left| x-p(x) \right| ^2&= \sum _i \partial _i^2\frac{1}{2}\left| x-p(x) \right| ^2 \\&= -\sum _{i,j}\partial _i^2 p_j(x)(x_j-p_j(x)) + \sum _{i,j}(\delta _{ij}-\partial _i p_j(x))(\delta _{ij}-\partial _i p_j(x)) \\&= -\sum _{i,j}\partial _i^2 p_j(x)(x_j-p_j(x)) + d -2\sum _{i}\partial _i p_i(x) + \sum _{i,j} \partial _i p_j(x) \partial _i p_j(x) \\&= -\sum _{i,j}\partial _i^2 p_j(x)(x_j-p_j(x)) + d -2{{\,\textrm{trace}\,}}(Dp(x)) + {{\,\textrm{trace}\,}}(Dp(x)Dp(x)^T) \\&= -\sum _{i,j}\partial _i^2 p_j(x)(x_j-p_j(x)) + d + {{\,\textrm{trace}\,}}(Dp(x)(Dp(x)^T-2{\mathbb {I}})). \end{aligned}$$

It also holds \(0\preccurlyeq Dp(x) \preccurlyeq {\mathbb {I}}\) and therefore \(2{\mathbb {I}} - Dp(x)\succcurlyeq {\mathbb {I}}\). This allows us to estimate

$$\begin{aligned} {{\,\textrm{trace}\,}}(Dp(x)(Dp(x)^T-2{\mathbb {I}})) = -{{\,\textrm{trace}\,}}(Dp(x)(2{\mathbb {I}}-Dp(x))) \le -{{\,\textrm{trace}\,}}(Dp(x)) \le 0. \end{aligned}$$

Going back to the previous formula for the Laplacian, we obtain the estimate

$$\begin{aligned} \Delta \frac{1}{2}\left| x-p(x) \right| ^2 \le -\sum _{i,j}\partial _i^2 p_j(x)(x_j-p_j(x)) + d \end{aligned}$$
(34)

and it remains to bound the first term. For this we need second derivatives of p(x). Writing (30) in coordinates gives

$$\begin{aligned} \partial _i p_j(x) = \delta _{ij} - \kappa ^2 \sum _{r}\partial _{r}\partial _j V(p(x))\partial _i p_r(x). \end{aligned}$$

Taking another derivative with respect to the ith variable and using the product rule yields

$$\begin{aligned} \partial _i^2 p_j(x) = -\kappa ^2\sum _r\partial _r\partial _j V(p(x))\partial _i^2 p_r(x) - \kappa ^2 \sum _{r,s}\partial _s\partial _r\partial _jV(p(x))\partial _i p_s(x)\partial _j p_r(x). \end{aligned}$$

We have to solve this equation for the second derivatives of p for which we define the matrix \(A=A_{ij}:= \partial _i^2 p_j(x)\) and the matrix \(B = B_{ij}:= \sum _{r,s}\partial _s\partial _r\partial _jV(p(x))\partial _i p_s(x)\partial _j p_r(x)\). Then the previous equation is equivalent to the linear system

$$\begin{aligned} A = -\kappa ^2 A D^2V(p(x)) - \kappa ^2 B \end{aligned}$$

which is solved by

$$\begin{aligned} A = -\kappa ^2 B ({\mathbb {I}} + \kappa ^2 D^2V(p(x)))^{-1}. \end{aligned}$$
(35)

Using the definition of A and (34) we get

$$\begin{aligned} \Delta \frac{1}{2}\left| x-p(x) \right| ^2 \le -\kappa ^2 \sum _{i,j} A_{ij}\partial _j V(p(x)) + d \end{aligned}$$

and it remains to uniformly bound the first term.

Using (35), the definition of B as well as (31), we get the following estimate in terms of the matrix/tensor norms

$$\begin{aligned} \Delta \frac{1}{2}\left| x-p(x) \right| ^2 \le C_d \kappa ^4 \left| D^3V(p(x)) \right| \left| \nabla V(p(x)) \right| \left| \left( {\mathbb {I}}+\kappa ^2 D^2V(p(x))\right) ^{-1} \right| ^3 + d, \end{aligned}$$

where \(C_d\) is a dimensional constant. By assumption the right hand side is uniformly bounded by some constant \(C>0\) and going back to (27) we obtain

$$\begin{aligned} \frac{\,\textrm{d}}{\,\textrm{d}t}\textsf{L}[\rho _t]&\le -\frac{\kappa ^2\mu }{1+\kappa ^2\mu }\int \left| x-p(x) \right| ^2\,\textrm{d}\rho _t(x) + \frac{\sigma ^2}{2} C \int \left| x-p(x) \right| ^2\,\textrm{d}\rho _t(x) \\&= -\left( \frac{2\kappa ^2\mu }{1+\kappa ^2\mu }-\sigma ^2 C\right) \textsf{L}[\rho _t]. \end{aligned}$$

Since \(\mu >0\), we can choose \(\sigma >0\) sufficiently small—for instance, \(\sigma ^2<\frac{1}{C}\frac{\kappa ^2\mu }{1+\kappa ^2\mu }\)—such that the brackets are strictly positive. Then we can conclude the proof using Gronwall’s inequality. \(\square \)

Example 1

(The one-dimensional case) In the case of one spatial dimension \(d=1\), we can bound the Laplacian \(\Delta \frac{1}{2}\left| x-p(x) \right| ^2\)—which in this case is just the second derivative—more accurately. In this case \(B=V'''(p(x))(p'(x))^2\) and hence \(A=-\kappa ^2\frac{V'''(p(x))(p'(x))^2}{1+\kappa ^2V''(p(x))}\). Plugging in \(p'(x)=\frac{1}{1+\kappa ^2V''(p(x))}\) we obtain

$$\begin{aligned} A = -\kappa ^2\frac{V'''(p(x))}{(1+\kappa ^2V''(p(x)))^3}. \end{aligned}$$

Hence, we obtain the estimate

$$\begin{aligned} \Delta \frac{1}{2}\left| x-p(x) \right| ^2 \le \kappa ^4 \frac{V'''(p(x))V'(p(x))}{(1+\kappa ^2V''(p(x)))^3} + 1. \end{aligned}$$

We give three examples of loss functions V which satisfy all assumptions of Proposition 2. The first one is the quadratic loss \(V_1(x)=ax^2+bx+c\) with \(a>0,b,c\in \mathbb {R}\) for which it holds \(V_1'''=0\) and hence the Laplacian is bounded by 1. The second one is \(V_2(x)=x^2+\ln (x+\sqrt{1+x^2})\) for which we have (the loose bound) \(\left| V'''V' \right| \le 1\) and so a valid bound for the Laplacian is \(\kappa ^4+1\). The third loss is \(V_3(x)=x^2+x+1-(x+1)\ln (x+1)\) for \(x\ge 0\), extended to an even function on \(\mathbb {R}\). Also here one has (the loose) bound \(\left| V'''V' \right| \le 1\) and again a valid bound for the Laplacian is \(\kappa ^4+1\).

Next we prove a compactness property for measures for which \(\textsf{L}[\rho _t]\) is uniformly bounded in time. For this we also require that the loss V has a Lipschitz continuous gradient, which is a standard assumption in nonlinear optimization.

Proposition 3

Let \(V \in C^2(\mathbb {R}^d)\) satisfy \(\mu {\mathbb {I}}\preccurlyeq D^2 V \preccurlyeq L{\mathbb {I}}\) for some \(0<\mu \le L<\infty \) and assume that

$$\begin{aligned} \sup _{t>0}\textsf{L}[\rho _t] < \infty . \end{aligned}$$

Then there exists a subsequence \((\rho _{t_n})_{n\in \mathbb {N}}\) for \(t_n\rightarrow \infty \) as \(n\rightarrow \infty \) which converges to a probability measure \(\rho _\infty \) in the Wasserstein-2 distance.

Proof

Let \(x^*:=\mathop {\mathrm {arg\,min}}\limits V\). By assumption it holds

$$\begin{aligned} \frac{\mu }{2}\left| p(x)-x^* \right| ^2 + \langle \nabla V(p(x)), p(x)- x^*\rangle + V(p(x)) \le V(x^*) \le V(p(x)) \end{aligned}$$

which implies

$$\begin{aligned} \frac{\mu \kappa ^2}{2}\left| p(x)-x^* \right| ^2 \le \kappa ^2\langle \nabla V(p(x)), x^*- p(x)\rangle . \end{aligned}$$

By definition of p(x) it holds \(\kappa ^2\nabla V(p(x))=x-p(x)\) and therefore

$$\begin{aligned} \frac{\mu \kappa ^2}{2}\left| p(x)-x^* \right| ^2 \le \langle x-p(x),x^*-p(x)\rangle \le \left| x-p(x) \right| \left| x^*-p(x) \right| . \end{aligned}$$

Hence, we have showed that

$$\begin{aligned} \left| p(x)-x^* \right| \le \frac{2}{\mu \kappa ^2} \left| x-p(x) \right| . \end{aligned}$$

Using this as well as the inequality \(\left| a+b \right| ^2\le 2\left| a \right| ^2+2\left| b \right| ^2\) we obtain

$$\begin{aligned} \begin{aligned} \int \left| p(x) \right| ^2\,\textrm{d}\rho _t(x)&\le 2\int \left| x^*-p(x) \right| ^2\,\textrm{d}\rho _t(x) + 2\left| x^* \right| ^2 \\&\le \frac{8}{\mu ^2\kappa ^4} \int \left| x-p(x) \right| ^2\,\textrm{d}\rho _t(x) + 2\left| x^* \right| ^2. \end{aligned} \end{aligned}$$
(36)

By assumption one has

$$\begin{aligned} \sup _{t>0}\int \left| x-p(x) \right| ^2\,\textrm{d}\rho _t(x) < \infty \end{aligned}$$

which, together with change of variables and (36), implies

$$\begin{aligned} \sup _{t>0}\int \left| y \right| ^2\,\textrm{d}(p_\sharp \rho _t)(y)= \sup _{t>0}\int \left| p(x) \right| ^2\,\textrm{d}\rho _t(x)<\infty . \end{aligned}$$

Therefore, by compactness a subsequence \((p_\sharp \rho _{t_n})_{n\in \mathbb {N}}\) of the push-forward measures converges to some probability measure \(\sigma _\infty \) in the Wasserstein-2 distance as \(n\rightarrow \infty \).

First, we claim that under our assumptions the map \(x\mapsto p(x)\) is a biLipschitz homeomorphism with Lipschitz continuous inverse \(y\mapsto p^{-1}(y):=y+\kappa ^2\nabla V(y)\). The fact that these two maps are Lipschitz continuous is obvious, noting that the proximal operator of a convex function is 1-Lipschitz and the map \(p^{-1}\) is Lipschitz because \(\nabla V\) is Lipschitz. Furthermore, we observe that

$$\begin{aligned} p^{-1}(p(x)) = p(x) + \kappa ^2\nabla V(p(x)) = x \end{aligned}$$

by definition of the proximal operator so \(p^{-1}\) is a left inverse. To see that also \(p(p^{-1}(x))=x\) we note that

$$\begin{aligned} p(p^{-1}(x)) = \mathop {\mathrm {arg\,min}}\limits _{y\in \mathbb {R}^d}\frac{\left| y-x-\kappa ^2\nabla V(x) \right| ^2}{2\kappa ^2}+V(y). \end{aligned}$$

This is a strongly convex optimization problem and hence the optimality conditions

$$\begin{aligned} 0 = y-x-\kappa ^2\nabla V(x) + \kappa ^2\nabla V(y) \end{aligned}$$

are necessary and sufficient. They have the (unique) solution \(y=x\) and therefore \(p(p^{-1}(x))=x\) which shows that \(p^{-1}\) is truely the inverse of p.

Next, we claim that the measures \(\rho _{t_n}\) converge to \(\rho _\infty := p^{-1}_\sharp \sigma _\infty \) in the Wasserstein-2 distance as \(n\rightarrow \infty \). To see this, let \(\pi _n\) denote an optimal coupling of \(p_\sharp \rho _{t_n}\) and \(\sigma _\infty \) such that

$$\begin{aligned} W_2(p_\sharp \rho _{t_n},\sigma _\infty ) = \int \int \left| x-y \right| ^2\,\textrm{d}\pi _n(x,y). \end{aligned}$$

We define \({\tilde{\pi }}_n:= (p^{-1}\times p^{-1})_\sharp \pi _n\) and first argue that this a coupling of \(\rho _{t_n}\) and \(\rho _\infty \). To see this, we compute

$$\begin{aligned} {\tilde{\pi }}_n(A\times \mathbb {R}^d) =\pi _n(p(A)\times p(\mathbb {R}^d)) =\pi _n(p(A)\times \mathbb {R}^d) = p_\sharp \rho _{t_n}(p(A)) = \rho _{t_n}(A) \end{aligned}$$

for every Borel set \(A\subset \mathbb {R}^d\), where we used that p is invertible and that the first marginal of \(\pi _n\) is \(p_\sharp \rho _{t_n}\). Hence, the first marginal of \({\tilde{\pi }}_n\) is \(\rho _{t_n}\). Similarly, the second marginal can be shown to be \(\rho _\infty \) and hence \(\pi _n\) is a coupling. We therefore obtain

$$\begin{aligned} W_2(\rho _{t_n},\rho _\infty )&\le \int \int \left| x-y \right| ^2\,\textrm{d}{\tilde{\pi }}_n(x,y) \\&= \int \int \left| p^{-1}(x)-p^{-1}(y) \right| ^2\,\textrm{d}\pi _n(x,y) \\&\le (1+\kappa ^2 L)^2 \int \int \left| x-y \right| ^2\,\textrm{d}\pi _n(x,y) \\&= W_2(p_\sharp \rho _{t_n},\sigma _\infty )\rightarrow 0 ,\qquad \text {as }n\rightarrow \infty , \end{aligned}$$

where we used the Lipschitzness of \(\nabla V\) which implies Lipschitzness of \(p^{-1}\). \(\square \)

Theorem 4

Under the conditions of Propositions 2 and 3 and assuming \(\textsf{L}[\rho _0]=0\) it holds that

$$\begin{aligned} W_2(\rho _t,\delta _{x^*}) \rightarrow 0\quad \text {as }t\rightarrow \infty , \end{aligned}$$

where \(x^*:=\mathop {\mathrm {arg\,min}}\limits V\).

Proof

By Proposition 2 it holds \(\textsf{L}[\rho _t]\rightarrow 0\) and, in particular, \(t\mapsto \textsf{L}[\rho _t]\) is uniformly bounded. Hence, one can apply Proposition 3 to obtain that a subsequence of \(\rho _t\) converges to some probability measure \(\rho _\infty \) in the Wasserstein-2 distance and hence also weakly. Since \(\rho \mapsto \textsf{L}[\rho ]\) is an integral of the continuous and lower-bounded function \(x\mapsto \frac{1}{2}\left| x-p(x) \right| ^2\) against \(\rho \), it is lower semicontinuous with respect to weak convergence of measures. It follows \(\textsf{L}[\rho _\infty ]=0\) and hence \(\rho _\infty =\delta _{x^*}\), where \(x^*\) is the global minimizer of V. The uniqueness of the minimizer implies that the whole sequence \(\rho _t\) converges to \(\delta _{x^*}\) and this concludes the proof. \(\square \)

3 Numerical examples

In this section we evaluate the numerical performance of the proposed algorithms. In all our experiments we chose a time step parameter of \(\,\textrm{d}t=0.01\). The code to reproduce all numerical experiments is available on GitHub.Footnote 1

3.1 Unimodal Ackley function

Fig. 2
figure 2

Dynamics of standard CBO for minimizing the Ackley function. The points mark particle locations, the arrows the drift field towards the shared weighted mean

Fig. 3
figure 3

Dynamics of the proposed polarized CBO for minimizing the Ackley function. The points mark particle locations, the arrows the drift field towards the individual weighted means

In the first example we perform a consistency check for our method for finding the unique global minimum of the Ackley function [26], defined as

$$\begin{aligned} A(x) := -20 \exp \left( -\frac{0.2}{\sqrt{d}} \left| x \right| \right) - \exp \left( \frac{1}{d}\sum _{n=1}^d\cos (2\pi x_n)\right) + e + 20. \end{aligned}$$
(37)

This function has a global minimum at \(0 \in \mathbb {R}^d\) with \(A(0)=0\) and in this experiment we choose \(d=2\) and minimize the shifted version \(V(x):= A(x-(3,2))\) which has its global minimum at \((3,2)\in \mathbb {R}^2\). We compare the dynamics of standard CBO with our proposed polarized variant at three different time points in Figs. 2 and 3. We use the the Gaussian kernel \({{\,\mathrm{\textsf{k}}\,}}(x,y):= \exp \left( -\frac{\left| x-y \right| ^2}{2\kappa ^2}\right) \) with standard deviation \(\kappa =\infty \) for standard CBO and \(\kappa =1\) for polarized CBO. Furthermore, we choose \(\beta =1\). In this simple situation with a unique global minimum we observe that both standard and polarized CBO find the global minimum and do not get stuck in local minima. Notably, the polarized variant converges slightly slower than standard CBO which is due to the localization effect of the kernel with a relatively small standard deviation.

3.2 Multimodal Rastrigin function

In this example we evaluate different choices of kernel functions for minimizing a Rastrigin-type function with three global minima. The original Rastrigin function [27] on \(\mathbb {R}^d\) is defined as

$$\begin{aligned} R(x) := 10 d + \sum _{n=1}^d(x_n^2 - 10\cos (2\pi x_n)) \end{aligned}$$
(38)

and has a global minimum at \(x=0\) with \(U(0)=0\). In this experiment we choose \(d=2\) and minimize the product

$$\begin{aligned} V(x) = \frac{1}{8} R(x) \, R(x-(3,2)) \, R(x+(1,3.5)) \end{aligned}$$

which has three global minima. Note that this function is very non-convex and at the same time extremely flat around its minima, see Fig. 4 for a surface plot.

Fig. 4
figure 4

A variant of the Rastrigin function with three global minima

We consider the three different kernels

$$\begin{aligned} {{\,\mathrm{\textsf{k}}\,}}(x,y)&= \exp \left( -\frac{\left| x-y \right| ^2}{2\kappa ^2}\right) \quad{} & {} \text {Gaussian kernel}, \end{aligned}$$
(39)
$$\begin{aligned} {{\,\mathrm{\textsf{k}}\,}}(x,y)&= \exp \left( -\frac{\left| x-y \right| }{\kappa }\right) \quad{} & {} \text {Laplace kernel}, \end{aligned}$$
(40)
$$\begin{aligned} {{\,\mathrm{\textsf{k}}\,}}(x,y)&= 1_{\left| x-y \right| \le \kappa }(x,y)\quad{} & {} \text {bounded confidence kernel}, \end{aligned}$$
(41)

and the corresponding results of our method are depicted in Fig. 5. Again, we choose \(\beta =1\). The kernel parameters \(\kappa \) were chosen sufficiently small for the methods to detect all three minima, which are marked with blue diamonds. While the Gaussian and the Laplace kernel work similarly well—note however that the Laplace kernel needs a much smaller value of \(\kappa \) than then Gaussian—the bounded confidence kernel (cf. the discussion in Sect. 2.1) works suboptimally for the task of minimization. While it manages to detect all three minima a lot of particle get stuck in suboptimal consensus points which can be explained by the fact that the kernel has compact support.

Fig. 5
figure 5

Dynamics of polarized CBO with different kernels for minimizing a Rastrigin-type function with three global minima, marked by red diamonds. For all kernels all minima were detected. Gaussian and Laplace kernel work especially well, whereas the bounded confidence kernel generates too many consensus points due to its compact support

3.3 Multimodal Ackley function

We proceed with quantitative evaluations of our method for a multimodal version of the Ackley function (37), defined as

$$\begin{aligned} V(x) := \prod _{i=1}^N A(x - z_i), \end{aligned}$$

where \(\{z_i\in \mathbb {R}^d\,:\,i=1,\dots ,N\}\) are points which constitute the global minimizers of V, and U is the standard Ackley function defined in (37). In Table 1 we plot how many of the three minima in dimension \(d=2\) were detected by the proposed polarized CBO method with a Gaussian kernel and different values of the standard deviation \(\kappa \). That means, we show the percentage of runs that detected at least 1, 2, or 3 minima. Here, we employed the standard noise model as specified in (5) with \(\sigma =1.0\) and \(\beta =1.0\). For completeness we also include results for standard CBO (i.e., \(\kappa =\infty \)) which by definition can detect at most one minimum.

We say that a method detects a minimum if at convergence there exists a weighted mean \(\textsf{m}_{\beta ,{{\,\mathrm{\textsf{k}}\,}}}[\rho ](x)\) which is closer than 0.25 in the infinity norm to the minimum. For standard CBO where \(\textsf{m}_{\beta ,{{\,\mathrm{\textsf{k}}\,}}}[\rho ](x) = \textsf{m}_{\beta }[\rho ]\) for all x this coincides with the definition of success in [4, 15]. For our experiments we employ \(N=3\) different minima \(z_1,z_2,z_3\in \mathbb {R}^{d}\) with

$$\begin{aligned}&(z_1)_i := {\left\{ \begin{array}{ll} -2&{}\text { if } i{\text {mod}}2 = 0,\\ 1&{}\text { else, } \end{array}\right. } \quad (z_2)_i := {\left\{ \begin{array}{ll} 2&{}\text { if } i{\text {mod}}2 = 0,\\ -1&{}\text { else, } \end{array}\right. }\\&\quad (z_3)_i := {\left\{ \begin{array}{ll} -1&{}\text { if } i{\text {mod}}2 = 0,\\ -3&{}\text { else} \end{array}\right. } \end{aligned}$$

for \(i=1,\ldots , d\). While polarized CBO works very well in the two-dimensional example, Table 2 shows that in dimension \(d=10\), it fails to detect more than one minimum. However, the cluster method from 3 exhibits significantly improved behavior and manages to detect all three minima frequently. Here, we employed the coordinate-wise noise model from (9) with \(\sigma =7.5\). Additionally we employed a simple scheduling for the parameter \(\beta \), where we start with \(\beta =30\) and increase it in each step via

$$\begin{aligned} \beta \leftarrow 1.01\cdot \beta \end{aligned}$$

up to a limit of \(\beta _\text {max}=10^7\). Here, one could potentially employ more sophisticated approaches as proposed in [15]. For the cluster-based methods we chose \(\alpha =5.0\) in Algorithm 3.

Table 1 Performance of polarized CBO for minimizing the multimodal Ackley function with and three global minima: Averaging over 100 independent runs of 1000 iterations we plot how many percent of the succeeded in detecting at least 1, 2, or 3 of the minima

Even for \(\kappa =\infty \), where the kernel has no influence anymore, the method works very well. Note that this case does not correspond to standard CBO. Furthermore, Table 2 shows that our polarized method can outperform standard CBO at finding at least one minimum, albeit at the cost of higher complexity.

We also test the cluster method in \(d=30\) dimensions and the results can be found in Table 3. We observe that it is harder to find multiple minima, however for \(J=1600\) the method finds at least two minima in over \(50\%\) of the runs for \(\kappa \ge 1\). For smaller particle numbers the algorithm performs better for \(\kappa \le 1\), although the percentage of runs, where multiple minima are found is very low.

Table 2 Performance of polarized and cluster CBO for minimizing the multimodal Ackley function with and ceteris paribus
Table 3 Performance of cluster CBO for minimizing the multimodal Ackley function with and ceteris paribus

3.4 Multimodal sampling

Fig. 6
figure 6

Dynamics of standard CBS and our polarized version, sampling from a mixture of Gaussians (top row: far apart, bottom row: closer together). The points mark particle locations, the arrows the drift field towards the weighted mean(s)

In this section we consider the task of sampling from a bimodal mixture of Gaussians.

$$\begin{aligned} \exp (-V(x))&:= \exp \left( -(x_1-a_1)^2 - \frac{(x_2-a_2)^2}{0.2}\right) \\&\quad + \frac{1}{2}\exp \left( -\frac{(x_1-b_1)^2}{8} - \frac{(x_2-b_2)^2}{0.5}\right) , \end{aligned}$$

where the tuples \(a=(a_1,a_2)\) and \(b=(b_1,b_2)\) determine the centers of the clusters. In Fig. 6 we plot the result of standard CBS and our polarized variant using Gaussian kernels with three different standard deviations \(\kappa \in \{0.4,0.6,0.8\}\). For both methods we choose \(\beta =1\). Note that, at least for convex potentials V with bounded Hessian, standard CBS is known to exhibit a Gaussian steady state. Being designed as a method for unimodal sampling, there is not much hope CBS can work in this multimodal situation. Still we include CBS results for comparison.

In contrast, our polarized modification of CBS manages to isolate the two modes. Note that in Proposition 1 we proved that polarized CBS with a Gaussian kernel is unbiased when when the target measure is a Gaussian. We do not expect this to be true for target measures which are a mixture of Gaussians. Still, our results in the first row of Fig. 6 show that, if the two Gaussians are sufficiently far apart, our method (second to fourth column) seems close to being unbiased. Standard CBS (left column) can, by design, find at most one mode and here it successfully detects the lower Gaussian. Note that we use the same number of particles for both algorithms, namely \(J=400\), which is why the bottom mode for CBS looks more densely sampled.

The situation is different when the two modes are closer together, as shown in the bottom row of Fig. 6. Here, standard CBS fails to detect even one mode whereas our polarized version does achieve that. However, as expected in this multimodal situation the result is not a perfect sample but appears to be biased. Furthermore, if the clusters are close, the sensitivity with respect to the choice of the kernel width \(\kappa \) is larger and \(\kappa \) has to be chosen sufficiently small in order to generate enough samples for the lower cluster.

3.5 Non-Gaussian sampling

Fig. 7
figure 7

Standard and polarized CBS for sampling from a non-Gaussian distribution. The orange crosses mark the position of the weighted means

Our final experiment deals with sampling from a non-Gaussian distribution. Again, we use \(\beta =1\). According to the theoretical analysis in [15] standard CBS is not expected to correctly sample from such a distribution but rather from a Gaussian approximation. While we do not claim that our polarized method generates an exact sample, our result in Fig. 7 show that polarized CBS approximates the non-Gaussian distribution much better than standard CBS. This is due to the fact that the weighted means, indicated as orange crosses, do not collapse to a single point but concentrate in the region of high probability mass.

4 Conclusion and outlook

In this article we presented a polarized version of consensus-based optimization and sampling dynamics for objectives with multiple global minima or modes. For this we localized the dynamics such that every particle is primarily influenced by close-by particles. We proved that in the case of sampling from a Gaussian distribution this does not introduce a bias. We also suggested a cluster-based version of our polarized dynamics which is computationally more efficient. Our extensive numerical experiments suggested a large potential of our method for detecting multiple global minima or modes, improving over standard consensus-based methods.

There is a lot of room for future work regarding well-definedness of the Fokker–Planck equation derived above, and stability and convergence of both the mean-field and particle system to consensus using less restrictive assumptions than the ones used in this paper. We are convinced that the Lyapunov function \(\textsf{L}[\rho _t]\) which we studied in Sect. 2.4 will be very helpful in this endeavour. Future work will also focus on further numerical improvements of our method, in particular, incorporating a batching strategy similar to the one from [10] to further improve performance in high-dimensional optimization. Finally, a long-term goal will be to find multimodal sampling methods which are provably consistent for multimodal and non-Gaussian distributions. Given that even gradient-free sampling from unimodal non-Gaussian distributions is still relatively poorly understood, with [13] being a promising approach, this will be a challenging task for future work.