1 Introduction

Motivated by the problem in which two or more objectives must be considered at the same time, in this work we are interested in the design of stochastic algorithms for multi-objective optimization. This type of problem is commonly found in everyday life, for example, in physics, engineering, social sciences, economy, biology, and many others [17, 22, 44, 45, 53]. Investing in the financial market while maximizing profit and minimizing risk, or building a vehicle while maximizing performance and minimizing fuel consumption and pollutant emissions are examples of multi-objective optimization problems, where objectives typically conflict with each other.

From a mathematical viewpoint, the problem can be formulated through a variable \(x\in {\mathbb {R}}^d\) describing a possible decision and assuming that \(g_i(x)\) is the i-th objective for \(i=1, \ldots , m\), with \(m \in {\mathbb {N}}\) being the total number of objectives. A multi–objective problem then requires to solve for a decision x

$$\begin{aligned} \min _{x \in {\mathbb {R}}^d} g(x) \end{aligned}$$
(1.1)

where \(g(x) = (g_1(x), \ldots , g_m(x))^\top \). A solution to (1.1) corresponds to several optimal decisions. Here, we consider optimality in the sense of Pareto [45], i.e., no objective can be improved without necessarily degrading another objective. Without additional information about subjective preferences, there may be a (possibly infinite) number of Pareto optimal solutions, all of which are considered equally good. Therefore, the optimization task consists of providing a set of optimal decisions. To this end, it is also desirable to have a diverse set, that is, addressing the problem not only by optimizing the objectives, but also by aiming to best describe the (possibly) broad set of optimal decisions.

Several methods have been proposed to numerically solve (1.1) and, as for single-objective optimization, they typically belong to either the class of metaheuristic algorithms or mathematical programming methods [59]. Among metaheuristics, multi-objective evolutionary algorithms [17], such as NSGA-II [18] and MOEA/D [64], have gained popularity among practitioners due to their flexibility and ease of use. At the same time, they usually lack of convergence analysis compared to mathematical programming methods. Some of these algorithms, like MOEA/D, make use of scalarization strategies [45] which allow to translate problem (1.1) into a set of parametrized single-objective problems. With this detour, all classical, non-heuristic methods can also be employed in multi-objective optimization, at the cost of solving a possibly large amount of single-objective sub-problems. Many well-known techniques such as descend methods, trust-region methods, and Newton’s method have been successfully adapted to directly solve multi-objective problems avoiding scalarization and they typically guarantee to find locally Pareto optimal solutions, see e.g. [30, 37]. Among deterministic algorithms for non-convex problems we mention branch-and-bound methods [27], DIRECT methods [11], and worst-case optimal algorithms [65]. For more details on mathematical programming methods and evolutionary algorithms in multi-objective optimization we refer to the recent surveys [15, 22].

We are interested in a particular class of gradient-free stochastic particle optimization methods, called consensus-based optimization (CBO), which has recently gained popularity due to the use of techniques developed in statistical physics that can provide them with a rigorous mathematical foundation. Such methods consider interacting particle systems which combine a consensus mechanism towards the estimated minimum and random exploration of the search space  [2, 12, 13, 56, 61]. From a mathematical viewpoint, this class of metaheuristic methods is inspired by the corresponding mean-field dynamics based on particle swarming and multi-agent social interactions, which have been widely used to study complex systems in life sciences, social sciences and economics [19, 50, 51, 54, 63]. In the context of stochastic particle optimization, mean-field techniques first approximate the heuristic iterative strategy as continuous-in-time dynamics by means of a system of Stochastic Differential Equations (SDEs). Then, the system is further approximated by a mono-particle process, the corresponding mean-field model, which is more amenable to mathematical analysis. Quantitative estimates of such approximations can eventually provide precise error estimates for the original heuristic method, see e.g. [32]. This approach has proven fruitful to demonstrate convergence towards a global minimum for single-objective problems, not only in the case of CBO methods, but also for the popular Particle Swarm Optimization (PSO) algorithm [36, 43], thus paving the way to provide a mathematical foundation for other metaheuristics.

In the same spirit, the authors proposed in [7] a multi-objective optimization algorithm (M-CBO) by prescribing a CBO-type interaction among several particles making use of a scalarization strategy. As already mentioned, scalarization allows to reduce (1.1) into many single-objective sub-problems, which can be solved simultaneously in the case of particle-based optimization methods. In this paper, we provide a convergence analysis for the method proposed in [7] based on the mean-field description of the M-CBO dynamics. Furthermore, we improve the method in order to capture with uniform accuracy the shape of the Pareto front. This is done by iteratively updating the parameters of the method to minimize specific diversity measures, based on two-body energy potentials. Mathematically, this last feature is achieved by enlarging the phase space of the particles and introducing Vlasov-type dynamics in the space of the parameters. A detailed analysis of the extended model is also presented by studying a mean-field approximation of the particle dynamics which allows to recover convergence guarantees towards optimal points. We further analyze the adaptive strategy for bi-objective problems in simplified settings where the Pareto front is the unit simplex, that is, a linear segment for \(m=2\). Under suitable regularity assumptions on the potential, the particle dynamics takes a gradient-flow structure in the space of the parameters. The latter, in particular, suggests that the heuristic strategy leads to a better quality approximation of the Pareto front, where particles are uniformly spread over it. We remark that another CBO algorithm for multi-objective optimization has been recently proposed in [47]. This variant makes use of a multi-swarm approach rather the a one-swarm approach, as it is the case of the algorithm proposed in [7] and extended in this work.

The rest of the paper is organized as follows. In Sect. 2, we formally introduce the concept of optimality for (1.1) and present the scalarization strategy. Next, in Sect. 3, we illustrate the particle dynamics both in the search space and in the space of parameters. Section 4 is devoted to the mathematical analysis of the system evolution using a mean-field description. Finally, in Sect. 5 numerical examples on convex and non-convex as well as disjoint Pareto fronts are presented, which confirm the theoretical results and the good performance of the new method. Some concluding remarks are discussed in the last section.

2 Problem Definition and Scalarization

We will use the following notation. Let \(a \in {\mathbb {R}}^n\), |a| indicates its euclidean norm and \((a)_l\) its l-th component. For given vectors \(a, b \in {\mathbb {R}}^d\), \(a \cdot b\) denotes the scalar product. For a Borel set \(A \subset {\mathbb {R}}^d\), |A| indicates its Lebesgue measure. \({\mathcal {P}}({\mathbb {R}}^d)\) is the set of Borel probability measures over \({\mathbb {R}}^d\) and \({\mathcal {P}}_q ({\mathbb {R}}^d) = \{\mu \in {\mathcal {P}}({\mathbb {R}}^d) \,|\, \int |x|^q d\mu <\infty \}\) the set of Borel probability measures with finite q-th moment. The symbols \(\prec \) and \(\preceq \) indicate the partial ordering with respect to the cone \({\mathbb {R}}^m_{>0}\) and \({\mathbb {R}}^m_{\ge 0}\) respectively.

2.1 Pareto Optimality and Diversity

When dealing with a vector-valued objective function \(g: {\mathbb {R}}^d \rightarrow {\mathbb {R}}^m\)

$$\begin{aligned} g(x) = \left( g_1(x), \cdots , g_m(x) \right) ^{\top } \end{aligned}$$
(2.1)

with \(m\ge 2\), the interpretation of the minimization problem (1.1) is not unique, as the image space \({\mathbb {R}}^m\) is not fully ordered. We consider the notions of strong and weak Edgeworth-Pareto optimality which rely on the natural, component-wise, partial ordering on \({\mathbb {R}}^m\) [45].

Definition 2.1

(Edgeworth-Pareto optimality) A point \({\bar{x}} \in {\mathbb {R}}^d\) is (strong) Edgeworth-Pareto (EP) optimal, or simply optimal, if \(g({\bar{x}})\) is a minimal element of the image set \(g({\mathbb {R}}^d)\) with respect to the natural partial ordering, that is if there is no \(x \in {\mathbb {R}}^d\) such that

$$\begin{aligned} g_i(x) \le g_i({\bar{x}}) \;\; \text {for all}\;\; i=1, \ldots , m, \quad g(x) \ne g({\bar{x}}). \end{aligned}$$

Alike, \({\bar{x}}\) is weakly EP optimal, if there is no \(x \in {\mathbb {R}}^d\) such that

$$\begin{aligned} g_i(x) < g_i({\bar{x}})\; \; \text {for all}\;\; i=1, \ldots , m. \end{aligned}$$

The set \( F_x = \{ {\bar{x}} \in {\mathbb {R}}^d \, |\, {\bar{x}} \; \text {is EP optimal} \} \) constitutes the set of EP optimal points, while

$$\begin{aligned} F = \{ g({\bar{x}}) \in {\mathbb {R}}^m \, |\, {\bar{x}} \; \text {is EP optimal} \} \end{aligned}$$

is the Pareto front.

The multi-objective optimization problem (1.1) consists of finding the set of EP optimal points. Unlike single-objective problems, the set is typically uncountable and the optimization task involves finding a finite subset of optimal points. The image of those points should ideally cover F and the concept of diversity is introduced to distinguish between two approximations [17]. Intuitively, if points on the Pareto front are more distanced, the diversity is higher. In view of the minimization problem, having a diverse approximation is desirable as it provides at the same cost a broader variety of possible solutions.

The most diverse approximation is possibly given by a set of points which is uniformly distributed over the Pareto front. Quantifying the diversity of an optimal set is of paramount importance both to assess the performance of optimization methods and to design them. Indeed, oftentimes the heuristic of a specific method is constructed to specifically minimize, or maximize, a specific measure [15]. Without knowledge of the exact Pareto front, popular diversity measures are given by hypervolume contribution [66], crowding distance [18] and, recently, by the Riesz s-energy [25, 48]. Our proposed algorithm will aim to minimize the latter (or similar energy-based measures) as it can be embedded in a mean-field framework. The exact definitions are introduced later.

To sum up, the multi-objective optimization problem we consider is a two-objective task itself, as one needs to find a set of points which are both EP optimal and optimize a suitable diversity measure.

2.2 Scalarization Strategy

A popular way to approach (1.1) is to use a scalarization strategy [21, 45] which reduces the multi-objective problem to an (infinite) number of single-objective sub-problems.

Among the possible scalarization strategies, we consider the approximation sub-problems with weighted Chebyshev semi-norms [45] where the single-objectives sub-problems are given by

$$\begin{aligned} \min _{x \in {\mathbb {R}}^d} G (x,w):= \max _{k\in \{ 1, \ldots , m\}}\, w_k \,|g_k(x)| \end{aligned}$$
(2.2)

and are parametrized by a vector of weights w which belongs to the unit, or probability, simplex

$$\begin{aligned} \varOmega :=\left\{ w \in {\mathbb {R}}^m_{\ge 0}\; | \;\sum _{\ell =1}^m w_\ell = 1 \right\} . \end{aligned}$$

The link between the scalarized problems and the original multi-objective problem is given by the following result.

Theorem 2.1

([45, Corollaries 5.25, 11.21]) Assume g is component-wise positive.

  1. (a)

    A point \({\bar{x}}\) is weakly EP optimal if and only if \({\bar{x}}\) is a solution to (2.2) for some \(w \in \varOmega \).

  2. (b)

    Assume all sub-problems (2.2) attain a unique minimum. Then, \({\bar{x}}\) is EP optimal if and only if \({\bar{x}}\) is the solution to (2.2) for some \(w \in \varOmega \).

Theorem 2.1 shows the strength of the Chebyshev scalarization strategy which allows to find all the weakly EP optimal points, contrary to other strategies like linear scalarization [45]. We remark that the proposed algorithm can also be applied to solve any other scalarized problems of the form (2.2) where the parameters are taken from the unit simplex \(\varOmega \).

In practice, it will be sufficient to solve a finite number \(N \in {\mathbb {N}}\) of sub-problems to obtain a good approximation of the Pareto front. Even though solving N sub-problems with corresponding weights vectors \(\{ W^i\}_{i=1}^N \subset \varOmega \) ensures to find N optimal points, we note that there is no guarantee to obtain a diverse approximation. Therefore, scalarization targets only one of the two aspects of the problem, without addressing the diversity of the solution. In the following, we introduce an algorithm where the parameters \(W^i\) dynamically change during the computation to obtain a set of EP points which is also diverse.

3 Adaptive Multi-objective Consensus Based Optimization

We propose a dynamics where \(N\in {\mathbb {N}}\) particles interact with each other to solve N scalar sub-problems given in the form (2.2).

At the iteration step \(k = 0,1, \ldots \), every particle is described by its position \(X_k^i \in {\mathbb {R}}^d\) and its vector of weights \(W_k^i \in \varOmega \) which determines the optimization sub-problem the particle aims to solve. As a result, particles are described by N tuples

$$\begin{aligned} (X_k^i, W_k^i) \quad \text {for}\quad i= 1, \ldots , N\, \quad \text {for} \quad k = 0,1,\ldots \end{aligned}$$

in the augmented space \({\mathbb {R}}^d \times \varOmega \).

The initial configuration is generated by sampling the positions \(X^i_0\) from a common distribution \(\rho _0 \in {\mathcal {P}}({\mathbb {R}}^d)\) and by taking uniformly distributed weights vectors \(W^i_0\) over \(\varOmega \). The iteration is prescribed to solve the multi-objective optimization task. We recall that (1.1) not only requires to find optimal points, but also points that are diverse, that is, well-distributed over the Pareto front. To this end, the optimization process is made of two mechanisms which address these two aims separately.

3.1 A Consensus Based Particle Dynamics in the Search Space

The first mechanism prescribes the update of the positions \(\{X_k^i\}_{i=1}^N\), such that they converge towards EP optimal points. As in [7], this is done by introducing a CBO-type interaction between the particles. To illustrate the CBO update rule, let us consider for the moment a fixed single-objective sub-problem parametrized by \(w \in \varOmega \). Similar to Particle-Swarm Optimization methods, in CBO methods at step k, the particles move towards an attracting point \(Y_k^\alpha (w)\) which is given by a weighted average of their position:

$$\begin{aligned} Y^\alpha _k(w) = \frac{ \sum _{j=1}^N X^j_k\, \exp \left( -\alpha G(X^j_k, w)\right) }{\sum _{j=1}^N \exp \left( -\alpha G(X^j_k, w)\right) }. \end{aligned}$$
(3.1)

Due to the coefficients used in (3.1), if \(\alpha \gg 1\), \(Y^\alpha _k(w)\) is closer to the particles with low values of the objective function \(G(\cdot , w)\) and, in the limiting case, it holds

$$\begin{aligned} Y^\alpha _k(w) \longrightarrow \underset{X^j_k,\,j=1,\ldots , N}{\text {argmin}} G(X_k^j, w)\quad \text {as} \quad \alpha \rightarrow \infty , \end{aligned}$$

if the above minimum uniquely exists. This promotes the concentration of the particles in areas of the search space where the objective function \(G(\cdot ,w)\) attains low values and hence, more likely, a global minimum. We remark that the exponential coefficients correspond to the Gibbs distribution associated with the objective function and, moreover, that this choice is justified by the Laplace principle [20]. The latter is an essential result to study the convergence of CBO methods [33] and it states that for any absolutely continuous probability density \( \rho \in {\mathcal {P}}({\mathbb {R}}^d)\) we have

$$\begin{aligned} \lim _{\alpha \rightarrow \infty }\left( - \frac{1}{\alpha }\log \left( \int e^{-\alpha G(x, w)} d\rho (x) \right) \right) = \inf _{x \in \text {supp}(\rho )} G(x, w). \end{aligned}$$

Since in the multi-objective optimization dynamics each particle addresses a different sub-problem, each of them moves towards a different attracting point given by \(Y^\alpha _k(W^i_k)\). The attraction strength is given by \(\lambda >0\), while another parameter \(\sigma >0\) determines the strength of an additional stochastic component. The update rule reads

$$\begin{aligned} X^i_{k+1} = X_k^i + \lambda \varDelta t \left( Y_k^\alpha (W^i_k) - X_k^i \right) + \sigma \sqrt{\varDelta t}\, D^i_k(X_k^i,W^i_k) B_k^i \end{aligned}$$
(3.2)

for all \(i = 1, \ldots , N\) where \(B_k^i\) are multivariate independent random vectors, \(B_k^i \sim {\mathcal {N}}(0,I_d)\), \(I_d\) being the d-dimensional identity matrix, and \(\varDelta t>0\) is the step-length. The update rule (3.2) is overparametrized and in CBO optimization schemes typically \(\lambda =1\) is used. The matrices \(D_k^i(X_k^i,W^i_k)\) characterize the random exploration process which might be isotropic [56]

$$\begin{aligned} D_{k,\text {iso}}^i (X_k^i,W^i_k)= | Y^\alpha _k(W^i_k) - X_k^i |\, I_d, \end{aligned}$$
(3.3)

or anisotropic [13]

$$\begin{aligned} D_{k,\text {aniso}}^i (X_k^i,W^i_k)= \text {diag} \left( Y^\alpha _k(W^i_k) - X_k^i \right) . \end{aligned}$$
(3.4)

where \(\text {diag}(a)\) is the diagonal matrix with elements of \(a \in {\mathbb {R}}^d\) on the main diagonal.

Both explorations depend on the distance between \(X^i_k\) and the corresponding attracting point making, the variance of the stochastic component larger if the particle is far from \(Y_k^\alpha (W^i_k)\). The difference lies in the direction of the random component: while in the isotropic exploration all dimensions are equally explored, the anisotropic one explores each dimension with a different magnitude.

We note that (3.2) can be interpreted as a simple Euler-Maruyama discretization [41] of the following system of SDEs

$$\begin{aligned} dX^{i}_{t} = \lambda \left( Y_t^\alpha (W^i_t) - X_t^i \right) dt + \sigma D^i_t(X_t^i,W_t^i) dB_t^i\,, \end{aligned}$$
(3.5)

for all \(i = 1, \ldots , N\), where \(B_t^i\) are d-dimensional independent Brownian processes and \(Y_t(W_t^i), D_t^i(X_t^i,W_t^i)\) are defined as in (3.1), (3.3) and (3.4) depending on whether isotropic or anisotropic diffusion is used. In Sect. 4 we will use this interpretation to obtain convergence results for (3.2).

The expected outcome of the position update rule (3.2) is that every particle will find a minimizer of a sub-problem and hence, by Theorem 2.1, a weak EP optimal point. We have already mentioned that if the weights vectors are fixed to the initial, uniform, distribution, i.e. \(W_k^i = W_0^i\) for all \(i = 1, \ldots , N\), there is no guarantee to obtain equidistant points on the front. Since it is impossible to determine beforehand the optimal distribution on \(\varOmega \), we propose a heuristic strategy which updates the vector weights promoting diversity.

Remark 3.1

Assume the particle position \(X^i_{k}\) attains a good value for a different problem \(W_k^j\). Given that the i-th particle aims to optimize a different objective, it will probably leave such good location in the following iterations, possibly resulting in a deterioration of the weighted average for the j-th particle. As we will see, a key role is played by the Gaussian noise which ensures that the probability of finding another particle in such good position \(X^i_k\) is non-zero throughout the whole computation. Similarly to Particle Swarm Optimization methods, one could also avoid losing valuable information by adding memory mechanisms to the CBO dynamics, see e.g [6, 58].

3.2 Uniform Approximation of the Pareto Front

A popular diversity metric in multi-objective optimization is the hypervolume contribution metric [66], which has the drawbacks of being computationally expensive [3] and, by definition, dependent on estimates of upper bounds of g. Motivated by this and by the objective of designing algorithms which perform well for any shape of the Pareto front [48], new energy-based diversity measures have recently gained popularity [15, 25]. Such measures quantify the diversity of a given empirical distribution \(\rho ^N \in {\mathcal {P}}({\mathbb {R}}^d)\) by considering the pairwise interaction given by a two-body potential \(U: {\mathbb {R}}^m \rightarrow (-\infty , \infty ]\) on the image space

$$\begin{aligned} {\mathcal {U}}[ g_\# \rho ^N ]:= \iint U\left( g(x) - g(y) \right) \, d\rho ^N(y)\, d\rho ^N(x), \end{aligned}$$
(3.6)

\(g_\# \rho ^N\) being the push-forward measure of \(\rho ^N\).

The problem of finding well-spread points over the Pareto front is then equivalent to finding a configuration which is minimal with respect to the given energy \({\mathcal {U}}\) where we recall that \(F_x\) is the set of EP optimal points:

$$\begin{aligned} \underset{\rho ^N \in {\mathcal {P}}(F_x)}{\text {min}}\; {\mathcal {U}}\, [ g_\# \rho ^N] \rightarrow \min \, \end{aligned}$$

We refer to Fig. 1 for an intuition on the difference between low and high energy configurations of particles over the Pareto front.

A distribution \(\nu ^N\) is called diverse, if and only if

$$\begin{aligned} \nu ^N \in \underset{\rho ^N \in {\mathcal {P}}(F_x)}{\text {argmin}}\; {\mathcal {U}}\, [ g_\# \rho ^N]. \end{aligned}$$
Fig. 1
figure 1

Illustration of two particle systems over the Pareto front (in red): a high-energy configuration (left) and a low-energy one (right). The configuration Riesz energy ((3.6) with (3.7)) is also displayed, together with the histogram illustrating the corresponding particle distribution on the simplex (Color figure online)

Any energy \( {\mathcal {U}}\) describing short range repulsion between particles, like Monge energy or repulsive-attractive power-law energy, is in principle a candidate to be a diversity measure. The Riesz s-energy, given by

$$\begin{aligned} U_R(z) = \frac{1}{|z|^s} \quad \text {with} \quad s = m-1, \end{aligned}$$
(3.7)

is a popular choice [25] due to its theoretic guarantees of being a good measure of the uniformity of points. Indeed, if F is a \((m-1)\)-dimensional manifold, the minimal energy configuration \(\nu ^N\) converges to the uniform Hausdorff distribution over F as \(N\rightarrow \infty \). We refer to [40] for the precise statement of the result and more details. Inspired by the electrostatic potential between charged particles, the authors in [10] used a Newtonian potential which is also empirically proven to be a suitable diversity measure [9, 10]. See [25] for a numerical comparison between two-body potentials as diversity measures in evolutionary algorithms. We will also compare different energies in Sect. 5 and consider \(U\in {\mathcal {C}}^1({\mathbb {R}}^m {\setminus } \{ 0\})\) to be any of the above. The exact computation of minimal energy configurations of a system of N particles is a well-studied problem as it is connected to e.g. crystallization phenomenon [4]. We note that, in our settings, the configuration \(\rho ^N\) is additionally mapped to the image space in (3.6), making the task even harder. Therefore, we propose a heuristic strategy that is expected to find only suboptimal configurations.

Fig. 2
figure 2

Illustration of dynamics (3.10) in case of two particles \((X_t^i, W_t^i)\), \(i=1,2\). Forces \({\nabla U(g(X_t^i) - g(X_t^j))}\) are collected in the image space and applied to the weights vectors \({W_t^i}\) in the parameters space. The projection operator \(P_{W_t^i}(\cdot )\) ensures that the dynamics is confined in the simplex \(\varOmega \)

To promote diversity, we let the particles follow a vector field associated with \({\mathcal {U}}\). The movement will be only in parameter space, acting on \(\{ W_k^i\}_{i=1}^N\), in order not to interfere with the CBO optimization dynamics acting on the positions \(\{X_k^i\}_{i=1}^N\). Intuitively, if two particles are close to each other in the image space \(g({\mathbb {R}}^d)\), their weights vectors are pulled apart, see Fig. 2. This resembles a short range repulsion of U. To ensure \(W_k^i\) remains in the unit simplex \(\varOmega \), a projection \(P_{\varOmega }\) is required, where

$$\begin{aligned} P_\varOmega (v) = \text {argmin}_{w \in \varOmega } |v-w| \quad \text {for all}\; v \in {\mathbb {R}}^m, \end{aligned}$$

see, e.g., [16] for an implementation. A parameter \(\tau \ge 0\) determines the time scale of the weights adaptation process with respect to the CBO dynamics (3.2). The process can be turned off for \(\tau =0\).

In the case of bi-objective problems, where \(m=2\), we, therefore, obtain the update rule

$$\begin{aligned} W^{i}_{k+1} = P_{\varOmega }\Bigg (W^i_k + \frac{\tau }{N}\sum _{j=1}^N \nabla U\left( g(X_k^i) - g(X_k^j)\right) \varDelta t \Bigg ) , \end{aligned}$$
(3.8)

which is well-defined as the parameters space is embedded in the image space \({\mathbb {R}}^m\). If U has a singularity in 0, we set \(\nabla U(0) = 0\). We note that the lack of a minus sign in (3.8), is due to the explicit form of the relation determined by Theorem 2.1 between the Pareto front and \(\varOmega \). This will become clear in the next section, as this choice gives a gradient flow structure to the parameters dynamics, under suitable assumptions.

For \(m> 2\), the relation between a weight vector \(w \in \varOmega \) and the correspondent (weakly) EP optimal point is more involved. Indeed, (3.8) makes implicitly use of the fact that parameters can only move in one direction when \(m=2\), as the simplex is given by a segment. This is not the case for \(m>3\) and there is no straightforward way to pull back the vector field \(\nabla U\) from the objective space to the simplex. Nevertheless, we prescribe a suitable heuristic dynamics as follows:

$$\begin{aligned} W^{i}_{k+1} = P_{\varOmega } \Bigg (W^i_k - \frac{\tau }{N}\sum _{j=1}^N \frac{W_k^i - W_k^j}{|W_k^i - W_k^j|}\; \left| \nabla U \left( g(X_k^i) - g(X_k^j)\right) \right| \varDelta t \Bigg ) , \end{aligned}$$
(3.9)

for all \(i = 1, \ldots , N\). The term \(|\nabla U (g(X_k^i) - g(X_k^j)) |\) determines the strength of the interaction, while \((W_k^i - W_k^j)/|W_k^i - W_k^j|\) the direction of movement. We note that this second adaptive strategy can also be used in bi-objective problems, \(m=2\).

Up to our knowledge, energy-based diversity metrics have only been used as a selection criterion between candidate approximations of the Pareto front [26, 48], and this is the first time the vector field associated with \({\mathcal {U}}\) is used to guide the particle dynamics in a metaheuristic multi-objective optimization method.

As before, (3.8) and (3.9) can be interpreted as an explicit discretization of continuous-in-time dynamics. In particular, to study the analytical properties of the adaptive strategy, we will consider the following Vlasov-type dynamics corresponding to (3.8)

$$\begin{aligned} dW^{i}_{t} = \tau P_{T(W^i_t,\varOmega )}\left( \frac{1}{N} \sum _{j=1}^N\nabla U\left( g(X_t^i) - g(X_t^j)\right) \right) dt , \end{aligned}$$
(3.10)

for all \(i = 1, \ldots , N\), where \(P_{T(w,\varOmega )}\) corresponds to the projection to the tangent cone \(T(w,\varOmega )\) to \(\varOmega \) at \(w \in \varOmega \), see also [14] for more details. Similarly, the time-continuous approximation of (3.9) is given by

$$\begin{aligned} dW^{i}_{t} = \tau P_{T(W^i_t,\varOmega )}\left( -\frac{1}{N} \sum _{j=1}^N\frac{W_k^i - W_k^j}{|W_k^i - W_k^j|}\; \left| \nabla U \left( g(X_k^i) - g(X_k^j)\right) \right| \right) dt . \qquad \end{aligned}$$
(3.11)

4 Convergence Analysis in Mean-Field Law

In this section, we give a statistical description of the optimization dynamics by presenting the corresponding mean-field model, which allows us to analyze the convergence of the method towards a solution to the multi-objective optimization problem. We will consider any scalarization strategy where the scalar sub-problems take the form of (2.2), with the parameters space given by the unit simplex \(\varOmega \). We restrict ourselves to the case where particle positions are updated with anisotropic diffusion (3.4) for simplicity. The algorithm update rule for the positions \(X_k^i\) is given by (3.2), while weights vectors \(W_k^i\) are given by either (3.8) or (3.9).

As anticipated, we first approximate the discrete-in-time algorithm with the continuous-in-time dynamics given by (3.5) for particle locations, (3.10) or (3.11) for particle weights vectors. At the cost of introducing an approximation error which depends on \(\varDelta t\) (see Remark 4.3 below for more details), we are able in this way to derive the correspondent mean-field dynamics. Mean-field models are obtained by taking the limit \(N \rightarrow \infty \), that is, by considering an infinite number of particles. Such description allows to work with only one probability distribution of particles, rather than a (usually large) number of N different particle trajectories given by the microscopic description. Similar to [56], we make the so-called propagation of chaos assumption on the marginals [60]. In particular, let \(F^N(t)\) be the particles probability distribution over \(({\mathbb {R}}^d \times \varOmega )^N\) at a time \(t \ge 0\). We assume \(F^N(t) \approx f(t)^{\otimes N}\), that is, we assume the particles \((X_t^i, W_t^i), i=1, \ldots ,N\) to be independently distributed according to \(f(t) \in {\mathcal {P}}({\mathbb {R}}^d \times \varOmega )\) for some large \(N\gg 1\).

In the following, we indicate with \(\rho (t) \in {\mathcal {P}}({\mathbb {R}}^d)\) the first marginal of f(t) and with \(\mu (t) \in {\mathcal {P}}(\varOmega )\) the second marginal on the parameters space \(\varOmega \).

As a consequence of the propagation of chaos assumption, we formally obtain that

$$\begin{aligned} Y_t^\alpha (W_t^i) =\frac{\frac{1}{N}\sum _{j=1}^N X_t^j e^{-\alpha G(X_t^j,W_t^i)}}{\frac{1}{N}\sum _{j=1}^N e^{-\alpha G(X_t^j, W_t^i)}} \; \approx \; \frac{\int x e^{-\alpha G(x,W_t^i)} d\rho ({t,x})}{\int e^{-\alpha G(x, W_t^i)}d\rho ({t, x})} =: y^\alpha (\rho (t), W_t^i) \end{aligned}$$

and that

$$\begin{aligned} \frac{1}{N}\sum _{j=1}^N \nabla U \left( g(X_t^i) - g(X_t^j) \right) \approx \int \nabla U \left( g(X_t^i) - g(x) \right) d\rho ({t,x})\,{,} \end{aligned}$$

where the summation with respect to the j-indices has been substituted with integration with respect to the marginal \(\rho (t)\). We note that factorization of the marginals, \(f(t) = \rho (t) \otimes \mu (t)\), is not required in the above derivations and it is not expected to hold true due to the coupling between the dynamics in \({\mathbb {R}}^d\) and \(\varOmega \).

The SDEs (3.5) and (3.10) are now independent on the index i and we obtain the process \((X_t, W_t), \,t>0\) as

$$\begin{aligned} {\left\{ \begin{array}{ll} dX_t &{}=\lambda (y_t^\alpha (\rho (t), W_t) - X_t)dt + \sigma \text {diag}(y_t^\alpha (\rho (t), W_t) - X_t) dB_t \\ dW_t&{}= \tau P_{W_t} \left( \int \nabla U \left( g(X_t) - g(x)\right) d\rho ({t,x})\right) dt \end{array}\right. } \end{aligned}$$
(4.1)

where we introduced \(P_{w}:= P_{T(w,\varOmega )}\) and anisotropic diffusion is used. In the same way, we can derive the mean-field mono-particle process associated with the CBO evolution (3.5) and the alternative adaptive dynamics (3.11). We obtain

$$\begin{aligned} {\left\{ \begin{array}{ll} dX_t &{}=\lambda (y_t^\alpha (\rho (t), W_t) - X_t)dt + \sigma \text {diag}(y_t^\alpha (\rho (t), W_t) - X_t) dB_t \\ dW_t&{}= \tau P_{W_t} \left( - \int \frac{W_t - w}{|W_t - w|} |\nabla U \left( g(X_t) - g(x)\right) | df({t,x,w})\right) dt. \end{array}\right. } \end{aligned}$$
(4.2)

Processes (4.1) and (4.2) are mean-field descriptions of the microscopic dynamics generated by the optimization dynamics described in Sect. 3. We note that the rigorous mean-field limit for single-objective CBO dynamics, which are similar to (3.5), was proven in [42]. Following previous works, see e.g. [12, 33], we consider such an approximation and mathematically analyze the proposed optimization method by studying solutions to (4.1) and (4.2).

Remark 4.1

We note that the law f(t) of process (4.1) weakly satisfies

$$\begin{aligned} \frac{\partial }{\partial t} f(t,x,w)= & {} - \lambda \nabla _x\cdot \left( ( y^\alpha (\rho (t),w) - x) f(t,x,w) \right) \nonumber \\{} & {} + \frac{\sigma ^2}{2} \sum _{\ell =1}^d \partial _{x_\ell x_\ell }^2 \left( \left( y^\alpha (\rho (t),w) - x\right) _\ell ^2 f(t,x,w) \right) \nonumber \\{} & {} - \tau \nabla _w \cdot \left( P_w\left( \int \nabla U\left( g(x) - g(y)\right) d\rho (t,y) \right) f(t,x,w) \right) , \nonumber \\ \end{aligned}$$
(4.3)

with initial conditions \(f(0,x,w) = \rho _0 \otimes \mu _0\), \(\rho _0 = \text {Law}(X_0)\) and \(\mu _0 = \text {Law}(W_0)\), if we take as test function any \(\phi \in C^\infty _c({\mathbb {R}}^d\times \varOmega )\), as a consequence of Itô’s formula.

4.1 Convergence to the Pareto Front

In the following, we study solutions \((X_t, W_t)_{t \in [0,T]}\) to the mean-field approximations (4.1), (4.2) to gain insights on the method convergence properties. We assess the performance of a multi-objective algorithm by the average distance from the Pareto front F and we use the Generational Distance (GD) [62] given by

$$\begin{aligned} GD[\rho (t)] = \left( \int \text {dist}(g(x),F)^2 d\rho (t,x) \right) ^{\frac{1}{2}}, \end{aligned}$$
(4.4)

where \(\rho (t) = \text {Law}(X_t)\). In the following, we state conditions such that \(GD[\rho (t)]\) decays up to a given accuracy \(\varepsilon >0\).

Assumption 4.1

(Uniqueness) Every sub-problem (2.2) with \(w\in \varOmega \) admits a unique solution \({\bar{x}} (w) \in F\). Moreover \({\bar{x}}: \varOmega \rightarrow F_x\) is a \({\mathcal {C}}^1\) map satisfying \(\sup _{w \in \varOmega }\Vert D{\bar{x}}(w) \Vert < \infty \), \(D{\bar{x}}(w)\) being the differential at w, and \(\Vert .\Vert \) is operator norm.

The uniqueness requirement is common in the analysis of CBO methods [33]. This is due to the difficulty of controlling the attractive term \(y^\alpha \), whenever there are two or more minimizers. For example, assume \(\nu \) is a measure concentrated in two different global minimizers of a sub-problem w, \(\nu = \left( \delta _{x_1} + \delta _{x_2}\right) /2\): the attractive term could be located in the middle between them

$$\begin{aligned} y^\alpha (\nu ,w) = \frac{1}{2} \left( x_1 + x_2 \right) \, \end{aligned}$$

being obviously not(!) a minimizer. The regularity assumption on \({\bar{x}}\) will be required to provide bounds on the adaptive strategies (4.1), (4.2) and may be dropped if the interaction in the weights space is not present.

The next assumption is concerned with lower and upper bounds in a neighborhood of minimizers of \(G(\cdot ,w), w \in \varOmega \). See also [35] and the references therein for more details on the following conditions.

Assumption 4.2

(Stability at the minimizer) The functions \(G(\cdot ,w)\), \(w \in \varOmega \), are continuous and, in a neighborhood of their minimizer, satisfy the following growth condition: there exists a radius \(R>0\), exponents \(p_1>0, p_2\le p_1\) and constants \(c_1, c_2>0\) such that for all \(w \in \varOmega \)

$$\begin{aligned} c_1 \Vert x - {\bar{x}} (w) \Vert _\infty ^{p_1} \le G(x,w) - \min _{y\in {\mathbb {R}}^d} G(y,w) \le c_2\Vert x - {\bar{x}} (w) \Vert _\infty ^{p_2}, \end{aligned}$$
(4.5)

for all \(x\in {\mathbb {R}}^d\) such that \(\Vert x - {\bar{x}} (w) \Vert _\infty \le R\).

Moreover, outside such a neighborhood, the function cannot be arbitrarily close to the minimum: there exists \(c_3>0\) such that for all \(w \in \varOmega \)

$$\begin{aligned} c_3 \le G(x,w) - \min _{y\in {\mathbb {R}}^d} G(y,w) \quad \forall \, x:\; \Vert x - {\bar{x}} (w) \Vert _\infty \ge R . \end{aligned}$$

We note that the lower bound in (4.5) is known in the literature as conditioning property [35] or also as inverse continuity property [33]. For a better understanding, we provide a simple example of scalarized sub-problem satisfying the above condition.

Example 1

Consider the bi-objective problem given by \(g_1(x) = (x-0.5)^2\), \(g_2(x) = (x+0.5)^2\), \(x\in {\mathbb {R}}\), and let the scalarized sub-problems be given by \(G(x,w) = w_1g_1(x) + w_2g_2(x)\), for any \(w = (w_1, w_2)^\top \in \varOmega \). By setting \(\nabla _x G(x,w) = 0\), one obtains that the EP optimal points are given by \({\bar{x}} (w) = w_1 - 0.5\) and \(\min _{y \in {\mathbb {R}}} G(y,w) = -w_1(w_1-1)\). For all \(w = (w_1, w_2)^\top \in \varOmega , x \in {\mathbb {R}}\), it holds

$$\begin{aligned} G(w,x) - \min _{y \in {\mathbb {R}}} G(y,w) = \left( x - w_1 +0.5 \right) ^2, \quad |x - {\bar{x}}(w)| = |x - w_1 +0.5| \end{aligned}$$

and therefore Assumption 4.3 is satisfied for \(c_1 = c_2 = 1, p_1 = p_2 = 2\) and any \(R>0\).

Finally, we assume the EP optimal points to be bounded. As for the single-objective case [33, 34], we also prescribe a condition on the initial data \(\rho _0\) and \(\mu _0\).

Assumption 4.3

(Boundedness and initial datum) The set \(F_x\) of optimal points is contained by a bounded, open set \(F_x \subset H\). The initial distribution \(f_0\) is given by \(f_0 = \rho _0 \otimes \mu _0\) with any \(\mu _0\in {\mathcal {P}}(\varOmega )\) and \(\rho _0 = \text {Unif}(H)\), that is, the uniform probability distribution over H.

Boundedness of the EP optimal points will be necessary to provide estimates on the probability mass around the solutions to the sub-problems. We note that, in many applications, the prescribed search space is bounded.

Assumptions 4.14.3 ensure that the quantitative results on the Laplace principle [33, 34] are applicable to all the different sub-problems (2.2) with uniform choice of \(\alpha \). Therefore, under such assumptions, it is possible to prove convergence of the particle system towards EP optimal points by studying the large time behavior of the average \(\ell _2\)-error

$$\begin{aligned} Err [f(t)]:= {\mathbb {E}} \left[ | X_t - {\bar{x}}(W_t)|^2\right] \end{aligned}$$
(4.6)

where \(f(t) = \text {Law}(X_t, W_t)\). We will do so by following the analysis of the single-objective CBO methods [33, Theorem 12], [34, Theorem 2].

Theorem 4.1

Assume scalar sub-problems and initial data \(X_0 \sim \rho _0, W_0\sim \mu _0, f_0 = \rho _0\otimes \mu _0\) satisfy Assumptions 4.14.3 and \(\nabla U \in L^\infty ({\mathbb {R}}^m)\).

For any accuracy \(\varepsilon \in (0,Err [f(0)])\), let \(\lambda , \sigma >0\), \(\tau \ge 0\) satisfy

$$\begin{aligned} \lambda > 2\sigma ^2 + C\frac{\tau }{\sqrt{\varepsilon }}, \quad \text {where} \quad C:=4 \sup _{w \in \varOmega } \Vert D {\bar{x}}(w)\Vert \Vert \nabla U\Vert _{L^{\infty }({\mathbb {R}}^m)} \end{aligned}$$
(4.7)

and \(T^*>0\) be the time horizon given by

$$\begin{aligned} T^* = \frac{2}{\lambda - 2\sigma ^2 - C\tau /\sqrt{\varepsilon }} \log \left( \frac{Err [f(0)]}{\varepsilon }\right) . \end{aligned}$$
(4.8)

Consider now a solution \((X_t, W_t)_{t \in [0,T^*]}\) to either (4.1) or (4.2) with initial data given by \((X_0, W_0)\) and such that \(\sup _{t \in [0,T^*]} {\mathbb {E}}[|X_t|^4] < \infty \).

Provided \(\alpha \) is sufficiently large, it holds

$$\begin{aligned} \min _{t \in [0,T^*]} Err [f(t)] \le \varepsilon , \end{aligned}$$
(4.9)

where \(f(t) = \text {Law}(X_t, W_t)\). Moreover, until the desired accuracy is reached, the error decays exponentially

$$\begin{aligned} Err [f(t)] \le Err [f(0)] \exp \left( -\frac{1}{2}\left( \lambda - 2\sigma ^2 - C\frac{\tau }{\sqrt{\varepsilon }}\right) t \right) . \end{aligned}$$
(4.10)

We remark that the choice of \(\alpha \) depends on the estimates given in Assumption 4.2 and, in particular, on the accuracy \(\varepsilon \). We will provide a proof in the next section. Before, we show how to link the above convergence result to the GD performance metric.

Corollary 4.1

Under assumptions of Theorem 4.1, consider \(\rho (t) = \text {Law}(X_t)\), \(f(t) = \text {Law}(X_t, W_t)\). If g is Lipschitz continuous and component-wise positive, it holds

$$\begin{aligned} \min _{t \in [0,T^*]} GD[\rho (t)] \le \text {Lip}(g) \sqrt{\varepsilon } \end{aligned}$$

and, until the desired accuracy is reached,

$$\begin{aligned} GD[\rho (t)] \le \text {Lip}(g) \sqrt{Err [f(0)]} \exp \left( - \frac{\lambda - \kappa \sigma ^2 - C \tau /\sqrt{\varepsilon } }{2} t \right) . \end{aligned}$$

Proof

Since every sub-problem admits a unique solution (Assumption 4.1), by Theorem 2.1 every solution \({\bar{x}}(w)\) is EP optimal and therefore its image \(g({\bar{x}}(w))\) belongs to the Pareto front F. Therefore

$$\begin{aligned} \text {dist}(g(x),F) \le | g(x) - g({\bar{x}} (w))| \le \text {Lip}(g) | x- {\bar{x}}(w)| \end{aligned}$$

from which follows that \(GD[\rho (t)]\) is bounded by the average \(\ell _2\)-error. \(\square \)

Remark 4.2

In Theorem 4.1, \(\tau \) must be taken of order \(o(\sqrt{\varepsilon })\) suggesting that the parameters should adapt at a much slower time scale with respect to the positions, so as not to interfere with the CBO dynamics. With no weights vectors interaction, \(\tau =0\), the decay estimate (4.10) is independent of \(\varepsilon \) and the rate is larger compared to the case with \(\tau >0.\)

Remark 4.3

When quantitative estimates of the mean-field limit are given, the expected \(\ell _2\)-error can be decomposed as

$$\begin{aligned} {\mathbb {E}} \left[ \sum _{i=1}^N |X^i_k - {\bar{x}} (W_k^i) |^2 \right] \le {\mathcal {O}}(\varDelta t) + {\mathcal {O}}(N^{-1}) + {\mathcal {O}}(\varepsilon ) \end{aligned}$$
(4.11)

see, for instance, [31, 33]. The first term quantifies the distance between the algorithm iteration (3.2), (3.8) (or (3.9)) and the system of SDEs (3.5), (3.10) (or (3.11)) and it is given by classical results in numerical approximation of SDEs by means of the Euler-Maruyama scheme [57]. The second term, of order \(N^{-1}\), is the (expected) error introduced by the mean-field approximation, see e.g. [33]. The last term comes from the convergence of the mean-field model to the sub-problems solutions and it is given by Theorem 4.1.

We note that the additional dynamics of the weight vectors in (4.1) and (4.2) adds non-linearity to the particle system. Thus, standard arguments used to prove well-posedness and mean-field convergence of the CBO dynamics, see for instance [12, 33], cannot be adapted to the present case in a straightforward manner. For this reason, we leave such investigations for future work.

4.2 Proof of Theorem 4.1

To prove Theorem 4.1, we follow the strategy introduced in [33, 34] for the analysis of single-objective CBO particle methods.

We start by studying the time evolution of \(Err [f(t)]\).

Proposition 4.1

Under assumptions of Theorem 4.1, for all \(t \in [0,T^*]\) it holds

$$\begin{aligned} \frac{d}{dt} Err [f(t)]\le & {} \left( -\lambda + 2\sigma ^2\right) Err [f(t)] + \left( \lambda + 2\sigma ^2 \right) {\mathbb {E}} \left[ | y^\alpha (\rho (t), W_t) - {\bar{x}}(W_t) |^2 \right] \\{} & {} + 2 \tau \sup _{w \in \varOmega } \Vert D {\bar{x}}(w)\Vert \Vert \nabla U\Vert _{L^{\infty }({\mathbb {R}}^m)} \sqrt{Err [f(t)]}. \end{aligned}$$

Proof

By applying Itô’s formula with \(\phi (x,w) = |x - {\bar{x}}(w)|^2\), we obtain

$$\begin{aligned} d\phi (X_t, W_t)&= \lambda \nabla _x \phi (X_t, W_t) \cdot (y^\alpha (\rho (t), W_t) - X_t) dt \\&\quad + \sigma \sum _{\ell =1}^d \partial _{x_\ell } \phi (X_t, W_t) (y^\alpha (\rho (t), W_t) - X_t)_\ell dB_{t,\ell } \\&\quad + \frac{\sigma ^2}{2} \sum _{\ell =1}^d \partial ^2_{x_\ell x_\ell } \phi (X_t,W_t) (y^\alpha (\rho (t), W_t) - X_t)_\ell ^2 dt \\&\quad + \tau D_w\phi (X_t, W_t) dW_t\,. \end{aligned}$$

Thanks to the solution regularity \({\mathbb {E}}[|X_t|^4] < \infty \) for all \(t \in [0,T^*]\), we have \({\mathbb {E}} [\int _0^t|\partial _{x_\ell } \phi (X_s, W_s) (y^\alpha (\rho (s), W_s) - X_s)_\ell |^2 ds] < \infty \) from which follows

$$\begin{aligned} {\mathbb {E}} \left[ \int _0^t\partial _{x_\ell } \phi (X_s, W_s) (y^\alpha (\rho (s), W_s) - X_s)_\ell dB_{s,\ell }\right] = 0 \quad \forall \ell = 1, \ldots , d \end{aligned}$$

due to [52, Theorem 3.2.1 (iii)]. Therefore, by taking the expectation and applying Fubini’s theorem we obtain

$$\begin{aligned} \frac{d}{dt}Err [f(t)]&= \frac{d}{dt}{\mathbb {E}} [\phi (X_t, W_t)] \\&= {\mathbb {E}} \left[ \lambda \nabla _x \phi (X_t, W_t) \cdot (y^\alpha (\rho (t), W_t) - X_t) \right] \\&\quad + {\mathbb {E}}\left[ \frac{\sigma ^2}{2} \sum _{\ell =1}^d \partial ^2_{x_\ell x_\ell } \phi (X_t,W_t) \left( y^\alpha (\rho (t), W_t) - X_t \right) _\ell ^2 \right] \\&\quad + {\mathbb {E}} \left[ \tau D_w\phi (X_t, W_t) \frac{d}{dt} W_t \right] \\&=: T_1 + T_2 + T_3\,. \end{aligned}$$

The first term can be bounded as follows:

$$\begin{aligned} T_1&= 2\lambda {\mathbb {E}} \left[ (X_t - {\bar{x}}(W_t) )\cdot (y^\alpha (\rho (t),W_t) - X_t) \right] \\&=2\lambda {\mathbb {E}} \left[ (X_t - {\bar{x}}(W_t) )\cdot (y^\alpha (\rho (t),W_t) - {\bar{x}}(W_t) ) \right] -2 \lambda {\mathbb {E}} \left[ |X_t - {\bar{x}}(W_t)|^2 \right] \\&\le \lambda {\mathbb {E}} \left[ |X_t - {\bar{x}}(W_t)|^2 \right] + \lambda {\mathbb {E}} \left[ |y^\alpha (\rho (t),W_t) - {\bar{x}}(W_t)|^2 \right] \\&\quad - 2\lambda {\mathbb {E}} \left[ |X_t - {\bar{x}}(W_t)|^2 \right] , \end{aligned}$$

where in the last inequality we used that \(a \cdot b \le (|a|^2 + |b|^2)/2\) for any \(a,b \in {\mathbb {R}}^d\). By employing the definition of \(Err [f(t)]\), we obtain

$$\begin{aligned} T_1 \le -\lambda Err [f(t)] + \lambda {\mathbb {E}} \left[ |y^\alpha (\rho (t),W_t) - {\bar{x}}(W_t)|^2 \right] . \end{aligned}$$
(4.12)

Given that \(\partial ^2_{x_\ell x_\ell } \phi (x,w) = 2\), it holds

$$\begin{aligned} T_2&= {\sigma ^2} {\mathbb {E}}\left[ \sum _{\ell =1}^d (y^\alpha (\rho (t), W_t) - X_t)_\ell ^2\right] \nonumber \\&= {\sigma ^2} {\mathbb {E}}\left[ |y^\alpha (\rho (t), W_t) - X_t|^2 \right] \nonumber \\&\le 2\sigma ^2 {\mathbb {E}} [|y^\alpha (\rho (t), W_t) - {\bar{x}}(W_t)|^2 ] +2 \sigma ^2 {\mathbb {E}} [| {\bar{x}}(W_t) - X_t|^2 ] \nonumber \\&= 2\sigma ^2 {\mathbb {E}} [|y^\alpha (\rho (t), W_t) - {\bar{x}}(W_t)|^2 ] + 2\sigma ^2 Err [f(t)]\,, \end{aligned}$$
(4.13)

thanks to inequality \(|a-b|^2 \le 2(|a-c|^2 + |c-b|^2)\) for any \(a,b,c \in {\mathbb {R}}^d\). To bound the third term, we first note that \(D_w \phi (x,w)(v) = 2(x - {\bar{x}}(w))\cdot D {\bar{x}}(w) (v)\), for any \(v \in {\mathbb {R}}^m\). Note that this requires to differentiate on the manifold and the corresponding differential is denoted by \(D_w.\)

Now, thanks to the boundedness of \(\nabla U\), it holds

$$\begin{aligned} \left| \frac{d}{dt}W_t \right| \le \Vert \nabla U\Vert _{L^\infty ({\mathbb {R}}^m)}, \end{aligned}$$

no matter if the weights are adapted according to (4.1) or (4.2). Thanks to the regularity of \({\bar{x}}\), we obtain

$$\begin{aligned} T_3&= 2\tau {\mathbb {E}} \left[ (X_t - {\bar{x}}(W_t)) \cdot D {\bar{x}}(W_t) \frac{d}{dt}W_t \right] \nonumber \\&\le 2 \tau \sup _{w \in \varOmega } \Vert D {\bar{x}}(w)\Vert \Vert \nabla U \Vert _{L^\infty ({\mathbb {R}}^m)} {\mathbb {E}} \left[ |X_t - {\bar{x}}(W_t) | \right] \nonumber \\&\le 2 \tau \sup _{w \in \varOmega } \Vert D{\bar{x}}(w)\Vert \Vert \nabla U \Vert _{L^\infty ({\mathbb {R}}^m)} \sqrt{Err [f(t)]} \end{aligned}$$
(4.14)

where we applied Jensen’s inequality in the last step. By putting together (4.12), (4.13) and (4.14), we obtain the desired estimate. \(\square \)

Lemma 4.1

(Quantitative Laplace principle) Under assumptions of Theorem 4.1, let \(\rho \in {\mathcal {P}}({\mathbb {R}}^d)\) be any measure such that the EP optimal points belong to its support, \(F_x \subset \text {supp}(\rho )\). Consider any \(r \in [0,R]\), \(q>0\) such that \(q+c_2r^{1/{p_2}} \le c_3\), for all \(w \in \varOmega \) it holds

$$\begin{aligned} | y^\alpha (\rho ,w) - {\bar{x}}(w) | \le \sqrt{d}c_1 \left( q + c_2 r^{1/p_2}\right) ^{1/p_1} +\frac{\sqrt{d}\exp (-\alpha q)}{\rho \left( B_r^\infty ({\bar{x}}(w))\right) } \int |x - {\bar{x}}(w)| d \rho {(x)} \end{aligned}$$

where \(B_r^\infty (z) \) is the open \(\ell _{\infty }\)-ball centered in \(z\in {\mathbb {R}}^d\) with radius r.

Proof

For a fixed \(w \in \varOmega \), we recall the analogous result from the single-objective case [34, Proposition 1]. Let \(\min _{y \in {\mathbb {R}}^d} G(y, w) = 0\) without loss of generality, and define \(G_r(w):= \sup _{y\in B_r^\infty ({\bar{x}}(w))} G(y,w)\). For any \(r \in [0,R]\) and \(q>0\) such that \(q + G_r(w) \le c_3\), it holds

$$\begin{aligned} | y^\alpha (\rho ,w) - {\bar{x}}(w) | \le \sqrt{d}c_1 \left( q + G_r(w)\right) ^{1/p_1} +\frac{\sqrt{d}\exp (-\alpha q)}{\rho \left( B_r^\infty ({\bar{x}}(w))\right) } \int |x - {\bar{x}}(w)| d \rho {(x)}. \end{aligned}$$

Assumption 4.1 ensures that all sub-objectives \(G(\cdot , w)\) satisfy a common growth condition around the minimizers. In particular, we have that \(G_r(w) \le c_2 r^{1/p_2}\) for all \(w \in \varOmega \), which leads to the desired estimate. \(\square \)

Next, to apply the above result for all \(t\in [0,T^*]\), we provide a lower bound on \(\rho \left( B_r^\infty ({\bar{x}}(w))\right) \) for all \(w\in \varOmega \). As the lower bound will be independent on the definition of \({\bar{x}} (w)\), we consider in the following any point \(x^* \in {\mathbb {R}}^d\) belonging to the initial search domain, \(x^* \in \text {supp}(\rho _0) = H\) (see Assumption 4.3). To do so, we introduce the mollifier \(\phi _r\), for \(r>0\):

$$\begin{aligned} \phi _r(x):= {\left\{ \begin{array}{ll} \prod _{k=1}^d \exp \left( 1 - \frac{r^2}{r^2 - (x - x^*)^2_k} \right) , &{} \text {if}\;\; \Vert x -x^*\Vert _{\infty } <r\,, \\ 0, &{}\text {else}. \end{array}\right. } \end{aligned}$$

Lemma 4.2

Under the settings of Theorem 4.1, let \(T\in [0, T^*], r>0,\) and \(x^* \in \text {supp}(\rho _0)\) be fixed. Assume

$$\begin{aligned} \sup _{t \in [0,T]} \sup _{w\in \varOmega } \Vert y^\alpha (\rho (t),w) - x^*\Vert _\infty < B \end{aligned}$$
(4.15)

for some \(B>0\). It holds

$$\begin{aligned} {\mathbb {P}}(X_t \in B_r^\infty (x^*))\ge & {} {\mathbb {E}}[\phi _r(X_0)] \exp (-qt) \\ \text {for} \quad q:= & {} 2d \max \left\{ \frac{\lambda (cr +B \sqrt{c})}{(1-c)^2 r} +\frac{\sigma ^2(cr^2 + B^2)(2c+1)}{(1-c)^4r^2},\,\frac{2\lambda ^2}{(2c-1)\sigma ^2} \right\} \end{aligned}$$

were \(c\in (0.5,1)\) can be any constant satisfying \((1-c^2)^2 \le (2c-1)c\).

We refer to Appendix C for a proof. Now, we are ready to provide a proof of the main result.

Proof

(Theorem 4.1) Thanks to Proposition 4.1, we have for all \(t \in [0,T^*]\),

$$\begin{aligned} \frac{d}{dt} Err [f(t)]\le & {} \left( -\lambda +2\sigma ^2\right) Err [f(t)] + \left( \lambda + 2\sigma ^2 \right) {\mathbb {E}} \left[ | y^\alpha (\rho (t), W_t) - {\bar{x}}(W_t) |^2 \right] \\{} & {} + 2\tau \sup _{w \in \varOmega } \Vert {D {\bar{x}}(w)}\Vert \Vert \nabla U\Vert _{L^{\infty }({\mathbb {R}}^m)} \sqrt{Err [f(t)]}. \end{aligned}$$

We consider now

$$\begin{aligned} T:= \sup \left\{ t \ge 0 \,\Big |\, Err (s) > \varepsilon \; \text {and}\; \sup _{w \in \varOmega }|y_\alpha (\rho (s), w) - {\bar{x}}(w)| \le {{\tilde{C}}}(s) \;\forall s \in [0,t] \right\} \end{aligned}$$

where

$$\begin{aligned} {{\tilde{C}}}(t):= \frac{1}{2}\frac{\lambda - 2\sigma ^2}{\lambda + 2\sigma ^2}Err [f(t)]. \end{aligned}$$

For all \(t \in [0,T]\) it holds \(Err [f(t)] \ge \varepsilon \) and therefore we have

$$\begin{aligned}&2\tau \sup _{w \in \varOmega } \Vert {D {\bar{x}}(w)}\Vert \Vert \nabla U\Vert _{L^{\infty }({\mathbb {R}}^m)} \sqrt{Err [f(t)]}\\&\quad \le 2\tau \sup _{w \in \varOmega } \Vert {D {\bar{x}}(w)}\Vert \Vert \nabla U\Vert _{L^{\infty }({\mathbb {R}}^m)} \sqrt{Err [f(t)]}\frac{\sqrt{Err [f(t)]}}{\sqrt{\varepsilon }}\\&\quad =: C \frac{\tau }{2 \sqrt{\varepsilon }} \, Err [f(t)]\,. \end{aligned}$$

By using the above estimate and definition of \({{\tilde{C}}}(t)\), we then obtain

$$\begin{aligned} \frac{d}{dt} Err [f(t)]&\le \left( -\lambda +2\sigma ^2\right) Err [f(t)] + \left( \lambda +2 \sigma ^2 \right) \frac{1}{2}\frac{\lambda - 2\sigma ^2}{\lambda + 2\sigma ^2} Err [f(t)] \\&\quad + C \frac{\tau }{2\sqrt{\varepsilon }} \, Err [f(t)]\\&= - \frac{1}{2}\left( \lambda - 2\sigma ^2 - C \frac{\tau }{\sqrt{\varepsilon }}\right) Err [f(t)]\,. \end{aligned}$$

By applying Grönwall’s inequality we obtain the error decay

$$\begin{aligned} Err [f(t)] \le Err [f(0)] \exp \left( -\frac{1}{2} \left( \lambda - 2\sigma ^2 - C\tau / \sqrt{\varepsilon } \right) \, t \right) \end{aligned}$$
(4.16)

for all \(t \in [0,T]\). In particular, by definition of \({{\tilde{C}}}(t)\), it holds

$$\begin{aligned} \sup _{w \in \varOmega }|y_\alpha (\rho (s), w) - {\bar{x}}(w)| \le {{\tilde{C}}}(s) \le {{\tilde{C}}}(0) \end{aligned}$$

from which follows, for any \(w^*\in \varOmega \),

$$\begin{aligned} \sup _{w \in \varOmega }|y_\alpha (\rho (s), w) - {\bar{x}}(w^*)|&\le \sup _{w \in \varOmega }|y_\alpha (\rho (s), w) - {\bar{x}}(w)| + | {\bar{x}}(w) - {\bar{x}} (w^*)| \nonumber \\&\le {{\tilde{C}}}(0) + \text {diam}(F_x) \end{aligned}$$
(4.17)

due to the boundedness of the EP optimal points \(F_x = \{{\bar{x}} (w)\,|w \in \varOmega \}\). We now consider three different cases.

Case \(T\ge T^*\). By definition of \(T^*\) and thanks to the error decay (4.16) over [0, T], it holds \(Err [f(T^*)] \le \varepsilon \).

Case \(T< T^*\) and \(Err [f(T)] = \varepsilon \). There is nothing to prove here.

Case \(T< T^*\), \(Err [f(T)] > \varepsilon \) and \( \sup _{w \in \varOmega }|y_\alpha (\rho (T), w) - {\bar{x}}(w)| > {{\tilde{C}}}(T)\). We will show that if \(\alpha \) is sufficiently large this case cannot occur. In particular, we prove that for large \(\alpha \)’s, \( \sup _{w \in \varOmega }|y_\alpha (\rho (T), w) - {\bar{x}}(w)| < {{\tilde{C}}}(T)\) and therefore we have a contradiction.

Let us fix any \(w \in \varOmega \) and let \({{\tilde{C}}}_\varepsilon \) be given by

$$\begin{aligned} {{\tilde{C}}}_\varepsilon := \frac{1}{2}\frac{\lambda - 2\sigma ^2}{\lambda + 2\sigma ^2}\,\varepsilon . \end{aligned}$$

We take qr as follows

$$\begin{aligned} q&:= \frac{1}{2} \min \left\{ \left( \frac{{{\tilde{C}}}_\varepsilon }{c_1 2 \sqrt{d}} \right) ^{p_1}, c_3 \right\} \,, \\ r&:= \min \left\{ \left( \frac{q}{c_2}\right) ^{p_2}, R \right\} \,. \end{aligned}$$

This choice implies \(r \le R\) and

$$\begin{aligned} \sqrt{d}c_1 \left( q + c_2 r^{1/p_2}\right) ^{1/p_1}&\le \sqrt{d}c_1 \left( q + c_2 \frac{q}{c_2}\right) ^{1/p_1} =\sqrt{d}c_1 \left( 2q\right) ^{1/p_1}\nonumber \\&\le \sqrt{d}c_1 \frac{{{\tilde{C}}}_\varepsilon }{c_1 2 \sqrt{d}} = \frac{{{\tilde{C}}}_\varepsilon }{2}\,. \end{aligned}$$
(4.18)

We can apply now Lemma 4.1 with \(x^* = {\bar{x}}(w)\) and \(\rho = \rho (T) = \text {Law}(X_T)\). We obtain

$$\begin{aligned} | y^\alpha (\rho (t),w)&- {\bar{x}}(w) | \nonumber \\&\le \sqrt{d}c_1 \left( q + c_2 r^{1/p_2}\right) ^{1/p_1} +\frac{\sqrt{d}\exp (-\alpha q)}{{\mathbb {P}}(X_T \in B_r^\infty ({\bar{x}}(w)))} {\mathbb {E}}[|X_T - {\bar{x}}(w)|]\nonumber \\&\le \frac{{{\tilde{C}}}_\varepsilon }{2} +\frac{\sqrt{d}\exp (-\alpha q)\,{\mathbb {E}}[|X_T - {\bar{x}}(w)|]}{{\mathbb {P}}(X_T \in B_r^\infty ({\bar{x}}(w)))} \,, \end{aligned}$$
(4.19)

where the last inequality follows from (4.18). At the same time, thanks to the bound provided by (4.17), we can apply Lemma 4.2 to obtain a lower bound on the mass around \({\bar{x}}(w)\) at time T:

$$\begin{aligned} {\mathbb {P}}(X_T \in B_r^{\infty }({\bar{x}}(w))) \ge {\mathbb {E}}[\phi _{r,w}(X_0)] \exp (-{p}T) \end{aligned}$$
(4.20)

for some \(p>0\) independent on \(w,\alpha \), and where \(\phi _{r,w}(x)\) is now the mollifier centered around the EP optimal point \({\bar{x}}(w)\). We note that \(\inf _{w\in \varOmega }{\mathbb {E}}[\phi _{r,w}(X_0)]=:m_0 >0\), due to our choice of initial data. Moreover, we have by Jensen’s inequality

$$\begin{aligned} {\mathbb {E}}[|X_T - {\bar{x}}(w)|]&\le {\mathbb {E}}[|X_T - {\bar{x}}(W_T)|] + {\mathbb {E}}[| {\bar{x}}(W_T) - {\bar{x}}(w)|] \\&\le \sqrt{2Err [f(0)]} + \text {diam}(F_x) \end{aligned}$$

where we used again the error decay estimate and the boundedness of \(F_x\). By combining the above estimates, we obtain

$$\begin{aligned} \frac{\sqrt{d}\exp (-\alpha q)\,{\mathbb {E}}[|X_T - {\bar{x}}(w)|]}{{\mathbb {P}}(X_T \in B_r^\infty ({\bar{x}}(w)))}&\le \frac{\sqrt{d}\exp (-\alpha q + T {p})}{m_0}\left( \sqrt{2Err [f(0)]} + \text {diam}(F_x) \right) \\&\le \frac{\sqrt{d}\exp (-\alpha q + T^* {p})}{m_0}\left( \sqrt{2Err [f(0)]} + \text {diam}(F_x) \right) \end{aligned}$$

where the right-hand side is independent on the choice of \(w\in \varOmega \). We also note that the decay rate q does not depend on \(\alpha \). As a consequence, there exists \(\alpha _0>0\), such that it holds

$$\begin{aligned} \frac{\sqrt{d}\exp (-\alpha q)\,{\mathbb {E}}[|X_T - {\bar{x}}(w)|]}{{\mathbb {P}}(X_T \in B_r^\infty ({\bar{x}}(w)))} < \frac{{{\tilde{C}}}_\varepsilon }{2} \quad \text {for all}\, w \in \varOmega , \end{aligned}$$

for any \(\alpha > \alpha _0\).

Together with (4.19), this leads to \( \sup _{w \in \varOmega }|y_\alpha (\rho (T), w) - {\bar{x}}(w)| < {{{\tilde{C}}}_\varepsilon }\). By definition of \({{\tilde{C}}}(T)\), it holds \( {{\tilde{C}}}_\varepsilon \le {{\tilde{C}}}(T)\) and so we reached the desired contradiction. \(\square \)

Theorem 4.1 and Corollary 4.1 show that the particle distribution will converge towards a neighborhood of the Pareto front. Due to the adaptive strategy, the distribution might move along the front. In the next section, we will therefore analyze the dynamics in the parameters space \(\varOmega \) to investigate the diversity of the computed solution.

4.3 Decay of Diversity Measure

The aim of interactions (3.8), (3.9) is to improve the distribution of the parameters \(\{W_t^i\}_{i=1}^N\) so that, in view of Theorem 2.1, the corresponding (weak) EP optimal points are well-distributed in the image space. In the following, we will investigate the special case where \(m = 2\) and the Pareto front F corresponds to the unit simplex \(\varOmega \). Dynamics (3.8) can then be shown to be a discretization of a gradient flow on the unit simplex \(\varOmega \) when Chebyshev scalarization (2.2) is used. While considering \(F = \varOmega \) is clearly a strong assumption, the analysis sheds some light on why (3.8) may be a suitable adaptive strategy for the parameters.

We restrict again to the case of Assumption 4.1, which ensures uniqueness of global minima and continuity of \({\bar{x}}\), where \({\bar{x}}(w)\) is the global solution to (2.2) for a given \( w\in \varOmega \). As we are interested in the relation between w and its corresponding point on the Pareto front \(g\left( {\bar{x}}(w)\right) \in F\), let us assume

$$\begin{aligned} X_t = {\bar{x}} (W_t)\, \end{aligned}$$
(4.21)

that is, every particle is assumed to be located at its corresponding EP optimal point. This is justified by the convergence result (Theorem 4.1) and by the fact that the particle interactions in the search space \({\mathbb {R}}^d\) take place at a faster time scale than the adaptation of the parameters, see Remark 4.2.

The parameter evolution in the mean-field model (4.1) becomes then independent on \(X_t\) and can be written as

$$\begin{aligned} dW_t = \tau P_{W_t} \left( \int \nabla U \left( {\bar{g}}(W_t) - {\bar{g}}(w)\right) d\mu ({t,w})\right) dt \end{aligned}$$
(4.22)

where \(\mu (t) = \text {Law}(W_t)\) and, for simplicity, we introduced \({\bar{g}}:= g \circ {\bar{x}}\).

Assumption 4.4

Let (1.1) be a bi-objective problem, \(m=2\), and the scalar sub-problems be given by the Chebyshev scalarization strategy (2.2). We assume the Pareto front F is given by

$$\begin{aligned} F = \varOmega = \{ (s, 1-s)^\top :\, s \in [0,1]\} \end{aligned}$$

and that the binary potential energy U is radially symmetric.

Lemma 4.3

Under Assumption 4.4, for all \(w,v \in \varOmega \) it holds

$$\begin{aligned} P_w\Big ( \nabla U\big ({\bar{g}}(w) - {\bar{g}}(v)\big ) \Big ) = - P_w\big (\nabla U (w - v) \big ). \end{aligned}$$
(4.23)

Proof

We note that when \(F = \varOmega \), all weakly EP optimal points are also EP optimal and hence by Theorem 2.1\({\bar{g}}(w) = g({\bar{x}}(w)) \in F\) for all \(w \in \varOmega \). Then, for any \(w \in \varOmega \), there exists \(s \in [0,1]\) such that \({\bar{g}} (w) = (s, 1-s)\). By definition of the sub-problem (2.2) with \(w = (w_1,w_2) = (w_1, 1-w_1)\),

$$\begin{aligned} {\bar{g}}(w) = \min _{y \in F} \max \left\{ y_1w_1, y_2 w_2 \right\} = \min _{s \in [0,1]} \max \{ s w_1, (1-s) (1-w_1) \}, \end{aligned}$$

due to the definition of G(xw). At the minimizer, it must hold \(s w_1 = (1-s)(1-w_1)\) and hence \(s = (1-w_1)\). It follows that

$$\begin{aligned} {\bar{g}} (w) = A w, \quad \text {where} \quad A = \begin{pmatrix} 0 &{}\quad 1\\ 1 &{}\quad 0 \end{pmatrix}. \end{aligned}$$
(4.24)

Since U is radially symmetric it holds \(\nabla U ( Aw - Av) = A \nabla U(w-v)\).

Finally, let us consider the basis \(n_1 = (1,1)^\top , n_2 = (1,-1)^\top \) and a vector \(u \in {\mathbb {R}}^2, u = u_1 n_1 + u_2 n_2\) for some \(u_1,u_2 \in {\mathbb {R}}\). We note that the tangent cone at \(w\in \varOmega \) satisfies \(T(w, \varOmega ) \subseteq \text {span}(n_2)\) and therefore \(P_w(n_1) = 0\). Together with the fact that \(Au = u_1n_1 - u_2n_2\), this leads to

$$\begin{aligned} P_w(Au) = P_w(-u_2 n_2) = P_w(-u)\, \end{aligned}$$

and identity (4.23) follows. \(\square \)

Thanks to Lemma 4.3, under Assumption 4.4 equation (4.22) can be simplified to

$$\begin{aligned} dW_t = \tau P_{W_t} \left( - \int \nabla U \left( W_t - w \right) d\mu ({t,w})\right) dt. \end{aligned}$$
(4.25)

We note that \(\mu (t)\) weakly satisfies

$$\begin{aligned} \frac{\partial }{\partial t} \mu (t,w) = - \tau \nabla _w \cdot \left( P_w\left( - \int \nabla U\left( w -v \right) d\mu (t,v) \right) \mu (t,w) \right) \end{aligned}$$
(4.26)

with initial conditions \(\mu (0) = \mu _0\) over the test set \(C_c^\infty (\varOmega )\). Equation (4.26) describes the continuum dynamics of particles which binary interact and that are confined to the set \(\varOmega \). Such aggregation model on bounded domains has been the subject of several works, see for instance [14, 28, 29, 55]. Particularly relevant to the present work is [14] where general prox-regular sets, like \(\varOmega \), are considered. We recall the main result from [14] for completeness, after introducing the necessary notation.

In the following, \(\text {Conv}(\varOmega - \varOmega )\) denotes the convex envelop of \((\varOmega - \varOmega ) =\{w-v\,:\, w, v \in \varOmega \} \) and \(U \in {\mathcal {C}}^1({\mathbb {R}}^m)\) is said to be \({\tilde{\lambda }}\)-geodesically convex on a set \(A \subset {\mathbb {R}}^m\) if for all \(x,y\in A\)

$$\begin{aligned} U(x) \ge U(x) + \nabla U \cdot (y-x) + \frac{{\tilde{\lambda }}}{w} |y - x|^2. \end{aligned}$$

Theorem 4.2

([14, Theorem 1.5]) Assume \(U \in {\mathcal {C}}^1({\mathbb {R}}^m)\) to be \({\tilde{\lambda }}\)-geodesically convex on \(\text {Conv}(\varOmega - \varOmega )\) for some \({\tilde{\lambda }} \in {\mathbb {R}}\). For any initial data \(\mu _0 \in {\mathcal {P}}_2(\varOmega )\) there exists a locally absolutely continuous curve \(\mu (t) \in {\mathcal {P}}_2(\varOmega ), t>0,\) such that \(\mu \) is a gradient flow with respect to \({\mathcal {U}}\). Also, \(\mu \) is a weak measure solution to (4.26).

Furthermore,

$$\begin{aligned} \frac{d}{dt}\,{\mathcal {U}}[\mu (t)] \le - \int \left| P_w\left( \nabla U *\mu (t) (w) \right) \right| ^2 d \mu (t,w) , \end{aligned}$$
(4.27)

where \(*\) denotes the convolution operator.

In these specific settings, thanks (4.24) and continuity of \({\bar{x}}\) we have

$$\begin{aligned} {\mathcal {U}}[\mu (t)] = {\mathcal {U}}[{\bar{g}}_\#\mu (t)] = {\mathcal {U}}[ g_\#\rho (t)] \end{aligned}$$

since \({\bar{g}} = g \circ {\bar{x}}\) and \(\rho (t) = {\bar{x}}_\# \mu (t)\) due to (4.21). Therefore, Theorem 4.2 states that the energy over the front is decreasing. We note that the flow may converge to stationary points of (4.22) that are not minimal configurations, as observed in [28] for even simple domains.

Clearly, without ansatz (4.21), there is no guarantee that the potential decreases along the evolution of the algorithm. Quite the opposite, by Theorem 4.1 particles are expected to concentrate on the Pareto front leading to an increased potential \({\mathcal {U}}\). Nevertheless, if Theorem 4.1 applies, there exists a time \(T>0\) where

$$\begin{aligned} {\mathbb {E}} [|X_t - {\bar{x}} (W_t)|^2] < \varepsilon \quad \text {and hence} \quad X_t \approx {\bar{x}}(W_t) \end{aligned}$$

making ansatz (4.21) valid. Therefore, we claim that the reduced model (4.22) describes the dynamics for \(t>T\). We will numerically investigate two phases of the algorithm: the first one when concentration over the Pareto happens, and the second when the potential \({\mathcal {U}}\) decays leading to an improved diversity of the solution.

5 Numerical Experiments

In this section, we numerically investigate the performance of the proposed method by testing it against several benchmark bi-objective problems. We also examine the role of the parameters – in particular of \(\tau \) which controls the step size of the adaptation process of the weights vectors \(W^i_k\).

For the sake of reproducible research, an implementation in MATLAB code of the proposed algorithm is made available in the GitHub repository https://github.com/borghig/AM-CBO .

We recall that the particles positions \(\{X_k^i\}_{i=1}^N\) are updated according to (3.2), while the sub-problem parameters \(\{W_k^i\}_{i=1}^N\) are updated according to (3.8) in the following experiments. For experiments with the second adaptive strategy (3.9) and validation against tri-objective problems, we refer to [5]. The complete optimization method is described by Algorithm 1. A remark on the algorithm computational complexity follows.

figure a

AM-CBO

Remark 5.1

Even though in every iteration the objective function g is evaluated only N times, the overall computational complexity is \({\mathcal {O}}(N^2)\) because the computation of \(Y^\alpha _k(w)\) requires \({\mathcal {O}}(N)\) computations, as well as the parameters update (3.8) which is particularly costly.

One can reduce the computation complexity by considering only a random subset \(I_k^M \subset \{1, \ldots , N \}\) of \(M \ll N\) particles when computing (3.1) and (3.8), by substituting

$$\begin{aligned} \frac{1}{N} \sum _{j=1}^N (\cdot )^j \quad \text {with} \quad \frac{1}{M} \sum _{j \in I^M_k} (\cdot )^j, \end{aligned}$$

whenever a sum over the different particles is performed. Inspired by Monte-Carlo particle simulations [1, 46], this mini-random batch technique allows to lower the complexity to \({\mathcal {O}}(NM)\). We also note that Fast Multipole Methods (FMM) [38] may additionally be used to speed up the computation of the potential field. Then, the computational complexity of (3.8), (3.9) is further reduced.

5.1 Performance Metrics

Denote by \(\{X_k^i\}_{i=1}^N\) the set of particle positions at the k-th algorithm iteration and their empirical distribution by \(\rho _k^N \in {\mathcal {P}}({\mathbb {R}}^d)\). We employ three different energies, the Riesz s-energy (3.7), Newtonian and the Morse potentials, both to measure the diversity of the solutions and to determine the dynamics of the weights vectors. The Newtonian binary potential is given by

$$\begin{aligned} U_N (z) = {\left\{ \begin{array}{ll} -\log |z| &{} \text {if} \;\; m = 2 \\ |z|^{2-m} &{} \text {if}\;\; m >2 \end{array}\right. } , \end{aligned}$$
(5.1)

while the Morse potential is given

$$\begin{aligned} U_{M}(z) = e^{-C|z|} \quad \text {with}\quad C>0. \end{aligned}$$
(5.2)

All considered potentials describe short-range repulsion between the particles. While the Morse potential is \({\tilde{\lambda }}\)-geodesically convex, the Newtonian and Riesz potentials are not. Since we will also employ the corresponding energies \({\mathcal {U}}_R\), \({\mathcal {U}}_N\), \({\mathcal {U}}_M\) to define the interaction between weights vectors, the constant C can be considered as an algorithm parameter when the Morse repulsion is used.

To show the validity of the energy-based diversity metrics, we additionally consider the hypervolume contribution metric \({\mathcal {S}}\) [66]. Let \(g^* \in {\mathbb {R}}^m\) be a maximal element with respect to the natural partial ordering

$$\begin{aligned} y_i \prec g^*_j \quad \text {for all}\quad y \in F, \end{aligned}$$

the hypervolume measure is given by the Lebesgue measure of the set of points between the computed solution and the maximal point \(g^*\), that is

$$\begin{aligned} {\mathcal {S}} [\rho _k^N] = \left| \bigcup _{i=1}^N \left\{ y \in {\mathbb {R}}^m \; | \; g(X^i_k) \prec y \prec g^* \right\} \right| . \end{aligned}$$
(5.3)

Maximizing \({\mathcal {S}}\) has been shown to lead to a diverse approximation of the Pareto front [23].

In Sect. 4.1, the convergence of the mean-field dynamics towards the Pareto front is shown by studying the evolution of the Generation Distance GD (4.4). In the experiments, we approximate this quantity by considering a reference approximation \(\{ y^j\}_{j=1}^M\) of the front with \(M=100\) points \(y^i \in F\), \(i = 1, \ldots , M\) for every test problem. More details on the reference solution are given in Appendix A. For simplicity, we indicate the numerical approximation of the Generational Distance again by GD, which is defined by

$$\begin{aligned} GD[\rho _k^N] = \left( \frac{1}{N} \sum _{i=1}^N \text {dist}(g(X_k^i), F_M)^2 \right) ^{\frac{1}{2}}. \end{aligned}$$
(5.4)

The Inverted Generational Distance (IGD) is also considered. It consists of the average distance between the points of the reference solution \(\{ y^i\}_{j=1}^M\) and the computed front

$$\begin{aligned} IGD[\rho _k^N] = \left( \frac{1}{M} \sum _{j=1}^M \text {dist}(y^j, G_k)^2 \right) ^{\frac{1}{2}} \quad \text {with}\quad G_k:= \{g(X_k^i)\}_{i=1}^N. \end{aligned}$$
(5.5)

Contrary to GD, which only measures the distance from the Pareto front, IGD takes into account the diversity of the computed solution, too. Hence, IGD is also a suitable indicator of the overall quality of the solution.

5.2 Test Problems

Test problems with diverse Pareto front geometries are selected to show the performance of the proposed method. In the Lamé problems [24] the parameter \(\gamma >0\) controls the front curvature: we use \(\gamma = 0.25, 1, 3\) to obtain convex, linear and concave fronts respectively. We also consider the DO2DK [8] problems with \(k=2,s=1\) and \(k=4, s=2\). Here, the Pareto fronts have more complex geometries as they are not symmetric and, in one case, the front is discontinuous. All the above problems are scalable to any dimension of the search space d and in the image space m. For presentation purposes, we restrict ourselves to bi-objective optimization problems by setting \(m=2\), but consider possibly large d. In the considered benchmark problems, the analytical description of the fronts is known, allowing us to obtain reference solutions (see Appendix A). The problems definitions are recalled in Appendix B for completeness.

Fig. 3
figure 3

Plots display, in gray, the particle positions in the image space after a single run for different benchmark problems. The reference solution is displayed in red. Four different parameters interaction strategies are used: no interaction, Riesz, Newtonian and Morse potential. Histograms show the final distribution over \(\varOmega \) (blue) and the optimal one (red) (Color figure online)

In this section, we use Algorithm 1 in four different scenarios

  1. 1.

    No parameters interaction \(\tau =0\);

  2. 2.

    Riesz potential (3.7), with \(\tau =10^{-5}\) ;

  3. 3.

    Newtonian potential (5.1), with \(\tau =10^{-3}\) ;

  4. 4.

    Morse potential (5.2), with \(\tau =10^{-1}\), \(C=20\);

The first scenario clearly corresponds to the standard M-CBO approximation, while the others to different AM-CBO strategies. The parameter \(\tau \) for each setting has been tuned to obtain the best performance. We note how the stronger the potential singularity is at the origin, the smaller the optimal value of \(\tau \) is. To validate model (4.26) and Theorem 4.2, we update the parameters of the scalarized sub-problems according to (3.8). The initial weights vectors \(\{ W_0^i\}_{i=1}^N\) are taken (deterministically) uniformly distributed over \(\varOmega \), while the particle positions are uniformly sampled over \(H = [0,1]^d\), \(d=10\). We employ \(N=100\) particles, which evolve for a maximum of \(k_{\text {max}} = 5000\) steps. The remaining parameters are set to \(\lambda = 1, \sigma = 4, \alpha = 10^6\). This parameter choice consists of a compromise between the optimal parameters of each problem. Anisotropic diffusion (3.4) is used.

Figure 3 shows the computed solutions, in the image-space, in the four different scenarios. Regardless of the interaction on \(\varOmega \), the particles always converge towards EP optimal points and hence to the Pareto front. By definition of the Chebyshev sub-problems (2.2), a uniform distribution in \(\varOmega \) leads to a uniform distribution of the particles over the front only when F is linear (as in the Lamé problem \(\gamma =1\)). Indeed, Fig. 3 shows that the particles are well distributed even when there is no weights interaction (\(\tau =0\)). If the front geometry differs from this straight segment, the optimal parameters distribution on \(\varOmega \) differs from the uniform one. In particular, subsets of the Pareto front which are almost parallel to the axis are difficult to approximate without any interaction in the parameter space, see for instance Lamé \(\gamma = 0.25\) and the DO2DK problems in Fig. 3. When using \(\tau \ne 0\), solutions improve as the particles are more distributed over the entire front.

Table 1 Algorithm performance for the different settings and problems

Table 1 reports the performance metrics for all the problems. For a reference, we include the results obtained with the well-known NSGA-II algorithm [18], using the implementation provided in [49]. For better comparability, we use the same initial population size \(N = 100\) and number of iterations \(k_{\max } = 5000\), while the remaining parameters are set to default settings (crossover probability \(p_c = 0.9\), mutation probability \(p_m=0.5\), mutation strength \(m_s = 0.05\)). For most problems, the strategy \(\tau =0\) with no interaction in \(\varOmega \) allows to reach lower values of GD. This is consistent with Theorem 4.1 and Remark 4.2, which suggested that the additional dynamics may interfere with the CBO mechanism and, as a consequence, slow down the convergence towards EP optimal points. If diversity metrics \({\mathcal {U}}_R\), \({\mathcal {U}}_N\), \({\mathcal {U}}_M\) and \({\mathcal {S}}\) are considered, dynamics including the interaction of parameters allow to obtain more diverse solutions. Interestingly, using Morse binary potential in the interaction leads to a final lower Newtonian energy in some cases. We will investigate the role of the choice of the potential \({\mathcal {U}}\) and \(\tau \) in the next section. We note that NSGA-II outperforms the proposed algorithm in the Lamé problems in terms of IGD metrics, whereas better results are obtained by AM-CBO in the DO2DK benchmarks.

In Fig. 3 the IGD performance shows that letting particles interact in parameter space improves the overall quality of the solution. While the improvement is more substantial in problems with complex Pareto fronts (see for instance Lamé \(\gamma = 0.25\), or DO2DK \(k=2\)), we remark that the additional mechanism allows to obtain better solutions even if the parameter distribution is already optimal form the beginning (see Lamé \(\gamma = 1\)). We conjecture that this is due to the additional stochasticity introduced by the potential. We will also investigate this aspect in the next subsection.

Figure 4a and b show the time evolution of GD, \({\mathcal {U}}_R\), \({\mathcal {U}}_N\), \({\mathcal {U}}_M\) and IGD for two of the considered test problems. As suggested by the analysis of the mean-field model, in particular Theorem 4.1, GD exponentially decays up to a maximum accuracy within the first iterations of the algorithm. This is due to the CBO dynamics driving the particles around EP optimal points. At the same time, the potential energies increase as the particles concentrate towards the front in the image-space. If the particles are concentrated at the local minimizer, then the resulting dynamics on the mean-field level are described by equation (4.26). This is a gradient flow for the marginal \(\mu .\) While this is rigorously true only under Assumption 4.4, we nevertheless observe similar phenomena also numerically in other cases, as seen in Fig. 4a and b, where the potentials start decreasing provided that relatively low GD values are attained.

Fig. 4
figure 4

Performance metric evolution, results are averaged over 25 runs

5.3 Effect of the Parameter \(\tau \) and Scalability

By looking at the computational results, it becomes clear that the two phases of the algorithm, the one characterized by the CBO dynamics and the one characterized by the gradient-flows dynamics, have different scales. Typically, the former dynamics are much slower compared with the second one. This was consistent with assumptions to Theorem 4.1, where \(\tau \) needs to be taken of order \(o(\sqrt{\varepsilon })\). Interestingly, we note that the multi-swarm CBO approach suggested in [47] leads to the opposite behavior, where weights adapt first and particles create consensus afterwards.

Fig. 5
figure 5

Performance metrics as a function of \(\tau \) for all the problems considered. Results are averaged over 10 runs

To experimentally investigate the importance of parameter \(\tau \), we test the algorithm for various values of \(\tau \), keeping the remaining parameters fixed. Figure 5a and b show the final GD and IGD metrics when different binary potentials are used during the computation. As expected, relatively large values of \(\tau \) lead to a strong interaction in parameter space that interferes with the CBO mechanism. As a result, the GD metric increases for large values of \(\tau \). Interestingly, the lowest GD values are not always attained for the smallest values of \(\tau \), suggesting that the additional weights vectors dynamics might help the CBO mechanism in optimizing the sub-problems.

The IGD metrics in Fig. 5b, show that the optimal value of \(\tau \) is different for each test case. In particular, DO2DK problems benefit from a strong interaction in parameter space. This might be explained by the front geometry (Fig. 3): the front length is long and, as consequence, the particles tend to be further apart in the image space, making the binary potential interaction weaker. Larger values of \(\tau \) mitigate this effect, leading to better algorithm performances. If the extrema of the Pareto front are known in advance, one could address this issue by estimating the front length and choosing the parameter \(\tau \) accordingly. We also note that the proposed algorithm seems to perform better when the Morse potential is used during the computation.

As already mentioned, the dynamics in \(\varOmega \) add stochasticity to the particle evolution. Hence, the additional diffusive term \(\sigma D_k^i(X_k^i,W_k^i)B_k^i\) in (3.2) might not be necessary. Yet, taking \(\sigma = 0\) yields poor approximations of the Pareto front, see Fig. 6b, suggesting that the diffusive term is still of paramount importance for the particles exploratory behavior and their statistical independence. From Fig. 6a, it is obvious that the optimal diffusion parameter \(\sigma \) is larger, the smaller \(\tau \) is. In particular, if \(\tau =0\) the particles diverge from the EP optimal points only when \(\sigma >10\), which is consistent with other CBO methods for single-objective optimization with anisotropic diffusion, see for instance [2]. At the same time, for some problems, if \(\sigma \) is too small, larger values of \(\tau \) improve the convergence towards optimal points.

Fig. 6
figure 6

Metrics as functions of \(\sigma \), for different values of \(\tau \). Morse interaction is used and results are averaged over 5 runs

Finally, we test the algorithm performance for different dimensions d of the search space, keeping the same parameters choice. If the same number \(N=100\) of particles is used, the IGD of the computed solutions increases as the space dimension d becomes larger, see Fig. 7. This effect can be simply reduced by increasing the number of particles linearly with the space dimension, see Fig. 7.

Fig. 7
figure 7

IGD metrics as functions of the search space dimension d. Morse interaction is used and results are averaged over 20 runs

6 Conclusions

In this work, we proposed an adaptive stochastic particle dynamics based on consensus to solve multi-objective optimization problems. The method makes use of a scalarization strategy to break down the original problem into N parametrized single-objective sub-problems and can deal with non-differentiable objective functions. The proposed algorithm, AM-CBO, extends prior work on multi-objective consensus based optimization by additional adaptive dynamics in the parameter space in order to ensure that the particles distribute uniformly over the Pareto front. This is achieved by exploiting energy-based diversity measures. A rigorous mathematical analysis is provided to validate this behavior however only in a particular case, where Assumptions 4.14.3 hold true. Under those conditions, we theoretically investigated the long-time behavior of the particle dynamics under the propagation of chaos assumption and establish convergence towards EP optimal points. The particles are capable of solving several single-objective problems at the same time, saving also computational cost with the respect to a naive approach. The additional dynamics on the parameter space are also analyzed based on results on non-linear aggregation equation under Assumption 4.4. Numerical experiments show that the proposed method is capable to solve multi-objective problems with very different Pareto front even those violating the previous assumptions. The algorithm scales well with the problem dimension, even when using a relatively small number of particles. Our scaling of the dynamics in the weights leads to a behavior, where first the particles concentrate along the front and then are redistributed. In future work, also the opposite may be considered as typical for classical multi-objective optimization methods.