On the Global Convergence of Particle Swarm Optimization Methods

In this paper we provide a rigorous convergence analysis for the renowned particle swarm optimization method by using tools from stochastic calculus and the analysis of partial differential equations. Based on a continuous-time formulation of the particle dynamics as a system of stochastic differential equations, we establish convergence to a global minimizer of a possibly nonconvex and nonsmooth objective function in two steps. First, we prove consensus formation of an associated mean-field dynamics by analyzing the time-evolution of the variance of the particle distribution, which acts as Lyapunov function of the dynamics. We then show that this consensus is close to a global minimizer by employing the asymptotic Laplace principle and a tractability condition on the energy landscape of the objective function. These results allow for the usage of memory mechanisms, and hold for a rich class of objectives provided certain conditions of well-preparation of the hyperparameters and the initial datum. In a second step, at least for the case without memory effects, we provide a quantitative result about the mean-field approximation of particle swarm optimization, which specifies the convergence of the interacting particle system to the associated mean-field limit. Combining these two results allows for global convergence guarantees of the numerical particle swarm optimization method with provable polynomial complexity. To demonstrate the applicability of the method we propose an efficient and parallelizable implementation, which is tested in particular on a competitive and well-understood high-dimensional benchmark problem in machine learning.


Introduction
In nature, collective behavior and self-organization allow complicated global patterns to emerge from simple interaction rules and random fluctuations.Inspired by the fascinating capabilities of swarm intelligence, large multi-agent systems are employed as a tool for solving challenging problems in applied mathematics.One classical task arising throughout science is concerned with the global optimization of a problemdependent possibly nonconvex and nonsmooth objective function E : R d → R, i.e., the search for a global optimizer x * ∈ arg min A popular class of methods with a long history of achieving state-of-the-art performance on such problems are metaheuristics [11].They orchestrate an interplay between local and global improvement procedures, consider memory mechanisms and selection strategies, and combine random and deterministic decisions, to create a process capable of escaping local optima and performing a robust search of the solution space in order to find a global optimizer.Initiated by seminal works on stochastic approximation [42] and random search [40], a big variety of such mechanisms has been introduced, analyzed and applied to numerous realworld problems.A non-exclusive list of representatives includes evolutionary programming [12], genetic algorithms [21], simulated annealing [1], and particle swarm optimization [28].Despite their tremendous empirical success, it is very difficult to provide a theoretical convergence analysis to global minimizers, mostly due to their stochastic nature and the appearance of memory effects.
In this paper we study particle swarm optimization (PSO), which was initially introduced by Kennedy and Eberhart in the 90s [28,27] and is now widely recognized as an efficient method for tackling complex optimization problems [39,31].Originally, PSO solves (1.1) by considering a group of finitely many particles, which explore the energy landscape of E. Each agent experiences a force towards its own personal (historical) best position as well as towards the global best position communicated in the swarm.Although these interaction rules are seemingly simple, a complete numerical analysis of PSO is still lacking; see, e.g., [48,35,49] and references therein.Recently, however, by introducing a continuous description of PSO based on a system of stochastic differential equations (SDEs), the authors of [19] have paved the way for a rigorous mathematical analysis using tools from stochastic calculus and the analysis of partial differential equations (PDEs).
To explore the domain and to form a global consensus about the minimizer x * as time passes, the formulation of PSO proposed by the authors of [19] uses N particles, described by triplets (X i t , Y i t , V i t ) t≥0 i=1...,N , with X i t and V i t denoting the position and velocity, and Y i t being a regularized version of the local (historical) best position of the ith agent at time t.The particles, formally stochastic processes, are initialized independently according to some common distribution f 0 ∈ P(R 3d ).In the most general form the PSO dynamics is given by the system of SDEs, expressed in Itô's form as ) where α, β, θ, κ, γ, m, λ 1 , λ 2 , σ 1 , σ 2 ≥ 0 are user-specified parameters.The change of the velocity in (1.2c) is subject to five forces.The first term on the right-hand side models friction with a coefficient commonly chosen as γ = 1 − m ≥ 0, where m > 0 denotes the inertia weight.The subsequent term can be regarded as the drift towards the local best position of the ith particle.A time-continuous approximation of its evolution is given by Y i t and described in Equation (1.2b).It involves the operator S β,θ , given by S β,θ (x, y) = 1+θ+tanh(β(E(y)−E(x))) for 0 ≤ θ 1 and β 1, which converges to the Heaviside function as θ → 0 and β → ∞.The concept behind Equation (1.2b) can then be seen when being discretized, see Remark 1.1.For an alternative implementation of the local best position we refer to [45].
Remark 1.1.A time-discretization of (1.2b) with κ = 1/(2∆t), θ = 0 and β = ∞ yields the update rule meaning that the ith particle stores in Y i k∆t the best position which it has seen up to the kth iteration.This explains the name local (historical) best position and restores the original definition from the work [28].
The last deterministic term imposes a drift towards the momentaneous consensus point y α ( ρ N Y,t ), given by d ρ N Y,t (y), with ω E α (y) := exp(−αE(y)), (1.4) where ρ N Y,t denotes the empirical measure ρ N Y,t := 1 N N i=1 δ Y i t of the particles' local best positions.The choice of the weight ω E α in (1.4) comes from the well-known Laplace principle [33,9], a classical asymptotic argument for integrals stating that for any probability measure ∈ P(R d ) it holds Based thereon, y α ( ρ N Y,t ) is expected to be a rough estimate for a global minimizer x * , which improves as α → ∞ and as larger regions of the domain are explored.To feature the latter, the two remaining terms in (1.2c), each associated with a drift term, are diffusion terms injecting randomness into the dynamics through independent standard Brownian motions (B 1,i t ) t≥0 i=1,...,N and (B 2,i t ) t≥0 i=1,...,N .The two commonly studied diffusion types for similar methods are isotropic [36,5,15] and anisotropic [6,16] diffusion with D(y − x) = y − x 2 Id, for isotropic diffusion, diag (y − x), for anisotropic diffusion, (1.6) where Id ∈ R d×d is the identity matrix and diag : R d → R d×d the operator mapping a vector onto a diagonal matrix with the vector as its diagonal.Intuitively, the term's scaling encourages agents far from its own local best position or the globally computed consensus point to explore larger regions, whereas agents already close try to enhance their position only locally.As the coordinate-dependent scaling of anisotropic diffusion has been proven to be highly beneficial for high-dimensional problems [6,14], in what follows, we limit our analysis to this case.An illustration of the formerly described PSO dynamics (1.2) is given in Figure 1.A theoretical convergence analysis of PSO is possible either on the microscopic level (1.2) or by analyzing the macroscopic behavior of the particle density through a mean-field limit, what usually admits more powerful analysis tools.In the large particle limit an individual particle is not influenced any more by individual particles but only by the average behavior of all particles.As shown in [18, Section 3.2], the empirical particle measure )), which weakly satisfies the nonlinear Vlasov-Fokker-Planck equation with initial datum f 0 .The mean-field limit results [3,22,44,24,23] ensure that the particle system (1.2) is well approximated by the following self-consistent mean-field McKean process ) ) Here, f t denotes the distribution of (X t , Y t , V t ).This makes (1.7) and (1.8) nonlinear.
1.1.Contribution.In view of the versatility, efficiency, and wide applicability of PSO combined with its long historical tradition, a mathematical analysis of the finite particle system (1.2) is of considerable interest.
In this work we advance the theoretical understanding of the method and contribute to the completion of a full numerical analysis of PSO by proving rigorously the convergence of PSO with memory effects to global minimizers using mean-field techniques.More precisely, under mild regularity assumptions on the objective E and a well-preparation condition about the initialization f 0 , we analyze the behavior of the particle distribution f solving the mean-field dynamics (1.8).At first, it is shown that concentration is achieved at some x in the sense that the marginal law w.r.t. the local best position, ρ Y,t , converges narrowly to a Dirac delta δ x as t → ∞.Consecutively, we argue that, for an appropriate choice of the parameters, in particular α 1, E( x) can be made arbitrarily close to the minimal value E := inf x∈R d E(x).A suitable tractability condition on the objective E eventually ensures that x is close to a global minimizer.Similar mean-field convergence results are obtained for the case without memory effects.In this setting we are moreover able to establish the convergence of the interacting N -particle dynamics to its mean-field limit with a dimension-independent rate, which allows to obtain a so far unique wholistic and quantitative convergence statement of PSO.As the mean-field approximation result does not suffer from the curse of dimensionality, we in particular prove that the numerical PSO method has polynomial complexity.With these new results we solve the theoretical open problem about the convergence of PSO posed in [19].
Furthermore, we propose an efficient and parallelizable implementation, which is particularly suited for machine learning problems by integrating modern machine learning techniques such as random mini-batch ideas as well as traditional metaheuristic-inspired techniques from genetic programming and simulated annealing.
1.2.Prior Arts.The convergence of PSO algorithms has been investigated by many scholars since its introduction, which has lead to several variations allowing to establish desirable properties such as consensus formation or convergence to optimal solutions.While the matter of consensus is well-studied, see, e.g., [8,34], only few general theoretical statements regarding the properties of the found consensus are available.Both the existence of a large number of variations of the algorithm and the lack of a rigoros global convergence analysis are attributed amongst other things, such as the stochasticity and the usage of memory mechanisms, to the phenomenon of premature convergence of basic PSO [28], which was observed in [47,46] and remedied by proposing a modified version, called guaranteed convergence PSO.Nevertheless, this adaptation only allows to prove the convergence to local optima.In order to obtain therefrom a stochastic global search algorithm, the authors suggest to add purely stochastic particles to the swarm, which trivially makes the method capable of detecting a global optimizer, but entails a computational time which coincides with the time required to examine every location in the search space.Other works consider certain notions of weak convergence [4] or provide probabilistic guarantees of finding locally optimal solutions, meaning that eventually all particles are located almost surely at a local optimum of the objective function [43].In [38], similarly to our work, the expected behavior of the particles is investigated.
All of the formerly mentioned results though are obtained through the analysis of the particles' trajectories generated by a time-discretized algorithm as in [18,Equation (6.3)].The present paper takes a different point of view by studying the time-continuous description of the PSO model (1.2) through the lens of the mean-field approximation (1.7).Analyzing the macroscopic behavior of a system through a mean-field limit instead of investigating the microscopic particle dynamics has its origins in statistical mechanics [26], where interactions between particles are approximated by an averaged influence.By eliminating the correlation between the particles, a many-body problem can be reduced to a one-body problem, which is usually much easier to solve while still giving an understanding of the mechanisms at play by describing the average behavior of the particles.These ideas, for instance, are also used to study the collective behavior of animals when forming large-scale patterns through self-organization by analyzing an associated kinetic PDE [3].In very recent works, this perspective of analysis has also been taken to demystify the training process of neural networks, see, e.g., [32,10], where a mean-field approximation is utilized to formulate risk minimization by stochastic gradient descent (SGD) in terms of a gradient-flow PDE, which allows for a rigorous mathematical analysis.
The analysis technique we use follows the line of work of self-organization.It is inspired by [5,6], where a variance-based analysis approach has been developed for consensus-based optimization (CBO), which follows the guiding principles of metaheuristics and in particular resembles PSO but is of much simpler nature and therefore easier to analyze.In comparison to Equation (1.2), CBO methods are described by a system of first-order SDEs [5, Equation (1.1)] and do not contain memory mechanisms, which are responsible for both a significantly more challenging mathematical modeling and convergence analysis.1.3.Organization.Sections 2 and 3 are dedicated to the analysis of PSO without and with memory mechanisms, respectively.After providing details about the well-posedness of the mean-field dynamics, we present and discuss the main result about the convergence of the mean-field dynamics to a global minimizer of the objective function.In Section 4 we then state a quantitative result about the meanfield approximation for PSO without memory effects, which enables us to obtain a wholistic convergence statement of the numerical PSO method.Eventually, a computationally efficient implementation of PSO is proposed in Section 5, before Section 6 concludes the paper.For the sake of reproducible research, in the GitHub repository https://github.com/KonstantinRiedl/PSOAnalysiswe provide the Matlab code implementing the PSO algorithm analyzed in this work.

Mean-Field Analysis of PSO without Memory Effects
Before providing a theoretical analysis of the mean-field PSO dynamics (1.7) and (1.8), in this section we investigate a reduced version, which does not involve memory mechanisms.Its multi-particle formulation was proposed in [19, Section 3.1] and reads ) (2.1b) Compared to the full model, each particle is characterized only by its position X i and velocity V i .The forces acting on a particle, i.e., influencing its velocity in Equation (2.1b), are friction, acceleration through the consensus drift and diffusion as in (1.6) with independent standard Brownian motions (B i t ) t≥0 i=1,...,N .
The consensus point x α ( ρ N X,t ) is directly computed from the current positions of the particles according to where ρ N X,t denotes the empirical measure ρ N X,t := 1 N N i=1 δ X i t of the particles' positions.Independent and identically distributed initial data (X i 0 , V i 0 ) ∼ f 0 i=1,...,N with f 0 ∈ P(R 2d ) complement (2.1).Similar to the particle system (1.2), as N → ∞, the mean-field dynamics of (2.1) is described by the nonlinear self-consistent McKean process ) with initial datum (X 0 , V 0 ) ∼ f 0 and the marginal law ρ X,t of X t given by ρ X (t, A direct application of the Itô-Doeblin formula shows that the law f ∈ C([0, T ], P(R 2d )) is a weak solution to the nonlinear Vlasov-Fokker-Planck equation with initial datum f 0 .
Remark 2.1.A separate theoretical analysis of the dynamics (2.1) is necessary as it cannot be derived from (1.2) in a way that also the proof technique can be adopted in a straightforward manner.
It is also worth noting that Equation (2.1) bears a certain resemblance to CBO [36,5,6,15,16].Indeed, as made rigorous in [7], CBO methods can be derived from PSO in the small inertia limit m → 0.
Before turning towards the well-posedness of the mean-field dynamics (2.3) and presenting the main result of this section about the convergence to the global minimizer x * , let us introduce the class of objective function E considered in the theoretical part of this work.We remark that the assumptions made in what follows coincide with the ones of [5,6] as well as several subsequent works in this direction.Assumption 2.2.Throughout the paper we are interested in objective functions E : R d → R, for which A1 there exists A5 there exist η > 0 and ν ∈ (0, ∞) such that for any x ∈ R d there exists a global minimizer x * of E (which may depend on x) such that Assumption A1 just states that the objective function E attains its infimum E at some x * ∈ R d , which may not necessarily be unique.Assumption A2 describes the local Lipschitz-continuity of E, entailing in particular that the objective has at most quadratic growth at infinity.Assumption A3, on the other hand, requires E to be either bounded or of at least quadratic growth in the farfield.Together, A2 and A3 allow to obtain the well-posedness of the PSO model.Assumption A4 is a regularity assumption about E, which is required only for the theoretical analysis.The PSO method is a zero-order method where we do not need the gradient information of the objective function in the numerical application.Assumption A5 should be interpreted as a tractability condition of the landscape of E, which ensures that achieving an objective value of approximately E guarantees closeness to a global minimizer x * and thus eliminates cases of almost-optimal valleys in the energy landscape far away from any globally minimizing argument.Such assumption is therefore also referred to as an inverse continuity property.
It shall be emphasized that objectives with multiple global minima of identical quality are not excluded.
2.1.Well-Posedness of PSO without Memory Effects.Let us recite a well-posedness result about the mean-field PSO dynamics (2.3) and the associated Vlasov-Fokker-Planck equation (2.4).Its proof is analogous to the one provided for Theorem 3.1 for the full dynamics (1.8) based on the Leray-Schauder fixed point theorem.
3) admits a unique strong solution up to time T with the paths of process and is a weak solution to the Vlasov-Fokker-Planck equation (2.4).In particular, for some constant C > 0 depending only on m, γ, λ, σ, α, c E , R, and L E .

Convergence of PSO without Memory Effects to a Global Minimizer.
A successful application of the PSO dynamics underlies the premise that the particles form consensus about a certain position x.
In particular, in the mean-field limit one expects that the distribution of a particle's position ρ X,t converges to a Dirac delta δ x .This entails that the variance in the position ] and the second-order moment of the velocity E[|V t | 2 ] of the averaged particle vanish.As we show in what follows, both functionals indeed decay exponentially fast in time.Motivated by these expectations we define the functional which we analyze in the remainder of this section.Its last term is required from a technical perspective.However, by proving the decay of E[H(t)], one immediately obtains the same for as a consequence of the equivalence established in Lemma 2.1, which follows from Young's inequality.
Lemma 2.1.The functional We now derive an evolution inequality of the quantity E[H(t)].
Lemma 2.2.Let E satisfy Assumptions A1-A3 and let (X t , V t ) t≥0 be a solution to the nonlinear SDE (2.3).
Then E[H(t)] with H as defined in (2.6) satisfies Proof.Let us write δX t := X t − E[X t ] for short and note that the integration by parts formula gives (2.9) Observe that the stochastic integrals have vanishing expectations as a consequence of the regularity obtained in Theorem 2.3.Applying the Itô-Doeblin formula and Young's inequality yields (2.10) Again by employing the Itô-Doeblin formula we obtain where we used the identity (2.9) and the fact that E[ δX t , x α (ρ X,t ) − E[X t ] ] = 0 in the last two steps.We now rearrange inequality (2.11) to get which, in combination with (2.10), allows to show (2.12) In order to upper bound , an application of Jensen's inequality yields . (2.13) By choosing ε = (2λ)/γ in (2.12) and utilizing the estimate (2.13), we obtain (2.8) as desired.
Remark 2.4.To obtain exponential decay of E[H(t)] it is necessary to ensure the negativity of the prefactor of ) by choosing the parameters of the PSO method in a suitable manner.This may be achieved by choosing for any fixed time t, given α and arbitrary σ, γ > 0, where we abbreviate In order to be able to choose the parameters in Remark 2.4 once at the beginning of the algorithm instead of at every time step t, we need to be able to control the time-evolution of E[exp(−αE(X t ))].We therefore study its time-derivative in the following lemma.
Lemma 2.3.Let E satisfy Assumptions A1-A4 and let (X t , V t ) t≥0 be the solution to the nonlinear SDE (2.3).Then it holds that Proof.We first note that 1 2 leaving the second time-derivative of E[exp(−αE(X t ))] to be lower bounded.To do so, we start with its first derivative.Applying the Itô-Doeblin formula twice and noting that stochastic integrals have vanishing expectations as a consequence of the regularity obtained in Theorem 2.3, we have (2.17) Differentiating both sides of (2.17) with respect to the time t yields where we employed the first line of (2.17) in the last step.It remains to upper bound the terms T 1 and T 2 .Making use of Assumptions A1 and A4, we immediately obtain For T 2 , again under Assumptions A1 and A4, we first note that where the equality is a consequence of as in (2.13) we can further bound (2.20) as which yields the statement after employing the lower bound of (2.7) as in Lemma 2.1.
We are now ready to state and prove the main result about the convergence of the mean-field PSO dynamics (2.3) without memory mechanisms to the global minimizer x * .Theorem 2.5.Let E satisfy Assumptions A1-A4 and let (X t , V t ) t≥0 be a solution to the nonlinear SDE (2.3).Moreover, let us assume the well-preparation of the initial datum X 0 and V 0 in the sense that P1 µ > 0 with with x + = max{x, 0} for x ∈ R denoting the positive part and where Then E[H(t)] with H as defined in Equation (2.6) converges exponentially fast with rate χ to 0 as t → ∞.Moreover, there exists some x, which may depend on α and f 0 , such that E[X t ] → x and x α (ρ X,t ) → x exponentially fast with rate χ/2 as t → ∞.Eventually, for any given accuracy ε > 0, there exists α 0 > 0, such that for all α > α 0 , x satisfies Remark 2.6.As suggested in Remark 2.4, Theorem 2.5 traces back the evolution of E[exp(−αE(X t ))] to its initial state by employing Lemma 2.3.This allows to fixate all parameters of PSO at initialization time.
By replacing D X t with 2D X 0 in (2.14), the well-preparation of the parameters as in Condition P1 can be ensured.
Condition P2 requires the well-preparation of the initialization in the sense that the initial datum f 0 is both well-concentrated and to a certain extent not too far from an optimal value.While this might have a locality flavor, the condition is generally fulfilled in practical applications.Moreover, for CBO methods there is recent work where such assumption about the initial datum is reduced to the absolute minimum [15,16].
Proof of Theorem 2.5.Let us define the time horizon Obviously, by continuity, T > 0. We claim that T = ∞, which we prove by contradiction in the following.Therefore, assume T < ∞.Then, for t ∈ [0, T ], we have where the positivity of µ is due to the well-preparation condition P1 of the initialization.Lemma 2.2 then provides an upper bound for the time derivative of the functional E[H(t)], where we made use of the upper bound of (2.7) as in Lemma 2.1 in the last inequality.The rate χ is defined implicitly and it is straightforward to check that χ < γ/m.Grönwall's inequality implies

.23)
Let us now investigate the evolution of the functional Then, an application of Grönwall's inequality to Equation (2.15) from Lemma 2.3 and using the explicit bound of E[H(t)] from (2.23) yields which, in turn, implies after discarding the positive parts.Recalling the definition of X and employing the second well-preparation condition P2, we can deduce that for all t ∈ [0, T ] it holds which entails that there exists δ > 0 such that E[exp(−αE(X t ))] ≥ E[exp(−αE(X 0 ))]/2 in [T, T + δ] as well, contradicting the definition of T and therefore showing the claim T = ∞.
As a consequence of (2.23) we have for all t ≥ 0. In particular, by means of Lemma 2.1, for a suitable generic constant C > 0, we infer ) where the last inequality uses the fact (2.13).Moreover, with Jensen's inequality, showing that E[X t ] → x for some x ∈ R d , which may depend on α and f 0 .According to (2.25), X t → x in mean-square and Eventually, by continuity of the objective function E and by the dominated convergence theorem, we conclude that E[exp(−αE(X t ))] → e −αE( x) as t → ∞.Using this when taking the limit t → ∞ in the second bound of (2.24) after applying the logarithm and multiplying both sides with −1/α, we obtain

Mean-Field Analysis of PSO with Memory Effects
Let us now turn back to the PSO dynamics (1.2) described in the introduction.The fundamental difference to what was analyzed in the preceding section is the presence of a personal memory of each particle, encoded through the additional state variable Y i t .It can be thought of as an approximation to the in-time best position arg min τ ≤t E(X i τ ) seen by the respective particle.Its dynamics is encoded in Equation (1.2b).In this section we analyze (1.2) in the large particle limit, i.e., through its mean-field limit (1.8).
Step 1: For a given function u ∈ C([0, T ], R d ) and an initial measure f 0 ∈ P 4 (R 3d ), according to standard SDE theory [2, Chapter 7], we can uniquely solve the auxiliary SDE with initial condition X 0 , Y 0 , V 0 ∼ f 0 as, due to the smoothness of S β,θ and Assumptions A2 and A3, the coefficients are locally Lipschitz and have at most linear growth.This induces f t = Law X t , Y t , V t .Moreover, the regularity of f 0 ∈ P 4 (R 3d ) allows for a moment estimate of the form (3.1) and thus Step 2: Let us now define, for some constant C > 0, the test function space For some φ ∈ C 2 * (R 3d ), by the Itô-Doeblin formula, we derive where we mean φ X t , Y t , V t whenever we write φ.After taking the expectation, applying Fubini's theorem and observing that the stochastic integrals vanish due to the definition of the test function space C 2 * (R 3d ) and the regularity (3.1), we observe that f ∈ C([0, T ], P 4 (R 3d )) satisfies the Vlasov-Fokker-Planck equation Step 3: ) provides the self-mapping property of the map which is compact as a consequence of the stability estimate [5,Lemma 3.2], and the Hölder-1/2 continuity of the Wasserstein-2 distance W 2 ( ρ Y,t , ρ Y,s ).
Step 5: As for uniqueness, we assume the existence of two distinct fixed points u 1 and u 2 with associated processes X 1 , Y 1 , V 1 and X 2 , Y 2 , V 2 , respectively.Under the premise of the same initialization and identical Brownian motion paths, Grönwall's inequality ensures that they coincide, cf.[5, Theorem 3.1].

Convergence of PSO with Memory
Effects to a Global Minimizer.Analogously to Section 2.2 we define a functional H(t), which is analyzed in this section to eventually prove its exponential decay and thereby consensus formation at some x close to the global minimizer x * .In addition to the requirements that the variance in the position and the second-order moment of the velocity E[|V t | 2 ] of the averaged particle vanish, we also expect that the particle's position X t aligns with its personal best position Y t over time, meaning that E[|X t − Y t | 2 ] decays to zero.This motivates the definition whose last two terms are required for technical reasons.Again, by the equivalence established in Lemma 3.1, proving the decay of E[H(t)] directly entails the decay of Proof.Let us write δX t := X t − E[X t ] for short and note that the integration by parts formula gives Observe that the stochastic integrals have vanishing expectations as a consequence of the regularity obtained in Theorem 3.1.An application of the Itô-Doeblin formula and Young's inequality yields Again by employing the Itô-Doeblin formula we obtain where, for the second line, we used the identity (3.7) and that E[ δX t , C ] = 0, whenever C ∈ R d is constant, allowing to expand the expression in the way done.We now rearrange the previous inequality to get Next, using the Itô-Doeblin formula, we compute where the last step follows from the fact that θ < S β,θ (X t , Y t ) < 2 + θ < 4.And lastly, the Itô-Doeblin formula and Young's inequality allow to bound We now collect the bounds (3.8), (3.9), (3.10), and (3.11) to show Recalling the computation (2.13) yields the bound where we inserted ±X t and ±E[X t ] in the second step and used that (a + b + c) 2 ≤ 3(a 2 + b 2 + c 2 ) as well as Jensen's inequality.Combining the last two bounds and choosing ε = (3λ 2 )/γ we obtain (3.6) as desired.
Remark 3.2.The exponential decay of E[H(t)] it obtained by choosing the parameters of PSO in a manner which ensures the negativity of the prefactors of This may be achieved by choosing for any fixed time t, given α and arbitrary θ, σ 1 , σ 2 , γ > 0, , and m < min where we abbreviate In our main theorem on convergence of the PSO dynamics with memory mechanisms to the global minimizer x * we again ensure that the parameter can be chosen once at initialization time.
Theorem 3.3.Let E satisfy Assumptions A1-A4 and let (X t , V t ) t≥0 be a solution to the nonlinear SDE (1.8).Moreover, let us assume the well-preparation of the initial datum X 0 and V 0 in the sense that P1 µ 1 > 0 with where Then E[H(t)] with H as defined in Equation (3.4) converges exponentially fast with rate χ to 0 as t → ∞.Moreover, there exists some x, which may depend on α and f 0 , such that E[X t ] → x and y α (ρ Y,t ) → x exponentially fast with rate χ/2 as t → ∞.Eventually, for any given accuracy ε > 0, there exists α 0 > 0, such that for all α > α 0 , x satisfies If E additionally satisfies Assumption A5, we additionally have Remark 3.4.By replacing D Y t with 2D Y 0 in the parameter choices of Remark 3.2, the well-preparation of the parameters as in Conditions P1 and P2 can be ensured.
In analogy to Remark 2.6, Condition P3 guarantees the well-preparation of the initialization.
Proof of Theorem 3.3.Let us define the time horizon Obviously, by continuity, T > 0. We claim that T = ∞, which we prove by contradiction in the following.Therefore, assume T < ∞.Then, for t ∈ [0, T ], noting that where we made use of the upper bound of (3.5) as in Lemma 3.1 in the last inequality.The rate χ is defined implicitly and it is straightforward to check that 0 < χ < γ/m, where the positivity of χ follows from the well-preparation conditions P1 and P2 of the initialization.Grönwall's inequality implies where the last step follows from Cauchy-Schwarz inequality and uses Assumption A4 and S β,θ (X t , Y t ) < 4. Now firstly notice that ] by Young's inequality.Secondly, using again Assumption A4 in the first inequality, we have where the next-to-last step uses the explicit bound in (3.14).Using the two latter observations together with the fact that By integrating (3.16) we obtain for all t ∈ [0, T ] Recalling the definition of Y and employing Condition P3, we can deduce that for all t ∈ [0, T ] it holds which entails that there exists δ > 0 such that as well, contradicting the definition of T and therefore showing the claim T = ∞.
As a consequence of (3.14) we have for all t ≥ 0. In particular, by means of Lemma 3.1, for a suitable generic constant C > 0, we infer Moreover, with Jensen's inequality, showing that E[X t ] → x for some x ∈ R d , which may depend on α and f 0 .According to (3.18), X t → x as well as Y t → x in mean-square.Moreover, by reusing the inequality (3.12) we get The remainder of the proof follows the lines of the proof of Theorem 2.5, replacing merely X t with Y t .

A Wholistic Convergence Statement of PSO without Memory Effects
In Sections 2 and 3 we analyzed the macroscopic behavior of PSO without and with memory effects in the mean-field regime.For this purpose we introduced the with (1.2) and (2.1) associated self-consistent mono-particle processes (1.8) and (2.3), for which we then established convergence guarantees under the in Theorems 2.5 and 3.3 specified assumptions.However, in order to be able to infer therefrom the optimization capabilities of the numerically implemented PSO method, a quantitative estimate on the approximation quality of the interacting particle system by the corresponding mean-field dynamics is necessary.
4.1.On the Mean-Field Approximation of PSO without Memory Effects.The following theorem provides a probabilistic quantitative estimate on the mean-field approximation for PSO without memory effects.Notably, the result does not suffer from the curse of dimensionality.
Theorem 4.1.Let T > 0, f 0 ∈ P 4 (R 2d ) and let N ∈ N be fixed.Moreover, let E obey Assumptions A1-A4.We denote by (X i t , V i t ) t≥0 i=1,...,N the solution to system (2.1) and let (X i t , V i t ) t≥0 i=1,...,N be N independent copies of the solution to the mean-field dynamics (2.3).Then it holds where K = K(γ/m, λ/m, σ/m, T, E) is a constant, which is in particular independent of N and d.Furthermore, if the processes share the initial data as well as the Brownian motion paths (B i t ) t≥0 for all i = 1, . . ., N , then we have a probabilistic mean-field approximation of the form with a constant C MFA = C MFA (α, γ/m, λ/m, σ/m, T, E, K, M ), which is in particular independent of N and d.
Proof.The proof is based on the arguments of [15, Section 3.3] about the mean-field approximation of CBO.First we compute a bound for which is then used to derive a mean-field approximation for PSO conditioned on the set Ω M of uniformly bounded processes.
Step 1: Using standard inequalities and Jensen's inequality allows to derive the bound with C = C(T ).For the velocities V i t we first note that While the two middle terms on the right-hand side of (4.4) can be controlled as before by applying Jensen's inequality, the last term is treated as follows.Since s is a martingale we can apply the Burkholder-Davis-Gundy inequality [41, Chapter IV, Theorem 4.1], which gives where the latter step is again due to Jensen's inequality and with a constant C = C(T ).Utilizing these bounds allows to continue the inequality in (4.4) and to obtain with with b 1 = 0 and b 2 = e α(E−E) in the case that E is bounded, and in the case that E satisfies the coercivity assumption A3, we eventually obtain the upper bound with 3) and (4.7) yields which, averaged over i, allows to derive the bound An application of Grönwall's inequality ensures that E sup t∈[0,T ] . Note, that the constant K does in particular not depend on N or d.With identical arguments for the processes (X Step 2: We define the cutoff function which is a random variable adapted to the natural filtration and satisfying 1 Ω M ≤ I M (t) pointwise for all t ∈ [0, T ] as well as I M (t) = I M (t)I M (s) for all s ∈ [0, t].Firstly, for the positions, by using standard inequalities and Jensen's inequality, we obtain the bound with C = C(T ).Secondly, for the velocities we have with C = C(γ/m, λ/m, σ/m, T ).In the first step of (4.12) we used that the processes (X i t , V i t ) and (X i t , V i t ) share the Brownian motion paths, and in the second both Itô isometry and Jensen's inequality.In order to conclude, it remains to control the term E |x α ( ρ N X,s ) − x α (ρ X,s )| 2 I M (s) .To do so, in analogy to the definition of ρ N X,s , let us denote by ρ N X,s the empirical measure associated with the processes . Then, by following the proofs of [5, Lemma 3.2] and [13, Lemma 3.1], and exploiting the boundedness ensured by the multiplication with the random variable I M (s), we obtain Inserting the latter into (4.12), and adding up (4.11) and (4.12) yields ) and where we used that the processes (X i t , V i t ) and (X i t , V i t ) share the initial conditions.Lastly, by taking the maximum over i on both sides we get max i=1,...,N with the C from before.After recalling the definition of the conditional expectation, an application of Grönwall's inequality concludes the proof.
Remark 4.2.While the first part of Theorem 4.1 about the uniform in time boundedness of the empirical measures holds mutatis mutandis for the PSO dynamics with memory effects (1.2) and (1.8), it seems not straightforward to obtain the second part in this setting due to the way the memory effects are implemented in (1.2b).We leave the investigation of this extension to future research, in particular in regard to the question whether other implementations of memory effects might resolve this issue.
for k = 0, . . ., K and where (B i k∆t ) k=1,...,K−1 i=1,...,N are independent, identically distributed standard Gaussian random vectors in R d .Theorem 4.3.Let total > 0 and δ ∈ (0, 1/2).Then, under the assumptions of Theorems 2.5 and 4.1, it holds for the discretized PSO dynamics (4.15) that with probability larger than Here, m denotes the order of accuracy of the used discretization scheme.Moreover, besides problem dependent factors and the parameters of the method, the dependence of the constants is as follows.C NA depends linearly on d and N , and exponentially on T * .C MFA depends on exponentially on α, T * and δ −1 .C LLN depends on the moment bound from Theorem 2.3.Lastly, ε and ε are chosen according to Theorem 2.5.
Proof.The overall error can be decomposed as where we used that P(Ω M ) ≥ (1 − δ) ≥ 1/2.By means of a classical result about the convergence of numerical schemes for SDEs [37], the first term in (4.17) can be bounded by C NA (∆t) m .For the second term, Theorem 4.1 gives the estimate C MFA N −1 .The third term can be bounded by C LLN N −1 as a consequence of the law of large numbers.Eventually, Theorem 2.5 allows us to choose T * = O log( ε −1 )/χ sufficiently large to reach any prescribed accuracy ε for the next-to-last term as well as ε 2ν /η 2 for the last term by a suitable choice of α.With these individual bounds we obtain for a sufficiently large choice of M in (4.1).
A result in this spirit was first presented for CBO in [15,Theorem 14] and is hereby extended to PSO.

Implementation of PSO and Numerical Results
The purpose of this section is twofold.At first, an efficient implementation of PSO is provided, which is particularly suited for high-dimensional optimization problems arising in machine learning.Its performance is then evaluated on a standard benchmark problem, where we investigate the influence of the parameters, and the training of a neural network classifier for handwritten digits.Furthermore, several relevant implementational aspects are discussed, including the computational complexity and scalability, modifications inspired from simulated annealing and evolutionary algorithms, and the numerical stability of the method.
5.1.An Efficient Implementation of PSO.Let us stress that PSO is an extremely versatile, flexible and customizable optimization method, which can be regarded as a black-box optimizer.As a zero-order method it is not reliant on the gradient information and can be even applied to discontinuous objectives, making it inevitably superior to first-order optimization methods in cases where derivatives are computationally infeasible.However, also in machine learning applications, where gradient-based optimizers are considered the state of the art, PSO may be of particular interest in view of vanishing or exploding gradient phenomena.Typical objective functions appearing in machine learning are of the form where E j is usually the loss of the jth training sample.In order to run the scheme (1.2), frequent evaluations of E are necessary, which may be computationally intense or even prohibitive in some applications.
Computational complexity: Inspired by mini-batch gradient descent, the authors of [25] developed a random batch method for interacting particle systems, which was employed for CBO in [6].In the same spirit, we present with Algorithm 1 a computationally efficient implementation of PSO.The mini-batch idea is present on two different levels.In line 7, the objective is defined with respect to a batch of the training data of size n E , meaning that only a subsample of the data is considered.One epoch is completed after each data sample was seen exactly once, i.e., after M/n E steps.At each step the consensus point y α has to be computed, for which E batch needs to be evaluated for N particles.This still constitutes the most Algorithm 1 Particle swarm optimization (PSO) Input: Objective E as in (5.1), time horizon T or number of epochs #epochs, discrete time step size ∆t, batch sizes n N and n E , parameters m, γ, λ 1 , λ 2 , σ 1 , σ 2 , α, β, θ and κ, number of particles N , initialization f 0 Output: Approximation y α ( ρ N Y,T ) of the global minimizer x * of E 1: Generate the particles' initial positions and velocities (X i 0 , V i 0 ) i=1,...,N according to a common initial law f 0 .Initialize the local best positions Y i 0 = X i 0 .2: Ensure that n E divides M and n N divides N .Define the objective function on this batch as

11:
Update either all particles (full update) or only the particles in the current batch P n k (partial update) according to a discretized version of the PSO dynamics (1.2). 12: ) is too small despite stopping criterion not fulfilled 13: Perform an independent Brownian motion for the positions or velocities of all particles.Set k = k + 1.
significant computational effort.However, the mini-batch idea can be leveraged for a second time.In the for loop in line 9 we partition the particles into sets of size n N and perform the updates of line 11 only for the n N particles in the respective subset.Since this is embarrassingly parallel, a parallel machine can be deployed to reduce the runtime by up to a factor p (the number of available processors).While this is referred to as partial update, alternatively, on a sequential architecture, a full update can be made at every iteration, requiring all N particles to be updated in line 11.Apart from lowering the required computing resources tremendously, these mini-batch ideas actually improve the stability of the method and the capability of finding good optima by introducing more stochasticity into the algorithm.
Concerning additional computational complexity due to the usage of memory effects, let us point out that, except for the required storage of the local (historical) best positions and their objective values, the update rule (1.3) in combination with the partial update allows to include such mechanisms with no additional cost by keeping track of the objective values of the local best positions.In such case, only one function call of each E batch per epoch and per particle is necessary, which is optimal and coincides with PSO without memory effects or CBO.A different realization of (1.2b) might result in a higher cost.Implementational aspects: A discretization of the SDE (1.2) in line 11 can be obtained for instance from a simple Euler-Maruyama or semi-implicit scheme [37,20], see, e.g., [18,Equation (6.3)].In our numerical experiments below Equation (1.3) is used for updating the local best position, which corresponds to κ = 1/(2∆t), θ = 0, and β = ∞.Furthermore, the friction parameter is set according to γ = 1 − m, which is a typical choice in the literature.Let us also remark that a numerically stable computation of the consensus point in lines 10 and 20 for α 1 can be obtained by replacing E batch with E batch − E, where E := min i∈P n k E batch (Y i k∆t ).Cooling and evolutionary strategies: The PSO algorithm can be divided into two phases, an exploration phase, where the energy landscape is searched coarsely, and a determination phase, where the final output is identified.While the former benefits from small α and large diffusion parameters, in the latter, α 1 guarantees the selection of the best solution.A cooling strategy inspired from simulated annealing allows to start with moderate α and relatively large diffusion parameters σ 1 , σ 2 .After each epoch, α is multiplied by 2, while the diffusion parameters follow the schedule σ = σ/ log(epoch + 2) for σ ∈ {σ 1 , σ 2 }.Such strategy was proposed in [6, Section 4] for CBO.In order to further reduce computational complexity, the provable decay of the variance suggests to decrease the number of agents by discarding particles in accordance with the empirical variance.A possible schedule for the number of agents proposed in [17, Section 2.2] is to set 3), i.e., κ = 1/(2∆t), θ = 0, and β = ∞.We moreover let λ 2 = 1 (or λ = 1 for PSO without memory) and ∆t = 0.01, which are such that the algorithm either finds consensus or explodes within the time horizon T = 100 in all instances.For simplicity we assume that σ 1 = λ 1 σ 2 .The algorithm is initialized with positions distributed according to N (2, . . ., 2), 4Id and velocities according to N (0, . . ., 0), Id .In Figure 2 we depict the phase diagram comparing the success probability of PSO for different parameter choices of the inertia parameter m and the diffusion parameter σ or σ 2 , respectively.We observe that for m fixed there is a noise threshold above which the dynamics explodes.While smaller m permit a larger flexibility in the used noise, they require an individual minimal noise level.Further numerical experiments suggest however that increasing the number of particles N allows for a lower minimal noise level.There are subtle differences between PSO without and with memory, but they are not decisive as in applications also confirmed by the numerical experiments in Section 5.3, [19,Section 5.3] as well as the survey paper [18, Section 6.3].

5.3.
A Machine Learning Application.We now showcase the practicability of PSO as implemented in Algorithm 1 at the example of a very competitive high-dimensional benchmark problem in machine learning, the classification of handwritten digits.In what follows we train a shallow and a convolutional NN (CNN) classifier for the MNIST dataset [30].Let us point out, that it is not our objective to challenge the state of the art by employing the most sophisticated model (deep CNNs achieve near-human performance of more than 99.5% accuracy).Instead, we want to demonstrate that PSO reaches results comparable to SGD with backpropagation, while at the same time relying exclusively on the evaluation of E.
In our experiment we use NNs with architectures as depicted in Figure 3.The input is a black-and-white  image represented by a (28 × 28)-dimensional matrix with entries between 0 and 1.For the shallow NN (see Figure 3a), the flattened image is passed through a dense layer ReLU(W • +b) with trainable weights W ∈ R 10×728 and bias b ∈ R 10 .Our CNN (see Figure 3b) is similar to LeNet-1, cf.[29, Section III.C.7].Each dense or convolution layer has a ReLU activation and is followed by a batch normalization layer to speed up the training process.Eventually, the final layers of both NNs apply a softmax activation function allowing to interpret the 10-dimensional output vector as a probability distribution over the digits.We denote by θ the trainable parameters of the NNs, which are 7850 for the shallow NN and 2112 for the CNN.They are learned by minimizing E(θ) = 1 M M j=1 (f θ (x j ), y j ), where f θ denotes the forward pass of the NN, (x j , y j ) the jth image-label tuple and the categorical crossentropy loss ( y, y) = − 9 k=0 y k log ( y k ).The performance is measured by counting the number of successful predictions on a test set.We use a train-test split of 60000 training and 10000 test images.For our experiments we choose λ 2 = 1, (σ 2 ) initial = √ 0.4, α initial = 50, ∆t = 0.1 and update the local best position according to Equation (1.3).We use N = 100 agents, which are initialized according to N (0, . . ., 0) T , Id in position and velocity.The mini-batch sizes are n E = 60 and n N = 100 (consequently a full update is performed in line 11) and a cooling strategy is used in line 18.
Figure 4a reports the performances for different memory settings and fixed m = 0.2, whereas Figure 4b depicts the results for different inertia parameters m in the case of PSO with memory but no memory drift.For the shallow NN, we obtain a test accuracy of above 89%, whereas the CNN achieves almost 97%.To put those numbers into perspective, when trained with SGD, a similar performance for the shallow NN, see [6,Figure 7], and a benchmark accuracy of 98.3% for a comparable CNN, cf.[29, Figure 9], are reached.As can be seen from Figure 4a, the usage of the local best positions when computing the consensus point significantly improves the performance.The advantage of having an additional drift towards the local best position is less pronounced.Regarding the inertia parameter m in Figure 4b, our numerical results suggest that larger m yield faster convergence.

Conclusions
In this paper we prove the convergence of PSO without and with memory effects to a global minimizer of a possibly nonconvex and nonsmooth objective function in the mean-field sense.Our analysis holds under a suitable well-preparation condition about the initialization and comprises a rich class of objectives which in particular includes functions with multiple global minimizers.For PSO without memory effects we furthermore quantify how well the mean-field dynamics approximates the interacting finite particle dynamics, which is implemented for numerical experiments.Since in particular the latter results does not suffer from the curse of dimensionality, we thereby prove that the numerical PSO method has polynomial complexity.With this we contribute to the completion of a mathematically rigorous understanding of PSO.Furthermore, we propose a computationally efficient and parallelizable implementation and showcase its practicability by training a CNN reaching a performance comparable to stochastic gradient descent.
It remains an open problem to extend the mean-field approximation result to the variant of PSO with memory effects or, alternatively, to devise an implementation of such effects compatible with the used proof technique.Moreover, we also leave a more thorough understanding of the influence of the parameters as well as the influence of memory effects for future, more experimental research.
Finally, we believe that the analysis framework of this and prior works on CBO [36,5,15] motivates to investigate also other renowned metaheuristic algorithms through the lens of a mean-field limit.

Figure 1 .
Figure 1.An illustration of the PSO dynamics.Agents with positions X 1 , . . ., X N (yellow dots with their trajectories) explore the energy landscape of E in search of the global minimizer x * (green star).The dynamics of each particle is governed by five terms.A local drift term (light blue arrow) imposes a force towards its local best position Y i t (indicated by a circle).A global drift term (dark blue arrow) drags the agent towards a momentaneous consensus point y α ( ρ N Y,t ) (orange circle) computed as a weighted (visualized through color opacity) average of the particles' local best positions.Friction (purple arrow) counteracts inertia.The two remaining terms are diffusion terms (light and dark green arrows) associated with a respective drift term.

3. 1 .
Well-Posedness of PSO with Memory Effects.Ensured by a sufficiently regularized implementation of the local best position Y , we can show the well-posedness of the mean-field PSO dynamics (1.8), respectively, the associated Vlasov-Fokker-Planck equation (1.7).Theorem 3.1.Let E satisfy Assumptions A1-A3 and let m

4 .
analogous bound can be obtained for E sup t∈[0,T ] The first claim of the statement now follows from Markov's inequality.

5 . 2 .
1] and where Σ epoch and Σ epoch denote the empirical variances of the N epoch particles at the beginning and at the end of the current epoch.Numerical Experiments for the Rastrigin Function.Before turning to high-dimensional optimization problems, let us discuss the parameter choices of PSO in moderate dimensions (d = 20) at the example of the well-known Rastrigin benchmark function E(v) = d k=1 v 2 k + 5 2 (1 − cos(2πv k )), which meets the requirements of Assumption 2.2 despite being highly non-convex with many spurious local optima.To narrow down the number of tunable parameters, we let γ = 1 − m, choose α = 100, N = 100, and update the local best position (if present) according to Equation (1.

Figure 2 .
Figure 2. Phase transition diagrams comparing PSO without and with memory effects for different inertia parameters m and noise coefficients σ (PSO without memory) and σ 2 (PSO with memory).The empirical success probability is computed from 25 runs and depicted by color from zero (blue) to one (yellow).
CNN with two convolutional and two pooling layers, and one dense layer.
(a) PSO for three different memory settings: without memory (lightest lines); with memory but without local best drift, i.e., λ1 = 0, σ1 = 0 (line with intermediate opacity); with memory with local best drift λ1 = 0.4, σ1 = λ1σ2 (darkest lines).(b)PSO with memory but without local best drift for three different inertia parameters m ∈ {0.1, 0.2, 0.4} (increasing opacity for larger m).Note that, for reference, the lines with intermediate opacity coincide with the ones of Figure4a.

Figure 4 .
Figure 4. Comparison of the performances of a shallow (dashed lines) and convolutional (solid lines) NN with architectures as described in Figure 3, when trained with PSO as in Algorithm 1. Depicted are the accuracies on a test dataset (orange lines) and the values of the objective function E (blue lines) evaluated on a random sample of the training set of size 10000.
Lemma 3.2.Let E satisfy Assumptions A1-A3 and let (X t , Y t , V t ) t≥0 be a solution to the nonlinear SDE(1.8).Then E[H(t)] with H as defined in (3.4) satisfies .5) We now derive an evolution inequality of the quantity E[H(t)].
in Lemma 3.2 are upper bounded by −µ 1 and −µ 2 , respectively.Lemma 3.2 then provides an upper bound for the time derivative of the functional E[H(t)], .14) We now investigate the evolution of the functional Y(t) := E[exp(−αE(Y t ))].The Itô-Doeblin formula yields [18,Convergence of PSO without Memory Effects in Probability.Combining Theorem 4.1 with the convergence analysis of the mean-field dynamics (2.1) as described in Theorem 2.5, as well as a classical result about the numerical approximation of SDEs allows to obtain convergence guarantees with provable polynomial complexity for the numerical PSO method as stated in Theorem 4.3 below.Let us, for the reader's convenience, recall from[18, Section 6] that a possible discretized version of the interacting particle system (2.1) is given by .18) It now remains to estimate the probability of the set K N total ⊂ Ω, where Inequality (4.16) does not hold.By utilizing the conditional version of Markov's inequality together with the formerly established bound (4.18), we have .2) 8:Partition the particles, i.e., the set {1, ..., N }, into disjoint sets P 1 k , ...,PN/n N kof size n N .9:forn= 1, ..., N/n N 10:Compute the consensus point y α ( ρN/n N Y,k∆t) according to Equation (1.4) with objective E batch from the particles in P n k , i.e., with the empirical measure ρ Check the stopping criterion and break if fulfilled.If not, employ the optional strategies described at the end of Section 5.1, set epoch = epoch + 1 and continue.19: end while 20: Compute the consensus point y α ( ρ N Y,T ) according to Equation (1.4) with objective E from all particles, i.e., with ρ N