Parallel sequential Monte Carlo for stochastic gradient-free nonconvex optimization

We introduce and analyze a parallel sequential Monte Carlo methodology for the numerical solution of optimization problems that involve the minimization of a cost function that consists of the sum of many individual components. The proposed scheme is a stochastic zeroth order optimization algorithm which demands only the capability to evaluate small subsets of components of the cost function. It can be depicted as a bank of samplers that generate particle approximations of several sequences of probability measures. These measures are constructed in such a way that they have associated probability density functions whose global maxima coincide with the global minima of the original cost function. The algorithm selects the best performing sampler and uses it to approximate a global minimum of the cost function. We prove analytically that the resulting estimator converges to a global minimum of the cost function almost surely and provide explicit convergence rates in terms of the number of generated Monte Carlo samples and the dimension of the search space. We show, by way of numerical examples, that the algorithm can tackle cost functions with multiple minima or with broad"flat"regions which are hard to minimize using gradient-based techniques.


Introduction
In signal processing and machine learning, optimization problems of the form (1.1) where Θ ⊂ R d is the d-dimensional compact search space, have attracted significant attention in recent years for problems where n is very large. Such problems often arise in big data settings, e.g., when one needs to estimate parameters given a large number of observations (Bottou et al 2018).
Because of their efficiency, the optimization community has focused mainly on stochastic gradient based methods (Robbins and Monro 1951;Duchi et al 2011;Kingma and Ba 2014) (see Bottou et al (2018) for a recent review of the field) where an estimate of the gradient is obtained using a randomly selected subsample of the gradients of the component functions (the f i 's in Eq. (1.1)) at each iteration. The resulting estimate is then used to perform a stochastic descent step. The majority of these stochastic gradient methods construct the subsamples using sampling with replacement to obtain unbiased estimates of the gradient. The latter can then be seen as a noisy gradient estimate with additive, zero-mean noise. In practice, however, there are schemes that subsample the data set without replacement (hence producing biased gradient estimators) and it has been argued that such methods can attain better numerical performance (Gürbüzbalaban et al 2015;Shamir 2016).
The gradient information may not be always available, however, due to different reasons. For example, in an engineering application, the system to be optimized might be a black-box, e.g., a piece of closed software code with free parameters, which can be evaluated but cannot be differentiated (Nesterov and Spokoiny 2011). In these cases, one needs to use a gradientfree optimization scheme, meaning that the scheme must rely only on function evaluations, rather than any sort of actual gradient information. Classical gradient-free optimization methods have attracted significant interest over the past decades (Appel et al 2004;Spall 2005;Mariño and Míguez 2007;Conn et al 2009). These methods proceed either by a random search (which is based on evaluating the cost function at random points and update the parameter whenever a descent in the function evaluation is achieved (Spall 2005)), or by constructing a numerical (finite-difference type) approximation of the gradient that can be used to take a descent step (Nesterov and Spokoiny 2011).
Such methods are not applicable, however, if one can only obtain noisy function evaluations or one can only evaluate certain subsets of component functions in a problem like (1.1). In this case, since the function evaluations are not exact, direct random search methods cannot be used reliably and it is only recently that some authors have described how to compute finite-difference approximations of the gradient (Wibisono et al 2012;Ghadimi and Lan 2013;Chen and Wild 2015;Bach and Perchet 2016). Also in recent years, evolutionary methods, based on the mutation, recombination and selection of samples, have been suggested for the approximation of gradients. The resulting optimization algorithms, termed evolutionary strategies (ES) have been applied within reinforcement learning schemes (Salimans et al 2017;Wierstra et al 2014;Hansen and Ostermeier 2001;Morse and Stanley 2016).
However, when the cost function has multiple minima or has some regions where the gradient vanishes, gradient-based methods may suffer from poor numerical performance. In particular, the optimizer can get stuck in a local minimum easily, due to its reliance on gradient approximations. Moreover, when the gradient contains little information about any minimum (e.g., in flat regions), gradient-free stochastic optimizers (as well as perfect gradient schemes) can suffer from slow convergence.
Model-based random-search methods (Hu et al 2012), which use probabilistic models of various types in order to speed up the search procedure, have been investigated in order to address problems where gradients cannot be approximated or simply turn out ineffective. The latter include classical algorithms such as simulated annealing (SA) (Kirkpatrick et al 1983), Monte Carlo expectation maximization (EM) (Robert and Casella 2004) and other Markov chain Monte Carlo (MCMC) based methods (Pereyra et al 2015). The class of model-based random search schemes also encompasses sequential Monte Carlo (SMC) techniques, e.g., SMC implementations of SA (Zhou and Chen 2013) and several optimization algorithms that mimic standard particle filters Liu et al 2016). Let us note that most of the latter MCMC-and SMC-based procedures can be cast within the class of SMC samplers described in Del Moral et al (2006), albeit with a target distribution which is sometimes implicitly defined in order to satisfy certain properties related to the objective function . Nevertheless, these optimization techniques are generally designed to be used in problems where the objective function can be evaluated exactly and their extension to stochastic optimization is not straightforward, neither from the point of view of practical performance nor in terms of theoretical convergence guarantees.
Some authors have also explored the duality between optimization and probability theory, in a way that potentially enables the use of general computational inference algorithms for solving optimization problems. While in model-based optimization the emphasis is put on the algorithms (e.g., how to use MCMC methods in Pereyra et al (2015) or particle filters in Liu et al (2016), for optimization), in this line of research the emphasis is in converting the optimization problem into an equivalent inference problem, which can then be tackled with any suitable inference algorithm. A rigorous mathematical treatment of the topic can be found in Del Moral and Doisy (1999), while Ikonen et al (2005) and Míguez et al (2013) address the problem from a methodological viewpoint. Again, these contributions deal with problems where the objective function can be computed deterministically and exactly, though.
The stochastic setting, where it is only possible to compute noisy evaluations of f (θ), is harder and the bibliography is limited in comparison with the deterministic setup. The recent survey in Homem-de Mello and Bayraksan (2014) covers various gradient-based Monte Carlo procedures, however it addresses a different class of stochastic optimizaton problems where the cost function itself is defined as an expectation, rather than a finite-sum as in (1.1). Existing model-based random methods for stochastic optimization include MCMCbased samplers which target a probability density function (pdf) matched to the objective function in (1.1) (meaning that the maxima of the pdf coincide with the minima of f (θ)) (Welling and Teh 2011;Chen et al 2016). Such schemes, however, also rely on the computation of noisy gradients. Other MCMC-based methods (see, e.g., Alquier et al (2016) which employs noisy Metropolis steps) do not require gradients, yet these techniques have been primarily designed and investigated as sampling algorithms, rather than optimization methods. Similarly, an adaptive importance sampler for a target pdf matched to f (θ) is reported in Akyildiz et al (2017). This method uses subsampling to compute noisy weights, but the technique lacks any theoretical guarantees and does not address the problem optimization directly either. A particle filtering algorithm for stochastic global optimization has been proposed by Stinis (2012). The method is intuitive, simple to implement and has been shown to work efficiently in some simple examples, however the contribution of Stinis (2012) is strictly methodological: there is no analysis of performance and no theoretical guarantees.
In this paper, we propose a parallel sequential Monte Carlo optimizer (PSMCO) to minimize cost functions with the finite-sum structure of problem (1.1). The PSMCO is a zeroth-order stochastic optimization algorithm, in the sense that it only uses evaluations of small batches of individual components f i (θ) in (1.1). In particular, it does not require the computation or approximation of gradients. The proposed scheme proceeds by constructing parallel samplers, each of which aims at minimizing the same cost function f (θ). Each sampler performs subsampling without replacement to obtain its mini-batches of individual components and processes each component only once. Using these mini-batches, the PSMCO constructs potential functions, propagates samples via a jittering scheme (Crisan and Miguez 2018) and selects samples by applying a weighting-resampling procedure. The communication between parallel samplers is only necessary when a joint estimate of the minimum is required. In this case, the best performing sampler is selected and the minimum is estimated.
We analytically prove that the estimate provided by each sampler converges almost surely to a global minimum of the cost function and provide explicit convergence rates in terms of the number of Monte Carlo samples generated by the algorithm. This type of analysis goes beyond standard results for particle filters: it tackles the problem of stochastic optimization directly and it yields stronger theoretical guarantees compared to other stochastic optimization methods in the literature. In particular, we obtain error bounds for the solution of problem (1.1) that hold almost surely (a.s.) and vanish at a rate O N − 1 2(d+1) , where N is the number of Monte Carlo samples and d is the dimension of the search space Θ. This is in contrast to the usual results for random search methods in the literature, which are purely asymptotic and do not provide any rates (Appel et al 2004;Miguez 2010;Hu et al 2012;Zhou and Chen 2013;. Let us also remark the difference between the proposed scheme and the SMC-based schemes in Míguez et al (2013) where the authors partitioned the parameter vector and modeled it as a dynamical system, an approach that cannot be used in the more general setup of (1.1) because each individual function f i depends on the complete vector θ. The PSMCO algorithm, in turn, is explicitly designed to provide an estimate of the full parameter θ at each iteration.
The main contribution of this paper includes the theoretical analysis of the proposed PSMCO scheme and its numerical demonstration on three problems where classical stochastic optimization methods (especially gradient-based algorithms) struggle to perform. The paper is organized as follows. After a brief survey of the relevant notation (below), we lay out the relationship between Bayesian inference and optimization in Section 2. Then, we develop a sequential Monte Carlo scheme in Section 3. In Section 4, we analyze this scheme and investigate its theoretical properties. We present some numerical results in Section 5 and make some concluding remarks in Section 6.

Stochastic optimization as inference
In this section, we describe how to construct a sequence of probability distributions that can be linked to the solution of problem (1.1). Let π 0 ∈ P(Θ) be the initial element of the sequence. We construct the rest of the sequence recursively as where the maps G t : Θ → R + are termed potential functions (Del Moral 2004). The key idea is to associate these potentials (G t ) t≥1 with mini-batches of individual components of the cost function (subsets of the f i 's) in order to construct a sequence of measures π 0 , π 1 , . . . , π T such that (for a prescribed value of T ) the global maxima of the density of π T match the global minima of f (θ). We remark that the measures π 1 , . . . , π T are all absolutely continuous w.r.t π 0 if the potential functions G t , t = 1, . . . , T , are bounded.
To construct the potentials, we use mini-batches consisting of K individual functions f i for each iteration t. To be specific, we randomly select subsets of indices I t , t = 1, . . . , T , by drawing uniformly from {1, . . . , n} without replacement. Each subset has |I t | = K elements, in such a way that we obtain T subsets satisfying T i=1 I t = [n] and I i ∩ I j = ∅ when i = j. Finally, we define the potential functions (G t ) t≥1 as In the sequel, we provide a result that establishes a precise connection between the optimization problem in (1.1) and the sequence of probability measures defined in (2.1), provided that Assumption 1 below is satisfied.
Next, we show the relationship between the minima of f (θ) and the maxima of dπT dπ0 . Proposition 1. Assume that the potentials are selected as in (2.2) for 1 ≤ t ≤ T , with I i ∩ I j = ∅ and i I i = [n]. Let π T be the T -th probability measure constructed by means of recursion (2.1). If Assumption 1 holds and π 0 ∈ P(Θ), then where dπT dπ0 (θ) : Θ → R + denotes the Radon-Nikodym derivative of π T w.r.t. the prior measure π 0 .
For conciseness, we abuse the notation and use π(θ), θ ∈ Θ, to indicate the pdf associated to a probability measure π(dθ). The two objects are distinguished clearly by the context (e.g., for an integral (ϕ, π), π necessarily is a measure) but also by their arguments. The probability measure π(·) takes arguments dθ or A ∈ B(Θ), while the pdf π(θ) is a function Θ → [0, ∞).
Remark 1. Notice that when π 0 is a uniform probability measure on Θ, we simply have where π T (θ) denotes the pdf (w.r.t. Lebesgue measure) of the measure π T (dθ).
Remark 2. Moreover, if we choose and select index subsets such that T t=1 I t = {2, . . . , n} then we also obtain When a Monte Carlo is scheme used to realize recursion (2.1), the use of a prior of the form (2.3) requires the ability to sample from it.
In summary, if we can construct the sequence described by (2.1), then we can replace the minimization problem of f (θ) in (1.1) by the maximization of a pdf. This relationship was exploited in a Gaussian setting in Akyildiz et al (2018), i.e., the special case of a Gaussian prior π 0 and log-quadratic potentials (G t ) t≥1 (Gaussian likelihoods), which makes it possible to implement recursion (2.1) analytically. The solution of this special case can be shown to match a well-known stochastic optimization algorithm, called the incremental proximal method (Bertsekas 2011), with a variable-metric. However, for general priors and potentials, it is not possible to analytically construct (2.1) and maximize π T (θ). For this reason, we propose a simulation method to approximate the recursion (2.1) and solve argmax θ∈Θ dπT dπ0 (θ).

The algorithm
In this section we first describe a sampler to simulate from the distributions defined by recursion (2.1). We then describe an algorithm which runs these samplers in parallel. The parallelization here is not primarily motivated by the computational gain (although it can be substantial). We have empirically found that noninteracting parallel samplers are able to keep track of multiple minima better than a single "big" sampler. For this reason, we will not focus on demonstrating computational gains in the experimental section. Rather, we will discuss what parallelization brings in terms of providing better estimates.
We consider M workers (corresponding to M samplers). Specifically, each worker sees a different configuration of the dataset, i.e., the m-th worker constructs a distinct sequence of index sets (I (m) t ) t≥1 which determine the mini-batches sampled from the full set of individual components. Having obtained different minibatches which are randomly constructed, each worker then constructs different potentials (G The m-th worker, therefore, aims at estimating a specific sequence of probability measures π (m) t , for m ∈ {1, . . . , M }. We denote the particle approximation of the posterior π (m) t at time t as where δ θ ′ (dθ) is the unit delta measure located at θ ′ ∈ Θ. Overall, the algorithm retains M probability distributions. Note that these distributions are different for each t < T , as they depend on different potentials, but π One iteration of the algorithm on a local worker m can be described as follows. Assume the worker has computed the probability measure π (m),N t−1 using the First, we use a jittering kernel κ(dθ|θ t−1 ) (a Markov kernel on Θ) to modify the particles (Crisan and Miguez 2018) (see Section 3.1 for the precise definition of κ(·|·)). The idea is to jitter a subset of the particles in order to modify and propagate Algorithm 1 Sampler on a local node m 1: Sample θ (i,m) 0 ∼ π 0 for i = 1, . . . , N . 2: for t ≥ 1 do 3: Jitter by generating sampleŝ

4:
Compute normalized weights, for i = 1, . . . , N. 6: end for them into better regions of Θ with higher probability density and lower cost. The particles are jittered by sampling, Note that the jittering kernel may be designed so that it only modifies a subset of particles (again, see Section 3.1 for details). Next, we compute weights for the new set of particles {θ Remark 3. The particle weights can be made proportional to the potentials alone, i.e., w , as long as the jittering kernels satisfy Assumption 2 in Section 3.1. Under mild assumptions, Algorithm 1 converges with standard error rates O(N − 1 2 ), as proved in Section 4.
After obtaining weights, each worker performs a resampling step where for i = 1, . . . , N , we set θ The procedure just described corresponds to a simple multinomial resampling scheme, but other standard methods can be applied as well (Douc and Cappé 2005). We denote the resulting probability measure constructed at the t-th iteration of the m-th worker as The full procedure for the m-th worker is outlined in Algorithm 1. In Section 3.1, we elaborate on the 6Ömer Deniz Akyildiz et al. selection of the jittering kernels and in Section 3.2, we detail the scheme for estimating a global minimum of f (θ) from the set of random measures {π

Jittering kernel
The jittering kernel constitutes one of the key design choices of the proposed algorithm. Following Crisan and Miguez (2018), we put the following assumption on the kernel κ.
In this paper, we use kernels of form where ǫ N ≤ 1 √ N , which satisfy Assumption 2 (Crisan and Miguez 2018). The kernel τ can be rather simple, such as a multivariate Gaussian or multivariate-t distribution centered around θ ′ ∈ Θ. Other choices of τ are possible as well.
Remark 4. The design of the kernel as a centered Gaussian or a multivariate-t distribution around θ ′ may not guarantee the propagation of samples into better (lower cost) regions. In this case, the weightingand-resampling procedure naturally tends to keep and replicate the particles that attain a lower cost. However, the jittering kernel can also be designed to accelerate the optimization process. In particular, our setup allows for the use of gradient estimators (such as finite-difference schemes (Nesterov and Spokoiny 2011) or nudging steps (Akyildiz and Míguez 2020)) in the jittering kernel to accelerate the propagation of samples into lower-cost regions.

Estimating the global minima of f (θ)
In order to estimate the global minima of f (θ), we first assess the performance of the samplers run by each worker. A typical performance measure is the marginal likelihood estimate resulting from π (m),N t . After choosing the worker which has attained the highest marginal likelihood (say the m 0 -th worker), we estimate a minimum of f (θ) by selecting the particle θ (i,m) t that yields the highest density π for m = 1, . . . , M do 4: Jitter N local particles (step 3 of Algorithm 1). 5: Update the marginal likelihood .

9:
if an estimate of the solution of problem (1.1) is needed at time t then 10: Choose To be precise, let us start by denoting the incremental marginal likelihood associated to π (m) t and its estimate π and then updating the running products The quantity Z (m) 1:t is a local performance index that keeps track of the "quality" of the m-th particle system {θ (Elvira et al 2017) and, hence, we use {Z we obtain a maximum-a-posteriori (MAP) estimator, is the kernel density estimator (Silverman 1998; Wand and Jones 1994) described in Remark 5. Note that we do not construct the entire density estimator and maximize it. Since this operation is performed locally on the particles from the best performing sampler, it involves O(N 2 ) operations, where N is the number of particles on a single worker, which is much smaller than the total number M N . The full procedure is outlined in Algorithm 2.

Analysis
In this section, we provide some basic theoretical guarantees for Algorithm 2. In particular, we prove results regarding a sampler on a single worker m. To ease the notation, we skip the superscript (m) in the rest of this section and simply note that results presented below hold for every m ∈ {1, . . . , M }. All proofs are deferred to the Appendix.
When constrained to a single worker m, the approximation π N t is provably convergent. In particular, we potentials) then Z (m) 1:t is the Bayesian evidence in favour of model m. Let us note, however, that Z does not necessarily imply that the estimate of θ ⋆ computed from worker m 1 is quantifiably better than the estimate computed from worker m 2 .
have the following results that hold for every worker m = 1, . . . , M .
Theorem 1. If the sequence (G t ) t≥1 satisfies Assumption 1 and the jittering kernels satisfy Assumption 2, then, for any ϕ ∈ B(Θ), we have for every t = 1, . . . , T and for any p ≥ 1, where c t,p > 0 is a constant independent of N .
Theorem 1 states that the samplers on local workers converge to their correct probability measures (for each m) with rate O(1/ √ N ), which is standard for Monte Carlo methods. Next we provide an upper bound for the random error |(ϕ, π t ) − (ϕ, π N t )|. Corollary 1. Under the assumptions of Theorem 1, for every ϕ ∈ B(Θ), we have is an a.s. finite random variable and 0 < δ < 1 2 is an arbitrary constant independent of N . In particular, for any ϕ ∈ B(Θ).

Proof. See Appendix A.3.
This result ensures that the random error made by the estimators vanishes as N → ∞. Moreover, it provides us with a rate O(1/ √ N ) since the constant δ > 0 can be chosen arbitrarily small.
These results are important because they enable us to analyze the properties of the kernel density estimators constructed using the samples at each worker. In order to be able to do so, we need to impose regularity conditions on the sequence of densities π t (θ) and the kernels we use to approximate them.
Note that for α = (0, . . . , 0) it is not hard to relate Assumption 3 directly to the cost function as we do in the following proposition.
Proposition 2. Assume that we define the incremental cost functions and there exists some ℓ t such that i.e., F t is Lipschitz. Assume there exists F ⋆ t = min θ∈Θ F t (θ) such that |F ⋆ t | < ∞ and recall that π t (θ) ∝ exp(−F t (θ)). Then we have the following inequality, Proof. See Appendix A.4.
Next, we state assumptions on the kernel k. We first note that the kernels in practice are defined with a bandwidth parameter h ∈ R + . In particular, given a kernel k, we can define scaled kernels k h as where, we recall, d is the dimension of the parameter vector θ. Hence, given k we define a family of kernels {k h , h ∈ R + }.
Remark 6. We note that Assumption 4 implies that We denote the kernel density estimator defined using a scaled kernel k h and the empirical measure π N t as p h,N t (θ). In particular, given a normalized kernel (a pdf) k : Θ → (0, ∞), satisfying the assumptions in Assumption 4, we can construct the KDE where k θ h (θ ′ ) = k h (θ − θ ′ ) (see Remark 5). Now, we are ready to state the main results regarding the kernel density estimators, adapted from Crisan and Míguez (2014).
). If Assumptions 1, 2, 3 and 4 hold, and Θ is compact, then where V ε ≥ 0 is an a.s. finite random variable and 0 < ε < 1 is a constant, both of which are independent of N and θ. In particular, Proof. It follows from the proof of Theorem 4.2 and Corollary 4.1 in Crisan and Míguez (2014). See Appendix A.5 for an outline.
This theorem is a uniform convergence result, i.e., it holds uniformly in a compact parameter space Θ. We note that Theorem 2 specifies the dependence of the bandwidth h on the number of Monte Carlo samples N for convergence to be attained at that rate. Based on this result, we can relate the empirical maxima to the true maxima.
Theorem 3. Let θ ⋆,N t ∈ argmax i∈{1,...,N } p N t (θ (i) t ) be an estimate of a global maximum of π t and let θ ⋆ t ∈ argmax θ∈Θ π t (θ) be an actual global maximum. If Θ is compact, π t is continuous at θ ⋆ t and Assumptions 1, 2, 3 and 4 hold, then for N sufficiently large where ε ∈ (0, 1) is an arbitrarily small constant and W t,d,ε is an a.s. finite random variable, both independent of N .
Remark 7. By choosing t = T , Theorem 3 provides a convergence rate for the MAP estimator θ ⋆ T , which is also the approximate solution of problem (1.1).

Corollary 2. Choose any
Under the same assumptions as in Theorem 3, if f ∞ < ∞ then we have , whereW T,d,ε is an a.s. finite random variable.
Finally, we obtain a convergence rate for the expected error.
Corollary 3. Choose any Under the same assumptions as in Theorem 3, if Proof. The proof follows from Corollary 2, sinceW T,d,ε is an a.s. finite random variable.

Discussion
Theorem 3 and Corollaries 2 and 3 go beyond standard results on the convergence of SMC methods. While the latter refer to the approximation of integrals (in the vein of Theorem 1 and Corollary 1), Corollaries 2 and 3 directly address the convergence of the sequence of optimizers θ ⋆,N T and state that the proposed algorithm yields, with probability 1, an asymptotically optimal solution to problem (1.1) even if f (θ) is non-convex and presents multiple local and/or global minima. These results also provide explicit convergence rates that depend on the computational cost (the number of particles N ) and the dimension d of the search space.
Note that the analyses available in the literature for most Monte Carlo optimization algorithms are purely asymptotical (see Appel et al (2004) (2013), i.e., they do not provide explicit convergence rates. Moreover, they often rely on restrictive assumptions. For example, Hu et al (2012) and  require that the objective function present a unique global minimum. More detailed analyses are carried out by Zhou and Chen (2013) and Míguez et al (2013). However, the former falls short of providing explicit error rates for the sequence of optimizers (bounds are given for the total variation distance between the Boltzmann distributions and their SMC approximations in a SA scheme) and the latter relies on a sequential decomposition of the cost function that is not satisfied by f (θ) in problem (1.1). Moreover, all the analytical results in these papers (Appel et

Numerical Results
In this section, we show numerical results for three optimization problems which are hard to solve with conventional methods. In the first example, we focus on minimizing a function with multiple global minima. The aim of this experiment is to show that, when the cost function has several global minima, the PSMCO algorithm can successfully populate with Monte Carlo samples the regions of Θ that contain these minima. In the second example, we tackle the minimization of a challenging cost function, with broad flat regions, for which standard stochastic gradient optimizers struggle. The third example involves a nonconvex, non-smooth cost function and we use it to compare the proposed PSMCO scheme with a similar SMC-based optimization method proposed in Stinis (2012).

Minimization of a function with multiple global minima
In this experiment, we tackle the problem with λ = 10 and R = rI 2 , with I 2 denoting the 2 × 2 identity matrix and r = 0.2. We choose the means m i,k randomly, namely m i,k ∼ N (m i,k ; m k , σ 2 ) where,  and σ 2 = 0.5. This selection results in a cost function with four global minima. Such functions arise in many machine learning problems, see, e.g., Mei et al (2018).
In this experiment, we have chosen n = 1, 000. Although a small number for stochastic optimization problems, we note that each f i (θ) represents a minibatch in this scenario and we set K = 1 in the PSMCO algorithm.
In order to run the algorithm, we choose a uniform prior measure π 0 (θ) = U([−a, a] × [−a, a]) with a = 50. It follows from Proposition 1 that the pdf that matches the cost function f (θ) can be written as π T (θ) ∝ exp(−f (θ)), and it has four global maxima. This pdf is displayed in Fig. 5.1(a). We run M = 100 samplers, with N = 50 particles each, yielding a total number of M N = 5, 000 particles. We choose a Gaussian jittering scheme; specifically, the jittering kernel is defined as (5.1) where ǫ N = 1/ √ N and σ 2 j = 0.5. Some illustrative results can be seen from Fig. 5.1. To be specific, we have run independent samplers and plot all samples for this experiment (instead of estimating a minimum with the best performing sampler). From Fig. 5.1(b), it can be seen that the algorithm populates the regions surrounding all maxima with samples. Finally, Fig. 5.1(c) shows the location of the samples relative to the actual cost function f (θ). These plots illustrate how the algorithm "locates" multiple, distinct global maxima with independent samplers. Note different samplers can converge to different global maxima in practice -which is in agreement with the analysis provided in Section 4.

Minimization of the sigmoid function
In this experiment, we address the problem, The function g i is called as the sigmoid function. Cost functions of the form in eq. (5.2) are widely used in nonlinear regression with neural networks in machine learning (Bishop 2006). In this experiment, we have n = 100, 000. We choose M = 25 and M N = 1, 000, leading to N = 40 particles for every sampler. The mini-batch size is K = 100. The jittering kernel κ is defined in the same way as in (5.1), where the Gaussian pdf has a variance chosen as the ratio of the dataset size L to the mini-batch size K, i.e., σ 2 j = n/K, which yields a rather large variance 2 σ 2 j = 1000. To compute the maximum as described in Eq. (3.2), we use a Gaussian kernel density with bandwidth h = ⌊N 1 6 ⌋ −1 . In Fig. 5.2 we compare the PSMCO algorithm with a parallel stochastic gradient descent (PSGD) scheme (Zinkevich et al 2010) using M optimizers. We note that, given a particular realization 3 of (x i ) n i=1 , searching for a minimum of f (θ) may be a hard task. Fig. 5.2(a) shows one such case, where the cost function 2 Note that this is for efficient exploration of the global minima, which are hard to find for this example. A large jittering variance may not be adequate in practice when there are multiple minima close to each other, see, e.g., Section 5.1.
3 For this experiment, we generate i.i.d. uniform realizations, x k ∼ U ([−2.5, 2.5]) for k = 1, . . . , n. has broad flat regions which make it difficult to find its maxima using gradient based methods unless their initialization is sufficiently good. Accordingly, we have run two instances of PSGD with "bad" and "good" initializations.
The bad initial point for PSGD can be seen from Fig 5.2(a), at [−190, 0] ⊤ (the blue dot). We initialize M parallel SGD optimizers around [−190, 0] ⊤ , each with a small zero-mean Gaussian perturbation with variance 10 −8 . This is a poor initialization because gradients are nearly zero in this region (yellow area in Fig. 5.2(a)). We refer to the PSGD algorithm starting from this point as PSGD with B/I, which refers to bad initialization. We also initialize the PSMCO from this region, with Gaussian perturbations around [−190, 0] ⊤ , with the same small variance σ 2 init = 10 −8 .
The "good" initialization for the PSGD is selected from a better region, namely around the point [0, −100] ⊤ , where gradient values actually contain useful information about the minimum. We refer to the PSGD algorithm starting from this point as PSGD with G/I. The results can be seen in Fig. 5.2(b). We observe that the PSGD with good initialization (G/I) moves towards a better region, however, it gets stuck because the gradient becomes nearly zero. On the other hand, PSGD with B/I is unable to move at all, since it is initialized in a region where all gradients are negligible (which is true even for the mini-batch observations). The PSMCO method, on the other hand, searches the space effectively to find the global minimum, as depicted in Fig. 5.2(b).

Constrained nonsmooth nonconvex optimization
In this section, we compare the proposed PSMCO scheme to the method of Stinis (2012), labeled here as 'particle filtering for stochastic global optimization' (PFSGO), and the stochastic evolution strategies (SES) algorithm in Salimans et al (2017) for a highdimensional non-smooth and non-convex optimization problem. In particular, we apply this algorithms to numerically solve the problem where y ∈ R n , X ∈ R d×n , Θ = [−5, 5] d , the dimension d is set to different values (see below), and P λ,γ : R → R is given by where λ > 2 and γ > 0. This problem formulation is useful for variable selection, see, e.g., Fan and Li (2001) or Lan and Yang (2019). It is easy to see that problem (5.3) can be written as where y i ∈ R, and x i ∈ R d . This, in turn, makes the problem an instance of (1.1), with In this problem, we also test the single-worker version of the proposed optimization scheme. We refer to this algorithm simply as SMCO and it is obtained as the particular case of PSMCO with M = 1. We use the usual jittering kernel of the form (3.1) where τ is a Gaussian kernel with covariance C = σ 2 I d for both methods. We also use the same Gaussian transition kernel for the PFSGO. Let us remark, though, that (unlike SMCO and PSMCO) the PFSGO scheme modifies all particles at every iteration, i.e., it uses τ (·|θ ′ ) instead of κ(·|θ ′ ) for sampling. The SES scheme also uses τ in order to estimate the gradients.
We choose σ 2 = 10 −2 and N = 100. The mini-batch size is taken as K = 1 and the number of components is n = 1, 000. For the PSMCO, we chose M = 5, so it essentially runs 5 samplers with 20 particles each while the SMCO scheme runs a single sampler with N = 100 particles. For the regularization parameters, we chooseρ = 1, λ = 10 −3 , and γ = 2.01. For the SES, we choose a small step size of α = 10 −7 as larger values cause it to diverge. We simulate the data using a sparse parameter θ ⋆ , where only three values are nonzero. We simulate the entries of the matrix X as i.i.d. variates from N (0, 1) and compute y = X ⊤ θ ⋆ . In order to compute the error for an iterate θ k produced by any method, we compute The results can be seen in Fig. 5.3. We also plot the 0.5σ curves around the error curves which are averaged over 1, 000 Monte Carlo runs. It can be seen that, for this particular example, the SMCO performs the best, while the PSMCO still outperforms the PFSGO. The SES basically is very slow due to the inefficiency of the gradient estimators for this problem.
To gain further insight, we also compare PSMCO (M = 25) and the PFSGO on a problem that is higherdimensional, namely d = 30, and with more data points, n = 10, 000. We set σ 2 = 10 −3 and leaving other parameters same as in the example with d = 10. Figure 5.4 displays the results for this example. It can be seen that again the PSMCO algorithm converges to a point which has lower NMSE than the PFSGO. We believe that this is mainly due to the difference in the transition kernels. The PFSGO uses a full transition kernel where every particle is modified whereas jittering enables us to induce slower and more careful changes  and also gives us a chance to keep a particle unmodified if it is in a good location.

Conclusions
We have proposed a parallel sequential Monte Carlo optimization algorithm which does not require the computation (either exact or approximate) of gradients and, therefore, can be applied to the minimization of challenging cost functions, e.g., with multiple global minima or with broad "flat" regions. The proposed method uses jittering kernels to propagate samples (Crisan and Miguez 2018) and particle kernel den-sity estimators to find the minima (Crisan and Míguez 2014), within a stochastic optimization setup. We have provided a detailed analysis of the proposed scheme. In particular, we have proved that it yields asymptotically optimal solutions to the stochastic optimization problem (1.1) (as the number of samples N is increased) and we have computed explicit convergence rates for the resulting optimizers that depend on N and the dimension of the search space, d. These results are new and improve on classical asymptotic analyses for Monte Carlo optimization methods, which typically lack convergence rates. From a practical perspective, we argue that the parallel setting where each sampler uses a different configuration of the same dataset can be useful to improve the practical behaviour of the algorithm. To illustrate this point, we have studied the numerical performance of the PSMCO algorithm in scenarios where gradient-based methods struggle to converge. In this work, we have focused on challenging but relatively low-dimensional cost functions. We leave the potential applications of our scheme to highdimensional optimization problems as a future work. Also the design of an interacting extension of our method similar to particle islands (Vergé et al 2015) can be potentially useful in more challenging settings.

A.1 Proof of Proposition 1
We prove this result by induction. For t = 1, let because of Assumption 1. Hence π 1 ≪ π 0 is a proper measure. Assume next, as an induction hypothesis, that π T −1 ≪ π 0 . Then and Assumption 1 implies (again) that hence π T is proper and π T ≪ π 0 . Therefore, the Radon-Nikodym derivative of the final measure π T w.r.t. the prior π 0 is From here, it readily follows that maximizing this Radon-Nikodym derivative is equivalent to solving problem (1.1).

A.2 Proof of Theorem 1
We proceed by an induction argument. At time t = 0, the bound is a straightforward consequence of the Marcinkiewicz-Zygmund inequality (Shiryaev 1996) because the particles {θ (i) Assume now that, after iteration t − 1, we have a particle set {θ We first analyze the error in the jittering step. To this end, we construct the jittered random measurê and iterate the triangle inequality to obtain The first term on the right hand side (rhs) of (A.2) is bounded by the induction hypothesis (A.1). For the second term, we note that, where the last inequality follows from Assumption 2. The upper bound in (A.3) is deterministic, so the inequality readily implies that For the last term on the right-hand side of (A.2), we let F t−1 be the σ-algebra generated by the random sequence 14Ömer Deniz Akyildiz et al.
Therefore, the difference (ϕ,π N t ) − (ϕ, κπ N t−1 ) takes the form . . , N , are zeromean and conditionally independent random variables, with |S (i) | ≤ 2 ϕ ∞ . Then we readily obtain the bound where the relation (A.5) follows from the Marcinkiewicz-Zygmund inequality (Shiryaev 1996) and Bt,p < ∞ is some constant independent of N . Taking unconditional expectations on both sides of (A.5) and then computing (·) (A.6) whereĉt,p = B 1 p t,p is a finite constant independent of N . Therefore, taking together (A.1), (A.4) and (A.6) we have established that where c 1,t,p = c t−1,p + cκ +ĉt,p < ∞ is a finite constant independent of N . Next, we have to bound the error after the weighting step. We recall that πt(dθ) = π t−1 (dθ) Gt(θ) (Gt, π t−1 ) and definẽ whereπ N t denotes the weighted measure. We first note that Using Minkowski's inequality together with (A.7) and (A.8) yields where the second inequality follows from ϕGt ∞ ≤ ϕ ∞ Gt ∞ . More concisely, we have where the constant c 2,t,p = 2c 1,t,p Gt ∞ (Gt, π t−1 ) < ∞ is independent of N . Note that the assumptions on (Gt) t≥1 imply that (Gt, π t−1 ) > 0. Finally, we bound the resampling step. Note that the resampling step consists of drawing N i.i.d samples fromπ N t , i.e. θ (i) t ∼π N t i.i.d for i = 1, . . . , N , and then constructing Since samples are i.i.d, as in the base case, we have, for some constantcp < ∞ independent of N . Now combining (A.9) and (A.10), we have the desired result, where ct = c 2,t,p +cp is a finite constant independent of N .

A.3 Proof of Corollary 1
From Theorem 1, we obtain where ct < ∞ is a constant independent of N . Let us choose p ≥ 4 and 0 < ǫ < 1. We construct the nonnegative random variable and use Fatou's lemma to obtain where the second inequality follows from Theorem 1. The relationship (A.11) implies that the r.v. U p t,ǫ is a.s. finite.

A.4 Proof of Proposition 2
Recall the assumption We write F ⋆ t = min θ∈Θ Ft(θ), which is assumed to be finite, but not necessarily nonnegative. We first prove that exp(−Ft(θ)) is also Lipschitz continuous. Note that we trivially have exp(−Ft(θ)) ≤ exp(−F ⋆ t ) for all θ since Ft(θ) ≥ B ⋆ t,n . Moreover, we deduce an inequality that relates the radius n −1 of the ball B ⋆ t,n with the number of necessary particles N . 2. From the existence of at least one particle θ (i) t inside B ⋆ t,n and the assumption that πt(θ) is Lipschitz, we deduce bounds for the differences |πt(θ ⋆ t ) − πt(θ (i) t )| and , and, as a consequence, for the approximation error |πt(θ ⋆,N t ) − πt(θ ⋆ t )|.
A.6.1 The ball B ⋆ t,n is a.s. non-empty Since πt(θ) is assumed continuous at every θ ⋆ t ∈ arg max θ∈Θ πt(θ), we have πt(B ⋆ t,n ) > 0. Therefore, for every n < ∞, Theorem 2 ensures that there exists Nn (a.s. finite) such that for all N ≥ Nn, where U t,δ is an a.s. finite random variable and δ ∈ (0, 1 2 ) is an arbitrarily small constant (both independent of N ). Moreover, the second inequality in (A.19) implies that Therefore, for all N > Nn there exists at least one integer i b ∈ {1, . . . , N } such that θ (i b ) t ∈ B ⋆ t,n . To be specific, since πt(θ) is a density w.r.t. the Lebesgue measure, we can readily find a lower bound for the integral  Let Lt < ∞ be the Lipschitz constant of the pdf πt(θ). Since θ (i b ) t ∈ B ⋆ t,n , we readily obtain the upper bound

A.6.2 Error bounds
and, therefore, (A.22) However, using Theorem 2 we obtain where ε ∈ (0, 1) is an arbitrarily small constant and Vt,ε is an a.s. finite random variable, both independent of N . Combining (A.22) and (A.23) yields and, as a consequence, (A.24) Moreover, using Theorem 2 again, we find that with the same constant ε ∈ (0, 1) and a.s. finite random variable Vt,ε as in (A.24). Since πt(θ ⋆,N t ) ≤ πt(θ ⋆ t ), the inequality (A.25) implies that and a simple triangle inequality then yields where the second inequality follows from (A.27) and yet another application of Theorem 2. The inequality (A.28) holds for any pair of integers (N, n) that satisfies the relationship (A.21). For any given N , sufficiently large for to be well defined, the pair consisting of N and n = n N satisfies (A.21), while Hence, if we substitute n = n N in the inequality (A.28) and then apply the inequality (A.29) we arrive at where Vt,ε and U t,δ are a.s. finite, and Lt and c t,d are finite. The constants ε ∈ (0, 1) and δ ∈ (0, 1/2) can be chosen arbitrarily small. Hence, if we let 0 < δ = ε/2 < 1 2 , the r.h.s. of (A.30) can be upper bounded, which results in the bound where W t,d,ε = 2 Vt,ε + U t,δ(ε) c t,d where Zπ T is the normalizing constant of π T . Next, we lower bound the left-hand side of (A.31) as where the last inequality follows from the relationships and e a ≥ a + 1 for a ∈ R. Combining (A.31) and (A.32), we obtain wherẽ W T,d,ε = Zπ T W T,d,ε e f ∞ is a.s. finite.