Cumulative weighting optimization

Global optimization problems with limited structure (e.g., convexity or differentiability of the objective function) can arise in many fields. One approach to solving these problems is by modeling the evolution of a probability density function over the solution space, similar to the Fokker–Planck equation for diffusions, such that at each time instant, additional weight is given to better solutions. We propose an addition to the class of model-based methods, cumulative weighting optimization (CWO), whose general version can be proven convergent to an optimal solution and stable under disturbances (e.g., floating point inaccuracy). These properties encourage us to design a class of CWO algorithms for solving global optimization problems. Beyond the general convergence and stability analysis, we prove that with some additional assumptions the Monte Carlo version of the CWO algorithm is also convergent and stable. Interestingly, the well known cross-entropy method is a CWO algorithm.


Introduction
Many questions in engineering and science can be formulated as optimizing over an objective function. When the objective function is differentiable, its derivative has explicit form, and has few finite local extrema, the problem is highly tractable-the first-order necessary condition generates a set of candidate solutions, the greatest of which is an optimal solution.
On the other hand, objective functions absent any structural information can be challenging to solve analytically. Approaches developed to solve these problems numerically can be divided into two categories: deterministic and random search. Random search is further divided into instance-based (e.g., simulated annealing, genetic algorithm, tabu search, nested partitions, generalized hill climbing, and evolutionary programming) and model-based algorithms [e.g., annealing-adaptive search (AAS), cross-entropy (CE), model reference adaptive search (MRAS), and estimation of distribution algorithms (EDAs)]. For the interested reader, Hu et al. [1] have a recent survey paper on model-based methods, which also contains references to instance-based methods mentioned in this paragraph.
The cumulative weighting optimization (CWO) method extends the class of model-based methods by introducing an alternative weight-update equation. The weight-update equation is important for model-based methods, as it decides the search direction for the next time step. Our equation is inspired by Cumulative Prospect Theory (CPT) and has an intuitive connection with the risk-sensitive nature of the human decision making process. The new equation can be proven to converge to solutions of optimization problems when it can be solved analytically. Interestingly, the well known cross-entropy method is a special case of the CWO method. We also provide a convergence result when an analytical solution cannot be obtained and the problem requires an approximate solution. The approximate version, later referred to as the Monte Carlo version, will first project the underlying distribution onto a family of easy-to-sample from distributions and then sample from the projected distribution. The techniques used in the convergence analysis of the Monte Carlo version will follow the work of Hu et al. [2] with two major differences: the class of functions considered and the mean vector equation.
Moving from theory to implementation, this paper proceeds as follows: Section 2 presents the problem statement. Section 3 introduces the concept of probability weighting functions. In Sect. 4, rigorous convergence and stability analyses on both the general and Monte Carlo versions of the CWO method are presented. In this section, we provide additional analysis for the case when the optimizers are isolated and the family of distributions used has continuous density, due to the additional assumptions required. Finally, Sect. 5 describes a few CWO numerical algorithms and tabulates their simulation results.

Problem
In many engineering applications, we seek a best solution based on an objective function. For example, in the well known traveling salesman problem (TSP), we are looking for the cheapest route that visits all cities and terminates at the starting point. Problems of this nature can be formulated as the following optimization problem: where x * is an optimal solution to the problem and X is a compact subset of R n (the ndimension real space). H : X → R, the objective function, is a bounded deterministic measurable function possibly with multiple local extrema. The set of optimizers for Eq. (1) is denoted by X * := {x * ∈ X|H (x) ≤ H (x * ) , ∀x ∈ X}. The following assumption holds throughout this paper.
In practice, this assumption is true for many optimization problems. For example, the assumption holds trivially when H is continuous. In general, the objective function lacks properties such as convexity and differentiability. Let the set of non-negative reals be denoted by R + . Common in many situations, a measurable strictly increasing fitness function, φ : R → R + , is introduced to reformulate Eq. (1) as: x * ∈ arg max x∈X φ (H (x)) . A similar fitness function modified problem statement can be found in Hu et al. [1].

Remark 1
Since the reformulated problem guarantees the range of the new fitness-objective function [i.e., φ (H (·))] is non-negative, and it has the same optimizers as the original problem; we will only need to consider the case when H is non-negative in Eq. (1), i.e., H : X → R + .
In its historical application, an optimal-seeking weighting function places more weight on highly rewarding outcomes. In particular, it is used to overweight the probabilities of unlikely events and underweight the probabilities of highly likely events. In our context, the optimalseeking property of the weighting function is used to place more weight on higher ranked or more desirable outcomes. In the example below, we apply an optimal-seeking weighting function to a complementary CDF.
Example 1 A die is rolled and the player receives a payoff that is equivalent to the outcome of the roll. For example, if the player rolled a 1, then he/she is given a $1 reward. The expected payoffs for both the risk-neutral and optimal-seeking cases are calculated below assuming the die is fair. The outcome of the roll is a random variable denoted by R.
The risk-neutral expected payoff is calculated as: is the CDF evaluated at outcome n. Using the weighting 2 , the corresponding optimal-seeking re-weighted expected payoff is:

Remark 2
The reader should observe the fact that the optimal-seeking re-weighted expected payoff is greater than that of the risk-neutral, which will be key in proving the convergence of the CWO method.

Convergence and stability analysis
Convergence and stability are two desirable properties for any global optimization method. In particular, we would like to provide a theoretical guarantee that the CWO method will converge to an optimal solution and remain there under "reasonable" disturbances. These properties will be proven true for both the general and Monte Carlo versions of the CWO method in this section, with each version explained in detail in a subsection. We highlight the case of isolated optimizers in this section due to the additional assumptions, and a slight modification of the standard approach is required.

General theory
The best way to gain some intuition for the CWO method is to understand the finite solution space case. To make the idea concrete, let X in Eq. (1) be the set {1, . . . , N } . In this case, Assumption 1 is trivially satisfied and Assumption 2 is always true in our analysis. Let P x denote the set of probability mass functions (PMFs) over X, and the set of PMFs exclusively supported on optimal solutions is denoted by P X * := P ∈ P x | x∈X * P (x) = 1 . Since any element of P X * has positive weight assigned to optimal solutions and zero weight assigned to non-optimal solutions, it follows that finding an element of P X * solves Eq. (1).
As alluded to earlier, finite solution space model-based methods iterate on a PMF so that it will eventually concentrate its mass on optimal solutions. Let P t denote a PMF over X at time t, i.e., P t ∈ P x ; then solving Eq. (1) is equivalent to finding an algorithm that can update P t iteratively such that P t ∈ P X * , ∀t > τ ∈ (0, ∞) . If P 0 has a positive mass on at least one of the optimal solutions, i.e., P 0 (X * ) > 0, then one way of insuring P t eventually reaches P X * is by using an optimal-seeking weighting function. The idea can be better demonstrated by introducing a step size variable Δ. Let the set-valued map M : X → 2 X return all elements in the solution space with the same outcome, i.e., M (x) := {ξ ∈ X|H (ξ ) = H (x)} ; then P t can be updated according to the following equation: where ξ ∈M(x) β t (ξ ) = 1, ∀x ∈ X, and w is an optimal-seeking weighting function. In Eq. (2), the difference between the first w distorted term and the second w distorted term is the set {ξ ∈ X|H (ξ ) = H (x)} . In the finite solution space case, we can verify that P t+Δ is indeed a probability measure by summing over X, i.e., x∈X P t+Δ (x), and checking that the sum is 1. Since for each x ∈ X, the negative term in Eq.
(2) cancels out with a positive term in the summation except for w (1) = 1, it is verified that x∈X P t+Δ (x) = 1 and P t+Δ is a probability measure. When Δ = 1, we obtain a simpler equation evolving on t = {0, 1, 2, . . . }: where time t can be treated as the iteration count. Similar to the observation in Remark 2, it will be shown later in the paper that where the equality is achieved only when P t ∈ P X * and H is treated as a random variable. The following example illustrates in continuous-time the point that E t [H ] is strictly increasing in t unless P t reaches P X * .
Example 2 Let X be the finite solution space [1,2,3,4] and pick any optimal-seeking probability weighting function w. We assume that H (4) = H (3) > H (2) > H (1) ≥ 0. The continuous-time analogue of Eq. (2) for this example is written as From Proposition 1, we know that w( p) > p, ∀ p ∈ (0, 1), which implies dP t (3) dt + dP t (4) dt > 0, ∀P t (3) + P t (4) ∈ (0, 1) and dP t (3) dt + dP t (4) 1} . The weight on the optimal solutions for this example (i.e., the mass on X * = {3, 4}) monotonically increases and asymptotically approaches the set P X * . Since the weight on the optimal solutions is increasing and the total weight has to sum to one, the weight on the non-optimal solutions approaches zero, i.e., ⎡ Remark 3 Since the weight on optimal solutions monotonically increases and approaches 1, it follows that the corresponding re-weighted expected payoff is monotonically increasing in t until optimality.
Thus, the CWO method updates P t iteratively, so that P t approaches P X * asymptotically. The limit of P t as t → ∞ only has weight on optimal solutions; hence, optimal solutions can be inferred from it. The finite solution space case offers the most intuition; however, to apply the CWO method to a wide variety of problems, we will need to work with more general solution spaces. In the rest of this section, Eq. (1) is solved given that X is a compact subset of a finite-dimensional vector space i.e., R n . Similar to the finite solution space case, the set of probability measures defined on the Borel measurable space (X, B (X)) is denoted by P x , which has the Prohorov topology. We reference [6] and [7] for technical details on the Prohorov topology. While the following assumption is not strictly required for the CWO method, it is used in the analysis for ease of notation. Assumption 3 w is differentiable and has a bounded first derivative, which is denoted by w .
In general, boundedness of the sub-gradient should be sufficient for the application of the CWO method.
At each time t, the push-forward measure of P t through H in Eq. (1) is denoted by where H −1 (B) is the preimage of B under H , and B R + denotes the Borel σ -algebra for R + . For the justification of using R + in Eq. (5), see Remark 1. Furthermore, H can be treated as a random variable from (X, B (X)) to R + , B R + . For any B ∈ B R + and A ∈ B (X), the generalization of Eq. (2) to the continuous-time where i.e., β t can be viewed as a probability measure supported on all ξ ∈ X with the same H (x) value. Here, P H t (B) is interpreted as the time derivative of P H t (B). Note that Eq. (7) says that given β t , P t can be determined from P H t , which is a solution of Eq. (6). Special attention is paid to the equation governing the optimal solutions: where y * = max x∈X H (x) .
Wang [8] proposes an alternative set of evolution equations, also nonlinear Fokker-Planck equations [9,10], motivated by evolutionary game theory. As the reader will see later, we reach the same convergence results as Wang et al. [11] with a modified approach.
Similar to the finite solution space case, the set of probability measures exclusively supported on optimal solutions is denoted by P X * := {P ∈ P x |P (X * ) = 1} , where X * denotes the set of optimal solutions, i.e., X * := {x * ∈ X|H (x) ≤ H (x * ), ∀x ∈ X} . The reader is reminded that obtaining an element of P X * is equivalent to solving the optimization problem stated in Eq. (1). The goal is to prove that Eqs. (6-7) update P t such that P t approaches P X * asymptotically. The first step is to prove the existence and uniqueness of a solution for Eq. (6).
Theorem 1 For each P 0 ∈ P x and its corresponding push-forward measure P H 0 , the ordinary differential equation (6) has a unique solution for t ∈ R + .
Proof The outline of our proof follows [12] and [13]. The total variation norm on a σ -finite signed measure P over R + , B R + at time t is denoted by: where the sup is taken over all measurable functions g : R + → R and We simplify our notations by introducing the following shorthand: where B ∈ B R + . Since P H t is a probability measure on R + , B R + , we have where K is the Lipschitz constant for w. The inequality above proves the boundedness of C P H t . Next, we need to prove that the right hand side of Eq. (6) is Lipschitz continuous. Letting P H t and Q H t denote two probability measures defined on R + , B R + , we know from the definition of the norm that Furthermore, we have By substituting Eq. (10) into Eq. (8), we have Hence, the right hand side of Eq. (6) is Lipschitz continuous in P H t with the constant K + 1. Using [14,Corollary3.9], we conclude that Eq. (6) with an initial measure P H 0 has a unique solution.
Next, P H t is proved to be a probability measure over R + , B R + for any t.

Lemma 1 Given that P H
0 is a probability measure, then a solution P H t of Eq. (6) at each time t > 0 is a probability measure, i.e., , then we have obtained our desired result. Using Eq. (6), the fact that w is bounded, and the dominated convergence theorem, it is straightforward to prove this assertion.
The next Lemma is needed in Theorem 5, which shows E t [H ] is monotonically increasing in t [cf. Remark 2 and Eq. (4)].

Lemma 2
Given an optimal-seeking weighting function, w, there exists aỹ ∈ R + such that can be decomposed into the sum of its non-negative and negative parts, i.e., Eq.
Proof We constructively findỹ. Since w is a monotonically non-decreasing function, it satisfies Furthermore, since w is also optimal-seeking, we have Because w (0) = 0 and w (1) = 1 by definition, we have w P H t ([y, ∞)) > 1, for some y ∈ R + . From Eq. (12), we know if y 2 satisfies the above inequality, then so does y 1 ≥ y 2 ∈ R + . Hence, we can conclude thatỹ is the smallest such y.
The theorems below present a blueprint to obtain an element of P X * utilizing the solution P t of Eqs. (6)(7). Accomplishing this goal, the initial point set in Theorem 1 is restricted to measures P 0 that allow P t to approach P X * , i.e., lim t→∞ P t ∈ P X * . The following definition helps us to present this idea succinctly.

Definition 3
The set of all optimal initial solution probability measures is denoted by: The corresponding set of H push-forward optimal probability measures over Definition 3 is essential, as a requirement on the initial condition of the system (i.e., P 0 ∈ I X * ), for the convergence and stability of the system. This condition can be too stringent when the optimizers are isolated and I X * is restricted to measures with continuous densities, since no positive measure can be placed on isolated points. In the next section, we will address this issue specifically by modifying the definition of I X * and adding an extra assumption on the objective function H . The next theorem proves P H t (max x∈X H (x)), the probability measure on optimal solutions, will converge to 1 as t → ∞. On the other hand, the probability measure on non-optimal solutions will approach zero as t → ∞.

Theorem 2 If P H
t is a solution of Eq. (6) with initial points in I H * (i.e., P H 0 ∈ I H * ) and y * = max x∈X H (x) , then the following claims hold: is a monotonically non-decreasing function of t that converges to 1 as t → ∞; 2. P H t R + \y * → 0 as t→ ∞.
Proof We recall that the following equation is true : From Proposition 1, we obtain the fact that w P H t y * , ∞) > P H t y * , ∞) . From the two equations above, we conclude thaṫ Since P H 0 ∈ I H * , the first claim is obvious. The second claim follows from the first claim.
The next theorem connects the properties of P H t with those of P t as t → ∞. This is an important step for understanding the evolution of Eqs. (6-7).
Theorem 3 If P 0 ∈ I X * and P t is a solution of Eqs. (6)(7), then the following claims hold: the first claim is proved by writing down the following equation and using Theorem 2: The second claim follows from the first claim.
We are interested in finding the limit points of Eqs. (6-7). Ideally, these limit points should be elements in P X * . In order to guarantee this, we restrict the initial points to be elements of I X * . To facilitate our discussion, we introduce the following definition.

Definition 4
A limit set of Eqs. (6-7) from an initial set I is denoted by where P t is a solution of Eqs. (6-7) and lim k→∞ t k = +∞. The useful limit set L [I X * ] is characterized by the theorem below.

Theorem 4
The limit set of Eqs. (6-7) starting from any initial point P 0 ∈ I X * is P X * , i.e., Proof The proof will be done in two parts: first by proving L [I X * ] ⊃ P X * , then by proving L [I X * ] ⊂ P X * . The first case can be proved by taking an element P ∈ P X * . By the definition of L [I X * ] (i.e., the limit set of Eqs. (6-7) starting from I X * ), we conclude that P ∈ L [I X * ]. The second claim is proved by contradiction. We assume there is an element Q ∞ ∈ L [I X * ] , but not in P X * , which implies The first equality is due to the fact that Q ∞ ∈ L [I X * ]. The second equality due to dominated convergence theorem in the equation above, along with Eqs. (6-7), implies lim t→∞ P H t R + \y * > 0, which contradicts the second claim of Theorem 2 stating lim t→∞ P H t R + \y * = 0.
The following theorem shows the monotonically increasing nature of E t [H ], which will be useful later in proving the stability of Eqs. (6-7).

Theorem 5 Let P H t be a solution for
Eq. (6) with its initial point in I H * , then the following claims are true: Proof We start our proof by differentiating the expected value: The swapping of integration and differentiation in the last equation is allowed due to the dominated convergence theorem. Theỹ variable is used to decompose the expected outcome function into non-negative and negative parts (cf. Lemma 2). The last line of the inequality is true because P H t is a probability measure (cf. Lemma 1). The first claim is proved. The second claim is proved by contradiction. Assume P τ / ∈ L [I X * ], i.e., P τ is not a limit point, and d dτ E τ [H ] = 0, i.e., E τ [H ] is not increasing. Along with Theorem 2, the equality above implies that H is equal to a constant C = max x∈X H (x). This implies that P H τ is a Dirac measure concentrated at C, which means P τ ∈ L [I X * ] (cf. Theorem 4).
At this point, we have demonstrated the convergence to optimality property of our method. We now explore the stability property of our method. The metric function d in the following definitions is the Prohorov metric found in the appendix of [6, p. 170].
Definition 5 Let L be a subset of P x . For a point P ∈ P x , we define the distance between P and L as L is called Lyapunov stable, with respect to a sequence of measures {P t }, if for all > 0, there exists a δ > 0 such that L is called asymptotically stable, with respect to a sequence of measures {P t }, if L is Lyapunov stable, and there exists a δ > 0 such that The next theorem is the main result of this section.

Theorem 6 L [I X * ] is a compact set and it is asymptotically stable.
Proof We need to first prove that L [I X * ] is a compact set. Since from Theorem 4, we have L [I X * ] = P X * , we can instead prove that P X * := P ∈ P x |P X * = 1 is compact. It is easy to see that P X * is tight, since X * is pre-compact (i.e., X * ⊂ X and X is compact). Furthermore, we can prove it is a closed set by contradiction. Assume there exists a sequence {P n } ∈ P X * such that P n →P / ∈ P X * . This implies ∃N such that ∀n > N , we have P n (X * ) < 1, and P n (X\X * ) > 0, which implies lim n→∞ P H n R + \y * > 0.
This contradicts the second claim of Theorem 2; hence, P X * = L [I X * ] is a compact set. We pick the Lyapunov function where V (P t ) > 0, for all P t ∈ P x \P X * , and V (P t ) = 0, for P t ∈ P X * = L [I X * ]. From Theorem 5 we haveV (P t ) < 0 for all t > 0 if P t / ∈ P X * . Furthermore, we know thaṫ V (P t ) = 0, ∀t > 0, if P t ∈ P X * . Using V (P t ) as the Lyapunov function, and the fact that P X * = L [I X * ] is a compact set, we can appeal to a generalized version of Lyapunov's theorem (see [15,Chapter 5]). The desired conclusion is reached.
The use of a Lyapunov function for proving the asymptotic stability of the limit set can be found previously in Wang's dissertation [8].

Isolated optimizers
In Sect. 4.1, we require that our initial distribution P 0 belong to I X * , which requires P 0 (X * ) > 0. This is reasonable in many cases: (1) if the solution space is large but finite; (2) if the optimizers are not isolated. However, for the case when the optimizers are isolated and I X * is restricted to measures with continuous densities, the condition P 0 (X * ) > 0 cannot be satisfied, because the probability measure at a single point is always zero. For example, when minimizing H (x) = x 2 , it is convenient to start with a Gaussian distribution due to its simple form. Since Gaussian measures have continuous densities, it is impossible to satisfy the condition P 0 (X * ) > 0. Thus, we need to make a slight modification to the definition for I X * and the statements of the theorems in the previous section. The modifications will lead to the conclusion that the weight update system will converge to measures that place all their weight in the ball neighborhoods of the optimizers. We make a slight modification to the definition of the I X * so that the positive measure requirement is satisfied as long as the measure is positive for any neighborhood of at least one element in X * . In other words, as long as the initial measure has not excluded all neighborhoods of all optimizers, it should be included in the admissible initial probability measure setĨ X * , which is defined below.

Definition 6
The set of all optimal initial solution probability measures is denoted by: The corresponding set of H push-forward optimal probability measures over R + , B R + is denoted byĨ H * := P • H −1 |P ∈Ĩ X * .
An example can illustrate the reasonableness ofĨ X * for the isolated optimizers case. If we would like to minimize the function H (x) = x 2 , which has a single isolated optimizer at x = 0, then any Gaussian distribution can satisfy the requirement imposed byĨ X * , since for any δ-neighborhood around x = 0, the measure under P 0 is non-zero. After we redefined the admissible initial distributions, there needs to be an additional continuity assumption on the objective function H . Assumption 4 H , the objective function, is continuous at all x * ∈ X * . Assumption 4 is needed here so that the neighborhoods around elements in X * have values close to the optimal objective value y * . In the rest of this section, we will list the modified version of the theorems analogous to the theorems in the previous section eliminating any redundancy. Proof Using Assumption 4, we know for any > 0, there exists aδ > 0 such that As a consequence, we know that P H 0 ∈Ĩ H * , which is induced by P 0 ∈Ĩ X * that has nonzero probability over theδ-neighborhood of at least one optimizer x * ∈ X * , has non-zero probability over the -neighborhood of y * . The rest follows from the same proof technique as in Theorem 2.
For the following theorems, we introduce the notation: The next theorem is needed to characterize the limiting behavior of Eqs. (6-7). Theorem 8 If P 0 ∈ I X * and P t is a solution of Eqs. (6-7), then the following claims hold for any δ > 0: 1) lim t→∞ P t X δ, * =1; 2) lim t→∞ P t X\X δ, * = 0.
Proof Using the fact that the objective function is continuous around the optimizers (i.e., Assumption 4), we know that there exists an corresponding δ for each . Furthermore, since the is arbitrary in Theorem 7 and we can always shrink the δ by shrinking the around the isolated optimizers, δ can be made arbitrarily small as well.
The next theorem states that if we start fromĨ X * , the limit set will place all the weight in the neighborhoods of the optimizers.
Proof See the proof for Theorem 4 and replace Theorem 2 with Theorem 7.
The next theorem is needed in our final conclusion stated in Theorem 11.

Theorem 10 Let P H t be a solution for
Eq. (6) with its initial point inĨ X * , then the following claims are true:

E t [H ] is monotonically non-decreasing in t ∈ R
Proof See the proof for Theorem 5.
Finally, by using Theorems 7 and 10 and applying a Lyapunov function, we have our desired conclusion.

Theorem 11 L Ĩ X * is a compact set and it is asymptotically stable.
Proof See the proof for Theorem 6 and replace Theorems 2 and 5 with Theorems 7 and 10.
In other words, the modified theorems state that if Eqs. (6-7) start with a probability measure that places some weight on an arbitrarily small neighborhood of at least one optimizer, then the system will converge to an element in L Ĩ X * , the set of probability measures that only has weight on arbitrarily small neighborhoods of the optimizers. Furthermore, L Ĩ X * is asymptotically stable.

Monte Carlo version
In Sect. 4.1, we have demonstrated that when the solution space is a subset of R n , the CWO method will exhibit convergence and stability properties that are desirable for any optimization method. The theorems are proven when the probability measure can be modeled perfectly; however, there is no result on the CWO method's Monte Carlo version (cf. Algorithm 1) convergence, which is important in practice. The analysis techniques applied and the convergence proved in this section are significantly different from that of the general version. The difference is caused by the two layers of approximation used for efficient simulation: projection and sampling. In addition, we are able to apply the analysis techniques in [2] to Eq. (16), which is a more general version of equations considered previously for model-based methods. In practice, we often assume the existence of a PDF p t for P t such that where μ is the Lebesgue or discrete measure on X. For brevity, we use the following notations: In this section, we restrict the β t function to be of the form: (H (x))) .
Then, in discrete-time (i.e., t ∈ {0, 1, 2, . . . }), Eqs. (6-7) is rewritten more simply as: where is introduced to highlight its dependency on H (x) and p t . It is common to let β (x, p t ) = p t (x) (H (x))) , thus σ (x, p t ) = 1 and Eq. (14) becomes Note that in the previous section, we allow β to vary with time t; however, in this section, we require a more explicit structure on β, namely it depends on x and p t . To recap, the relevant assumptions for this section are Assumptions 1, 2, and 3. Furthermore, we will use t to denote time and k to denote the iteration count going forward. While Eq. (14) brings us a step closer to an implementable algorithm, we are still faced with a major challenge-given a p 0 , an initial point, it is not clear that we can solve Eq. (14), a nonlinear Fokker-Planck equation, analytically. Hence, solving Eq. (14) numerically is a natural alternative. In solving Eq. (14), most numerical methods need to sample from the PDF p t , which can be difficult at times. One common approach to circumvent this difficulty is to project p t onto a family of easy-to-sample-from parameterized density functions denoted by F := { f θ |θ ∈ }, via the Kullback-Leibler (KL) divergence: where θ ∈ and f θ ∈ F. The projection is done by minimizing the KL divergence, i.e., so that the reference PDF p k is updated through its surrogate PDF f θ . We adopt the notations P θ (·) and E θ [·] for the probability measure and expectation with respect to the surrogate PDF f θ . On the other hand, P k (·) and E k [·] denote the probability measure and expectation of the reference PDF at the k-th iteration. With a slight abuse of notation, we write X for both X k and X θ , the random variables with the PDFs p k and f θ , respectively. The natural exponential families (NEFs), in many applications, can be used as the surrogate parameterized family of PDFs. They are convenient to use in implementations and can lead to a closed analytical solution in our analysis. Their definition from [2, Definition 2.1] is presented below.

Definition 7 A parameterized family
Remark 4 Let the interior of be denoted by˚ . In this section, we use the following properties of the family from [16]: (1) K (θ ) is strictly convex on˚ ; (2) the Jacobian is the covariance with respect to f θ . Therefore, the Jacobian for the parameterized mean vector function, m (θ ) := E θ [Γ (X )] , is strictly positive definite and invertible. This fact implies, along with the inverse function theorem, that m (θ ) is also invertible. Since m (θ ) is invertible, we can iterate our algorithm on m (θ ) instead of θ, and recover θ as needed via the inverse function m −1 (·) .
In the existing work using NEFs for global optimization, the model follows the form: where S : R + → R + is an increasing, possibly iteration-varying function. Examples include: (1) proportional selection [17][18][19], where (2) Boltzmann distribution [20], where and {T k } is a sequence of predetermined parameters; (3) importance sampling [21], where In the CWO method, looking at Eq. (14), our model has the form: There are two major differences between our model [cf. Eq. (16)] and previous models [cf. Eq. (15)]. Firstly, the S function in our model, a mapping from X, R + , P x to R + , takes two additional parameters x and p k . Secondly, our model ensures that E p k [S (X, H (X ) , p k )] = 1; hence, there is no need for normalization. It is also interesting to note that, for a fixed p t and x, S (x, ·, p t ) is an increasing function. Acknowledging these differences, we will prove that the Monte Carlo version of Eq. (14) converges to the internal chain recurrent set of an ordinary differential equation. The definition of internal chain recurrent sets will be introduced later along with the corresponding ordinary differential equation in our analysis. The development of the theorems below runs in parallel with the work of Hu et al. [2], with the major difference in the structure of the S function as discussed in this paragraph. The main idea is to apply techniques from the stochastic approximation literature to model-based methods. With p k+1 defined by Eq. (16) and f θ k the surrogate PDF at the k-th iteration, the smoothed reference PDF is denoted by: The smoothing factor α k is a design parameter that can be used to limit the effect of p k+1 oñ p k+1 .
Lemma 3 Assume that f θ k+1 is a member of a natural exponential family with the parameter space denoted by F optimizing the KL divergence betweenp k+1 and F, i.e., θ k+1 ∈ arg min θ ∈ D (p k+1 , f θ ) and θ k+1 ∈˚ , the interior of , for all k. Then where m : R d → R d is the mean vector function (cf. Remark 4).
Proof Using the argument made for proving Lemma 2 of [19] and the assumption that θ k+1 ∈ , we can prove that Remark 4). The rest of the proof uses the same argument as in Lemma 2.1 of [2]. Using the fact that E θ k+1 [Γ (X )] = Ep k+1 [Γ (X )] , along with Eq. (17), it follows that which can be rewritten as where the last equality is obtained from the definitions of KL divergence and NEFs. So far, Eq. (18) provides us a way to update θ k via p k+1 . Of course, in reality we do not know p k+1 ; hence, we denote the estimator for p k+1 byp k+1 , which has the form: where the interchange of the derivative and the integral above is justified by the dominated convergence theorem. The intuition from the equation above is that the algorithm will move in the direction that will increase E θ S X, H (X ) , f θ k .
In the CWO method, the S mapping has the specific form [cf. Eq. (14)]: hence, its estimated form is: The mean vector function below, which differs from that of Proposition 3.1 in [2], describes the dynamics for the mean vector function in the CWO method: By replacing the mean vector function in Algorithm 2 from [2] with Eq. (19) (cf. Algorithm 1), we can prove the convergence of Algorithm 1.
Finally, we can prove that η k in Eq. (20) will converge to the internally chain recurrent sets (cf. Definition 8) of an ODE: where L (η) is an empirical estimate of Eq. (19) based on the sampled solutions in Λ k . 5. If a stopping rule is satisfied, then returnθ k+1 and terminate; otherwise set k = k + 1 and go to Step 2. x − y 0 ≤ δ, and η y i (t i ) − y i+1 ≤ δ for i = 0, . . . , k − 1. A compact invariant set A (i.e., for any y ∈ A, the trajectory η y (t) satisfies η y (t) ⊂ A, ∀t ∈ R + ) is said to be internally chain recurrent if every point x ∈ A is chain recurrent.
The following theorem proves the convergence and stability of the Monte Carlo Algorithm 1.

Theorem 12 Assume that L (η) is continuous with a unique integral curve and the following assumptions hold:
1. The parameterθ k+1 computed at step 4 satisfiesθ k+1 ∈˚ , ∀k; 2. The gain {α k } satisfies α k > 0, ∀k, α k → 0 as k → ∞, and ∞ k=0 α k = ∞. lim sup k→∞ 3. For a given ρ ∈ (0, 1) and a distribution family We have thus far shown that the Monte Carlo version of the CWO method converges to the internally chain recurrent set of Eq. (21), an invariant set. Since Eq. (20) will remain in the invariant set upon entrance, we have proved that Algorithm 1 is asymptotically stable. The major difference between the general and Monte Carlo convergence is that in the general case we can characterize the limiting behavior more precisely, i.e., having all the weight on optimal solutions. However, since the Monte Carlo version is an approximation of the general version, it can converge to a set that contains more than distributions with optimal solutions as support. The precise nature of the internally chain recurrent set of Eq.
(21) depends on the projection used, hence requires additional analysis for each problem. The chain recurrent set is closed and invariant; it contains all equilibrium points and any point that is able to reach itself by making a series of following the system dynamic and then jumping to a close by state (e.g., period orbits). The existence of these non-optimal recurrent points can only be confirm by plotting the vector field of Eq. (21). Knowing that in the worst case scenario the system will converge to the chain recurrent set is helpful.

Numerical algorithms
In this section, we present a few numerical algorithms based on the CWO method. These algorithms attempt to find an optimal solution iteratively. Each iteration consists of 5 stages: generation, quantile-update, parameter-update, weight-update, and projection. The generation, quantile-update and projection stages remain the same for all variations of the generic algorithm (i.e., Algorithm 2, where arrows are used as indentation markers). The weightupdate and projection steps, along with the equation in step 2 of Algorithm 2, correspond to step 4 in Algorithm 1. The additional uniform random variable is included in step 2 of Algorithm 2 to ensure all solutions are considered. We propose several approaches for constructing the weight-update stage. These algorithms build on the theoretical results using the same types of modifications as are found in CE and MRAS (see [11,19,22]). There are seven parameters in Algorithm 2. ρ 0 is the initial percentile threshold, ρ min is the minimum allowed percentile threshold, N 0 is the initial sample size, 2 is the minimal γ k threshold improvement requirement, where γ k is the corresponding value at threshold ρ k . ζ is the growth factor for the sample size, and ς is introduced to ensure all solutions are considered. Lastly, α controls how much information the algorithm should retain from the last iteration. In general, we want to start N 0 to be at least in the hundreds; increasing this parameter will improve the final solution. From condition 2 of Theorem 12, we know that the number of samples, N k , grows with the iteration count. In practice, we found that the decision to increase the sample size can be made when the value found at the current threshold ρ k does not improve γ k−1 by at least 2 . For CWO algorithms, ρ 0 controls the tradeoff between speed and optimality. The lower the ρ 0 the faster the algorithm will converge to a solution; however, that solution might be far from optimal. Similar considerations go into picking ρ min . We introduce the parameter to ensure that the threshold γ calculated at the (1 − ρ) percentile is improving at each iteration. If an improvement does not happen, then the algorithm is forced to either 1) reduce the threshold ρ k until a sample better than γ is found; 2) increase the sample size. In general, you want to avoid small ρ values in early iterations of the algorithm; hence, ρ 0 should not be too low such that the algorithm eliminates many suboptimal solutions that are near optimal solutions too early. ζ is used to decide how fast we want to grow our sample size when a satisfactory ρ k update cannot happen. How fast we are growing our sample size depends on ρ k and . Finally, ς is introduced, so that there is a non-zero probability over the neighborhood of any element of the solution space satisfying the requirement imposed by I X * . ς is generally on the order of 0.01. In addition to the parameters listed above, we also need to decide the family of distributions, g θ , we employ for each of the optimization problem in this section. The decision is discussed in detail in their respective sections.

Combinatorial optimization: ATSP
We apply variations of Algorithm 2 to several asymmetric traveling salesman problems (ATSPs). They are taken from the website http://www.iwr.uni-heidelberg.de/groups/comopt/ software/TSPLIB95. We follow a similar approach as in Hu et al. [19], which is outlined below. The reader is reminded here that Algorithm 2 is designed for maximization problems, whereas we are searching for the minimum distances of ATSPs. The goal in each ATSP problem is to find the minimum length of a tour that connects N cities cities with the same starting and ending cities. For an ATSP, we are given an N cities -by-N cities distance matrix D, whose (i, j)th element D i, j represents the distance from city i to city j. The problem can be mathematically stated as: where x := x 1 , x 2 , . . . , x N cities , x 1 is an admissible tour and X is the set of all admissible tours. We use the same approach suggested by Rubinstein [22], and De Boer et al. [23] for solving these problems. Each distance matrix D is given an initial state probability transition matrix, whose (i, j)th element specifies the probability of transitioning from city i to city j. At each iteration of the algorithm, there are two important steps: (1) generate random admissible tours according to the probability transition matrix and evaluate the performance of each sampled tour; (2) update the probability transition matrix based on the tours generated from step 1. We denote the set of tours generated at the kth iteration by x i k , where i ∈ {1, . . . , N k }. Without loss of generality, we will assume the samples are sorted according to their values (i.e., H x i k < H x j k if and only if i < j). A detailed discussion of the admissible tour generation process can be found in de Boer et al. [23]. The CWO algorithm differs from other algorithms in how it updates its transition matrix. At the kth iteration of CWO, the probability density function, p k (·, θ k ), parametrized by the transition matrix θ k is given by the equation below: where X i, j (l) is the set of all tours in X such that the lth transition is from city i to city j. We can show that the new transition matrix is updated (i.e., stage 6 of Algorithm 2) as: where we denote the updated density by p w k+1 (·) and {x i k+1 } is generated from p k (·, θ k ) (i.e., a density function that is parameterized by θ k ). The superscript w is used to emphasize the dependence of the updated probability mass function on the probability weighting function w. The construction of p w k+1 (·) depends on the specific weight-update method. Prob.   Fig. 1a). Table 2 contains the results from running 20 trials of CWO_U and CE algorithms with the parameters Δ = 0.01, ρ 0 = 0.1, ρ min = 0.001, N 0 = 1000, = 0, ζ = 1, ς = 0.01, and α = 0.7. Note the additional parameter Δ updates the weight function in each iteration by increasing the σ parameter in Eq. (23) by Δ. We plot the sorted minimum tour distances obtained from the 20 trials of CE and CWO_U algorithms in Fig. 1b. We observe from Fig. 1b that compared with the standard cross-entropy method, our approach does better in every percentile. For example, the 19 20 th percentile would contain the lowest optimal solution obtained among the 20 trials. The 18 20 th percentile would contain the second lowest optimal solution obtained among the 20 trials.

Continuous problems
We further tested the CWO uniform weight update scheme on the continuous case, comparing CWO_U against the CE method by minimizing two continuous test functions with many local minima and isolated optimizers: 1) Forrester:  Table 3 contains the results of our 20 trial runs for each scenario, using the parameters Δ = 0.1, ρ 0 = ρ min = 0.1, N 0 = 100, = 0, ζ = 1, ς = 0, and α = 0.7. We employed independent Gaussian distributions with zero mean and standard deviation of 10 as the initial distributions in all dimensions for all runs.

Conclusion
In the first part of this paper, we proved the convergence and stability of both the theoretical and Monte Carlo versions of the CWO-based method. The proofs provided a rigorous mathematical foundation for the two practical algorithms we proposed in the numerical examples section. These two algorithms are variations of the generic CWO algorithm described in Algorithm 2. The two algorithm variations, CWO_T and CWO_U, differ by how they update their probability density functions over the solution space for each iteration. The first approach, CWO_T, weights the samples according to their outcome values. On the other hand, CWO_U, uniformly weights the samples. We benchmarked the performance of the CWO_T algorithm and summarized the results in Table 1. Although the numeric values are quite satisfactory, we wanted to see if we could improve these results. This effort led us to the development of the second approach, CWO_U, which we consider as the preferred implementation of the CWO-base algorithm. Perhaps the most surprising fact is that by not taking into account the outcome values of the samples, we are able to achieve better performance results. Even more interesting is the fact that the standard cross-entropy approach is just a limiting case of the CWO_U approach. Comparing the numerical results of CWO_U with those of CE, we believe our algorithm is better at obtaining an optimal solution (see Fig. 1b). Of course, the improvement in performance is at the expense of increasing computational costs.