# Cumulative weighting optimization

- 1.5k Downloads

## Abstract

Global optimization problems with limited structure (e.g., convexity or differentiability of the objective function) can arise in many fields. One approach to solving these problems is by modeling the evolution of a probability density function over the solution space, similar to the Fokker–Planck equation for diffusions, such that at each time instant, additional weight is given to better solutions. We propose an addition to the class of model-based methods, cumulative weighting optimization (CWO), whose general version can be proven convergent to an optimal solution and stable under disturbances (e.g., floating point inaccuracy). These properties encourage us to design a class of CWO algorithms for solving global optimization problems. Beyond the general convergence and stability analysis, we prove that with some additional assumptions the Monte Carlo version of the CWO algorithm is also convergent and stable. Interestingly, the well known cross-entropy method is a CWO algorithm.

## Keywords

Stochastic optimization Cumulative weighting Convergence## 1 Introduction

Many questions in engineering and science can be formulated as optimizing over an objective function. When the objective function is differentiable, its derivative has explicit form, and has few finite local extrema, the problem is highly tractable—the first-order necessary condition generates a set of candidate solutions, the greatest of which is an optimal solution. On the other hand, objective functions absent any structural information can be challenging to solve analytically. Approaches developed to solve these problems numerically can be divided into two categories: deterministic and random search. Random search is further divided into instance-based (e.g., simulated annealing, genetic algorithm, tabu search, nested partitions, generalized hill climbing, and evolutionary programming) and model-based algorithms [e.g., annealing-adaptive search (AAS), cross-entropy (CE), model reference adaptive search (MRAS), and estimation of distribution algorithms (EDAs)]. For the interested reader, Hu et al. [1] have a recent survey paper on model-based methods, which also contains references to instance-based methods mentioned in this paragraph.

The cumulative weighting optimization (CWO) method extends the class of model-based methods by introducing an alternative weight-update equation. The weight-update equation is important for model-based methods, as it decides the search direction for the next time step. Our equation is inspired by Cumulative Prospect Theory (CPT) and has an intuitive connection with the risk-sensitive nature of the human decision making process. The new equation can be proven to converge to solutions of optimization problems when it can be solved analytically. Interestingly, the well known cross-entropy method is a special case of the CWO method. We also provide a convergence result when an analytical solution cannot be obtained and the problem requires an approximate solution. The approximate version, later referred to as the Monte Carlo version, will first project the underlying distribution onto a family of easy-to-sample from distributions and then sample from the projected distribution. The techniques used in the convergence analysis of the Monte Carlo version will follow the work of Hu et al. [2] with two major differences: the class of functions considered and the mean vector equation.

Moving from theory to implementation, this paper proceeds as follows: Section 2 presents the problem statement. Section 3 introduces the concept of probability weighting functions. In Sect. 4, rigorous convergence and stability analyses on both the general and Monte Carlo versions of the CWO method are presented. In this section, we provide additional analysis for the case when the optimizers are isolated and the family of distributions used has continuous density, due to the additional assumptions required. Finally, Sect. 5 describes a few CWO numerical algorithms and tabulates their simulation results.

## 2 Problem

*n*-dimension real space). \(H:\mathbf {X}\mapsto \mathbf {R}\), the objective function, is a bounded deterministic measurable function possibly with multiple local extrema. The set of optimizers for Eq. (1) is denoted by \(\mathbf {X}^{*}:=\left\{ x^{*}\in \mathbf {X}|H\left( x\right) \le H\left( x^{*}\right) ,\;\forall x\in \mathbf {X}\right\} \). The following assumption holds throughout this paper.

### **Assumption 1**

There exists a global optimal solution to Eq. (1), i.e., \(\mathbf {X}^{*}\) is nonempty.

In practice, this assumption is true for many optimization problems. For example, the assumption holds trivially when *H* is continuous. In general, the objective function lacks properties such as convexity and differentiability. Let the set of non-negative reals be denoted by \(\mathbf {R}^{+}.\) Common in many situations, a measurable strictly increasing *fitness function,* \(\phi :\mathbf {R}\mapsto \mathbf {R}^{+},\) is introduced to reformulate Eq. (1) as: \(x^{*}\in \arg \max _{x\in \mathbf {X}}\phi \left( H\left( x\right) \right) .\) A similar fitness function modified problem statement can be found in Hu et al. [1].

### *Remark 1*

Since the reformulated problem guarantees the range of the new fitness-objective function [i.e., \(\phi \left( H(\cdot )\right) \)] is non-negative, and it has the same optimizers as the original problem; we will only need to consider the case when *H* is non-negative in Eq. (1), i.e., \(H:\mathbf {X}\mapsto \mathbf {R}^{+}.\)

## 3 Probability weighting functions

Probability weighting functions have many applications in science and engineering. Kahneman and Tversky [3] proposed the original Prospect Theory (PT) in the 1970s, which has probabilistic weighting as one of its main features. They were unsatisfied with PT due to its violation of stochastic dominance, and thus suggested CPT in the 1990s [4]. CPT improves PT by re-weighting outcome cumulative distribution functions (CDFs) instead of outcome probability density functions (PDFs). An example of weighting functions used by CPT is \(w\left( p\right) :=\frac{p^{\gamma }}{\left( p^{\gamma }+\left( 1-p\right) ^{\gamma }\right) ^{1/\gamma }},\;\gamma \in \left( 0,1\right) ,\, p\in \left[ 0,1\right] ,\) which can be applied to a CDF or a complementary CDF. Their definition is presented below.

### **Definition 1**

A weighting function*, * \(w:[0,1]\mapsto [0,1]\), is a monotonically non-decreasing and Lipschitz continuous function with \(w(0)=0\) and \(w(1)=1\).

There are a few well-known weighting functions: 1) a simple *polynomial* weighting function has the form: \(w\left( p\right) =1-\left( 1-p\right) ^{b},\ b>1;\) 2) a more complicated weighting function involving *exponentials* has the form: \(w(p)=\frac{\text {e}^{cp}-1}{\text {e}^{c}-1},\) where \(c<0\). Other parametric weighting functions can be found in [5].

In later sections, weighting functions are used in models to update a PDF over the solution space; hence, we are interested in weighting functions with the additional property of optimal-seeking; this is important for guaranteeing CWO’s convergence to optimality.

### **Definition 2**

*optimal-seeking if*

### **Proposition 1**

An optimal-seeking weighting function satisfies the inequality \(w(p)>p,\;\forall p\in \left( 0,1\right) .\)

### *Proof*

Optimal-seeking is called risk-seeking in fields that model risk-sensitivity. In this paper, we only consider optimal-seeking weighting functions. \(\square \)

### **Assumption 2**

*w* is an optimal-seeking weighting function.

In its historical application, an optimal-seeking weighting function places more weight on highly rewarding outcomes. In particular, it is used to overweight the probabilities of unlikely events and underweight the probabilities of highly likely events. In our context, the optimal-seeking property of the weighting function is used to place more weight on higher ranked or more desirable outcomes. In the example below, we apply an optimal-seeking weighting function to a complementary CDF.

### *Example 1*

A die is rolled and the player receives a payoff that is equivalent to the outcome of the roll. For example, if the player rolled a 1, then he/she is given a $1 reward. The expected payoffs for both the risk-neutral and optimal-seeking cases are calculated below assuming the die is fair. The outcome of the roll is a random variable denoted by *R*.

The risk-neutral expected payoff is calculated as: \(E\left[ R\right] =\sum _{n=1}^{6}\left( 1-F\left( n\right) \right) =\sum _{n=1}^{6}\frac{n}{6}=\frac{21}{6}\approx 3.5,\) where \(F\left( n\right) \) is the CDF evaluated at outcome *n*. Using the weighting function \(w\left( p\right) =1-\left( 1-p\right) ^{2},\) the corresponding optimal-seeking re-weighted expected payoff is: \(E^{w}\left[ R\right] =\sum _{n=1}^{6}w\left( 1-F\left( n\right) \right) = \sum _{n=1}^{6}w\big (\frac{n}{6}\big )= \sum _{n=1}^{6}1-\big (1-\frac{n}{6}\big )^{2}=\frac{161}{36}\approx 4.47222.\)

### *Remark 2*

The reader should observe the fact that the optimal-seeking re-weighted expected payoff is greater than that of the risk-neutral, which will be key in proving the convergence of the CWO method.

## 4 Convergence and stability analysis

Convergence and stability are two desirable properties for any global optimization method. In particular, we would like to provide a theoretical guarantee that the CWO method will converge to an optimal solution and remain there under “reasonable” disturbances. These properties will be proven true for both the general and Monte Carlo versions of the CWO method in this section, with each version explained in detail in a subsection. We highlight the case of isolated optimizers in this section due to the additional assumptions, and a slight modification of the standard approach is required.

### 4.1 General theory

The best way to gain some intuition for the CWO method is to understand the finite solution space case. To make the idea concrete, let \(\mathbf {X}\) in Eq. (1) be the set \(\left\{ 1,\dots ,N\right\} .\) In this case, Assumption 1 is trivially satisfied and Assumption 2 is always true in our analysis. Let \({\mathbb {P}}_{x}\) denote the set of probability mass functions (PMFs) over \(\mathbf {X},\) and the set of PMFs exclusively supported on optimal solutions is denoted by \({\mathbb {P}}_{\mathbf {X}^{*}}:=\left\{ P\in {\mathbb {P}}_{x}|\sum _{x\in \mathbf {X}^{*}}P\left( x\right) =1\right\} .\) Since any element of \({\mathbb {P}}_{\mathbf {X}^{*}}\) has positive weight assigned to optimal solutions and zero weight assigned to non-optimal solutions, it follows that finding an element of \({\mathbb {P}}_{\mathbf {X}^{*}}\) solves Eq. (1).

*t*, i.e., \(P_{t}\in {\mathbb {P}}_{x};\) then solving Eq. (1) is equivalent to finding an algorithm that can update \(P_{t}\) iteratively such that \(P_{t}\in {\mathbb {P}}_{\mathbf {X}^{*}},\;\forall t>\tau \in \left( 0,\infty \right) .\) If \(P_{0}\) has a positive mass on at least one of the optimal solutions, i.e., \(P_{0}\left( \mathbf {X}^{*}\right) >0,\) then one way of insuring \(P_{t}\) eventually reaches \({\mathbb {P}}_{\mathbf {X}^{*}}\) is by using an optimal-seeking weighting function. The idea can be better demonstrated by introducing a step size variable \(\varDelta \). Let the set-valued map \(M:\mathbf {X}\mapsto 2^{\mathbf {X}}\) return all elements in the solution space with the same outcome, i.e., \(M\left( x\right) :=\left\{ \xi \in \mathbf {X}|H\left( \xi \right) =H\left( x\right) \right\} ;\) then \(P_{t}\) can be updated according to the following equation:

*w*is an optimal-seeking weighting function. In Eq. (2), the difference between the first

*w*distorted term and the second

*w*distorted term is the set \(\left\{ \xi \in \mathbf {X}|H\left( \xi \right) =H\left( x\right) \right\} .\) In the finite solution space case, we can verify that \(P_{t+\varDelta }\) is indeed a probability measure by summing over \(\mathbf {X}\), i.e., \(\sum _{x\in \mathbf {X}}P_{t+\varDelta }\left( x\right) \), and checking that the sum is 1. Since for each \(x\in \mathbf {X}\), the negative term in Eq. (2) cancels out with a positive term in the summation except for \(w\left( 1\right) =1\), it is verified that \(\sum _{x\in \mathbf {X}}P_{t+\varDelta }\left( x\right) =1\) and \(P_{t+\varDelta }\) is a probability measure.

*t*can be treated as the iteration count.

*H*is treated as a random variable. The following example illustrates in continuous-time the point that \(E_{t}\left[ H\right] \) is strictly increasing in

*t*unless \(P_{t}\) reaches \({\mathbb {P}}_{\mathbf {X}^{*}}.\)

### *Example 2*

*w*. We assume that \(H(4)=H(3)>H(2)>H(1)\ge 0\). The continuous-time analogue of Eq. (2) for this example is written as

### *Remark 3*

Since the weight on optimal solutions monotonically increases and approaches 1, it follows that the corresponding re-weighted expected payoff is monotonically increasing in *t* until optimality.

Thus, the CWO method updates \(P_{t}\) iteratively, so that \(P_{t}\) approaches \({\mathbb {P}}_{\mathbf {X}^{*}}\) asymptotically. The limit of \(P_{t}\) as \(t\rightarrow \infty \) only has weight on optimal solutions; hence, optimal solutions can be inferred from it.

The finite solution space case offers the most intuition; however, to apply the CWO method to a wide variety of problems, we will need to work with more general solution spaces. In the rest of this section, Eq. (1) is solved given that \(\mathbf {X}\) is a compact subset of a finite-dimensional vector space i.e., \(\mathbf {R}^{n}\). Similar to the finite solution space case, the set of probability measures defined on the Borel measurable space \(\left( \mathbf {X},{\mathscr {B}}\left( \mathbf {X}\right) \right) \) is denoted by \({\mathbb {P}}_{x},\) which has the Prohorov topology. We reference [6] and [7] for technical details on the Prohorov topology. While the following assumption is not strictly required for the CWO method, it is used in the analysis for ease of notation.

### **Assumption 3**

*w* is differentiable and has a bounded first derivative, which is denoted by \(w'\).

In general, boundedness of the sub-gradient should be sufficient for the application of the CWO method.

*t*, the push-forward measure of \(P_{t}\) through

*H*in Eq. (1) is denoted by

*B*under

*H*, and \({\mathscr {B}}\left( \mathbf {R}^{+}\right) \) denotes the Borel \(\sigma \)-algebra for \(\mathbf {R}^{+}.\) For the justification of using \(\mathbf {R}^{+}\) in Eq. (5), see Remark 1. Furthermore,

*H*can be treated as a random variable from \(\left( \mathbf {X},{\mathscr {B}}\left( \mathbf {X}\right) \right) \) to \(\left( \mathbf {R}^{+},{\mathscr {B}}\left( \mathbf {R}^{+}\right) \right) \).

Wang [8] proposes an alternative set of evolution equations, also nonlinear Fokker–Planck equations [9, 10], motivated by evolutionary game theory. As the reader will see later, we reach the same convergence results as Wang et al. [11] with a modified approach.

Similar to the finite solution space case, the set of probability measures exclusively supported on optimal solutions is denoted by \({\mathbb {P}}_{\mathbf {X}^{*}}:=\left\{ P\in {\mathbb {P}}_{x}|P\left( \mathbf {X}^{*}\right) =1\right\} ,\) where \(\mathbf {X}^{*}\) denotes the set of optimal solutions, i.e., \(\mathbf {X}^{*}:=\left\{ x^{*}\in \mathbf {X}|H(x)\le H(x^{*}),\ \forall x\in \mathbf {X}\right\} .\) The reader is reminded that obtaining an element of \({\mathbb {P}}_{\mathbf {X}^{*}}\) is equivalent to solving the optimization problem stated in Eq. (1). The goal is to prove that Eqs. (6–7) update \(P_{t}\) such that \(P_{t}\) approaches \({\mathbb {P}}_{\mathbf {X}^{*}}\) asymptotically. The first step is to prove the existence and uniqueness of a solution for Eq. (6).

### **Theorem 1**

For each \(P_{0}\in {\mathbb {P}}_{x}\) and its corresponding push-forward measure \(P_{0}^{H}\), the ordinary differential equation (6) has a unique solution for \(t\in \mathbf {R}^{+}\).

### *Proof*

*P*over \(\left( \mathbf {R}^{+},{\mathscr {B}}\left( \mathbf {R}^{+}\right) \right) \) at time

*t*is denoted by:

*K*is the Lipschitz constant for

*w*. The inequality above proves the boundedness of \({\mathscr {C}}\left( P_{t}^{H}\right) \).

Next, \(P_{t}^{H}\) is proved to be a probability measure over \(\left( \mathbf {R}^{+},{\mathscr {B}}\left( \mathbf {R}^{+}\right) \right) \) for any *t*.

### **Lemma 1**

### *Proof*

If we can prove that \(\dot{P}_{t}^{H}\left( \mathbf {R}^{+}\right) =0\) and \(\dot{P}_{t}^{H}\left( \cup _{i}B_{i}\right) =\sum _{i}\dot{P}_{t}^{H}\left( B_{i}\right) \), then we have obtained our desired result. Using Eq. (6), the fact that \(\int _{0}^{1}w'\left( s\right) ds=w\left( 1\right) -w\left( 0\right) =1,\) \(w'\) is bounded, and the dominated convergence theorem, it is straightforward to prove this assertion. \(\square \)

The next Lemma is needed in Theorem 5, which shows \(E_{t}\left[ H\right] \) is monotonically increasing in *t* [cf. Remark 2 and Eq. (4)].

### **Lemma 2**

*w*, there exists a \(\tilde{y}\in \mathbf {R}^{+}\) such that

### *Proof*

*w*is a monotonically non-decreasing function, it satisfies

*w*is also optimal-seeking, we have

*y*. \(\square \)

The theorems below present a blueprint to obtain an element of \({\mathbb {P}}_{\mathbf {X}^{*}}\) utilizing the solution \(P_{t}\) of Eqs. (6–7). Accomplishing this goal, the initial point set in Theorem 1 is restricted to measures \(P_{0}\) that allow \(P_{t}\) to approach \({\mathbb {P}}_{\mathbf {X}^{*}},\) i.e., \(\lim _{t\rightarrow \infty }P_{t}\in {\mathbb {P}}_{\mathbf {X}^{*}}.\) The following definition helps us to present this idea succinctly.

### **Definition 3**

*optimal initial solution probability measures*is denoted by:

*H*push-forward optimal probability measures over \(\left( \mathbf {R}^{+},{\mathscr {B}}\left( \mathbf {R}^{+}\right) \right) \) is denoted by \({\mathbb {I}}_{H^{*}}:=\left\{ P\circ H^{-1}|P\in {\mathbb {I}}_{\mathbf {X}^{*}}\right\} .\)

Definition 3 is essential, as a requirement on the initial condition of the system (i.e., \(P_{0}\in {\mathbb {I}}_{\mathbf {X}^{*}}\)), for the convergence and stability of the system. This condition can be too stringent when the optimizers are isolated and \({\mathbb {I}}_{\mathbf {X}^{*}}\) is restricted to measures with continuous densities, since no positive measure can be placed on isolated points. In the next section, we will address this issue specifically by modifying the definition of \({\mathbb {I}}_{\mathbf {X}^{*}}\) and adding an extra assumption on the objective function *H*. The next theorem proves \(P_{t}^{H}\left( \max _{x\in \mathbf {X}}H\left( x\right) \right) \), the probability measure on optimal solutions, will converge to 1 as \(t\rightarrow \infty .\) On the other hand, the probability measure on non-optimal solutions will approach zero as \(t\rightarrow \infty .\)

### **Theorem 2**

- 1.
\(P_{t}^{H}\left( y^{*}\right) \) is a monotonically non-decreasing function of

*t*that converges to 1 as \(t\rightarrow \infty \); - 2.
\(P_{t}^{H}\left( \mathbf {R}^{+}\backslash y^{*}\right) \rightarrow 0\) as t\(\rightarrow \infty \).

### *Proof*

\(\square \)

The next theorem connects the properties of \(P_{t}^{H}\) with those of \(P_{t}\) as \(t\rightarrow \infty .\) This is an important step for understanding the evolution of Eqs. (6–7).

### **Theorem 3**

If \(P_{0}\in {\mathbb {I}}_{\mathbf {X}^{*}}\) and \(P_{t}\) is a solution of Eqs. (6–7), then the following claims hold: 1) \(\lim _{t\rightarrow \infty }P_{t}\left( \mathbf {X}^{*}\right) \)=1; 2) \(\lim _{t\rightarrow \infty }P_{t}\left( \mathbf {X}\backslash \mathbf {X}^{*}\right) \)=0.

### *Proof*

We are interested in finding the limit points of Eqs. (6–7). Ideally, these limit points should be elements in \({\mathbb {P}}_{\mathbf {X}^{*}}.\) In order to guarantee this, we restrict the initial points to be elements of \({\mathbb {I}}_{\mathbf {X}^{*}}.\) To facilitate our discussion, we introduce the following definition.

### **Definition 4**

*limit set*of Eqs. (6–7) from an initial set \(\mathcal {I}\) is denoted by

### **Theorem 4**

### *Proof*

The proof will be done in two parts: first by proving \(\mathbb {L}\left[ {\mathbb {I}}_{\mathbf {X}^{*}}\right] \supset {\mathbb {P}}_{\mathbf {X}^{*}},\) then by proving \(\mathbb {L}\left[ {\mathbb {I}}_{\mathbf {X}^{*}}\right] \subset {\mathbb {P}}_{\mathbf {X}^{*}}.\) The first case can be proved by taking an element \(P\in {\mathbb {P}}_{\mathbf {X}^{*}}.\) By the definition of \(\mathbb {L}\left[ {\mathbb {I}}_{\mathbf {X}^{*}}\right] \) (i.e., the limit set of Eqs. (6–7) starting from \({\mathbb {I}}_{\mathbf {X}^{*}}\)), we conclude that \(P\in \mathbb {L}\left[ {\mathbb {I}}_{\mathbf {X}^{*}}\right] \).

The following theorem shows the monotonically increasing nature of \(E_{t}\left[ H\right] \), which will be useful later in proving the stability of Eqs. (6–7).

### **Theorem 5**

- 1.
\(E_{t}\left[ H\right] \) is monotonically non-decreasing in \(t\in \mathbf {R}^{+}\);

- 2.
if \(P_{\tau }\notin \mathbb {L}\left[ {\mathbb {I}}_{\mathbf {X}^{*}}\right] ,\) then \(E_{\tau }\left[ H\right] \) is strictly increasing in \(\tau \in \mathbf {R}^{+}\).

### *Proof*

The second claim is proved by contradiction. Assume \(P_{\tau }\notin \mathbb {L}\left[ {\mathbb {I}}_{\mathbf {X}^{*}}\right] \), i.e., \(P_{\tau }\) is not a limit point, and \(\frac{d}{d\tau }E_{\tau }\left[ H\right] =0,\) i.e., \(E_{\tau }\left[ H\right] \) is not increasing. Along with Theorem 2, the equality above implies that *H* is equal to a constant \(C=\max _{x\in \mathbf {X}}H\left( x\right) \). This implies that \(P_{\tau }^{H}\) is a Dirac measure concentrated at *C*, which means \(P_{\tau }\in \mathbb {L}\left[ {\mathbb {I}}_{\mathbf {X}^{*}}\right] \) (cf. Theorem 4). \(\square \)

At this point, we have demonstrated the convergence to optimality property of our method. We now explore the stability property of our method. The metric function *d* in the following definitions is the Prohorov metric found in the appendix of [6, p. 170].

### **Definition 5**

*P*and \({\mathcal {L}}\) as

*Lyapunov stable,*with respect to a sequence of measures \(\left\{ P_{t}\right\} \), if for all \(\epsilon >0\), there exists a \(\delta >0\) such that

*asymptotically stable*, with respect to a sequence of measures \(\left\{ P_{t}\right\} \), if \({\mathcal {L}}\) is Lyapunov stable, and there exists a \(\delta >0\) such that

The next theorem is the main result of this section.

### **Theorem 6**

\(\mathbb {L}\left[ {\mathbb {I}}_{\mathbf {X}^{*}}\right] \) is a compact set and it is asymptotically stable.

### *Proof*

The use of a Lyapunov function for proving the asymptotic stability of the limit set can be found previously in Wang’s dissertation [8].

### 4.2 Isolated optimizers

In Sect. 4.1, we require that our initial distribution \(P_{0}\) belong to \({\mathbb {I}}_{\mathbf {X}^{*}},\) which requires \(P_{0}\left( X^{*}\right) >0\). This is reasonable in many cases: (1) if the solution space is large but finite; (2) if the optimizers are not isolated. However, for the case when the optimizers are isolated and \({\mathbb {I}}_{\mathbf {X}^{*}}\) is restricted to measures with continuous densities, the condition \(P_{0}\left( X^{*}\right) >0\) cannot be satisfied, because the probability measure at a single point is always zero. For example, when minimizing \(H\left( x\right) =x^{2}\), it is convenient to start with a Gaussian distribution due to its simple form. Since Gaussian measures have continuous densities, it is impossible to satisfy the condition \(P_{0}\left( X^{*}\right) >0\). Thus, we need to make a slight modification to the definition for \({\mathbb {I}}_{\mathbf {X}^{*}}\) and the statements of the theorems in the previous section. The modifications will lead to the conclusion that the weight update system will converge to measures that place all their weight in the ball neighborhoods of the optimizers. We make a slight modification to the definition of the \({\mathbb {I}}_{\mathbf {X}^{*}}\) so that the positive measure requirement is satisfied as long as the measure is positive for any neighborhood of at least one element in \(\mathbf {X}^{*}.\) In other words, as long as the initial measure has not excluded all neighborhoods of all optimizers, it should be included in the admissible initial probability measure set \(\tilde{{\mathbb {I}}}_{\mathbf {X}^{*}},\) which is defined below.

### **Definition 6**

*optimal initial solution probability measures*is denoted by:

*H*push-forward optimal probability measures over \(\left( \mathbf {R}^{+},{\mathscr {B}}\left( \mathbf {R}^{+}\right) \right) \) is denoted by \(\tilde{{\mathbb {I}}}{}_{H^{*}}:=\left\{ P\circ H^{-1}|P\in \tilde{{\mathbb {I}}}_{\mathbf {X}^{*}}\right\} .\)

An example can illustrate the reasonableness of \(\tilde{{\mathbb {I}}}_{\mathbf {X}^{*}}\) for the isolated optimizers case. If we would like to minimize the function \(H\left( x\right) =x^{2},\) which has a single isolated optimizer at \(x=0,\) then any Gaussian distribution can satisfy the requirement imposed by \(\tilde{{\mathbb {I}}}_{\mathbf {X}^{*}},\) since for any \(\delta \)-neighborhood around \(x=0\), the measure under \(P_{0}\) is non-zero. After we redefined the admissible initial distributions, there needs to be an additional continuity assumption on the objective function *H*.

### **Assumption 4**

*H*, the objective function, is continuous at all \(x^{*}\in \mathbf {X}^{*}\).

Assumption 4 is needed here so that the neighborhoods around elements in \(\mathbf {X}^{*}\) have values close to the optimal objective value \(y^{*}\). In the rest of this section, we will list the modified version of the theorems analogous to the theorems in the previous section eliminating any redundancy.

### **Theorem 7**

- 1.
\(P_{t}^{H}\left( \left[ y^{*}-\epsilon ,\infty \right) \right) \) is a monotonically non-decreasing function of

*t*that converges to 1 as \(t\rightarrow \infty \); - 2.
\(P_{t}^{H}\left( \mathbf {R}^{+}\backslash \left[ y^{*}-\epsilon ,\infty \right) \right) \rightarrow 0\) as t\(\rightarrow \infty \).

### *Proof*

### **Theorem 8**

If \(P_{0}\in {\mathbb {I}}_{\mathbf {X}^{*}}\) and \(P_{t}\) is a solution of Eqs. (6–7), then the following claims hold for any \(\delta >0\): 1) \(\lim _{t\rightarrow \infty }P_{t}\left( \mathbf {X}^{\delta ,*}\right) \)=1; 2) \(\lim _{t\rightarrow \infty }P_{t}\left( \mathbf {X}\backslash \mathbf {X}^{\delta ,*}\right) =0\).

### *Proof*

Using the fact that the objective function is continuous around the optimizers (i.e., Assumption 4), we know that there exists an corresponding \(\delta \) for each \(\epsilon .\) Furthermore, since the \(\epsilon \) is arbitrary in Theorem 7 and we can always shrink the \(\delta \) by shrinking the \(\epsilon \) around the isolated optimizers, \(\delta \) can be made arbitrarily small as well.

The next theorem states that if we start from \(\tilde{{\mathbb {I}}}_{\mathbf {X}^{*}}\), the limit set will place all the weight in the neighborhoods of the optimizers.

### **Theorem 9**

The next theorem is needed in our final conclusion stated in Theorem 11.

### **Theorem 10**

- 1.
\(E_{t}\left[ H\right] \) is monotonically non-decreasing in \(t\in \mathbf {R}^{+}\);

- 2.
if \(P_{\tau }\notin \mathbb {L}\left[ \tilde{{\mathbb {I}}}_{\mathbf {X}^{*}}\right] ,\) then \(E_{\tau }\left[ H\right] \) is strictly increasing in \(\tau \in \mathbf {R}^{+}\).

### *Proof*

See the proof for Theorem 5.

Finally, by using Theorems 7 and 10 and applying a Lyapunov function, we have our desired conclusion.

### **Theorem 11**

\(\mathbb {L}\left[ \tilde{{\mathbb {I}}}_{\mathbf {X}^{*}}\right] \) is a compact set and it is asymptotically stable.

In other words, the modified theorems state that if Eqs. (6–7) start with a probability measure that places some weight on an arbitrarily small neighborhood of at least one optimizer, then the system will converge to an element in \(\mathbb {L}\left[ \tilde{{\mathbb {I}}}_{\mathbf {X}^{*}}\right] \), the set of probability measures that only has weight on arbitrarily small neighborhoods of the optimizers. Furthermore, \(\mathbb {L}\left[ \tilde{{\mathbb {I}}}_{\mathbf {X}^{*}}\right] \) is asymptotically stable.

### 4.3 Monte Carlo version

In Sect. 4.1, we have demonstrated that when the solution space is a subset of \(\mathbf {R}^{n}\), the CWO method will exhibit convergence and stability properties that are desirable for any optimization method. The theorems are proven when the probability measure can be modeled perfectly; however, there is no result on the CWO method’s Monte Carlo version (cf. Algorithm 1) convergence, which is important in practice. The analysis techniques applied and the convergence proved in this section are significantly different from that of the general version. The difference is caused by the two layers of approximation used for efficient simulation: projection and sampling. In addition, we are able to apply the analysis techniques in [2] to Eq. (16), which is a more general version of equations considered previously for model-based methods.

*t*; however, in this section, we require a more explicit structure on \(\beta \), namely it depends on

*x*and \(p_{t}\). To recap, the relevant assumptions for this section are Assumptions 1, 2, and 3. Furthermore, we will use

*t*to denote time and

*k*to denote the iteration count going forward.

*reference*PDF \(p_{k}\) is updated through its

*surrogate*PDF \(f_{\theta }\). We adopt the notations \(P_{\theta }\left( \cdot \right) \) and \(E_{\theta }\left[ \cdot \right] \) for the probability measure and expectation with respect to the surrogate PDF \(f_{\theta }.\) On the other hand, \(P_{k}\left( \cdot \right) \) and \(E_{k}\left[ \cdot \right] \) denote the probability measure and expectation of the reference PDF at the

*k*-th iteration. With a slight abuse of notation, we write

*X*for both \(X_{k}\) and \(X_{\theta },\) the random variables with the PDFs \(p_{k}\) and \(f_{\theta }\), respectively.

The natural exponential families (NEFs), in many applications, can be used as the surrogate parameterized family of PDFs. They are convenient to use in implementations and can lead to a closed analytical solution in our analysis. Their definition from [2, Definition 2.1] is presented below.

### **Definition 7**

A parameterized family \(\mathbb {F}:=\left\{ f_{\theta }|\theta \in \varvec{\Theta }\subseteq \mathbf {R}^{d}\right\} \) on \(\mathbf {X}\) is called a natural exponential family (NEF) if there exist continuous mappings \(\varGamma :\mathbf {R}^{n}\mapsto \mathbf {R}^{d}\) and \(K:\mathbf {R}^{d}\mapsto \mathbf {R}\) such that \(f_{\theta }\left( x\right) =\exp \left( \theta ^{T}\varGamma \left( x\right) -K\left( \theta \right) \right) ,\) where \(\varvec{\Theta }:=\left\{ \theta \in \mathbf {R}^{d}|\left| K\left( \theta \right) \right| <\infty \right\} \) is the natural parameter space and \(K\left( \theta \right) =\ln \int _{\mathbf {X}}\exp \left( \theta ^{T}\varGamma \left( x\right) \right) \mu \left( \text {d}x\right) .\)

### *Remark 4*

Let the interior of \(\varvec{\Theta }\) be denoted by \(\mathring{\varvec{\Theta }}.\) In this section, we use the following properties of the family from [16]: (1) \(K\left( \theta \right) \) is strictly convex on \(\mathring{\varvec{\Theta }};\) (2) the Jacobian for \(K\left( \theta \right) \) is \(E_{\theta }\left[ \varGamma \left( X\right) \right] \), i.e., \(\nabla K\left( \theta \right) =E_{\theta }\left[ \varGamma \left( X\right) \right] ;\) (3) the Hessian matrix for \(K\left( \theta \right) \) is \(\mathrm {Cov_{\theta }\left[ \varGamma \left( X\right) \right] },\) where \(\mathrm {Cov_{\theta }\left[ \cdot \right] }\) is the covariance with respect to \(f_{\theta }.\) Therefore, the Jacobian for the parameterized *mean vector function,* \(m\left( \theta \right) :=E_{\theta }\left[ \varGamma \left( X\right) \right] ,\) is strictly positive definite and invertible. This fact implies, along with the inverse function theorem, that \(m\left( \theta \right) \) is also invertible. Since \(m\left( \theta \right) \) is invertible, we can iterate our algorithm on \(m\left( \theta \right) \) instead of \(\theta ,\) and recover \(\theta \) as needed via the inverse function \(m^{-1}\left( \cdot \right) .\)

*S*function in our model, a mapping from \(\left( \mathbf {X},\mathbf {R}^{+},{\mathbb {P}}_{x}\right) \) to \(\mathbf {R}^{+}\), takes two additional parameters

*x*and \(p_{k}\). Secondly, our model ensures that \(E_{p_{k}}\left[ S\left( X,H\left( X\right) ,p_{k}\right) \right] =1;\) hence, there is no need for normalization. It is also interesting to note that, for a fixed \(p_{t}\) and

*x*, \(S\left( x,\cdot ,p_{t}\right) \) is an increasing function. Acknowledging these differences, we will prove that the Monte Carlo version of Eq. (14) converges to the

*internal chain recurrent set*of an ordinary differential equation. The definition of internal chain recurrent sets will be introduced later along with the corresponding ordinary differential equation in our analysis. The development of the theorems below runs in parallel with the work of Hu et al. [2], with the major difference in the structure of the

*S*function as discussed in this paragraph. The main idea is to apply techniques from the stochastic approximation literature to model-based methods.

*k*-th iteration, the

*smoothed reference PDF*is denoted by:

### **Lemma 3**

*k*. Then

### *Proof*

*S*mapping has the specific form [cf. Eq. (14)]:

*H*under \(f_{m^{-1}\left( \eta \right) }\).

### **Definition 8**

Given an initial condition \(\eta \left( 0\right) =y,\) let \(\eta _{y}\left( t\right) \) be the solution to Eq. (21). A point *x* is said to be chain recurrent if for any \(\delta >0\) and \(T>0,\) there exist an integer \(k\ge 1,\) points \(y_{0},\dots ,y_{k}\) with \(y_{k}=x,\) and time instances \(t_{0},\dots ,t_{k-1}\) such that \(t_{i}\ge T,\) \(\left\| x-y_{0}\right\| \le \delta ,\) and \(\left\| \eta _{y_{i}}\left( t_{i}\right) -y_{i+1}\right\| \le \delta \) for \(i=0,\dots ,k-1.\) A compact invariant set \(\mathcal {A}\) (i.e., for any \(y\in \mathcal {A},\) the trajectory \(\eta _{y}\left( t\right) \) satisfies \(\eta _{y}\left( t\right) \subset \mathcal {A},\;\forall t\in \mathbf {R}^{+}\)) is said to be internally chain recurrent if every point \(x\in \mathcal {A}\) is chain recurrent.

The following theorem proves the convergence and stability of the Monte Carlo Algorithm 1.

### **Theorem 12**

- 1.
The parameter \(\hat{\theta }_{k+1}\) computed at step 4 satisfies \(\hat{\theta }_{k+1}\in \mathring{\varvec{\Theta }},\;\forall k\);

- 2.The gain \(\left\{ \alpha _{k}\right\} \) satisfies \(\alpha _{k}>0,\;\forall k,\,\alpha _{k}\rightarrow 0\) as \(k\rightarrow \infty ,\) and \(\sum _{k=0}^{\infty }\alpha _{k}=\infty .\) \(\limsup _{k\rightarrow \infty }\left( \frac{\lambda _{k}}{k^{-\lambda }}\right) <\infty \) for some constant \(\lambda \ge 0.\) Furthermore, there exists a \(\beta >\max \left\{ 0,1-2\lambda \right\} \) such that$$\begin{aligned} \limsup _{k\rightarrow \infty }\left( \frac{N_{k}}{k^{\beta }}\right) =\limsup _{k\rightarrow \infty }\left( \frac{k^{\beta }}{N_{k}}\right) . \end{aligned}$$
- 3.
For a given \(\rho \in \left( 0,1\right) \) and a distribution family \(\mathbb {F}\), the \(\left( 1-\rho \right) \)-quantile of \(\left\{ H\left( X\right) ,\, X\sim f_{\theta }\left( x\right) \right\} \) is unique for each \(\theta \in \varvec{\Theta }.\) Then, the sequence \(\left\{ \eta _{k}\right\} \) generated by Eq. (20) converges to a compact connected internally chain recurrent set of Eq. (21) w.p.1. Furthermore, if the internally chain recurrent sets of Eq. (21) are isolated equilibrium points, then w.p.1 \(\left\{ \eta _{k}\right\} \) converges to a unique equilibrium point.

### *Proof*

See Theorem 3.1 in [2].

We have thus far shown that the Monte Carlo version of the CWO method converges to the internally chain recurrent set of Eq. (21), an invariant set. Since Eq. (20) will remain in the invariant set upon entrance, we have proved that Algorithm 1 is asymptotically stable. The major difference between the general and Monte Carlo convergence is that in the general case we can characterize the limiting behavior more precisely, i.e., having all the weight on optimal solutions. However, since the Monte Carlo version is an approximation of the general version, it can converge to a set that contains more than distributions with optimal solutions as support. The precise nature of the internally chain recurrent set of Eq. (21) depends on the projection used, hence requires additional analysis for each problem. The chain recurrent set is closed and invariant; it contains all equilibrium points and any point that is able to reach itself by making a series of following the system dynamic and then jumping to a close by state (e.g., period orbits). The existence of these non-optimal recurrent points can only be confirm by plotting the vector field of Eq. (21). Knowing that in the worst case scenario the system will converge to the chain recurrent set is helpful.

## 5 Numerical algorithms

In this section, we present a few numerical algorithms based on the CWO method. These algorithms attempt to find an optimal solution iteratively. Each iteration consists of 5 stages: generation, quantile-update, parameter-update, weight-update, and projection. The generation, quantile-update and projection stages remain the same for all variations of the generic algorithm (i.e., Algorithm 2, where arrows are used as indentation markers). The weight-update and projection steps, along with the equation in step 2 of Algorithm 2, correspond to step 4 in Algorithm 1. The additional uniform random variable is included in step 2 of Algorithm 2 to ensure all solutions are considered. We propose several approaches for constructing the weight-update stage. These algorithms build on the theoretical results using the same types of modifications as are found in CE and MRAS (see [11, 19, 22]).

### 5.1 Combinatorial optimization: ATSP

*D*, whose (

*i*,

*j*)th element \(D_{i,j}\) represents the distance from city i to city j. The problem can be mathematically stated as:

We use the same approach suggested by Rubinstein [22], and De Boer et al. [23] for solving these problems. Each distance matrix *D* is given an initial state probability transition matrix, whose (*i*, *j*)th element specifies the probability of transitioning from city i to city j. At each iteration of the algorithm, there are two important steps: (1) generate random admissible tours according to the probability transition matrix and evaluate the performance of each sampled tour; (2) update the probability transition matrix based on the tours generated from step 1. We denote the set of tours generated at the *k*th iteration by \(\left\{ x_{k}^{i}\right\} ,\) where \(i\in \left\{ 1,\dots ,N_{k}\right\} \). Without loss of generality, we will assume the samples are sorted according to their values (i.e., \(H\left( x_{k}^{i}\right) <H\left( x_{k}^{j}\right) \) if and only if \(i<j\)).

*k*th iteration of CWO, the probability density function, \(p_{k}\left( \cdot ,\theta _{k}\right) \), parametrized by the transition matrix \(\theta _{k}\) is given by the equation below:

*l*th transition is from city i to city j. We can show that the new transition matrix is updated (i.e., stage 6 of Algorithm 2) as:

*w*is used to emphasize the dependence of the updated probability mass function on the probability weighting function

*w*. The construction of \(p_{k+1}^{w}\left( \cdot \right) \) depends on the specific weight-update method.

#### 5.1.1 Weight-update methods

In this section, we present several different methods of obtaining \(p_{k+1}^{w}\left( \cdot \right) \) from a collection of samples \(\{x_{k}^{i}\}\) at the *k*th step. The first method we introduce is called tilted weight update.

*Tilted weight update (CWO_T)*

Performance of CWO_T on various ATSP problems based on 30 independent replications

ATSP | \(N_\mathrm{cities}\) | \(N_\mathrm{Total}\,{\text {(Std.\ err.)}}\) | \(H_\mathrm{best}\) | \(H_{*}\) | \(H^{*}\) | \(\delta _{*}\) | \(\delta ^{*}\) | \(\delta \,{\text {(Std.\ err.)}}\) |
---|---|---|---|---|---|---|---|---|

ftv33 | 34 | 6.59e4 (1.81e4) | 1286 | 1379 | 1286 | 0.0723 | 0.0000 | 0.0396(0.0279) |

ftv35 | 36 | 6.79e4 (1.63e4) | 1473 | 1581 | 1473 | 0.0733 | 0.0000 | 0.0195(0.0172) |

ftv38 | 39 | 8.81e4 (3.26e4) | 1530 | 1651 | 1536 | 0.0791 | 0.0039 | 0.0243(0.0190) |

p43 | 43 | 2.80e5 (1.04e5) | 5620 | 5636 | 5622 | 0.0028 | 0.0004 | 0.0011(0.0007) |

ry48p | 48 | 4.65e5 (2.30e5) | 14,422 | 18,725 | 14,618 | 0.2984 | 0.0136 | 0.0744(0.0676) |

ft53 | 53 | 3.24e5 (1.23e5) | 6905 | 7844 | 7059 | 0.1360 | 0.0223 | 0.0590(0.0247) |

ft70 | 70 | 7.02e5 (3.32e5) | 38,673 | 39,738 | 38,760 | 0.0275 | 0.00225 | 0.0130(0.0050) |

*Uniform Weight Update(CWO_U)*Tilting assigns the initial weights of the samples \(\{x_{k}^{i}\}\) using their values. Uniform weight updating differs from tilting by assuming a uniform distribution over the samples. Another major difference from the above approach is that we no longer only consider elite samples. Instead, we use a carefully chosen probability weighting function that smoothly re-weights the samples. More specifically in stage 5 of Algorithm 2, we assume a uniform initial density and use the weighting functionwhere \(\sigma \) is the optimal-seeking factor and \(\rho \) is the quantile threshold. \(\sigma \) and \(\rho \) are treated as variables that parameterize the weighting function, whereas

*p*is the argument of the parameterized weighting function. Using Eq. (23), we modify the generic CWO algorithm by altering the way the sample weights are updated. The algorithm has a strong connection with the traditional cross-entropy method, which is explained below.

CWO_U and CE performance results

ATSP | \(N_\mathrm{cities}\) | \(N_\mathrm{Total}\,{\text {(Std.}})\) | \(H_\mathrm{best}\) | \(H_{*}\) | \(H^{*}\) | \(\delta _{*}\) | \(\delta ^{*}\) | \(\delta \,{\text {(Std.)}}\) |
---|---|---|---|---|---|---|---|---|

ft53 | 53 | 90,450 (6.0e3) | 6905 | 7679 | 7037 | 0.112 | 0.0191 | 0.060 (0.0244) |

ce_ft53 | 53 | 65,100 (5.7e3) | 6905 | 7676 | 7088 | 0.111 | 0.0265 | 0.075 (0.0276) |

CWO_U and CE performance results—continuous case

| \(N_{Total}\) ( | \(H_{best}\) | \(H_{*}\) | \(H^{*}\) | \(\delta _{*}\) | \(\delta ^{*}\) | \(\delta \,(Std.)\) |
---|---|---|---|---|---|---|---|

\(H_{1}\) | 1250 (51) | \(-\)6.02074 | \(-\)6.02066 | \(-\)6.02074 | 0.00008 | 0.0 | 1.8e\(-\)5(2.6e\(-\)5) |

ce \(H_{1}\) | 800 (0) | \(-\)6.02074 | \(-\)6.02047 | \(-\)6.02074 | 0.00027 | 0.0 | 2.4e\(-\)5(6.1e\(-\)5) |

\(H_{2}\) | 2380 (77) | \(-\)10.1532 | \(-\)10.152576 | \(-\)10.153163 | \(6.24\times 10^{-4}\) | \(3.7\times 10^{-5}\) | 3.0e\(-\)4(2.0e\(-\)4) |

ce \(H_{2}\) | 1800 (108) | \(-\)10.1532 | \(-\)2.682841 | \(-\)10.153113 | 7.47036 | \(8.7\times 10^{-5}\) | 3.7e\(-\)1(1.7) |

We plot the sorted minimum tour distances obtained from the 20 trials of CE and CWO_U algorithms in Fig. 1b. We observe from Fig. 1b that compared with the standard cross-entropy method, our approach does better in every percentile. For example, the \(\frac{19}{20}\)th percentile would contain the lowest optimal solution obtained among the 20 trials. The \(\frac{18}{20}\)th percentile would contain the second lowest optimal solution obtained among the 20 trials.

### 5.2 Continuous problems

We further tested the CWO uniform weight update scheme on the continuous case, comparing CWO_U against the CE method by minimizing two continuous test functions with many local minima and isolated optimizers: 1) Forrester: \(H_{1}(x)=\left( 6x-2\right) ^{2}\sin \left( 12x-4\right) ,\) \(0\le x\le 1;\) 2) Shekel: \(H_{2}(x)=-\sum _{j=1}^{5}\left( \sum _{i=1}^{4}\left( x_{i}-A_{ij}\right) ^{2}+B_{j}\right) ^{-1},\) \(0\le x_{i}\le 10,\) where \(A_{1}=A_{3}=\left[ 4,1,8,6,3\right] ,\ A_{2}=A_{4}=\left[ 4,1,8,6,7\right] ,\) and \(A_{i}\) represents the *i*th row of the matrix *A*. Furthermore, \(B=\left[ 0.1,0.2,0.2,0.4,0.4\right] .\) Table 3 contains the results of our 20 trial runs for each scenario, using the parameters \(\varDelta =0.1,\) \(\rho _{0}=\rho _\mathrm{min}=0.1,\) \(N_{0}=100,\) \(\epsilon =0\), \(\zeta =1\), \(\varsigma =0\), and \(\alpha =0.7.\) We employed independent Gaussian distributions with zero mean and standard deviation of 10 as the initial distributions in all dimensions for all runs.

## 6 Conclusion

In the first part of this paper, we proved the convergence and stability of both the theoretical and Monte Carlo versions of the CWO-based method. The proofs provided a rigorous mathematical foundation for the two practical algorithms we proposed in the numerical examples section. These two algorithms are variations of the generic CWO algorithm described in Algorithm 2. The two algorithm variations, CWO_T and CWO_U, differ by how they update their probability density functions over the solution space for each iteration. The first approach, CWO_T, weights the samples according to their outcome values. On the other hand, CWO_U, uniformly weights the samples. We benchmarked the performance of the CWO_T algorithm and summarized the results in Table 1. Although the numeric values are quite satisfactory, we wanted to see if we could improve these results. This effort led us to the development of the second approach, CWO_U, which we consider as the preferred implementation of the CWO-base algorithm. Perhaps the most surprising fact is that by not taking into account the outcome values of the samples, we are able to achieve better performance results. Even more interesting is the fact that the standard cross-entropy approach is just a limiting case of the CWO_U approach. Comparing the numerical results of CWO_U with those of CE, we believe our algorithm is better at obtaining an optimal solution (see Fig. 1b). Of course, the improvement in performance is at the expense of increasing computational costs.

## Notes

### Acknowledgments

This work was supported in part by the National Science Foundation (NSF) under Grants CNS-0926194, CMMI-0856256, CMMI-1362303, CNS-1446665 and CCF-0926194, and by the Air Force Office of Scientific Research (AFOSR) under Grant FA9550-15-10050. A preliminary and shorter version of this paper was presented at the 2013 Winter Simulation Conference.

## References

- 1.Hu, J., Wang, Y., Zhou, E., Fu, M.C., Marcus, S.I.: A survey of some model-based methods for global optimization. In: Hernández-Hernández, D., Minjárez-Sosa, J.A. (eds.) Optimization, Control, and Applications of Stochastic Systems, pp. 157–179. Birkhäuser, Boston (2012)CrossRefGoogle Scholar
- 2.Hu, J., Hu, P., Chang, H.S.: A stochastic approximation framework for a class of randomized optimization algorithms. IEEE Trans. Autom. Control
**57**, 165–178 (2012)MathSciNetCrossRefGoogle Scholar - 3.Kahneman, D., Tversky, A.: Prospect Theory: An Analysis of Decision Under Risk. National Emergency Training Center, Emmitsburg (1979)zbMATHGoogle Scholar
- 4.Tversky, A., Kahneman, D.: Advances in prospect theory: cumulative representation of uncertainty. J. Risk Uncertain
**5**(4), 297–323 (1992)CrossRefzbMATHGoogle Scholar - 5.Diecidue, E., Schmidt, U., Zank, H.: Parametric weighting functions. J. Econ. Theory
**144**, 1102–1118 (2009)MathSciNetCrossRefzbMATHGoogle Scholar - 6.Borkar, V.S.: Topics in Controlled Markov Chains. CRC Press, Boca Raton (1991)zbMATHGoogle Scholar
- 7.Billingsley, P.: Convergence of Probability Measures, 2nd edn. Wiley-Interscience, New York (1999)CrossRefzbMATHGoogle Scholar
- 8.Wang, Y.: Simulation-Based Methods for Stochastic Control and Global Optimization. PhD thesis, University of Maryland - College Park (2011)Google Scholar
- 9.Frank, T.D.: Nonlinear Fokker–Planck Equations—Fundamentals and Applications. Springer, Berlin (2005)zbMATHGoogle Scholar
- 10.Kolokoltsov, V.N.: Nonlinear Markov Processes and Kinetic Equations. Cambridge University Press, Cambridge (2010)CrossRefzbMATHGoogle Scholar
- 11.Wang, Y., Fu, M.C., Marcus, S.I.: Model-based evolutionary optimization. In: Proceedings of the Winter Simulation Conference, WSC’10, pp. 1199–1210, Winter Simulation Conference (2010)Google Scholar
- 12.Oechssler, J., Riedel, F.: Evolutionary dynamics on infinite strategy spaces. Econ. Theory
**17**(1), 141–162 (2001)MathSciNetCrossRefzbMATHGoogle Scholar - 13.Hofbauer, J., Oechssler, J., Riedel, F.: Brown-von Neumann–Nash dynamics: the continuous strategy case. Games Econ. Behav.
**65**, 406–429 (2009)MathSciNetCrossRefzbMATHGoogle Scholar - 14.Zeidler, E.: Nonlinear Functional Analysis and Its Applications: Part 2 B: Nonlinear Monotone Operators. Springer, Berlin (1989)Google Scholar
- 15.Bhatia, N.P., Szegö, G.P.: Stability Theory of Dynamical Systems. Springer, Berlin (1970)CrossRefzbMATHGoogle Scholar
- 16.Morris, C.N.: Natural exponential families with quadratic variance functions. The Annals of Statistics
**10**, pp. 65–80, Mar. 1982. Mathematical Reviews number (MathSciNet) MR642719, Zentralblatt MATH identifier0498.62015Google Scholar - 17.Mühlenbein, H., Paaß, G.: From recombination of genes to the estimation of distributions I. Binary parameters. In: Voigt, H.-M., Ebeling, W., Rechenberg, I., Schwefel, H.-P. (eds.) Parallel Problem Solving from Nature IV. Lecture Notes in Computer Science, pp. 178–187. Springer, Berlin (1996)CrossRefGoogle Scholar
- 18.Wolpert, D.H.: Finding bounded rational equilibria part i: Iterative focusing. In: Proceedings of the International Society of Dynamic Games Conference, 2004. Citeseer (2004)Google Scholar
- 19.Hu, J., Fu, M.C., Marcus, S.I.: A model reference adaptive search method for global optimization. Oper. Res.
**55**, 549–568 (2007)MathSciNetCrossRefzbMATHGoogle Scholar - 20.Zabinsky, Z.B.: Stochastic Adaptive Search for Global Optimization, vol. 72. Springer, Berlin (2003)zbMATHGoogle Scholar
- 21.Rubinstein, R.Y., Kroese, D.P.: The Cross-Entropy Method—A Unified Approach to Combinatorial Optimization, Monte-Carlo Simulation. Springer, New York (2004)zbMATHGoogle Scholar
- 22.Rubinstein, R.Y.: Combinatorial optimization, cross-entropy, ants and rare events. In: Uryasev, P.M.P.S. (ed.) Stochastic Optimization: Algorithms and Applications, pp. 304–358. Kluwer, Dordrecht (2001)Google Scholar
- 23.de Boer, P.-T., Kroese, D.P., Mannor, S., Rubinstein, R.Y.: A tutorial on the cross-entropy method. Ann. Oper. Res.
**134**, 19–67 (2005)MathSciNetCrossRefzbMATHGoogle Scholar

## Copyright information

**Open Access**This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.