Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

8.1 Introduction: Top-Down Versus Bottom-Up

Let \(f: {\mathbb{R}}^{n} \rightarrow \mathbb{R}\), \(\boldsymbol{x}\mapsto f(\boldsymbol{x})\) be an objective or cost (or fitness) function to be minimized, where, in practice, the typical search space dimension n obeys 3 < n < 300. When properties of f are unknown a priori, an iterative search algorithm can proceed in evaluating solutions on f and so gather information for finding better solutions over time (black-box search or optimization). Good solutions have, by definition, a small f-value, and evaluations of f are considered as the cost of search (note the double entendre of the word cost for f). The objective is, in practice, to find a good solution with the least number of function evaluations and, more rigorously, to generate a sequence \(\boldsymbol{x}_{k}\), \(k = 1,2,3,\ldots\), such that \(f(\boldsymbol{x}_{k})\) converges fast to the essential infimum of f, denoted f . The essential infimum f is the largest real number such that the set of better search points \(\{\boldsymbol{x} \in {\mathbb{R}}^{n}: f(\boldsymbol{x}) < {f}^{{\ast}}\}\) has zero volume.

In order to search in continuous spaces with even moderate dimension, some structure in the cost function needs to be exploited. For evolution strategies, the principle structure is believed to be neighborhood. Strong causality [33]—the principle that small actuator changes have generally only small effects—and fitness-distance correlation [31]—a statistical perspective of the same concept—are two ways to describe the structure that evolution strategies are based upon. In contrast to Chaps. 4, 6, and 7 of this volume, in this chapter we do not introduce an a priori assumption on the problem class we want to address, that is, we do not assume any structure in the cost function a priori. However, we use two ideas that might imply the exploitation of neighborhood: We assume that the variances of the sample distribution exist, and we encourage consecutive iteration steps to become, under a variable metric, orthogonal (via step-size control). Empirically, the latter rather reduces the locality of the algorithm: The step-sizes that achieve orthogonality are usually large in their stationary condition. We conjecture therefore that either the mere existence of variances and/or the “any-time” approach that aims to improve in each iteration, rather than only in a final step, implies already the exploitation of a neighborhood structure in our context.

In order to solve the above-introduced search problem on f, we take a principled stochastic (or randomized) approach. We first sample points from a distribution over the search space with density p(. | θ), we evaluate the points on f and finally update the parameters θ of the distribution. This is done iteratively and defines a search procedure on θ as depicted in Fig. 8.1. Indeed, the update of θ remains the one and only crucial element—besides the choice of p (and λ) in the first place. Consequently, this chapter is entirely devoted to the question of how to update θ.

Fig. 8.1
figure 1

Stochastic search template

Before we proceed, we note that under some mild assumptions on p, and for any increasing transformation \(g: \mathbb{R} \rightarrow \mathbb{R}\) (in particular also for the identity), the minimum of the function

$$\displaystyle{ \theta \mapsto E(g(f(\boldsymbol{x}))\vert \theta ) }$$
(8.1)

coincides with the minimum of f (the expectation E is taken under the sample distribution p, given parameters θ). The optimal distribution is entirely concentrated in the arg min of f. In black-box search, we do not want (and are not able) to impose strong regularity conditions on the unknown function f. However, we have entire control over p. This seems an excellent justification for a randomized approach to the original black-box search problem. We sketch two approaches to solve (8.1).Footnote 1

8.1.1 The Top-Down Way

We might chose p being “sufficiently smooth” and conduct a gradient descent,

$$\displaystyle{ \theta _{k+1} =\theta _{k} -\eta \nabla _{\theta }E(f(\boldsymbol{x})\vert \theta )\qquad \text{with}\eta > 0. }$$
(8.2)

We are facing two problems with Eq. (8.2). On the one hand, we need to compute \(\nabla _{\theta }E(f(\boldsymbol{x})\vert \theta )\). On the other hand, the gradient \(\nabla _{\theta }\) strongly depends on the specifically chosen parameterization in θ. The unique solution to the second problem is the natural gradient. The idea to use the natural gradient in evolution strategies was coined in [40] and elegantly pursued in [11]. The natural gradient is unique, invariant under reparametrization and in accordance with the Kullback-Leibler (KL) divergence or relative entropy, the informational difference measure between distributions. We can reformulate Eq. (8.2) using the natural gradient, denoted \(\tilde{\nabla }\), in a unique way as

$$\displaystyle{ \theta _{k+1} =\theta _{k} -\eta \tilde{\nabla }E(f(\boldsymbol{x})\vert \theta ). }$$
(8.3)

We can express the natural gradient in terms of the vanilla gradient ∇ θ , using the Fisher information matrix, as \(\tilde{\nabla } = F_{\theta }^{-1}\nabla _{\theta }\). Using the log-likelihood trick, \(\nabla _{\theta }p = (p/p)\nabla _{\theta }p = p\nabla _{\theta }\log p\) we can finally, under mild assumption on p, re-arrange Eq. (8.3) into

$$\displaystyle{ \theta _{k+1} =\theta _{k} -\eta E(\mathop{\underbrace{f(\boldsymbol{x})}}\limits _{\text{expensive}}\overbrace{F_{\theta }^{-1}\nabla _{\theta }\log p{(\boldsymbol{x}\vert \theta )}}^{\text{\textquotedblleft controlled\textquotedblright }}). }$$
(8.4)

In practice, the expectation in Eq. (8.4) can be approximated/replaced by taking the average over a (potentially small) number of samples, \(\boldsymbol{x}_{i}\), where computing \(f(\boldsymbol{x}_{i})\) is assumed to be the costly part. We will also choose p such that we can conveniently sample from the distribution and that the computation (or approximation) of \(F_{\theta }^{-1}\nabla _{\theta }\log p\) is feasible. The top-down way of Eqs. (8.3) and (8.4) is an amazingly clean and principled approach to stochastic black-box optimization.

8.1.2 The Bottom-Up Way

In this chapter, we choose a rather orthogonal approach to derive a principled stochastic search algorithm in the \({\mathbb{R}}^{n}\). We take a scrutinizing step-by-step road to construct the algorithm based on a few fundamental principles—namely maximal entropy, unbiasedness, maintaining invariance, and, under these constraints, exploiting all available information and solving simple functions reasonably fast.

Surprisingly, the resulting algorithm arrives at (8.3) and (8.4): Eqs. (8.12) and (8.51) implement Eq. (8.3) in the manifold of multivariate normal distributions under some monotonic transformation of f [1, 5] (let η = 1, \(c_{1} = c_{\epsilon } = 0\), \(c_{\mu } = \sigma _{k} = 1\)). The monotonic transformation is driven by an invariance principle. In both ways, top-down and bottom-up, the same, well-recognized stochastic search algorithm covariance matrix adaptation evolution strategy (CMA-ES) emerges. Our scrutinizing approach, however, reveals additional aspects that are consistently useful in practice: Cumulation via an evolution path, step-size control and different learning rates η for different parts of θ. These aspects are either well hidden by Eq. (8.4)Footnote 2 or can hardly be derived at all (cumulation). On the downside, the bottom up way is clearly less appealing.

The following sections will introduce and motivate the CMA-ES step-by-step. The CMA-ES samples new solutions from a multivariate normal distribution and updates the parameters of the distribution, namely the mean (incumbent solution), the covariance matrix and additionally a step-size in each iteration, utilizing the f-ranking of the sampled solutions. We formalize the different notions of invariance as well as the maximum likelihood and stationarity properties of the algorithm. A condensed final transcription of the algorithm is provided in the appendix. For a discussion under different perspectives, the reader is referred to [12, 15, 22].

8.2 Sampling with Maximum Entropy

We start by sampling λ (new) candidate solutions \(\boldsymbol{x}_{i} \in {\mathbb{R}}^{n}\), obeying a multivariate normal (search) distribution

$$\displaystyle{ \boldsymbol{x}_{i} \sim \boldsymbol{m}_{k} + \sigma _{k} \times \mathcal{N}_{i}\left ({\boldsymbol 0},\boldsymbol{C}_{k}\right )\qquad \text{for}\ i = 1,\ldots,\lambda, }$$
(8.5)

where \(k = 0,1,2,\ldots,\) is the time or iteration index, and \(\boldsymbol{m}_{k} \in {\mathbb{R}}^{n}\), σ k  > 0, and \(\mathcal{N}\left ({\boldsymbol 0},\boldsymbol{C}\right )\) denotes a multivariate normal distribution with zero mean and covariance matrix \(\boldsymbol{C}\), ∼ denotes equality in distribution. For convenience, we will sometimes omit the iteration index k.

New solutions obey a multivariate normal distribution with expectation \(\boldsymbol{m}\) and covariance matrix \({\sigma }^{2} \times \boldsymbol{C}\). Sets of equal density—that is, lines or surfaces in 2-D or 3-D, respectively—are ellipsoids centered about the mean and modal value \(\boldsymbol{m}\). Figure 8.2 shows 150 sampled points from a standard (2-variate) normal distribution, \(\mathcal{N}\left ({\boldsymbol 0},\boldsymbol{I}\right )\).

Fig. 8.2
figure 2

One hundred and fifty samples from a multivariate (standard) normal distribution in 2-D. Both coordinates are i.i.d. according to a standard normal distribution. The circle depicts the one-σ equal density line, the center of the circle is the mean and modal value at zero. In general, lines of equal density (level sets) are ellipsoids. The probability to sample a point outside the dashed box is close to \(1 - {(1 - 2 \times 0.0015)}^{2} \approx 1/170\)

Given mean, variances and covariances of a distribution, the chosen multivariate normal distribution has maximum entropy and—without any further knowledge—suggests itself for randomized search. We explain Eq. (8.5) in more detail.

  • The distribution mean value, m, is the incumbent solution of the algorithm: It is the current estimate for the global optimum provided by the search procedure. The distribution is point symmetrical about the incumbent. The incumbent \(\boldsymbol{m}\) is (usually) not evaluated on f. However, it should be evaluated as final solution in the last iteration.

  • New solutions are obtained by disturbing \(\boldsymbol{m}\) with the mutation distribution

    $$\displaystyle\begin{array}{rcl} \mathcal{N}\qquad \left ({\boldsymbol 0}{,\sigma }^{2}\boldsymbol{C}\right ) \equiv \sigma \times \mathcal{N}\left ({\boldsymbol 0},\boldsymbol{C}\right )\;,& & {}\end{array}$$
    (8.6)

    where the equivalence holds by definition of \(\mathcal{N}\left (.,.\right )\). The parameter σ > 0 is a step-size or scale parameter and exists for notational convenience only. The covariance matrix \(\boldsymbol{C}\) has \(\frac{{n}^{2}+n} {2}\) degrees of freedom and represents a full quadratic model.

    The covariance matrix determines the shape of the distribution, where level-sets of the density are hyper-ellipsoids (refer to [12, 15] for more details). On convex quadratic cost functions, \(\boldsymbol{C}\) will closely align with the inverse Hessian of the cost function f (up to a scalar factor). The matrix \(\boldsymbol{C}\) defines a variable neighborhood metric. The above-said suggests that using the maximum entropy distribution with finite variances implies the notion, and underlines the importance of neighborhood.

The initial incumbent \(\boldsymbol{m}_{0}\) needs to be provided by the user. The algorithm has no preference for any specific value and its operations are invariant to the value of \(\boldsymbol{m}_{0}\) (see translation invariance in Sect. 8.4).

Equation (8.5) implements the principle of stationarity or unbiasedness, because the expected value of Eq. (8.6) is zero. Improvements are not a priori made by construction, but only after sampling by selection. In this way, the least additional assumptions are built into the search procedure.

The number of candidate solutions sampled in Eq. (8.5) cannot be entirely derived from first principles. For small \(\lambda\not\gg n\) the search process will be comparatively local and the algorithm can converge quickly. Only if previously sampled search points are considered, λ could be chosen to its minimal value of one—in particular if the best so-far evaluated candidate solution is always retained. We tend to disregard previous samples entirely (see below). In this case, a selection must take place between λ ≥ 2 new candidate solutions. Because the mutation distribution is unbiased, newly sampled solutions tend to be worse than the previous best solution, and in practice λ ≥ 5 is advisable.Footnote 3

On the other hand, for large λ ≫ n, the search becomes more global and the probability to approach the desired, global optimum on multimodal functions is usually larger. On the downside, more function evaluations are necessary to closely approach an optimum even on simple functions.

Consequently, a comparatively successful overall strategy runs the algorithm first with a small population size, e.g., the default \(\lambda = 4 + \lfloor 3\ln n\rfloor \), and afterwards conducts independent restarts with increasing population sizes (IPOP) [6].

After we have established the sampling procedure using a parameterized distribution, we need to determine the distribution parameters which are essential to conduct efficient search. All parameters depend explicitly or implicitly on the past and therefore are described in their update equations.

8.3 Exploiting the Objective Function

The pairs \((\boldsymbol{x}_{i},f(\boldsymbol{x}_{i}))_{i=1,\ldots,\lambda }\) provide the information for choosing a new and better incumbent solution \(\boldsymbol{m}_{k+1}\) as well as the new distribution covariance matrix \({\sigma }^{2}\boldsymbol{C}\). Two principles are applied.

8.3.1 Old Information Is Disregarded

There are a few reasons to believe that old information can or should be disregarded.

  1. (a)

    The given \(({n}^{2} + 3n)/2\) distribution parameters, \(\boldsymbol{m}\) and \({\sigma }^{2} \times \boldsymbol{C}\), should already capture all necessary previous information. Two additional state variables, the search paths \({\boldsymbol{p}}^{\sigma },{\boldsymbol{p}}^{\mathrm{c}} \in {\mathbb{R}}^{n}\), will provide another 2n parameters. Theoretical results suggests that only slight improvements can be made by storing and using (all) previously sampled candidate solutions [38, 39], given rank-based selection.

  2. (b)

    Convergence renders previously sampled solutions rather meaningless, because they are too far away from the currently focused region of interest.

  3. (c)

    Disregarding old solutions helps to avoid getting trapped in local optima.

  4. (d)

    An elitist approach can be destructive in the presence of noise, because a supersolution can stall any further updates. Under uncertainties, any information must be used with great caution.

8.3.2 Ranking of the Better Half Is Exploited

Only the ranking of the better half of the new candidate solutions is exploited. Function values are discarded as well as the ranking of the worse half of the newly sampled points. Specifically, the function f enters the algorithm only via the indices i: λ for \(i = 1,\ldots,\mu\), in that (serving as definition for i: λ)

$$\displaystyle{ f(\boldsymbol{x}_{1:\lambda }) \leq f(\boldsymbol{x}_{2:\lambda }) \leq \ldots \leq f(\boldsymbol{x}_{\lambda:\lambda }) }$$
(8.7)

is satisfied. We choose \(\mu = \lfloor \lambda /2\rfloor \), because

  1. (a)

    On a linear function in expectation the better half of the new solutions improve over \(\boldsymbol{m}_{k}\) and for the same reason

  2. (b)

    On the quadratic sphere function only the better half of the new solutions can improve the performance, using positive recombination weights (see Eq. (8.12) below). For the remaining solutions, \(\boldsymbol{x}_{i:\lambda } -\boldsymbol{m}_{k}\) needs to enter with a negative prefactor [3].

We feel that using worse points to make predictions for the location of better points might make a too strong assumption on the regularity of f in general. Indeed, optimization would be a much easier task if outstandingly bad points would allow generally valid implications on the location of good points, because bad points are generally easy to obtain.

On the highly symmetrical, isotropic sphere model, using the worse half of the points with the same importance than the better half of the points for calculating the new incumbent can render the convergence two times faster [2, 3]. In experiments with CMA-ES, we find the factor to be somewhat smaller and obtain very similar results also on the isotropic, highly multimodal Rastrigin function. On most anisotropic functions we observe performance degradations and also failures in rare cases and in cases with noise. The picture, though, is more encouraging for a covariance matrix update with negative samples, as discussed below.

Because only the f-ranked solution points (rather than the f-values) are used, we denote the f-ranking also as (rank-based) selection. The exploitation of available information is quite conservative, reducing the possible ways of deception. As an additional advantage, function values do not need to be available (for example, when optimizing a game-playing algorithm, a passably accurate selection and ranking of the μ best current players suffices to proceed to the next iteration). This leads to a strong robustness property of the algorithm: Invariance to order-preserving transformations, see next section. The downside of using only the f-ranking is that the possible convergence speed cannot be faster than linear [7, 28, 38].

8.4 Invariance

We begin with a general definition of invariance of a search algorithm \(\mathcal{A}\). In short, invariance means that \(\mathcal{A}\) does not change its behavior under exchange of f with an equivalent function \(h \in \mathcal{H}(f)\), in general conditionally to change of the initial conditions.

Definition 8.1 (Invariance).

Let \(\mathcal{H}\) be a mapping from the set of all functions into its power set, \(\mathcal{H}:\{ {\mathbb{R}}^{n} \rightarrow \mathbb{R}\} \rightarrow {2}^{\{{\mathbb{R}}^{n}\rightarrow \mathbb{R}\} }\), \(f\mapsto \mathcal{H}(f)\). Let S be the state space of the search algorithm, s ∈ S and \(\mathcal{A}_{f}: S \rightarrow S\) an iteration step of the algorithm under objective function f. The algorithm \(\mathcal{A}\) is invariant under \(\mathcal{H}\) (in other words: Invariant under the exchange of f with elements of \(\mathcal{H}(f)\)) if for all \(f \in \{ {\mathbb{R}}^{n} \rightarrow \mathbb{R}\}\), there exists for all \(h \in \mathcal{H}(f)\) a bijective state space transformation \(T_{f\rightarrow h}: S \rightarrow S\) such that for all states s ∈ S

$$\displaystyle{ \mathcal{A}_{h} \circ T_{f\rightarrow h}(s) = T_{f\rightarrow h} \circ \mathcal{A}_{f}(s), }$$
(8.8)

or equivalently

$$\displaystyle{ \mathcal{A}_{h}(s) = T_{f\rightarrow h} \circ \mathcal{A}_{f} \circ T_{f\rightarrow h}^{-1}(s). }$$
(8.9)

If T f → h is the identity for all \(h \in \mathcal{H}(f)\), the algorithm is unconditionally invariant under \(\mathcal{H}\). For randomized algorithms, the equalities hold almost surely, given appropriately coupled random number realizations, otherwise in distribution. The set of functions \(\mathcal{H}(f)\) is an invariance set of f for algorithm \(\mathcal{A}\).

The simplest example where unconditional invariance trivially holds is \(\mathcal{H}: f\mapsto \{f\}\). Any algorithm is unconditionally invariant under the “exchange” of f with f. The idea of invariance is depicted in the commutative diagram in Fig. 8.3. The two possible paths from the upper left to the lower right are reflected in Eq. (8.8).

Fig. 8.3
figure 3

Commutative diagram for invariance. Vertical arrows depict an invertible transformation (encoding) T of the state variables. Horizontal arrows depict one time step of algorithm \(\mathcal{A}\), using the respective function and state variables. The two possible paths between a state at time k and a state at time k + 1 are equivalent in all (four) cases. The two paths from upper left to lower right are reflected in Eq. (8.8). For f = h the diagram becomes trivial with T f → h as the identity. One interpretation of the diagram is that given T f → h −1, any function h can be optimized like f

Equation (8.9) implies (trivially) for all \(k \in \mathbb{N}\) that

$$\displaystyle{ \mathcal{A}_{h}^{k}(s) = T_{ f\rightarrow h} \circ \mathcal{A}_{f}^{k} \circ T_{ f\rightarrow h}^{-1}(s), }$$
(8.10)

where \({\mathcal{A}}^{k}(s)\) denotes k iteration steps of the algorithm starting from s. Equation (8.10) reveals that for all \(h \in \mathcal{H}(f)\), the algorithm \(\mathcal{A}\) optimizes the function h with initial state s just like the function f with initial state \(T_{f\rightarrow h}^{-1}(s)\). In the lucky scenario, T f → h is the identity and \(\mathcal{A}\) behaves identically on f and h. Otherwise, first s must be moved to \(T_{f\rightarrow h}^{-1}(s)\), such that after an adaptation phase any function h is optimized just like the function f. This is particularly attractive if f is the easiest function in the invariance class. The adaptation time naturally depends on the distance between s and \(T_{f\rightarrow h}^{-1}(s)\).

We give the first example of unconditional invariance to order-preserving transformations of f.

Proposition 8.1 (Invariance to order-preserving transformations).

For all strictly increasing functions \(g: \mathbb{R} \rightarrow \mathbb{R}\) and for all \(f: {\mathbb{R}}^{n} \rightarrow \mathbb{R}\) , the CMA-ES behaves identically on the objective function \(\boldsymbol{x}\mapsto f(\boldsymbol{x})\) and the objective function \(\boldsymbol{x}\mapsto g(f(\boldsymbol{x}))\) . In other words, CMA-ES is unconditionally invariant under

$$\displaystyle{ \mathcal{H}_{\mathrm{monotonic}}: f\mapsto \{g \circ f\;\vert \;g {is strictly increasing}\}. }$$
(8.11)

Additionally, for each \(f: {\mathbb{R}}^{n} \rightarrow \mathbb{R}\) , the set of functions \(\mathcal{H}_{\mathrm{monotonic}}(f)\) —the orbit of f—is an equivalence class of functions with indistinguishable search trace.

Proof idea.

Only the f-ranking of solutions is used in CMA-ES, and g does not change this ranking. We define the equivalence relation as f ∼ h iff ∃g strictly increasing such that f = gh. Then, reflexivity, symmetry and transitivity for the equivalence relation ∼ can be shown elementarily, recognizing that the identity and g −1 and compositions of strictly increasing functions are strictly increasing. □ 

The CMA-ES depends only on the sub-level sets \(\{\boldsymbol{x}\;\vert \;f(\boldsymbol{x}) \leq \alpha \}\) for \(\alpha \in \mathbb{R}\). The monotonous transformation g does not change the sub-level sets, that is \(\{\boldsymbol{x}\;\vert \;g(f(\boldsymbol{x})) \leq g(\alpha )\} =\{ \boldsymbol{x}\;\vert \;f(\boldsymbol{x}) \leq \alpha \}\).

8.5 Update of the Incumbent

Given the restricted usage of information from the evaluations of f, the incumbent is generally updated with a weighted mean of mutation steps:

$$\displaystyle\begin{array}{rcl} \boldsymbol{m}_{k+1} = \boldsymbol{m}_{k} +\, c_{\mathrm{m}}\sum _{i=1}^{\mu }w_{ i}\,(\boldsymbol{x}_{i:\lambda } -\boldsymbol{m}_{k})& &{}\end{array}$$
(8.12)

with

$$\displaystyle\begin{array}{rcl} \sum _{i=1}^{\mu }\vert w_{ i}\vert = 1,\quad w_{1} \geq w_{2}\ldots \geq w_{\mu },\quad 0 < c_{\mathrm{m}} \leq 1\;.& &{}\end{array}$$
(8.13)

The question of how to choose optimal weight values w i is pursued in [3], and the default values in Table 8.2 of the Appendix approximate the optimal positive values on the infinite dimensional sphere model. As discussed above, we add the constraints

$$\displaystyle\begin{array}{rcl} w_{\mu } > 0\quad \text{and}\quad \mu \leq \lambda /2\;,& &{}\end{array}$$
(8.14)

while the formulation with Eq. (8.12) also covers more general settings. Usually, we set the learning rate c m = 1 and the computation of the new incumbent simplifies to

$$\displaystyle\begin{array}{rcl} \boldsymbol{m}_{k+1} =\sum _{ i=1}^{\mu }w_{ i}\,\boldsymbol{x}_{i:\lambda }\;.& &{}\end{array}$$
(8.15)

A learning rate of one seems to be the largest sensible setting. A value larger than one should only be advantageous if σ k is too small, and implies that the step-size heuristic should be improved. Very small σ k together with \(c_{\mathrm{m}} \gg 1\) resemble a classical gradient descent scenario.

The amount of utilized information can be quantified via the variance effective selection mass, or effective μ

$$\displaystyle{ \mu _{\mathrm{eff}} ={ \left (\sum _{i=1}^{\mu }w_{ i}^{2}\right )}^{-1}\;, }$$
(8.16)

where we can easily derive the tight bounds \(1 <\mu _{\mathrm{eff}} \leq \mu\). Usually, a weight setting with \(\mu _{\mathrm{eff}} \approx \lambda /4\) is appropriate. Given \(\mu _{\mathrm{eff}}\), the specific choice of the weights is comparatively uncritical. The presented way to update the incumbent using a weighted mean of all μ selected points gives raise for the name(\(\mu /\mu _{\,<Emphasis Type="SmallCaps">\text{w}</Emphasis>},\lambda\))-CMA-ES.

Proposition 8.2 (Random ranking and stationarity of the incumbent).

Under (pure) random ranking, \(\boldsymbol{m}_{k}\)  follows an unbiased random walk

$$\displaystyle\begin{array}{rcl} \boldsymbol{m}_{k+1} \sim \boldsymbol{m}_{k} + \frac{\sigma _{k}} {\sqrt{\mu _{\mathrm{eff }}}}\mathcal{N}\left ({\boldsymbol 0},\boldsymbol{C}_{k}\right )& &{}\end{array}$$
(8.17)

and consequently

$$\displaystyle\begin{array}{rcl} E(\boldsymbol{m}_{k+1}\vert \boldsymbol{m}_{k}) = \boldsymbol{m}_{k}\;.& &{}\end{array}$$
(8.18)

Pure random ranking means that the index values \(i:\lambda \in \{ 1,\ldots,\lambda \}\) do not depend on \(\boldsymbol{x}_{1},\ldots,\boldsymbol{x}_{\lambda }\) , for all \(i = 1,\ldots,\lambda\) , for example, when \(f(\boldsymbol{x})\) is a random variable with a density and does not depend on \(\boldsymbol{x}\) , or when i: λ is set to i.

Proof idea.

Equation (8.17) follows from Eqs. (8.5), (8.12), (8.16), and (8.18) follows because \(E\mathcal{N}\left ({\boldsymbol 0},\boldsymbol{C}\right ) = {\boldsymbol 0}\) by definition. □ 

The proposition affirms that only selection (f-ranking) can induce a biased movement of the incumbent m.

Proposition 8.3 (Maximum likelihood estimate of the mean).

Given \(\boldsymbol{x}_{1:\lambda },\ldots,\boldsymbol{x}_{\mu:\lambda }\) , the incumbent \(\boldsymbol{m}_{k+1}\) maximizes, independent of the positive definite matrix \(\boldsymbol{C}\) , the weighted likelihood

$$\displaystyle\begin{array}{rcl} \boldsymbol{m}_{k+1} = \mathop\mathrm{arg\,max}\limits_{\boldsymbol{m}\in {\mathbb{R}}^{n}}\prod _{i=1}^{\mu }p_{_{\!\!\! \mathcal{N}}}^{w_{i} }\left (\boldsymbol{x}_{i:\lambda }\,\vert \,\boldsymbol{m}\right ),& &{}\end{array}$$
(8.19)

where \(p_{_{\!\!\!\mathcal{N}}}^{w_{i}}(\boldsymbol{x}\,\vert \,\boldsymbol{m}) = {(p_{_{\!\!\! \mathcal{N}}}(\boldsymbol{x}\,\vert \,\boldsymbol{m}))}^{w_{i}}\) and \(p_{_{\!\!\!\mathcal{N}}}(\boldsymbol{x}\,\vert \,\boldsymbol{m})\) denotes the density of \(\mathcal{N}\left (\boldsymbol{m},\boldsymbol{C}\right )\) at point \(\boldsymbol{x}\) , or equivalently the weighted log-likelihood

$$\displaystyle\begin{array}{rcl} \boldsymbol{m}_{k+1} = \mathop\mathrm{arg\,max}\limits_{\boldsymbol{m}\in {\mathbb{R}}^{n}}\sum _{i=1}^{\mu }w_{ i} \times \log p_{_{\!\!\!\mathcal{N}}}\left (\boldsymbol{x}_{i:\lambda }\,\vert \,\boldsymbol{m}\right ),& &{}\end{array}$$
(8.20)

Proof idea.

We exploit the one-dimensional normal density and the fact that the multivariate normal distribution, after a coordinate system rotation, can be decomposed into n independent marginal distributions. □ 

Finally, we find translation invariance, a property that every continuous search algorithm should enjoy.

Proposition 8.4 (Translation invariance).

The CMA-ES is translation invariant, that is, invariant under

$$\displaystyle{ \mathcal{H}_{\mathrm{trans}}: f\mapsto \{h_{\boldsymbol{a}}: \boldsymbol{x}\mapsto f(\boldsymbol{x} -\boldsymbol{a})\;\vert \;\boldsymbol{a} \in {\mathbb{R}}^{n}\}\;, }$$
(8.21)

with the bijective state transformation, \(T_{f\rightarrow h_{\boldsymbol{a}}}\) , that maps \(\boldsymbol{m}\) to \(\boldsymbol{m} + \boldsymbol{a}\) (cf. Fig. 8.3). In other words, the trace of \(\boldsymbol{m}_{k} + \boldsymbol{a}\) is the same for all functions \(h_{\boldsymbol{a}} \in \mathcal{H}_{\mathrm{trans}}\) .

Proof idea.

Consider Fig. 8.3: An iteration step with state \((\boldsymbol{m}_{k},\sigma _{k},\boldsymbol{C}_{k},\ldots )\) using cost function \(\boldsymbol{x}\mapsto f(\boldsymbol{x})\) in the upper path is equivalent with an iteration step with state \((\boldsymbol{m}_{k} + \boldsymbol{a},\sigma _{k},\boldsymbol{C}_{k},\ldots )\) using cost function \(h_{\boldsymbol{a}}: \boldsymbol{x}\mapsto f(\boldsymbol{x} -\boldsymbol{a})\) in the lower path. □ 

Translation invariance, meaning also that \(\boldsymbol{m}_{k} -\boldsymbol{m}_{0}\) does not depend on \(\boldsymbol{m}_{0}\), is a rather indispensable property for a search algorithm. Nevertheless, because \(\boldsymbol{m}_{k}\) depends on \(\boldsymbol{m}_{0}\), a reasonable proposition for \(\boldsymbol{m}_{0}\), depending on f, is advisable.

8.6 Step-Size Control

Step-size control aims to make a search algorithm adaptive to the overall scale of search. Step-size control allows for fast convergence to an optimum and serves to satisfy the following basic demands on a search algorithm:

  1. 1.

    Solving linear functions, like \(f(\boldsymbol{x}) = x_{1}\). On linear functions we desire a geometrical increase of the f-gain \(f(\boldsymbol{m}_{k}) - f(\boldsymbol{m}_{k+1})\) with increasing k.

  2. 2.

    Solving the simplest convex-quadratic function, the sphere function

    $$\displaystyle{ f(\boldsymbol{x}) =\sum _{ i=1}^{n}{(x_{ i} - x_{i}^{{\ast}})}^{2} =\| \boldsymbol{x} -{\boldsymbol{x}{}^{{\ast}}\|}^{2}, }$$
    (8.22)

    quickly. We desire

    $$\displaystyle{ \frac{\left \|\boldsymbol{m}_{k} -{\boldsymbol{x}}^{{\ast}}\right \|} {\left \|\boldsymbol{m}_{0} -{\boldsymbol{x}}^{{\ast}}\right \|} \approx \exp \left (-c\frac{k} {n}\right ), }$$
    (8.23)

    such that \(c\not\ll0.02\min(n,\lambda)\), because c ≈ 0. 25 λ is the optimal value which can be achieved with optimal step-size, and optimal positive weights for λ ≫̸ n (c ≈ 0. 5 λ can be achieved using also negative weights for \(\boldsymbol{x}_{i\,:\lambda } -\boldsymbol{m}_{k}\) in Eq. (8.12), see [3]). The optimal step-size changes when approaching the optimum.

Additionally, step-size control will provide scale invariance, as explicated below.

Unfortunately, step-size control can hardly be derived from first principles and therefore relies on some internal model or some heuristics. Line-search is one such heuristic that decides on the realized step length after the direction of the step is given. Surprisingly, a line-search can gain very little over a fixed (optimal) step length given in each iteration [27]. Recent theoretical results even seem to indicate that in the limit for n →  the optimal progress rate cannot be improved at all by a cost-free ray search on a half-line (given positive weights) or by a line search otherwise (Jebalia M, personal communication). A few further heuristics for step-size control are well-recognized:

  1. 1.

    Controlling the success rate of new candidate solutions, compared to the best solution seen so far (one-fifth success rule) [33, 35].

  2. 2.

    Sampling different candidate solutions with different step-sizes (self-adaptation) [33, 36]. Selected solutions also retain their step-size.

  3. 3.

    Testing different step-sizes by conducting additional test steps in direction \(\boldsymbol{m}_{k+1} -\boldsymbol{m}_{k}\), resembling a rudimentary line-search (two-point adaptation) [18, 34].

  4. 4.

    Controlling the length of the search path, taken over a number of iterations (cumulative step-size adaptation, CSA, or path-length control) [32].

In our context, the last two approaches find reasonable values for σ in simple test cases (like ridge topologies).

We use cumulative step-size adaptation here. The underlying design principle is to achieve perpendicularity of successive steps. Perpendicularity is measured using an evolution path and a variable metric.

Conceptually, an evolution path, or search path, of length j is the vector

$$\displaystyle{ \boldsymbol{m}_{k} -\boldsymbol{m}_{k-j}, }$$
(8.24)

that is, the total displacement of the mean during j iterations. For technical convenience, and in order to satisfy the stationary condition Eq. (8.26), we compute the search path, \({\boldsymbol{p}}^{\sigma }\), in an iterative momentum equation with the initial path \(\boldsymbol{p}_{0}^{\sigma } = {\boldsymbol 0}\) as

$$\displaystyle{ \boldsymbol{p}_{k+1}^{\sigma } = (1 - c_{\sigma })\,\boldsymbol{p}_{ k}^{\sigma } + \sqrt{c_{\sigma }(2 - c_{\sigma })\mu _{\mathrm{ eff}}}\;{\boldsymbol{C}_{k}}^{-\frac{1} {2} }\, \frac{\boldsymbol{m}_{k+1} -\boldsymbol{m}_{k}} {\sigma _{k}} \;. }$$
(8.25)

The factor 1 − c σ  > 0 is the decay weight, and \(1/c_{\sigma } \approx n/3\) is the backward time horizon. After 1∕c σ iterations about \(1 -\exp (-1) \approx 63\,\%\) of the information has been replaced; \({\boldsymbol{C}_{k}}^{-\frac{1} {2} }\) is the positive symmetric square rootFootnote 4 of \({\boldsymbol{C}_{k}}^{-1}\). The remaining factors are, without further degree of freedom, chosen to guarantee the stationarity,

$$\displaystyle{ \boldsymbol{p}_{k}^{\sigma } \sim \mathcal{N}\left ({\boldsymbol 0},\boldsymbol{I}\right )\quad \text{for}k = 1,2,3,\ldots \;, }$$
(8.26)

given \(\boldsymbol{p}_{0}^{\sigma } \sim \mathcal{N}\left ({\boldsymbol 0},\boldsymbol{I}\right )\) and pure random ranking of \(\boldsymbol{x}_{i:\lambda }\) in all preceding time steps.

The length of the evolution path is used to update the step-size σ either following [29]

$$\displaystyle{ \sigma _{k+1} = \sigma _{k} \times \exp \left (\frac{c_{\sigma }} {d_{\sigma }}\left (\frac{{\|\boldsymbol{p}_{k+1}^{\sigma }\|}^{2} - n} {2n} \right )\right ) }$$
(8.27)

or via

$$\displaystyle{ \sigma _{k+1} = \sigma _{k} \times \exp \left (\frac{c_{\sigma }} {d_{\sigma }}\left ( \frac{\|\boldsymbol{p}_{k+1}^{\sigma }\|} {\mathsf{E}\|\mathcal{N}\left ({\boldsymbol 0},\boldsymbol{I}\right )\|} - 1\right )\right )\;, }$$
(8.28)

where \(d_{\sigma } \approx 1\). The step-size increases/decreases \(\mathrm{iff} \ {\|\boldsymbol{p}_{k+1}^{\sigma} \|}^{2}\) or \(\|\boldsymbol{p}_{k+1}^{\sigma }\|\) is larger/smaller than its expected value. Equation (8.27) is more appealing and easier to analyze, but Eq. (8.28) might have an advantage in practice. In practice, also an upper bound to the argument of exp is sometimes useful.

Figure 8.4 depicts the idea of the step-size control schematically.

Fig. 8.4
figure 4

Schematic depiction of three evolution paths in the search space (each with six successive steps of \(\boldsymbol{m}_{k}\)). Left: Single steps cancel each other out and the evolution path is short. Middle: Steps are “on average orthogonal”. Right: Steps are positively correlated and the evolution path is long. The length of the path is a good indicator for optimality of the step-size

  • If steps are positively correlated, the evolution path tends to be long (right picture). A similar trajectory could be covered by fewer but longer steps and the step-size is increased.

  • If steps are negatively correlated they tend to cancel each other out and the evolution path is short (left picture). Shorter steps seem more appropriate and the step-size is decreased.

  • If the f-ranking does not affect the length of the evolution path, the step-size is unbiased (middle picture).

We note two major postulates related to step-size control and two major design principles of the step-size update.

Postulate 1 (Conjugate steps).

Successive iteration steps should be approximately \({\boldsymbol{C}}^{-1}\) -conjugate, that is, orthogonal with respect to the inner product (and metric) defined by \({\boldsymbol{C}}^{-1}\) .

As a consequence of this postulate, we have used perpendicularity as optimality criterion for step-size control.

If steps are uncorrelated, like under random selection, they indeed become approximately \({\boldsymbol{C}}^{-1}\)-conjugate, that is, \({\left (\boldsymbol{m}_{k+1} -\boldsymbol{m}_{k}\right )}^{\mathrm{T}}{\boldsymbol{C}}^{-1}\boldsymbol{m}_{k} -\boldsymbol{m}_{k-1} \approx 0\), see [15]. This means the steps are orthogonal with respect to the inner product defined by \({\boldsymbol{C}}^{-1}\) and therefore orthogonal in the coordinate system defined by \(\boldsymbol{C}\). In this coordinate system, the coordinate axes, where the independent sampling takes place, are eigenvectors of \(\boldsymbol{C}\). Seemingly uncorrelated steps are the desired case and are achieved by using \({\boldsymbol{C}}^{-1/2}\) in Eq. (8.25).

In order to better understand the following assertions, we rewrite the step-size update in Eq. (8.28), only using an additive update term,

$$\displaystyle{ \log \sigma _{k+1} =\log \sigma _{k} + \frac{c_{\sigma }} {d_{\sigma }}\left ( \frac{\|\boldsymbol{p}_{k+1}^{\sigma }\|} {\mathsf{E}\|\mathcal{N}\left ({\boldsymbol 0},\boldsymbol{I}\right )\|} - 1\right )\;. }$$
(8.29)

First, in accordance with our stationary design principle, we establish a stationarity condition on the step-size.

Proposition 8.5 (Stationarity of step-size).

Given pure random ranking and \(\boldsymbol{p}_{0}^{\sigma } \sim \mathcal{N}\left ({\boldsymbol 0},\boldsymbol{I}\right )\) , the quantity \(\log \sigma _{k}\) performs an unbiased random walk (see Eq. (8.29) ). Consequently, the step-size obeys the stationarity condition

$$\displaystyle{ E(\log \sigma _{k+1}\vert \sigma _{k}) =\log \sigma _{k}\;. }$$
(8.30)

Proof idea.

We analyze the update Eqs. (8.29) and (8.25). □ 

Postulate 2 (Behavior on linear functions [14]).

On a linear function, the dispersion of new candidate solutions should increase geometrically fast in the iteration sequence, that is, linearly on the log scale. Given \(\sigma _{k}^{\beta }\) as dispersion measure with β > 0, we can set w.l.o.g. β = 1 and demand for some α > 0

$$\displaystyle{ E(\log \sigma _{k+1}\vert \sigma _{k}) \geq \log \sigma _{k} +\alpha \;. }$$
(8.31)

The CMA-ES satisfies the postulate for some k 0 and all k ≥ k 0, because on a linear function the expected length of the evolution path increases monotonically. We reckon that \(k_{0} \propto 1/c_{\sigma }\). Finally, we investigate the more abstract conception of scale invariance as depicted in Fig. 8.5.

Proposition 8.6 (Scale invariance).

The CMA-ES is invariant under

$$\displaystyle{ \mathcal{H}_{\mathrm{scale}}: f\mapsto \{h_{\alpha }: \boldsymbol{x}\mapsto f(\boldsymbol{x}/\alpha )\;\vert \;\alpha > 0\} }$$
(8.32)

with the associated bijective state space transformation

$$\displaystyle{T: (\boldsymbol{m},\sigma,\boldsymbol{C},{\boldsymbol{p}}^{\sigma },{\boldsymbol{p}}^{\mathrm{c}})\mapsto (\alpha \boldsymbol{m},\alpha \sigma,\boldsymbol{C},{\boldsymbol{p}}^{\sigma },{\boldsymbol{p}}^{\mathrm{c}})\;.}$$

That means for all states \((\boldsymbol{m}_{k},\sigma _{k},\boldsymbol{C}_{k},\boldsymbol{p}_{k}^{\sigma },\boldsymbol{p}_{k}^{\mathrm{c}})\)

$$\displaystyle\begin{array}{rcl} \text{CMA-ES}_{h}(T(\boldsymbol{m}_{k},\sigma _{k},\boldsymbol{C}_{k},\boldsymbol{p}_{k}^{\sigma },\boldsymbol{p}_{ k}^{\mathrm{c}})) = T(\text{CMA-ES}_{ f}(\mathop{\underbrace{\boldsymbol{m}_{k},\sigma _{k},\boldsymbol{C}_{k},\boldsymbol{p}_{k}^{\sigma },\boldsymbol{p}_{ k}^{\mathrm{c}}}}\limits _{ \stackrel{=}{{T}^{-1}(T(\boldsymbol{m}_{k},\sigma _{k},\boldsymbol{C}_{k},\,\boldsymbol{p}_{k}^{\sigma },\,\boldsymbol{p}_{k}^{\mathrm{c}}))}})),& &{}\end{array}$$
(8.33)

see Fig. 8.5 . Furthermore, for any given \(f: {\mathbb{R}}^{n} \rightarrow \mathbb{R}\) , the set of functions \(\mathcal{H}_{\mathrm{scale}}(f)\) —the orbit of f—is an equivalence class.

Proof idea.

We investigate the update equations of the state variables comparing the two possible paths from the lower left to the lower right in Fig. 8.5. The equivalence relation property can be shown elementarily (cf. Proposition 8.1) or using the property that the set {α > 0} is a transformation group over the set \(\{h: {\mathbb{R}}^{n} \rightarrow \mathbb{R}\}\) and therefore induces the equivalence classes \(\mathcal{H}_{\mathrm{scale}}(f)\) (see also Proposition 8.9). □ 

Fig. 8.5
figure 5

Commutative diagram for scale invariance. Vertical arrows depict an invertible transformation (encoding) T of all state variables of CMA-ES with \(T(\alpha ): (\boldsymbol{m},\sigma,\boldsymbol{C},{\boldsymbol{p}}^{\sigma },{\boldsymbol{p}}^{\mathrm{c}})\mapsto (\alpha \boldsymbol{m},\alpha \sigma,\boldsymbol{C},{\boldsymbol{p}}^{\sigma },{\boldsymbol{p}}^{\mathrm{c}})\). Horizontal arrows depict one time step of CMA-ES, applied to the respective tuple of state variables. The two possible paths between a state at time k and a state at time k + 1 are equivalent in all (four) cases. For α = 1 the diagram becomes trivial. The diagram suggests that CMA-ES is invariant under the choice of α > 0 in the sense that, given T and T −1 were available, any function \(\boldsymbol{x}\mapsto f(\alpha \boldsymbol{x})\) is (at least) as easy to optimization as f

Invariance allows us to draw the commutative diagram of Fig. 8.5. Scale invariance can be interpreted in several ways:

  • The choice of scale α is irrelevant for the algorithm, that is, the algorithm has no intrinsic (built-in) notion of scale.

  • The transformation T in Fig. 8.5 is a change of coordinate system (here: change of scale) and the update equations are independent of the actually chosen coordinate system; that is, they could be formulated in an algebraic way.

  • For functions in the equivalence class \(\mathcal{H}_{\mathrm{scale}}(f)\), the trace of the algorithm \((\alpha \boldsymbol{m}_{k},[0]\alpha \sigma _{k},[0]\boldsymbol{C}_{k},\boldsymbol{p}_{k}^{\sigma },\boldsymbol{p}_{k}^{\mathrm{c}})\) will be identical for all \(k = 0,1,2,\ldots\), given that m 0 and σ 0 are chosen appropriately, for example, \(\sigma _{0} = 1/\alpha\) and \(\boldsymbol{m}_{0} = \sigma _{0} \times \boldsymbol{a}\). Then the trace for k = 0 equals \((\alpha \boldsymbol{m}_{0},\alpha \sigma _{0},\boldsymbol{C}_{0},\ldots ) = (\boldsymbol{a},1,\boldsymbol{C}_{0},\ldots )\), and the trace does not depend on α for any \(k \geq 0\).

  • From the last point follows that the step-size control has a distinct role in scale invariance. In practice, when α is unknown, adaptation of the step-size that achieves \(\sigma _{k} \propto 1/\alpha\) can render the algorithm virtually independent of α.

Scale invariance and step-size control also facilitate the possibility of linear convergence in k to the optimum \({\boldsymbol{x}}^{{\ast}}\), in that

$$\displaystyle\begin{array}{rcl} \lim _{k\rightarrow \,\infty }\root{k}\of{\frac{\|\boldsymbol{m}_{k} -{\boldsymbol{x}}^{{\ast}}\|} {\|\boldsymbol{m}_{0} -{\boldsymbol{x}}^{{\ast}}\|}} =\exp \left (-\frac{c} {n}\right )\;& &{}\end{array}$$
(8.34)

exists with c > 0 or equivalently,

$$\displaystyle\begin{array}{rcl} \lim _{k\rightarrow \,\infty }\frac{1} {k}\log \|\boldsymbol{m}_{k} -{\boldsymbol{x}}^{{\ast}}\|& =& \lim _{ k\rightarrow \,\infty }\frac{1} {k}\log \frac{\|\boldsymbol{m}_{k} -{\boldsymbol{x}}^{{\ast}}\|} {\|\boldsymbol{m}_{0} -{\boldsymbol{x}}^{{\ast}}\|} \\ & =& \lim _{k\rightarrow \,\infty }\frac{1} {k}\sum _{k=1}^{t}\log \frac{\|\boldsymbol{m}_{k} -{\boldsymbol{x}}^{{\ast}}\|} {\|\boldsymbol{m}_{k-1} -{\boldsymbol{x}}^{{\ast}}\|} \\ & =& -\frac{c} {n}\; {}\end{array}$$
(8.35)

and similarly

$$\displaystyle\begin{array}{rcl} E\left (\log \frac{\|\boldsymbol{m}_{k+1} -{\boldsymbol{x}}^{{\ast}}\|} {\|\boldsymbol{m}_{k} -{\boldsymbol{x}}^{{\ast}}\|} \right ) \rightarrow -\frac{c} {n}\quad \text{for}k \rightarrow \infty \;.& &{}\end{array}$$
(8.36)

Hence, c denotes a convergence rate and for c > 0 the algorithm converges “log-linearly” (in other words, geometrically fast) to the optimum.

In the beginning of this section we stated two basic demands on a search algorithm that step-size control is meant to address, namely solving linear functions and the sphere function appropriately fast. We now pursue, with a single experiment, whether the demands are satisfied.

Figure 8.6 shows a run on the objective function \(f: {\mathbb{R}}^{n} \rightarrow \mathbb{R},\boldsymbol{x}\mapsto \|\boldsymbol{x}\|\), with n = 20, λ = 12 (the default value, see Table 8.2) and with \(\sigma _{0} = 1{0}^{-9}\) chosen far too small given that \(\boldsymbol{m}_{0} = {\boldsymbol 1}\). The outcome when repeating this experiment always looks very similar. We discuss the demands in turn.

Fig. 8.6
figure 6

A run of CSA-ES (Eqs. (8.5), (8.15), (8.25) and (8.28)) on the objective function \(f: {\mathbb{R}}^{20} \rightarrow \mathbb{R},\boldsymbol{x}\mapsto \|\boldsymbol{x}\|\), as a member of the equivalence class of functions \(\boldsymbol{x}\mapsto g(\|\alpha \,\boldsymbol{x} -{\boldsymbol{x}}^{{\ast}}\|)\) with identical behavior, given \(\sigma _{0} \propto 1/\alpha\) and \(\boldsymbol{m}_{0} = \sigma _{0} \times (\mathrm{const} +{ \boldsymbol{x}}^{{\ast}})\). Here, \(\boldsymbol{m}_{0} = {\boldsymbol 1}\) and the initial step-size \(\sigma _{0} = 1{0}^{-9}\) is chosen far too small. Left: \(f(\boldsymbol{m}_{k})\) (thick blue graph) and \(\sigma _{k}\) versus iteration number k in a semi-log plot. Right: All components of \(\boldsymbol{m}_{k}\) versus k

  1. 1.

    During the first 170 iterations the algorithm virtually “observes” the linear function \(\boldsymbol{x}\mapsto \sum _{i=1}^{20}x_{i}\) at point \({\boldsymbol 1} \in {\mathbb{R}}^{20}\). We see during this phase that σ increases geometrically fast (linearly on the log scale). From this observation, and the invariance properties of the algorithm (also rotation invariance, see below), we can safely imply that the demand for linear functions is satisfied.

  2. 2.

    After the adaptation of σ after about 180 iterations, linear convergence to the optimum can be observed. We compute the convergence rate between iteration 180 and 600 from the graph. Starting with \(\frac{\|\boldsymbol{m}_{k}\|} {\|\boldsymbol{m}_{0}\|} \approx \exp \left (-c\frac{k} {n}\right )\) from Eq. (8.23) we replace \(\boldsymbol{m}_{0}\) with \(\boldsymbol{m}_{180}\) and compute

    $$\displaystyle{ \frac{\|\boldsymbol{m}_{k=600}\|} {\|\boldsymbol{m}_{k=180}\|} \approx \frac{1{0}^{-9.5}} {1{0}^{0}} \approx \exp \left (-c\frac{600 - 180} {20} \right )\;. }$$
    (8.37)

    Solving for c yields c ≈ 1. 0 and with \(\min (n,\lambda ) =\lambda = 12\) we get c ≈ 1.0≪̸0.24 = 0.02 min(n,λ). Our demand on the convergence rate c is more than satisfied. The same can be observed when covariance matrix adaptation is applied additionally (not shown).

The demand on the convergence (8.23) can be rewritten in that

$$\displaystyle{ \log \|\boldsymbol{m}_{k} -{\boldsymbol{x}}^{{\ast}}\|\approx -c\frac{k} {n} +\mathrm{ const}\;. }$$
(8.38)

The k in the RHS numerator implies linear convergence in the number of iterations. The n in the denominator implies linear scale-up: The number of iterations to reduce the distance to the optimum by a given factor increases linearly with the dimension n. Linear convergence can also be achieved with covariance matrix adaptation. Given \(\lambda\not\gg n\), linear scale-up cannot be achieved with covariance matrix adaptation alone, because a reliable setting for the learning rate for the covariance matrix is o(1∕n). However, step-size control is reliable and achieves linear scale-up given the step-size damping parameter d σ  = O(1) in Eq. (8.28). Scale-up experiments are inevitable to support this claim and have been done, for example, in [22].

8.7 Covariance Matrix Adaptation

In the remainder we exploit the f-ranked (i.e., selected and ordered) set \((\boldsymbol{x}_{1:\lambda },\ldots,[0]\boldsymbol{x}_{\mu:\lambda })\) to update the covariance matrix \(\boldsymbol{C}\). First, we note that the covariance matrix represents variation parameters. Consequently, an apparent principle is to encourage, or reinforce variations that have been successful—just like successful candidate solutions are reinforced in the update of \(\boldsymbol{m}\) in Eq. (8.15). Based on the current set of f-ranked points, the successful variations are (by definition)

$$\displaystyle\begin{array}{rcl} \boldsymbol{x}_{i:\lambda } -\boldsymbol{m}_{k}\quad \text{for}i = 1,\ldots,\mu \;.& &{}\end{array}$$
(8.39)

Remark that “successful variation” does not imply \(f(\boldsymbol{x}_{i:\lambda }) < f(\boldsymbol{m}_{k})\), which is neither necessary nor important nor even desirable in general. Even the demand \(f(\boldsymbol{x}_{1:\lambda }) < f(\boldsymbol{m}_{k})\) would often result in a far too small a step-size.

8.7.1 The Rank-μ Matrix

From the successful variations in (8.39) we form a covariance matrix

$$\displaystyle\begin{array}{rcl} \boldsymbol{C}_{k+1}^{\mu } =\sum _{ i=1}^{\mu }w_{ i}\,\frac{\boldsymbol{x}_{i:\lambda } -\boldsymbol{m}_{k}} {\sigma _{k}} \times \frac{{\left(\boldsymbol{x}_{i:\lambda } -\boldsymbol{m}_{k}\right)}^{\mathrm{T}}} {\sigma _{k}} \;.& &{}\end{array}$$
(8.40)

Equation (8.40) is analogous to Eq. (8.15) where successful solution points are used to form the new incumbent. We can easily derive the condition

$$\displaystyle\begin{array}{rcl} E(\boldsymbol{C}_{k+1}^{\mu }\vert \boldsymbol{C}_{ k}) = \boldsymbol{C}_{k}& &{}\end{array}$$
(8.41)

under pure random ranking, thus explaining the factors \(1/\sigma _{k}\) in (8.40).

Assuming the weights w i as given, the matrix \(\boldsymbol{C}_{k+1}^{\mu }\) maximizes the (weighted) likelihood of the f-ranked steps.

Proposition 8.7 (Maximum likelihood estimate of \(\boldsymbol{C}\)).

Given μ ≥ n, the matrix \(\boldsymbol{C}_{k+1}^{\mu }\) maximizes the weighted log-likelihood

$$\displaystyle\begin{array}{rcl} \boldsymbol{C}_{k+1}^{\mu } = \mathrm{arg\,max}_{ \boldsymbol{C}{\it \text{pos def}}}\,\sum _{i=1}^{\mu }w_{ i} \times \log p_{_{\!\!\!\mathcal{N}}}\left (\left.\frac{\boldsymbol{x}_{i:\lambda } -\boldsymbol{m}_{k}} {\sigma _{k}} \,\right \vert \,\boldsymbol{C}\right )& &{}\end{array}$$
(8.42)

where \(p_{_{\!\!\!\mathcal{N}}}(\boldsymbol{x}\,\vert \,\boldsymbol{C})\) denotes the density of \(\mathcal{N}\left ({\boldsymbol 0},\boldsymbol{C}\right )\) at point \(\boldsymbol{x}\), and therefore the RHS of Eq. (8.42)reads more explicitly

$$\displaystyle\begin{array}{rcl} \mathop\mathrm{arg\,max}\limits_{\boldsymbol{C}{\it \text{pos def}}}\left (-\frac{1} {2}\log \det (\alpha \boldsymbol{C}) - \frac{1} {2{\sigma _{k}}^{2}}\sum _{i=1}^{\mu }w_{ i}{(\boldsymbol{x}_{i:\lambda } -\boldsymbol{m}_{k})}^{\mathrm{T}}{\boldsymbol{C}}^{-1}(\boldsymbol{x}_{ i:\lambda } -\boldsymbol{m}_{k})\right )& &{}\end{array}$$
(8.43)

where \(\alpha = 2\pi {\sigma _{k}}^{2}\) is irrelevant for the result.

Proof idea.

The proof is nontrivial but works similarly to the classical non-weighted case. □ 

In contrast to the computation of \(\boldsymbol{m}\) in Eq. (8.12), we are not aware of a derivation for optimality of certain weight values in Eq. (8.40). Future results might reveal that different weights and/or even a different value for μ are desirable for Eqs. (8.12) and (8.40). Before we turn finally to the covariance matrix update, we scrutinize the computation of \(\boldsymbol{C}_{k+1}^{\mu }\).

8.7.1.1 What Is Missing?

In Sect. 8.3 we argued to use only the μ best solutions from the last iteration to update distribution parameters. For a covariance matrix update, disregarding the worst solutions might be too conservative, and a negative update of the covariance matrix with the μ worst solutions is proposed in [29]. This idea is not accommodated in this chapter, but has been recently exploited with consistently good results [4, 23]. An inherent inconsistency with negative updates though is that long steps tend to be worse merely because they are long (and not because they represent a bad direction) meanwhile, unfortunately, long steps also lead to stronger updates.

At first sight we might believe to have covered all variation information given by \(\boldsymbol{x}_{i:\lambda } -\boldsymbol{m}_{k}\) in the covariance matrix \(\boldsymbol{C}_{k+1}^{\mu }\). On closer inspection we find that the outer product in Eq. (8.40) removes the sign: Using \(-(\boldsymbol{x}_{i:\lambda } -\boldsymbol{m})\) instead of \(\boldsymbol{x}_{i:\lambda } -\boldsymbol{m}\) in Eq. (8.40) yields the same \(\boldsymbol{C}_{k+1}^{\mu }\). One possibility to recover the sign information is to favor the direction \(\boldsymbol{x}_{i:\lambda } -\boldsymbol{m}\) over \(-(\boldsymbol{x}_{i:\lambda } -\boldsymbol{m}) = \boldsymbol{m}_{k} -\boldsymbol{x}_{i:\lambda }\) in some way. This seems difficult to accomplish without affecting either the distribution mean (interfering with Proposition 8.3) or the maximum entropy property. Therefore, we choose a different way to recover the sign information.

8.7.2 Another Evolution Path

We recover the sign information in a classical and rather heuristic way, which turns out to be nevertheless quite effective. We consider an evolution path \(\boldsymbol{x} -\boldsymbol{m}_{k-j}\) for j > 0, where \(\boldsymbol{x}\) might be \(\boldsymbol{m}_{k+1}\) or any \(\boldsymbol{x}_{i:\lambda }\). We decompose the path into the recent step and the old path

$$\displaystyle{ \boldsymbol{x} -\boldsymbol{m}_{k-j} = \boldsymbol{x} -\boldsymbol{m}_{k} + \boldsymbol{m}_{k} -\boldsymbol{m}_{k-j}. }$$
(8.44)

Switching the sign of the last step means using the vector \(\boldsymbol{m}_{k} -\boldsymbol{x}\) instead of \(\boldsymbol{x} -\boldsymbol{m}_{k}\), and we get in this case

$$\displaystyle\begin{array}{rcl} \boldsymbol{m}_{k} -\boldsymbol{x} + \boldsymbol{m}_{k} -\boldsymbol{m}_{k-j}& =& 2(\boldsymbol{m}_{k} -\boldsymbol{x}) + \boldsymbol{x} -\boldsymbol{m}_{k-j} \\ & =& \boldsymbol{x} -\boldsymbol{m}_{k-j} - 2(\boldsymbol{x} -\boldsymbol{m}_{k})\;.{}\end{array}$$
(8.45)

Comparing the last line with the LHS of Eq. (8.44), we see that now the sign of the recent step matters. Only in the trivial cases, if either \(\boldsymbol{x} = \boldsymbol{m}_{k}\) (zero step) or \(\boldsymbol{m}_{k} = \boldsymbol{m}_{k-j}\) (previous zero path) the outer products of Eqs. (8.44) and (8.45) are identical. Because we will compute the evolution path over a considerable number of iterations j, the specific choice for \(\boldsymbol{x}\) should become rather irrelevant and we will use \(\boldsymbol{m}_{k+1}\) in the following.

In practice, we compute the evolution path, analogous to Eq. (8.25). We set \(\boldsymbol{p}_{0}^{\mathrm{c}} = {\boldsymbol 0}\) and use the momentum equation

$$\displaystyle\begin{array}{rcl} \boldsymbol{p}_{k+1}^{\mathrm{c}} = (1 - c_{\mathrm{ c}})\,\boldsymbol{p}_{k}^{\mathrm{c}} + h_{\sigma }\sqrt{c_{\mathrm{ c}}(2 - c_{\mathrm{c}})\mu _{\mathrm{eff}}}\,\frac{\boldsymbol{m}_{k+1} -\boldsymbol{m}_{k}} {\sigma _{k}} \;,& &{}\end{array}$$
(8.46)

where \(h_{\sigma } = 1\) if \({\|\boldsymbol{p}_{k+1}^{\sigma }\|}^{2} <\big (1 - {(1 - c_{\sigma })}^{2(k+1)}\big)\left (2 + \frac{2} {n+1}\right )n\) and zero otherwise; h σ stalls the update whenever \(\|\boldsymbol{p}_{k+1}^{\sigma }\|\) is large. The implementation of h σ supports the judgment of pursuing a heuristic rather than a first principle here, and is driven by two considerations.

  1. 1.

    Given a fast increase of the step-size (induced by the fact that \(\|\boldsymbol{p}_{k+1}^{\sigma }\|\) is large), the change to the “visible” landscape will be fast, and the adaptation of the covariance matrix to the current landscape seems inappropriate, in particular, because

  2. 2.

    The covariance matrix update using \({\boldsymbol{p}}^{\mathrm{c}}\) is asymmetric: A large variance in a single direction can be introduced fast (while \(\|\boldsymbol{p}_{k+1}^{\mathrm{c}}\|\) is large), but the large variance can only be removed on a significantly longer time scale. For this reason in particular, an unjustified update should be avoided.

While in Eq. (8.46), again, 1 − c c is the decay factor and \(1/c_{\mathrm{c}} \approx (n + 4)/4\), the remaining constants are determined by the stationarity condition

$$\displaystyle\begin{array}{rcl} \boldsymbol{p}_{k+1}^{\mathrm{c}} \sim \boldsymbol{p}_{ k}^{\mathrm{c}}\;,& &{}\end{array}$$
(8.47)

given \(\boldsymbol{p}_{k}^{\mathrm{c}} \sim \mathcal{N}\left ({\boldsymbol 0},\boldsymbol{C}_{k}\right )\) and pure random ranking and \(h_{\sigma } \equiv 1\).

The evolution path \({\boldsymbol{p}}^{\mathrm{c}}\) heavily exploits the sign information. Let us consider, for a given \(\boldsymbol{y} \in {\mathbb{R}}^{n}\), two hypothetical situations with \(\boldsymbol{m}_{k+1} -\boldsymbol{m}_{k} {=\alpha }^{k}\,\boldsymbol{y}\), for \(k = 0,1,2,\ldots\). We find that for k → 

$$\displaystyle\begin{array}{rcl} & &\boldsymbol{p}_{k}^{\mathrm{c}} \rightarrow \sqrt{\frac{2 - c_{\mathrm{c} } } {c_{\mathrm{c}}}} \,\boldsymbol{y} \approx \sqrt{\frac{n + 2} {2}} \;\boldsymbol{y}\ \ {\text{if}\alpha }^{k} = 1{}\end{array}$$
(8.48)
$$\displaystyle\begin{array}{rcl} & & \boldsymbol{p}_{k}^{\mathrm{c}} \rightarrow {(-1)}^{k-1}\sqrt{ \frac{c_{\mathrm{c} } } {2 - c_{\mathrm{c}}}}\boldsymbol{y} \approx {(-1)}^{k-1}\sqrt{ \frac{2} {n + 2}}\;\boldsymbol{y}\;\;{\text{if}\alpha }^{k} = {(-1)}^{k}{}\end{array}$$
(8.49)

Both equations follow from solving the stationarity condition \(x = (1 - c_{\mathrm{c}}) \times (\pm x) + \sqrt{c_{\mathrm{c} } (2 - c_{\mathrm{c} } )}\) for x. Combining both equations, we get the ratio between maximal and minimal possible length of \({\boldsymbol{p}}^{\mathrm{c}}\), given the input vectors have constant length, as

$$\displaystyle{ \frac{2 - c_{\mathrm{c}}} {c_{\mathrm{c}}} \approx \frac{n + 2} {2} \;. }$$
(8.50)

Additionally to the matrix \(\boldsymbol{C}_{k+1}^{\mu }\), we use the rank-one matrix \(\boldsymbol{p}_{k+1}^{\mathrm{c}}{\boldsymbol{p}_{k+1}^{\mathrm{c}}}^{\mathrm{T}}\) to introduce the missing sign information into the covariance matrix. The update is specified below in Eq. (8.51). The update implements the principal heuristic of reinforcing successful variations for variations observed over several iterations.

8.7.2 Evaluation of the Cumulation Heuristic

We evaluate the effect of the evolution path for covariance matrix adaptation. Figure 8.7 shows running length measurements of the (\(\mu /\mu _{\mathrm{\,<Emphasis Type="SmallCaps">\text{w}</Emphasis>}},\lambda\))-CMA-ES depending on the choice of c c on the cigar function (see legend). The graphs in the left plot are typical example data to identify a good parameter setting. Ten values for \(c_{\mathrm{c}}^{-1}\) between 1 and 10 n are shown for each dimension. Larger values are not regarded as sensible. The setting c c = 1 means that the heuristic is switched off. Improvements over the setting c c = 1 can be observed in particular for larger dimensions, where, up to n = 100, the function can be solved up to ten times faster. For \(c_{\mathrm{c}}^{-1} = n\) the performance is for all dimensions close to optimal.

Fig. 8.7
figure 7

Number of function evaluations to reach \(f(\boldsymbol{x}) < 1{0}^{-6}\) on \(f(\boldsymbol{x}) = x_{1}^{2} + 1{0}^{6}\sum _{i=2}^{n}x_{i}^{2}\) with \(\boldsymbol{m}_{0} = \boldsymbol{1}\) and σ 0 = 1. For a (backward) time horizon of \(c_{\mathrm{c}}^{-1} = 1\), the cumulation heuristic is, by definition, switched off. Left figure: Number of function evaluations, where each point represents a single run, plotted versus the backward time horizon of the evolution path, \(c_{\mathrm{c}}^{-1}\), for n = [3; 10; 30; 100] (from bottom to top). Triangles show averages for \(c_{\mathrm{c}}^{-1} = \sqrt{n}\) and n, also shown on the right. Right figure: Average number of function evaluations divided by n, from \([10;3;2;1] = \lfloor 10/\lfloor \sqrt{n}\rfloor \rfloor \) runs, plotted versus n for (from top to bottom) \(c_{\mathrm{c}}^{-1} = 1;\sqrt{n}; \frac{n+3} {3};n\). Compared to c c = 1, the speed-up exceeds in all cases a factor of \(\sqrt{n}/2\) (dashed line)

The right plot shows the running lengths for four different parameter settings versus dimension. For n = 3 the smallest speed-up of about 25 % is observed for all variants with \(c_{\mathrm{c}}^{-1} > 1\). The speed-up grows to a factor of roughly 2, 4, and 10 for dimensions 10, 30, and 100, respectively, and always exceeds a factor of \(\sqrt{n}/2\). For c c = 1 (heuristic off) the scaling with the dimension is ≈ n 1. 7. For \(c_{\mathrm{c}}^{-1} = \sqrt{n}\) the scaling becomes \(\approx {n}^{1.1}\) and about linear for \(c_{\mathrm{c}}^{-1} \geq n/3\). These findings hold for any function, where the predominant task is to acquire the orientation of a constant number of “long axes”, in other words to find a few insensitive directions, where yet a large distance needs to be traversed. The assertion in [37] that \(c_{\mathrm{c}}^{-1} \propto n\) is needed to get a significant scaling improvement turns out to be wrong. For larger population sizes λ, where the rank-μ update becomes more effective, the positive effect reduces and almost vanishes with λ = 10n.

The same experiment has been conducted on other (unimodal) functions. While on many functions the cumulation heuristic is less effective and yields only a rather n-independent and small speed-up (e.g., on the Rosenbrock function somewhat below a factor of two), we have not seen an example yet where it compromises the performance remarkably. Hence the default choice has become \(c_{\mathrm{c}}^{-1} \approx n/4\) (see Table 8.2 in the Appendix), because (a) the update for the covariance matrix will have a time constant of \(c_{1}^{-1} \approx {n}^{2}/2\) and we feel that \(c_{1}^{-1}/c_{\mathrm{c}}^{-1}\) should not be smaller than n, and (b) in our additional experiments the value \(c_{\mathrm{c}}^{-1} = n\) is indeed sometimes worse than smaller values.

8.7.3 The Covariance Matrix Update

The final covariance matrix update combines a rank-one update using \({\boldsymbol{p}}^{\mathrm{c}}{{\boldsymbol{p}}^{\mathrm{c}}}^{\mathrm{T}}\) and a rank-μ update using \(\boldsymbol{C}_{k+1}^{\mu }\),

$$\displaystyle{ \boldsymbol{C}_{k+1} = (1 - c_{1} - c_{\mu } + c_{\epsilon })\,\boldsymbol{C}_{k} +\, c_{1}\,\boldsymbol{p}_{k+1}^{\mathrm{c}}{\boldsymbol{p}_{ k+1}^{\mathrm{c}}}^{\mathrm{T}} +\, c_{\mu }\boldsymbol{C}_{ k+1}^{\mu }\;, }$$
(8.51)

where \({\boldsymbol{p}}^{\mathrm{c}}\) and \(\boldsymbol{C}_{k+1}^{\mu }\) are defined in Eqs. (8.46) and (8.40), respectively, and \(c_{\epsilon } = (1 - h_{\sigma }^{2})\,c_{1}c_{\mathrm{c}}(2 - c_{\mathrm{c}})\) is of minor relevance and makes up for the loss of variance in case of h σ  = 0. The constants \(c_{1} \approx 2/{n}^{2}\) and \(c_{\mu } \approx \mu _{\mathrm{eff}}/{n}^{2}\) for \(\mu _{\mathrm{eff}} < {n}^{2}\) are learning rates satisfying \(c_{1} + c_{\mu } \leq 1\). The approximate values reflect the rank of the input matrix or the number of input samples, divided by the degrees of freedom of the covariance matrix. The remaining degrees of freedom are covered by the old covariance matrix \(\boldsymbol{C}_{k}\). Again, the equation is governed by a stationarity condition.

Proposition 8.8 (Stationarity of covariance matrix \(\boldsymbol{C}\)).

Given pure random ranking and \(\boldsymbol{p}_{k}^{\mathrm{c}} \sim \mathcal{N}\left ({\boldsymbol 0},\boldsymbol{C}_{k}\right )\) and h σ = 1, we have

$$\displaystyle{ E(\boldsymbol{C}_{k+1}\vert \boldsymbol{C}_{k}) = \boldsymbol{C}_{k}\;. }$$
(8.52)

Proof idea.

Compute the expected value of Eq. (8.51). □ 

Finally, we can state general linear invariance for CMA-ES, analogous to scale invariance in Proposition 8.6 and Fig. 8.5.

Proposition 8.9 (Invariance under general linear transformations).

The CMA-ES is invariant under full rank linear transformations of the search space, that is, for each \(f: {\mathbb{R}}^{n} \rightarrow \mathbb{R}\) invariant under

$$\displaystyle\begin{array}{rcl} \mathcal{H}_{\mathrm{GL}}: f\mapsto \{f \circ {\boldsymbol{B}}^{-1}: \boldsymbol{x}\mapsto f({\boldsymbol{B}}^{-1}\boldsymbol{x})\;\vert \;\mbox{ $\boldsymbol{B}$ is a full rank $n \times n$ matrix}\}\;.& &{}\end{array}$$
(8.53)

The respective bijective state space transformation reads

$$\displaystyle\begin{array}{rcl} T_{\boldsymbol{B}}: (\boldsymbol{m},\sigma,\boldsymbol{C},{\boldsymbol{p}}^{\sigma },{\boldsymbol{p}}^{\mathrm{c}})\mapsto (\boldsymbol{B}\boldsymbol{m},\sigma,\boldsymbol{B}\boldsymbol{C}{\boldsymbol{B}}^{\mathrm{T}},{\boldsymbol{p}}^{\sigma },\boldsymbol{B}{\boldsymbol{p}}^{\mathrm{c}})\;.& &{}\end{array}$$
(8.54)

Furthermore, for each f, the set \(\mathcal{H}_{\mathrm{GL}}(f)\) is an equivalence class with identical algorithm trace \(T_{\boldsymbol{B}}(\boldsymbol{m}_{k},\sigma _{k},\boldsymbol{C}_{k},\boldsymbol{p}_{k}^{\sigma },\boldsymbol{p}_{k}^{\mathrm{c}})\) for a state s and the initial state \((\boldsymbol{m}_{0},\sigma _{0},\boldsymbol{C}_{0},\boldsymbol{p}_{0}^{\sigma },\boldsymbol{p}_{0}^{\mathrm{c}}) = T_{\boldsymbol{B}}^{-1}(s)\) .

Proof idea.

Straightforward computation of the updated tuple: The equivalence relation property can be shown elementarily (cf. Proposition 8.1) or by recognizing that the set of full rank matrices is a transformation group over the set \(\{f: {\mathbb{R}}^{n} \rightarrow \mathbb{R}\}\) with group action \((\boldsymbol{B},f)\mapsto f \circ {\boldsymbol{B}}^{-1}\) and therefore induces the equivalence classes \(\mathcal{H}_{\mathrm{GL}}(f)\) as orbits of f under the group action. □ 

A commutative diagram, analogous to Fig. 8.5, applies with \(T_{\boldsymbol{B}}\) in place of \(T(\alpha )\) and using \(f({\boldsymbol{B}}^{-1}\boldsymbol{x})\) in the lower path. The transformation \(\boldsymbol{B}\) can be interpreted as a change of basis and therefore CMA-ES is invariant under linear coordinate system transformations. All further considerations made for scale invariance likewise hold for invariance under general linear transformations.

Because an appropriate (initial) choice of \(\boldsymbol{B}\) is usually not available in practice, general linear invariance must be complemented with adaptivity of \(\boldsymbol{C}\) to make it useful in practice and eventually adapt a linear encoding [17].

Corollary 8.1 (Adaptive linear encoding and variable metric [17]).

The covariance matrix adaptation implements an adaptive linear problem encoding, that is, in other words, an adaptive change of basis, or a change of coordinate system, or a variable metric for an evolution strategy.

Proof idea (The proof can be found in [16]).

General linear invariance achieves identical performance on \(f({\boldsymbol{B}}^{-1}\boldsymbol{x})\) under respective initial conditions. Here, \(\boldsymbol{B}\) is the linear problem encoding used within the algorithm. Changing (or adapting) \(\boldsymbol{C}\) without changing \(\boldsymbol{m}\) turns out to be equivalent with changing the encoding (or representation) \(\boldsymbol{B}\) in a particular way without changing \({\boldsymbol{B}}^{-1}\boldsymbol{m}\) (see also [13, 16]). Also, for each possible encoding we find a respective covariance matrix \(\boldsymbol{B}{\boldsymbol{B}}^{\mathrm{T}}\). □ 

While adaptation of \(\boldsymbol{C}\) is essential to implement general linear invariance, rotation invariance does not necessarily depend on an adaptation of \(\boldsymbol{C}\): rotation invariance is already achieved for \(\boldsymbol{C} \equiv \boldsymbol{ I}\), because \(\boldsymbol{B}\boldsymbol{I}{\boldsymbol{B}}^{\mathrm{T}} =\boldsymbol{ I}\) when \(\boldsymbol{B}\) is a rotation matrix, cf. Eq. (8.54). Nevertheless, it is important to note that covariance matrix adaptation preserves rotation invariance.

Corollary 8.2 (Rotation invariance).

The CMA-ES is invariant under search space rotations.

Proof idea.

Rotation invariance follows from Proposition 8.9 when restricted to the orthogonal group with \(\boldsymbol{B}{\boldsymbol{B}}^{\mathrm{T}} =\boldsymbol{ I}\) (for any initial state). □ 

8.8 An Experiment on Two Noisy Functions

We advocate testing new search algorithms always on pure random, on linear and on various (nonseparable) quadratic functions with various initializations. For the (\(\mu /\mu _{\mathrm{\,<Emphasis Type="SmallCaps">\text{w}</Emphasis>}},\lambda\))-CMA-ES this has been done elsewhere with the expected results: Parameters are unbiased on pure random functions, the step-size σ grows geometrically fast on linear functions, and on convex quadratic functions the level sets of the search distribution align with the level sets of the cost function, in that \({\boldsymbol{C}}^{-1}\) aligns to the Hessian up to a scalar factor and small stochastic fluctuations [15, 22].

Here, we show results on the well-known Rosenbrock function

$$\displaystyle{f(\boldsymbol{x}) =\sum _{ i=1}^{n-1}100\,{(x_{ i}^{2} - x_{ i+1})}^{2} + {(x_{ i} - 1)}^{2}\;,}$$

where the possible achievement is less obvious. In order to “unsmoothen” the landscape, a noise term is added: Each function value is multiplied with

$$\displaystyle\begin{array}{rcl} \exp \left ( \frac{\alpha _{N}} {2\,n} \times (G + C/10)\right ) + \frac{\alpha _{N}} {2\,n} \times (G + C/10)\;,& &{}\end{array}$$
(8.55)

where G and C are standard Gauss (normal) and standard Cauchy distributed random numbers, respectively. All four random numbers in (8.55) are sampled independently each time f is evaluated. The term is a mixture between the common normal noise 1 + G, which we believe has a principal “design flaw” [30], and the log-normal noise exp(G) which is alone comparatively easy to solve, each mixed with a heavy tail distribution which cannot be alleviated through averaging. We believe that this adds several difficulties on top of each other.

We show results for two noise levels, α N  = 0. 01 and α N  = 1. A section through the 5-D and the 20-D landscape for α N  = 1 is shown in Fig. 8.8. The lower dimensional landscape appears more disturbed but is not more difficult to optimize.

Fig. 8.8
figure 8

Both figures show three sections of the Rosenbrock function for α N  = 1 and argument \(\boldsymbol{x} =\beta \times {\boldsymbol 1} + \frac{1} {20}\mathcal{N}\left ({\boldsymbol 0},\boldsymbol{I}\right )\). All graphs show 201 points for β ∈ [−0. 5, 1. 5] and a single realization of \(\mathcal{N}\left ({\boldsymbol 0},\boldsymbol{I}\right )\) in each subfigure. The left basin about zero is initially highly attractive (cf., for example, Fig. 8.9, upper right) but is not nearby a local or global optimum. The basin around β = 1 is close to the global optimum at \({\boldsymbol 1}\) and is monotonically (nonvisibly) connected to the left basin

Fig. 8.9
figure 9

A typical run of the (\(\mu /\mu _{\mathrm{\,<Emphasis Type="SmallCaps">\text{w}</Emphasis>}},\lambda\))-CMA-ES on the Rosenbrock function (n = 20) with a small disturbance of the function value (α N  = 0. 01). All values are plotted against number of objective function evaluations. Upper left: Iteration-wise best function value (thick blue graph), median and worst function value (black graphs, mainly hidden), square root of the condition number of \(\boldsymbol{C}_{k}\) (increasing red graph), smallest and largest coordinate-wise standard deviation of the distribution \(\mathcal{N}\left ({\boldsymbol 0},\sigma _{k}^{2}\boldsymbol{C}_{k}\right )\) with final values annotated (magenta), and \(\sigma _{k}\) following closely the largest standard deviation (light green). Lower left: Square roots of eigenvalues of \(\boldsymbol{C}_{k}\), sorted. Upper right: Incumbent solution \(\boldsymbol{m}_{k}\). Lower right: Square roots of diagonal elements of \(\boldsymbol{C}_{k}\)

Figure 8.9 shows the output from a typical run for α N  = 0. 01 of the (\(\mu /\mu _{\mathrm{\,<Emphasis Type="SmallCaps">\text{w}</Emphasis>}},\lambda\))-CMA-ES with \(\boldsymbol{m}_{0} = -{\boldsymbol 1}\) and \(\sigma _{0} = 1\) (correctly presuming that in all variables \(m_{i} \pm 3\sigma _{0}\) embrace the optimum at \({\boldsymbol 1}\)). The calling sequence in Matlab wasFootnote 5

opts.evalparallel = ’on’; % only one feval() call per iteration

cmaes(’frosennoisy’, -ones(20,1), 1, opts);  % run CMA-ES

plotcmaesdat;             % plot figures using output files

The default population size for n = 20 is λ = 12. An error of 10−9, very close to the global optimum, is reached after about 20,000 function evaluations (without covariance matrix adaptation it takes about 250,000 function evaluations to reach 10−2). The effect of the noise is hardly visible in the performance. In some cases, the optimization only finds the local optimum of the function close to \({(-1,1,\ldots,1)}^{\mathrm{T}}\); in some cases the noise leads to a failure to approach any optimum (see also below).

The main challenge on the Rosenbrock function is to follow a winding ridge, in the figure between evaluation 1, 000 and 15, 000. The ridge seems not particularly narrow: The observed axis ratio is about twenty, corresponding to a condition number of 400. But the ridge constantly changes its orientation (witnessed by the lower-right subfigure). Many stochastic search algorithms are not able to follow this ridge and get stuck with a function value larger than one.

Fig. 8.10
figure 10

A typical run of the IPOP-UH-CMA-ES on the noisy Rosenbrock function (n = 20, α N  = 1), a (\(\mu /\mu _{\mathrm{\,<Emphasis Type="SmallCaps">\text{w}</Emphasis>}}\))-CMA-ES with uncertainly handling restarted with increasing population size. The highly rugged lines, partly beyond 105, in the upper left depict the worst measured function value (out of λ). One restart was necessary to converge close to the global optimum. See also Fig. 8.9 for more explanations

In Fig. 8.10, the noise term is set to α N  = 1, generating a highly rugged landscape (Fig. 8.8) and making it even harder to follow the winding ridge. Most search algorithms will fail to solve this function.Footnote 6 Now, two additional heuristics are examined.

First, restarting the algorithm with increasing population size (IPOP, [6]). The population size is doubled for each restart. A larger population size λ is more robust to rugged landscapes, mainly because the sample variance can be larger (for \(\mu _{\mathrm{eff}} < n\), the optimal step-size on the sphere function is proportional to μ eff [2]). Restarting with increasing population sizes is a very effective heuristic when a good termination criterion is available.

Second, applying an uncertainty-handling (UH, [25]). The uncertainty-handling reevaluates a few solutions and measures their resulting rank changes [25]. If the rank changes exceed a threshold, an action is taken. Here, σ is increased. This prevents from getting stuck, when the noise disturbs the selection too severely, but it can also lead to divergence. This is of lesser relevance, because in this case the original algorithm would most likely have been stuck anyway. Again, a good termination criterion is essential.

Remark that in both cases, for restarts and with the uncertainty handling, another possible action is to increase the number of function evaluations used for each individual in replacing a single value with a median.

Fig. 8.11
figure 11

Two typical runs of the IPOP-CMA-ES (left) and UH-CMA-ES (right, with uncertainty-handling) on the noisy ellipsoid function (n = 20, α N  = 1). With α N  = 0 the ellipsoid is solved in about 22, 000 function evaluations. In the lower left we can well observe that the algorithm gets stuck “in the middle of nowhere” during the first two launches. See also Fig. 8.9 for more explanations

For running IPOP-UH-CMA-ES, the following sequence is added before calling cmaes.

opts.restarts = 1;              % maximum number of restarts

opts.StopOnStagnation = ’yes’;  % terminate long runs

opts.noise.on = ’yes’;          % activate uncertainty-handling

Each restart uses the same initial conditions, here \(\boldsymbol{m}_{0} = -{\boldsymbol 1}\) and σ 0 = 1 from above. For α N  = 0. 01 (Fig. 8.9) the uncertainty-handing increases the running length by about 15 %, simply due to the reevaluations (not shown). For α N  = 1 in Fig. 8.10, it shortens the running length by a factor of about ten by reducing the number of necessary restarts. Typical for noisy functions, the restart was invoked due to stagnation of the run [20]. When repeating this experiment, in about 75 % one restart is needed to finally converge to the global optimum with λ = 24. Without uncertainty-handling it takes usually five to six restarts and a final population size of λ ≥ 384. Without covariance matrix adaptation it takes about 70 times longer to reach a similar precision as in Fig. 8.10.

Experiments with the well-known Ellipsoid function,

$$\displaystyle{f(\boldsymbol{x}) =\sum _{ i=1}^{n}1{0}^{6 \frac{i-1} {n-1} }x_{i}^{2}}$$

with the same noisy multiplier and α N  = 1 are shown in Fig. 8.11 for IPOP-CMA-ES (left) and UH-CMA-ES (right). The function is less difficult and can be solved with a population size λ = 48 using the IPOP approach and with the default population size of 12 with UH-CMA-ES.

8.9 Summary

Designing a search algorithm is intricate. We recapitulate the principled design ideas for deriving the CMA-ES algorithm.

  • Using a minimal amount of prior assumptions on the cost function f in order to achieve maximal robustness and minimal susceptibility to deceptiveness.

    • Generating candidate solutions by sampling a maximum entropy distribution adds the least amount of unwarranted information. This implies the stochastic nature of the algorithm and that no construction of potentially better points is undertaken. This also implies an internal quadratic model—at least when the distribution has finite variances—and stresses the importance of neighborhood. Consequently, a variable neighborhood suggests itself.

    • Unbiasedness of all algorithm components, given the objective function is random and independent of its argument. This principle suggests that only the current state and the selection information should bias the behavior of the algorithm. Adding another bias would add additional prior assumptions. We have deliberately violated this principle for uncertainty-handling as used in one experiment, where the step-size is increased under highly perturbed selection.

    • Only the ranking of the most recently sampled candidate solutions is used as feed-back from the objective function. This implies an attractive invariance property of the algorithm.

    Exploiting more specific information on f effectively, for example, smoothness, convexity or (partial) separability, will lead to different and more specific design decisions, with a potential advantage on smooth, convex or separable functions, respectively.

  • Introducing and maintaining invariance properties. Even invariance is related to avoiding prior assumptions as it implies not exploiting the specific structure of the objective function f (for example, separability). We can differentiate two main cases.

    • Unconditional invariance properties do not depend on the initial conditions of the algorithm and strengthen any empirical performance observation. They allow us to unconditionally generalize empirical observations to the equivalence class of functions induced by the invariance property.

    • Invariance properties that depend on state variables of the algorithm (like σ k  for scale invariance in Fig. 8.5) must be complemented with adaptivity. They are particularly attractive, if adaptivity can drive the algorithm quickly into the most desirable state. This behavior can be empirically observed for CMA-ES on the equivalence class of convex-quadratic functions. Step-size control drives step-size σ k  close to its optimal value, and adaptation of the covariance matrix reduces these functions to the sphere model.

  • Exploiting all available information effectively. The available information and its exploitation are highly restricted by the first two demands. Using a deterministic ranking and different weights for updating \(\boldsymbol{m}\) and \(\boldsymbol{C}\) are due to this design principle. Also the evolution paths in Eq. (8.46) and in Eq. (8.51) are governed by exploiting otherwise unused sign information. Using the evolution paths does not violate any of the above demands, but allows us to additionally exploit dependency information between successive time steps of the algorithm.

  • Solving the two most basic continuous domain functions reasonably fast. Solving the linear function and the sphere function reasonably fast implies the introduction of step-size control. These two functions are quite opposed: The latter requires convergence, the former requires divergence of the algorithm.

Table 8.1 Summary of the update equations for the state variables in the (μμ w , λ)-CMA-ES with iteration index \(k = 0,1,2,\ldots\). The chosen ordering of equations allows us to remove the iteration index in all variables but \(\boldsymbol{m}_{k}\). Unexplained parameters and constants are given in Table 8.2

Finally, two heuristic concepts are applied in CMA-ES.

  • Reinforcement of the better solutions and the better steps (variations) when updating mean and variances, respectively. This seems a rather unavoidable heuristic given a conservative use of information from f. This heuristic bears the maximum likelihood principle.

  • Orthogonality of successive steps. This heuristic is a rather common conception in continuous domain search.

Pure random search, where the sample distribution remains constant in the iteration sequence, follows most of the above design principles and has some attractive robustness features. However, pure random search neither accumulates information from the past in order to modify the search distribution, nor changes and adapts internal state variables. Adaptivity of state variables, however, detaches the algorithm from the initial conditions and lets (additional) invariance properties come to life. Only invariance to increasing f-value transformations (Proposition 8.1) is independent of state variables of the search algorithm. We draw the somewhat surprising conclusion that the abstract notion of invariance leads, by advising the introduction of adaptivity, when carefully implemented, to a vastly improved practical performance.

Despite its generic, principled design, the practical performance of CMA-ES turns out to be surprisingly competitive, or even superior, also in comparatively specific problem classes. This holds in particular when more than 100n function evaluations are necessary to find a satisfactory solution [26]—even, for example, on smooth unimodal nonquadratic functions [8], or on highly multimodal functions [21] and on noisy or highly rugged functions [20]. In contrast, much better search heuristics are available given (nearly) convex-quadratic problems or (partially) separable multimodal problems.

Table 8.2 Default parameter values of the (\(\mu /\mu _{\mathrm{\,<Emphasis Type="SmallCaps">\text{w}</Emphasis>}}\))-CMA-ES, where by definition \(\sum _{i=1}^{\mu }\vert w_{i}\vert = 1\) and \(\mu _{\mathrm{eff}}^{-1} = \sum _{i=1}^{\mu }w_{i}^{2}\)

8.10 Appendix

The (\(\mu /\mu _{\mathrm{\,<Emphasis Type="SmallCaps">\text{w}</Emphasis>}},\lambda\))-CMA-ES, as described in this chapter, is summarized in Table 8.1. We have \(\boldsymbol{p}_{k=0}^{\sigma } = \boldsymbol{p}_{k=0}^{\mathrm{c}} = {\boldsymbol 0}\), \(\boldsymbol{C}_{k=0} =\boldsymbol{ I}\), while \(\boldsymbol{m}_{k=0} \in {\mathbb{R}}^{n}\) and \(\sigma _{k=0} > 0\) are user defined. Additionally, \(\boldsymbol{x}_{i:\lambda }\) is the i-th best of the solutions \(\boldsymbol{x}_{1},\ldots,\boldsymbol{x}_{\lambda }\),

$$\displaystyle\begin{array}{rcl} h_{\sigma } = \left \{\begin{array}{@{}l@{\quad }l@{}} 1\quad &\text{if} \frac{{\|\boldsymbol{p}_{k+1}^{\sigma }\|}^{2}} {1-{(1-c_{\sigma })}^{2(k+1)}} < \left (2 + \frac{4} {n+1}\right )n \\ 0\quad &\text{otherwise} \end{array} \right.\;,& & {}\\ \end{array}$$

for \(\mathsf{E}\|\mathcal{N}\left ({\boldsymbol 0},\boldsymbol{I}\right )\| = \sqrt{2}\,\Gamma (\frac{n+1} {2} )/\Gamma (\frac{n} {2} ) \approx \sqrt{n - 1/2}\) we use the better approximation \(\sqrt{n}\left (1 - \frac{1} {4n} + \frac{1} {21{n}^{2}} \right )\), and \({\boldsymbol{C}_{k}}^{-\frac{1} {2} }\) is symmetric with positive eigenvalues and satisfies \({\boldsymbol{C}_{k}}^{-\frac{1} {2} }{\boldsymbol{C}_{ k}}^{-\frac{1} {2} } ={ \left (\boldsymbol{C}_{k}\right )}^{-1}\). The binary ∧ operator depicts the minimum of two values with low operator precedence. The default parameter values are shown in Table 8.2.