Progress Rate Analysis of Evolution Strategies on the Rastrigin Function: First Results

A first order progress rate is derived for the intermediate multi-recombinative Evolution Strategy (μ/μI, λ)-ES on the highly multimodal Rastrigin test function. The progress is derived within a linearized model applying the method of so-called noisy order statistics. To this end, the mutation-induced variance of the Rastrigin function is determined. The obtained progress approximation is compared to simulations and yields strengths and limitations depending on mutation strength and distance to the optimizer. Furthermore, the progress is iterated using the dynamical systems approach and compared to averaged optimization runs. The property of global convergence within given approximation is discussed. As an outlook, the need of an improved first order progress rate as well as the extension to higher order progress including positional fluctuations is explained.


Introduction
Evolution Strategies (ES) [12,13] are well-recognized Evolutionary Algorithms suited for real-valued non-linear optimization.State-of-the-art ES such as the CMA-ES [8] or its simplification [5] are also well-suited for locating global optimizers in highly multimodal fitness landscapes.While the CMA-ES was originally mainly intended for non-differentiable optimization problems, but yet regarded as a locally acting strategy, it was already in [7] observed that using a large population size can make the ES a strategy that is able to locate the global optimizer among a huge number of local optima.This is a surprising observation when considering the ES as a strategy that acts mainly local in the search space following some kind of gradient or natural gradient [3,6,11].As one can easily check using standard (highly) multimodal test functions such as Rastrigin, Ackley, and Griewank to name a few, this ES property is not intimately related to the covariance matrix adaptation (CMA) ES which generates non-isotropic correlated mutations, but can also be found in (μ/μ I , λ)-ES with isotropic mutations.Therefore, if one wants to understand the underlying working principles how the ES locates the global optimizer, the analysis of the (μ/μ I , λ)-ES should be the starting point.
The question regarding why and when optimization algorithms -originally designed for local search -are able to locate global optima has gained attention in the last few years.A recurring idea comes from relaxation procedures that transform the original multimodal optimization problem into a convex optimization problem called Gaussian continuation [9].Gaussian continuation is nothing else but a convolution of the original optimization problem with a Gaussian kernel.As has been shown in [10], using the right Gaussian, Rastrigin-like functions can be transformed into a convex optimization problem, thus making it accessible to gradient following strategies.However, this raises the question how to perform the convolution efficiently.One road followed in [14] uses high-order Gauss-Hermite integration in conjunction with a gradient descent strategy yielding surprisingly good results.The other road coming to mind is approximating the convolution by Gaussian sampling.This resembles the procedure ES do: starting from a parental state, offspring are generated by Gaussian mutations.The problem is, however, that in order to get a reliable gradient, a huge number of samples, i.e. offspring in ES must be generated in order to get reliable convolution results.The number of offspring needed to get reliable estimates seems much larger than the offspring population size needed in ES experiments conducted in [7] showing approximately a linear relation between problem dimension N and population size for the Rastrigin function.Therefore, understanding the ES performance from viewpoint of Gaussian relaxation does not seem to help much.
The approach followed in this paper will incorporate two main concepts, namely a progress rate analysis as well as its application within the so-called evolution equations modeling the transition dynamics of the ES [2].The progress rate measure yields the expected positional change in search space between two generations depending on location, strategy and test function parameters.Aiming to investigate and understand the dynamics of globally converging ES runs, the progress rate is an essential quantity to model the expected evolution dynamics over many generations.
This paper provides first results of a scientific program that aims at an analysis of the performance of the (μ/μ I , λ)-ES on Rastrigin's test function based on a first order progress rate.After a short introduction of the (μ/μ I , λ)-ES, the N -dimensional first order progress will be defined and an approximation will be derived resulting in a closed form expression.The predictive power and its limitations will be checked by one-generation experiments.The progress rate will then be used to simulate the ES dynamics on Rastrigin using difference equations.This simulation will be compared with real runs of the (μ/μ I , λ)-ES.In a concluding section a summary of the results and outlook of the future research will be given.

Rastrigin Function and Local Quality Change
The real-valued minimization problem defined for an N -dimensional search vector y = (y 1 , ..., y N ) is performed on the Rastrigin test function f given by with A denoting the oscillation amplitude and α = 2π the corresponding frequency.The quadratic term with superimposed oscillations yields a finite number of local minima M for each dimension i, such that the overall number of minima scales exponentially as M N posing a highly multimodal minimization problem.The global optimizer is at ŷ = 0.
For the progress rate analysis in Sect. 4 the local quality function Q y (x) at y due to mutation vector x = (x 1 , ..., x N ) is needed.In order to reuse results from noisy progress rate theory it will be formulated for the maximization case of F (y) = −f (y) with F i (y i ) = −f i (y i ), such that local quality change yields ( Q y (x) can be evaluated for each component i independently giving A closed form solution of the progress rate appears to be obtainable only for a linearized expression of Q i (x i ).A first approach taken in this paper is based on a Taylor expansion for the mutation x i and discarding higher order terms ≈ (−2y i − αA sin (αy i )) using the following derivative terms A second approach is to consider only the linear term of Eq. ( 4) and neglect all non-linear terms denoted by δ(x i ) according to The linearization using f i is a local approximation of the function incorporating oscillation parameters A and α.Using only k i (setting d i = 0) discards oscillations by approximating the quadratic term via k i = ∂(y 2 i )/∂y i = 2y i with negative sign due to maximization.Both approximations will be evaluated later.

The (μ/μ I , λ)-ES with Normalized Mutations
The Evolution Strategy under investigation consists of a population of μ parents and λ offspring (μ < λ) per generation g.Algorithm 1 is presented below and offspring variables are denoted with overset "∼".
Population variation is achieved by applying an isotropic normally distributed mutation x ∼ σN (0, 1) with strength σ to the parent recombinant in Lines 6 and 7.The recombinant is obtained using intermediate recombination of all μ parents equally weighted in Line 11. Selection of the m = 1, ..., μ best search vectors y m;λ (out of λ) according to their fitness is performed in Line 10.
Note that the ES in Algorithm 1 operates under constant normalized mutation σ * in Lines 3 and 12 using the spherical normalization This property ensures global convergence of the algorithm as the mutation strength σ (g) decreases if and only if the residual distance y (g) = R (g) decreases.While σ * is not known during black-box optimizations, it is used here to investigate the dynamical behavior of the ES using the first order progress rate approach to be developed in this paper.Incorporating self-adaptation of σ or cumulative step-size adaptation remains for future research.

Definition
Having introduced the Evolution Strategy, we are interested in the expected one-generation progress of the optimization on the Rastrigin function (1) before investigating the dynamics over multiple generations.
A first order progress rate ϕ i for the i-th component between two generations g → g + 1 can be defined as the expectation value over the positional difference of the parental components given mutation strength σ (g) and the position y (g) .First, an expression for y (g+1) is needed, see Algorithm Taking the expectation E y (g+1) i , setting x = σz = σN (0, 1) and inserting the expression back into (11) yields Therefore progress can be evaluated by averaging over the expectations of μ selected mutation contributions.In principle this task can be solved by deriving the induced order statistic density p m;λ for the m-th best individual and subsequently solving the integration over the i-th component However, the task of computing expectations of sums of order statistics under noise disturbance has already been discussed and solved by Arnold in [1].Therefore the problem of Eq. ( 13) will be reformulated in order to apply the solutions provided by Arnold.

Expectations of Sums of Noisy Order Statistics
Let z be a random variate with density p z (z) and zero mean.The density is expanded into a Gram-Charlier series by means of its cumulants κ i (i ≥ 1) according to [1, p. 138, D.15] with expectation κ 1 = 0, variance κ 2 , skewness (higher order terms not shown) and He k denoting the k-th order probabilist's Hermite polynomials.For the problem at hand, see Eq. ( 13), the mutation variate z ∼ N (0, 1) with κ 2 = 1 and κ i = 0 for i = 2 yielding a standard normal density.Furthermore, let ∼ N (0, σ 2 ) model additive noise disturbance, such that resulting observed values are v = z + .Selection of the m-th largest out of λ values yields and the distribution of selected source terms z m;λ follows a noisy order statistic with density p m;λ .Given this definition and a linear relation between z m;λ and v m;λ the method of Arnold is applicable.
In our case the i-th mutation component x m;λ of Eq. ( 13) is related to selection via the quality change defined in Eq. ( 3).Maximizing the fitness F i (y i + x i ) conforms to maximizing quality Q i (x i ) with F i (y i ) being a constant offset.
Aiming at an expression of form ( 16) and starting with (3), we first isolate component Q i from the remaining N −1 components denoted by j =i Q j .Then, approximations are applied to both terms yielding with linearization (6) applied to , as the sum of independent random variables asymptotically approaches a normal distribution in the limit N → ∞ due to the Central Limit Theorem.This is ensured by Lyapunov's condition provided that there are no dominating components within the sum due to largely different values of y j .The corresponding Rastrigin quality variance D 2 i = Var j =i Q j (x j ) is calculated in the supplementary material (https://github.com/omam-evo/paper/blob/main/ppsn22/PPSN22 OB22.pdf).As the expectation ) is only an offset to Q y (x) it has no influence on the selection and its calculation can be dropped. Using The decomposition using sign function and absolute value is needed for correct ordering of selected values w.r.t.z i in (20).Given result (20), one can define the linearly transformed quality measure v i := (Q y (x) − E i )/|f i |σ and noise variance σ 2 := (D i /f i σ) 2 , such that the selection of mutation component sgn (−f i ) z i is disturbed by a noise term due to the remaining N − 1 components.A relation of the form ( 16) is obtained up to the sign function.
In [1] Arnold calculated the expected value of arbitrary sums S P of products of noisy ordered variates containing ν factors per summand with random variate z introduced in Eqs. ( 15) and ( 16).The vector P = (p 1 , ..., p ν ) denotes the positive exponents and distinct summation indices are denoted by the set {n 1 , ..., n ν }.The generic result for the expectation of ( 21) is provided in [1, p. 142, D.28] and was adapted to account for the sign difference between ( 16) and (20) resulting in possible exchanged ordering.Performing simple substitutions in Arnold's calculations in [1] and recalling that in our case γ 1 = γ 2 = 0, the expected value yields Now we are in the position to calculate expectation (13) using ( 22).Since z ∼ N (0, 1), it holds κ 2 = 1.Identifying P = (1), P 1 = 1 and ν = 1 yields i with the requirement a > 0, and noting that f i = sgn (f i ) |f i | one finally obtains for the i-th component first order progress rate The population dependency is given by progress coefficient c μ/μ,λ .The fitness dependent parameters are contained in f i , see (7), and in D 2 i calculated in the supplementary material (https://github.com/omam-evo/paper/blob/main/ppsn22/PPSN22 OB22.pdf).For better readability the derivative f i and variance D 2 i are not inserted into (26).An exemplary evaluation of D 2 i as a function of the residual distance R using normalization ( 10) is also shown in the supplementary material.

Comparison of Simulation and Approximation
Figure 1 shows an experimentally obtained progress rate compared to the result of (26).Due to large N one exemplary ϕ i -graph is shown on the left, and corresponding i = 1, ..., N errors are shown on the right.
The left plot shows the progress rate over a σ-range of [0, 1].This magnitude was chosen in order to study the oscillation, as the frequency α = 2π.The initial position was chosen randomly to be on the sphere surface R = 10.
The red dashed curve uses f i as linearization, while the blue dash-dotted curve assumes f i = k i (with d i = 0), see also (7).As f i approximates the quality change locally, agreement for the progress is given only for very small mutations σ.For larger σ very large deviation may occur, depending on the local derivative.
The blue curve ϕ i (k i ) neglects the oscillation (d i = 0) and therefore follows the progress of the quadratic function f (y) = i y 2 i for large σ with very good agreement.Due to a linearized form of Q i (x i ) in ( 6) neither approximation can reproduce the oscillation for moderately large σ.
To verify the approximation quality, the error between (26) and simulation is displayed on the right side of Fig. 1 for all i = 1, ..., N .It was done for small σ = 0.1 and large σ = 1.The deviations are very similar in magnitude for all i, given randomly chosen y i .Note that for σ = 1 the red points show very large errors compared to blue, which was expected.
Figure 2 shows the progress rate ϕ i over σ * , for i = 2 as in Fig. 1, with y randomly on the surface radii R = {100, 10, 1, 0.1}.Using σ * the mutation σ is normalized by the residual distance R with spherical normalization (10).Far from the origin with R = {100, 10} the quadratic terms are dominating giving better results using ϕ i (k i ).Reaching R = 1 local minima are more relevant and mixed results are obtained with ϕ i (f i ) better for smaller σ * and ϕ i (k i ) for larger σ * .Within the global attractor R = 0.1 the local structure dominates and ϕ i (f i ) yields better results.These observations will be relevant analyzing the dynamics in Fig. 3 where both approximations show strengths and weaknesses.

Evolution Dynamics
As we are interested in the dynamical behavior of the ES, averaged real optimization runs from Algorithm 1 will be compared to the iterated dynamics using progress result (26) by applying the dynamical systems approach [2].Neglecting fluctuations, i.e., y (g+1) i = E y (g+1) i σ (g) , y (g) the mean value dynamics for the mapping y ( The control scheme of σ (g) was introduced in Eq. ( 10) and yields simply Equations ( 27) and (28) describe a deterministic iteration in search space and rescaling of mutations according to the residual distance.For a convergence analysis, we are interested in the dynamics of R (g) = y (g) rather than the actual position values y (g) .Hence in Fig. 3 the R (g) -dynamics of the conducted experiments is shown.In Fig. 3, all runs of Algorithm 1 exhibit global convergence with the black line showing the average.The left and right plots differ by population size.Iteration ϕ i (k i ), blue dash-dotted curve, also converges globally, though very slowly and therefore not shown entirely.The convergence behavior of iteration ϕ i (f i ), red and orange dashed curves, strongly depends on the initialization and is discussed below.
Three phases can be observed for the simulation.It shows linear convergence at first being followed by a slow-down due to local attractors.Reaching the global attractor the convergence speed increases again.Iteration ϕ i (k i ) is able to model the first two phases to some degree.Within the global attractor the slope information d i is missing such that the progress is largely underestimated.
Iteration ϕ i (f i ) converges first, but yields a stationary state with R st ≈ 20 when the progress ϕ i becomes dominated by derivative term d i .Starting from R (0) = 10 2 the stationary y st i are either fixed or alternating between coordinates depending on σ, D i , k i , and d i .This effect is due to attraction of local minima and due to the deterministic iteration disregarding fluctuations.It occurs also with varying initial positions.Initialized at R (0) = 10 −1 orange iteration ϕ i (f i ) is globally converging.
It turns out that the splitting point of the two approximations in Fig. 3 occurs at a distance R to the global optimizer where the ES approaches the attractor region of the "first" local minima.For the model parameters considered in the experiment this is at about R ≈ 28.2 -the distance of the farest local minimizer to the global optimizer (obtained by numerical analysis).
Plots in Fig. 3 differ by population size.The convergence speed, i.e. the slopes, show better agreement for large populations, which can be attributed to the fluctuations neglected in (27).Investigations on unimodal funtions Sphere [2] and Ellipsoid [4] have shown that progress is decreased by fluctuations due to a loss-term scaling with 1/μ, which agrees with Fig. 3. On the left the iterated progress is faster due to neglected but present fluctuations, while on the right better agreement is observed due to insignificant fluctuations.These observations will be investigated in future research.

Summary and Outlook
A first order progress rate ϕ i was derived for the (μ/μ I , λ)-ES by means of noisy order statistics in (26) on the Rastrigin function (1).To this end, the mutation induced variance of the quality change D 2 i is needed.Starting from (4) a derivation yielding D 2 i has been presented in the supplementary material.Furthermore, the approximation quality of ϕ i was investigated using Rastrigin and quadratic derivatives f i and k i , respectively, by comparing with one-generation experiments.
Linearization f i shows good agreement for small-scale mutations, but very large deviations for large mutations.Conversely, linearization k i yields significantly better results for large mutations as the quadratic fitness term dominates.A progress rate modeling the transition between the regimes is yet to be determined.First numerical investigations of ( 14) including all terms of (4) indicate that nonlinear terms are needed for a better progress rate model, which is an open challenge and part of future research.
The obtained progress rate was used to investigate the dynamics by iterating (27) using (28) and comparing with ES runs.Iteration via f i only converges globally if initialized close to the optimizer, since local attraction is strongly dominating.Dynamics via k i converges globally independent of initialization, but the observed rate matches only for the initial phase and for very large populations.This confirms the need for a higher order progress rate modeling the effect of fluctuations, especially when function evaluations are expensive and small populations must be used.Additionally, an advanced progress rate formula is needed combining effects of global and local attraction to model all three phases of the dynamics correctly.
The investigations done so far are a first step towards a full dynamical analysis of the ES on the multimodal Rastrigin function.Future investigations must also include the complete dynamical modeling of the mutation strength control.One aim is the tuning of mutation control parameters such that the global convergence probability is increased while still maintaining search efficiency.Our final goal will be the theoretical analysis of the full evolutionary process yielding also recommendations regarding the choice of the minimal population size needed to converge to the global optimizer with high probability.
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/),which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.
The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material.If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.