# Improving Sampling in Evolution Strategies Through Mixture-Based Distributions Built from Past Problem Instances

- 847 Downloads

## Abstract

The notion of learning from different problem instances, although an old and known one, has in recent years regained popularity within the optimization community. Notable endeavors have been drawing inspiration from machine learning methods as a means for algorithm selection and solution transfer. However, surprisingly approaches which are centered around internal sampling models have not been revisited. Even though notable algorithms have been established in the last decades. In this work, we progress along this direction by investigating a method that allows us to learn an evolutionary search strategy reflecting rough characteristics of a fitness landscape. This latter model of a search strategy is represented through a flexible mixture-based distribution, which can subsequently be transferred and adapted for similar problems of interest. We validate this approach in two series of experiments in which we first demonstrate the efficacy of the recovered distributions and subsequently investigate the transfer with a systematic from the literature to generate benchmarking scenarios.

## Keywords

Evolution strategies Model-based optimisation Continuous optimisation Algorithm configuration Transfer learning## 1 Introduction

Within recent decades, the field of evolutionary computation has seen a surge of novel algorithms being proposed, frequently with the intent to operate on very specific problem domains. While this reflects on one hand the efficacy of population-based and evolutionary approaches for a wide range of applications, it also reflects deep rooted issues within the current state of the art. Particularly in regards to: 1) A lack of a prescriptive theory on how to construct efficient algorithms for a given problem and 2) a lack of understanding on what constitutes and characterizes optimisation problems and the similarity thereof in a more generalized way. While the theorists cannot give definite answer to both questions at the moment, one may still legitimately ask whether or not it is possible to approach some of these problems from a pragmatic line of attack. For this reason, two popular trends have emerged within the optimisation community: 1) Research on meta-learning frameworks [13, 18, 23] and 2) research on transfer learning approaches [4, 9, 12, 14, 17]. Both try to boost the efficiency of optimisation algorithms by using prior knowledge from solving problem instances.

In our work, we progress at the intersection of both lines of research by building a model of a search strategy from individual runs which may be then subsequently transferred to similar problem instances. For this reason, we first give in Sect. 2 a brief overview discussing these two existing lines of research and give insight into the state-of-the-art. Section 3 explains the extensions we introduce to consolidate a search strategy. Further, we demonstrate its functionality on an illustrative benchmark function. In Sect. 4.1, we widen the range of considered benchmark problems to a selected variety of multimodal and valley-shaped problems. Subsequently, in Sect. 4.2 we consider the scenario of transferring search strategies across problem instances generated by translations, rotation and various non-linear transformations to the benchmark functions. We conclude our study with a summary in Sect. 5 and give an outlook on future work.

## 2 Knowledge from Problem Solving Exercises

*Meta-learning frameworks* attempt to harness high-level knowledge that can be subsequently used in the future to more efficiently solve related tasks. In the classical algorithm selection and algorithm configuration problem, this would equate to predicting the best performing algorithm or configuration for a given problem
[13, 18]. However, a key problem in optimisation lies in the first place in the extraction and computation of said task specific features. This poses especially an outstanding problem within the domain of continuous optimisation, where unlike in the combinatorial domain, problem features cannot be simply derived from the problem state or definition. Features thus have to be explicitly computed in a cheap and at best informative manner.

*Transfer learning approaches* on the other side may be seen as operating under more relaxed conditions. Essentially, what transfer learning assumes between two problem instances, is that beneficial knowledge which helped solving one problem instance, can be transferred either directly or by means of a transformation to a new problem instance. However, notably it introduces by this further uncertainties. The bulk of transfer learning literature in optimisation draws inspiration from instance-based transfer
[19] by means of transferring high performing candidate solutions between tasks (e.g.
[5, 12, 14, 22]). Retrieval of the candidate solutions occurs either directly from the previous solved tasks
[5, 14] or through probabilistic sampling from a continuously built repository (e.g.
[4, 17]). As a way of determining the probabilistic weights, often times task similarity measures may be used
[17]. However, in many scenarios instead simply solution similarity may be used as a proxy of task similarity
[4, 17]. In general, the lack of satisfying task similarity measures together with being prone to uncontrolled ‘negative’ knowledge transfer which degrades algorithm performance
[5] are known problems of these approaches.

Interestingly, aside from these mentioned works, barely any of the recent literature tries to learn across problem instances explicitly by means of internal sampling models. Although quite notably, many popular algorithms rely upon operators drawing random variables from symmetrical distributions and thus have by default isotropy assumptions built in. However, this assumption becomes broken when given an optimisation problem which does not resemble a flat plane. Quite intuitively, the interplay between algorithm and optimisation problem should enforce characteristic search strategies and behaviors. Modern model-based algorithms [10, 15] acknowledge this by adapting a distribution online during the optimisation run. However, they do not attempt to memorize these in a more rough and abstract way, such that these can be transferred across problem instances. In many ways, this perspective might be also the only meaningful notion to realize transfer learning in continuous single-objective optimisation. In the following, we build up on our previous work [7, 8] and try tackle the issue in a study using a variant of the popular \((\mu , \,\lambda )\)-Evolution Strategy for continuous optimisation. We explicitly incorporate strategy parameters through a windowing approach and harness systematics from the literature to build benchmarking scenarios.

## 3 Extending the Evolution Strategy

In the following, we consider continuous single-objective optimisation problems of the form Open image in new window, where \(\chi \) denotes the search space and *d* its associated dimensionality. As a base we use a variant of the Evolution Strategy with \((\mu , \lambda )\) selection mechanism
[1]. We keep out explicitly any recombination operators to have the framework reduced to its essentials. Meaning to sample mutations from a multivariate distribution and performing selection in an elitist manner. Note, that from an evolutionary perspective, mutation is the principle source of variation
[16]. In many ways, this basic outline may resemble continuous variants of Evolutionary Programming. However, the elitist selection mechanism in Evolution Strategies has been implicated to contribute to performance improvements
[2].

**s**(j) are represented by tuples \(\mathbf{s} (j) = [\mathbf{x} (j) , \varvec{\sigma }(j) ]\), where \(\mathbf{x} (j) = (x_1(j),\cdots ,x_d(j))\) is the population member’s representation in the solution space and \(\varvec{\sigma }(j)= (\sigma _1(j),\cdots ,\sigma _d(j))\) are its strategy parameters. The latter can be considered to be a key feature of Evolution Strategy implementations. Strategy parameters essentially control the shape the normal distribution from which mutations

*j*are drawn which shift the individuals \(\mathbf{x} '(j) = \mathbf{x} (j) + \varDelta \mathbf{x} (j)\) in the solution space. Likewise, variation operators can be defined such that they also vary and recombine the strategy parameters of population members. However, we neglect this extension within our study.

### 3.1 Quality-Based Filtering of Mutations

### 3.2 Constructing Operators from Empirical Distributions

**Choosing a Density Estimator.** While by default, mutations are sampled in the Evolution Strategy from a multivariate normal distribution as given in Eq. (1), for empirical distributions one explicitly has to use a modeling technique. In principle, many techniques are available for this purpose. However, in the following we will use the Gaussian mixture model as it is a well-studied model which can act as universal density approximator. Mixture models reduce the input data to a small set of descriptive clusters which are parametrized by multivariate normal distributions, such that the full data distribution can then be expressed as \(p(\mathbf {x}) = \sum ^{K}_{k=1} \pi _k \cdot \mathcal {N}(\mathbf {x}|\varvec{\mu }_k,\mathbf {\varSigma }_k)\), with mixture coefficients \(\pi _k\), which are normalized such that \(\varSigma _{k=1}^K \pi _k =1\), and determined together with means \(\mu _k\) and covariances \(\varSigma _k\) by maximizing the log-likelihood through the expectation-maximization algorithm
[3, 20].

**Incorporating Strategy Parameters.**However, an outstanding problem still lies in the fact that the Evolution Strategy possesses strategy parameters \(\varvec{\sigma }\) which control the shape of the distribution from which mutations are sampled. Changing the shape of an empirical distribution as basis for improved sampling should not break the contained spatial information. Therefore, we simply window the empirical distribution with the multivariate normal distribution spanned by the strategy parameters as defined by Eq. (1). Effectively, this results in a reweighing of the mixture model where we replace the original mixture coefficients \(\pi _k\) with

*k*-th mixture component. This can be analytically calculated such that

Benchmark functions used in this study, grouped from top to bottom according to landscape structure. 1st–3rd row: Unimodal and valley shaped problems. 4th–6th row: Multimodal problems with single global optimum and strong regularity. 7th–9th row: Difficult multimodal problems with single global optimum and high irregularity.

## 4 Experimental Study

The following study is based upon the DEAP library for Evolutionary Computation [6] with the extensions as elaborated in Sect. 3. We first investigate in Sect. 4.1 whether distributions of beneficial mutations can be harnessed at all to realize performance improvements on a selected range of different continuous optimisation problems. Subsequently in Sect. 4.2 we investigate different transfer scenarios between problem instances. Particularly, we build these scenarios by harnessing existing systematics from the literature.

### 4.1 On the Efficacy of Distributions of Beneficial Mutations

Medians \(\tilde{f}_{\text {min}}\), means \(\overline{f}_{\text {min}}\) and standard deviations \(\sigma _{\text {min}}\) of the minimum fitness after 1000 generations aggregated from 100 runs for default sampling using a normal distribution \((\mathcal {N})\) and improved sampling using a mixture model of quality-filtered mutations\((\mathcal {M})\). Further, normalized ranks *z* and p-values for a two-tailed Wilcoxon rank sum test have been calculated. For a significance level of \(\alpha =0.05\) the null hypothesis can be considered to be rejected in all experiments.

Benchmark | \(\tilde{f}_{\text {min}} \,(\mathcal {N})\) | \(\overline{f}_{\text {min}} \,(\mathcal {N})\) | \(\sigma _{\text {min}} \,(\mathcal {N})\) | \(\tilde{f}_{\text {min}} \,(\mathcal {M})\) | \(\overline{f}_{\text {min}} \,(\mathcal {M})\) | \(\sigma _{\text {min}} \,(\mathcal {M})\) | \(|z|\) | p-value |
---|---|---|---|---|---|---|---|---|

Sphere | 5.528e−4 | 1.303e−3 | 1.975e−3 | 2.360e−5 | 3.600e−5 | 3.630e−5 | 9.713e+0 | 2.670e−22 |

Bohachevsky | 1.668e−2 | 4.888e−2 | 8.349e−2 | 2.754e−3 | 4.580e−3 | 5.222e−3 | 8.107e+0 | 5.180e−16 |

Rosenbrock | 9.098e−3 | 1.013e−1 | 5.641e−1 | 6.096e−4 | 9.370e−4 | 1.081e−3 | 1.035e+1 | 4.390e−25 |

Rastrigin | 1.924e−1 | 5.067e−1 | 7.974e−1 | 6.609e−3 | 9.327e−3 | 9.066e−3 | 1.133e+1 | 9.060e−30 |

Ackley | 1.223e−1 | 9.722e−1 | 3.221e+0 | 2.957e−2 | 3.297e−2 | 1.575e−2 | 9.287e+0 | 1.580e−20 |

Griewank | 3.686e−1 | 3.653e+0 | 5.780e+0 | 1.011e−4 | 1.747e−3 | 3.744e−3 | 1.172e+0 | 1.050e−31 |

Schwefel | 3.523e+0 | 5.413e+1 | 7.509e+1 | 2.595e−3 | 1.189e+0 | 1.178e+1 | 1.201e+1 | 3.270e−33 |

Eggholder | 4.959e+0 | 6.135e+1 | 7.786e+1 | 3.730e−5 | 1.131e+1 | 3.494e+1 | 9.935e+0 | 2.940e−23 |

Rana | 1.652e+0 | 1.832e+1 | 2.939e+1 | 1.650e+0 | 4.359e+0 | 6.102e+0 | 2.622e+0 | 8.748e−3 |

### 4.2 Cross-Instance Transfer Scenarios

In the following section we will consider now cross-instance transfer learning scenarios. Meaning we try to transfer a mutation operator learned on a source optimisation problem to a target problem (c.f. Fig. 1) in the hope of realizing performance improvements. To generate variations of the source problem instances we apply in the following a systematic of transformations proposed by Hansen et al. [11].

**Transformations of the Fitness Landscape.**The following base transformations are designed to explicitly break the well-behavedness of our optimisation problems by acting upon the decision variables \(\mathbf {x}\).

*Ill-conditioning*introduces fast running components by a means of a linear rescaling

*asymmetrical*transformation breaks the symmetry of components \(x_i\) under sign transformations with

*oscillatory*transformation introduces sinusoidal variability of the components by

Medians \(\tilde{f}_{\text {min}}\), means \(\overline{f}_{\text {min}}\) and standard deviations \(\sigma _{\text {min}}\) of the minimum fitness after 1000 generations aggregated from 100 runs for default sampling (upper table) and transfer scenarios (bottom table). Further, normalized ranks *z* and p-values for a two-tailed Wilcoxon rank sum test are given. For a significance level of \(~\alpha =0.05\) the null hypothesis can be considered to be rejected in all experiments.

Scenarios | \(\tilde{f}_{\text {min}}\) | \(\overline{f}_{\text {min}}\) | \(\sigma _{\text {min}}\) | \(|z|\) | p-value |
---|---|---|---|---|---|

\(\text {S}_{0}^{*}\) | 8.319e−4 | 1.443e−3 | 1.854e−3 | – | – |

\(\text {R}_{0}^{*}\) | 2.173e−1 | 9.599e+0 | 4.118e+1 | – | – |

\(\text {S}_{1}\) | 2.623e−3 | 1.859e−1 | 1.781e+0 | – | – |

\(\text {A}_{1}\) | 1.639e−1 | 3.101e+0 | 6.812e+0 | – | – |

\(\text {R}_{1}\) | 1.813e−1 | 6.901e−1 | 1.438e+0 | – | – |

\(\text {G}_{1}\) | 3.298e+0 | 7.139e+0 | 8.870e+0 | – | – |

\(\text {S}_{0}^{*} \rightarrow \text {S}_{1}\) | 5.786e−4 | 7.114e−4 | 7.064e−4 | 7.281e+0 | 3.306e−13 |

\(\text {S}_{1} \rightarrow \text {S}_{0}^{*}\) | 1.899e−4 | 2.299e−4 | 1.751e−4 | 6.744e+0 | 1.544e−11 |

\(\text {A}_{0} \rightarrow \text {A}_{1}\) | 2.839e−2 | 1.827e+0 | 5.716e+0 | 8.721e+0 | 2.771e−18 |

\(\text {A}_{1} \rightarrow \text {A}_{0}\) | 3.301e−2 | 3.774e−2 | 2.379e−2 | 9.928e+0 | 3.161e−23 |

\(\text {R}_{0} \rightarrow \text {R}_{1}\) | 5.689e−2 | 8.816e−2 | 8.894e−2 | 6.304e+0 | 2.902e−10 |

\(\text {R}_{1} \rightarrow \text {R}_{0}\) | 6.984e−2 | 1.051e−1 | 1.111e−1 | 5.603e+0 | 2.111e−8 |

\(\text {G}_{0} \rightarrow \text {G}_{1}\) | 3.034e−1 | 1.305e+0 | 4.749e+0 | 6.842e+0 | 7.837e−12 |

\(\text {G}_{1} \rightarrow \text {G}_{0}\) | 1.261e−1 | 1.773e+0 | 1.793e−1 | 3.531e+0 | 4.145e−4 |

\(\text {S}_{0}^{*}\rightarrow \text {G}_{0}\) | 3.531e+0 | 5.110e+0 | 4.490e+0 | 3.592e+0 | 3.284e−4 |

\(\text {G}_{0} \rightarrow \text {S}_{0}^{*}\) | 6.850e−5 | 1.386e−4 | 1.488e−4 | 3.722e+0 | 1.974e−4 |

**Experimental Validation.** We investigate the utility of the transformations in a set of 9 experiments with 4 transformed standard problems. Further, the transfer from source problem to target problem \(P_0 \rightarrow P_1\), and likewise the transfer into the reverse direction \(P_0 \rightarrow P_1\). We use in the following the sphere function \(S_1\) with ill-conditioning, \(45^\circ \) rotation and extended search space to \([-100,100]^2\), the Ackley’s function \(A_1\) with a translation of \(\mathbf {t}=(-15,20)\) and subsequently added oscillations and asymmetries, Rastrigin’s function \(R_1\) with \(22.5^\circ \) rotation, small shift \(\mathbf {t}=(3,2)\), extended search space to \([-100,100]^2\) and added asymmetry, as well as Griewanks function \(G_1\) with \(20^\circ \) rotation and added oscillations. Further, we denote the Sphere and Rastrigin’s function with extended search spaces to \([-100,100]^2\) as \(S_0^*\) and \(R_0^*\). Heightmaps of most altered benchmark problems are plotted in Fig. 5. We find that in most considered transfer scenarios, performance improvements can be realized (Table 3). However, finding difficult and interesting scenarios without making them obvious is a bit of a hurdle. For example, in our experiments the scenario \(S_0^*\rightarrow G_0\) features negative transfer, as the transferred distribution is simply adapted for a unimodal fitness landscape with small search space.

## 5 Conclusions

We have investigated in this paper an approach which allows us to learn an evolutionary search strategy reflecting rough and globally averaged characteristics of a fitness landscape. We represented this search strategy through flexible mixture-based distributions of beneficial mutations as basis for improved operators. Particularly, these distributions can be considered to be improved as they enable us to lift the isotropy assumption usually built into mutation operators, thus ingrain the problem structure and redistribute probability weight radially to more appropriately balance exploration and exploitation on a given problem instance. The distribution can be further adapted through a Gaussian reweighing approach, thus emulating the role strategy parameters have for sampling with a default normal distribution. However, this only seems to be useful on a limited range of scenarios. We showed that unweighted distributions can indeed lead to performance improvements on a large variety problems, however prior good convergence properties of the default sampling approach seems to be an essential prerequisite. Further, we investigated systemically built transfer scenarios and could also realize performance improvements in these. However, we openly acknowledge the difficulty of finding meaningful and difficult transfer scenarios. Part of the problem stems from the fact, as it is unsure to which degree one can alter or change a problem such that it still may be attributed to be an instance of the former. However, introducing and investigating systematic transformations should be one the first key steps towards to resolving the issue. For the future, we plan to investigate the proposed framework in higher dimensions for improved transfer scenarios, as well as look into measures of problem similarity potentially by means of fitness landscape analysis.

## Notes

### Acknowledgements

This research has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement number 766186. It was also supported by the Program for Guangdong Introducing Innovative and Enterpreneurial Teams (Grant No. 2017ZT07X386), Shenzhen Science and Technology Program (Grant No. KQTD2016112514355531), and the Program for University Key Laboratory of Guangdong Province (Grant No. 2017KSYS008).

## References

- 1.Bäck, T.: Evolutionary Algorithms in Theory and Practice: Evolution Strategies, Evolutionary Programming. Genetic Algorithms. Oxford University Press, Oxford (1996)zbMATHGoogle Scholar
- 2.Bäck, T., Rudolph, G., Schwefel, H.P.: Evolutionary programming and evolution strategies: similarities and differences. In: Proceedings of the Second Annual Conference on Evolutionary Programming. Citeseer (1993)Google Scholar
- 3.Bishop, C.M.: Pattern Recognition and Machine Learning. Springer, Heidelberg (2006)zbMATHGoogle Scholar
- 4.Da, B., Gupta, A., Ong, Y.: Curbing negative influences online for seamless transfer evolutionary optimization. IEEE Trans. Cybern.
**99**, 1–14 (2018). https://doi.org/10.1109/TCYB.2018.2864345CrossRefGoogle Scholar - 5.Feng, L., Ong, Y.S., Jiang, S., Gupta, A.: Autoencoding evolutionary search with learning across heterogeneous problems. IEEE Trans. Evol. Comput.
**21**(5), 760–772 (2017)CrossRefGoogle Scholar - 6.Fortin, F.A., De Rainville, F.M., Gardner, M.A., Parizeau, M., Gagné, C.: DEAP: evolutionary algorithms made easy. J. Mach. Learn. Res.
**13**, 2171–2175 (2012)MathSciNetGoogle Scholar - 7.Friess, S., Tiňo, P., Menzel, S., Sendhoff, B., Yao, X.: Representing experience through problem-tailored search operators. In: 2020 IEEE World Congress on Computational Intelligence (2020)Google Scholar
- 8.Friess, S., Tiňo, P., Menzel, S., Sendhoff, B., Yao, X.: Learning transferable variation operators in a continuous genetic algorithm. In: 2019 IEEE Symposium Series on Computational Intelligence (SSCI), pp. 2027–2033, December 2019. https://doi.org/10.1109/SSCI44817.2019.9002976
- 9.Gupta, A., Ong, Y.S., Feng, L.: Insights on transfer optimization: because experience is the best teacher. IEEE Trans. Emerg. Top. Comput. Intell.
**2**(1), 51–64 (2018)CrossRefGoogle Scholar - 10.Hansen, N.: The CMA evolution strategy: a comparing review. In: Lozano, J.A., P. Larrañaga, P., Inza I., Bengoetxea E. (eds.) Towards a New Evolutionary Computation, pp. 75–102. Springer (2006). https://doi.org/10.1007/3-540-32494-1_4
- 11.Hansen, N., Finck, S., Ros, R., Auger, A.: Real-parameter black-box optimization benchmarking 2009: Noiseless functions definitions (2009)Google Scholar
- 12.Jiang, M., Huang, Z., Qiu, L., Huang, W., Yen, G.G.: Transfer learning-based dynamic multiobjective optimization algorithms. IEEE Trans. Evol. Comput.
**22**(4), 501–514 (2017)CrossRefGoogle Scholar - 13.Kerschke, P., Hoos, H.H., Neumann, F., Trautmann, H.: Automated algorithm selection: survey and perspectives. Evol. Comput.
**27**(1), 3–45 (2019)CrossRefGoogle Scholar - 14.Koçer, B., Arslan, A.: Genetic transfer learning. Exp. Syst. Appl
**37**(10), 6997–7002 (2010)CrossRefGoogle Scholar - 15.Bengoetxea, E., Larrañaga, P., Bloch, I., Perchant, A.: Estimation of distribution algorithms: a new evolutionary computation approach for graph matching problems. In: Figueiredo, M., Zerubia, J., Jain, A.K. (eds.) EMMCVPR 2001. LNCS, vol. 2134, pp. 454–469. Springer, Heidelberg (2001). https://doi.org/10.1007/3-540-44745-8_30CrossRefGoogle Scholar
- 16.Losos, J.B.: The Princeton Guide to Evolution. Princeton University Press, Princeton (2017)Google Scholar
- 17.Louis, S.J., McDonnell, J.: Learning with case-injected genetic algorithms. IEEE Trans. Evol. Comput.
**8**(4), 316–328 (2004)CrossRefGoogle Scholar - 18.Muñoz, M.A., Sun, Y., Kirley, M., Halgamuge, S.K.: Algorithm selection for black-box continuous optimization problems: a survey on methods and challenges. Inf. Sci.
**317**, 224–245 (2015)CrossRefGoogle Scholar - 19.Pan, S.J., Yang, Q., et al.: A survey on transfer learning. IEEE Trans. Knowl. Data Eng.
**22**(10), 1345–1359 (2010)CrossRefGoogle Scholar - 20.Pedregosa, F., et al.: Scikit-learn: machine learning in python. J. Mach. Learn. Res.
**12**(Oct), 2825–2830 (2011)MathSciNetzbMATHGoogle Scholar - 21.Rice, J.R., et al.: The algorithm selection problem. Adv. Comput.
**15**(65–118), 5 (1976)Google Scholar - 22.Ruan, G., Minku, L.L., Menzel, S., Sendhoff, B., Yao, X.: When and how to transfer knowledge in dynamic multi-objective optimization. In: 2019 IEEE Symposium Series on Computational Intelligence (SSCI), pp. 2034–2041. IEEE (2019)Google Scholar
- 23.Smith-Miles, K.A.: Cross-disciplinary perspectives on meta-learning for algorithm selection. ACM Comput. Surv. (CSUR)
**41**(1), 6 (2009)CrossRefGoogle Scholar

## Copyright information

**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.