A new Taxonomy of Continuous Global Optimization Algorithms

Surrogate-based optimization and nature-inspired metaheuristics have become the state-of-the-art in solving real-world optimization problems. Still, it is difficult for beginners and even experts to get an overview that explains their advantages in comparison to the large number of available methods in the scope of continuous optimization. Available taxonomies lack the integration of surrogate-based approaches and thus their embedding in the larger context of this broad field. This article presents a taxonomy of the field, which further matches the idea of nature-inspired algorithms, as it is based on the human behavior in path finding. Intuitive analogies make it easy to conceive the most basic principles of the search algorithms, even for beginners and non-experts in this area of research. However, this scheme does not oversimplify the high complexity of the different algorithms, as the class identifier only defines a descriptive meta-level of the algorithm search strategies. The taxonomy was established by exploring and matching algorithm schemes, extracting similarities and differences, and creating a set of classification indicators to distinguish between five distinct classes. In practice, this taxonomy allows recommendations for the applicability of the corresponding algorithms and helps developers trying to create or improve their own algorithms.


Introduction
Continuous global optimization (CGO) tackles various difficult problems emerging from the context of complex physical or chemical processes.Solving optimization problems of this kind necessarily relies on performing real-world experiments or on using computer simulations, which are frequently employed in black-box fashions.A fundamental challenge in such systems is the high cost of function evaluations.Whether we are probing the real physical system or querying the simulator, the time needed to receive an objective function value is typically very high and can range from hours to months.CGO methods for such problems thus need to fulfill a certain set of requirements.They need to work with black-box style probes only, so without any further information on the structure of the problem.Further they must approach the vicinity of the global optimum with a limited number of function evaluations.The improvement of computational power in the last decades has been influencing the development of algorithms.A massive amount of computational power became available for researchers worldwide through multi-core desktop machines, parallel computing, and highperformance computing clusters.This has improved the following fields of research: Firstly, the development of more complex, nature-inspired, and generally applicable heuristics, so called metaheuristics.Secondly, major advances in the field of accurate, data-driven approximation models, so-called surrogate models and their embodiment in an optimization process.Nowadays, CGO differs largely from early approaches.For example, multi-staged methods evaluate objective functions not directly on the problem.They utilize a combination of surrogate modeling with classical or metaheuristic optimization methods to maximize the use of available problem information instead.These frameworks, such as sequential parameter optimization (Bartz-Beielstein, Lasarczyk, and Preuß, 2005) or the surrogate management framework (Booker, Dennis Jr, Frank, Serafini, Torczon, and Trosset, 1999;Serafini, 1999), define a new class of algorithms that are not well integrated in previous taxonomies.In this article we will propose a new taxonomy on basis of algorithm features and display plausible descriptions founded on the natural human behavior in path finding.To establish a comprehensive taxonomy, we focus on identifying key elements of algorithm design and utilizing these to define a clear separation between a small number of algorithm classes.Although abstraction is necessary for developing our results, we will present results that are useful for practitioners.The utilized abstraction allows us to present simply comprehensible ideas on how the individual classes differ and moreover, how the respective algorithms perform their search.For this purpose, we divide CGO algorithms into five intuitive classes: Exact, Wanderer, Guide, Cartographer, and Hybrid.This article particularly addresses different kinds of readers: beginners will find an intuitive taxonomy of CGO algorithms, especially with regard to common metaheuristics and newer developments in the field of surrogate-based optimization.For advanced readers, we also discuss the suitability of certain classes for specific problem properties to provide basic knowledge for reasonable algorithm selection.An extensive list of references is provided for the experienced users.The taxonomy can be used to create realistic comparisons and benchmarks for the different classes of algorithms.It further provides insights for users, who aim to develop new search strategies, operators and algorithms.The goal of global optimization is to find the overall best solution, i.e., for the common task of minimization, to discover decision variable values which minimize the objective function value.We denote the global search space as compact set S = {x | x l ≤ x ≤ x u } with x l , x u ∈ R n being the explicit, finite lower and upper bounds on x.Given a real valued objective function f: R n → R with real valued input vectors x we attempt to find the location x ∈ R n which minimizes the function: arg min f (x), x ∈ S.
Finding the global optimum is always the ultimate goal and as such desirable, but for many practical problems a solution improving the current best solution in a given budget of evaluations or time will still be a success.Particular in CGO the global optimum commonly cannot be identified exactly, thus modern heuristics are designed to spend their resources as efficiently as possible to approximate near-best solutions, while finding the global optimum is never guaranteed.The remainder of this article is organized as follows: Section 2 presents the development of optimization algorithms and their core concepts.Section 3 motivates a new taxonomy by reviewing the history of available CGO taxonomies, illustrates algorithm design aspects and presents extracted classification features.Sections 4 to 8 introduce the five different classes of the new taxonomy with examples and suggestions regarding their applicability.Section 9 summarizes and concludes the article with the recent trends and challenges in CGO and currently important research fields.

Evolution of Optimization Algorithms
In order to develop a taxonomy, it is necessary to understand the methodology and development history of the corresponding algorithms.Before presenting the requirements for the new taxonomy in Section 3, we will describe the fundamental principles of modern search algorithms, particular the elements and backgrounds of surrogate-based optimization.

Heuristics and Metaheuristics
In modern computer-aided optimization, heuristics and metaheuristics are well established solution techniques.Although presenting solutions which are not guaranteed to be optimal or perfect, their general applicability and ability of presenting fast sufficient solutions make them very attractive for applied optimization, particular for industrial problems.They are built upon the principle of trial and error, where solution candidates are evaluated and rewarded with a fitness.The term fitness has its origins in evolutionary computation (Eiben and Smith, 2015), where the fitness describes the competitive ability of an individual in the reproduction process.The fitness is in its simplest form the objective function value y = f (x) in relation to the optimization goal, e.g., in a minimization problem smaller values have a higher fitness.Moreover, it can be part of the search strategy, e.g., scaled or adjusted by additional functions, particular for multi-objective or constrained optimization.Heuristics can be defined as problem-dependent algorithms, which are developed or adapted to the particularities of a specific optimization problem or problem instance (Pearl, 1985).Typically, heuristics perform evaluations in a systematic manner, although utilizing stochastic elements.Heuristics use this principle to provide fast, not necessarily exact (i.e., not optimal) numerical solutions to optimization problems.Moreover, heuristics are often greedy to provide fast solutions, but get trapped in local optima and fail to find the global optimum.
In their starting days in the 1960s, heuristics were not considered as reliable problem solvers, because most researchers in academia preferred classical mathematical approaches and only some practitioners used heuristics to get fast, possibly inaccurate solutions (Zanakis and Evans, 1981).This situation changed in the 1970s.Heuristic optimization became a major part of academic research.Possible reasons for this change might be: • The need to solve more sophisticated nondeterministic problems, which had a polynomial runtime and thus could not be solved efficiently with exact algorithms (Fomin and Kaski, 2013).• The availability and easy access for academics to more computational power.Further advantages of heuristics were summarized by Zanakis and Evans (1981).The most important are: • Simplicity of the algorithm.
• Accuracy, i.e., small error of final solution.
• Robustness, i.e., good solutions within reasonable time for different problems.
• Speed, i.e., duration of the computation.While heuristics are developed and optimized to efficiently solve a certain problem, the improved availability of computer resources gave rise to higher-level heuristics, the metaheuristics.Metaheuristics can be defined as problem independent, general purpose optimization algorithms.They are applicable to a wide range of problems and problem instances.The term meta describes the higher-level general methodology, which is utilized to guide the underlying heuristic strategy (Talbi, 2009).
They share the following characteristics (Boussaïd, Lepagnot, and Siarry, 2013): • The algorithms are nature-inspired; they follow certain principles from natural phenomena or behaviors (e.g., biological evolution, physics, social behavior).• The search process involves stochastic parts; it is based on probability distributions and random processes.• They do not use the gradient or Hessian of the objective function or rely on information of the process which is available before the start of the optimization run, so-called a priori information.• As they are meant to be general applicable solvers, they include a set of control parameters to adjust the search strategy.During the remainder of this article we will focus on heuristic, respectively metaheuristic algorithms.

Modern Optimization Algorithms
Based on the fundamentals of heuristics and metaheuristics, we are able to identify a similarity in the design of modern optimization algorithms, which targets a large class of problems.Importantly, we have to consider the No Free Lunch Theorem (Wolpert and Macready, 1997), which states that there is no optimization algorithm that is superior to all others if their performance is averaged over all possible problems.Consequently, any algorithm needs to be adapted to the structure of the problem at hand to achieve optimal performance.This can be considered during the construction of an algorithm, before the optimization by parameter tuning or during the run by parameter control (Bartz-Beielstein et al., 2005;Eiben, Hinterding, and Michalewicz, 1999).Törn and Zilinskas (1989) mention three principles for the construction of an optimization algorithm: 1.An algorithm utilizing all available a priori information will outperform a method using less information.2. If no a priori information is available, the information is completely based on evaluated candidate points and their objective values.3. Given a fixed number of evaluated points, optimization algorithms will only differ from each other in the distribution of candidate points.As most modern algorithms focus on handling problems where little or no a priori information is given, the principles displayed above lead to the conclusion that the most crucial design aspect of any algorithm is to find a strategy to distribute the initial candidates in the search space and to generate new candidates based on variation of solutions.These two procedures define the search strategy, which needs to follow the two competing goals of exploration and exploitation.In general, the main goal of any method is to reach their target with high efficiency, i.e., to discover optima fast and accurate with as little resources as possible.Moreover, the goal is not mandatory finding the global optimum, which is a demanding and expensive task for many problems, but to identify a valuable local optimum or to improve the current available solution.We will explicitly discuss the design of modern optimization algorithms in Section 3.3.

Exact Algorithms
Exact algorithms, also referred to as complete algorithms (Neumaier, 2004), are a special class of deterministic, systematic and non-heuristic optimization algorithms.If sufficient a priori information about the objective function is available, they have a guarantee of finding the global optimum within using a predictable amount of resources, such as function evaluations or computation time (Fomin and Kaski, 2013).If they are applicable to the problem, these algorithms are more reliable than heuristics, as they allow convergence proofs of finding the global optimum.Without available a priori information, the stopping criterium needs to be defined by a heuristic approach, which softens the guarantee of finding the optimum.Moreover, it is theoretical possible to apply these algorithms to the class of black-box problems with the given ability to find the global optimum with certainty after finite time.However, they will need exponential computation time due to an expensive, dense search.This renders them not applicable to many resource-limited applications.The exact class, presented in Section 4, contains the related algorithms.

Surrogate-based Optimization Algorithms
Surrogate-based optimization algorithms are designed to process expensive and complex problems, which arise from real-world applications and sophisticated computational models.Realworld problems are commonly black-box, which means that they only provide very sparse domain knowledge.Consequently, problem information needs to be exploited by experiments or function evaluations.Surrogate-based optimization is developed to optimally exhaust the available information by utilizing a surrogate model.A surrogate model is an approximation which substitutes the original expensive objective function, real-world process, physical simulation, or computational process during the optimization.In general, surrogates are either simplified physical or numerical models based on knowledge about the physical system, or empirical functional models based on knowledge acquired from evaluations and sparse sampling of the parameter space (Søndergaard, Madsen, and Nielsen, 2003).In this work, we focus on the latter.The terms surrogate model, meta-model, response surface model and posterior distribution are used synonymous in the common literature (Mockus, 1974;Jones, 2001;Bartz-Beielstein and Zaefferer, 2017).We will briefly refer to a surrogate model as surrogate.Furthermore, we assume that it is crucial to distinguish between the use of an explicit surrogate of the objective function and general model-based optimization (Zlochin, Birattari, Meuleau, and Dorigo, 2004), which additionally refers to methods, where a statistical model is used to generate new candidate solutions (cf.Section 3.3).As these two definitions of model-based optimization are frequently used in a non-consistent manner, we will clearly distinguish between the two different terms surrogatebased and model-based to avoid confusions.Another term present in the literature is surrogateassisted optimization, which mostly refers to the application of surrogates in population-based evolutionary computation (Jin, 2011), where evolutionary optimization and surrogate-based optimization are applied in an hybrid approach (see Section 8).Important publications featuring overviews or surveys on surrogates and surrogate-based optimization were presented by Sacks, Welch, Mitchell, and Wynn (1989), Jones (2001), Queipo, Haftka, Shyy, Goel, Vaidyanathan, and Tucker (2005), Forrester and Keane (2009).Surrogatebased optimization is commonly applied, but not limited to the case of complex real-world optimization applications, where two typical problem layers and a surrogate layer can be de- tuning procedure optimization algorithm E i j r 1 f N r o 9 V k P x 6 z 9 H u q s U y 8 V N b W o V U u 5 s h Y P P I l t 7 K D A q e 6 j j F N U W I e J a z z j B a 9 K V b l R H p T H L 6 q S i H O 2 8 G M p T 5 9 L P p A 0 < / l a t e x i t > surrogate model < l a t e x i t s h a 1 _ b a s e 6 4 = " 5 8 e t J I 4 / e + 2 3 v I D a k V S E u B I 2 P G A = " > A A A C 0 X i c h V F L T 8 J A E B 7 q C / C F e v T S C C Z 4 I S 0 X P Z L 4 i B c T j P J I k J B t W b C h r 7 Q L E Y m J 8 e r N q / 4 x / S 0 e / H Y t J k o M 2 2 x n 9 p t v v p 3 Z s U L X i Y V h v K e 0 h c W l 5 Z V 0 J r u 6 t r 6 x m d v a r s f B M L J 5 z x F 5 Y p y 5 W K n y g y 6 E W w 8 v V R D 8 Z s / h 3 q r F M v l 0 y j Z F 6 W 8 x U z G X i a d m m P i p j q I V X o n K q o Q 0 7 z h V 7 p T b v S x t q j 9 v R N 1 V J J z g 7 9 W t r z F 9 S g k M s = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " 5 8 e t J I 4 / e + 2 3 v I D a k V S E u B I 2 P G A = " > A A A C 0 X i c h V F L T 8 J A E B 7 q C / C F e v T S C C Z 4 I S 0 X P Z L 4 i B c T j P J I k J B t W b C h r 7 Q L E Y m J 8 e r N q / 4 x / S 0 e / H Y t J k o M 2 2 x n 9 p t v v p 3 Z s U L X i Y V h v K e 0 h c W l 5 Z V 0 J r u 6 t r 6 x m d v a r s f B M L J 5 z x F 5 Y p y 5 W K n y g y 6 E W w 8 v V R D 8 Z s / h 3 q r F M v l 0 y j Z F 6 W 8 x U z G X i a d m m P i p j q I V X o n K q o Q 0 7 z h V 7 p T b v S x t q j 9 v R N 1 V J J z g 7 9 W t r z F 9 S g k M s = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " 5 8 e t J I 4 / e + 2 3 v I D a k V S E u B I 2 P G A = " > A A A C 0 X i c h V F L T 8 J A E B 7 q C / C F e v T S C C Z 4 I S 0 X P Z L 4 i B c T j P J I k J B t W b C h r 7 Q L E Y m J 8 e r N q / 4 x / S 0 e / H Y t J k o M 2 2 x n 9 p t v v p 3 Z s U L X i Y V h v K e 0 h c W l 5 Z V 0 J r u 6 t r 6 x m d v a r s f B M L J 5 z x F 5 Y p y 5 W K n y g y 6 E W w 8 v V R D 8 Z s / h 3 q r F M v l 0 y j Z F 6 W 8 x U z G X i a d m m P i p j q I V X o n K q o Q 0 7 z h V 7 p T b v S x t q j 9 v R N 1 V J J z g 7 9 W t r z F 9 S g k M s = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " 5 8 e t J I 4 / e + 2 3 v I D a k V S E u B I 2 P G A = " > A A A C 0 X i c h V F L T 8 J A E B 7 q C / C F e v T S C C Z 4 I S 0 X P Z L 4 i B c T j P J I k J B t W b C h r 7 Q L E Y m J 8 e r N q / 4 x / S 0 e / H Y t J k o M 2 2 x n 9 p t v v p 3 Z s U L X i Y V h v K e 0 h c W l 5 Z V 0 J r u 6 t r 6 x m d v a r s f B M L J 5 z x F 5 Y p y 5 W K n y g y 6 E W w 8 v V R D 8 Z s / h 3 q r F M v l 0 y j Z F 6 W 8 x U z G X i a d m m P i p j q I V X o n K q o Q 0 7 z h V 7 p T b v S x t q j 9 v R N 1 V J J z g 7 9 W t r z F 9 S g k M s = < / l a t e x i t > f 1 (x) < l a t e x i t s h a 1 _ b a s e 6 4 = " 5 8 e t J I 4 / e + 2 3 v I D a k V S E u B I 2 P G A = " > A A A C 0 X i c h V F L T 8 J A E B 7 q C / C F e v T S C C Z 4 I S 0 X P Z L 4 i B c T j P J I k J B t W b C h r 7 Q L E Y m J 8 e r N q / 4 x / S 0 e / H Y t J k o M 2 2 x n 9 p t v v p 3 Z s U L X i Y V h v K e 0 h c W l 5 Z V 0 J r u 6 t r 6 x m d v a r s f B M L J 5 z x F 5 Y p y 5 W K n y g y 6 E W w 8 v V R D 8 Z s / h 3 q r F M v l 0 y j Z F 6 W 8 x U z G X i a d m m P i p j q I V X o n K q o Q 0 7 z h V 7 p T b v S x t q j 9 v R N 1 V J J z g 7 9 W t r z F 9 S g k M s = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " 5 8 e t J I 4 / e + 2 3 v I D a k V S E u B I 2 P G A = " > A A A C 0 X i c h V F L T 8 J A E B 7 q C / C F e v T S C C Z 4 I S 0 X P Z L 4 i B c T j P J I k J B t W b C h r 7 Q L E Y m J 8 e r N q / 4 x / S 0 e / H Y t J k o M 2 2 x n 9 p t v v p 3 Z s U L X i Y V h v K e 0 h c W l 5 Z V 0 J r u 6 t r 6 x m d v a r s f B M L J 5 z x F 5 Y p y 5 W K n y g y 6 E W w 8 v V R D 8 Z s / h 3 q r F M v l 0 y j Z F 6 W 8 x U z G X i a d m m P i p j q I V X o n K q o Q 0 7 z h V 7 p T b v S x t q j 9 v R N 1 V J J z g 7 9 W t r z F 9 S g k M s = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " 5 8 e t J I 4 / e + 2 3 v I D a k V S E u B I 2 P G A = " > A A A C 0 X i c h V F L T 8 J A E B 7 q C / C F e v T S C C Z 4 I S 0 X P Z L 4 i B c T j P J I k J B t W b C h r 7 Q L E Y m J 8 e r N q / 4 x / S 0 e / H Y t J k o M 2 2 x n 9 p t v v p 3 Z s U L X i Y V h v K e 0 h c W l 5 Z V 0 J r u 6 t r 6 x m d v a r s f B M L J 5 z x F 5 Y p y 5 W K n y g y 6 E W w 8 v V R D 8 Z s / h 3 q r F M v l 0 y j Z F 6 W 8 x U z G X i a d m m P i p j q I V X o n K q o Q 0 7 z h V 7 p T b v S x t q j 9 v R N 1 V J J z g 7 9 W t r z F 9 S g k M s = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " 5 8 e t J I 4 / e + 2 3 v I D a k V S E u B I 2 P G A = " > A A A C 0 X i c h V F L T 8 J A E B 7 q C / C F e v T S C C Z 4 I S 0 X P Z L 4 i B c T j P J I k J B t W b C h r 7 Q L E Y m J 8 e r N q / 4 x / S 0 e / H Y t J k o M 2 2 x n 9 p t v v p 3 Z s U L X i Y V h v K e 0 h c W l 5 Z V 0 J r u 6 t r 6 x m d v a r s f B M L J 5 z x F 5 Y p y 5 W K n y g y 6 E W w 8 v V R D 8 Z s / h 3 q r F M v l 0 y j Z F 6 W 8 x U z G X i a d m m P i p j q I V X o n K q o Q 0 7 z h V 7 p T b v S x t q j 9 v R N 1 V J J z g 7 9 W t r z F 9 S g k M s = < / l a t e x i t >  fined.The defined layers can be transferred to different computational problems with expensive function evaluations, such as complex algorithms or machine learning tasks.Each layer can be the target of an optimization or used to retrieve information to guide the optimization process.Figure 1 illustrates the different layers of objective functions and the surrogate-based optimization process for real-world problems.In this case, the objective function layers, from the bottom up, are: L1 The real-world application f 1 (x), given by the physical process itself or a physical model.Direct optimization is often expensive or even impossible, due to evaluations involving resource demanding prototype building or even hazardous experiments.
L2 The computational model f 2 (x), given by a simulation of the physical process or a complex computational model , e.g., a computational fluid dynamics model or structural dynamics model.A single computation may take minutes, hours, or even weeks to compute.

L3
The surrogate s(x), given by a data-driven regression model.The accuracy heavily depends on the underlying surrogate type and number of available information (i.e., function evaluations).The optimization is, compared to the other layers, typically cheap.Surrogates are constructed either for the real-world application f 1 (x) or the computational model f 2 (x).
Furthermore, the surrogate-based optimization cycle includes the optimization process itself, which is given by an adequate optimization algorithm for the selected objective function layer.
No surrogate-based optimization is performed, if the optimization is directly applied to f 1 (x) or f 2 (x).The surrogate-based optimization uses f 1 (x) or f 2 (x) for verification of promising solution candidates.Moreover, the control parameters of the optimization algorithm or even the complete optimization cycle including the surrogate modeling process can be tuned (Bartz-Beielstein et al., 2005).Each layer imposes different evaluation costs and solution accuracies: the real-world problem is the most expensive to evaluate, while the surrogate is the cheapest to evaluate.The main benefit of using surrogates is thus the reduction of needed expensive function evaluations on the objective function f 1 (x) or f 2 (x) during the optimization.The studies by Loshchilov, Schoenauer, and Sebag (2012), Marsden, Wang, Dennis Jr, and Moin (2004), Ong, Nair, Keane, and Wong (2005) and Won and Ray (2004) feature benchmark comparisons of surrogate-based optimization.Nevertheless, the model construction and updating of the surrogates also requires computational resources, as well evaluations for verification on the more expensive function layers.An advantage of surrogate-based optimization is the availability of the surrogate model, which can be utilized to gain further global insight into the problem, which is particularly valuable for black-box problems.For instance, the surrogate can be utilized to identify important decision variables or visualize the nature of the problem, i.e., the fitness landscape.
A common optimization process using surrogates is outlined by the following steps: 1. Sampling the objective function at k positions with The sampling design plan is commonly selected according to the surrogate.
2. Selecting a suitable surrogate.The selection of the correct surrogate type can be a computational demanding step in the optimization process, as often no prior information indicating the best type is available.Common types of surrogates will be presented in Section 7.
3. Constructing the surrogate s(x) using the observations.4. Utilizing the surrogate s(x) to predict n new promising candidates {x * 1:n }, e.g., by optimization of the infill function with a suitable algorithm.For example, it is reasonable to use an exact algorithm, as the surrogate often provides the required global information or is very cheap to evaluate.

If the stopping criterium is not met:
Updating the surrogate with the new observations , and repeating the optimization cycle (4.-6.) The accuracy of a surrogate strongly relies on the selection of the correct model type to approximate the objective function.Furthermore, it relies on the method of initialization and the connected initially available information, i.e., by sampling candidate solutions by means of appropriate designs.These initial sampling design plans have a major impact on the performance and should be carefully selected.Another important aspect is the selection of the next candidate.A number of the available surrogates provide information about the fitness distribution with mean and variance at the candidate location.This information can be used in the optimization process to apply sophisticated infill functions for predicting promising candidates.The most elementary infill criterion is the best predicted improvement on the surrogate.Sophisticated infill criterions include the well-known expected improvement, or confidence bound criteria (Mockus, 1974;Jones, Schonlau, and Welch, 1998;Schonlau, 1997).Particular expected improvement focuses on maintaining a balance between exploration and exploitation.By selecting of a certain surrogate, the user makes certain assumptions regarding the characteristics of the objective function, i.e., modality, continuity and smoothness (Forrester and Keane, 2009).Most surrogates are selected to provide continuous, low-modal and smooth landscapes, which renders the optimization process computational inexpensive and straightforward in comparison to the objective function, which often possesses unknown properties.

A new Intuitive Taxonomy
A taxonomy is defined as a consistent procedure or classification scheme for separating objects into classes or categories on basis of certain features.The term taxonomy is particular present in natural science for establishing hierarchical classifications.A taxonomy fulfills the task of distinction and ordering; it provides explanations and a greater understanding of the research area through the identification of coherences and the differences between the classes.As a starting point, we will take a look at existing taxonomies, because many different surveys and handbooks for specialized optimization algorithms and associated techniques, such as metaheuristics (Boussaïd et al., 2013), direct search methods (Audet, 2014;Kolda, Lewis, and Torczon, 2003), or nature-inspired methods (Rozenberg, Bck, and Kok, 2011) are available.Moreover, we will investigate aspects of algorithm design and derive classification features as a basis for the different classes of algorithms.Further, we suggest common or manageable problem properties for each of the identified classes, e.g., the type of the underlying problem search space or the resource cost of function evaluations.

History of Taxonomies
This section illustrates a time line of the different CGO algorithm taxonomies found in the literature.A first overview on global optimization was presented by Leon (1966).Leon classified algorithms into three categories: 1. Blind search, 2. Local search 3. Non-local search.In this context, blind search refers to strategies where the candidates are selected at random in the complete search space but following a built-in sequential search strategy.During the local search, new candidates are selected only in the immediate neighborhood of the previous candidates.This leads to a trajectory of small steps.Finally, non-local search allows to escape from local optima and thus enables a global search.Archetti and Schoen (1984) proposed a taxonomy of global optimization which for the first time includes surrogate-based approaches.Their taxonomy reads as follows: 1. Deterministic: 1.1.Space covering methods 1.2.Trajectory and tunneling methods 2. Probabilistic: 2.1.Random sampling methods 2.2.Random search methods 2.3.Methods based on a stochastic model (i.e., a surrogate) of the objective function They used the term deterministic for the class of algorithms, which we classified as exact or complete, i.e., those who are guaranteed to find the global optimum with a defined budget.The deterministic class encompasses space covering and more complex trajectory methods, where the candidate is following a trajectory which passes through all local minima.Additionally, following the scheme, the term probabilistic defines those algorithms which allow non-exact solutions, i.e., heuristic or metaheuristic approaches.Random sampling methods are by these means described as algorithms, which perform a set of local searches starting from different, uniformly sampled initial points.Random search methods iteratively alter a candidate solution utilizing random distributions.The paper stands out in establishing a taxonomy, which for the first time includes the concept to construct a surrogate model of the objective function.This is the groundbreaking idea of modern surrogate-based optimization algorithms.Surrogates are described as stochastic models based on evaluated points with successive updates in order to select proper candidates.Moreover, Gaussian field models are already mentioned as possible models for multivariate surrogates.However, no concrete methods or frameworks applying these ideas are described.Törn and Zilinskas (1989) reviewed existing classification schemes and presented their own classifications.They stated that the most important distinction has to be made between two non-overlapping main classes, namely those methods with guaranteed accuracy and those without.The main new feature of their taxonomy is the clear separation of the heuristic methods in those with direct and indirect objective function evaluation, resulting in the following scheme of three classes: 1. Methods with guaranteed accuracy 1.1.Covering methods 2. Direct Methods 2.1.Random search methods 2.2.Clustering methods 2.3.Generalized descent methods 3. Indirect Methods: 3.1.Methods approximating the level sets 3.2.Methods approximating the objective function Methods with guaranteed accuracy are similar to exact algorithms as aforementioned in Neumaier (2004).Direct methods utilize local function evaluations and encompass algorithms with pure random search, single start and multi start strategies.Furthermore, they include clustering methods, which try to identify regions of attraction of local minima to guide the search process and prevent reevaluation of known regions.Indirect methods use evaluations to build global models, similar to the taxonomy of Archetti and Schoen (1984), these methods cover the ideas of modern surrogate-based optimization.The use of Bayesian optimization (Mockus, 1974) was also discussed.Today's high availability of computational power did not exist, therefore Törn and Zilinskas (1989) concluded the following regarding Bayesian models and their applicability for (surrogate-based) optimization: Even if it is very attractive theoretically it is too complicated for algorithmic realization.Because of the fairly cumbersome computations involving operations with the inverse of the covariance matrix and complicated auxiliary optimization problems the resort has been to use simplified models.Moreover, Žilinskas (1992) added more insight in a specific review on statistical models for global optimization.In this work, he explicitly described common challenges in using stochastic models during in an optimization process.Special remark was given to the problems arising by using multivariate and high dimensional models, particular the need of complex computations.Thus, he suggested using simplified, reduced models.Arora, Elwakeil, Chahande, and Hsieh (1995) presented a review of optimization algorithms with the focus on engineering applications.The paper's significant feature is the emphasis on the specific characteristics of real world problem optimization, including different constraint handling techniques.The features of different underlying problems are discussed, such as availability of function information.Moreover, they outline certain algorithm features which are needed in algorithm selection on basis of the desired optimization goal.Although the main classification is similar to the one by Archetti and Schoen (1984) and splits the algorithms into deterministic and stochastic classes, it further adds some knowledge by discussing different heuristic and metaheuristic approaches, including modern metaheuristics, such as simulated annealing (Kirkpatrick, Gelatt, Vecchi, et al., 1983), tabu search (Glover, 1989) and genetic algorithms (Eiben and Smith, 2015).The taxonomy characterized algorithms based on whether they • are deterministic or stochastic, • are able to solve continuous, discontinuous or combinatorial problems, • can find all optima of a problem, • have a local and/or global search phase, • utilize local search, or • need gradients.Jones et al. (1998) were not the first ones to apply surrogate-based optimization, but significantly influenced their popularity.The efficient global optimization algorithm is still used as an example for a surrogate-based strategy as common in modern frameworks.It already included infill functions and a sophisticated optimization method for their optimization instead of simple multi-start or grid approaches (see Section 7.2.1).In addition, the framework distinguishes from previous work by introducing an explicit phase of model validation utilizing statistical measures.Jones (2001) present a taxonomy of surrogate-based optimization in which he segmented the different approaches.On the surrogate model level, the distinction was drawn between • non-interpolating models, which minimize the sum of squared errors for a selected functional form.• interpolating models, where the function passes through all evaluated points.Moreover, a difference between • Two-stage approaches, where first a surrogate is fit to a set of preselected and evaluated candidates and then utilized for optimization and • One-stage approaches, where the candidate selection is made on the basis of a hypothesis and part of the modeling process, is shown.An extensive taxonomy focusing on exact (see Section 2.3) CGO methods was given by Neumaier (2004).The particular interesting feature of this taxonomy is that it includes a distinction between two classes: 1. Incomplete and asymptotically complete methods, which can get stuck in local optima.Furthermore, they might be able to reach the global optimum in finite time but are not able to ensure they found the global optimum.2. Complete and rigorous methods, which reach the global minimum with certainty after a finite runtime and are able to recognize that they found it or are able to find the global minimum within given tolerances.He also pointed out that the later class is often referred to as deterministic, which can be misleading as incomplete methods are also commonly deterministic.Zlochin et al. (2004) presented a survey of model-based optimization.Although the survey focused on algorithms for solving combinatorial problems, the underlying classification is attractive as it explicitly features modern model-based optimization as a separate class.It draws a clear distinction between: • Instance-based algorithms, which generate new candidates only based on the current candidate solution or candidate population and • Model-based algorithms.This class encompasses algorithms which generate candidates by utilizing a parameterized probabilistic model, which is updated using evaluated points, as done in distribution-based algorithms (see Section 6.3).The authors pointed out that model-based algorithms describe an adaptive stochastic mechanism which generates candidates and does not approximate the objective function.Thus, they explicitly excluded surrogate-based optimization as previously introduced in this article.In difference to the former classifications, Boussaïd et al. (2013) focused on a concrete design aspect of optimization algorithms and divided them into two classes: 1. Single-solution based metaheuristics which start with a single initial solution and then make step-wise movements away from this solution, thus forming a trajectory.They also refer to this class as trajectory methods.According to the authors, these algorithms can be seen as intelligent extensions of local search algorithms.
2. Population-based methods, i.e., methods which use a set of concurrent candidate solutions instead a single solution.A summary of the algorithm classifications discussed so far is given in Figure 2.For an easier comparison, we divided the algorithms roughly in three distinct classes: 1.Exact 2. Heuristic and metaheuristic 3.Surrogate optimization For the sake of completeness, we added a description of our new taxonomy, which we present later in this section.The presented new taxonomy is intended to be a successor of the taxonomy presented by Törn and Zilinskas (1989), which is the last in our overview to include all classes and provide a general classification.

Motivation for a new Taxonomy
Considering the history of CGO taxonomies (Section 3.1, cf. Figure 2), we can conclude that during the last decades, several new taxonomies for these optimization algorithms were developed.However, new algorithms are proposed nearly every day and particular model-based and surrogate-based algorithms dominate the field of real-world applications and have become the state-of-the-art in modern algorithm design for this task.Existing taxonomies of CGO algorithms do not reflect this situation.Although there are surveys and books which handle the broad field of optimization and give general taxonomies, they are outdated and lack the integration of the new designs.Available up-to-date taxonomies often specialize on a subfield of algorithms and do not attempt to present a general overview, which allows to connect and compare the different optimization algorithms.This motivated the development of a new taxonomy that explicitly includes surrogate-based optimization and puts it in the larger context of CGO algorithms.From our point of view, it is important that our new taxonomy not only divides the algorithms into classes, but also provides an understanding of the working mechanisms of each algorithm.This grants an insight for which kind of problem they are suitable.It is thus crucial, that the basic concept of the taxonomy is comprehensible and appears intuitive to a broad audience.
In his work about evolution strategies, Rechenberg (1994) illustrated a visual approach to an optimization process: a mountaineer in an alpine landscape, attempting to find and climb the largest mountain.We will further investigate the idea of optimization processes being humanlike individuals trying to find their way to the most attractive location and thus define our main class names: The Wanderer, the Guide, the Cartographer and two additional classes, the Hybrid and Exact.This naming convention shall accomplish two goals: 1. Giving a simple idea on how the algorithms in the associated class perform their basic search.
2. Being obvious and straightforward and consequently, being simple to memorize.
To reach our first goal, we will identify the crucial elements of each optimization algorithm.These elements allow us to extract classification features, which can be used to establish a new distinct taxonomy and further accomplishes our second goal.To further support the comprehensibility of our taxonomy, we will draw an analogy between the algorithm class names and the behavior of a human-like individual in each of the descriptive class section.The usage of analogies to the natural world is not a new idea.Instead, it is common in the area of metaheuristics, where the behavior of animals inspires the search procedure of the algorithms: Evolutionary algorithms are based on the evolution theory (Rechenberg, 1994;Eiben and Smith, 2015);  particle swarm optimization (Kennedy and Eberhart, 1995;Shi and Eberhart, 1998) uses the movement of bird flocks as a role model; ant colony optimization (Dorigo, Birattari, and Stutzle, 2006) mimics, as the name suggests, the ingenious path finding and food search principles of ant populations.These two examples indicate that analogies are useful to inspire developers to create new search strategies.They are also helpful to explain the behavior of these search algorithms, which makes them valuable to be used in a comprehensible taxonomy.

The Five Elements of Algorithm Design
Any modern optimization algorithm, as defined in Section 2. The initialization of the search defines starting locations or a schema for the initial candidate solutions.Two common strategies exist: 1.If there is no a priori knowledge about the problem and its search space, the best option is to use strategically randomized starting points.Particularly interesting for surrogatebased optimization are systematic initialization schemes by methods from the field of design of experiments.
2. If domain knowledge or other a priori information is available, such as information from the process or data from previous optimization runs, the algorithm should be initialized utilizing this information to full extend, e.g., by using a selection of these solutions, such as these with best fitness.In surrogate-based optimization available data can be used for the initial modeling.
The initial candidates have a large impact on the balance between exploration and exploitation.Space-filling designs with large amounts of random candidates or sophisticated design of experiments methods will lead to a initial exploration of the search space.Starting with a single candidate will presumably lead to an exploitation of the neighborhood of the selected candidate location.Hence, algorithms using the first scheme are in general more robust, while the latter are sensitive to the selection of the starting candidate, particular in multi-modal landscapes.The robustness can be further increased by multi-start strategies, which are particular common for single-candidate algorithms, and also frequently recommended for populationbased algorithms (Hansen, Auger, Ros, Finck, and Pošík, 2010a).
The variation during the search process defines the methods for generating new candidates, with special regard on how available or obtained information about the objective function is used.A standard approach is the variation of existing observations, as it utilizes, and to a certain extend preserves, the information of previous iterations.Even the simplest wanderer class algorithms (Sec 5.1), which do not require any global information or stored knowledge of former iterations, utilize the last obtained solution to generate new candidate(s).Sophisticated algorithms generate new candidates on the basis of exploited and stored global knowledge about the objective function and fitness landscape.This can be conducted explicitly by either keeping an archive of all available or selected observations, or implicitly by using distribution or data models of available observations.Another option to generate new candidates is combining information of multiple candidates by dedicated functions or operators, particular present in the guide class (Sec 6.1).The exact operators for generation and variation of candidate solutions are various and a key aspect of keeping the balance between exploration and exploitation in a search strategy.
The evaluation defines how the fitness of the candidates is computed and which objective function is utilized.The evaluation is the key aspect of any algorithm, as it defines the basis for any information gain and has a huge influence on the search strategy.For black-box problems, the evaluation of candidates is the only option to exploit any problem information.We differentiate between a direct evaluation of the objective and indirect evaluation by using the predicted fitness provided by a surrogate.How evaluation is performed depends mainly on the underlying problem and is largely influenced by the design of the objective function.Important aspects in real world problems are noise, constraints and multiple objectives.
While most computer experiments can be seen as deterministic, i.e., iterations using the same value set for the associated decision variables should deliver the same results, real-world problems are often non-deterministic.They include non-observable disturbance variables and stochastic noise.Typical noise handling techniques include multiple evaluations of solutions to reduce the standard deviation and special sampling techniques.The interested reader can find a survey on noise handling by Arnold and Beyer (2003).Moreover, many real-world problems frequently include different constraints, which need to be considered during the optimization process.Constraint handling techniques can be directly part of the optimization algorithm, but most algorithms are designed to minimize the objective function and constraint handling is added on top.Thus, it is often integrated by adjusting the fitness, e.g., by penalty terms.Different techniques for constraint handling are discussed by Coello (2002) and Arnold and Hansen (2012).
The evaluation of multiple objectives can include several correlated objective functions and usually delivers a set of non-dominated solutions, a so-called Pareto-set (Naujoks, Beume, and Emmerich, 2005).In this case, a so-called decision maker is utilized to compute the fitness of a solution and select solutions from the Pareto-set (Fonseca, Fleming, et al., 1993).
The selection defines the principle of choosing the solutions which will be used in the next iteration.We use the term selection, which has it's origins in evolutionary computation.Beside the simplest strategy of choosing the solution(s) with the best fitness, advanced selection strategies have emerged, which are particular present in metaheuristics (Boussaïd et al., 2013).These selection strategy are particular common in algorithms with several candidates per variation step, thus the most sophisticated selection methods were introduced in the scope of evolutionary computation (Eiben and Smith, 2015).A common strategy is based on relative fitness comparisons, so-called ranked selection.Detailed examples of selection strategies are given in Section 6.2.Control parameters determine how the search can be adapted and improved by controlling the above mentioned key elements.We distinguish between internal and external parameters: External parameters, also known as offline parameters, can be adjusted by the user and need to be set a priori to the optimization run.Typical external parameters include the number of candidates and settings influencing the above mentioned key elements.Beside common theory-based defaults (Schwefel, 1993), they are usually set by either utilizing available domain knowledge, extensive a priori benchmark experiments (Gämperle, Müller, and Koumoutsakos, 2002), or educated guessing.Sophisticated tuning methods were developed to exploit good parameter settings in an automated fashion.Well known examples are sequential parameter tuning (Bartz-Beielstein et al., 2005), iterated racing for automatic algorithm tuning (López-Ibáñez, Dubois-Lacoste, Cáceres, Birattari, and Stützle, 2016), bonesa (Smit and Eiben, 2011) or SMAC (Hutter, Hoos, and Leyton-Brown, 2011).
In difference, internal parameters are not meant to be changed by the user.They are either fixed to a certain value, which is usually based on physical constants or extensive testing by the authors of the algorithm, or are self-adaptive.Self-adaption or online control changes the parameters online during the search process on basis of the gathered knowledge or exploited problem information (Eiben et al., 1999) without user influence.Algorithms using self-adaptive schemes thus tend to gain outstanding generalization abilities and are especially interesting for black-box problems, where no information about the objective function properties is available (Hansen, Müller, and Koumoutsakos, 2003).In general, the settings of algorithm control parameters directly affect the balance between exploration and exploitation during the search and are crucial for the performance.

Features of an Intuitive Taxonomy
Taxonomies are often based on the subjective author's experience and former definitions, as well as more impartial features and similarities.Prior to establishing our new taxonomy, we defined a set five essential classification features (CF), which are based on former available taxonomies as well as significant features and similarities of CGO algorithms.They are intended to give a good understanding on how we separated our classes to create a distinct taxonomy, which still remains comprehensible and intuitive.Prior to each class description we will outline the their features in a text box as exemplified in Figure 3.We will refer to our classification features as CF-I to CF-V: CF I) Use of Information: The information feature has four possible categories: The first category is memoryless.The term describes algorithms, which only use the available information of the prior iteration (or initialization).
The second is explicit memory.It defines those algorithms, which information of prior iterations in a direct fashion, e.g., by maintaining an archive of all observations.Third is implicit memory; these algorithms combine information of several iterations and solutions by operators, functions or models.
Finally, algorithms which require a priori information about the objective function, such as the value of the optimum.

CF II) Candidate Evaluation:
The candidate evaluation feature defines if the objective function value is only directly calculated or also indirectly approximated.The approximation of the fitness during the candidate variation phase can greatly lower the necessary amount of objective function evaluations, but no optimization process can be reliable and successful without verification of these candidates with the objective function CF III) Type of Candidate: This feature refers to the number and type of candidate solutions used in the variation and maintained in each iteration.It has three categories: The first is single and implies that the variation is based on a single candidate solution.
Moreover, these algorithms maintain only a single solution for their next iteration.
The second type is population, where in each iteration the variation is based on several candidate solutions and moreover, several solutions are stored for the next iteration.
The most sophisticated type are model-based algorithms, which utilize a candidate distribution model for the variation, which is stored and adapted in each iteration.The candidate evaluation is not affected and remains direct.

CF IV) Region of Search:
This feature describes the effective search region of an algorithm.
Local algorithms have no operators or functions for exploration and are thus not capable of escaping the so-called region of attraction of an optimum.
Global algorithms have the ability to find optima in multi-modal landscapes by introducing operators or functions to balance exploration and exploitation.

CF V) Problem Properties:
This feature is a collection of objective function properties.Algorithms which are efficient in solving a problem with the given property are assigned with the related feature.
• Domain Knowledge: This feature describes problems with known function properties, such as a mathematical problem formulation or information about the number and objective function value of optima.This knowledge can be exploited and used for a efficient or exact search process.
• Unimodal: This term describes objective functions with a single optimum in a linear or convex search space.
• Multimodal: Problems are called multimodal when they have several local and/or global optima.
• Black-Box: Problems are called black-box if they do not provide any domain knowledge and all information needs to be gathered by objective function evaluations.Many real-world problems can be associated with this characteristic.
• Discontinuities: Functions who have jumps in their objective function values are discontinuous and not differentiable.
• Noisy: Noisy objective function do not return deterministic function values.Multiple evaluations of the same candidate on a noisy function can lead to different results.
• Expensive: Expensive problems have a high cost for each function evaluation in terms of either physical resources or computation time.

Classification Features:
I Use of Information: a priori, memoryless, explicit/implicit memory II Candidate Evaluation: direct, indirect III Type of Candidate: single, population, model IV Search Space: local, global V Problem Properties: unimodal, multimodal, black-box, ... For an overview of all classes and the connected classification features CF I-IV, we utilize a decision tree, illustrated in Figure 4.In this figure, we utilized the features I to IV as classification nodes which conclude in our new taxonomy with the introduced classes exact, wanderer, guide and cartographer.Moreover, the associated problem properties are displayed.The figure is intended to provide a fast overview on how the new taxonomy works.New algorithms could be easily included by assigning the above listed features and then using the illustrated decision tree.We know that not all algorithms will fit into our taxonomy, as they have property combinations, which are not displayed in our scheme.They may also belong to the hybrid class, which represent combinations of methods from the displayed classes.The hybrid class is not shown in the figure, as it has no distinct properties.The general description of exact algorithms was presented in Section 2.3.For an efficient search, exact algorithms need a priori information about the objective function.Therefore, they are only suitable for a limited class of problems where this information is available.Moreover, the application of exact algorithms is always a tradeoff decision between computation time and precision.For example: Given a nondeterministic polynomial (NP)-hard problem, usually no exact algorithm exists that is able to find the best solution in polynomial time.The traveling salesman problem is a common combinatorial NP-hard problem, where the goal is to find the minimal length tour through a fixed number of cities.An exact algorithm could solve this problem by calculating every possible tour and selecting the best.Despite providing the best solutions, this strategy would use a lot of computation time.Particular if the number of cities gets large, this problem cannot be solved exact in reasonable time.It should be noted that even if exact algorithms are not efficient for NP-hard and black-box problems, they are commonly part of surrogate-based optimization frameworks, because the models provide the required information for an exact search or are computationally very cheap to evaluate (see Section 2.4 and Figure 1).For example, efficient global optimization (Section 7.2.1 )and bayesian optimization (Section 7.2.2) can use exact algorithms.Two common methods from the family of exact methods are grid search and branch-and-bound.Grid search combines a multi-start local optimization with an increasingly subtler sampling grid of starting points.Branch-and-bound optimization is conducted by splitting the original problem recursively into subproblems with the goal of excluding or solving them, until it is guaranteed that no subproblem can lead to a better solution.It is known as branch-andbound (Lawler and Wood, 1966) as lower bounds on the objective function are computed.More complex exact search methods combine branching and local optimization, Lipschitzian optimization, convexity or interval analysis (Neumaier, 2004;Floudas, 2013;Hansen and Walster, 2003;Horst and Tuy, 2013).
Example 4.1 (Dividing Rectangles).An example for an exact algorithm is dividing rectangles (DIRECT), initially proposed by Jones, Perttunen, and Stuckman (1993) as a modification of Lipschitzian optimization.While assuming that the objective function is Lipschitz continuous, the algorithm does not need a specification of the Lipschitz constant, as it is sampled during the optimization run.The algorithm uses hypercubes to divide the search space.The center c i of each hypercube is sampled and given a fitness, based on the objective function value and the size of the associated hypercube.Based on this fitness, the hypercube which is most likely to inherit the optimum is selected and further divided into smaller hypercubes.Then their fitness is sampled and the process repeats, until a stopping criterion is met or the algorithm has converged.

The Wanderer Class
The wanderer class encompasses algorithms with non-complex search strategies which generate and maintain a single candidate per iteration.New candidates are generated in the vicinity of the current solution by a stochastic process which is independent of previous search steps.The consecutive candidates describe a trajectory in the search space that forms in the ideal case a direct line to the optimum.In their search, they only use the local objective function information about the prior solution.Thus, these algorithms do not use global information about the problem in the variation or selection steps.
Analogy 1 (The Wanderer).The intuitive description of a wanderer is a single individual who wanders through the landscape to find the most attractive place in a given area.During its search it only utilizes local information about its current position to find the best direction.If the goal of this individual is to find the highest mountain, it will likely follow the ascending way, because it directly satisfies the current objective.It does not memorize gathered information, so that there is a chance that it will circle a position, revisit a place or gets completely lost.
We separate between the local and global subclass of the wanderer: While the local wanderer keeps a greedy selection, the global wanderer also allows the acceptance of non-improving candidates.The local wanderer subclass consists of basic local optimization algorithms, which include classical gradient-based algorithms as well as deterministic or stochastic hill-climbing algorithms.These algorithms are designed for fast convergence to a local optimum situated in a region of attraction A ⊆ S and have no explicit strategy for exploration.Gradient-based methods, auch as quasi-Newton Methods (Shanno, 1970), directly compute or approximate the gradients of the objective function to find the steepest direction of the optimum.Direct-Search methods perform an iterative and gradient-free search by using a minimal amount of information about the objective function.Overviews of direct search methods were presented by Lewis, Torczon, and Trosset (2000) and Kolda et al. (2003).Moreover, the (1+1) evolution strategy wit a basic selection operator (further explained in Section 6.2) can be associated with this class.

Local Wanderer
Example 5.1 (Iterated Stochastic Hill-Climber).The iterated stochastic hill-climber (Michalewicz and Fogel, 2013) is a typical example of the local wanderer subclass.It has a elementary algorithm design, where the search variation is stochastic and the selection typically greedy.In each iteration of the algorithm a new candidate x t is created by sampling from a probability distribution D around the prior solution x t−1 .The variance of the distribution, which is often uniform or Gaussian, defines the so-called step size of the variation and is the most important control parameter of this algorithm.The greedy selection works as follows: If the new candidate x t has a better fitness value than the prior solution x t−1 , it is accepted as new solution and the iteration is repeated.If the fitness is not improved, the prior solution is kept for the next iteration.This greedy selection scheme is based on a comparative, ranking-based selection of the candidates with no influence of the absolute difference in the objective function values.This implies invariance to linear transformations and scaling of the objective function.As the basic scheme includes no operator to escape from a local optimum or exploration, this algorithm will most likely converge to a local optimum.The termination condition is usually set to a number of non-improving iterations, after which is assumed that an optimum has been found.The generalized design of the iterated hill-climber is outlined in Algorithm 5.1.As the name implies, the local wanderer is in the first place suitable for unimodal functions or to exploit local optima.It can be applied for global optimization to multimodal landscapes if an adequate multi-start strategy is used.These multi-start strategies typically demand a high number of function evaluations and are only reasonable to be used for problems with relatively cheap objective functions, such as in surrogate-based optimization.Theoretically, a local wanderer could establish a global search by sampling the candidates from a distribution with a dispersion over the complete search space, which is equivalent to a very large step size for the variation.This strategy would be inefficient as it leads to random sampling of candidates without considering any local information and direction of search.Thus, it is common to limit the dispersion of the probability distribution to the vicinity of the current solution, to form a small neighborhood which is very small compared to the complete search space.This leads to the outlined hill-climbing search strategy which performs a trajectory of small, fitness-improving steps.The maximal step size is consequently an important control parameter, which is often designed as being adaptive in many versions of the algorithm (cf.mutation operator in Section 6.2).In general, local wanderers are often part of sophisticated algorithms as fast converging local search strategy.

Global Wanderer
Global Wanderer: The global wanderer subclass encompasses algorithms which implement operators to balance exploration and exploitation in the selection process of candidates.They differ from local wanderers by their explorative search strategies, which further enable global optimization.Similar to the local version, the global wanderer utilizes a stochastic variation of candidates in the neighborhood of the current solution without considering stored or modeled global information.Exploration is achieved by introducing operators or functions which allow to expand the search space and escape the region of attraction of a local optimum.The well-known simulated annealing (SANN) will be used for exemplifying this approach.
Example 5.2 (Simulated Annealing).Kirkpatrick et al. (1983) introduced SANN as a search procedure for global combinatorial optimization.It is known to be a significant contribution to the field of metaheuristic search algorithms.The basic search procedure of every SANN method is inspired by annealing in metallurgy, where a material is first heated, then cooled.Following this analogy, the most important element to control the search is an adaptive parameter: the temperature (T).Analogous to the thermodynamic energy in a heated material, it increases the possible movement of the candidates.The continuous version (Goffe, Ferrier, and Rogers, 1994;Siarry, Berthiau, Durdin, and Haussy, 1997;Van Groenigen and Stein, 1998) of the SANN algorithm basically extends the iterated stochastic hill climber.It includes a new element to allow global search, the so-called acceptance function P (x t−1 , x t , T ).The acceptance function is used during the selection and determines the probability of accepting an inferior candidate as solution by utilizing T as a parameter.This dynamic selection allows to escape local optima steps by accepting movement in the opposite direction of improvement, which is the fundamental difference to a hill climber.A common example for an acceptance function is the so-called Metropolis function: The Metropolis function always accepts fitness improving steps towards the minimum (f (x t )− f (x t−1 ) ≤ 0) and moreover, has a probability to accept ascending (f (x t ) − f (x t−1 ) > 0) steps based on T. Higher T values thus increase the probability to accept an inferior candidate.In contrast to rank-based selection schemes, this function is scale-based and utilizes absolute differences in the fitness, which renders it sensitive to linear transformations of the objective function, e.g., a multiplication by a scalar.At the end of each iteration, a so-called cooling operator C adapts T. This operator can be used to balance exploration and exploitation (Henderson, Jacobson, and Johnson, 2003).A common approach is to start with a high temperature and steadily reduce T according to the number of iterations or to utilize an exponential decay of T. This steady reduction of T leads to a phase of global exploring in the early iterations, while with decreasing T the probability of accepting inferior candidates is reduced.With approaching a T value of zero, the behavior of the algorithm becomes more and more similar to this of an iterative hill-climber.A general SANN version is displayed in Algorithm 5.2.Modern SANN implementations integrate self-adaptive cooling-schemes which use alternating phases of cooling and reheating (Locatelli, 2002).These allow a more sophisticated control of exploration and exploitation.The basic version of the algorithm does not use any archive, so that evaluated solutions (including exploited optima) are not stored.As the algorithm is able to escape the region of attraction of a local optima by accepting inferior solutions, the algorithm can convergence to a local optima which has an inferior fitness than the best found solution during the search.Thus, later versions commonly store at least the best found solution.
Algorithm 5.2: Simulated Annealing Global wanderers are suitable for searches in unimodal and multimodal problems, particular if multi-start strategies are used.As they do not rely on stored information of former iterations during their search, they are also a good choice to handle dynamic objective functions (Carson and Maria, 1997;Corana, Marchesi, Martini, and Ridella, 1987;Faber, Jockenhövel, and Tsatsaronis, 2005).However, the rather simplistic utilization of exploited global information renders them not efficient for challenging and expensive optimization problems.Moreover, the control parameters have a significant effect on the performance of these algorithms and should be tuned in an offline or online fashion.

The Guide Class
The guide class encompasses algorithms which utilize information from consecutive function evaluations.Guiding is here the process of finding a direction of improvement.In comparison to the wanderer the guide stores information and utilizes it during the search.Moreover, it can share information and group with other guides to organize a structured crowd search.
The intuitive idea of this class is a single or group of traveller(s) looking for an interesting place.
They try to memorize their own route, follow travel signs about interesting or forbidden paths and ask their fellow travelers to share their knowledge and give directions.Furthermore, they can be able to consolidate all gathered information and utilize it to their benefit.
We divide this class in three sub-classes, which follow different schemes to process information and use it during the search: (i) The single-solution guide (Section 6.1) utilizes an intelligent landscape structuring and partitioning process using the information of prior search iterations.
(ii) The population-based guide (Section 6.2) utilizes multiple candidates in each iteration and has special operators to combine information and to select the most successful candidates.
(iii) The model-based guide (Section 6.3) utilizes mathematical or statistical models to store and process the exploited information of several iterations and/or candidate evaluations.

Single-Solution Guide
Single-solution Guide: I Use of Information: implicit memory II Candidate Evaluation: direct III Type of Candidate: single IV Search Space: global V Problem Properties: domain knowledge, multimodal Single-solution guides are the connecting link between the wanderer class and populationbased guides.While these algorithms also allow the sampling of several solutions in one iteration, they are still based on the principle of maintaining a single solution, which perform a trajectory in the search space over consecutive iterations.These methods are also known as trajectory methods (Boussaïd et al., 2013) (cf.wanderer class 5.1).In difference to the wanderer class, they explicitly use operators to store und utilize information of former iterations and to guide the search process in certain directions.They are a step towards population-based algorithms, but miss the idea of explicitly using population-based operators in the variation and selection steps.This class encompasses search space partitioning algorithms which use the exposed knowledge of former iterations.They create sub-spaces which are forbidden and not considered in the current search iteration, or attractive sub-spaces, where the search is focused on.This search space partitions are used to find a balance between exploration and exploitation, as new candidates are placed in promising or previous unexplored parts of the search space.For example, variable neighborhood search (VNS) (Hansen and Mladenovic, 2003;Hansen, Mladenović, and Pérez, 2010b;Mladenović, Dražić, Kovačevic-Vujčić, and Čangalović, 2008) utilizes so-called neighborhood structures, which are pre-defined local sub-spaces.The search strategy of VNS is to perform sequential local searches in these sub-spaces to exploit their local optima.The idea behind this search strategy is that by using an adequate set of sub-spaces, the chance of exploiting a local optimum which is near the global optimum increases.Another, well-known algorithm and outstanding paradigm for this class is Tabu Search: Example 6.1 (Tabu Search).Tabu search (TS) (Glover, 1989) was introduced as an optimization method for combinatorial search spaces, where the number of possible solutions is limited.The last successful candidates are put to a so-called tabu list T , which defines a sub-space of all evaluated solutions.All members of the tabu list are forbidden as candidates in the current search.This process shall ensure to move away from available solutions and prevents cycling identical candidates, including local optima.The size of the tabu list is limited by a control parameter.If the limitation is reached, the oldest solution on the list is deleted and again accepted as potential candidate.A continuous version of the TS algorithm was presented by Siarry and Berthiau (1997).In continuous search spaces, the amount of candidates in the neighborhood of a solution is only limited to the underlying numerical accuracy, which leads to a nearly infinite number of candidates.The continuous TS algorithm thus needs to create limited sub-spaces.One approach for implementing these sub-spaces are ball-shaped regions, first presented by Hu (1992).A ballshaped region (or hypersphere) B(x, h), is defined in the neighborhood of a solution x t .It is limited by a radius h, resulting in a new sub-space X B = {(x t , x) ∈ S| x t − x ≤ h}.If a set of k + 1 of these concentric regions with different radii {h 0:k } is generated, they form so-called crowns C i:k , defined by: Similar to the last solution in the combinatorial TS, the region C 0 around the prior solution x t−1 is added to the tabu list T .New candidates are sampled uniformly random in each of the remaining crowns C 1:k .Siarry and Berthiau (1997) also suggest a more directed search variant by selecting candidates in direction of the gradient.From the set of new candidates, the one with the best fitness is selected for the next iteration based on a rank-based comparison.Exploration and exploitation can be controlled by the definition of the different radii and the corresponding partitions.An example Linear partitioning gives the outer crowns a larger volume and can shift the focus to a more localized search, as the density of sampled candidates will decrease.The continuous TS is displayed in algorithm 6.1.An enhanced version of the continuous TS was presented by Chelouah and Siarry (2000).Instead of hyperspheres, they use easier to handle hyperrectangles for the partition of the search space.Moreover, they define distinct alternating search phases of exploration and exploitation.
The main concept of the single-solution guide is to use the information of evaluated solutions and to direct the search to former unknown regions to avoid early convergence to a non-global optimum.The strategic use of sub-spaces allows a systematic control of exploration and exploitation and particularly ensures a high level of exploration.They include a large number of parameters, such as the number or size of sub-spaces, which makes them very vulnerable search direction of all candidates while keeping up an individual search for an optimum.This strategy results in a balanced search with explorative elements and local convergence.We regard EAs as the state of the art in population-based optimization, as their search concepts are dominating for this field.Nearly all other population-based algorithms use similar concepts and are frequently associated with EAs.Fleming and Purshouse (2002) go as far to state: In general, any iterative, population-based approach that uses selection and random variation to generate new solutions can be regarded as an EA.
We will thus focus on outlining the general concepts of EAs with specific instances of the utilized search strategies, methods and operators.
Example 6.2 (Evolutionary Algorithms).Evolutionary algorithms are based on the idea of evolution, reproduction and the natural selection concept of survival of the fittest.In general, the field of EAs goes back to four distinct developments, evolution strategies (ES) (Rechenberg, 1973;Schwefel, 1977), evolutionary programming (Fogel, Owens, and Walsh, 1966), genetic algorithms (Holland, 1992), and genetic programming (Koza, 1992).The naming of the methods and operators matches with their counterparts from biology: candidates are individuals which can be selected to take the role of parents, mate and recombine to give birth to offspring.The genetic information of the parents is inherited to the offspring and can also be further changed by random mutation.The population of individuals is evolved (varied, evaluated and selected) over several iterations, so-called generations, to improve the solutions.
In general, a large number of different implementations and configurations of the mentioned operators to manipulate or alter the genetic information exist (Eiben and Smith, 2015).A basic EA is outlined in Algorithm 6.2, illustrating the four common base operators of an EA.Several examples for EAs can be found in the literature, such as the well-known µ + λ evolution strategies (Schwefel, 1993), which we will later use to exemplify some selection operators.Parent selection is the first of two selection operators in an EA.It is the process of choosing solutions from the population with size µ ∈ N, which will be utilized in the recombination and mutation step to create new candidates, the so-called offspring population with size λ ∈ N.
The most common and simplest strategy is linear rank-based selection: Based on fitness comparisons, only the highest ranked candidates are selected to reproduce, leading to a high selection pressure.The selection pressure defines how likely an inferior candidate can be successful in the selection.The usage of probabilistic selection strategies which have a more dynamic chance of selection with less selection pressure is more common.For instance, in roulette wheel selection, the chance of being selected is proportional to the ranking while all chances sum up to one.Each parent is then chosen by a spin of the roulette wheel, where the highest ranked individual has the highest chance of being selected.tournament selection draws randomly small subsets of the population, commonly two or up to µ − 1, for a number of tournaments.Within these tournaments, the best solution(s) are selected on basis of direct fitness comparisons.Another, rather unusual, scheme is fitness proportional selection: The probability to select a solution depends on the objective function value compared to the mean objective value of the population.The disadvantage of this schema is the usage of absolute fitness, which can result in a high selection pressure and the best solution dominating the population, ultimately leading to premature convergence.This is also true for other scale-based selection strategies.A complete random selection can also be used, e.g., by uniform selection, where each solution has the same chance of being selected, regardless of their fitness or objective function value.Recombination is an outstanding concept of EAs and an variation operator.New candidates are generated by combining the information of two (or more) evaluated solutions, based on the idea of the selected, commonly fittest, parents inheriting their children their genetic information.A typical recombination operator is the crossover, where parts of each parent are swapped, i.e., the decision variable settings, also known as discrete recombination.For example, in a one-point crossover of between two discrete, string-based parents aa|aaa and bb|bbb, two children aa|bbb and bb|aaa would be created, where "|" marks the crossover point.The crossover point is typically chosen at random.For real-valued variables in CGO, a basic recombination is the arithmetic or intermediate recombination, where the values are not swapped, but a value between the selected decision variables of the parents x a and x b is chosen by x of f spring = α × x a × (1 − α) × x b .The α value in range [0, 1] is either chosen at random, or a constant control parameter, the crossover strength.By α = 0.5, uniform arithmetic recombination is applied.This basic recombination therefore creates candidates which are situated between both parents.Mutation is applied to λ offspring candidates as another variation operator to add new genetic information and allow a higher diversification of the offspring population.Mutation is the process of changing a single candidate, commonly performed by adding a random valued vector, sampled from a parametrized uniform or Gaussian distribution.Hereby, the state of the art is to use an adaptive mutation strength (variation step size).It is then often set on basis of previous successful steps, for example as defined in the famous 1/5 rule (Rechenberg, 1973).Moreover, strategies exist, which implement the mutation strength as part of the genetic information of an individual, which also undergoes variation and selection and is thus self-adaptive.Different concepts for self-adaptive mutation strategies can be found in the publication by Back (1996).Survivor selection is the last step of each iteration and selects the candidates for the next iteration.While the survivor selection can generally be performed by the same selection operators as the parent selection, some special concepts were introduced and are commonly used.Two well-known concepts arise from the Evolution Strategies (Schwefel, 1993), known as (µ + λ) and (µ, λ) selection.The (µ + λ)-strategy selects survivors from the merged set of µ parents and λ offspring candidates accordingly to their ranked fitness, whereby the top µ solutions are kept for the next generation.In the (µ, λ)-strategy all parents µ are discarded and the survivors are selected only from the offspring λ according to their ranked fitness.This requires to create λ ≥ µ candidates in the recombination step to prevent extinction of the population.Further, in age-based selection strategies, each solution only survives a defined number of iterations before it gets discarded.This is often combined with fitness-based selection, as in the (µ, λ)-strategy, where each solution survives only one iteration.Aging adds a handicap to old dominating solutions which survived several generations.This depicts an explorative strategy to escape local optima and increase the diversity of the population.
EAs are very flexible in their implementation and adaptable by tuning.They are robust and suitable to solve a large class of problems, including multimodal, multi-objective, dynamic and black-box problems, even with noise or discontinuities in the fitness landscape (Jin and Branke, 2005;Marler and Arora, 2004).Further, they have successfully been applied to a large number of different industrial problems (Fleming and Purshouse, 2002) but typically require a relatively large number of function evaluations to converge.This makes them not the first choice for expensive problems where the number of evaluations is strongly limited.The flexibility and robustness of EAs is caused by different mechanisms and strategies for controlling the balance between exploration and exploitation.A good overview is presented in the survey by Črepinšek, Liu, and Mernik (2013).The detailed survey classifies the different available evolutionary approaches and presents an intensive discussion which mechanisms influence exploration and exploitation.Theoretical aspects of evolutionary operators are discussed by Beyer (2013).The performance of EAs is influenced by parameter tuning and control, e.g., the setting of population size, mutation strength and selection probability.An extensive overview of the different on-and offline tuning approaches for parameter control in EAs was published by Eiben et al. (1999).Further common strategies of controlling exploration and exploitation and multimodal optimization are so-called niching strategies, which utilize sub-populations to maintain the diversity of the population, investigate several regions of the search space in parallel or conduct defined tasks of exploring and exploiting (Shir and Bäck, 2005;Filipiak and Lipinski, 2014).The model-based guide class encompasses algorithms, which explicitly utilize mathematical or statistical models.The distinction to surrogate-based optimization is based on the assumption, that these models do not depict direct approximations of the underlying objective function and to not target to model the complete fitness landscape.Instead, they are specialized local models which are used in the variation .They are not explicitly searched for optima, although often the mean of the distribution is utilized as predicted optimal solution.

Model-based Guide
A common class of model-based algorithms are estimation of distribution algorithms (EDA) (Larrañaga and Lozano, 2001).They generally belong to the large field of EAs (Section 6.2).The main difference to EAs is that the variation operators, such as recombination or mutation, are not directly applied to the candidates, but to distribution models.The distribution models are build using information of prior evaluated populations.Different distribution models can be utilized, e.g., Bayesian networks or multivariate Gaussian distributions.Further common examples of model-based algorithms are the covariance matrix adaption -evolution strategy (CMA-ES) (Hansen et al., 2003) and ant colony optimization for continuous domains ACO R (Dorigo et al., 2006;Socha and Dorigo, 2008).ACO R is based on the social behavior of ant colonies and their communication via pheromones.Ants search for food around their nest in a random manner and leave trails of pheromones on their way to mark and enable other ants to follow these trails.The idea of these pheromone trails are transferred to use special distribution models and variation operators.As the general idea of model-based algorithms is similar, we will outline their mechanisms on the example of a generalized EDA and give further insight on the particularities of ACO R and the CMA-ES.
Example 6.3 (Estimation of Distribution Algorithms).The general idea behind the distribution model is that it is beneficial to learn the structural information of the underlying population (Larrañaga and Lozano, 2001;Hauschild and Pelikan, 2011).The structural information allows to acquire quality knowledge about the dependences among the variables.Moreover, this information is used to generate new candidates and thus, to guide the search for the optimum.A general EDA is outlined in Algorithm 6.3.Instead of a direct recombination or mutation, the parents are used to construct a distribution model.For example, typical parameters of a multivariate Gaussian distribution, such as mean, variance and covariance of the selected parent population are computed.In ACO R Gaussian kernel functions for each dimension of the decision variable space are constructed.The distribution parameters are then target of the variation operators.This variation is the crucial aspect of the search strategy, similar to the recombination and mutation in the basic EA.For example, the CMA-ES adapts the parameters during each iteration following the history of prior successful iterations, the so-called evolution paths.These evolution paths are basically exponentially smoothed sums for each distribution parameter over the consecutive prior iterative steps (cf.adaptive mutation strength in Section 6.2).They thus utilize the information of several successful search steps, which is intended to quickly approach an optimum.In the next step, a number of λ offspring candidates are generated by sampling the adapted distribution.Afterwards the candidates are evaluated and survivors for the next generation are selected, again based on survivor selection operators, as explained in (Section 6.2).The CMA-ES uses only rank-based selection schemes, which makes it insensitive to scaling of the objective function.
In general, model-based guides try to combine the benefits of statistical models and their capability of storing and processing information with population-based search operators.They are high level metaheuristics and advanced EAs which intended to be flexible, robust and applicable to a large class of problems, particular those with unknown function properties.This makes them very successful in popular black-box benchmarks (Hansen et al., 2010a).For example, the design of the CMA-ES is seeks to make the algorithm performance robust and not dependent on the objective function or tuning.The various control parameters of the algorithm were pre-defined on basis of theoretical aspects and practical benchmarks.Cartographer algorithms differ from all other defined classes in their focus on acquiring, gathering and utilizing global information about the fitness landscape.They utilize prior evaluated candidates and model the acquired information to predict the fitness of new candidates.These models are then used for an efficient indirect search in which typically a single new candidate is proposed in each iteration, instead of performing multiple, direct and localized search steps.
Analogy 3 (The Cartographer).The intuitive idea of the cartographer is a specialist who systematically measures a landscape by taking samples of the height to create a topological map.This map resembles the real landscape with a given approximation accuracy and is typically exact at the sampled locations (if the measurements are without variance) and models the remaining landscape by regression.It can then be examined and utilized by any other individual, such as a wanderer or guide, to find a desired location.One could think of a guide using a paper map or navigation system to find the place of interest.
As illustrated in Section 2.4, the surrogates s(x) depict the maps of the fitness landscape of an objective function (f 1 (x) or f 2 (x)) in an algorithmic framework.In this Section, we will first give a brief introduction to common surrogate models and then outline typical cartographer frameworks and their search strategies.

Surrogate Models
The surrogate is the core element of any surrogate-based optimization and essential for their performance.A perfect surrogate provides an excellent fit to prior observations, whilst ideally possessing superior interpolation and extrapolation abilities.However, the large number of available surrogate models all have significant differing characteristics, advantages and disadvantages.Model selection is thus a complicated and difficult task.If no domain knowledge is available, such as in real black box optimization, it is often inevitable to test different surrogates for their applicability.Surrogates are built on basis of prior observations, which provide information about the fitness landscape of the problem.Thus, the initial candidates are commonly selected following different information criteria and suitable experimental design.For example, linear regression models can be build with factorial designs, while Gaussian process models are best coupled with space-filling designs, such as Latin hypercube sampling (Montgomery, Montgomery, and Montgomery, 1984;Sacks et al., 1989).Common models are: linear, quadratic or polynomial regression, Gaussian processes (also known as Kriging) (Sacks et al., 1989;Forrester, Sobester, and Keane, 2008), regression trees (Breiman, Friedman, Stone, and Olshen, 1984), artificial neural networks and radial basis function networks (Haykin, 2004;Hornik, Stinchcombe, and White, 1989) including deep learning networks (Collobert and Weston, 2008;Hinton, Deng, Yu, Dahl, Mohamed, Jaitly, Senior, Vanhoucke, Nguyen, Sainath, et al., 2012;Hinton, Osindero, and Teh, 2006) and symbolic regression models (Augusto and Barbosa, 2000;Flasch, Mersmann, and Bartz-Beielstein, 2010;McKay, Willis, and Barton, 1995), which are usually optimized by genetic programming (Koza, 1992).Further, a lot of effort in current studies is to research the benefits of model ensembles, which combine several distinct models (Goel, Haftka, Shyy, and Queipo, 2007;Müller and Shoemaker, 2014;Friese, Bartz-Beielstein, and Emmerich, 2016).The goal is to create a sophisticated predictor that surpasses the performance of a single model.A well-known example are random forests (Freund and Schapire, 1997), which use bagging to fit a large number of decision trees (Breiman, 2001).We regard ensemble modeling as state-of-the-art of current research, as they are able to combine the advantages of different models to generate outstanding results in both classification and regression.The drawback of these ensemble methodologies is that they are computational expensive and pose a demanding problem in regard to efficient model selection, evaluation and combination.

Cartographer Algorithms
Cartographer algorithms are surrogate-based optimization methodologies, which explicitly use a surrogate in their optimization cycle, following the general principle outlined in Section 2.4.They are either fixed algorithms designed around a certain model, such as Kriging in the well-known efficient global optimization (EGO) (Jones et al., 1998), or frameworks with a choice of possible surrogates and optimization methods.We present two common frameworks and discuss their particularities: general bayesian optimization (Mockus, 1974) and sequential parameter optimization (Bartz-Beielstein et al., 2005;Bartz-Beielstein, 2010).Forrester and Keane (2009) and Bartz-Beielstein and Zaefferer (2017) give overviews of surrogatebased optimization, different surrogate models and infill criteria.Moreover, they match surrogates to problem classes and give hints about their individual applicability.In general, the selection of an adequate model, experimental design and optimizer requires both domain knowledge and expertise.We will focus on the above-mentioned frameworks as they deliver a good, yet not complete, view of the surrogate-based search strategy.

Efficient Global Optimization
EGO (Jones et al., 1998) was motivated by the urge to develop a methodology to optimize expensive black-box functions.It utilizes Kriging surrogates and motivates the use expected improvement as infill criteria.In general, the algorithm consists of two phases: first, the initialization by Latin hypercube sampling and the construction of a Kriging surrogate; second, the iterative improvement of the best solution utilizing the surrogate.Before starting the second phase, Jones et al. (1998)  The second phase starts the iterative optimization process, as described in Section 2.4.During the variation, a new candidate is searched by optimizing the expected improvement infill criteria of the Kriging surrogate.For this optimization, the exact branch and bound method is used (cf.exact class, Section 4).Expected improvement is motivated as infill criteria, because it guarantees a balance of exploration and exploitation by utilizing both the predicted best mean value of the model, as well as the model uncertainty.An example for the complete methodology is outlined in Algorithm 7.1.The search strategy of EGO is a fundamental example for most surrogate-based optimization which are applicable to expensive optimization problems.However, we suggest to use advanced frameworks based on this base version of EGO.These frameworks are more flexible and applicable to a larger class of problem.

Bayesian Optimization
The term Bayesian optimization (BO) was introduced by Mockus (1974Mockus ( , 1994Mockus ( , 2012) ) and describes not a single, but a scheme for algorithms, which we regard as surrogate-based, particular these based on Gaussian processes.While the general BO scheme thus remains similar to the outlined algorithm in Section 2.4, BO differs in the underlying terminology: In BO, the uses selects and initial, so-called prior distribution, which should support the a priori beliefs about the underlying unknown objective function.Gaussian distributions are suggested and a common choice.Algorithm 7.2 displays a general BO algorithm.
Algorithm 7.2: Bayesian Optimization This prior distribution is updated by sampled observations to acquire the posterior distribution.The optimization cycle includes the optimization of the acquisition (or infill) function to maximize utility or minimize risk (fitness).Typical choices include the probability of improvement (Kushner, 1964), expected improvement (Jones et al., 1998) and confidence bounds (Cox and John, 1997).Algorithms such as EGO can be seen as applied variants of BO.It is widely applicable to different applications, including expensive optimization problems (Lizotte, 2008;Khan, Goldberg, and Pelikan, 2002) and machine learning (Snoek, Larochelle, and Adams, 2012;Swersky, Snoek, and Adams, 2013).Brochu, Cora, and De Freitas (2010) give a tutorial on BO with different application examples.

Sequential Parameter Optimization Toolbox
The sequential parameter optimization toolbox (SPOT), developed by Bartz-Beielstein (2010), is a dynamic surrogate-based optimization framework, which was initially intended for offline tuning of algorithm control parameters.Various methods for initial sampling designs, different models and optimization techniques are included.SPOT is strongly influenced by statistical methods from design of experiments, where it is attempted to prove a certain statistical hypothesis on basis of testing.Hereby, the available budget for experiments (i.e., function evaluations) is used sequentially to improve a solution and update the surrogate.This is done until sufficient knowledge about the search space is available to accept or reject the initial stated hypothesis.
The overall design was dedicated to algorithm tuning and follows two goals: one was improving the efficiency of an algorithm, i.e., discovering the algorithm parameters to solve a defined problem instance as fast as possible.The other goal was improving the robustness of an identified parameter setup, i.e., for solving different problem instances which for example differ in their region of interest or search space dimensionality.To tune stochastic algorithms, SPOT integrates noise handling techniques by dynamic re-sampling of solutions.This design can be transferred to general surrogate-based optimization, as the methods tackle the present balancing problem of exploration (robustness) and exploitation (efficiency).
The general framework of SPOT is similar to the general surrogate-based optimization algorithms EGO and BO, which are divided in initialization and iterative optimization phases.SPOT also explicitly includes a prior parametrization phase, where the uses has to choose the surrogate and initial sampling design.SPOT defines a flexible framework and is thus applicable to a large range of problems, such as the mentioned algorithm tuning and industrial optimization.

The Hybrid Class
Hybrid Class: The hybrid class depicts combinations of algorithms from the previously mentioned classes.Hybrid algorithms were developed as a strategy to improve or tackle individual algorithm weaknesses.The algorithms are often given distinctive roles of exploration and exploitation, as they are combinations of an explorative global search method paired with a local search algorithm.For example, population-based algorithms with remarkable exploration abilities can be paired with local algorithms with fast convergence.This approach gives some benefits, as the combined algorithms can be adapted or tuned to fulfill their distinct tasks.Moreover, the concepts can be easily adapted for parallel frameworks.One of the most successful type of hybrids are the surrogate-assisted evolutionary algorithms (Emmerich, Giannakoglou, and Naujoks, 2006;Lim, Jin, Ong, and Sendhoff, 2010).An overview of surrogate-assisted optimization is given by Jin (2011) Additional examples of hybrids can be found in the literature, covering all possible class combinations: Memetic algorithms, as defined by Moscato et al. (1989), are a class of search methods which combine population-based guides with a local wanderer.An extensive overview of memetic algorithms is given by Molina, Lozano, García-Martínez, and Herrera (2010).They describe how different hybrid algorithms can be constructed by looking at suitable local search algorithms with special regard to their convergence abilities.Bartz-Beielstein, Preuss, and Rudolph (2006) describe a hybrid approach an evolution strategy for exploration with quasi-Newton method for exploitation.The algorithm runs the ES and the local search in a consecutive way and the budget of evaluations for each method is controlled by a control parameter.They performed experiments in which they varied the budget parameter to test if this hybrid approach can be superior to running both methods individually.They came to the conclusion, that hybridization can be beneficial for difficult objective functions, as the ES provides information about interesting regions where the local search is then applied.The surrogate management framework (Booker et al., 1999;Serafini, 1999) utilizes a combination of a global surrogate-based algorithm with an exact local pattern search.Moreover, it uses fitness space transformations from continuous to combinatorial by introducing a finite mesh of possible solutions.The key concept is to lower the optimization costs by reducing the real function evaluations using a surrogate, while at the same time retaining the benefits of a combinatorial search space and pattern search, i.e., the robust convergence behavior.The algorithm utilizes two distinct global and local search phases, which are executed during the sequential optimization.In the global search step the selected infill criterion (e.g.expected improvement) is optimized in continuous space and the nearest mesh point is selected.In the local poll step, a set of candidates situated on the mesh around the current best solution is evaluated directly on the objective.Taddy, Lee, Gray, and Griffin (2009) combine surrogate-based optimization based on treed Gaussian processes(TGP) with exact asynchronous parallel pattern search (APPS) in a parallel search framework.The algorithm starts with a space filling initial sampling using a latin hypercube design, then runs TGP and APPS in parallel.In this case, TGP is used to predict a ranked list of a fixed number of new candidates, while APPS is used to perform local optimization runs.
The budget for evaluation and computation is split between these two components and all observations are stored in an shared archive.Hybrid algorithms are applicable to a large class of problems, defined by which class their algorithms originate from.Their downside is their large complexity and the risk, that their higher complexity does not lead to improved performance, due to the difficult balancing and required tuning of the distinct algorithms.Their complex search strategies with a large number of control parameter can make them difficult to tune.The algorithm itself becomes a black-box, as the underlying search strategy and the convergence behavior is influenced by numerous operators and difficult to comprehend.

Concluding Remarks
In this work, we presented an overview of continous global optimization algorithms with focus on explaining their search strategies using a new intuitive taxonomy.We defined a set of five classes: exact, wanderer, guide, cartographer, and hybrid in the Sections 4 to 8 and outlined their individual properties and exemplified algorithms for each of the proposed classes.
To shortly recapitulate: the exact class utilizes (a priori) problem information to solve a problem with a guarantee of finding the optimum.The heuristic search strategies of the class of wanderer algorithms are suitable for fast convergence in a unimodal search space and often part of other algorithms.The well-known metaheuristics from the guide class were developed for general applicability, particular for multi-modal problems with unknown properties.The cartographer class focuses on surrogate-based algorithms and regarded frameworks for problems with expensive function evaluations.Last but not least, we took a look at hybrid class algorithms, which try to combine the strengths of different algorithms to overcome their individual weaknesses.
In general, it is beneficial for each user to identify if an optimization algorithm is suitable for their problem before applying them.To support users in selecting a suitable algorithm, we pointed out the pros and cons of the different search strategies, the individual algorithm features and typical characteristics of CGO problems they are able to handle.At this point, we also want to highlight a new promising research area in the field of algorithms, which is automated algorithm selection and particular automatic algorithm configuration.Both ideas tackle the problem of selecting the correct search strategy for a given problem.Automated algorithm selection tries to find the most suitable algorithm for a certain problem based on machine learning and exploited problem information, such as explorative landscape analysis.This method of algorithm selection has shown to be able to outperform a single algorithm on a set of benchmark functions (Kerschke and Trautmann, 2017).An even more promising result was presented by van Rijn, Wang, van Stein, and Bäck (2017), where algorithm configuration was used to select algorithmic components for creating a search strategy outperforming available algorithms.The particular interesting idea is here, that search operators of algorithms are identified, extracted and then again combined to a new search strategy.The whole procedure also shows the strong connections between different named algorithms, particular in the area of bio-inspired metaheuristics.
Interesting challenges for future algorithm design arise from problems in engineering applications, where the data available is restricted to certain conditions, such as streaming and online data and dynamic problems.The need for new optimization approaches emerges from rapid development of communicating sensors and machines in the field of engineering, also known as internet of things (Atzori, Iera, and Morabito, 2010).Suitable optimization algorithms need to be directly included in the production cycle, adapting to generate robust solutions in challenging dynamic environments with moving optima.Dynamic, surrogate-based online learning, where a complex static surrogate is constructed and combined with time-varying modeling, is still an open issue (Jin and Branke, 2005).Nowadays, cloud computing and high-performance computing clusters are available to a wide range of users.Many optimization algorithms are not yet fitted for the needs of parallel computation and need to be adapted Rehbach, Zaefferer, Stork, and Bartz-Beielstein (2018).The large and successful field of deep learning networks (Le-Cun, Bengio, and Hinton, 2015;Schmidhuber, 2015) declares a complete new field from which very complex and difficult optimization problems arise.The extension of surrogate-based optimization to these fields, e.g., parallel frameworks and deep learning, is an interesting research topic.Further, we identify a lack in the field of realistic benchmarks, which are based on realworld data sets, which would allow a realistic comparison of different algorithmic approaches.

Figure 1 :
Figure 1: A surrogate-based optimization process of a real-world process with the different objective function layers and outlined inputs and outputs; the complexity of the objective functions is visible by the decreasing size of their boxes.The full grey arrows illustrate the approximation and verification paths, the yellow dashed and red dotted arrows indicate the surrogate-based optimization and optional direct optimization.The blue dashed arrows show the optional parameter tuning of the optimization algorithm or surrogate modeling process.

Figure 3 :
Figure 3: Fundamental features for algorithm classification

Figure 4 :I
Figure 4: Algorithmic classification including the classes exact, wanderer, guide, and cartographer.The defined classification features I to IV are used as nodes of a decision tree with the subclasses and related example algorithms as the final leaves.Below the subclass the corresponding main classes and the associated objective function characteristics are illustrated.The objective function characteristics are displayed bottom-up, i.e., the displayed characteristics are adding up and getting more sophisticated from left to right.

I
Use of Information: implicit memory by the model/distribution II Candidate Evaluation: direct III Type of Candidate: model/distribution IV Search Space: global V Problem Properties: multimodal, black-box, noisy, discontinuities, multi-objective

iteration 11 optional: update control parameters 12 end 13 end
(Bartz-Beielstein and Zaefferer, 2017)ch strategy elements Initialization, Variation, Evaluation and Selection.All these key elements are controlled by a fifth element: the control parameters for the different functions and operators in each element.Algorithm 3.1 displays the key elements and the abstracted fundamental structure of optimization algorithms(Bartz-Beielstein and Zaefferer, 2017).These structure and elements can be mapped to any modern optimization algorithm.Even if the search strategy is inherently different or elements do not follow the illustrated order or appear multiple times per iteration.
4 initialize candidate(s) 5 evaluate initial candidate(s) 6 while not termination-condition do 7 t = t + 1 8 variate solutions to get new candidate(s) 9 evaluate new candidate(s) 10 select solution(s) for next

14 end 15 end
6 t = t + 1 7 sample new candidate x t from probability distribution D around the current solution x t−1 8 evaluate candidate x t 9 if new solution improves the fitness y t < y t−1 then 10 accept new solution 11 else 12 keep old solution x t = x t−1 13 adjust variation step size/variance of probability distribution (optional)

t+1 14 end 15 end
x t with (random) candidate x ∈ S 4 evaluate x t 5 while not termination-condition do 6 t = t + 1 7 sample new candidate x t from probability distribution D 8 evaluate candidate x t 9 if new solution satisfies acceptance function P (x t−1 , x t , T t ) then 10 accept new solution 11 else 12 keep old solution x t = x t−1 13 vary T t using cooling scheme C to get T t to create λ offspring candidates O t = {x t,i , µ < i ≤ µ + λ} 3 initialize µ random population P t = {x t,1:µ } ⊆ S 4 evaluate µ random population P t 5 while not termination-condition do 6 select parents from population P * t ⊆ P t 7 recombine selected parents P * The algorithm starts with population of µ solutions, either initialized at random or based on exposed problem knowledge.Similar to an EA, parents are selected, typically based on ranking selection.It is common to use the fittest solutions as parents (cf.selection in EAs, Sec 6.2).

7 The Cartographer Class
suggests to analyze the surrogate model fit.If the fit is not satisfactory, it can be tried to improve it by a tuning of the model parameters or transformation of the data.
3 initialize k candidate solutions {x 1:k } by latin hypercube sampling 4 evaluate them on objective functiony i = f 1 (x i ) or y i = f 2 (x i ), 1 ≤ i ≤ k ; 5 build initial Kriging surrogate s t with initial observations D t = {(x i , y i ), 1 ≤ i ≤k} 6 analyze and improve model fit (optional) // phase 2: use and update surrogate 7 while not termination-condition do 8 if t > 1 then 9 update the Kriging surrogate s t with the set of all observations D t 10 end 11 calculate expected improvement infill criteria on surrogate s t 12 optimize EI for maximum by branch and bound; use optimum as candidate x t 13 evaluate x t on the objective function y t = f 1 (x t ) or y t = f 2 (x t ) 14 add new solutions to the set of all observations D t+1 = {D t , (x t , y t )} Use of Information:implicit/explicit memory by surrogate or candidate archive II Candidate Evaluation: direct/indirect III Type of Candidate: single ,population, distribution model IV Search Space: global V Problem Properties: multimodal, black-box, noisy, discontinuities, multi-objective, expensive , including several examples for realworld applications.An example is outlined in Algorithm 8.1.algorithmtofind λ offspringO t = {x i , µ < i ≤ µ + λ} 7 build surrogate s t (x) ( with current observations D t = {(x i , y i ), 1 ≤ i ≤ k}) 8predict fitness of offspring O t using surrogate s t (x) (Surrogate-Assisted Evolutionary Algorithm).In this hybrid search strategy, a local surrogate is built upon the current parent population and utilized to predict the fitness of a number of λ offspring candidates.The selection is then based on the predicted fitness of the surrogate.Optionally, a local optimizer can be used to further refine the computed solutions.Extensive use of local search leads to a fast convergence to local optima.This hybrid strategy can be altered by using the surrogate only for a part of the generated offspring, while the other part is evaluated with the real fitness function.
6 run evolutionary 11 run local optimizer from x ∈ O * t as starting solution to get refined solutions O * * t (optional) 12 select survivors from P t ∪ O * t ∪ O * * t for next generation P t+1 13 t = t + 1