This section provides the underlying methodological framework of the study. The global optimization problem we consider is defined as:
$$ \mathop {\max }\limits_{{x \in \chi \subset {\mathbb{R}}^{d} }} f\left( x \right) $$
where the search space \(\chi\) is generally box-bounded and \(f\left( x \right)\) is black-box meaning that no gradient information is available and that we have only access to noisy observations of \(f\left( x \right)\) which are computationally expensive.
Gaussian processes
GPs are a powerful nonparametric model for implementing both regression and classification. One way to interpret a GP is as a distribution over functions, with inference taking place directly in the space of functions (Williams and Rasmussen 2006). A GP, therefore, is a collection of correlated random variables, any finite number of which have a joint Gaussian distribution. A GP is completely specified by its mean function \(\mu \left( x \right)\) and covariance function \({\text{cov}}\left( {f\left( x \right),f\left( {x^{\prime}} \right)} \right) = k\left( {x,x^{\prime}} \right)\):
$$ \begin{aligned} & \mu \left( x \right) = {\mathbb{E}}\left[ {f\left( x \right)} \right] \\ & {\text{cov}}\left( {f\left( x \right),f\left( {x^{\prime}} \right)} \right) = k\left( {x,x^{\prime}} \right) = {\mathbb{E}}\left[ {\left( {f\left( x \right) - \mu \left( x \right){ }} \right)\left( {f\left( {x^{\prime}} \right) - \mu \left( {x^{\prime}} \right){ }} \right)} \right] \\ \end{aligned} $$
and will be denoted by: \(f\left( x \right)\sim GP\left( {\mu \left( x \right),k\left( {x,x^{\prime}} \right)} \right)\). This means that the behaviour of the model can be controlled entirely through the mean and covariance.
Usually, for notational simplicity we will take the prior of the mean function to be zero, although this is not necessary. The covariance function assumes a critical role in the GP modelling, as it specifies the distribution over functions, depending on a sample \(X_{1:n} = \left\{ {x_{1} , \ldots ,x_{n} } \right\}\). More precisely, \(f\left( {X_{1:n} } \right)\sim {\mathcal{N}}\left( {0,K\left( {X_{1:n} ,X_{1:n} } \right)} \right)\) with \(K\left( {X_{1:n} ,X_{1:n} } \right)\) an \(n \times n\) matrix whose entry \(K_{ij} = k\left( {x_{i} ,x_{j} } \right)\) with \(k\) a kernel function specifying a prior about the smoothness of the function to be approximated and optimized.
We usually have access only to noisy function values, denoted by \(y = f\left( x \right) + \varepsilon\), with \(\varepsilon\) an additive independent identically distributed Gaussian noise with variance \(\lambda^{2}\).
Let \(y = \left( {y_{1} , \ldots ,y_{n} } \right)\) denote a set of \(n\) noisy function evaluations, then the resulting covariance matrix becomes \(K\left( {X_{1:n} ,X_{1:n} } \right) + \lambda^{2} I\).
Let \(D_{1:n} = \left\{ {\left( {x_{i} ,y_{i} } \right)} \right\}_{i = 1,..,n}\) denote the set containing both the locations and the associated function evaluations, then the predictive equations for GP regression are updated by conditioning the joint Gaussian prior distribution on \(D_{1:n}\):
$$ \begin{aligned} & \mu \left( x \right) = {\mathbb{E}}\left[ {f\left( x \right)|D_{1:n} ,x} \right] = k\left( {x,X_{1:n} } \right)\left[ {K\left( {X_{1:n} ,X_{1:n} } \right) + \lambda^{2} I} \right]^{ - 1} y \\ & \sigma^{2} \left( x \right) = k\left( {x,x} \right) - k\left( {x,X_{1:n} } \right)\left[ {K\left( {X_{1:n} ,X_{1:n} } \right) + \lambda^{2} I} \right]^{ - 1} k\left( {X_{1:n} ,x} \right) \\ \end{aligned} $$
where \(k\left( {x,X_{1:n} } \right) = \left[ {k\left( {x,x_{1} } \right), \ldots , k\left( {x,x_{n} } \right)} \right]\).
The covariance function is the crucial ingredient in a GP predictor, as it encodes assumptions about the function to approximate: function evaluations that are near to a given point should be informative about the prediction at that point. Under the GP view, it is the covariance function that defines nearness or similarity. Once the prior mean and the kernel are chosen, they are updated with the observation of \(f\left( x \right)\) to find a-posterior distribution \(f{(}x {|} D_{1:n} )\) and this allows us to find the expected value of the function at any point and to calculate its uncertainty through its predicted variance.
Examples of covariance (aka kernel) functions:
$$ \begin{gathered} {\text{Squared Exponential (SE) kernel:}}\quad k_{{{\text{SE}}}} \left( {x,x^{\prime}} \right) = e^{{ - \frac{{x - x^{{\prime}{2}} }}{{2\ell^{2} }}}} . \hfill \\ {\text{Exponential kernel:}}\quad k_{{{\text{Exp}}}} \left( {x,x^{\prime}} \right) = e^{{ - \frac{{\left| {x - x^{\prime}} \right|}}{\ell }}} . \hfill \\ {\text{Power exponential kernel:}}\quad k_{{{\text{PowExp}}}} \left( {x,x^{\prime}} \right) = e^{{ - \left( {\frac{{\left| {x - x^{\prime}} \right|}}{\ell }} \right)^{p} }} . \hfill \\ {\text{Mat}}{\acute{\text{e}}}{\text{rn kernels:}}\quad k_{{{\text{Mat}}}} \left( {x,x^{\prime}} \right) = \frac{{2^{1 - \nu } }}{\Gamma \left( \nu \right)}\left( {\frac{{\left| {x - x^{\prime}} \right|\sqrt {2\nu } }}{\ell }} \right)^{\nu } K_{\nu } \left( {\frac{{\left| {x - x^{\prime}} \right|\sqrt {2\nu } }}{\ell }} \right). \hfill \\ \end{gathered} $$
The hyperparameter \(\ell\), in all the reported kernels, is the characteristic length-scale (updated by maximum likelihood estimation). The hyperparameter \(v > 0\) in the Matérn kernel governates the smoothness of the GP samples, which are \(v - 1\) differentiable, and where \({\Gamma }\left( \nu \right) \) is the gamma function and \(K_{\nu }\) is the modified Bessel function of the second kind. The most widely adopted versions, specifically in the machine learning community and considered in this paper, are \(\nu = 3/2\) and \(\nu = 5/2\). The Matérn kernel encodes the expected smoothness of the target function explicitly.
Random forest
Random forest (RF) is an ensemble learning method, based on decision trees, for both classification and regression problems (Ho 1995). According to the originally proposed implementation, RF aims at generating a multitude of decision trees, at training time, and providing as output the mode of the classes (classification) or the mean/median of the predictions (regressions) of the individual trees.
Although originally designed and presented as a machine learning algorithm, RF is also an effective and efficient alternative to GP for implementing BO. RF consists of an ensemble of different regressors (i.e. decision trees); it is possible to compute—as for GP—both \(\mu \left( x \right)\) and \(\sigma \left( x \right)\), simply as mean and variance of the samples of the individual outputs provided by the regressors. Due to the different nature of RF and GP, they result in significantly different surrogate models. While GP is well suited to model smooth functions in search space spanned by continuous variables, RF can also deal with discrete and conditional variables.
Moreover, fitting an RF is computationally more efficient than fitting a GP; indeed, the kernel matrix inversion required to fit a GP is a well-known computational issue.
The acquisition functions
The acquisition function is the mechanism to implement the trade-off between exploration and exploitation in BO. More precisely, any acquisition function aims to guide the search of the optimum towards points with potentially high values of objective function either because the prediction of \(f\left( x \right)\), based on the probabilistic surrogate model, is high or the uncertainty, also based on the same model, is high (or both). Indeed, exploitation means to consider the area providing more chance to improve over current solution, while exploration means to move towards less explored regions of the search space where predictions based on the surrogate model are more uncertain, with higher variance. There are many acquisition functions, we quote only those used in this study. Probability of improvement (PI) (Kushner 1964) and expected improvement (EI) (Mockus 1975) measure, respectively, the probability and the expectation of the improvement over the best observed value of \(f\left( x \right)\) given the predictive distribution of the probabilistic surrogate model. More precisely, they are defined as follows:
$$ \begin{aligned} & {\text{EI}}\left( x \right) = \mathop {\max }\limits_{x \in X} \left\{ {0,\mu \left( x \right) - y^{ + } } \right\} \\ & {\text{PI}}\left( x \right) = {\Phi }\left( {\frac{{\mu \left( x \right) - y^{ + } }}{\sigma \left( x \right)}} \right) \\ \end{aligned} $$
with \(y^{ + }\) the best value observed so far and \({\Phi }\) is the cumulative density function of a normal distribution. The next evaluation point, \(x_{n + 1}\), is obtained by maximizing \({\text{EI}}\left( x \right)\) or \({\text{PI}}\left( x \right)\). More recently, Upper/Lower Confidence Bound, (Srinivas et al. 2010), denoted with UCB/LCB, is widely used. It is an acquisition function that manages exploration–exploitation by being optimistic in the face of uncertainty where UCB and LCB are used, respectively, for maximization and minimization problems:
$$ {\text{UCB}}\left( x \right) = \mu \left( x \right) + \beta \sigma \left( x \right)\;{\text{and}}\;{\text{LCB}}\left( x \right) = \mu \left( x \right) - \beta \sigma \left( x \right) $$
where \(\beta \ge 0\) is the parameter to manage the trade-off between exploration and exploitation (\(\beta = 0\) is for pure exploitation; on the contrary, higher values of \(\beta\) emphasizes exploration by inflating the weight of model uncertainty). In Srinivas et al. (2010), a policy is provided for updating the value of \(\beta\) along function evaluations, with also a proof of convergence of such a policy. In the case of a maximization problem, the next point is chosen as \(x_{n + 1} = \mathop {{\text{argmax}}}\limits_{x \in X} UCB\left( x \right)\) while in the case of a minimization problem the next point is selected as \(x_{n + 1} = \mathop {{\text{argmin}}}\limits_{x \in X} LCB\left( x \right)\).
Bayesian optimization
Algorithm 1 summarizes a general BO schema where the acquisition function, whichever it is, is denoted by \(\alpha \left( {x,D_{1:n} } \right)\). In order to maintain the schema as general as possible we do not specify the probabilistic surrogate model, as well as the kernel in the case of a GP.
In this study we have used both GP (considering the kernel types presented in the previous section) and RF as surrogate probabilistic models. The three different acquisition functions previously described have been used for both the two surrogates.