Stable Bayesian optimization

  • Thanh Dai Nguyen
  • Sunil Gupta
  • Santu Rana
  • Svetha Venkatesh
Regular Paper
  • 100 Downloads

Abstract

Tuning hyperparameters of machine learning models is important for their performance. Bayesian optimization has recently emerged as a de-facto method for this task. The hyperparameter tuning is usually performed by looking at model performance on a validation set. Bayesian optimization is used to find the hyperparameter set corresponding to the best model performance. However, in many cases, the function representing the model performance on the validation set contains several spurious sharp peaks due to limited datapoints. The Bayesian optimization, in such cases, has a tendency to converge to sharp peaks instead of other more stable peaks. When a model trained using these hyperparameters is deployed in the real world, its performance suffers dramatically. We address this problem through a novel stable Bayesian optimization framework. We construct two new acquisition functions that help Bayesian optimization to avoid the convergence to the sharp peaks. We conduct a theoretical analysis and guarantee that Bayesian optimization using the proposed acquisition functions prefers stable peaks over unstable ones. Experiments with synthetic function optimization and hyperparameter tuning for support vector machines show the effectiveness of our proposed framework.

Keywords

Bayesian optimization Gaussian process Stable Bayesian optimization Acquisition function 

1 Introduction

Bayesian optimization is a technique to sequentially optimize expensive blackbox functions in a sample efficient manner. Recently, it has emerged as a de-facto method to tune complex machine learning algorithms [21]. In tuning, the goal is to train a classifier at the right complexity so that it neither overfits, nor underfits. Performance on a validation set is used as an indicator of the fitting, and it is expected to peak at the hyperparameters corresponding to the right complexity and exhibit lower values at other hyperparameters. Thus to tune a machine learning algorithm, Bayesian optimization is employed in the pursuit of the peak validation set performance. However, in some situations, especially when the training or the validation dataset is small, spurious peaks appear along the performance surface (e.g., Fig. 1). These peaks tend to be distributed randomly over low performance region. They are characteristically different from the peak corresponding to the right complexity in two ways: (a) they tend to be narrow and (b) they vanish when tested on a large test dataset, whereas the right peak remains stable. Due to the latter difference, a Bayesian optimization method that does not explicitly avoid these spurious peaks may converge to one of them and can result in a badly tuned system with inexplicably low performance during real-world deployment. To the best of our knowledge, we are the first to identify and analyze this issue of spurious peaks and its serious downsides.
Fig. 1

Accuracy performance versus hyperparameters for a support vector machine training as color-coded images: a on a small validation set, and b on a large test set. Spurious peaks of region 1 seen for the validation set vanish for the test set, while the stable peak of region 2 still remains

Existence of multiple peaks with different widths along an optimization surface is prevalent in many real-world systems. For some of them, the end result of optimization can get dramatically affected depending on whether the optimization has converged to a wide peak or a narrow peak. For example, in alloy design [23], one of the main goals is to find the mixing proportion of a set of elements with the highest value of a physical property (e.g., strength, ductility). However, alloy making is an imprecise process. Due to the impurities in the raw material, the elements can never be mixed at the desired proportion. Therefore, if the desired proportion is at a narrow peak, then the performance of the alloy would not be stable when made repeatedly as even a small difference in impurities could result in dramatic loss in performance. Hence, being able to avoid narrow peaks in favor of more stable peaks is a critical factor of success for several different applications of Bayesian optimization. Unfortunately, till now the various downsides of reaching a narrow peak in the optimization of physical systems and processes have never been identified and attended.

Bayesian optimization, in its simplest form, consists of a Gaussian process (GP) [17] to maintain a distribution on the objective function based on the observations so far, and a mechanism to select the next query point based on an optimistic exploration strategy. This strategy is more commonly known as acquisition functions and can be of different types, such as expected improvement (EI) [14] or GP-UCB [20]. Based on the predictive distribution of the Gaussian process, EI computes the expected improvement over the current best observation and we choose a location as our next query point that offers the highest improvement. GP-UCB finds the location of the highest peak of a function by judiciously combining both the mean and the variance of the GP prediction. Apart from hyperparameter tuning, Bayesian optimization has also been used for optimal sensor placement [5], for gait design [12] and optimal path planning [13], etc. While this simple strategy is powerful for many applications, there have been recent attempts to make it widely applicable by making it feasible in high dimension [4], and adding the ability to perform transfer learning [10], multi-objective optimization [11], batch optimization [1, 16], etc. Convergence analyses of Bayesian optimization for EI [3], EI with noisy measurement [22], and GP-UCB [20] provide guarantee of convergence to the optimum of the objective function. However, none of the methods differentiate based on the stability of peaks and can, in principle, converge to any if there are multiple peaks with the same height but with different stability. Thus, to find an optimum where the function value is more stable and avoid regions where function values exhibit undesired fluctuations is an open problem.

To address the issues with spurious peaks, we propose two new acquisition functions for Bayesian optimization that actively seek stable peaks of the objective function. Based on our definition of stability, we show that it is possible to measure the stability of a peak by subjecting the underlying Gaussian process model with input perturbation. When faced with input perturbation, the predictive distribution of the Gaussian process changes. At any peak, the mean of the distribution goes lower, and the variance goes higher. But more importantly, for two peaks of the same height, the narrower peak will have a lower mean and more variance than the other peak. Furthermore, we show that the variance can be effectively decomposed as a sum of two parts: (a) epistemic variance due to the limited number of samples, and (b) aleatoric variance arising from the interaction between the curvature of the function with the input perturbation. The narrower a peak is, the higher the aleatoric variance will be around that peak. Therefore, the aleatoric variance can be used as an effective measure of the instability of a peak. Two acquisition functions are proposed in line with the GP-UCB and EI that, while exploiting the usual combination of mean and variance, also penalize for instability. Theoretically, we prove that under mild assumptions, when two peaks are of the same height, the proposed acquisition functions would always favor the more stable peak. We compare our method with a standard Bayesian optimization implementation on both synthetic function optimization and real-world hyperparameter tuning. On synthetic function optimization, we create a function that has both stable regions and spurious regions. Experiments with the synthetic function show that our proposed method converges to stable regions more often than the baseline. For real-world applications, we demonstrate tuning the hyperparameters of support vector machine on two real datasets. Experimental results clearly demonstrate that our proposed method converges to a stable peak, whereas the standard Bayesian optimization converges to an unstable peak, and hence the SVMs tuned by our method perform better on test sets.

With the concerns about stability of optimization, our proposed method can be applied widely for real-world problems, especially in industrial settings. In practice, it is almost impossible to precisely set the parameters of machines. Depending on the process/plant settings, the outcome of industrial processes can be noticeably different even with a small modification in parameters. By choosing stable peaks rather than unstable ones, our proposed method allows us to find a favorable set of parameters and minimize the effects of instability, which finally leads to robust design and desired products.

2 Background

Bayesian optimization [18] is a well-known methodology to find the extremum value of an unknown blackbox function f. This optimization problem can be formally defined as:
$$\begin{aligned} \mathbf {x}^{*}=\text {argmax}_{\mathbf {x}\in \mathcal {X}}f(\mathbf {x}) \end{aligned}$$
where \(\mathcal {X}\) is domain of \(\mathbf {x}\). It is assumed that although f is a blackbox function without a closed-form expression, it can be evaluated at any point \(\mathbf {x}\) in the domain. Given few input and output values of the function f, Bayesian optimization iteratively suggests samples for evaluation to find the optimum value of the function. Unlike common convex optimizers, it does not require to know the gradient of the function. Bayesian optimization uses all the information available from observations \(\mathbf {x}\) and \(f(\mathbf {x})\) for reasoning rather than relies on only local gradients. Thus, it is useful for optimizing expensive blackbox functions.

Bayesian optimization consists of two main components. The first is a meta model that can be evaluated at any point with uncertainty. This meta model uses prior knowledge, such as smoothness, about the cost function and known datapoints to update our beliefs about the function. There are plenty of choices for this function such as Gaussian process, Bayesian neural networks [19], random forest [2]. The second component of a Bayesian optimization algorithm is an acquisition function that suggests where to evaluate the function next. Using this function, the original problem becomes optimizing a less expensive non-convex function. The acquisition function maintains a trade-off between exploration (where the posterior distribution has high uncertainty) and exploitation (where the objective function has high predictive value). This technique is able to minimize the number of the cost function evaluations. In this work, we use Gaussian process as the meta model, and upper confidence bound (UCB) and expected improvement (EI) as the acquisition functions.

2.1 Gaussian process

Gaussian process is a stochastic process such that every finite collection of its random variables are jointly Gaussian. Intuitively, one can think of a Gaussian process as a multivariate Gaussian distribution over an infinite dimensional vector. Gaussian process is specified by a mean function and a covariance function. A function \(f(\mathbf {x})\) drawn from a Gaussian process with mean \(m(\mathbf {x})\) and covariance function \(k(\mathbf {x},\mathbf {x}')\) is denoted as follows:
$$\begin{aligned} f(\mathbf {x})\thicksim \mathcal {GP}(m(\mathbf {x}),k(\mathbf {x},\mathbf {x}')). \end{aligned}$$
Assume that we have a dataset \(\mathcal {D}_{t}=\{(\mathbf {x}_{i},y_{i})\},i=1,2,\ldots ,t\), where \(y_{i}=f(\mathbf {x}_{i})\) and \(f(\mathbf {x})\) is drawn from a Gaussian process \(\mathcal {GP}(m(\mathbf {x}),k(\mathbf {x},\mathbf {x}'))\). Without loss of generality, we can make the Gaussian process depend only on the covariance function \(k(\mathbf {x},\mathbf {x}')\) by assuming the mean function \(m(\mathbf {x})\) to be zero. The covariance function \(k(\mathbf {x},\mathbf {x}')\) should be a valid covariance function in order to make the covariance matrix \(\mathbf {K}\) positive definite [17], where \(\mathbf {K}_{i,j}=k(\mathbf {x}_{i},\mathbf {x}_{j})\). The kernel \(k(\mathbf {x},\mathbf {x}')\) is the most important part in a Gaussian process. It represents our prior belief about the properties of the function being modeled and also tells us about the correlation between any two particular points. Squared exponential kernel is a popular choice for the covariance function and is given by:
$$\begin{aligned} k(\mathbf {x},\mathbf {x}')=\exp \left( -\frac{1}{2\theta ^{2}}\left\| \mathbf {x}-\mathbf {x}'\right\| ^{2}\right) \end{aligned}$$
(1)
where \(\theta \) is the length scale parameter. To make predictions using Gaussian process, we consider the joint distribution of the old observations \(\mathcal {D}_{t}\) and a new observation \((\mathbf {x}_{t+1},y_{t+1})\) as:
$$\begin{aligned} \left[ \begin{array}{c} \mathbf {y}_{1:t}\\ y_{t+1} \end{array}\right] \sim \mathcal {N}\left( 0,\left[ \begin{array}{cc} \mathbf {K} &{} \mathbf {k}\\ \mathbf {k}^{T} &{} k(\mathbf {x}_{t+1},\mathbf {x}_{t+1}) \end{array}\right] \right) \end{aligned}$$
where \(\mathbf {k}=[k(\mathbf {x}_{1},\mathbf {x}_{t+1}),k(\mathbf {x}_{2},\mathbf {x}_{t+1}),...,k(\mathbf {x}_{t},\mathbf {x}_{t+1})]^{T}\). Using the Sherman–Morrison–Woodbury formula [17], the predictive distribution of the function value at \(\mathbf {x}_{t+1}\) can be written as:
$$\begin{aligned} p(y_{t+1}|\mathbf {y}_{1:t},\mathbf {x}_{1:t+1})=\mathcal {N}(\mu _{t}(\mathbf {x}_{t+1}),\sigma _{t}^{2}(\mathbf {x}_{t+1})) \end{aligned}$$
where
$$\begin{aligned} \mu _{t}(\mathbf {x}_{t+1})= & {} \mathbf {k}^{T}\mathbf {K}^{-1}\mathbf {y}_{1:t}\end{aligned}$$
(2)
$$\begin{aligned} \sigma _{t}^{2}(\mathbf {x}_{t+1})= & {} k(\mathbf {x}_{t+1},\mathbf {x}_{t+1})-\mathbf {k}^{T}\mathbf {K}^{-1}\mathbf {k}. \end{aligned}$$
(3)
From Eqs. (2) and (3), we can see some interesting properties of Gaussian process. Firstly, its predictive variance only depends on the covariance function and the input values \(\mathbf {x}_{1:t}\). Secondly, the marginal distribution at any single point is a univariate Gaussian due to the nice property of multivariate Gaussian distribution. This makes the computation of acquisition functions easier.

2.2 Acquisition functions

Now that we have discussed about how to use a Gaussian process as a meta model for Bayesian optimization, in this section, we will focus on the remaining component of Bayesian optimization: the acquisition function. The role of the acquisition function is to recommend the next sample for function evaluation. After defining an acquisition function,  the original optimization problem is approached by maximizing the acquisition function \(\alpha \) as follows:
$$\begin{aligned} \mathbf {x}_{t+1}^{*}=\arg \max _{\mathbf {x}\in \mathcal {X}}\alpha \left( \mathbf {x};\mathcal {I}_{t}\right) \end{aligned}$$
where \(\mathcal {I}_{t}\) denotes the Gaussian process estimated using t observations. Typically, the acquisition function is defined such that its high value potentially leads to a high value of objective function f. The trade-off between exploring highly uncertain regions or exploiting promising areas is also represented in the acquisition function. In this section, we will discuss about two popular choices for the acquisition function: upper confidence bound and expected improvement.

2.2.1 Upper confidence bound

Given the posterior Gaussian process, the upper confidence bound acquisition function (GP-UCB) is defined as follows [20]:
$$\begin{aligned} \text {GP-UCB}(\mathbf {x})=\mu _{t}(\mathbf {x})+\kappa _{t}\sigma _{t}(\mathbf {x}) \end{aligned}$$
(4)
where \(\kappa _{t}\) is a positive parameter that balances exploitation and exploration. Maximizing a GP-UCB acquisition function suggests the point where to evaluate the target function f next. Srinivas et al. [20] proved that if \(\kappa _{t}=2\log \left( t^{2}2\pi ^{2}/3\delta \right) +2d\log \left( t^{2}dbr\sqrt{\text {log}\left( 4da/\delta \right) }\right) \), GP-UCB achieves an upper bound on the cumulative regret \(\mathcal {R}_{T}=\sum _{t=1}^{T}\left( f\left( \mathbf {x}^{*}\right) -f\left( \mathbf {x}_{t}\right) \right) \) that has the order \(\mathcal {O}\left( \sqrt{T\gamma _{T}\kappa _{T}}\right) \forall T\ge 1\), with the probability greater than or equal to \(1-\delta \), where \(\gamma _{T}\) is the maximum information gain after T iterations, the search space is \([0,r]^{d}\) with some \(r>0\) and \(a,b>0\) are some constants related to the function smoothness.

2.2.2 Expected improvement

Another approach to define the acquisition function is based on improvement. Mockus et al. [14] defined the improvement function as follows:
$$\begin{aligned} \text {I}(\mathbf {x})=\text{ max }\left\{ 0,f_{t+1}(\mathbf {x})-f(\mathbf {x}^{+})\right\} \end{aligned}$$
(5)
where \(\mathbf {x}^{+}=\arg \max f(\mathbf {x}_{i})\). The probability of the improvement function \(\text {I}(\mathbf {x})\) can be computed from the density function of the normal distribution \(\mathcal {N}(\mu (\mathbf {x}),\sigma ^{2}(\mathbf {x}))\) as:
$$\begin{aligned} \text {P}\left[ \text {I}(\mathbf {x})\right] =\frac{1}{\sqrt{2\pi }\sigma (\mathbf {x})}\exp \left[ -\frac{\left( \mu (\mathbf {x})-f(\mathbf {x}^{+})-\text {I}\right) {}^{2}}{2\sigma ^{2}(\mathbf {x})}\right] \end{aligned}$$
Then the expected improvement acquisition function can be computed as follows:
$$\begin{aligned} \text {EI}\left( \mathbf {x}\right)= & {} \mathbb {E}\left[ \text {I}(\mathbf {x})\right] \\= & {} \intop _{\text {I=0}}^{\text {I}=\infty }\frac{\text {I}}{\sqrt{2\pi }\sigma (\mathbf {x})}\exp \left[ -\frac{\left( \mu (\mathbf {x})-f(\mathbf {x}^{+})-\text {I}\right) {}^{2}}{2\sigma ^{2}(\mathbf {x})}\right] \text {dI} \end{aligned}$$
This function can be evaluated analytically [9, 14], yielding the following results:
$$\begin{aligned} \text {EI}\left( \mathbf {x}\right)= & {} {\left\{ \begin{array}{ll} z\sigma (\mathbf {x})\varPhi (z)+\sigma (\mathbf {x})\phi (z) &{}\quad \text {if }\sigma (\text {x})>0\\ 0 &{}\quad \text {if }\sigma (\text {x})=0 \end{array}\right. } \end{aligned}$$
(6)
where \(z=\frac{\mu \left( \mathbf {x}\right) -f(\mathbf {x}^{+})}{\sigma \left( \mathbf {x}\right) }\), \(\varPhi (z)\) is the standard normal cumulative distribution function and \(\phi (z)\) is the standard normal probability density function. The EI function has been well studied theoretically [3] and proven to be effective in practice [18].

2.2.3 Maximizing the acquisition function

In order to find the point where we next evaluate \(f(\mathbf {x})\), we have to maximize the acquisition function. Unlike the original objective function, the two acquisition functions of Eq. (4) and Eq. (6) are cheap to sample. They can be solved using standard optimization techniques such as local optimizers, sequential quadratic programming or global optimizers such as DIRECT [8].

3 The proposed framework

We present two new acquisition functions for Bayesian optimization designed to maximize a blackbox function with behavior that the maxima from stable regions are preferred over the maxima from relatively unstable regions. We first discuss the notion of stability and then describe how a Gaussian process model gets modified in presence of any perturbation in the input variables. Next, we use the predictive distribution of the modified Gaussian process to formulate two novel acquisition functions: STABLE-UCB and STABLE-EI. We theoretically analyze the proposed acquisition functions and prove that they are guaranteed to take higher values in more stable regions and thus Bayesian optimization using these acquisition functions will have higher tendency to sample from more stable regions. Finally, we present an algorithm summarizing the proposed stable Bayesian optimization.

3.1 Stability of Gaussian process prediction

Given a set of observed data \(\mathcal {D}_{t}=\{\mathbf {x}_{i},y_{i}\}_{i=1}^{t}\) where \(\mathbf {x}_{i}\in \mathbb {R}^{D}\) and \(y_{i}=f(\mathbf {x}_{i})+\epsilon \), we use a Gaussian process to model the function f. Using \(\mathcal {D}_{t}\), for a new input \(\mathbf {x}\), the predictive distribution of the corresponding output \(y=f(\mathbf {x})\) is given as
$$\begin{aligned} P\left( y|\mathcal {D}_{t},\mathbf {x}\right) =\mathcal {N}\left( \mu _{t}\left( \mathbf {x}\right) ,\sigma _{t}^{2}\left( \mathbf {x}\right) \right) , \end{aligned}$$
where the predictive mean \(\mu _{t}\left( \mathbf {x}\right) =\mathbf {k}^{T}\mathbf {K}^{-1}\mathbf {y}\) and the predictive variance \(\sigma _{t}^{2}\left( \mathbf {x}\right) =k(\mathbf {x},\mathbf {x})-\mathbf {k}^{T}\mathbf {K}^{-1}\mathbf {k}\) with a notation \(\mathbf {y}=y_{1:t}\). We define \({{\varvec{\beta }}}=\mathbf {K}^{-1}\mathbf {y}\) to be used later.
The above predictive mean and variance are instrumental to Bayesian optimization as they provide a way to estimate the function value at any point in the function support along with the model uncertainty. The model uncertainty, also called ‘epistemic uncertainty’, is used in the Bayesian optimization to express our belief in the estimation and guide efficient exploration of the function while keeping a balance on exploitation. This phenomenon is an instance of a general concept in reinforcement learning known as exploitation–exploration trade-off.
Fig. 2

Predictive distributions of the Gaussian approximation and the Monte Carlo approximation with respect to different values of noise levels: a \({\varvec{\Sigma }}_{\mathbf {x}}=0.01\), b \({\varvec{\Sigma }}_{\mathbf {x}}=0.02\), c \({\varvec{\Sigma }}_{\mathbf {x}}=0.05\). In practice, parameter settings are limited to 1% perturbation and our approximation based on Gaussian distribution can handle it well

Since our goal is to develop a stable Bayesian optimization framework that prefers solutions insensitive to small perturbations in input, we start from asking a question how the predictive mean and variance of the function value changes if an input is slightly perturbed. A large shift in the mean and/or a large increase in the variance indicates a fast varying function and can be used to detect the unstable regions. In an early work, Girard and Murray-Smith [7] analyzed that if a test input is corrupted by a Gaussian noise, \({{\varvec{\epsilon }}}_{\mathbf {x}}\sim \mathcal {N}(\mathbf {0},{\varvec{\Sigma }}_{\mathbf {x}})\) such that \(\mathbf {u}=\mathbf {x}+{{\varvec{\epsilon }}}_{\mathbf {x}}\), the predictive distribution is given as
$$\begin{aligned} p(y|\mathcal {D}_{t},\mathbf {x},{\varvec{\Sigma }}_{\mathbf {x}})=\int p(y|\mathcal {D}_{t},\mathbf {u})p(\mathbf {u}|\mathbf {x},{\varvec{\Sigma }}_{\mathbf {x}})\text {d}\mathbf {u}. \end{aligned}$$
(7)
This distribution, in general, is non-Gaussian. However, in [7], it is shown that the Gaussian approximation is a fairly close approximation under the constraint of tractability. Let us use \(\mu _{t}(\mathbf {x})\) and \(\sigma _{t}^{2}(\mathbf {x})\) to denote the mean and variance of the Gaussian predictive distribution \(p(y|\mathcal {D}_{t},\mathbf {x})\) in the perturbation-free case. Also use \(m_{t}(\mathbf {x},\varvec{\Sigma }_{\mathbf {x}})\) and \(v_{t}(\mathbf {x},{\varvec{\Sigma }}_{\mathbf {x}})\) to denote the mean and variance of predictive distribution \(p(y|\mathcal {D}_{t},\mathbf {x},{\varvec{\Sigma }}_{\mathbf {x}})\), then with the Gaussian approximation, we can write
$$\begin{aligned} p(y|\mathcal {D}_{t},\mathbf {x},{\varvec{\Sigma }}_{\mathbf {x}})\approx \mathcal {N}(m_{t}(\mathbf {x}, \varvec{\Sigma }_{\mathbf {x}}),v_{t}(\mathbf {x},\varvec{\Sigma }_{\mathbf {x}})). \end{aligned}$$
(8)
We compare the Gaussian approximation to the Monte Carlo-based approximation of the predictive distribution for noisy inputs. For the Monte Carlo approach, we draw 1000 samples to approximate the true predictive distribution. (We have verified that by adding further samples the change in the distribution is negligible.) Figure 2 shows the predictive distributions of the Gaussian approximation and the Monte Carlo approximation for three noisy inputs with respect to different levels of perturbations \({\varvec{\Sigma }}_{\mathbf {x}}\). As seen from the figure, the predictive distributions of the Gaussian approximation are comparable to those of the Monte Carlo approximation. For a small amount of perturbation, e.g., \({\varvec{\Sigma }}_{\mathbf {x}}=0.01\), the two approximations are similar for all three noisy inputs. As the noise level increases, the Gaussian approximation starts to differentiate from the Monte Carlo-based approximation but still remains close. For a high level of noise (e.g., \({\varvec{\Sigma }}_{\mathbf {x}}=0.05\)), the two approximations are comparable for inputs in the stable region (e.g., \(\mathbf {x}=-0.25,0.25\)) and slightly different for the input in the unstable region (e.g., \(\mathbf {x}=0.65\)).
The computation of the predictive mean and variance in (8) may also become intractable when using an arbitrary covariance function. Fortunately, they are tractable for popular covariance functions such as linear and square exponentials. We demonstrate our framework using squared exponential covariance function. Nonetheless, our framework remains amenable to any valid covariance function and appropriate approximations arising due to an arbitrary covariance function can be easily incorporated. For the squared exponential covariance function, the predictive mean and variance are given as
$$\begin{aligned} m_{t}(\mathbf {x},{{\varvec{\Sigma }}}_{\mathbf {x}})&= \sum _{i=1}^{t}\beta _{i}k(\mathbf {x},\mathbf {x}_{i})k_{1}(\mathbf {x},\mathbf {x}_{i}) \end{aligned}$$
(9)
$$\begin{aligned} v_{t}(\mathbf {x},{\varvec{\Sigma }}_{\mathbf {x}})&= \sigma _{t}^{2}(\mathbf {x})+\sigma _{t,a}^{2}(\mathbf {x},\varSigma _{\mathbf {x}}) \end{aligned}$$
(10)
where \(\sigma _{t}^{2}(\mathbf {x})\) is the variance as in the unperturbed case and the extra variance due to perturbation is given as
$$\begin{aligned} \sigma _{t,a}^{2}(\mathbf {x},\varSigma _{\mathbf {x}})= & {} \sum _{i,j=1}^{t}\mathbf {K}_{ij}^{-1}\mathbf {H}_{ij}(\mathbf {x})(1-k_{2}(\mathbf {x},\bar{\mathbf {x}}_{ij}))\nonumber \\&+\,\sum _{i,j=1}^{t}\beta _{i}\beta _{j}\mathbf {H}_{ij}(\mathbf {x})(k_{2}(\mathbf {x},\bar{\mathbf {x}}_{ij})\nonumber \\&-\,k_{1}(\mathbf {x},\mathbf {x}_{i})k_{1}(\mathbf {x},\mathbf {x}_{j})). \end{aligned}$$
(11)
In the above expression, we have used the definitions:
$$\begin{aligned} k_{1}(\mathbf {x},\mathbf {x}_{i})= & {} \mathbf {Q}_{\varSigma _{\mathbf {x}}}({\mathbf {W}})\text {e}^{\frac{1}{2}(\mathbf {x}-\mathbf {x}_{i})^{T}\mathbf {S}_{\varSigma _{\mathbf {x}}}(\mathbf {W})(\mathbf {x}-\mathbf {x}_{i})}\\ k_{2}(\mathbf {x},\bar{\mathbf {x}}_{ij})= & {} \mathbf {Q}_{\varSigma _{\mathbf {x}}}({{\frac{\mathbf {W}}{2}}})\text {e}^{\frac{1}{2}(\mathbf {x}-\bar{\mathbf {x}}_{ij})^{T}\mathbf {S}_{\varSigma _{\mathbf {x}}}(\frac{\mathbf {W}}{2})(\mathbf {x}-\bar{\mathbf {x}}_{ij})}\\ \mathbf {H}_{ij}(\mathbf {x})= & {} k(\mathbf {x},\mathbf {x}_{i})k(\mathbf {x},\mathbf {x}_{j}) \end{aligned}$$
where \(\bar{\mathbf {x}}_{ij}=\frac{\mathbf {x}_{i}+\mathbf {x}_{j}}{2}\), \(\mathbf {Q}_{\varSigma _{\mathbf {x}}}( {{\mathbf {W}}})=\left| \mathbf {I}+\mathbf {W}^{-1}{\varvec{\Sigma }}_{\mathbf {x}}\right| ^{-1/2}\) and \(\mathbf {S}_{\varSigma _{\mathbf {x}}}(\mathbf {W})=\mathbf {W}^{-1}(\mathbf {W}^{-1}+\varvec{\Sigma }_{\mathbf {x}}^{-1})^{-1}\mathbf {W}^{-1}\).

In the following, we utilize the above analysis to define two novel acquisition functions to propose a stable Bayesian optimization framework.

3.2 Stable Bayesian optimization

Having a closed-form expression for the predictive mean and variance as in (9) and (10) provides us the required tractability to formulate an acquisition function for ‘stable’ Bayesian optimization. In the expression for the predictive variance in (10), we note that the variance \(v_{t}(\mathbf {x},{\varvec{\Sigma }}_{\mathbf {x}})\) has two components: (1) the epistemic variance (uncertainty) term \(\sigma _{t}^{2}(\mathbf {x})\), arising due to our lack of understanding about the function value, mainly due to the finite set of observations and (2) the aleatoric variance term \(\sigma _{t,a}^{2}(\mathbf {x},\varSigma _{\mathbf {x}})\) [further detailed in (11)], arising due to the inherent variation in the function around \(\mathbf {x}\). We associate the notion of the stability to this aleatoric variance which takes higher values in regions where the function has faster variations. In the remainder of this section, we use this property to define two new acquisition functions that yield stable Bayesian optimization, which results in a solution where the function value is robust to small perturbations.

3.2.1 The STABLE-UCB acquisition function:

Denoting the epistemic and the aleatoric variance at time t by \(\sigma _{t}^{2}(\mathbf {x})\) and \(\sigma _{t,a}^{2}(\mathbf {x},\varSigma _{\mathbf {x}})\), respectively, we define the STABLE-UCB acquisition function as:
$$\begin{aligned} a_{t}(\mathbf {x},\varSigma _{\mathbf {x}})=m_{t}(\mathbf {x},\varSigma _{\mathbf {x}})+\kappa _{t}\sigma _{t}(\mathbf {x},\varSigma _{\mathbf {x}})-\lambda \sigma _{t,a}(\mathbf {x},\varSigma _{\mathbf {x}}) \end{aligned}$$
(12)
where \(\kappa _{t}\) is a t-dependent weight that sets a balance between exploitation and exploration, and \(\lambda >0\) is a fixed weight that sets our penalty on the instability. In the above formulation, our intuition is to penalize the points where the function is varying fast with small change in \(\mathbf {x}\). In our implementation, to balance between the epistemic and the aleatoric variance, we set \(\lambda \) equal to \(\kappa _{t}\).

3.2.2 The STABLE-EI acquisition function:

We also propose a new acquisition function based on improvement. This is because the improvement-based acquisition function is extremely popular among practitioners. Similar to the STABLE-UCB acquisition function, we incorporate the aleatoric variance \(\sigma _{t,a}^{2}(\mathbf {x},\varSigma _{\mathbf {x}})\) at time t to the improvement function:
$$\begin{aligned} \text {I}_{t}(\mathbf {x},\varSigma _{\mathbf {x}})=\max \left\{ 0,f_{t+1}(\mathbf {x})-\omega \sigma _{t,a}(\mathbf {x},\varSigma _{\mathbf {x}})-f(\mathbf {x}^{+})\right\} \end{aligned}$$
where \(\mathbf {x}^{+}=\arg \max f(\mathbf {x}_{i})\) and \(\omega \) is a weight that penalizes points that in unstable regions. In this formulation, our idea is to make the stable region have higher improvement value compared to the unstable region with same level of predictive mean. Using the new definition of improvement, the stable expected improvement acquisition function (STABLE-EI) can be computed as:
$$\begin{aligned}&b_{t}(\mathbf {x},\varSigma _{\mathbf {x}})=\mathbb {E}[\text {I}(\mathbf {x},\varSigma _{\mathbf {x}})]=\intop _{0}^{\infty }\frac{\text {I}}{\sqrt{2\pi }\sigma _{t}(\mathbf {x},\varSigma _{\mathbf {x}})}\\&\quad \times \,\exp \left[ -\frac{\left( \text {I}-m_{t}(\mathbf {x},\varSigma _{\mathbf {x}})+\omega \sigma _{t,a}(\mathbf {x},\varSigma _{\mathbf {x}})+f(\mathbf {x}^{+})\right) ^{2}}{2\sigma _{t}^{2}(\mathbf {x},\varSigma _{\mathbf {x}})}\right] \text {d}\text {I} \end{aligned}$$
This function can be analytically evaluated as:
$$\begin{aligned} b_{t}(\mathbf {x},\varSigma _{\mathbf {x}})= & {} {\left\{ \begin{array}{ll} v_{t}(z_{t}\varPhi (z_{t})+\phi (z_{t})) &{} \text { if }\,v_{t}>0\\ 0 &{} \text { if }\,v_{t}=0 \end{array}\right. } \end{aligned}$$
(13)
where \(z_{t}=\left[ m_{t}(\mathbf {x},\varSigma _{\mathbf {x}})-\omega \sigma _{t,a}(\mathbf {x},\varSigma _{\mathbf {x}})-f(\mathbf {x}^{+})\right] /v_{t}\) and \(v_{t}=\sigma _{t}(\mathbf {x},\varSigma _{\mathbf {x}})\). In practice, because the aleatoric variance estimate increases with iterations, we set the parameter \(\omega =\sqrt{t}\) for the tth iteration.
Stable Bayesian optimization maximizes the acquisition function \(a_{t}(\mathbf {x},\varSigma _{\mathbf {x}})\) or \(b_{t}(\mathbf {x},\varSigma _{\mathbf {x}})\) to suggest the next function evaluation at each iteration. A step-by-step procedure of stable Bayesian optimization is provided in Algorithm 1.

3.2.3 Theoretical analysis:

In this section, we analyze the proposed acquisition functions to provide theoretical guarantees that the acquisition functions \(a_{t}(\mathbf {x},\varSigma _{\mathbf {x}})\) and \(b_{t}(\mathbf {x},\varSigma _{\mathbf {x}})\) indeed prefer less sharp peaks of the function \(f(\mathbf {x})\).

Definition 1

(Identical data topology): Any two points \(\mathbf {x}\), \(\mathbf {x}'\) are said to have identical data topology if for each observation \(\mathbf {x}_{i}\), there exists another observation \(\mathbf {x}_{i'}\) such that \(||\mathbf {x}-\mathbf {x}_{i}||=||\mathbf {x}'-\mathbf {x}_{i'}||\).

A consequence of the identical data topology is that for points \(\mathbf {x}\), \(\mathbf {x}'\), any distance based kernel induces Gram matrices that are equal up to a permutation. With an increasing set of observations, it is not difficult to achieve identical data topology approximately.

Lemma 1

If \(\mathbf {x}\), \(\mathbf {x}'\) are the two highest peaks in the support of function f such that \(|f(\mathbf {x})-f(\mathbf {x}')|<\eta _{0}\) for small \(\eta _{0}\), and f locally varies faster around \(\mathbf {x}'\) compared to \(\mathbf {x}\) in a small \(h_{0}\)-neighborhood, i.e., \(|\frac{f(\mathbf {x+h})-f(\mathbf {x})}{f(\mathbf {x'+h})-f(\mathbf {x}')}|<1\), \(\forall h\in (-h_{0},h_{0})\), under certain mild assumptions, the following relations hold true:
$$\begin{aligned}&m_{t}(\mathbf {x},\varSigma _{\mathbf {x}})\ge m_{t}(\mathbf {x}',\varSigma _{\mathbf {x}})\\&\sigma _{t}(\mathbf {x},\varSigma _{\mathbf {x}})=\sigma _{t}(\mathbf {x}',\varSigma _{\mathbf {x}})\\&\sigma _{t,a}(\mathbf {x},\varSigma _{\mathbf {x}})\le \sigma _{t,a}(\mathbf {x}',\varSigma _{\mathbf {x}}). \end{aligned}$$

Proof

To have no favor to any peak, let us assume that there are sufficiently many observations around both \(\mathbf {x}\), \(\mathbf {x}'\) so that the two points have identical data topology. Due to this mild assumption, we have a pair of observations \(\mathbf {x}_{i}\) and \(\mathbf {x}_{i'}\) such that \(||\mathbf {x}-\mathbf {x}_{i}||=||\mathbf {x}'-\mathbf {x}_{i'}||\). This implies that the covariance values \(k(\mathbf {x},\mathbf {x}_{i})=k(\mathbf {x}',\mathbf {x}_{i'})\) and \(k_{1}(\mathbf {x},\mathbf {x}_{i})=k_{1}(\mathbf {x}',\mathbf {x}_{i'})\). By definition, \({{\varvec{\beta }}}=\mathbf {K}^{-1}\mathbf {y}\). Since the peak at \(\mathbf {x}'\) is sharper than the peak at \(\mathbf {x}\), meaning \(y_{i'}\le y_{i}\) and therefore \(\beta _{i'}\le \beta _{i}\) . Hence, \(\sum _{i=1}^{t}\beta _{i}k(\mathbf {x},\mathbf {x}_{i})k_{1}(\mathbf {x},\mathbf {x}_{i})\ge \sum _{i'=1}^{t}\beta _{i'}k(\mathbf {x}',\mathbf {x}_{i'})k_{1}(\mathbf {x}',\mathbf {x}_{i'})\), or \(m_{t}(\mathbf {x},\varSigma _{\mathbf {x}})\ge m_{t}(\mathbf {x}',\varSigma _{\mathbf {x}})\).

Next, we also note that due to the identical data topology assumption around both peaks, we have equal epistemic uncertainties, i.e., \(\sigma _{t}(\mathbf {x},\varSigma _{\mathbf {x}})=\sigma _{t}(\mathbf {x}',\varSigma _{\mathbf {x}})\) by definition of \(\sigma _{t}(\mathbf {x})\) in (3).

Finally, we show that \(\sigma _{t,a}(\mathbf {x},\varSigma _{\mathbf {x}})\le \sigma _{t,a}(\mathbf {x}',\varSigma _{\mathbf {x}})\). For this, consider the aleatoric variance term in (11). As above, we have the following relations: \(k(\mathbf {x},\mathbf {x}_{i})=k(\mathbf {x}',\mathbf {x}_{i'})\), \(k_{1}(\mathbf {x},\mathbf {x}_{i})=k_{1}(\mathbf {x}',\mathbf {x}_{i'})\), \(\beta _{i'}\le \beta _{i}\) and additionally, \(k_{2}(\mathbf {x},\bar{\mathbf {x}}_{ij})=k_{2}(\mathbf {x}',\bar{\mathbf {x}}_{i'j'})\) . Using these relations, it is straightforward to show that \(\sigma _{t,a}(\mathbf {x},\varSigma _{\mathbf {x}})\le \sigma _{t,a}(\mathbf {x}',\varSigma _{\mathbf {x}})\). \(\square \)

Next, we state and prove our key results for the newly proposed acquisition functions.

Theorem 1

(STABLE-UCB case) If \(\mathbf {x}\), \(\mathbf {x}'\) are the two highest peaks in the support of a function f such that \(|f(\mathbf {x})-f(\mathbf {x}')|<\eta _{0}\) for small \(\eta _{0}\), and f locally varies faster around \(\mathbf {x}'\) compared to \(\mathbf {x}\) in a small \(h_{0}\)-neighborhood, i.e., \(|\frac{f(\mathbf {x+h})-f(\mathbf {x})}{f(\mathbf {x'+h})-f(\mathbf {x}')}|<1\), \(\forall h\in (-h_{0},h_{0})\), the acquisition function in (12) satisfies the relation: \(a_{t}(\mathbf {x},\varSigma _{\mathbf {x}})\ge a_{t}(\mathbf {x}',\varSigma _{\mathbf {x}})\) under certain mild assumptions.

Proof

Let us assume that \(\mathbf {x}\) and \(\mathbf {x'}\) have identical data topology. Consider the difference between the acquisition function values at \(\mathbf {x}\), \(\mathbf {x'}\):
$$\begin{aligned} \varDelta a_{t}= & {} \left[ m_{t}(\mathbf {x},\varSigma _{\mathbf {x}})-m_{t}(\mathbf {x}',\varSigma _{\mathbf {x}})\right] \\&+\, \left[ \kappa _{t}\left( \sigma _{t}(\mathbf {x},\varSigma _{\mathbf {x}})-\sigma _{t}(\mathbf {x}',\varSigma _{\mathbf {x}})\right) \right] \\&-\, \left[ \lambda \left( \sigma _{t,a}(\mathbf {x},\varSigma _{\mathbf {x}})-\sigma _{t,a}(\mathbf {x}',\varSigma _{\mathbf {x}})\right) \right] . \end{aligned}$$
Using the three separate inequalities from Lemma 1, we can prove that \(\varDelta a_{t}\ge 0\), i.e., \(a_{t}(\mathbf {x},\varSigma _{\mathbf {x}})\ge a_{t}(\mathbf {x}',\varSigma _{\mathbf {x}})\). \(\square \)

Theorem 2

(STABLE-EI case) If \(\mathbf {x}\), \(\mathbf {x}'\) are the two highest peaks in the support of a function f such that \(|f(\mathbf {x})-f(\mathbf {x}')|<\eta _{0}\) for small \(\eta _{0}\), and f locally varies faster around \(\mathbf {x}'\) compared to \(\mathbf {x}\) in a small \(h_{0}\)-neighborhood, i.e., \(|\frac{f(\mathbf {x+h})-f(\mathbf {x})}{f(\mathbf {x'+h})-f(\mathbf {x}')}|<1\), \(\forall h\in (-h_{0},h_{0})\), the acquisition function in (13) satisfies the relation: \(b_{t}(\mathbf {x},\varSigma _{\mathbf {x}})\ge b_{t}(\mathbf {x}',\varSigma _{\mathbf {x}})\) under certain mild assumptions.

Proof

Let us assume that \(\mathbf {x}\) and \(\mathbf {x'}\) have identical data topology. We define the difference between the acquisition function values at \(\mathbf {x}\), \(\mathbf {x'}\) as:
$$\begin{aligned} \varDelta b_{t}= & {} \sigma _{t}(\mathbf {x},\varSigma _{\mathbf {x}})(z_{t}\varPhi (z_{t})+\phi (z_{t}))\\&-\, \sigma _{t}(\mathbf {x}',\varSigma _{\mathbf {x}})(z'_{t}\varPhi (z'_{t})+\phi (z'_{t})). \end{aligned}$$
Our aim is to show that \(\varDelta b_{t}\ge 0\), i.e., \(b_{t}(\mathbf {x},\varSigma _{\mathbf {x}})\ge b_{t}(\mathbf {x}',\varSigma _{\mathbf {x}})\). From the definition of \(z_{t}\) in (13) and Lemma 1, we have
$$\begin{aligned} z_{t}-z'_{t}= & {} \left[ m_{t}(\mathbf {x},\varSigma _{\mathbf {x}})-m_{t}(\mathbf {x}',\varSigma _{\mathbf {x}})\right. \\&-\left. \omega \left( \sigma _{t,a}(\mathbf {x},\varSigma _{\mathbf {x}})-\sigma _{t,a}(\mathbf {x}',\varSigma _{\mathbf {x}})\right) \right] /\sigma _{t}(\mathbf {x},\varSigma _{\mathbf {x}})\ge 0. \end{aligned}$$
Let \(\tau (z)=z\varPhi (z)+\phi (z)\). Since \(\tau (z)\) is non-decreasing, we have \(\tau (z_{t})\ge \tau (z'_{t})\). On the other hand, from the Lemma 1 we have equal epistemic uncertainties at \(\mathbf {x}\) and \(\mathbf {x}'\), i.e., \(\sigma _{t}(\mathbf {x},\varSigma _{\mathbf {x}})=\sigma _{t}(\mathbf {x}',\varSigma _{\mathbf {x}})\). Therefore, we have \(\sigma _{t}(\mathbf {x},\varSigma _{\mathbf {x}})\tau (z_{t})\ge \sigma _{t}(\mathbf {x}',\varSigma _{\mathbf {x}})\tau (z'_{t})\), or in other words \(b_{t}(\mathbf {x},\varSigma _{\mathbf {x}})\ge b_{t}(\mathbf {x}',\varSigma _{\mathbf {x}}).\) \(\square \)

Remark

The above Theorems 1 and 2 cover an important case that when the peaks in both stable and unstable regions are approximately equal in height, a Bayesian optimization algorithm using the acquisition functions in (12) and (13) will prefer the peak from the stable region. In the case when a peak of the unstable region is higher than a peak of the stable region, the two terms \(m_{t}(\mathbf {x},\varSigma _{\mathbf {x}})-m_{t}(\mathbf {x}',\varSigma _{\mathbf {x}})\) and \(\sigma _{t,a}(\mathbf {x},\varSigma _{\mathbf {x}})-\sigma _{t,a}(\mathbf {x}',\varSigma _{\mathbf {x}})\) would be acting against each other and their net difference will decide whether the algorithm suggests a point from the stable region or unstable region. Since the parameters \(\lambda \) and \(\omega \) are user specified, there exist sufficiently large values of them that always guarantee the suggestion of the stable peak. In the case when a peak of the unstable region is lower than a peak of the stable region, both standard and stable Bayesian optimization will select the stable peak.

In the two proposed acquisition functions, the aleatoric variance acts as a form of regularization and we penalize the instability by subtracting the term \(\sigma _{t,a}(\mathbf {x},\varSigma _{\mathbf {x}})\). Another possible approach is to have the following regularization,
$$\begin{aligned} a_{t}(\mathbf {x},\varSigma _{\mathbf {x}})=\frac{m_{t}(\mathbf {x},\varSigma _{\mathbf {x}})+\kappa _{t}\sigma _{t}(\mathbf {x},\varSigma _{\mathbf {x}})}{\lambda \sigma _{t,a}(\mathbf {x},\varSigma _{\mathbf {x}})} \end{aligned}$$
in the case of the GP-UCB acquisition function, and
$$\begin{aligned} b_{t}(\mathbf {x},\varSigma _{\mathbf {x}})={\left\{ \begin{array}{ll} v_{t}(z_{t}\varPhi (z_{t})+\phi (z_{t})) &{}\quad \text { if }\,v_{t}>0\\ 0 &{}\quad \text { if }\,v_{t}=0 \end{array}\right. } \end{aligned}$$
where \(z_{t}=\frac{m_{t}(\mathbf {x},\varSigma _{\mathbf {x}})-f(\mathbf {x}^{+})}{\omega \sigma _{t,a}(\mathbf {x},\varSigma _{\mathbf {x}})v_{t}}\) and \(v_{t}=\sigma _{t}(\mathbf {x},\varSigma _{\mathbf {x}})\) for the expected improvement acquisition function. This approach follows an intuition similar to the proposed approach. It can be shown that, when the peaks in both stable and unstable regions are equal in height, these acquisition functions prefer the peaks in the stable regions rather than the unstable ones. Therefore, the algorithm using this approach should lead to similar performance to the proposed algorithm.

3.2.4 Computational complexity and convergence analysis:

In this section, we discuss about the computational complexity and the convergence analysis of the proposed stable Bayesian optimization algorithm.

Computational complexity Since the difference between our stable Bayesian optimization algorithm and the standard one is the acquisition function, we will focus our attention on the complexity analysis of acquisition function computation. In the standard Bayesian optimization algorithm, the complexity of UCB and EI for T observed datapoints is \(\mathcal {O}(T^{3})\). In our proposed acquisition functions (12) and (13), the complexity of computing the mean \(m_{T}(\mathbf {x},\varSigma _{\mathbf {x}})\), the epistemic variance \(\sigma _{T}(\mathbf {x},\varSigma _{\mathbf {x}})\) and the aleatoric variance \(\sigma _{T,a}(\mathbf {x},\varSigma _{\mathbf {x}})\) for T observations are all \(\mathcal {O}(T^{3})\). Therefore, our proposed algorithm has the same computational complexity as the standard Bayesian optimization algorithm.

Convergence From the definition of acquisition functions in (12) and in (13), we can see that the main difference between the proposed acquisition function and UCB/EI is the appearance of the aleatoric variance \(\sigma _{t,a}(\mathbf {x},\varSigma _{\mathbf {x}})\) term. After a sufficient large number of iterations \(T_{0}\), the Gaussian process models the function \(f(\mathbf {x})\) fairly accurately and the aleatoric variance becomes nearly a fixed function. Then the addition of the aleatoric variance term in the acquisition function can be interpreted as a constrained optimization problem of the blackbox function \(f(\mathbf {x})\) under the constraint that the aleatoric variance is smaller than a specified value. This problem has been well studied theoretically and shown to be effective in practice in the work of Gelbart et al. [6].
Fig. 3

a The synthetic function with one stable peak and multiple spurious peaks.b The STABLE-UCB acquisition function and the aleatoric variance after 30 iterations

4 Experiments

In this section, we experiment on a set of synthetic and real datasets to demonstrate the efficacy of our stable Bayesian optimization. Experiments with synthetic dataset show the behavior of our proposed method with a known and complex function with multiple sharp peaks and one stable peak. We also conduct experiments with several hyperparameter tuning problems to show the utility of our method for real-world applications.

4.1 Baseline method and evaluation measures

We compare the stable Bayesian optimization using proposed acquisition functions (STABLE-UCB and STABLE-EI) with standard Bayesian optimization using UCB acquisition function (BO-UCB) and EI acquisition function (BO-EI), respectively. On synthetic data, we compare STABLE-UCB and STABLE-EI with the corresponding baseline in two aspects: ‘the maximum value found’ and ‘the number of times an algorithm visits around the highest stable peak1 with respect to the number of iterations. On real data, we show the performance of stable Bayesian optimization and standard Bayesian optimization on both validation and test sets. We compare STABLE-UCB with standard UCB and STABLE-EI with standard EI for fair comparison.

4.2 Experiments with the synthetic function

4.2.1 Data generation:

The synthetic function \(f(\mathbf {x})\) is generated using a squared exponential kernel with two different parameters (See Fig 3a). The stable region is created using squared exponential kernel with the length scale 0.2. The unstable region is generated using a squared exponential kernel with the length scale 0.01 to simulate spurious peaks. The unstable region of \(f(\mathbf {x})\) is \(0.7\le \mathbf {x}\le 1.1\), and the rest is the stable region.
Fig. 4

Performance of stable Bayesian optimization and standard Bayesian optimization with respect to the number of iterations on synthetic function. a, c show that STABLE-UCB and STABLE-EI converge to 2.3 and 2.5 (stable peaks), whereas BO-UCB and BO-EI converge to 3.7 (a spurious peak). b, d show that stable Bayesian optimization reaches the stable peak more often than the baseline

Fig. 5

Sampling behavior of both STABLE-UCB and BO-UCB for hyperparameter tuning of SVM for letter classification a on validation dataset and b on test dataset. The background portrays the performance function with respect to the hyperparameters. Spurious region (region 1) is evident for the validation dataset but vanished for the test set, while stable region (region 2) still remains

4.2.2 Experimental results:

We randomly initialize 2 observations for Bayesian optimization. Figure 3b illustrates the value of the STABLE-UCB acquisition function and the aleatoric variance after 30 iterations. In the unstable region, the STABLE-UCB acquisition function used for stable Bayesian optimization has a smaller value than that in the stable region due to high aleatoric variance capturing instability. We observe similar results for the STABLE-EI acquisition function.

Figure 4 depicts the result of the optimization comparing the standard BO with the proposed stable variants. Figure 4a shows the result of ‘maximum value found’ using STABLE-UCB averaged over 50 different initializations. After 50 iterations, the proposed STABLE-UCB converges to the averaged maximum value at around 2.3, while BO-UCB converges to 3.7. This is because STABLE-UCB often converges to stable regions unlike BO-UCB which converges to unstable regions. The number of peaks visited in the stable region by STABLE-UCB and BO-UCB are compared in Fig 4b. In 20 iterations, the percentage of times STABLE-UCB and BO-UCB visit around the highest stable peak are 96% and 34%, respectively. In 60 iterations, more than 84% of times the proposed STABLE-UCB visits around the highest stable peak, whereas this number for BO-UCB is only at 6%, illustrating better stability behavior of STABLE-UCB. The pattern is similar for STABLE-EI. Figure 4b illustrates the ‘maximum value found’ result using STABLE-EI. After 50 iterations, STABLE-EI converges to 2.5, while BO-EI converges to 3.7. After 60 iterations, BO-EI can not find the stable peak, while STABLE-EI reaches the stable peak around 60% of times (Fig. 4d).
Fig. 6

Performance of stable Bayesian optimization and standard Bayesian optimization using SVM with respect to number of iterations on letter dataset. The performance of standard Bayesian optimization, due to convergence to spurious peaks on the validation set (Fig 6a, c), degrades for the test set (Fig 6b, d)

Fig. 7

Performance of stable Bayesian optimization and standard Bayesian optimization using SVM with respect to number of iterations on glass dataset. The performance of standard Bayesian optimization, due to convergence to spurious peaks on the validation set (Fig 7a, c), degrades for the test set (Fig 7b, d)

4.3 Experiments with hyperparameter tuning problems

4.3.1 Dataset:

We use letter and glass classification datasets from UCI machine learning repository.2 The letter dataset contains 20,000 datapoints about the image characteristic of 26 capital letters in the English alphabet. Since spurious peaks occur mostly when the training set or the validation set has limited number of datapoints, we sample only 200 datapoints from the letter dataset. The glass dataset consists of 214 datapoints represented using 10-features related to glass properties. Both datasets are divided into training set, validation set and test set. The validation set accuracy will be used as the objective for optimization, and the test set accuracy will be used as the performance measure.

4.3.2 Experimental results with support vector machine:

Support vector machine (SVM) is a popular machine learning algorithm for classification problems. Two main hyperparameters in SVM using RBF kernel are C and \(\gamma \) that represent the misclassification trade-off and the RBF kernel parameter, respectively. We apply both stable Bayesian optimization and standard Bayesian optimization for tuning C and \(\gamma \). In the experiments, the objective function \(f(\mathbf {x})\) is the validation set accuracy and vector \(\mathbf {x}\) represents the hyperparameters C and \(\gamma \). Performance on test set is used to compare the performance of the proposed method and the baseline.

Figure 5 shows the converged peaks by our proposed STABLE-UCB and BO-UCB over 30 different initializations. As seen from the figure, the number of times BO-UCB converges to spurious peaks is considerably higher than that of STABLE-UCB. This behavior leads to the accuracy performance shown in Fig 6. Figure 6a shows the performance of two STABLE-UCB and BO-UCB on the validation set. We note that this is a multi-class classification task, hence a random classifier would have a mean accuracy of only \(1/26=0.0385\). After 120 iterations, STABLE-UCB’s best accuracy on the validation set is 0.35, whereas BO-UCB’s best is 0.375. However, as we move to the test set and compare the performance of the two methods using the hyperparameters optimized using the validation set, we find that STABLE-UCB performance is higher compared to BO-UCB (see Fig 6b). After 120 iterations, STABLE-UCB performance remains high at 0.44, whereas BO-UCB reaches only up to 0.41. Figure 6c shows the performance of STABLE-EI and BO-EI on the validation set of the letter dataset. After 140 iterations, STABLE-EI reaches 0.35, whereas BO-EI reaches 0.36. However, on the test set, the performance of STABLE-EI remains at 0.43, whereas BO-EI’s best accuracy is 0.41 after 140 iterations (see Fig 6d).

We observed a similar behavior of two algorithms for the glass dataset (Fig 7). On the validation set, although BO-UCB and BO-EI converge to a higher accuracy (both have accuracy = 0.58) than that of STABLE-UCB and STABLE-EI (both have accuracy = 0.56), stable Bayesian optimization accuracy score stays above 0.56 compared to standard Bayesian optimization’s accuracy of under 0.52 for the test set.

Our experiments with SVM hyperparameter tuning demonstrate that spurious peaks are indeed abound in case of small training and validation sets. The proposed stable Bayesian optimization is able to successfully reduce the convergence to such peaks.

5 Conclusion

We proposed a stable Bayesian optimization framework to find stable solutions for hyperparameter tuning. We discussed the notion of stability and presented a modified Gaussian process model in presence of noisy inputs. We constructed two novel acquisition functions based on the epistemic and aleatoric variances of the modified Gaussian process estimates. The aleatoric variance becomes high in unstable region around spurious narrow peaks and thus offers a way to guide the function optimization toward stable regions. We theoretically showed that our proposed acquisition functions favor stable regions over unstable ones. We discussed about computational complexity and convergence of our proposed algorithm. Through experiments with both synthetic function optimization and hyperparameter tuning for SVM classifier, we demonstrated the utility of our proposed framework.

The proposed stable Bayesian optimization has advantages over standard Bayesian optimization when the number of datapoints is limited. Thus, it can be applied in real-world domains such as health care, bioinformatics, material design. It also opens promising problems for future work such as privacy preserving stable Bayesian optimization and multi-objective stable Bayesian optimization.

Footnotes

  1. 1.

    The highest stable peak region is \(0\le \mathbf {x}\le 0.125\).

  2. 2.

Notes

Acknowledgements

This research was partially funded by the Australian Government through the Australian Research Council (ARC) and the Telstra-Deakin Centre of Excellence in Big Data and Machine Learning. Prof Venkatesh is the recipient of an ARC Australian Laureate Fellowship (FL170100006).

Compliance with ethical standards

Conflicts of interest

All the authors declare that they have no conflict of interest.

References

  1. 1.
    Azimi, J., Fern, A., Fern, X.Z.: Batch bayesian optimization via simulation matching. Adv. Neural Inf. Process. Syst. 1, 109–117 (2010)MATHGoogle Scholar
  2. 2.
    Brochu, E., Cora, V.M., De Freitas, N.: A tutorial on bayesian optimization of expensive cost functions, with application to active user modeling and hierarchical reinforcement learning. arXiv preprint arXiv:1012.2599 (2010)
  3. 3.
    Bull, A.D.: Convergence rates of efficient global optimization algorithms. J. Mach. Learn. Res. 12, 2879–2904 (2011)MathSciNetMATHGoogle Scholar
  4. 4.
    Chen, B., Castro, R., Krause, A.: Joint optimization and variable selection of high-dimensional gaussian processes. arXiv preprint arXiv:1206.6396 (2012)
  5. 5.
    Garnett, R., Osborne, M.A., Roberts, S.J.: Bayesian optimization for sensor set selection. In: IPSN (2010)Google Scholar
  6. 6.
    Gelbart, M.A., Snoek, J., Adams, R.P.: Bayesian optimization with unknown constraints. arXiv preprint arXiv:1403.5607 (2014)
  7. 7.
    Girard, A., Murray-Smith, R.: Gaussian processes: prediction at a noisy input and application to iterative multiple-step ahead forecasting of time-series. In: Murray-Smith, R, Shorten, R (eds) Switching and Learning in Feedback Systems, pp. 158–184. Springer, Berlin (2005)Google Scholar
  8. 8.
    Jones, D.R., Perttunen, C.D., Stuckman, B.E.: Lipschitzian optimization without the lipschitz constant. J. Optim. Theory Appl. 79(1), 157–181 (1993)MathSciNetCrossRefMATHGoogle Scholar
  9. 9.
    Jones, D.R., Schonlau, M., Welch, W.J.: Efficient global optimization of expensive black-box functions. J. Glob. Optim. 13(4), 455–492 (1998)MathSciNetCrossRefMATHGoogle Scholar
  10. 10.
    Joy, T.T., Rana, S., Gupta, S.K., Venkatesh, S.: Flexible transfer learning framework for bayesian optimisation. In: Pacific-Asia Conference on Knowledge Discovery and Data Mining, pp. 102–114. Springer, Berlin (2016)Google Scholar
  11. 11.
    Laumanns, M., Ocenasek, J.: Bayesian optimization algorithms for multi-objective optimization. In: PPSN (2002)Google Scholar
  12. 12.
    Lizotte, D.J., Wang, T., Bowling, M.H., Schuurmans, D.: Automatic gait optimization with gaussian process regression. IJCAI 7, 944–949 (2007)Google Scholar
  13. 13.
    Martinez-Cantin, et al.: A bayesian exploration-exploitation approach for optimal online sensing and planning with a visually guided mobile robot. Auton. Robots 27(2), 93-103 (2009)Google Scholar
  14. 14.
    Mockus, J., Tiesis, V., Zilinskas, A.: The application of bayesian methods for seeking the extremum. Towards Glob. Optim. 2(117–129), 2 (1978)MATHGoogle Scholar
  15. 15.
    Nguyen, T.D., Gupta, S., Rana, S., Venkatesh, S.: Stable bayesian optimization. In: Pacific-Asia Conference on Knowledge Discovery and Data Mining, pp. 578–591. Springer, Berlin (2017)Google Scholar
  16. 16.
    Nguyen, V., Rana, S., Gupta, S.K., Li, C., Venkatesh, S.: Budgeted batch Bayesian optimization. In: 16th International Conference on IEEE Data Mining (ICDM), 2016, pp. 1107–1112. IEEE (2016)Google Scholar
  17. 17.
    Rasmussen, C.E.: Gaussian processes for machine learning. Citeseer, (2006)Google Scholar
  18. 18.
    Snoek, J., Larochelle, H., Adams, R.P.: Practical bayesian optimization of machine learning algorithms. In: NIPS, pp. 2951–2959 (2012)Google Scholar
  19. 19.
    Snoek, J., Rippel, O., Swersky, K., Kiros, R., Satish, N., Sundaram, N., Patwary, M., Prabhat, M., Adams, R.: Scalable Bayesian optimization using deep neural networks. In: Proceedings of the 32nd International Conference on Machine Learning (ICML-15), pp. 2171–2180 (2015)Google Scholar
  20. 20.
    Srinivas, N., Krause, A., Seeger, M., Kakade, S.M.: Gaussian process optimization in the bandit setting: no regret and experimental design. In: ICML (2010)Google Scholar
  21. 21.
    Thornton, C., Hutter, F., Hoos, H.H., Leyton-Brown, K.: Auto-weka: Combined selection and hyperparameter optimization of classification algorithms. In: ACM SIGKDD (2013)Google Scholar
  22. 22.
    Wang, Z., de Freitas, N.: Theoretical analysis of bayesian optimisation with unknown gaussian process hyper-parameters. arXiv preprint arXiv:1406.7758 (2014)
  23. 23.
    Xue, D., et al.: Accelerated search for materials with targeted properties by adaptive design. Nat. Commun. 7, 11241 (2016)CrossRefGoogle Scholar

Copyright information

© Springer International Publishing AG, part of Springer Nature 2018

Authors and Affiliations

  • Thanh Dai Nguyen
    • 1
  • Sunil Gupta
    • 1
  • Santu Rana
    • 1
  • Svetha Venkatesh
    • 1
  1. 1.Centre for Pattern Recognition and Data Analytics (PRaDA)Deakin UniversityGeelongAustralia

Personalised recommendations