High-dimensional Bayesian optimization with projections using quantile Gaussian processes

Key challenges of Bayesian optimization in high dimensions are both learning the response surface and optimizing an acquisition function. The acquisition function selects a new point to evaluate the black-box function. Both challenges can be addressed by making simplifying assumptions, such as additivity or intrinsic lower dimensionality of the expensive objective. In this article, we exploit the effective lower dimensionality with axis-aligned projections and optimize on a partitioning of the input space. Axis-aligned projections introduce a multiplicity of outputs for a single input that we refer to as inconsistency. We model inconsistencies with a Gaussian process (GP) derived from quantile regression. We show that the quantile GP and the partitioning of the input space increases data-efficiency. In particular, by modeling only a quantile function, we overcome issues of GP hyper-parameter learning in the presence of inconsistencies.


Introduction
Studies in robotics, machine learning, software development, recommendation systems and medicine are governed by design and parameter choices. For instance, changing gait properties of legged robots leads to different robustness and walking speed performances [5]. Adjusting controller configurations in drones results in reduced feedback-error of closed loop systems [2]. Clever tuning of hyper-parameters settings in multi-layer convolutional neural networks speeds up learning and lowers generalization error [19]. Since the process of evaluating the performance of each parameter configuration can be time consuming, automatically finding configurations that yield optimal performance is notoriously hard and demands data-efficient approaches. 1 In an ideal setting, we are interested in a globally optimal solution.
A promising algorithm for data-efficient optimization is Bayesian optimization (BayesOpt) [7,12,14,16] where a surrogate function is built for predicting the performance of a set of parameters. The exploration/exploitation trade-off to obtain a globally optimal solution is taken care of by an acquisition function. This approach has proven successful in various fields such as movie recommendation systems [21], parameter estimation of biological models [23] and automatic algorithm configuration [10].
The key to the success of BayesOpt relies on two steps: (i) using previous experiments to train statistical surrogate models and then (ii) using this model within an acquisition function to find input locations that yield the best added value for the optimization. These steps, however, come at a computational cost that remains often small in low dimensions. However, in high dimensions, we typically need a significantly larger number of experiments to find a good statistical model of the true objective function. Optimizing the acquisition function requires many evaluations of the surrogate model and constitutes a computational bottleneck in BayesOpt.
To reduce the negative effects of the high dimensionality on BayesOpt, Wang et al. [24] assume that the objective lives only on a linear subspace of the input domain that is d-dimensional with d D, where D is the original dimensionality. Strong theoretical results show that performing BayesOpt under these assumptions is equivalent to learning and optimizing the true objective on a random embedding. However, robust implementation requires further deftness to account for box-constraints and non-injectivity of the mapping from the embedding to the original domain. Kandasamy et al. [13] decompose the objective f as a sum of z-independent lower-dimensional function components f 1 , . . . , f z defined on orthogonal domains (1) , . . . , (z) of dimensionality at most d. Each subproblem is optimized independently using the Upper Confidence Bound (UCB) acquisition function [20]. This approach scales the optimization of the acquisition linearly in the number of components by including specific structural assumptions about f . Ulmasov et al. [23] propose a different decomposition of the input by randomly selecting subsets of the input parameters and assigning a different model for each subset. Each model is then trained with a separate dataset to address the problem of inconsistencies introduced by axis-aligned projections.
In this article, we propose a scalable method based on projections that overcomes issues inherent in axis-aligned projections. We formulate an optimization approach (a) (b) Fig. 1 a Vanilla GP hyper-parameter learning can result in unrealistic estimates and largely useless generalization capabilities. This is a clear sign of underfitting where the data with inconsistency is modeled as noise. The resulting acquisition function from this model will favor pure exploitation leading to pre-mature convergence of the algorithm. b The Quantile GP overcomes the issue of inconsistencies by explicitly modeling only a quantile function based on independent sub-problems for each subset of d D parameters. We address the issue of inconsistencies with a sensible choice of the model by using a quantile GP (QGP). The key idea behind the QGP is that we only model a lower τ -quantile of function values, which also allows us to retain a simpler explanation of the data. The QGP maintains well-calibrated confidence bounds in posterior predictions in the presence of noisy multiplicity of outputs see Fig. 1.

Problem setting
We consider the problem of finding a minimizer θ * = argmin θ∈ f (θ ) (1) of an unknown function f : Θ ⊂ R D → R defined on a D-dimensional parameter space Θ that we will assume to be a bounded hypercube Θ = [0, 1] D . We further assume that observations y of f are corrupted by i.i.d. Gaussian noise, i.e., with unknown variance σ 2 n . Gradient information of the black-box function f with respect to the inputs θ ∈ Θ are not available, and we cannot make any convexity assumptions.

Gaussian processes
In this article, we consider a Gaussian process (GP) as the probabilistic surrogate for the black-box objective function we seek to optimize. GPs allow for deriving a posterior of f in a fully Bayesian framework and expressing the uncertainty in its estimation through well calibrated error bars. GPs are commonly used in machine learning to express prior assumptions in nonlinear regression problems and have proven successful in Bayesian optimization [18], learning control [6], probabilistic numerics [9], and time series-predictions [8]. When a function f : → R is distributed as a GP we write f ∼ GP(m(·), k(·, ·)). A GP is fully determined by a mean function m : → R and a covariance function or kernel k : × → R, which encodes high-level properties and characteristics of the function as mean and covariance structure between function values. Common choices of covariance functions for BayesOpt include the squared exponential and Matern 52 kernels [7,17] given by respectively, where l > 0 is a lengthscale parameter. These are stationary kernels, i.e., they only depend on the distance r = θ − θ between two inputs θ and θ . With a Gaussian likelihood we obtain a closed-form posterior distribution

Bayesian optimization
Bayesian optimization is performed in two steps: (i) we train a statistical model p( f |D t ), usually a GP, of the latent function f based on observations collected up to iteration t; (ii) we select inputs θ * t+1 by maximizing an acquisition function, α : Θ → R that trades off exploration and exploitation, such that The acquisition function formalizes the notion of how useful each input parameter θ is for optimization of f . Given a probabilistic model for f , the acquisition function automatically trades off low expected function values (exploitation) with regions where the latent f has high estimated uncertainty (exploration). Different acquisition functions have been defined that characterize different exploration-exploitation trade-offs for global optimization. Examples are Expected improvement (EI): where f (θ − ) = min f (θ 1:t ) is the current minimum observed in the exploration history, and Z = ( f (θ − ) − μ(θ ))/σ (θ). Moreover, Φ and φ are the standard cumulative and normal density, respectively. β t in UCB accounts for the trade off between exploration and exploitation. We refer to [18] for a more detailed overview on acquisition functions. Increasing the dimensionality D of the input space raises a series of challenges. The number of evaluations required to cover the search space increases exponentially in D. This makes GP learning a computationally demanding task. GPs, as common choices for probabilistic modeling of the response surface, scale cubically in the number of data points for training and quadratically for predicting [17]. Computational demands of acquisition function optimizers also become significant. Heuristic strategies based on multi-start methods [4], and dividing rectangle searches (DiRect) [11] require an exponential number of evaluations of the response surface with respect to D.

High-dimensional Bayesian optimization with projections
We propose a novel Bayesian optimization algorithm based on axis-aligned projections that uses quantile regression models for learning a low-dimensional projection of the response surface. Under the assumption that the black-box function is effectively lower-dimensional, projections onto d-dimensional features tackle the curse of dimensionality for both the learning of the response surface and the maximization of the acquisition function.
We select z possible projections that partition the D-dimensional input space, such that ∪ i (i) = , and ∩ i (i) = ∅. This convention allows us to partition the dimensions into a maximum of z projections, or components. We then define the projection as a set of d coordinates, pr oj(i) = {p i,1 , . . . , p i,d }, to select from the original input space for i = 1, . . . , z. Given an input θ = [θ 1 , . . . , θ D ] ∈ R D , we identify its i-th projection as θ (i) = [θ p i,1 , . . . , θ p i,d ] ∈ R d , that is from the high-dimensional vector of parameters we select the components with indices p i,1 , .., p i,d . The projected vector is defined in the low-dimensional space d and we say that this d-dimensional space is defined by the projection pr oj(i). When performing optimization from projected data we consider the data set { (i) , Y} with lower-dimensional inputs θ (i) .
A standard GP would model this multiplicity of outputs as additive Gaussian noise. This modeling may result in a mis-interpretation of data as unstructured noise. Figure 1 shows an example of this shortcoming.
For our purpose of obtaining reliable posterior predictions after training, we are interested in removing such inconsistencies, e.g., by automatically selecting only the best (lowest) observations for each parameter sub-configuration. Selecting extreme (small) observations is intuitive in our minimization context, and we validate this choice with empirical results. Quantile regression provides a method for function Algorithm 1: Main steps of a Bayesian optimization algorithm with projections. The inner loop iterates over the z components and trains a different QGP model for each projection. The training of each QGP model in line 5 is performed by maximizing the marginal likelihood in equation (10) and selects the kernel lengthscale parameters. The update in line (8) in the outer loop concatenates all the selected updates θ (i) t+1 .

Algorithm 1 Quantile-GP BayesOpt
: end for estimation that effectively embodies this notion of automatic selection. We detail the steps of our method with quantile GP models for each projection in Algorithm 1.

Quantile GP regression
We are interested in modeling a proportion of the data with a GP. This proportion is referred to as quantile, τ , and defines the probability P(y ≤ μ τ ) = τ of observations y to be below the functional estimate μ τ [22]. The basic intuition behind quantile regression is that minimizing the l1-loss function, N i=1 |y i − μ τ (θ i )| yields a functional estimator of the median, which corresponds to quantile τ = 0.5. For an arbitrary τ , direct estimation of the quantile functions is obtained by minimizing a tilted loss function (pinball loss) where ξ = y i − μ τ (θ i ). The regression problem that optimizes the cumulative loss N i=1 l τ (y i − μ τ (θ i )) consistently produces τ -th quantile function estimates [22]. In our predictions, we model the uncertainty of our estimate μ τ as a posterior probability over function values derived in a Bayesian framework. The quantile-GP model (QGP) [3] allows for such a formulation and computation of posterior predictions through approximate inference via Expectation Propagation (EP) [15]. We introduce a standard GP prior over quantile functions, i.e., μ τ ∼ GP(m, k), and reformulate the tilted loss (9) in terms of a renormalized reward R(y i , μ τ ) = Z i exp (l τ (μ τ , y i )), where Z i is a normalizing constant, and R(y i , μ τ ) evaluates the Asymmetric Laplace Distribution (ALD). The basic intuition behind this definition is also displayed in Fig. 2 for a quantile τ = 0.1. The reward represents the likelihood p(y i |μ τ , θ i , θ G P ) of each input θ i and model hyper-parameters θ G P , which includes the lengthscales of the kernel. We perform training of QGP hyper-parameters via a type-II maximum likelihood approach [17], that is we select the lengthscale parameters of the kernel by maximizing the marginal likelihood at each BayesOpt iteration, i.e.
The integral (10) is intractable and we approximate it via EP. Expectation Propagation [15] is an approximate inference method that expresses the likelihood, p(y|μ τ , N , θ G P ), with a product of unnormalized Gaussian distributions in the latent variable, μ τ , called local likelihood approximationsπ i =Z i N (μ τ ;μ i ,σ 2 i ). In our setting, the rewards are independent for each y i so that the model likelihood p(y|μ τ , N , θ G P ) factorizes as N i=1 R(μ τ , y i ). The EP approximates each of these N -likelihood factors with a local Gaussian approximation, we therefore apply an approximation with N local likelihoods. Each local approximation is characterized by the site parameters:μ i ,σ 2 i , for i = 1, . . . , N , where the effect of the normailzation constants,Z i , on the marginal likelihood can be expressed as a function of the site parameters [17]. These are are the targets of the EP algorithm and are updated in an iterative process until convergence. The convergence guarantee for the EP procedure has not been proven but rather conjectured [17] for log-concave likelihoods such as ALD and has been reported that EP works relatively well for GP models. After convergence, each local approximationπ i will contribute to the posterior as the original likelihood in p(y|μ τ , N , θ G P ), still retaining nice properties of analytical integration against Gaussian distributions. Algorithm 2 shows the main steps involved in the approximate inference procedure. 2 For a further explanation of the EP steps the reader could refer to [15,17]. Algorithm 2: In each update of site i, the EP procedure substitutes the intractable likelihood with the cavity distribution π −i . Line 5 computes the posterior, q −i , analytically. EP then applies the contribution of the i-th original likelihood to q −i and projects the i-th hybrid distribution, h i , to an un-normalized Gaussian, q i , via moment matching (by minimizing KL(h i q i )). The local approximation is then obtained removing the cavity term.

Experiments
In this section, we assess the quantile GP model for Bayesian optimization on axisaligned projections. In our analysis, we dedicate a set of experiments to validate our choice of the quantile τ with empirical evidence. In high-dimensional settings, we test our approach on the commonly assumed additivity property by imposing and violating this assumption. In addition, we include an empirical analysis of performances when the axis-aligned assumption is violated, that is under an arbitrary rotation of the original domain.
Performing BayesOpt on subsets of dimensions, we define fixed groups and update each partition component in parallel during one optimization step. We avoid over-fitting to a single acquisition function comparing performances across a set of acquisitions: expected improvement (EI), [16], upper confidence bound (UCB) [20], and probability of improvement (PI) [14]. For the Gaussian process model in each baseline, we select the Matern 52 kernel.

Sensitivity analysis
The use of a QGP introduces τ as an additional hyper-parameter. This value models the proportion of observations that are modeled by the QGP and, consequently, the shape Here, we evaluate the sensitivity of BayesOpt to different choices of the quantile parameter τ . We restrict our selection of quantiles to a maximum of τ = 0.5 since we require our model to be sensible with respect to low observations. Intuitively, in a minimization setting, proportions of the data below the median represent good indicators of the location of a minimum and we therefore expect performances to deteriorate for τ > 0.5. 3 We use the Hartmann benchmark function, which has total of six effective dimensions, and lift it into a high-dimensional input space of dimensionality D = 60. Relevant dimensions are distributed uniformly at random over the 60 dimensions, and care is taken to ensure that all relevant dimensions are not contained in the same group. Figure 3 shows the progression of the best (lowest) observations collected during the independent optimization runs. Error bars represent twice the standard error over runs from different initializations. We evaluate performances in terms of best observed value at termination of the algorithm and data efficiency of each baseline which denotes the steepness of the descent in the succession towards the optimum. The collected results show good performances for moderate values of the quantile such as 0.1, 0.2, 0.3 while extreme values such as 0.5 and 0.01 retain a much slower descent. We observe that the extremely small quantile tends to overfit to lowest observations and reduces generalization capabilities. This renders exploration of the BayesOpt algorithm expensive in the number of function evaluations and increases the number of local optima of the optimization landscape. Large quantiles also correspond to poor performances. Selecting τ = 0.5, the QGP models the median which is sensible with respect to mid-range outputs. The lowest observations are treated as outliers and the resulting response surface landscape fails to capture downhill slopes relevant for global minimization. In our set of experiments, modeling the 0.1 proportion of observations proves effective, and we identify the choice of τ = 0.1 as our best configuration for the remaining experiments on synthetic data. In the subsequent we also introduce a Gamma hyper-prior for the lengthscales Ga(shape = 1, scale = 1) in each baseline to enforce exploration during optimization.

Additive high-dimensional objectives
In our second experiment, we assess the scalability of BayesOpt with Quantile GPs (QGP) by comparing to a set of baselines for high-dimensional optimization. We include random embeddings (REMBO) [24], random search (RS) [1], additive models (Add-GP) [13] and Lipschitz continuous optimization with GP (GP-Lip).
The algorithm REMBO: Random EMbeddings Bayesian Optimization [24] performs standard Bayesian optimization in a low-dimensional box constraint (embedding), i.e. z ∈ [z min , z max ] d and then projects the selected location to the original input space via a linear mapping θ = Az. The matrix A has entries sampled from standard Gaussian, i.e.
[A] i, j ∼ N (0, 1) and the values z min , z max are provided such that the optimum is contained in the embedded space with arbitrarily high probability.
Random Search [1] simply selects a set of T locations {θ } 0:T −1 uniformly at random in the high-dimensional input space and evaluates the objective functions on these locations without any adaptive search strategy.
Add-GP [13] learns a d-dimensional Gaussian process for each addend of the sum f 1 , . . . , f z . Each component then independently optimizes an acquisition function and updates the corresponding set of d-coordinates.
The last baseline GP-Lip manually applies axis-aligned projections in a partition of the input space and resolves the inconsistencies by selecting lowest observations in pairs of points that violate Lipschitz continuity assumption. It then learns plain GP response surface (instead of QGP) in each axis-aligned projection. This baseline compares with the automatic selection applied implicitly by the QGP in the presence of inconsistencies. The Lipschitz constant is the maximum element of a set of 5 · 10 6 gradients evaluated on a random selection of input locations for each benchmark function. We choose the Michalewicz function as a benchmark, which has effective dimensionality d e f = 10 and satisfies the additivity assumption. It is a sum of one-dimensional components f i , each of which is defined as f i (θ i ) = − sin(θ i ) sin 2m iθ 2 i π with parameter m = 0.5 for i = 1, . . . , 10. To assess scalability to high dimensions, we test the optimization in a D = 100-dimensional input space and optimize components of dimensionality d = 10, where the relevant dimensions are distributed uniformly randomly (with replacement) across the 10 components of the partition by enforcing that all the relevant dimensions are not contained in a single component. In this experiment we emphasize that Add-GP conforms to the properties of the objective function and is therefore a reference baseline for good performances. Figure 4 shows the progression of all optimization algorithms. Overall, we see that optimization with axis-aligned projections with the QGP model is an effective and competitive method for Bayesian optimization when f is decomposable as sum of low-dimensional components. We also note that both QGP and Lip-GP show consistent gap in data efficiency and optimization, which motivates the QGP as a model in the presence of inconsistencies for effective optimization along projections.

Non-additive high-dimensional objective
This experiment analyzes the performance of the QGP BayesOpt in a highdimensional search space, when we no longer make any assumptions on additive decomposability of the objective. More specifically, we define the objective, f (θ ) = 10 sin θ 1 10 i=1 sin(θ i ), with effective dimensionality of 10. We optimize 10dimensional components in a fixed partition of a 100-dimensional input space avoiding condensing all relevant dimensions in a single group. Figure 5 shows that QGP model attains the best observation at termination with respect to other baselines in both EI and UCB acquisition functions. For PI the QGP model shows a slower progression than Add-GP and GP-Lip during the early stages of optimization. Other baselines, such as REMBO, flatten out quite early. We explain poor performances of the REMBO baseline by noting that it performs exploration only on a d-dimensional space. Using a linear mapping it can only span at most d directions in the D-dimensional space, and this heavily restricts exploration. We observe that even relaxing assumptions on additivity, the additive model still maintains similar performances both in terms of progress and value at termination for most acquisition functions considered remaining however suboptimal on exploration with expected improvement and upper confidence bound. Overall the QGP results are competitive for different properties of the black-box function and prove robust with respect to model hyper-parameter τ . These results highlight the QGP as a good model for optimization with projected data.

Rotated high-dimensional objective
Our last experiment analyses the performances of the QGP-BayesOpt approach under arbitrary rotation of the high-dimensional domain. In particular, we consider the rotated additive Michalewicz benchmark function, g(θ) = f (Uθ ), where U ∈ R D×D is an  N (0, 1), and f is defined as in Sect. 5.2. We maintain the same model selection procedure for the GP models in each baseline i.e. with Gamma hyper-prior Ga(shape=1, scale=1) on the kernel lengthscales and the marginal likelihood defined as in Eq. (10). Figure 6 shows the results obtained with each baseline. Performances of both axis-aligned projections-based baselines, namely QGP model and Add-GP, clearly deteriorate w.r.t. the original experiment in Sect. 5.2. The random projection-based baseline REMBO, instead, shows steeper descent and better optimum at termination of optimization. Moreover, lengthscale kernel-parameters for the QGP model become shorter. In fact, by averaging over the 20 random initializations and the 250 iterations, we observe that at least 4 95% of lengthscales have decreased from the axis-aligned experiment. The variance (over random initializations and iterations) of the lengthscale parameters also becomes smaller. In particular, the average variance (averaging over all lengthscales) for the rotated objective experiment decreases by a factor 5 of 0.014. This is a sign of a more highly nonlinear response surface, characterized by many local minima and therefore harder to optimize. The QGP model, however, still retains best performances also w.r.t. random projection-based methods on all acquisitions.

Conclusion
We proposed a framework for scaling Bayesian optimization to high dimensions by using axis-aligned projections. We considered a quantile regression approach that allows for generalizations from projected data and we empirically showed low sensitivity of QGP-BayesOpt w.r.t. to the choice of the quantile parameter τ . Based on experimental results, we argue that modeling extreme functions from projected data maintains good indicators of the optimum location. One observation is that QGP BayesOpt features sensible modeling of the response surface from unstructured data and has an effective update strategy on all. We acknowledge that careful modeling and corresponding complexity of the learning is also an important trade-off to consider. The QGP approximates the GP posterior with EP, which becomes computationally involved for a large number of data points.
To address this downside, future work will tackle computational efficiency with sparse GP methods and extend applicability to a large number of data points in short time. Future work will also investigate whether to concentrate BayesOpt updates on projections that matter and neglect those that leave f unchanged. Analysis of the GP hyper-parameters could allow for introducing an on-line selection strategy based on the optimization history.