High-dimensional Bayesian optimization using low-dimensional feature spaces

Bayesian optimization (BO) is a powerful approach for seeking the global optimum of expensive black-box functions and has proven successful for ﬁne tuning hyper-parameters of machine learning models. However, in practice, BO is typically limited to optimizing 10–20 parameters. To scale BO to high dimensions, we normally make structural assumptions on the decomposition of the objective and/or exploit the intrinsic lower dimensionality of the problem, e.g. by using linear projections. The limitation of aforementioned approaches is the assumption of a linear subspace. We could achieve a higher compression rate with nonlinear projections, but learning these nonlinear embeddings typically requires much data. This contradicts the BO objective of a relatively small evaluation budget. To address this challenge, we propose to learn a low-dimensional feature space jointly with (a) the response surface and (b) a reconstruction mapping. Our approach allows for optimization of the acquisition function in the lower-dimensional subspace. We reconstruct the original parameter space from the lower-dimensional subspace for evaluating the black-box function. For meaningful exploration, we solve a constrained optimization problem.


Introduction
Bayesian optimization (BO) is a useful model-based approach to global optimization of black-box functions that are expensive to evaluate [25,28,38].This sample-efficient technique for optimization has proven effective in experimental design of machine learning algorithms [5], robotics applications [9] and medical therapies [50] for optimization of spinal-cord electro-stimulation.Despite its great success, BO is practically limited to optimizing 10-20 parameters, and a large body of literature has been devoted to address scalability issues to elevate BO to high-dimensional optimization problems, such as discovery of chemical compounds [16] or automatic software configuration [23].
The standard BO routine consists of two key steps: (i) estimating the black-box function from data through a probabilistic surrogate model, usually a Gaussian process (GP), referred to as the response surface; (ii) maximizing an acquisition function that computes a score that trades off exploration and exploitation according to uncertainty and optimality of the response surface.As the dimensionality of the input space increases, these two steps become challenging.The sample complexity to ensure good coverage of inputs for learning the response surface is exponential in the number of dimensions [48].With only a small evaluation budget the learned response surface and the resulting acquisition function are characterized by vast flat regions interspersed with highly non convex landscapes [41].This renders the maximization of the acquisition inherently hard [14].
However, high-dimensional data often possesses a lower intrinsic dimensionality, which can also be exploited for optimization.A feature mapping can then used to map the original D-dimensional data onto a d D-dimensional manifold.For example, in [54], the authors used random linear mappings in the context of BO to reduce dimensionality.Similar approaches use linear dimensionality reduction Figure 1: Model for Bayesian optimization on data manifolds, jointly solving two distinct tasks: (i) a regression from feature space to observations (in blue) and (ii) a reconstruction mapping from feature space to highdimensional space (in red).
drive exploration in BO to actively learn this linear embedding [14].While these methods perform well in practice they are restricted to linear subspaces of the original domain.With nonlinear embeddings higher compression rates would be possible.
A generalization to nonlinear subspaces was proposed in [16,18,29] and [20].In [16], a low-dimensional data representation is learned with variational autoencoders (VAEs).A characteristic of this approach is that the required amount of data necessary for learning a meaningful representation substantially exceeds small evaluation budgets that often constrain BO.However, in the specific application of automatic discovery of molecules, where libraries of existing compounds are available prior to optimization, this approach makes much sense.VAE models [35] were used to propagate uncertainty of latent space representations through the response surface model with Gaussian process latent variable models [31,30,51,32].However, in [35], the latent space representation is not learned specifically for the regression task.Gradient-based methods [1] have been used to learn a lower-dimensional Riemannian manifold for optimization and sampling.
Nonlinear embeddings also allow for modeling non-stationary objective functions.In this context, a composition of GPs, referred to as deep GPs [12,46,10,11,22] is especially useful when the response surface is characterized by abrupt changes or has constraints.An extensive investigation on the employment of deep GP models in BO is presented in [10,21].In our work, we also exploit the idea of learning highly nonlinear functions through the composition of simpler ones [33], but we rather elaborate on deterministic dimensionality reduction and optimization in feature space.
In this paper, we propose a BO algorithm for high-dimensional optimization, which learns a nonlinear feature mapping h : R D → R d to reduce the dimensionality of the inputs, and a reconstruction mapping g : R d → R D based on GPs to evaluate the true objective function, jointly, see Figure 1.This allows us to optimize the acquisition function in a lower-dimensional feature space, so that the overall BO routine scales to high-dimensional problems that possess an intrinsic lower dimensionality.Finally, we formulate a constrained maximization of the acquisition function in feature space to prevent meaningless reconstructions.
2 Bayesian optimization in low-dimensional feature spaces Algorithm 1 Feature space BO Optimal input selection x t+1 X t ∪ {x t+1 }, y t ∪ {y t+1 } 13: end for 14: Return x * = arg min y t Algorithm 1: Key steps of BO in feature space.
We consider global minimization problems of the form with a high-dimensional input space X = [0, 1] D , but where the objective f X : X → R possesses an intrinsic lower dimensionality.In our setting, we consider functions f X that are expensive to evaluate and for which we are allowed a small budget of evaluation queries to express our best guess of the optimum's location x * .We further assume we have access only to noisy evaluations of the objective y = f X + ε, where the noise ε ∼ N (0, σ 2 n ) is i.i.d.Gaussian.We restrict ourselves to the typical setting, where neither gradients nor convexity properties of f X are available.
The main steps of a BO routine involve (i) response surface learning, (ii) optimal input selec-tion x * and (iii) evaluation of the objective function f X at x * .The first step trains a probabilistic surrogate model p(f X ), the response surface, which describes the black-box relationship between inputs x and observations y.In the t + 1st iteration of BO, the optimal input selection step finds an input x t+1 that maximizes an acquisition function α(•), which describes the added value of an input x.The evaluation step observes the noise-corrupted outcome of the true objective function f X (x t+1 ) + ε at the selected location.These steps are summarized in lines 4, 7 and 10 of Algorithm 1, respectively.In relatively high-dimensional settings (D > 20) both the response surface learning and optimal input selection become computationally challenging.
In our work, we exploit the effective low dimensionality of the objective function for BO in a lower-dimensional feature space Z ⊂ R d , where d D. In particular, we express the true objective function f X : R D → R as a composition of a feature mapping h : R D → R d and a function f Z : Z → R so that f X = f Z • h.The lower-dimensional feature space allows for both learning the response surface f X and maximizing an acquisition function α with domain Z, which yields optimizer z * .
However, we cannot evaluate the true objective f X directly at the low-dimensional features z * , but need to project z * back into the D-dimensional data space X .We therefore define a reconstruction mapping g : Z → X .We can think of this mapping as a decoder within an auto-encoder framework.We model both the composition f X = f Z • h and the reconstruction with Gaussian process models.The algorithm in Algorithm 1 summarizes the main steps of the feature space BO.
In the following, we detail the model (see Figure 1) for jointly learning the feature map h(•), the low-dimensional response surface in feature space f Z , and the reconstruction mapping g(•).

Manifold Gaussian processes for response surface learning in feature space
In our optimization problem, we expect the response surface to predict the value of the black-box objective function f X with calibrated uncertainty associated with each prediction.Gaussian processes (GPs) [42] are probabilistic models that allow for an analytic computation of posterior predictive function values within a Bayesian framework, and they are the standard model in BO for modeling the response surface.
A GP is a distribution over functions f Z ∼ GP(m(•), k(•, •)) and is fully specified by a mean function m : Z → R, and a covariance function/kernel k : Z × Z → R. The kernel computes the covariance between pairs of function values as a function of the corresponding inputs, i.e.Cov(f Z (z), f Z (z )) = k(z, z ), and thereby encodes regularity assumptions about f Z , such as smoothness or periodicity.Common kernel choices in the BO literature include the squared exponential and Matérn kernels [13].
In our feature space optimization, we address lines 5-6 of Algorithm 1 as a single learning problem.Therefore, we need a GP that learns useful representations z of inputs x for the regression task together with f Z .A manifold Gaussian process (mGP) [36,8,56] addresses this issue by composing two mappings: The deterministic feature map h with parameters θ h and a GP f Z ∼ GP(m, k) with kernel hyper-parameters θ k .The GP models the relationship between features z and function values y in observation space.The resulting composite model with mean and covariance functions given by respectively.Given high-dimensional training inputs X and corresponding observations y of the objective function, we train the model, which is parameterized by {θ h , θ k }, by maximizing the log-marginal likelihood (evidence) [42] {θ This objective allows us to learn a low-dimensional feature embedding as a by-product of the supervised GP regression framework.Unsupervised dimensionality reduction usually solves an orthogonal task to that of learning a response surface.Algorithms, such as PCA [40,24] or variational auto-encoders [43,27,37], achieve compact data representations by optimizing objectives that are not necessarily useful in a supervised setting [53].The mGP, instead, leads to low-dimensional representations that are optimal (locally) for the regression task at hand.
We use a multi-layer feedforward neural network with sigmoid activation functions as a feature map (encoder) h, resulting in a feature space Z = [0, 1] d .Neural networks as an explicit feature map within an mGP have already been applied successfully for modeling non-smooth responses in bipedal robot locomotion [8].Deep network architectures have also proven useful for orientation extraction from high-dimensional images [56].
With a Gaussian likelihood, the mGP posterior predictive distribution at a test point x ∈ X is Gaussian distributed with mean and variance given by respectively.Here, computes the prior mean function evaluated at the embedded training inputs Z = h(X).Note that posterior predictions can be computed using both the feature and data space.
The mGP defines a GP on X , but allows us to learn a response surface in the lower-dimensional feature space Z.This is key for optimizing the acquisition function in a low-dimensional space Z instead of the original data/parameter space X .Thus far we have detailed the feature BO procedure up to line 8 in Algorithm 1. Once we have found an optimizer z * of the acquisition function, we need to project it back into the original data space X in order to evaluate the true objective f X .This can be done by means of a reconstruction mapping (decoder), which we detail in the following.

Input reconstruction with manifold multi-output Gaussian processes
Here, we present the reconstruction part (decoder) of our feature space optimization model described in Figure 1.We are interested in modeling the functional relationship between the feature space Z and the data space X for step 9 in Algorithm 1, which requires us to evaluate f X .We therefore consider a vector-valued function g = {g i } D i=1 , where each component function g i : Z → X i maps vectors in feature space to the i-th coordinate of high-dimensional data, i.e. g i (z) = x(i) ∈ X .Multioutput Gaussian processes (MOGPs) [4,3,58,57,2,39,47,6] define a prior over vector-valued functions and explicitly allow for output correlations.An MOGP GP(m, K) is fully specified by a mean vector function m : Z → R D and a positive, semi-definite matrix-valued covariance function K : Z → R D×D , which expresses the correlation between observations in the same output coordinate and cross-correlation between the D different outputs.Various formulations of the matrix-valued kernel correspond to specific generative model assumptions for the multiple outputs g i .
In our work, we consider the intrinsic coregionalization model (ICM) [19,52], which structures the covariance matrix as a Kronecker product and allows for efficient training and predictions.This model is particularly suitable for trading off number of model parameters and expressiveness of the vector valued function.In particular, the ICM facilitates information sharing across different tasks by adopting the same covariance function and has successfully been adopted in robotics for learning inverse dynamics [55].Hence, this model requires fewer parameters than the linear model of coregionalization [4], and allows for exploiting properties of the Kronecker product for efficient training and posterior computation.

Intrinsic coregionalization model
The ICM [19,52] applies a linear mapping to a set of latent functions.In particular, we consider a set of P latent functions u i : Z → R, that are assumed to be sample paths, i.e. sample functions independently drawn from the same GP prior GP(m c , k c ).The ICM model expresses the vector-valued function as a linear combination of these sample functions g(z) = Au(z), (7) where u(z) ∈ R P is the collection of the P sample paths' evaluations at z, and A ∈ R D×P is the linear mapping that couples the independent vector and parameterizes the ICM model.As a result, g is a MOGP GP(m, K) with mean function m = Am c , where m c = [m c ] P i=1 is obtained by repeating the single-valued mean function m c in a P -vector.The covariance function is expressed as where k c is the covariance function for the GP prior and ⊗ denotes the Kronecker product.
Reconstruction Model For the reconstruction task in line 9 of Algorithm 1, we introduce the manifold MOGP with intrinsic coregionalization model (mMOGP), which shares the feature map h with the manifold GP used for learning the response surface in Section 2.1.In our work, we assume without loss of generality a zero-mean vector function for the mMOGP GP(0, B ⊗ k M ), where k M (x, x ) = k c (h(x), h(x )) and the matrix B = AA T .We model the composition g • h, which describes the relationship between the data space X and feature space Z, jointly with the MOGP mapping from feature space back to the data space X .Albeit sharing the same set of parameters θ h for the feature mapping, the mMOGP uses a kernel k c = k that differs from the one used for modeling the response surface (see Section 2.1).

Joint training
The joint training of the mGP, which models the response surface, and the mMOGP, which is used for the reconstruction (see also Figure 1) is performed via log marginal likelihood maximization.
Here, L comprises terms from both the mGP and mMOGP models, where K y is defined in (5), and the covariance matrix of the mMOGP K V = K + σ 2 n I is obtained by evaluating the Kronecker product K = B ⊗ k c (Z, Z) with the mMOGP kernel k c .The vector x V is a concatenation of the columns of the data X.The maximizers [θ * h , θ * k , θ * c ] of the log-marginal likelihood are the parameters of the feature map h (which is shared between the mGP and the mMOGP) and the hyper-parameters of the two kernels k and k c for the mGP and mMOGP, respectively.Optimization of ( 8) is performed via gradient-based methods [34,59].
Modeling the black-box objective function f X is orthogonal to the reconstruction problem.However, when training these tasks jointly, they have a regularization effect on the optimization of the parameters θ h of the feature embedding in the sense that the mapping h will not overfit to a single regression task: the parameters θ h will give rise to a feature space embedding that is useful for both the modeling of the objective and the reconstruction of the original inputs.
The major computational bottleneck for evaluating the marginal likelihood comes from the term x T V K −1 V x V , which requires inverting an N D × N D covariance.We reduce the computational complexity of this operation to O(N 3 ) + O(D 3 ) by exploiting the properties of the Kronecker product, tensor algebra [44] and structured GPs [15,45].Details can be found in the Appendix.

Constrained acquisition
We defined a joint probabilistic model for the response surface learning and the input reconstruction tasks, summarized in lines 4-6 and 9 of Algorithm 1, respectively.We are now concerned with the maximization of the acquisition function in feature space as described in line 8 of Algorithm 1.We aim at maximizing the acquisition function in a low-dimensional feature space of the original data/ parameter space X .However, one problem that arises with the mMOGP decoding is that locations in feature space that are too far away from data will be mapped back to the mMOGP prior.Since the acquisition function is a key driver of exploration in BO, this is a problem.We address this limitation by introducing a constraint based on the Lipschitz continuity of the mMOGP posterior.This will ensure that candidates z * ∈ Z selected in feature space will not collapse to the origin 0 ∈ R D if the reconstruction is defined as x * = µ(z * ), where µ is the posterior mean of the mMOGP.
We want to leverage information from observed data for the multi-output mapping and exploit it when optimizing the acquisition function in feature space.This can be achieved by introducing an upper bound to the Euclidean distance dist(z, Z t ) = min 1≤i≤Nt z i − z 2 (9) in feature space between the optimization variable z and the embedded training data Here, N t is the number of data points available at BO iteration t. allows us to specify how far from the data we can move in feature space without falling back to the prior on all coordinates of the reconstruction.Here z * minimizes the Euclidean distance in ( 9), while the numerator on the right-hand side is the component-wise maximum of µ(z * ).We estimate the Lipschitz constant as the maximum norm of the Jacobian of the posterior mean of the mMOGP [17] This maximization returns a valid Lipschitz constant [17] for the multi-output mapping for any choice of norm in (11).The Jacobian of the posterior mean is represented by a D × d matrix and we adopt the max norm ∇ z µ(z) ∞ = max |µ i,j | for i = 1, ..., D and j = 1, ..., d.Lower values of valid Lipschitz constants L allow for exploration in larger regions of the feature space that still satisfy the nonlinear constraint in (10).

Results
We report results on a set of high-dimensional benchmark functions that possess an intrinsic low dimensionality.In particular, we (i) assess the benefits of adopting a model structure as presented in Figure 1; (ii) analyze the benefits of the constrained optimization of the acquisition function.
Our purpose is to compare empirical performances across (a) different characterizations of the feature spaces, e.g.linear/nonlinear subspaces; (b) different properties of the objective function, e.g.additivity/non additivity.
We compare our approach (MGPC-BO) with the random embeddings optimization (REMBO) [54], which performs BO on a random linear subspace of the inputs.Additional baselines include additive models (ADD-BO) [26], which assumes an additive structure (across dimensions) of the objective f X , and one recently proposed VAE-based model (VAE-BO) [16] that learns a feature space with deep networks offline.We also include a version of our model presented in Figure 1 (NNC-BO) that uses a hierarchical ICM for the input reconstruction mapping g.The hierarchical ICM partitions the data space into low-dimensional disjoint subsets, i.e.X = X 1 × ... × X Q , X i ⊂ R 3 , and assumes independence between reconstructions of different subsets, i.e. x(i) ⊥ x(j) , where x(i) ∈ X i , x(j) ∈ X j for i = j.Moreover, the baselines MGP-BO and NN-BO correspond to same modeling as in MGPC-BO and NNC-BO, respectively, but without applying the nonlinear constraint in (10).
We evaluate the performances of all baselines across a set of common choices of acquisition functions: expected improvement (EI) [38], upper confidence bound (UCB) [49] and probability of improvement (PI) [28].The maximization of the acquisition function is identical for all baselines: We first perform a random search step with 5, 000 samples drawn uniformly at random, then we select the best 100 locations and apply gradient-based optimization from these starting locations.For box-constrained acquisition optimization we use L-BFGS-B [34,59].For constrained acquisition optimization with nonlinear constraints we use a trust-region interior point method [7].
Each BO progression curve shows the mean and standard error of the immediate logarithmic regret log 10 |f (x best (t)) − f min |, where f min is the true minimum of f X and x best (t) ∈ arg min i=1:t f X (x i ).Mean and standard error are computed over 20 experiments with different random initializations.All optimization experiments start with a budget of 10 data points and perform a total of 300 iterations.

Linear feature space
We consider benchmark functions that are defined in a d = 10-dimensional space.We map their input space to a D = 60-dimensional space using an orthogonal matrix R d×D so that the overall objective is f X (x) = f (z) = f (Rx).
Additive objective We minimize the Rosenbrock benchmark function ] in a 10-dimensional feature space.Figure 3 shows that MGPC-BO and NNC-BO baselines descend quickly to relatively low regret in the early stages of optimization and recover the same regret at termination as the unconstrained baselines MGP-BO and NN-BO.This highlights the fast learning of feature-space representations that are effective for optimization when only few data samples are available.The VAE-BO baseline also improves quickly but lacks exploration due to an insufficiently expressive reconstruction mapping from feature space to data space.We highlight that the VAE-BO model was trained on a budget of 500 inputs-observations pairs prior to starting the BO experiments.This additional budget, however, still does not allow the VAE-BO to compare well with baselines that learn a feature mapping during optimization.REMBO shows a slower descent due to a limited exploration in at most d-directions of the data space, while the ADD-BO baseline suffers from the coupling effects of the linear dimensionality reduction R.
Non-additive objective Here, we optimize the Product of Sines function f (z) = 10 sin(z 1 ) 10 i=1 sin(z i ) and compare results when the additivity assumption is not satisfied.Figure 4 shows the regret curves obtained optimizing the objective on a 10-dimensional feature space.Solid lines describe the Lipschitz-regularized baselines MGPC-BO and NNC-BO (with nonlinear constraint), while dashed lines are baselines that apply box-constrained maximization of the acquisition in feature space.The NN-BO and MGP-BO regrets flatten early.The reason for this is that the acquisition function highlights locations in feature space that are too far away from the training data.In this setting, the decoder g returns the same high-dimensional reconstruction, which prevents BO from exploring.The constrained maximization of the acquisition is beneficial for both models.We also note that the REMBO baseline conforms to the intrinsic linear low-dimensionality assumption described Section 4.1.However, the linear reconstruction mapping applied by REMBO also suffers from non-injectivity, and this slows exploration in the high-dimensional space.The linear projection deteriorates performances of the additive model.ADD-BO assumes independence between axis- aligned projections of the high-dimensional space, while the linear mapping R couples all subsets of dimensions.This renders the optimization of independent additive components not effective.The VAE-BO approach requires much larger amounts of data to learn a meaningful reconstruction mapping than what is allowed in our BO experiment.As a result most locations in feature space are mapped to similar reconstructions.This explains the flat curve observed on all VAE-BO progressions with different acquisitions.

Nonlinear feature space with non-additive objective
We consider the product of sines functions and apply a nonlinear dimensionality reduction.We define a single-layer neural network mapping to elevate the dimensionality of the objective to D = 60, i.e. f X (x) = f (σ(Rx)).Here σ is the sigmoid activation function.We also compare with a different parametrization of the covariance function of the decoder g.The baseline DMGP-BO and DMGPC-BO define a single kernel k c for the reconstruction task while NN-BO and NNC-BO define different kernels {k i c } Q i=1 , one for each subset of the partitioning.Figure 5 shows the progression of the regret over 300 BO iterations.We can observe consistent improvements of MGPC-BO and DMGPC-BO with respect to VAE-BO which also assumes a nonlinear embedding for the objective.The Performance of MGPC-BO and DMGPC-BO also retain better regret at termination with box-constrained acquisition maximization, namely MGP-BO and DMGP-BO.
Overall, we observe that the constrained maximization of the acquisition function is beneficial for the proposed model and variants of its reconstruction mapping, i.e.NN-BO, DMGP-BO.The advantages are more more evident with the product of sines objective while with the Rosenbrock we retain no worse regret.We also highlight that our performances improve as we move to problems which are characterized by intrinsic low-dimensionality characterized by a nonlinear dimensionality reduction.

Conclusion
We proposed a framework for efficient Bayesian optimization of intrinsically low-dimensional black-box functions based on nonlinear embeddings.In our model, the manifold GP learns useful low-dimensional feature representations of high-dimensional data by jointly learning the response surface and the reconstruction mapping.As a reconstruction mapping we use a manifold MOGP.Our approach allows for optimizing acquisition functions in a low-dimensional feature space.However, since exploration in feature space (driven by the acquisition function) does not necessarily mean exploration in the high-dimensional parameter space, we introduce a nonlinear constraint based on Lipschitz continuity of predicitons of the mMOGP, which encourages exploration in the vicinity of the training data and eliminates un-identifiability issues in data space, which would otherwise hinder optimization.Reshape results r = vec(Z) 9: end for 10: Return W l=1 K l y = r Algorithm 2: Fast matrix-vector multiplication for matrices that can be expressed as a Kronecker product.
between the tensors T K i1,j1,...,i W ,j W and T Y j W ,...,j1 applies a contraction along the indices of the second tensor in the multiplication, i.e. j1 ...
This tensor contraction can be expressed in terms of a sequence of tensor-transposed matrix-tensor products with K l T Y = G l k=1 [K l ] i1,k T Y k,j2,...,j W .The function vec(•) returns the vectorized form of a matrix by stacking its columns vertically.The tensor-transposition applies a cyclic permutation to the order of the indices in a tensor.As a result, the right hand side in equation (18) allows us to evaluate the expensive matrix-vector product without computing and storing the Kronecker product.Algorithm 2 shows the main steps of the efficient matrix multiplication as a sequence of tensor-transpose matrix-tensor products.

Figure 3 :
Figure 3: Results of BO in feature space with linear embedding.Baselines MGPC-BO and NNC-BO (solid) apply nonlinearly constrained acquisition maximization and recover same regret as the unconstrained versions MGP-BO and NN-BO.

Figure 4 :
Figure 4: Optimization progression on product of sines with EI 4(a), UCB 4(b) and PI 4(c).Both MGPC-BO and NNC-BO learn low-dimensional representations of the objective that are useful for optimization.

Figure 5 :
Figure 5: BO performances expressed as log regret of the product of sines function in a nonlinear embedding.Results are shown for EI 5(a), UCB 5(b) and PI 5(c).This limits the exploration in the high-dimensional space.