Deterministic global optimization with Gaussian processes embedded

Gaussian processes (Kriging) are interpolating data-driven models that are frequently applied in various disciplines. Often, Gaussian processes are trained on datasets and are subsequently embedded as surrogate models in optimization problems. These optimization problems are nonconvex and global optimization is desired. However, previous literature observed computational burdens limiting deterministic global optimization to Gaussian processes trained on few data points. We propose a reduced-space formulation for deterministic global optimization with trained Gaussian processes embedded. For optimization, the branch-and-bound solver branches only on the free variables and McCormick relaxations are propagated through explicit Gaussian process models. The approach also leads to significantly smaller and computationally cheaper subproblems for lower and upper bounding. To further accelerate convergence, we derive envelopes of common covariance functions for GPs and tight relaxations of acquisition functions used in Bayesian optimization including expected improvement, probability of improvement, and lower confidence bound. In total, we reduce computational time by orders of magnitude compared to state-of-the-art methods, thus overcoming previous computational burdens. We demonstrate the performance and scaling of the proposed method and apply it to Bayesian optimization with global optimization of the acquisition function and chance-constrained programming. The Gaussian process models, acquisition functions, and training scripts are available open-source within the “MeLOn—MachineLearning Models for Optimization” toolbox (https://git.rwth-aachen.de/avt.svt/public/MeLOn).


Introduction
A Gaussian process (GP) is a stochastic process where any finite collection of random variables has a multivariate Gaussian distribution; they can be understood as an infinite-dimensional generalization of multivariate Gaussian distributions [66]. The predictions of GPs are Gaussian distributions that provide not only an estimate but also a variance. GPs originate from geostatistics [46] and gained popularity for the design and analysis of computer experiments (DACE) since 1989 [68]. Furthermore, GPs are commonly applied as interpolating surrogate models across various disciplines including biotechnology [13,25,29,54,82], chemical engineering [14,[22][23][24]27,34,52], chemistry [1,69], and deep-learning [74]. Note that GP regression is also often referred to as Kriging. In many applications, GPs are trained on a data set and are subsequently embedded in an optimization problem, e.g., to identify an optimal operating point of a process. Moreover, many derivative-free solvers for expensiveto-evaluate black-box functions actually train GPs and optimize their predictions (e.g., Bayesian optimization algorithms [12,40,72,82] and other adaptive sampling approaches [9,10,20,[22][23][24]27]). In Bayesian optimization, the optimum of an acquisition function determines the next sampling point [72]. The vast majority of these optimizations have been performed by local solution approaches [44,88] and a few by stochastic global optimization methods [12]. Our contribution focuses on the deterministic global solution of optimization problems with trained GPs embedded and on applications in process systems engineering.
GPs are commonly used to learn the input-output behavior of unit operations [14,15,42,43,48,63,64], complete flowsheets [33], or thermodynamic property relations from data [51]. Subsequently, the trained GPs are often combined with nonlinear mechanistic process models leading to hybrid mechanistic and data-driven models [31,41,60,83] which are optimized. Many of the previous works on optimization with GPs embedded rely on local optimization techniques. Caballero and Grossmann, for instance, train GPs on data obtained from a rigorous divided wall column simulation. Then, they iteratively optimize the operation of the column (modeled by GPs) using SNOPT [30], sample new data at the solution point, and update the GPs [14,15]. Later, Caballero and co-workers extend this work to distillation sequence superstructure problems [63,64]. In [63], the authors solve the resulting mixed-integer nonlinear programs (MINLPs) using a local solver in GAMS. Therein, the GP estimate is computed via an external function in Matlab which leads to a reduced optimization problem size visible to the local solver in GAMS. However, all these local methods have the drawback that they can lead to suboptimal solutions, because the resulting optimization problems are nonconvex. This nonconvexity is induced by the covariance functions of the GPs as well as often the mechanistic part of the hybrid models.
Deterministic global optimization can guarantee to identify globally optimal solutions within finite time to a given nonzero tolerance [36]. In a few previous studies, deterministic global optimization with GPs embedded was done using general-purpose global solvers. For instance, in the black-box optimization algorithms ALAMO [20] and ARGONAUT [9], GPs are included as surrogate models and are optimized using BARON [79] and ANTIGONE [57], respectively. However, computational burdens were observed that limit applicability, e.g., in terms of the number of training points. Cozad et al. [20] state that GPs are accurate but "difficult to solve using provable derivative-based optimization software". Similarly, Boukouvala and Floudas [9] state that the computational cost becomes a limiting factor because the number of nonlinear terms of GPs equals the product of the number of interpolated points (N ) and the dimensionality (D) of the input domain. More recently, Keßler et al. [42,43] optimized the design of nonideal distillation columns by a trust-region approach with GPs embedded. Therein, optimization problems with GPs embedded are solved globally using BARON [79] within relatively long CPU times (10 2 -10 5 CPU seconds on a personal computer).
As mentioned earlier, Quirante et al. [63] call an external Matlab function to compute GP estimates with a local solver in GAMS. As an alternative approach, they also solve the problem globally using BARON in GAMS by providing the full set of GP equations as equality constraints. This leads to additional intermediate optimization variables besides the degrees of freedom of the problem. Similar to other studies, they observe that their formulation is only practical for a small number of GP surrogates and training data points to avoid large numbers of variables and constraints [63]. We refer to the problem formulation where the GP is described by equality constraints and additional optimization variables as a full-space (FS) formulation. It is commonly used in modeling environments, e.g., GAMS, that interface with state-of-the-art global solvers such as ANTIGONE [57], BARON [79], and SCIP [50].
An alternative to the FS is a reduced-space (RS) formulation where some optimization variables are eliminated using explicit constraints. This reduced problem size leads to a lower number of variables for branching as well as potentially smaller subproblems. The former has some similarity to selective branching [28] (c.f. discussion in [4]). The exact size of the subproblems for lower bounding and bound tightening depends on the method for constructing relaxations. In particular, when constructing relaxations in the RS using McCormick [53], alphaBB [2] or natural interval extensions, the resulting lower bounding problems are much smaller compared to the auxiliary variable method (AVM) [73,80]. Therefore, any global solver can in principle handle RS but some methods for constructing relaxations appear more promising to benefit from the RS [4]. We have recently released the open-source global solver MAiNGO [7] which uses the MC++ library [16] for automatic propagation of McCormick relaxations through computer code [58]. We have shown that the RS formulation can be advantageous for flowsheet optimization problems [5,6] and problems with artificial neural networks embedded [37,65,70]. In the context of Bayesian optimization, Jones et at. [40] develop valid overestimators of the expected improvement (EI) acquisition function in the RS. However, their relaxations rely on interval extensions and optimization-based relaxations limited to a specific covariance function; they do not derive envelopes. Furthermore, they do not provide convex underestimators which are in general necessary to embed GPs in optimization problems.
The main contribution of this work is the efficient deterministic global optimization of optimization problems with GPs embedded. We develop a RS formulation for optimization problems with GPs embedded. The performance of the proposed method is analyzed in an extensive computational study by solving about 90,000 optimization problems. The proposed RS outperforms a FS formulation for problems with GPs embedded by speedup factors of several magnitudes. Moreover, this speedup increases with the number of training points. To further accelerate convergence, we derive and implement envelopes of covariance functions for GPs and tight relaxations of acquisition functions, which are commonly used in Bayesian optimization. Finally, we solve a chance-constrained optimization problem with GPs embedded and we perform global optimization of an acquisition function. The GP training methods and models are provided as an open-source toolbox called "MeLOn-Machine Learning Models for Optimization" under the Eclipse public license [71]. The resulting optimization problems are solved using our global solver MAiNGO [7]. Note that the MeLOn toolbox is also automatically included as a submodule in our new MAiNGO release.

Optimization problem formulations
In the simplest case, which is common in the literature, the (scaled) inputs of a GP are the free variables of the optimization problem with x ∈X = [x L , x U ]. For given x, the dependent (or intermediate) variables z can be computed by the solution of h(x, z) = 0, h :X × R n z → R n z . In the case of GP models, we aim to include the estimate (m D ) and variance (k D ) in the optimization. As will be shown in Sect. 3, we can solve explicitly for m D and k D [c.f. Eqs. (1) and (2)].
The realization of the objective function f depends on the application. In many applications, it depends on the estimate of the GP, i.e., f (m D ) (c.f. Sect. 6.1). In Bayesian optimization, the objective function is called the acquisition function and usually depends on the estimate and variance of the GP, i.e., f (m D , k D ) (c.f. Sect. 6.3). Finally, additional constraints might depend on the inputs of the GP, its estimate, and variance, i.e., g(x, m D , k D ) ≤ 0. In more complex cases, multiple GPs can be combined in one optimization problem (c.f. Sect. 6.2).
In the following, we describe two optimization problem formulations for problems with trained GPs embedded: the commonly used FS formulation in Sect. 2.1 and the RS formulation in Sect. 2.2. Both problem formulations are exact reformulations in the sense of Liberti et al. [47], meaning that they have the same local and global optima. The equivalence is shown in "Appendix A" of [4]. However, the formulation significantly affects problem size and performance of global optimization solvers.

Full-space formulation
In the FS formulation, the nonlinear equations h(x, z) = 0 are provided as equality constraints and the intermediate dependent variables z ∈ Z are optimization variables. A general FS problem formulation is: In general, there exist multiple valid FS formulations for optimization problems. In Sect. 3 of the electronic supplementary information (ESI), we provide a representative FS formulation for the case where the estimate of a GP is minimized. This is also the FS formulation that we use in our numerical examples (c.f., Sect. 6.1).

Reduced-space formulation
In the RS formulation, the equality constraints are solved for the intermediate variables and substituted in the optimization problem (c.f. [5]). A general RS problem formulation in the context of optimization with a GP embedded is: Herein, the Branch-and-Bound (B&B) solver operates only on the free variables x and no equality constraints are visible to the solver. In GPs, the estimate and variance are explicit functions of the input [Eqs. (1) and (2)]. Thus, we can directly formulate a RS formulation. The RS formulation effectively combines those equations and hides them from the B&B algorithm. This results in a total number of D optimization variables, zero equality constraints, and no additional optimization variables z. Thus, the RS formulation requires only bounds on x.
Note that the direct substitution of all equality constraints is not always possible when multiple GPs are combined with mechanistic models, e.g., in the presence of recycle streams. Here, a small number of additional optimization variables and corresponding equality constraints can remain in the RS formulation [5]. As an alternative, relaxations for implicit functions can also be derived [76,86]. Moreover, we have previously observed that a hybrid between RS and FS formulation can be more efficiently solvable for some optimization problems [6]. In this work, we compare the RS and the FS formulation and do not consider any hybrid problem formulations.

Gaussian processes
In this section, GPs are briefly introduced (c.f. [66]). We first describe the GP prior distribution, i.e., the probability distribution before any data is taken into account. Then, we describe the posterior distribution, which results from conditioning the prior on training data. Finally, we describe how hyperparameters of the GP can be adapted to data by a maximum a posteriori (MAP) estimate.

Prior
A GP prior is fully described by its mean function m(x) and positive semi-definite covariance function k(x, x ) (also known as kernel function). We consider a noisy observation y from a functionf (x) with y(x):=f (x)+ε noise , whereby the output noise ε noise is independent and identically distributed (i.i.d.) with ε noise ∼ N (0, σ 2 noise ). We say y is distributed as a GP, i.e., y ∼ GP(m(x), k(x, x )) with Without loss of generality, we assume that the prior mean function is m(x) = 0. This implies that we train the GP on scaled data such that the mean of the training outputs is zero. A common class of covariance functions is the Matérn class.
where σ 2 f is the output variance, r : is the gamma function, and K ν (·) is the modified Bessel function of the second kind. The smoothness of Matérn covariance functions can be adjusted by the positive parameter ν. When ν is a half-integer value, the Matérn covariance function becomes a product of a polynomial and an exponential [66]. Common values for ν are 1/2, 3/2, 5/2, and ∞, i.e., the most widely-used squared exponential covariance function, k SE (r ):= exp − 1 2 r 2 . We derive envelopes of these covariance functions in Sect. 4.1 and implement them within MeLOn [71]. Also, a noise term, σ 2 n · δ(x, x ), can be added to any covariance function where σ 2 n is the noise variance and δ(x, x ) is the Kronecker delta function. The hyperparameters of the covariance function are adjusted during training and are jointly noted as θ = [λ 1 , ..., λ d , σ f , σ n ]. Herein, a log-transformation is common to prevent negative values during training.

Posterior
The GP posterior is obtained by conditioning the prior on observations. We consider a set of N training inputs X = {x where the covariance matrix of the training data is given by K X ,X := k(x i , x j ) ∈ R N ×N , the covariance vector between the candidate point x and the training data (1) and (2) describe essentially the predictions of a GP and are implemented within MeLOn.

Maximum a posteriori
In order to find appropriate hyperparameters θ for a given problem, we use a MAP estimate which is known to be advantageous compared to the maximum likelihood estimation (MLE) on small data sets [77]. Using the MAP estimate, the hyperparameters are identified by maximizing the probability that the GP fits the training data, i.e., θ opt :=argmax θ P (θ |X , Y). Analytical expressions for P (θ|X , Y) and its derivatives w.r.t. the hyperparameters can be found in the literature [66]. We provide a Matlab training script in MeLOn that is based on our previous work [12]. Therein, we assume an independent Gaussian distribution as a prior distribution on the log-transformed hyperparameters, i.e., θ i ∼ N μ i , σ 2 i . The implementation of the training is efficient through the pre-computation of squared distances, the Cholesky decomposition for computing the inverse of the covariance matrix, and a two-step training approach that searches first globally and then locally [12].

Convex and concave relaxations
The construction of relaxations, i.e., convex function underestimators (F cv ) and concave function overestimators (F cc ), is essential for B&B algorithms. In our opensource solver MAiNGO, we use the (multivariate) McCormick method [53,81] to propagate relaxations and their subgradients [58] through explicit functions using the MC++ library [16]. However, the McCormick method often does not provide the tightest possible relaxations, i.e., the envelopes. In this section, we derive tight relaxations or envelopes of functions that are relevant for GPs and Bayesian optimization. The functions and their relaxations are implemented in MC++. When using these intrinsic functions and their relaxations in MAiNGO, the (multivariate) McCormick method is only used for the remaining parts of the model. Note that the derived relaxations are used within MAiNGO while BARON does not allow for implementation of custom relaxations or piecewise defined functions.

Covariance functions
The covariance function is a key element of GPs. When embedding trained GPs into optimization problems, the covariance function occurs N times because it is used in the covariance vector between the candidate point x and the training data, i.e., Note that the covariance matrix K X ,X depends only on training data and is thus a parameter during the optimization. Thus, tight relaxations of the covariance functions are highly desirable. In this subsection, we derive envelopes for common Matérn covariance functions. We consider univariate covariance functions, i.e., k ν : This is possible because we consider stationary covariance functions that are invariant to translations in the input space. Common Matérn covariance functions use ν = 1/2, 3/2, 5/2 and ∞ and are given by: where k SE is the squared exponential covariance function with ν → ∞. We find that these four covariance functions are convex because their Hessian is positive semidefinite. Thus, the convex envelope is given by As the McCormick composition and product theorems provide weak relaxations of k ν=3/2 and k ν=5/2 (c.f. ESI Sect. 1), we implement these functions and their envelopes in our library of intrinsic functions in MC++. Furthermore, natural interval extensions are not exact for k ν=3/2 and k ν=5/2 . Thus, we also provide exact interval bounds based on the monotonicity.
It should be noted that covariance functions are commonly given as a function of the weighted Euclidean distance r = √ d. However, we chose to use d instead for three main reasons: (1) x is usually a free variable of the optimization problem. Thus, the computation of r would lead to potentially weaker relaxations for k ν=5/2 and k SE . (2) The derivative of k ν=3/2 (·), k ν=5/2 (·), and k SE (·) is defined at d = 0 while the derivative of the square root function is not.
are nonconvex in r , so deriving the envelopes would be nontrivial.
Finally, it can be noted that we did not derive envelopes of k Matérn (x, x ), because the variable input dimensions pose difficulties in implementation and the multidimensionality is a challenge for the derivation of envelopes. Nevertheless, the McCormick composition theorem applied to k ν (d(x, x )) yields relaxations that are exact at the minimum of k Matérn because the natural interval extensions of the weighted squared distance d are exact (c.f. [62]). This means that the relaxations are exact in Hausdorff metric.

Gaussian probability density function
The PDF is used to compute the EI acquisition function and is given by φ : The Gaussian probability density function (PDF) is a nonconvex function for which the McCormick composition rule does not provide its envelopes. For one-dimensional functions, McCormick [53] also provides a method to construct envelopes. We construct the envelopes of PDF using this method and implement them in our library of intrinsic functions. The envelope of the PDF is illustrated in Fig. 1 and derived in "Appendix A.1".

Gaussian cumulative distribution function
The Gaussian cumulative distribution function (CDF) is given by Φ : The envelopes of the error function are already available in MC++ as an intrinsic function and consequently the McCormick technique provides envelopes of the CDF (see   [56] and Najman et al. [61]. Note that the ranges of μ and σ are different in the two subfigures to highlight the individual relaxations that are derived on different intervals. The ranges in a are such that they overlap with all four sets I 1 -I 4 defined in Sect. A.2, while the ranges in b lie within the set I 4

Lower confidence bound acquisition function
The lower confidence bound (LCB) (upper confidence bound when considering maximization) is an acquisition function with strong theoretical foundation. For instance, a bound on its cumulative regret, i.e., a convergence rate for Bayesian optimization, for relatively mild assumptions on the black-box function is known [75]. It is given by LCB : with a parameter κ ∈ R >0 . LCB has not been popular in engineering applications as it requires an additional tuning parameter κ and leads to heavy exploration when a rigorous value for κ is chosen [75]. Recently, LCB has gained more popularity through the application as a policy in deep reinforcement learning, e.g., by DeepMind [59]. LCB is a linear function and thus McCormick relaxations are exact.

Probability of improvement acquisition function
Probability of improvement (PI) computes the probability that a prediction at x is below a given target f min , i.e.,PI(x) = P ( f (x) ≤ f min ). When the underlying function is distributed as a GP with mean μ and variance σ , the PI is given by PI : The PI acquisition function is neither convex or concave over its entire domain. However, as analyzed in Sect. A.2, there are parts of its domain over which the function is componentwise convex or convex with respect to σ or μ. For componentwise convex or concave functions, there exist methods for constructing tight relaxations. [56] introduced a method for constructing concave relaxations for componentwise convex functions (or vice versa). In particular, the concave envelope of a componentwise convex function is polyhedral [78], and thus the method of [56] amounts to finding the correct combinations of corners of the considered interval box to construct the facets of the polyhedral concave envelope. [61], in contrast, introduce a method for constructing convex relaxations of componentwise convex functions that satisfy a certain monotonicity condition on their first order partial derivatives (or, in case of twice continuously differentiable functions, have mixed second-order partial derivatives with constant sign over the box). For functions that are componentwise convex with respect to some and concave with respect to other variables, [61] also show that by taking the secant with respect to the concave (or convex) variables, one can obtain a relaxation that is componentwise convex (or concave) with respect to all variables. Using the aforementioned methods, this function can then be further relaxed to obtain convex (or concave) relaxations.
We use these methods that exploit componentwise convexity along with techniques that exploit monotonicity properties to construct tight relaxations of the PI acquisition function. The procedure for constructing these relaxations is described in detail in "Appendix A.2". Examples for the resulting relaxations on two subsets of the domain of PI are shown in Fig. 2.

Expected improvement acquisition function
EI is the acquisition function that is most commonly used in Bayesian optimization [40]. It is defined asẼI(x) = IE max( f min − f (x), 0) . When the underlying function is distributed as a GP, EI : R × R ≥0 → R is given by As noted by Jones et al. [40], EI is componentwise monotonic and thus, exact interval bounds can easily be derived. In Sect. A.3, we show that EI is convex and we provide its envelopes. As EI is not available as an intrinsic function in BARON, an algebraic reformulation is necessary that uses Eq. (6) where Φ is substituted from Eq. (4) with Eq. (1) in ESI and φ from Eq. (3). In addition, some workaround would be necessary for σ = 0 (e.g., additional binary variable and big-M formulation).

Implementation
The described methods are implemented in our open-source solver MAiNGO [7] and the MeLOn toolbox [71]. The modeling interfaces of MAiNGO (currently either textbased input or a C++ API) allow a convenient implementation of RS models without having to eliminate variables symbolically. Instead, the sequential evaluation of model equations can be expressed as in procedural programming paradigms. MAiNGO implements a spatial B&B algorithm enhanced with some features for range reduction [32,49,67] and a multi-start heuristic. A directed acyclic graph representation of the model is constructed using the MC++ library [16] and evaluated in different arithmetics: We use automatic differentiation via FADBAD++ [3] to obtain first and second derivatives of functions and provide them to the desired local solver. Currently, MAiNGO supports local solvers found in the NLopt package [39], IPOPT and Knitro. These local solvers can be used for pre-processing and solving the upper bounding problems. In the presented manuscript, we use SLSQP [45] in the preprocessing and in the upper bounding. During pre-processing, a simple multistart heuristic initializes the first local search at the center point of the variable ranges. Subsequent local searches are initialized randomly within the variable ranges.
MAiNGO constructs (multivariate) McCormick relaxations of factorable functions [53,81]. The convex and concave relaxations together with their subgradients [58] are constructed through the MC++ library [16]. The necessary interval extensions are provided through FILIB++ [35]. MAiNGO currently supports CPLEX [38] and CLP [19] as linear programming solvers for lower bounding. In this work, the convex relaxations of the objective and the constraints are linearized at the center point of each node. Subsequently, CPLEX [38] solves the resulting linear problems for lower bounding. MAiNGO can also be run in parallel on multiple cores through MPI. For a fair comparison, we run all optimizations on a single core in this work.
The GP models, acquisition functions, and training scripts are available open-source within the MeLOn toolbox [71] and the relaxations of the corresponding functions are available through the MC++ library used by MAiNGO.
In order to install MAiNGO, please visit our public git repository at https://git.rwthaachen.de/avt.svt/public/maingo. Our machine learning toolbox MeLOn comes as a submodule of MAiNGO and will be installed with MAiNGO. Currently, MAiNGO can be run using our C++ interface or using our own modeling language called ALE [26]. At the time of writing this manuscript, the authors develop a new Python interface for MAiNGO which will be available soon via pypi.

Numerical results
We now investigate the numerical performance of the proposed method on one core of an Intel Xeon CPU with 2.60 GHz, 128 GB RAM and Windows Server 2016 operating system. We present three case studies. We use MAiNGO version v0.2.1 and BARON v19.12.7 through GAMS v30.2.0 to solve different optimization problems with GPs embedded on a single core. Note that we use CPLEX as a lower bounding solver in both BARON and MAiNGO. In MAiNGO we use SLSQP [45] in the pre-processing and in the upper bounding. By default, BARON automatically selects NLP solvers and may switch between different NLP solvers.
First, we illustrate the scaling of the method w.r.t. the number of training data points on a representative test function. Herein, the estimate of the GP is optimized. Second, we consider a chemical engineering case study with a chance constraint, which utilizes the variance prediction of a GP. Third, we optimize an acquisition function that is commonly used in Bayesian optimization on a chemical engineering dataset.

Illustrative example and scaling of the algorithm
In the first illustrative example, the peaks function is learned by GPs. Then, the GP predictions are optimized onX = {x 1 , x 2 ∈ R : − 3 ≤ x 1 , x 2 ≤ 3}. The peaks function is given by f peaks : The two-dimensional function has multiple suboptimal local optima and one unique global minimizer at x * ≈ [0.228, −1.626] T with f peaks (x * ) ≈ −6.551. We generate various training data onX using a Latin hypercube sampling of sizes 10, 20, 30,…, 500. Then, we train GPs with k ν=1/2 (d), k ν=3/2 (d), k ν=5/2 (d), and k SE (d) covariance functions on the data. The parameters of the trained GPs are saved in individual JSON files. After training, the JSON files are read by the solver and the predictions of the GPs are minimized using the RS and FS formulation to locate an approximation of the minimum of f peaks . We run optimizations in MAiNGO once using the developed envelopes and once using standard McCormick relaxations. Due to long CPU times, we run optimizations for the FS formulations only for up to 250 data points in MAiNGO. The whole data generation, training, and optimization procedure are repeated 50 times for each data set. Thus, we train a total of 10, 000 GPs and run 90, 000 optimization problems in MAiNGO. We also solve the FS and RS formulation in BARON by automatically parsing the problem from our C++ implementation to GAMS. This is particularly important in the RS as equations with several thousand characters are generated. We solve the RS problem for up to 360 and the FS for up to 210 data points in BARON due to the high computational effort. The optimality tolerances are set to abs. tol. = 10 −3 and rel. tol. = 10 −3 and the maximum CPU time is set to 1, 000 CPU seconds. The reported CPU times do not include any compilation time in MAiNGO and BARON. Note that the MAiNGO code is just compiled once for each problem class because the individual GPs are parameterized by JSON files. Thus, no repeated compilation is necessary. The feasibility tolerances are set to 10 −6 . The analysis in this section is based on results for the k ν=5/2 covariance function.  In the FS, this problem has D + 2 · N + 2 equality constraints and 2 · D + 2 · N + 2 optimization variables while the RS has D optimization variables and no equality constraints. Note that for practical applications the number of training data points is usually much larger than the dimension of the inputs, i.e., N D. The full problem formulation is also provided in ESI Sect. 3. Figure 3 shows a comparison of the CPU time for optimization of GPs. For the solver MAiNGO, Fig. 3a shows that RS formulation outperforms the FS formulation by more than one order of magnitude and shows a more favorable scaling with the number of training data points. For example, the speedup increases to a factor of 778 for 250 data points. Notably, the achieved speedup increases drastically with the number of training data points (c.f. ESI Sect. 4). This is mainly due to the fact that the CPU times for the FS formulations scale approximately cubically with the data points (CPU F S w/ env (N ) = 1.053 · 10 −4 N 2.958 sec with R 2 = 0.993) while the ones for the RS scale almost linearly (CPU RS w/ env (N ) = 0.0022 · N 1.156 sec with R 2 = 0.995).
In general, the number of optimization variables can lead to an exponential growth of the worst-case B&B iterations and thus runtime. In this particular case, the number of B&B iterations is very similar for the FS and RS formulation (see Fig. 4a). Instead, for the present problems the number of B&B iterations is more influenced by the use of tight relaxations. Figure 4b shows that the CPU time per iteration increases drastically with problem size in the FS while it increases only moderately in the RS. This indicates that the solution time of the lower bounding, upper bounding, and bound tightening subproblems scales favorably in the RS and that this is the main reason for speedup of the RS formulation in MAiNGO. This is probably due to the smaller subproblem sizes when using McCormick relaxations in the RS formulation (c.f. discussion in Sect. 1).
The use of envelopes of covariance functions also improves computational performance (see Fig. 3a). However, this effect is approximately constant over the problem size (c.f. Fig. 3 in ESI Sect. sec:globalspsgpspsoptimizationspsRelaxationsspsofsps RelevantspsFunctionsspsforspsGPsspsandspsBayesianspsOptimization). In other words, the CPU time shows a similar trend for the cases with and without envelopes in Fig. 3a. In the RS, the CPU time with envelopes takes on average 7.1% of the CPU time without envelopes (≈ 14 times less). In the FS, the impact of the envelopes is less pronounced, i.e., the CPU time w/ envelopes is on average 15.1% of the CPU time w/o envelopes (≈ 6.6 times less). Figure 4a shows that the envelopes considerably reduce the number of necessary B&B iterations. However, the relaxations do not show a significant influence on the CPU time per iteration (see Fig. 4a).
The results of this numerical example show clearly that the development of tight relaxations is more important for the RS formulation than for the FS. As shown in Sect. 3.4.2 of [4], this effect can be explained by the fact that in RS, it is more likely to have reoccurring nonlinear factors which can cause the McCormick relaxations to become weaker (c.f. also the relationship to the AVM in this case explored in [81]). However, in this study, the improvement in relaxations is outweighed by the increase of CPU time per iteration when additional variables are introduced in the FS.
The RS formulation also performs favorably compared to the FS formulation in the solver BARON (see Fig. 3). However, the differences between the CPU times are less pronounced. In contrast to MAiNGO, the number of B&B iterations in the FS and RS drastically increase with increasing number of training data points when using BARON (c.f. Fig. 4 in ESI). Also, the time per B&B iteration is similar between RS and FS. This is probably due to the AVM method for the construction of relaxations. The AVM method introduces auxiliary variables for some factorable terms. Thus, the size of the subproblems in BARON increases with the number of training data points regardless of which of the two formulations is used.
The results of the optimizations also provide information about the ability of GP surrogate models to approximate a function for optimization. The results show that the solution point of the GP optimization problem approximately converges to the optimum of the learned peaks function for all covariance functions. However, it is clear that some covariance functions lead to more accurate solution for the same number of training data points in this particular case. In the ESI, we provide figures that show the solution point and objective function value over the number of data points for this problem (Figs. 8-11). Interestingly, the objective function value is overestimated considerably for all problems.

Chance-constrained programming
Probabilistic constraints are relevant in engineering and science [18] and GPs have been used in the previous literature to formulate chance constraints, e.g., in model predictive control [11] or production planning [87].
As a second case study, we consider the N-benzylation reaction of α methylbenzylamine with benzylbromide to form desired secondary (2 • ) amine and undesired tertiary (3 • ) amine. We utilize an experimental data set consisting of 78 data points from a robotic chemical reaction platform [69]. We aim to maximize the expected space-time yield of 2 • amine (2 • -STY) and ensure that the probability of a product quality constraint satisfaction is above 95%. The 2 • -STY and yield of 3 • amine impurity (3 • -Y) are modeled by individual GPs. Thus, we solve optimization problems with two GPs embedded. The chance-constrained optimization problem is formulated as follows Here, the objective is to minimize the negative of the expected STY. This corresponds to minimizing the negative prediction of the GP, i.e., −m D,2 • −STY . The chance constraint ensures that the impurity is below a parameter c with a probability of 95%. This corresponds to the constraint m D,3 • −Y + 1.96 · √ k D,3 • −Y ≤ c with c = 5. The optimization is conducted with respect to four optimization variables: (1) the primary (1 • ) amine flow rate of the feed varying between 0.2 and 0.4 mL min −1 , (2) the ratio between benzyl bromide and 1 • amine varying between 1.0 and 5.0, (3) the ratio between solvent and 1 • amine varying between 0.5 and 1.0, and (4) the temperature varying between 110 and 150 • C.
As this problem is highly multimodal and difficult to solve, we increase the number of local searches in pre-processing in MAiNGO to 500 and increase the maximum CPU time to 24 hours. The computational performance of the different methods is given in Table 1. The results show that none of the considered methods converged to the desired tolerance within the time limit. The RS formulation in MAiNGO that uses the proposed envelopes outperforms the other formulations and BARON solver as it yields the smallest optimality gap. Note that the considered SLSQP solver does not find any valid solution point in the FS in MAiNGO while feasible points are found in the RS. This demonstrates that the RS formulation can also be advantageous for local solvers. Note that when using IPOPT [84] with 500 multistart points in the FS formulation in MAiNGO, it identifies a local optimum with f * = −226.5 in the pre- The FS formulation has 330 optimization variables, 326 equality constraints, and 1 inequality constraint.
The RS formulation has 4 optimization variables, 0 equality constraints, and 1 inequality constraint. The CPU time limit is 86,400 s processing. In the ESI, we provide a brief comparison of a few pre-processing settings for this case study. The best solution of the optimization problem that we found is x 1 = 0.40 min −1 , x 2 = 1.0, x 3 = 0.5, and x 4 = 123.5 • C. At the optimal point, the predicted 2 • -STY is 226.5 kg m −3 h − 1 with a variance of 17.1 while the predicted amine impurity is 4.2 % with a variance of 0.17. The result shows that the probability constraint ensures a safety margin between the predicted impurity and c = 5. Note that the chance constraint is active at the optimal solution point.

Bayesian optimization
In the third case study, we consider the synthesis of layer-by-layer membranes. Membrane development is a prerequisite for sustainable supply of safe drinking water. However, synthesis of membranes is often based on try-and-error leading to extensive experimental efforts, i.e., building and measuring a membrane in the development phase usually takes several weeks per synthesis protocol. In this case study, we plan to improve the retention of Na 2 SO 4 salt of a recently developed layer-by-layer nanofiltration membrane system. The optimization variables are the sodium chloride concentration in the polyelectrolyte solution c NaCl ∈ [0, 0.5] gL − 1, the deposited polyelectrolyte mass m P E ∈ [0, 5] gm −2 , and the number of layers N layer ∈ {1, 2, 3, ..., 10}. The detailed description of the setup is given in the literature [55,65]. Overall, we utilize 63 existing data points from previous literature [55]. We identify a promising synthesis protocol based on the EI acquisition function by solving: with x = [c NaCl , m P E , N layer ] T . Thus, this numerical example corresponds to one step of a Bayesian optimization setup for this experiment. Global optimization of the acquisition function is particularly relevant due to inherent multimodality of the The FS formulation has 136 optimization variables and 133 equality constraints. The RS formulation has 3 optimization variables and no equality constraints acquisition functions [44] and high cost of experiments. Note that the experimental validation of this data point is not within the scope of this work. The computational performance of the proposed method is summarized in Table 2. Using the solver MAiNGO, the RS formulation converges approximately 9 times faster to the desired tolerance compared to the FS formulation. Herein, we use the derived tailored relaxations of the EI acquisition function and envelopes of the covariance functions in both cases. Notably, the FS requires approximately 1.8 times the number of B&B iterations compared to the RS formulation, which is much less than the overall speedup. Thus, the results are in good agreement with the previous examples showing that the reduction of CPU time per iteration in the RS has a major contribution to the overall speedup. For this example, a comparison to BARON is omitted due to necessary workarounds including several integer variables and function approximations for CDF and EI (c.f., Sects. 4.3, 4.6).
The optimal solution point of the optimization problem is c NaCl = 0.362 gL − 1, m P E = 0 gm −2 , and N layer = 4. The expected retention is 85.32 with a standard deviation σ = 14.8. The expected retention is actually worse than the best retention in the training data of 96.1. However, Bayesian optimization takes also the high variance of the solution into account, i.e., it is also exploring the space. EI identifies an optimal trade-off between exploration and exploitation.

Conclusions
We propose a RS formulation for the deterministic global solution of problems with trained GPs embedded. Also, we derive envelopes of common covariance functions or tight relaxations of acquisition functions leading to tight overall problem relaxations.
The computational performance is demonstrated on illustrative and engineering case studies using our open-source global solver MAiNGO. The results show that the number of optimization variables and equality constraints are reduced significantly compared to the FS formulation. In particular, the RS formulation results in smaller subproblems whose size does not scale with the number of training data points when using McCormick relaxations. This leads to tractable solution times and overcomes previous computational limitations. For example, we archive a speedup factor of 778 for a GP trained on 250 data points. The GP training methods and models are provided as an open-source module called "MeLOn-Machine Learning Models for Optimization" toolbox [71].
We thus demonstrate a high potential for future research and industrial applications. For instance, global optimization of the acquisition function can improve the efficiency of Bayesian optimization in various applications. It also allows to easily include integer decisions and nonlinear constraints in Bayesian optimization. Furthermore, the proposed method could be extended to various related concepts such as multi-task GPs [8], deep GPs [21], global model-predictive control with dynamic GPs [13,85], and Thompson sampling [12,17]. Finally, the proposed work demonstrates that the RS formulation may be advantageous for a wide variety of problems that have a similar structure, including various machine-learning models, model ensembles, Monte-Carlo simulation, and two-stage stochastic programming problems.

A Derivations of convex and concave relaxations
For the sake of simplicity, we use the same symbols in each subsection for the corresponding convex (F cv ) and concave (F cc ) relaxations. To solve the one-dimensional nonlinear equations that arise multiple times in the following, we use Newton's method with 100 iterations and a tolerance of 10 −9 . If this is not successful, we run a golden section search as a backup.

A.1 Probability density function of Gaussian distribution
In this subsection, the envelopes of the PDF are derived on a compact interval D = [x L , x U ]. As the probability density function is one-dimensional, McCormick [53] gives a method to construct its envelopes. The PDF is convex on ] − ∞, −1] and [1, ∞[ and it is concave on [−1, 1]. Its convex envelope, F cv : R → R, and concave envelope, F cc : R → R, are given by The case x L + x U = 0 is symmetrical and handled separately to avoid numerical issues in Newton. F cc 5 : R → R is given by a combination of F cc 2 and F cc 4 : Further, x L c,5 = max(x L * c,5 , x L ) and x L * c,5 is the solution of dφ

A.2 Probability of improvement acquisition function
In this section, a tight relaxation of the PI acquisition function is derived. PI is con-

A.2.1 Monotonicity
From the gradient of PI on R × (0, ∞), where f min is a given target, we identify the following monotonicity properties: -PI is monotonically decreasing with respect to μ.
-If μ < f min then PI is monotonically decreasing with respect to σ .
These properties can be used to obtain exact interval bounds on the function values of PI. Furthermore, they can be exploited to construct relaxations as described in Sect. A.2.3.

A.2.2 Componentwise convexity
The Hessian of PI on R × (0, ∞[ is given by The Hessian is indefinite and PI is therefore neither convex nor concave on its whole domain. However, we find componentwise convexity properties on certain parts of the domain, i.e., convexity with respect to one variable when the other is fixed. To this end, we divide the domain into the following four sets: On these sets, PI has the componentwise convexity properties listed in Table 3.

A.2.3 Relaxations
We construct relaxations of PI over a given subset X = [μ L , μ U ] × [σ L , σ U ] of its domain depending on which of the four sets I 1 -I 4 contains the set X . If X ⊂ I 1 ∪ I 2 , we use the McCormick relaxations obtained by applying the multivariate composition theorem [81] to the composition of the rational function f min −μ σ with Φ (c.f. Eq. (5)), since these are already very tight. If X does not fully lie within I 1 ∪ I 2 , the McCormick relaxations get increasingly weaker and we thus resort to other methods as described in the following.
If X ⊂ I 4 , PI is componentwise convex with respect to both variables. Therefore, the concave envelope of PI over X consists of two planes anchored at the four corner points of X and can be calculated as described by [56]. A tight convex relaxation can be obtained using the method by [61]. Since the off-diagonal entries of the Hessian have a constant sign over I 4 , a sufficient condition for this method is fulfilled (c.f. Corollary 1 in [61]). An example for the resulting relaxation is shown in Fig. 2b. Similarly, if X ⊂ I 3 , PI is componentwise concave and we obtain its convex envelope using the method by [56] and a tight concave relaxation using the method by [61].
If X ⊂ I 2 ∪ I 4 , we construct relaxations exploiting the monotonicity properties of PI. Since for all (μ, σ ) ∈ I 2 ∪ I 4 we have μ ≥ f min , PI is thus monotonically decreasing in μ and increasing in σ over X . Therefore, we can construct a convex relaxation PI cv 2,4 : X → [0, 1] as PI cv 2,4 (μ, σ ):= max f cv where f cv σ L and f cv μ U are the convex envelopes of the univariate functions respectively, i.e., they correspond to the function PI restricted to one-dimensional facets of X at σ L and μ U . Both f cv σ L (μ) and f cv μ U (σ ) are valid relaxations of PI because of the monotonicity of PI over I 2 ∪ I 4 . By taking the pointwise maximum in (7), we obtain a tighter relaxation while preserving convexity. To compute f cv σ L and f cv μ U , we can use the method described in Sect. 4 of [53] because they are one-dimensional functions with a known inflection point. To apply this method, we typically need to solve a one-dimensional nonlinear equation, which we do via Newton's method. A concave relaxation can be obtained analogously using concave envelopes of PI over one-dimensional facets of X at σ U and μ L . If X ⊂ I 1 ∪ I 3 , an analogous method can be used since PI is monotonically increasing in both μ and σ .
In the most general case, X contains parts of all four sets I 1 -I 4 . In this case, we can still obtain relaxations by exploiting monotonicity properties. In particular, we compute a convex relaxation PI cv 1−4 : X → [0, 1] as PI cv 1−4 (μ, σ ):= max f cv σ L (μ), f cv μ U (σ ) , where f cv μ U is again the convex relaxation of the univariate function f μ U as in (8), which is still valid because PI is decreasing with respect to μ on its entire domain. In contrast, the convex relaxation of the univariate function at σ L as in (7) is not valid because PI is not monotonic with respect to σ . Instead, in (9) To see thatf cv σ L is a valid relaxation of PI, we first note that by definition it is a relaxation off σ L , so it suffices to show thatf σ L is in turn a relaxation of PI. The latter is established in the following Lemma.
Proof Consider first any fixedμ such thatμ ≥ f min . In this case, we havẽ f σ L (μ) = PI(μ, σ L ) ≤ PI(μ, σ ) ∀σ ∈ [σ L , σ U ] because of the monotonicity w.r.t σ (c.f. Sect. A.2.1). Next, consider anyμ such thatμ < f min . Note that this implies μ L < f min . In this case, we havẽ where the inequalities follow, in this order, from the monotonicity of PI with respect to σ for μ ≥ f min , its componentwise concavity with respect to μ for μ < f min , and its monotonicity with respect to σ for μ < f min .

A.3 Expected improvement acquisition function
We now show that the EI acquisition function is convex. From the Hessian matrix of EI on R × (0, ∞) we find the eigenvalues 0 and (μ− f min ) 2 +σ 2 σ 3 ·φ − μ− f min σ . As σ ≥ 0, EI(·, ·) is convex and the envelopes can be constructed directly.