1 Introduction

A Gaussian process (GP) is a stochastic process where any finite collection of random variables has a multivariate Gaussian distribution; they can be understood as an infinite-dimensional generalization of multivariate Gaussian distributions [66]. The predictions of GPs are Gaussian distributions that provide not only an estimate but also a variance. GPs originate from geostatistics [46] and gained popularity for the design and analysis of computer experiments (DACE) since 1989 [68]. Furthermore, GPs are commonly applied as interpolating surrogate models across various disciplines including biotechnology [13, 25, 29, 54, 82], chemical engineering [14, 22,23,24, 27, 34, 52], chemistry [1, 69], and deep-learning [74]. Note that GP regression is also often referred to as Kriging. In many applications, GPs are trained on a data set and are subsequently embedded in an optimization problem, e.g., to identify an optimal operating point of a process. Moreover, many derivative-free solvers for expensive-to-evaluate black-box functions actually train GPs and optimize their predictions (e.g., Bayesian optimization algorithms [12, 40, 72, 82] and other adaptive sampling approaches [9, 10, 20, 22,23,24, 27]). In Bayesian optimization, the optimum of an acquisition function determines the next sampling point [72]. The vast majority of these optimizations have been performed by local solution approaches [44, 88] and a few by stochastic global optimization methods [12]. Our contribution focuses on the deterministic global solution of optimization problems with trained GPs embedded and on applications in process systems engineering.

GPs are commonly used to learn the input-output behavior of unit operations [14, 15, 42, 43, 48, 63, 64], complete flowsheets [33], or thermodynamic property relations from data [51]. Subsequently, the trained GPs are often combined with nonlinear mechanistic process models leading to hybrid mechanistic and data-driven models [31, 41, 60, 83] which are optimized. Many of the previous works on optimization with GPs embedded rely on local optimization techniques. Caballero and Grossmann, for instance, train GPs on data obtained from a rigorous divided wall column simulation. Then, they iteratively optimize the operation of the column (modeled by GPs) using SNOPT [30], sample new data at the solution point, and update the GPs [14, 15]. Later, Caballero and co-workers extend this work to distillation sequence superstructure problems [63, 64]. In [63], the authors solve the resulting mixed-integer nonlinear programs (MINLPs) using a local solver in GAMS. Therein, the GP estimate is computed via an external function in Matlab which leads to a reduced optimization problem size visible to the local solver in GAMS. However, all these local methods have the drawback that they can lead to suboptimal solutions, because the resulting optimization problems are nonconvex. This nonconvexity is induced by the covariance functions of the GPs as well as often the mechanistic part of the hybrid models.

Deterministic global optimization can guarantee to identify globally optimal solutions within finite time to a given nonzero tolerance [36]. In a few previous studies, deterministic global optimization with GPs embedded was done using general-purpose global solvers. For instance, in the black-box optimization algorithms ALAMO [20] and ARGONAUT [9], GPs are included as surrogate models and are optimized using BARON [79] and ANTIGONE [57], respectively. However, computational burdens were observed that limit applicability, e.g., in terms of the number of training points. Cozad et al. [20] state that GPs are accurate but “difficult to solve using provable derivative-based optimization software”. Similarly, Boukouvala and Floudas [9] state that the computational cost becomes a limiting factor because the number of nonlinear terms of GPs equals the product of the number of interpolated points (N) and the dimensionality (D) of the input domain. More recently, Keßler et al. [42, 43] optimized the design of nonideal distillation columns by a trust-region approach with GPs embedded. Therein, optimization problems with GPs embedded are solved globally using BARON [79] within relatively long CPU times (\(10^2\)\(10^5\) CPU seconds on a personal computer).

As mentioned earlier, Quirante et al. [63] call an external Matlab function to compute GP estimates with a local solver in GAMS. As an alternative approach, they also solve the problem globally using BARON in GAMS by providing the full set of GP equations as equality constraints. This leads to additional intermediate optimization variables besides the degrees of freedom of the problem. Similar to other studies, they observe that their formulation is only practical for a small number of GP surrogates and training data points to avoid large numbers of variables and constraints [63]. We refer to the problem formulation where the GP is described by equality constraints and additional optimization variables as a full-space (FS) formulation. It is commonly used in modeling environments, e.g., GAMS, that interface with state-of-the-art global solvers such as ANTIGONE [57], BARON [79], and SCIP [50].

An alternative to the FS is a reduced-space (RS) formulation where some optimization variables are eliminated using explicit constraints. This reduced problem size leads to a lower number of variables for branching as well as potentially smaller subproblems. The former has some similarity to selective branching [28] (c.f. discussion in [4]). The exact size of the subproblems for lower bounding and bound tightening depends on the method for constructing relaxations. In particular, when constructing relaxations in the RS using McCormick [53], alphaBB [2] or natural interval extensions, the resulting lower bounding problems are much smaller compared to the auxiliary variable method (AVM) [73, 80]. Therefore, any global solver can in principle handle RS but some methods for constructing relaxations appear more promising to benefit from the RS [4]. We have recently released the open-source global solver MAiNGO [7] which uses the MC++ library [16] for automatic propagation of McCormick relaxations through computer code [58]. We have shown that the RS formulation can be advantageous for flowsheet optimization problems [5, 6] and problems with artificial neural networks embedded [37, 65, 70]. In the context of Bayesian optimization, Jones et at. [40] develop valid overestimators of the expected improvement (EI) acquisition function in the RS. However, their relaxations rely on interval extensions and optimization-based relaxations limited to a specific covariance function; they do not derive envelopes. Furthermore, they do not provide convex underestimators which are in general necessary to embed GPs in optimization problems.

The main contribution of this work is the efficient deterministic global optimization of optimization problems with GPs embedded. We develop a RS formulation for optimization problems with GPs embedded. The performance of the proposed method is analyzed in an extensive computational study by solving about 90,000 optimization problems. The proposed RS outperforms a FS formulation for problems with GPs embedded by speedup factors of several magnitudes. Moreover, this speedup increases with the number of training points. To further accelerate convergence, we derive and implement envelopes of covariance functions for GPs and tight relaxations of acquisition functions, which are commonly used in Bayesian optimization. Finally, we solve a chance-constrained optimization problem with GPs embedded and we perform global optimization of an acquisition function. The GP training methods and models are provided as an open-source toolbox called “MeLOn—Machine Learning Models for Optimization” under the Eclipse public license [71]. The resulting optimization problems are solved using our global solver MAiNGO [7]. Note that the MeLOn toolbox is also automatically included as a submodule in our new MAiNGO release.

2 Optimization problem formulations

In the simplest case, which is common in the literature, the (scaled) inputs of a GP are the free variables of the optimization problem with \({\varvec{x}} \in {\tilde{X}} = [{\varvec{x}}^{L},{\varvec{x}}^{U}]\). For given \({\varvec{x}}\), the dependent (or intermediate) variables \({\varvec{z}}\) can be computed by the solution of \({\varvec{h}}({\varvec{x}},{\varvec{z}})={\varvec{0}}, \quad {\varvec{h}}:{\tilde{X}} \times {\mathbb {R}}^{n_z} \rightarrow {\mathbb {R}}^{n_z}\). In the case of GP models, we aim to include the estimate (\(m_{\varvec{\mathcal {D}}}\)) and variance (\(k_{\varvec{\mathcal {D}}}\)) in the optimization. As will be shown in Sect. 3, we can solve explicitly for \(m_{\varvec{\mathcal {D}}}\) and \(k_{\varvec{\mathcal {D}}}\) [c.f. Eqs. (1) and (2)].

The realization of the objective function f depends on the application. In many applications, it depends on the estimate of the GP, i.e., \(f(m_{\varvec{\mathcal {D}}})\) (c.f. Sect. 6.1). In Bayesian optimization, the objective function is called the acquisition function and usually depends on the estimate and variance of the GP, i.e., \(f(m_{\varvec{\mathcal {D}}},k_{\varvec{\mathcal {D}}})\) (c.f. Sect. 6.3). Finally, additional constraints might depend on the inputs of the GP, its estimate, and variance, i.e., \({\varvec{g}}({\varvec{x}},m_{\varvec{\mathcal {D}}},k_{\varvec{\mathcal {D}}}) \leqslant {\varvec{0}}\). In more complex cases, multiple GPs can be combined in one optimization problem (c.f. Sect. 6.2).

In the following, we describe two optimization problem formulations for problems with trained GPs embedded: the commonly used FS formulation in Sect. 2.1 and the RS formulation in Sect. 2.2. Both problem formulations are exact reformulations in the sense of Liberti et al. [47], meaning that they have the same local and global optima. The equivalence is shown in “Appendix A” of [4]. However, the formulation significantly affects problem size and performance of global optimization solvers.

2.1 Full-space formulation

In the FS formulation, the nonlinear equations \({\varvec{h}}({\varvec{x}},{\varvec{z}})={\varvec{0}}\) are provided as equality constraints and the intermediate dependent variables \({\varvec{z}} \in Z\) are optimization variables. A general FS problem formulation is:

$$\begin{aligned}&\underset{{\varvec{x}} \in {\tilde{X}}, {\varvec{z}} \in Z}{\min } f({\varvec{x}},{\varvec{z}}) \\&\text {s.t.} \qquad {\varvec{h}}({\varvec{x}},{\varvec{z}}) = {\varvec{0}}, \qquad {\varvec{g}}({\varvec{x}},{\varvec{z}}) \leqslant {\varvec{0}} \end{aligned}$$
(FS)

In general, there exist multiple valid FS formulations for optimization problems. In Sect. 3 of the electronic supplementary information (ESI), we provide a representative FS formulation for the case where the estimate of a GP is minimized. This is also the FS formulation that we use in our numerical examples (c.f., Sect. 6.1).

2.2 Reduced-space formulation

In the RS formulation, the equality constraints are solved for the intermediate variables and substituted in the optimization problem (c.f. [5]). A general RS problem formulation in the context of optimization with a GP embedded is:

$$\begin{aligned}&\underset{{\varvec{x}} \in {\tilde{X}}}{\min } \quad f(m_{\varvec{\mathcal {D}}}({\varvec{x}}),k_{\varvec{\mathcal {D}}}({\varvec{x}})) \\&\text {s.t.} \quad {\varvec{g}}({\varvec{x}},m_{\varvec{\mathcal {D}}}({\varvec{x}}),k_{\varvec{\mathcal {D}}}({\varvec{x}})) \leqslant {\varvec{0}} \end{aligned}$$
(RS)

Herein, the Branch-and-Bound (B&B) solver operates only on the free variables \({\varvec{x}}\) and no equality constraints are visible to the solver. In GPs, the estimate and variance are explicit functions of the input [Eqs. (1) and (2)]. Thus, we can directly formulate a RS formulation. The RS formulation effectively combines those equations and hides them from the B&B algorithm. This results in a total number of D optimization variables, zero equality constraints, and no additional optimization variables \({\varvec{z}}\). Thus, the RS formulation requires only bounds on \({\varvec{x}}\).

Note that the direct substitution of all equality constraints is not always possible when multiple GPs are combined with mechanistic models, e.g., in the presence of recycle streams. Here, a small number of additional optimization variables and corresponding equality constraints can remain in the RS formulation [5]. As an alternative, relaxations for implicit functions can also be derived [76, 86]. Moreover, we have previously observed that a hybrid between RS and FS formulation can be more efficiently solvable for some optimization problems [6]. In this work, we compare the RS and the FS formulation and do not consider any hybrid problem formulations.

3 Gaussian processes

In this section, GPs are briefly introduced (c.f. [66]). We first describe the GP prior distribution, i.e., the probability distribution before any data is taken into account. Then, we describe the posterior distribution, which results from conditioning the prior on training data. Finally, we describe how hyperparameters of the GP can be adapted to data by a maximum a posteriori (MAP) estimate.

3.1 Prior

A GP prior is fully described by its mean function \(m(\mathbf{x })\) and positive semi-definite covariance function \(k(\mathbf{x },{\varvec{x}}')\) (also known as kernel function). We consider a noisy observation y from a function \({\tilde{f}}(\mathbf{x })\) with \(y(\mathbf{x }) {:}{=}{\tilde{f}}(\mathbf{x }) + \varepsilon _{\mathrm {noise}}\), whereby the output noise \(\varepsilon _{\mathrm {noise}}\) is independent and identically distributed (i.i.d.) with \(\varepsilon _{\mathrm {noise}} \sim \mathcal {N}(0,{\sigma }_{\mathrm {noise}}^2 )\). We say y is distributed as a GP, i.e., \(y \sim {\mathcal {G}}{\mathcal {P}}(m(\mathbf{x }),k(\mathbf{x }, \mathbf{x }'))\) with

$$\begin{aligned} m(\mathbf{x })&{:}{=}{\mathbb {E}}\big [{\tilde{f}}(\mathbf{x })\big ], \\ k(\mathbf{x }, {\varvec{x}}')&{:}{=}{\mathbb {E}}\big [~( y(\mathbf{x }) - m(\mathbf{x }) )~( y(\mathbf{x }') - m(\mathbf{x }) )^{\mathrm {T}} \big ]. \end{aligned}$$

Without loss of generality, we assume that the prior mean function is \(m(\mathbf{x })=0\). This implies that we train the GP on scaled data such that the mean of the training outputs is zero. A common class of covariance functions is the Matérn class.

$$\begin{aligned} k_{\text {Mat}{\acute{e}}\text {rn}}({\varvec{x}}, {\varvec{x}}') {:}{=}\sigma _f^2 \frac{2^{1-\nu }}{\varGamma (\nu )} \left( \sqrt{2 \nu } r \right) ^\nu K_\nu \left( \sqrt{2 \nu } r \right) , \end{aligned}$$

where \(\sigma _f^2\) is the output variance, \(r {:}{=}\sqrt{({\varvec{x}} - {\varvec{x}}')^{\mathrm {T}} \varvec{\varLambda }~({\varvec{x}} - {\varvec{x}}')}\) is a weighted Euclidean distance, \(\varvec{\varLambda } {:}{=}\mathrm {diag}(\lambda _1^2, \cdots , \lambda _i^2, \cdots \lambda _{n_x}^2)\) is a length-scale matrix with \(\lambda _i \in {\mathbb {R}}\), \(\varGamma (\cdot )\) is the gamma function, and \(K_\nu (\cdot )\) is the modified Bessel function of the second kind. The smoothness of Matérn covariance functions can be adjusted by the positive parameter \(\nu \). When \(\nu \) is a half-integer value, the Matérn covariance function becomes a product of a polynomial and an exponential [66]. Common values for \(\nu \) are 1/2,  3/2,  5/2,  and \(\infty \), i.e., the most widely-used squared exponential covariance function, \(k_{SE}'(r) {:}{=}\exp \left( -\frac{1}{2}~r^2 \right) \). We derive envelopes of these covariance functions in Sect. 4.1 and implement them within MeLOn [71]. Also, a noise term, \(\sigma _{\mathrm {n}}^2 \cdot \delta ({\varvec{x}},{\varvec{x}}')\), can be added to any covariance function where \(\sigma _{\mathrm {n}}^2\) is the noise variance and \(\delta ({\varvec{x}},{\varvec{x}}')\) is the Kronecker delta function. The hyperparameters of the covariance function are adjusted during training and are jointly noted as \(\varvec{\theta } =\) \( [\lambda _1,...,\lambda _d,\sigma _f, \sigma _{\mathrm {n}}]\). Herein, a log-transformation is common to prevent negative values during training.

3.2 Posterior

The GP posterior is obtained by conditioning the prior on observations. We consider a set of N training inputs \(\varvec{\mathcal {X}}=\{{\varvec{x}}_1^{(\varvec{\mathcal {D}})}, ...,{\varvec{x}}_N^{(\varvec{\mathcal {D}})} \}\) where \({\varvec{x}}_{i}^{(\varvec{\mathcal {D}})} = [x_{i,1}^{(\varvec{\mathcal {D}})}, ...,x_{i,D}^{(\varvec{\mathcal {D}})}]^T\) is a D-dimensional vector. Note that we use the superscript \({(\varvec{\mathcal {D}})}\) to denote the training data. The corresponding set of scalar observations is given by \(\varvec{\mathcal {Y}}=\{y_1^{(\varvec{\mathcal {D}})}, ...,y_N^{(\varvec{\mathcal {D}})} \}\). Furthermore, we define the vector of scalar observations \({\varvec{y}} =\) \([y_1^{(\varvec{\mathcal {D}})}, ...,y_N^{(\varvec{\mathcal {D}})}]^T\) \(\in {\mathbb {R}}^{N}\). The posterior GP is obtained by Bayes’ theorem:

$$\begin{aligned} {\tilde{f}}({\varvec{x}}) \sim {\mathcal {G}}{\mathcal {P}}(m({\varvec{x}}), k({\varvec{x}},{\varvec{x}}') | \varvec{\mathcal {X}},\varvec{\mathcal {Y}}) = \mathcal {N} \left( m_{\varvec{\mathcal {D}}}( {\varvec{x}}), k_{\varvec{\mathcal {D}}}( {\varvec{x}},{\varvec{x}}') \right) \end{aligned}$$

with

$$\begin{aligned} m_{\varvec{\mathcal {D}}}( {\varvec{x}})&= {\varvec{K}}_{{\varvec{x}},\mathcal {X}} \left( {\varvec{K}}_{\mathcal {X},\mathcal {X}} \right) ^{-1} {\varvec{y}}, \end{aligned}$$
(1)
$$\begin{aligned} k_{\varvec{\mathcal {D}}}({\varvec{x}})&= {K}_{{\varvec{x}},{\varvec{x}}} - {\varvec{K}}_{{\varvec{x}},\mathcal {X}} \left( {\varvec{K}}_{\mathcal {X},\mathcal {X}} \right) ^{-1} {\varvec{K}}_{ \mathcal {X},{\varvec{x}} }, \end{aligned}$$
(2)

where the covariance matrix of the training data is given by \({\varvec{K}}_{\mathcal {X},\mathcal {X}} {:}{=}\left[ k({\varvec{x}}_i,{\varvec{x}}_j) \right] \in {\mathbb {R}}^{N \times N}\), the covariance vector between the candidate point \({\varvec{x}}\) and the training data is given by \({\varvec{K}}_{{\varvec{x}},\mathcal {X}} {:}{=}\left[ k({\varvec{x}},{\varvec{x}}_1^{(\varvec{\mathcal {D}})}), ...,k({\varvec{x}},{\varvec{x}}_N^{(\varvec{\mathcal {D}})}) \right] \in {\mathbb {R}}^{1 \times N}\), \({\varvec{K}}_{ \mathcal {X},{\varvec{x}} } = {{\varvec{K}}_{ {\varvec{x}},\mathcal {X} }}^T\), and \({K}_{{\varvec{x}},{\varvec{x}}} {:}{=}k({\varvec{x}},{\varvec{x}})\). Equations (1) and (2) describe essentially the predictions of a GP and are implemented within MeLOn.

3.3 Maximum a posteriori

In order to find appropriate hyperparameters \(\varvec{\theta }\) for a given problem, we use a MAP estimate which is known to be advantageous compared to the maximum likelihood estimation (MLE) on small data sets [77]. Using the MAP estimate, the hyperparameters are identified by maximizing the probability that the GP fits the training data, i.e., \(\varvec{\theta }_{\mathrm {opt}} {:}{=}{\mathop {\hbox {argmax}}\nolimits _{\varvec{\theta }}}\,\mathcal {P}\left( \varvec{\theta } |\varvec{\mathcal {X}}, \varvec{\mathcal {Y}} \right) \). Analytical expressions for \(\mathcal {P}\left( \varvec{\theta } |\varvec{\mathcal {X}}, \varvec{\mathcal {Y}} \right) \) and its derivatives w.r.t. the hyperparameters can be found in the literature [66]. We provide a Matlab training script in MeLOn that is based on our previous work [12]. Therein, we assume an independent Gaussian distribution as a prior distribution on the log-transformed hyperparameters, i.e., \(\theta _i \sim \mathcal {N} \left( \mu _i, \sigma _i^2 \right) \). The implementation of the training is efficient through the pre-computation of squared distances, the Cholesky decomposition for computing the inverse of the covariance matrix, and a two-step training approach that searches first globally and then locally [12].

4 Convex and concave relaxations

The construction of relaxations, i.e., convex function underestimators (\(F^{cv}\)) and concave function overestimators (\(F^{cc}\)), is essential for B&B algorithms. In our open-source solver MAiNGO, we use the (multivariate) McCormick method [53, 81] to propagate relaxations and their subgradients [58] through explicit functions using the MC++ library [16]. However, the McCormick method often does not provide the tightest possible relaxations, i.e., the envelopes. In this section, we derive tight relaxations or envelopes of functions that are relevant for GPs and Bayesian optimization. The functions and their relaxations are implemented in MC++. When using these intrinsic functions and their relaxations in MAiNGO, the (multivariate) McCormick method is only used for the remaining parts of the model. Note that the derived relaxations are used within MAiNGO while BARON does not allow for implementation of custom relaxations or piecewise defined functions.

4.1 Covariance functions

The covariance function is a key element of GPs. When embedding trained GPs into optimization problems, the covariance function occurs N times because it is used in the covariance vector between the candidate point \({\varvec{x}}\) and the training data, i.e., \({\varvec{K}}_{{\varvec{x}},\mathcal {X}} = \left[ k({\varvec{x}},{\varvec{x}}_1^{(\varvec{\mathcal {D}})}), ...,k({\varvec{x}},{\varvec{x}}_N^{(\varvec{\mathcal {D}})}) \right] \in {\mathbb {R}}^{1 \times N}\). Note that the covariance matrix \({\varvec{K}}_{\mathcal {X},\mathcal {X}}\) depends only on training data and is thus a parameter during the optimization. Thus, tight relaxations of the covariance functions are highly desirable. In this subsection, we derive envelopes for common Matérn covariance functions. We consider univariate covariance functions, i.e., \(k_{\nu }: {\mathbb {R}} \rightarrow {\mathbb {R}}\), with input \(d = ({\varvec{x}} - {\varvec{x}}')^{\mathrm {T}} \varvec{\varLambda }~({\varvec{x}} - {\varvec{x}}')\geqslant 0\). This is possible because we consider stationary covariance functions that are invariant to translations in the input space. Common Matérn covariance functions use \(\nu =1/2,~3/2,~5/2\) and \(\infty \) and are given by:

$$\begin{aligned}&k_{\nu =1/2}(d) {:}{=}\exp \left( -\sqrt{d} \right) , \qquad k_{\nu =3/2}(d) {:}{=}\left( 1 + \sqrt{3}~\sqrt{d} \right) \cdot \exp \left( - \sqrt{3}~ \sqrt{d} \right) \\&k_{\nu =5/2}(d) {:}{=}\left( 1 + \sqrt{5}~\sqrt{d} +\frac{5}{3}~d \right) \cdot \exp \left( -\sqrt{5}\sqrt{d} \right) , \qquad k_{SE}(d) {:}{=}\exp \left( -\frac{1}{2}~d \right) , \end{aligned}$$

where \(k_{SE}\) is the squared exponential covariance function with \(\nu \rightarrow \infty \). We find that these four covariance functions are convex because their Hessian is positive semidefinite. Thus, the convex envelope is given by \(F^{cv}(d) = k(d)\) and the concave envelope by the secant \(F^{cc}(d) = {{\,\mathrm{sct}\,}}(d)\) where \({{\,\mathrm{sct}\,}}(d) =\) \(\frac{k(d^U) - k(d^L)}{d^U - d^L} d + \frac{d^U k(d^L) - d^L k(d^U) }{d^U - d^L}\) on a given interval \([d^L, d^U]\). As the McCormick composition and product theorems provide weak relaxations of \(k_{\nu =3/2}\) and \(k_{\nu =5/2}\) (c.f. ESI Sect. 1), we implement these functions and their envelopes in our library of intrinsic functions in MC++. Furthermore, natural interval extensions are not exact for \(k_{\nu =3/2}\) and \(k_{\nu =5/2}\). Thus, we also provide exact interval bounds based on the monotonicity.

It should be noted that covariance functions are commonly given as a function of the weighted Euclidean distance \(r=\sqrt{d}\). However, we chose to use d instead for three main reasons: (1) \({\varvec{x}}\) is usually a free variable of the optimization problem. Thus, the computation of r would lead to potentially weaker relaxations for \(k_{\nu =5/2}\) and \(k_{SE}\). (2) The derivative of \(k_{\nu =3/2}(\cdot )\), \(k_{\nu =5/2}(\cdot )\), and \(k_{SE}(\cdot )\) is defined at \(d=0\) while the derivative of the square root function is not. (3) The covariance functions \({\hat{k}}_{v=3/2}: r \mapsto k_{v=3/2}(r^2)\), \({\hat{k}}_{v=5/2}: r \mapsto k_{v=5/2}(r^2)\), and \({\hat{k}}_{SE}: r \mapsto k_{SE}(r^2)\) are nonconvex in r, so deriving the envelopes would be nontrivial.

Finally, it can be noted that we did not derive envelopes of \(k_{\text {Mat}{\acute{e}}\text {rn}}({\varvec{x}}, {\varvec{x}}')\), because the variable input dimensions pose difficulties in implementation and the multidimensionality is a challenge for the derivation of envelopes. Nevertheless, the McCormick composition theorem applied to \(k_{\nu }(d({\varvec{x}},{\varvec{x}}'))\) yields relaxations that are exact at the minimum of \(k_{\text {Mat}{\acute{e}}\text {rn}}\) because the natural interval extensions of the weighted squared distance d are exact (c.f. [62]). This means that the relaxations are exact in Hausdorff metric.

4.2 Gaussian probability density function

The PDF is used to compute the EI acquisition function and is given by \(\phi : {\mathbb {R}} \rightarrow {\mathbb {R}}\) with

$$\begin{aligned} \phi (x) {:}{=}\frac{1}{\sqrt{2 \pi }} \cdot \exp \left( \frac{-x^2}{2} \right) \end{aligned}$$
(3)

The Gaussian probability density function (PDF) is a nonconvex function for which the McCormick composition rule does not provide its envelopes. For one-dimensional functions, McCormick [53] also provides a method to construct envelopes. We construct the envelopes of PDF using this method and implement them in our library of intrinsic functions. The envelope of the PDF is illustrated in Fig. 1 and derived in “Appendix A.1”.

Fig. 1
figure 1

Illustration of the envelope of the Gaussian PDF

4.3 Gaussian cumulative distribution function

The Gaussian cumulative distribution function (CDF) is given by \(\varPhi : {\mathbb {R}} \rightarrow {\mathbb {R}}\) with

$$\begin{aligned} \varPhi (x) {:}{=}\int _{-\infty }^{x} ~\phi (t)~dt = \frac{1+{{\,\mathrm{erf}\,}}\left( \frac{\sqrt{2} x}{2} \right) }{2}. \end{aligned}$$
(4)

The envelopes of the error function are already available in MC++ as an intrinsic function and consequently the McCormick technique provides envelopes of the CDF (see Fig. 2a in ESI). In contrast, the error function is not available as an intrinsic function in BARON and a closed-form expression does not exist. Thus, a numerical approximation is required for optimization in BARON. Common numerical approximations of the error function are only valid for \(x\geqslant 0\) and use point symmetry of the error function. To overcome this technical difficulty in BARON, a big-M formulation with additional binary and continuous variables is a possible workaround. However, this workaround leads to potentially weaker relaxations (see Sect. 2 in the ESI).

4.4 Lower confidence bound acquisition function

The lower confidence bound (LCB) (upper confidence bound when considering maximization) is an acquisition function with strong theoretical foundation. For instance, a bound on its cumulative regret, i.e., a convergence rate for Bayesian optimization, for relatively mild assumptions on the black-box function is known [75]. It is given by \(\text {LCB}: {\mathbb {R}} \times {\mathbb {R}}_{\geqslant 0} \rightarrow {\mathbb {R}}\) with

$$\begin{aligned} \text {LCB}(\mu , \sigma ) {:}{=}\mu - \kappa \cdot \sigma \end{aligned}$$

with a parameter \(\kappa \in {\mathbb {R}}_{>0} \). LCB has not been popular in engineering applications as it requires an additional tuning parameter \(\kappa \) and leads to heavy exploration when a rigorous value for \(\kappa \) is chosen [75]. Recently, LCB has gained more popularity through the application as a policy in deep reinforcement learning, e.g., by DeepMind [59]. LCB is a linear function and thus McCormick relaxations are exact.

4.5 Probability of improvement acquisition function

Probability of improvement (PI) computes the probability that a prediction at x is below a given target \(f_{\text {min}}\), i.e., \(\tilde{\text {PI}}({\varvec{x}}) = \mathcal {P}\left( f({\varvec{x}}) \leqslant f_{\text {min}} \right) \). When the underlying function is distributed as a GP with mean \(\mu \) and variance \(\sigma \), the PI is given by \(\text {PI}: {\mathbb {R}} \times {\mathbb {R}}_{\geqslant 0} \rightarrow {\mathbb {R}}\) with

$$\begin{aligned} \text {PI}(\mu , \sigma ) {:}{=}{\left\{ \begin{array}{ll} \varPhi \left( \frac{f_{\text {min}} - \mu }{\sigma } \right) , &{}\quad \sigma> 0, \\ 0, &{} \quad \sigma = 0, ~ f_{\text {min}} \leqslant \mu , \\ 1, &{} \quad \sigma = 0, ~ f_{\text {min}} > \mu . \end{array}\right. } \end{aligned}$$
(5)
Fig. 2
figure 2

Graph of the probability of improvement acquisition function (PI) as in Eq. (5) for \(f_min =0\) along with the developed convex and concave relaxations. a On the interval \([-2,2]\times [0,10]\), the relaxations are constructed on the basis of monotonicity properties of PI. b On the interval \([1,2]\times [0,1]\), the relaxations are constructed on the basis of componentwise convexity properties via the methods of Meyer and Floudas [56] and Najman et al. [61]. Note that the ranges of \(\mu \) and \(\sigma \) are different in the two subfigures to highlight the individual relaxations that are derived on different intervals. The ranges in a are such that they overlap with all four sets \(I_1\)-\(I_4\) defined in Sect. A.2, while the ranges in b lie within the set \(I_4\)

The PI acquisition function is neither convex or concave over its entire domain. However, as analyzed in Sect. A.2, there are parts of its domain over which the function is componentwise convex or convex with respect to \(\sigma \) or \(\mu \). For componentwise convex or concave functions, there exist methods for constructing tight relaxations. [56] introduced a method for constructing concave relaxations for componentwise convex functions (or vice versa). In particular, the concave envelope of a componentwise convex function is polyhedral [78], and thus the method of [56] amounts to finding the correct combinations of corners of the considered interval box to construct the facets of the polyhedral concave envelope. [61], in contrast, introduce a method for constructing convex relaxations of componentwise convex functions that satisfy a certain monotonicity condition on their first order partial derivatives (or, in case of twice continuously differentiable functions, have mixed second-order partial derivatives with constant sign over the box). For functions that are componentwise convex with respect to some and concave with respect to other variables, [61] also show that by taking the secant with respect to the concave (or convex) variables, one can obtain a relaxation that is componentwise convex (or concave) with respect to all variables. Using the aforementioned methods, this function can then be further relaxed to obtain convex (or concave) relaxations.

We use these methods that exploit componentwise convexity along with techniques that exploit monotonicity properties to construct tight relaxations of the \(PI \) acquisition function. The procedure for constructing these relaxations is described in detail in “Appendix A.2”. Examples for the resulting relaxations on two subsets of the domain of \(PI \) are shown in Fig. 2.

4.6 Expected improvement acquisition function

EI is the acquisition function that is most commonly used in Bayesian optimization [40]. It is defined as \(\tilde{\text {EI}}({\varvec{x}}) = {\mathbb {E}}\big [ \max (f_{\text {min}} - f({\varvec{x}}) , 0)\big ]\). When the underlying function is distributed as a GP, \(\text {EI}: {\mathbb {R}} \times {\mathbb {R}}_{\geqslant 0} \rightarrow {\mathbb {R}}\) is given by

$$\begin{aligned} \text {EI}(\mu , \sigma ) {:}{=}{\left\{ \begin{array}{ll} \left( f_{\text {min}} - \mu \right) \cdot \varPhi \left( \frac{f_{\text {min}} - \mu }{\sigma } \right) + \sigma \cdot \phi \left( \frac{f_{\text {min}} - \mu }{\sigma } \right) , &{} \quad \sigma > 0 \\ f_{\text {min}} - \mu , &{} \quad \sigma = 0, \quad \mu < f_{\text {min}} \\ 0 &{}\quad \sigma = 0, \quad \mu \geqslant f_{\text {min}} \end{array}\right. } \nonumber \\ \end{aligned}$$
(6)

As noted by Jones et al. [40], EI is componentwise monotonic and thus, exact interval bounds can easily be derived. In Sect. A.3, we show that EI is convex and we provide its envelopes. As EI is not available as an intrinsic function in BARON, an algebraic reformulation is necessary that uses Eq. (6) where \(\varPhi \) is substituted from Eq. (4) with Eq. (1) in ESI and \(\phi \) from Eq. (3). In addition, some workaround would be necessary for \(\sigma = 0\) (e.g., additional binary variable and big-M formulation).

5 Implementation

The described methods are implemented in our open-source solver MAiNGO [7] and the MeLOn toolbox [71]. The modeling interfaces of MAiNGO (currently either text-based input or a C++ API) allow a convenient implementation of RS models without having to eliminate variables symbolically. Instead, the sequential evaluation of model equations can be expressed as in procedural programming paradigms.

MAiNGO implements a spatial B&B algorithm enhanced with some features for range reduction [32, 49, 67] and a multi-start heuristic. A directed acyclic graph representation of the model is constructed using the MC++ library [16] and evaluated in different arithmetics: We use automatic differentiation via FADBAD++ [3] to obtain first and second derivatives of functions and provide them to the desired local solver. Currently, MAiNGO supports local solvers found in the NLopt package [39], IPOPT and Knitro. These local solvers can be used for pre-processing and solving the upper bounding problems. In the presented manuscript, we use SLSQP [45] in the pre-processing and in the upper bounding. During pre-processing, a simple multistart heuristic initializes the first local search at the center point of the variable ranges. Subsequent local searches are initialized randomly within the variable ranges.

MAiNGO constructs (multivariate) McCormick relaxations of factorable functions [53, 81]. The convex and concave relaxations together with their subgradients [58] are constructed through the MC++ library [16]. The necessary interval extensions are provided through FILIB++ [35]. MAiNGO currently supports CPLEX [38] and CLP [19] as linear programming solvers for lower bounding. In this work, the convex relaxations of the objective and the constraints are linearized at the center point of each node. Subsequently, CPLEX [38] solves the resulting linear problems for lower bounding. MAiNGO can also be run in parallel on multiple cores through MPI. For a fair comparison, we run all optimizations on a single core in this work.

The GP models, acquisition functions, and training scripts are available open-source within the MeLOn toolbox [71] and the relaxations of the corresponding functions are available through the MC++ library used by MAiNGO.

In order to install MAiNGO, please visit our public git repository at https://git.rwth-aachen.de/avt.svt/public/maingo. Our machine learning toolbox MeLOn comes as a submodule of MAiNGO and will be installed with MAiNGO. Currently, MAiNGO can be run using our C++ interface or using our own modeling language called ALE [26]. At the time of writing this manuscript, the authors develop a new Python interface for MAiNGO which will be available soon via pypi.

6 Numerical results

We now investigate the numerical performance of the proposed method on one core of an Intel Xeon CPU with 2.60 GHz, 128 GB RAM and Windows Server 2016 operating system. We present three case studies. We use MAiNGO version v0.2.1 and BARON v19.12.7 through GAMS v30.2.0 to solve different optimization problems with GPs embedded on a single core. Note that we use CPLEX as a lower bounding solver in both BARON and MAiNGO. In MAiNGO we use SLSQP [45] in the pre-processing and in the upper bounding. By default, BARON automatically selects NLP solvers and may switch between different NLP solvers.

First, we illustrate the scaling of the method w.r.t. the number of training data points on a representative test function. Herein, the estimate of the GP is optimized. Second, we consider a chemical engineering case study with a chance constraint, which utilizes the variance prediction of a GP. Third, we optimize an acquisition function that is commonly used in Bayesian optimization on a chemical engineering dataset.

6.1 Illustrative example and scaling of the algorithm

In the first illustrative example, the peaks function is learned by GPs. Then, the GP predictions are optimized on \({\tilde{X}} = \{x_1, x_2 \in {\mathbb {R}} :-3 \le x_1, x_2 \le 3 \}\). The peaks function is given by \(f_{peaks}: {\mathbb {R}}^2 \rightarrow {\mathbb {R}}\) with

$$\begin{aligned}&f_{\text {peaks}}(x_1,x_2) \\&\quad {:}{=}3~(1-x_1)^2 \cdot e^{-x_1^2 -~(x_2+1)^2} - 10 \cdot ~\left( \frac{x_1}{5} - x_1^3 - x_2^5\right) \cdot e^{-x_1^2-x_2^2}- \frac{e^{-(x_1+1)^2 - x_2^2} }{3} \end{aligned}$$

The two-dimensional function has multiple suboptimal local optima and one unique global minimizer at \({\varvec{x}}^*\approx [0.228,-1.626]^T\) with \(f_{\text {peaks}}({\varvec{x}}^*) \approx -6.551\).

We generate various training data on \({\tilde{X}}\) using a Latin hypercube sampling of sizes 10, 20, 30,..., 500. Then, we train GPs with \(k_{\nu =1/2}(d)\), \(k_{\nu =3/2}(d)\), \(k_{\nu =5/2}(d)\), and \(k_{SE}(d)\) covariance functions on the data. The parameters of the trained GPs are saved in individual JSON files. After training, the JSON files are read by the solver and the predictions of the GPs are minimized using the RS and FS formulation to locate an approximation of the minimum of \(f_{peaks}\). We run optimizations in MAiNGO once using the developed envelopes and once using standard McCormick relaxations. Due to long CPU times, we run optimizations for the FS formulations only for up to 250 data points in MAiNGO. The whole data generation, training, and optimization procedure are repeated 50 times for each data set. Thus, we train a total of 10, 000 GPs and run 90, 000 optimization problems in MAiNGO. We also solve the FS and RS formulation in BARON by automatically parsing the problem from our C++ implementation to GAMS. This is particularly important in the RS as equations with several thousand characters are generated. We solve the RS problem for up to 360 and the FS for up to 210 data points in BARON due to the high computational effort. The optimality tolerances are set to \(\epsilon _{\text {abs. tol.}}= 10^{-3}\) and \(\epsilon _{\text {rel. tol.}}= 10^{-3}\) and the maximum CPU time is set to 1, 000 CPU seconds. The reported CPU times do not include any compilation time in MAiNGO and BARON. Note that the MAiNGO code is just compiled once for each problem class because the individual GPs are parameterized by JSON files. Thus, no repeated compilation is necessary. The feasibility tolerances are set to \(10^{-6}\). The analysis in this section is based on results for the \(k_{\nu =5/2}\) covariance function. The detailed results for the other covariance functions show qualitatively similar results (c.f. ESI Sect. 4). Also, the results in this section are based on the median computational times of the 50 repetitions because the variations are comparably small. Boxplots that illustrate the variance are provided in Sect. 4 of the ESI.

In the FS, this problem has \(D + 2 \cdot N + 2\) equality constraints and \(2 \cdot D + 2 \cdot N + 2\) optimization variables while the RS has D optimization variables and no equality constraints. Note that for practical applications the number of training data points is usually much larger than the dimension of the inputs, i.e., \(N \gg D\). The full problem formulation is also provided in ESI Sect. 3.

Fig. 3
figure 3

Comparison of the total CPU time for optimization, i.e., the sum of preprocessing time and B&B time, of GPs with \(k_{\nu =5/2}\) covariance function. The plots show the median of 50 repetitions of data generation, GP training, and optimization. Note that #points are incremented in steps of 10 and the lines are interpolations between them

Figure 3 shows a comparison of the CPU time for optimization of GPs. For the solver MAiNGO, Fig. 3a shows that RS formulation outperforms the FS formulation by more than one order of magnitude and shows a more favorable scaling with the number of training data points. For example, the speedup increases to a factor of 778 for 250 data points. Notably, the achieved speedup increases drastically with the number of training data points (c.f. ESI Sect. 4). This is mainly due to the fact that the CPU times for the FS formulations scale approximately cubically with the data points (\(\text {CPU}_{FS~w/~env}(N)=1.053\cdot 10^{-4} N^{2.958}~\text {sec}\) with \(R^2=0.993\)) while the ones for the RS scale almost linearly (\(\text {CPU}_{RS~w/~env}(N)= 0.0022 \cdot N^{1.156}~\text {sec}\) with \(R^2=0.995\)).

In general, the number of optimization variables can lead to an exponential growth of the worst-case B&B iterations and thus runtime. In this particular case, the number of B&B iterations is very similar for the FS and RS formulation (see Fig. 4a). Instead, for the present problems the number of B&B iterations is more influenced by the use of tight relaxations. Figure 4b shows that the CPU time per iteration increases drastically with problem size in the FS while it increases only moderately in the RS. This indicates that the solution time of the lower bounding, upper bounding, and bound tightening subproblems scales favorably in the RS and that this is the main reason for speedup of the RS formulation in MAiNGO. This is probably due to the smaller subproblem sizes when using McCormick relaxations in the RS formulation (c.f. discussion in Sect. 1).

Fig. 4
figure 4

Comparison of number of B&B iterations of optimization problems with GPs embedded with \(k_{\nu =5/2}\) covariance function. The plots show the median of 50 repetitions of data generation, GP training, and optimization. Note that #points are incremented in steps of 10 and the lines are interpolations between them

The use of envelopes of covariance functions also improves computational performance (see Fig. 3a). However, this effect is approximately constant over the problem size (c.f. Fig. 3 in ESI Sect. sec:globalspsgpspsoptimizationspsRelaxationsspsofspsRelevantspsFunctionsspsforspsGPsspsandspsBayesianspsOptimization). In other words, the CPU time shows a similar trend for the cases with and without envelopes in Fig. 3a. In the RS, the CPU time with envelopes takes on average \(7.1\%\) of the CPU time without envelopes (\(\approx 14\) times less). In the FS, the impact of the envelopes is less pronounced, i.e., the CPU time w/ envelopes is on average \(15.1\%\) of the CPU time w/o envelopes (\(\approx 6.6\) times less). Figure 4a shows that the envelopes considerably reduce the number of necessary B&B iterations. However, the relaxations do not show a significant influence on the CPU time per iteration (see Fig. 4a).

The results of this numerical example show clearly that the development of tight relaxations is more important for the RS formulation than for the FS. As shown in Sect. 3.4.2 of [4], this effect can be explained by the fact that in RS, it is more likely to have reoccurring nonlinear factors which can cause the McCormick relaxations to become weaker (c.f. also the relationship to the AVM in this case explored in [81]). However, in this study, the improvement in relaxations is outweighed by the increase of CPU time per iteration when additional variables are introduced in the FS.

The RS formulation also performs favorably compared to the FS formulation in the solver BARON (see Fig. 3). However, the differences between the CPU times are less pronounced. In contrast to MAiNGO, the number of B&B iterations in the FS and RS drastically increase with increasing number of training data points when using BARON (c.f. Fig. 4 in ESI). Also, the time per B&B iteration is similar between RS and FS. This is probably due to the AVM method for the construction of relaxations. The AVM method introduces auxiliary variables for some factorable terms. Thus, the size of the subproblems in BARON increases with the number of training data points regardless of which of the two formulations is used.

The results of the optimizations also provide information about the ability of GP surrogate models to approximate a function for optimization. The results show that the solution point of the GP optimization problem approximately converges to the optimum of the learned peaks function for all covariance functions. However, it is clear that some covariance functions lead to more accurate solution for the same number of training data points in this particular case. In the ESI, we provide figures that show the solution point and objective function value over the number of data points for this problem (Figs. 8–11). Interestingly, the objective function value is overestimated considerably for all problems.

6.2 Chance-constrained programming

Probabilistic constraints are relevant in engineering and science [18] and GPs have been used in the previous literature to formulate chance constraints, e.g., in model predictive control [11] or production planning [87].

As a second case study, we consider the N-benzylation reaction of \(\alpha \) methylbenzylamine with benzylbromide to form desired secondary (\(2\,^\circ \)) amine and undesired tertiary (\(3\,^\circ \)) amine. We utilize an experimental data set consisting of 78 data points from a robotic chemical reaction platform [69]. We aim to maximize the expected space-time yield of \(2\,^\circ \) amine (\(2\,^\circ \)-STY) and ensure that the probability of a product quality constraint satisfaction is above 95%. The \(2\,^\circ \)-STY and yield of \(3\,^\circ \) amine impurity (\(3\,^\circ \)-Y) are modeled by individual GPs. Thus, we solve optimization problems with two GPs embedded. The chance-constrained optimization problem is formulated as follows

$$\begin{aligned}&\underset{{\varvec{x}} \in E}{\min } \quad - {\mathbb {E}}\big [f_{\text {STY}}({\varvec{x}})\big ] \\&\text {s.t.} \quad \mathcal {P}\left( f_{\text {impurity}}({\varvec{x}}) \leqslant c \right) \geqslant 95 \% \end{aligned}$$

Here, the objective is to minimize the negative of the expected STY. This corresponds to minimizing the negative prediction of the GP, i.e., \(-m_{\varvec{\mathcal {D}},2\,^\circ -\text {STY}}\). The chance constraint ensures that the impurity is below a parameter c with a probability of 95%. This corresponds to the constraint \(m_{\varvec{\mathcal {D}},3^\circ -Y} + 1.96 \cdot \sqrt{k}_{\varvec{\mathcal {D}},3^\circ -Y} \leqslant c\) with \(c=5\).

The optimization is conducted with respect to four optimization variables: (1) the primary (\(1^\circ \)) amine flow rate of the feed varying between 0.2 and 0.4 mL min\(^{-1}\), (2) the ratio between benzyl bromide and \(1^\circ \) amine varying between 1.0 and 5.0, (3) the ratio between solvent and \(1^\circ \) amine varying between 0.5 and 1.0, and (4) the temperature varying between 110 and 150 \(^\circ \)C.

As this problem is highly multimodal and difficult to solve, we increase the number of local searches in pre-processing in MAiNGO to 500 and increase the maximum CPU time to 24 hours. The computational performance of the different methods is given in Table 1. The results show that none of the considered methods converged to the desired tolerance within the time limit. The RS formulation in MAiNGO that uses the proposed envelopes outperforms the other formulations and BARON solver as it yields the smallest optimality gap. Note that the considered SLSQP solver does not find any valid solution point in the FS in MAiNGO while feasible points are found in the RS. This demonstrates that the RS formulation can also be advantageous for local solvers. Note that when using IPOPT [84] with 500 multistart points in the FS formulation in MAiNGO, it identifies a local optimum with \(f^*=-226.5\) in the pre-processing. In the ESI, we provide a brief comparison of a few pre-processing settings for this case study.

Table 1 Numerical results of the N-benzylation reaction optimization with chance constraint (Sect. 6.2)

The best solution of the optimization problem that we found is \(x_1 = 0.40\) min\(^{-1}\), \(x_2 = 1.0\), \(x_3 =0.5\), and \(x_4 =123.5\,^\circ \)C. At the optimal point, the predicted \(2\,^\circ \)-STY is 226.5 kg m\(^{-3}\) h\(^-1\) with a variance of 17.1 while the predicted amine impurity is 4.2 % with a variance of 0.17. The result shows that the probability constraint ensures a safety margin between the predicted impurity and \(c=5\). Note that the chance constraint is active at the optimal solution point.

6.3 Bayesian optimization

In the third case study, we consider the synthesis of layer-by-layer membranes. Membrane development is a prerequisite for sustainable supply of safe drinking water. However, synthesis of membranes is often based on try-and-error leading to extensive experimental efforts, i.e., building and measuring a membrane in the development phase usually takes several weeks per synthesis protocol. In this case study, we plan to improve the retention of \({\hbox {Na}_{2}\hbox {SO}_{4}}\) salt of a recently developed layer-by-layer nanofiltration membrane system. The optimization variables are the sodium chloride concentration in the polyelectrolyte solution \(c_{NaCl} \in [0,0.5]\) gL\(^-1\), the deposited polyelectrolyte mass \(m_{PE} \in [0,5]\) gm\(^{-2}\), and the number of layers \(N_{layer} \in \{1,2,3,...,10\}\). The detailed description of the setup is given in the literature [55, 65]. Overall, we utilize 63 existing data points from previous literature [55]. We identify a promising synthesis protocol based on the EI acquisition function by solving:

$$\begin{aligned}&\underset{{\varvec{x}} \in E}{\min } \quad - \text {EI}\big (m_{\varvec{\mathcal {D}}}({\varvec{x}}),k_{\varvec{\mathcal {D}}}({\varvec{x}})\big ) \end{aligned}$$

with \({\varvec{x}}=[c_{NaCl}, m_{PE},N_{layer}]^T\). Thus, this numerical example corresponds to one step of a Bayesian optimization setup for this experiment. Global optimization of the acquisition function is particularly relevant due to inherent multimodality of the acquisition functions [44] and high cost of experiments. Note that the experimental validation of this data point is not within the scope of this work.

The computational performance of the proposed method is summarized in Table 2. Using the solver MAiNGO, the RS formulation converges approximately 9 times faster to the desired tolerance compared to the FS formulation. Herein, we use the derived tailored relaxations of the EI acquisition function and envelopes of the covariance functions in both cases. Notably, the FS requires approximately 1.8 times the number of B&B iterations compared to the RS formulation, which is much less than the overall speedup. Thus, the results are in good agreement with the previous examples showing that the reduction of CPU time per iteration in the RS has a major contribution to the overall speedup. For this example, a comparison to BARON is omitted due to necessary workarounds including several integer variables and function approximations for CDF and EI (c.f., Sects. 4.3, 4.6).

Table 2 Numerical results of the membrane synthesis optimization (Sect. 6.3)

The optimal solution point of the optimization problem is \(c_{NaCl} = 0.362\) gL\(^-1\), \(m_{PE} = 0\) gm\(^{-2}\), and \(N_{layer} = 4\). The expected retention is 85.32 with a standard deviation \(\sigma = 14.8\). The expected retention is actually worse than the best retention in the training data of 96.1. However, Bayesian optimization takes also the high variance of the solution into account, i.e., it is also exploring the space. EI identifies an optimal trade-off between exploration and exploitation.

7 Conclusions

We propose a RS formulation for the deterministic global solution of problems with trained GPs embedded. Also, we derive envelopes of common covariance functions or tight relaxations of acquisition functions leading to tight overall problem relaxations.

The computational performance is demonstrated on illustrative and engineering case studies using our open-source global solver MAiNGO. The results show that the number of optimization variables and equality constraints are reduced significantly compared to the FS formulation. In particular, the RS formulation results in smaller subproblems whose size does not scale with the number of training data points when using McCormick relaxations. This leads to tractable solution times and overcomes previous computational limitations. For example, we archive a speedup factor of 778 for a GP trained on 250 data points. The GP training methods and models are provided as an open-source module called “MeLOn—Machine Learning Models for Optimization” toolbox [71].

We thus demonstrate a high potential for future research and industrial applications. For instance, global optimization of the acquisition function can improve the efficiency of Bayesian optimization in various applications. It also allows to easily include integer decisions and nonlinear constraints in Bayesian optimization. Furthermore, the proposed method could be extended to various related concepts such as multi-task GPs [8], deep GPs [21], global model-predictive control with dynamic GPs [13, 85], and Thompson sampling [12, 17]. Finally, the proposed work demonstrates that the RS formulation may be advantageous for a wide variety of problems that have a similar structure, including various machine-learning models, model ensembles, Monte-Carlo simulation, and two-stage stochastic programming problems.