A robust approach to warped Gaussian process-constrained optimization

Optimization problems with uncertain black-box constraints, modeled by warped Gaussian processes, have recently been considered in the Bayesian optimization setting. This work considers optimization problems with aggregated black-box constraints. Each aggregated black-box constraint sums several draws from the same black-box function with different decision variables as arguments in each individual black-box term. Such constraints are important in applications where, e.g., safety-critical measures are aggregated over multiple time periods. Our approach, which uses robust optimization, reformulates these uncertain constraints into deterministic constraints guaranteed to be satisfied with a specified probability, i.e., deterministic approximations to a chance constraint. While robust optimization typically considers parametric uncertainty, our approach considers uncertain functions modeled by warped Gaussian processes. We analyze convexity conditions and propose a custom global optimization strategy for non-convex cases. A case study derived from production planning and an industrially relevant example from oil well drilling show that the approach effectively mitigates uncertainty in the learned curves. For the drill scheduling example, we develop a custom strategy for globally optimizing integer decisions.

Bayesian optimization optimizes such functions by (i) fitting a Gaussian process to a small number of collected data points and (ii) subsequently choosing new sampling points using an acquisition function [41,53,55]. The Bayesian optimization literature also considers problems with black-box constraints, e.g., by multiplying the acquisition function with the probability of constraint satisfaction [25,26,47]. The global optimization community often handles black-box constraints by (i) generating a small data set from the black box function, (ii) fitting a surrogate model to this data, and (iii) replacing the black box constraint by the surrogate model [14,18,19,30,39,42,50]. This approach, however, rarely considers uncertainty in the black box function.
One way of including uncertain black-box function into the optimization problem is to consider the surrogate model's parameters to be uncertain and use classical parametric uncertainty methods. Hüllen et al. [33] recently demonstrated this approach for polynomial surrogates using robust optimization. This paper proposes a more direct approach utilizing probabilistic surrogate models to model the uncertain curves. We study optimization problems with constraints which aggregate black-box functions: where x i is a decision variable andã i depends on a vector of decision variables y i ∈ R k through a black-box function g(·). Constraint (1) occurs in many highly relevant applications. In production planning, one may limit the total allowed equipment degradation i r(p i )∆t i ≤ b, where r(p i ) is the black-box degradation rate depending on production p i in time period i and ∆t i is equipment operation time in period i [60]. A second example is vehicle routing, where the total traveling time i ∆t(t i , s i , d i )γ i is the sum of traveling times ∆t(t i , s i , d i ) for individual legs i, dependent on starting time t i , source s i , and destination d i , and γ i is a binary variable indicating whether leg i is part of the route. A third example is project scheduling under uncertainty in which duration uncertainty may be aggregated over multiple activities [58]. Lastly, the drill scheduling case study described in detail later in this paper is an industrially relevant example. When black-box constraints are risk or safety-critical, hedging solutions against uncertainty is essential. Evaluating black-box functions may require expensive com-puter simulations or physical experiments, so available data is generally limited and may be subject to model errors and measurement noise. We therefore consider the function g(·) to be uncertain and aim to find solutions for which Constraint (1) holds with confidence 1 − α: To capture the uncertainty in g(·), we model it by stochastic surrogate models. A common stochastic surrogate is the Gaussian process (GP) model. Depending on the underlying data generating distribution, however, a GP may be an inadequate model. Warped Gaussian processes, which map observations to a latent space using a warping function, are an alternative, more flexible model [54]. This paper considers both standard and warped GPs.
We note that other contributions have connected Bayesian optimization with robust optimization [7, 11,12,17]. In this setting, an adversary can perturb the input x by δ ∈ U. Robust solutions optimize performance under the worst-case perturbation realization: min x∈R n max δ∈U f (x + δ).

Contributions.
For the standard GP model, we show how chance constraint Eq. (2) can be exactly reformulated as a deterministic constraint using existing approaches. For the warped case, we develop a robust optimization approach which conservatively approximates the chance constraint. By constructing decision-dependent uncertainty sets from confidence ellipsoids based on the warped GP models, we obtain probabilistic constraint violation bounds. We utilize Wolfe duality to reformulate the resulting robust optimization problem and obtain explicit deterministic robust counterparts. This reformulation expresses uncertain constraints, modeled by GPs, as deterministic constraints with a guaranteed probability of constraint satisfaction, i.e., deterministic approximations to a chance constraint. We analyze convexity conditions of the warping function under which the Wolfe duality based reformulation is applicable. For non-convex cases, we develop a global optimization strategy which utilizes problem structure. To reduce solution conservatism, we furthermore propose an iterative a posteriori procedure of selecting the uncertainty set size which complements the obtained a priori guarantee. We show how the proposed approach hedges against uncertainty in learned curves for two case studies: i) a production planning-inspired case study with an uncertain price-supply curve and ii) an industrially relevant drill-scheduling case study with uncertain motor degradation characteristics. For the drill-scheduling case study we develop a custom strategy for dealing with discrete decisions.
Notation. See Appendix A for a table of notation.

Method
Sections 2.1-2.3 review (warped) GPs, robust optimization, and chance constraint reformulations for Gaussian distributions. Sections 2.4 and 2.5 outline our proposed robust approximation approach.
Definition 1 (Gaussian process). A continuous stochastic process G(x) for which G x 1 ,...,x l = (G x 1 , . . . , G x l ) is a multivariate Gaussian random variable for every finite set of points x 1 , . . . , x l .
A GP defines a probability distribution over functions and it is fully specified by its mean function m(·) and kernel function k(·, ·). Given a set of N data points X = [x 1 , . . . , x N ], y = [y 1 , . . . , y N ] and using a zero mean function, we can predict the mean µ and covariance matrix Σ of the multivariate Gaussian distribution defined by a set of new test points X * = [x * 1 , . . . , x * n ]: where σ n is the standard deviation of noise in the data, K(X * , X) = K(X, X * ) is the n × N covariance matrix between test and training points, K(X, X) the N × N covariance matrix between training points, K(X * , X * ) the n × n covariance matrix between test points, and I the identity matrix. We denote the ij-element of Σ as . The standard GP approach assumes that the data follows a multivariate Gaussian. While this assumption allows prediction using simple matrix multiplication, it can be an unreasonable for non-Gaussian data [54]. A slightly more flexible model, which still retains many of the benefits of GPs, is the warped GP model. The key idea is to warp the observations y to a latent space ξ using a monotonic warping function ξ = h(y, Ψ). A standard GP then models the data in the latent space ξ ∼ GP(x). The Jacobian ∂h(y) ∂y is included in the likelihood and the GP and warping parameters are learned simultaneously. A common warping functions is the neural net style function: where a j ≥ 0, b j ≥ 0, ∀j to guarantee monotonicity [34,38,54]. Note that we use h(·) to denote the vector version h : R n → R n , h(y) = [h(y 1 ), . . . , h(y n )] , which warps each component individually.

Robust optimization
Robust optimization immunizes optimization problems against parametric uncertainty by requiring constraints with uncertain parametersã i to hold for all values inside some uncertainty set U [27]. Application areas range from finance and engineering to scheduling and compressed least squares [6,27]. The uncertainty set U can take many different geometries, e.g., box [56], ellipsoidal [9], and polyhedral sets [13]. When U is convex and the constraint is concave, the semi-infinite constraint can often be reformulated into a deterministic equivalent using duality [8]. The general case can be solved using bilevel optimization [3,40], but this requires solving the inner maximization problem to global optimality, even to obtain feasible solutions.

Standard GPs: chance constrained optimization
When g(·) is modeled well by a standard GP, chance constraint Eq. (2) can be exactly replaced by a deterministic equivalent [20]. Since {g(y i ), i ∈ S} ∼ N (µ, Σ) is normal distributed, the linear combination: is also normal distributed with distribution: Note that we have surpressed the dependence of µ and Σ on y i for notational simplicity. For a given confidence level α, we can therefore replace chance constraint Eq. (2) by: where F (·) is the cumulative distribution function of the standard normal distribution. If the GP models g(·) well, Eq. (4) is an exact deterministic reformulation of chance constraint Eq. (2).

Warped GPs: robust approximation
If g(·) is insufficiently modeled by a standard GP, a warped GP may be a more suitable model [54]. In this case, a direct reformulation of the chance constraint as outlined above for the standard GP case is not known. Such chance constraints are generally addressed by (i) sample approximation [37,45,46] or (ii) safe outerapproximation [1,35,43,44,48,62]. Instead, we develop a robust approximation. First consider an optimization problem containing a nominal version of Constraint (1): where y is the vector containing all elements of y i , ∀i. Here, the inversely warped mean prediction of the GP h −1 (µ(·)) replaces the black-box function g(·). Clearly, a solution to Problem (5) is not guaranteed to be feasible in practice if the prediction µ(y i ) is uncertain. Using the full multivariate distribution generated by the sampling points {y i }, we can construct an α-confidence ellipsoid in the latent space: Here, F 1−α n is the cumulative distribution function of the χ 2 distribution with n degrees of freedom. Assuming that the warped GP models the black-box function well, E α (y) contains the true value h(g(y i )) with probability at least 1 − α. We therefore construct the following robust optimization problem: Any solution to Problem (7) is feasible with probability at least 1 − α given that the warped GP models the underlying data generating distribution well. Alternatively, we can take the warping into the uncertainty set: where U: Note that Problem (8) can also be interpreted as approximating a robust problem with an uncertainty set over functionsg ∈ U g (see Theorem (1), Appendix B). Fig. (1) shows an example of the ellipsoidal and warped sets E α and U for n = 2. The warped set U (Eqn. 9) may or may not be convex, depending on the warping function h(·).

Reformulation for convex warped sets U
In the following we surpress the dependence of µ and Σ on y for notational simplicity. We first assume that the warped set U retains convexity. In this case the inner maximization is convex: where u is a dual variable, ∇h(z) = diag(h (z i )), and Constraint (11c) is the Karush-Kuhn-Tucker (KKT) stationarity condition. Note that, unless x = 0, the stationarity condition means that u = 0 and, due to complementary slackness, w(z, y) = 0, i.e.: Furthermore, we can reformulate Eq. (11c) to: Substituting this in Eq. (12) yields: This leads to a slightly different formulation which has the advantage that it does not depend on the inverse of the covariance matrix Σ −1 : where ∇h −1 (z) is a diagonal matrix containing the inverse elements of ∇h(z).

Convexity conditions
Section 2.4.1 relies on the convexity of the inner maximization problem. If U is nonconvex, Problem (13) is not necessarily equivalent to Problem (8) as there may be more than one KKT point.Since U is the confidence set of a multivariate distribution, however, may often be convex, especially when the distribution is unimodal. The following section analyzes conditions where the Wolfe duality approach is justified. First consider the inner maximization Problem (10) transformed into the latent space by substituting z = h −1 (ξ): which depends on the generally not explicitly known inverse warping function h −1 . We further state the well known result on the derivative of inverse functions [4]: . Using this, we can show the following proposition.
Proof. Note that Problem (14) is convex when h −1 is concave (convex) and x i ≥ 0 (x i ≤ 0), ∀i. The KKT conditions for Problems (10) and (14) are: and: where: By Lemma (1): So Problem 16 is equivalent to: (19b) Let z * be a KKT point for Problem (10), then y * = h(z * ) is clearly a solution to Problem (19), and therefore a KKT point for Problem (14). Since Problem (14) is convex, z * is unique.

Strategy for non-convex warped sets U
When U is non-convex, we need to globally optimize the inner maximization problem efficiently. To this end we develop a custom divide and conquer strategy which makes use of the problems special structure. We first note the following properties of the inner maximization problem.
Lemma 2. Let z * be the solution of Problem 10, then z * is on the boundary of U, i.e., z * ∈ ∂U.
Proof. See Appendix C.

Lemma 3. The bounding box of an ellipsoid
Lemma 4. Consider a version of Problem (14) in which the ellipsoidal feasible region is replaced by its bounding box: If x i ≥ 0, ∀i, the optimal solution ξ * to this problem is ξ * i = µ + rσ ii , ∀i. Proof. Let ξ * be the optimal solution to Problem (20). Note that ξ * lies on the boundary of the feasible space (Lemma 6). Assume ∃i, s.t., Therefore we can construct a new solutionξ: for which x ξ ≥ x ξ * , which is a contradiction.
and ξ * the optimal solution to Problem (14).
Proof. The results follows immediately from Lemma (4) because Problem (20) is a relaxation of Problem (14).
Using Lemma (6) and Theorem (2) we develop the spatial branching strategy shown in Algorithm (1). It starts by outer-approximating the ellipsoid by its bounding box and evaluating the objective x h −1 (ξ) at the two corner points (ξ,ξ), obtaining an upper and lower bound (Theorem 2). The algorithm then branches on the dimension of largest width. Boxes can be pruned if they are fully inside or outside the ellipsoid (Lemma 6).

Iterative a posteriori approximation
The a priori probabilistic bound implied by E α may be overly conservative. Alg. (2) is an altnernative, less conservative strategy that iteratively determines the uncertainty set size. Starting with the confidence level α equal to the target feasibility x ← solution of Problem (13) with α 4: {Bisection search} 11: end while search. The search terminates when a solution has been found that is sufficiently close (tolerance δ) to the target feasibility 0 .

Case studies 3.1 Production planning
Our first case study is inspired by production planning. Assume that a company wants to decide how much product x t to produce in a number of subsequent time periods x = [x 1 , . . . , x t , . . . , x T ]. There is a known cost of production c t which may vary from period to period. The company seeks to maximize its profit ψ, which depends on the total production cost t c t x t and revenue tp t x t . Herep t is the price at which the product can be sold in period t. The company has to sell all its product in the same time period, e.g., because the product is perishable. The sale price depends on the amount the company produces in that periodp t =p(x t ), e.g., because the company has a very large market share.
The company uses GP regression to predictp(x t ) based on limited historical data. Additional features, e.g., season and general state of the economy, could be part of this regression but are irrelevant for our purpose as they are not decision variables. The prediction has to be considered uncertain and the company wants a production plan guaranteeing a certain profit with some confidence. This decision problem can The confidence region is two standard deviations wide.
be formulated as a chance constrained optimization problem: Choosing p(x t ) = exp (−x t ) as ground truth for the price-supply curve, we generate noisy datap(x t ) = p(x t ) + and fit a GP surrogate as shown in Fig. (2). We consider uniform Gaussian noise ( ∼ N (0, σ noise )) and non-uniform Gaussian noise( ∼ N (0, 4 · σ noise · exp(−x/2))), where σ noise is a parameter determining the amount of noise. We use a squared exponential kernel for this case study, but the proposed method does not generally rely on a specific choice for k(·, ·).

Drill scheduling
The objective in drilling oil wells is generally minimizing total well completion time. The aim of the drill scheduling problem, illustrated in Fig. (??), is to find a schedule of the two decision variables, rotational speedṄ ∈ R and weight on bit W ∈ R, as a function of depth x ∈ R. Current practice often consists of minimizing the total drilling time, which depends onṄ and W through a non-linear bit-rock interaction model [22] and the motor's power-curves (see Appendix D). Total well completion time, however, also depends on maintenance time. Current practice may increase Figure 3: Illustration of drill scheduling problem with two rock types. The rock type changes at x 1 , maintenance is scheduled at x 2 , and the target depth is x 3 . The right side shows an example schedule of the decision variablesṄ and W .
maintenance time because drilling quickly can detrimentally effect motor degradation. Furthermore, the motors degradation characteristics are subject to uncertainty and are often obtained through a mixture of experiments and expensive numerical simulations [2]. Other works have considered uncertain equipment degradation in scheduling applications [60,5], but not with predicted degradation rates.
To find the optimal trade off between drilling and maintenance time, we propose a drill scheduling model which explicitly considers uncertainty in the motor degradation characteristics. First consider a model which discretizes the drill trajectory into n equidistant intervals: The rate of penetration V i in each segment depends on the drill parameters (Ṅ i and W i ) through the non-linear model in Appendix D. The rate of degradation r(·) is a black-box function of the differential pressure across the motor ∆p. We model r(·) with a warped GP based on 10 data points from a curve obtained by Ba et al. [2] through a combination of experiments and numerical simulation. The maintenance indicator R i keeps track of the total cumulative degradation of the motor. We assume the motor fails when R i reaches 1. Binary variable z i indicates whether maintenance is scheduled in segment i. If maintenance is scheduled, the continuous variable y i resets the total degradation indicator R i to zero. Note that the bit-rock interaction model depends on rock parameters which can change from segment to segment. A major disadvantage of Model (23) is that it requires a large number of segments in order to get a good resolution on the optimal maintenance depth. To avoid this we propose, in analogy with continuous time formulations [52,24], an alternative continuous depth scheduling formulation: Model (24) only considers geological segments (segments with constant rock parameters) and maintenance induced segments. The vector x is ordered and contains the fixed rock formation depths as well as the variable maintenance depths. Fig. (??) shows an example where x 1 is the fixed depth at which the geology changes and x 2 is the variable depth of a maintenance event. The indices i ∈ M of the variable maintenance depths are determined a priori, i.e., we decide both the number of maintenance events as well as the geological segment in which they occur a priori. m − is either the index of the previous maintenance event or 1 if m is the first element in M . While Problem (24) cannot decide the optimal number of maintenance events, it is easier to solve than Problem (23) because it does not contain integer variables and generally has a much smaller number of segments, i.e., fewer variables and constraints. The following discusses strategies for deciding the optimal number and segment assignment of maintenance events.

Integer strategy
In drill scheduling, the number of maintenance events n is generally small (n ≤ 4). The number of geological segments m can be large in practice but will not be known a priori. We therefore consider groupings of segments into a small number (m ≤ 10) of longer segments with average rock parameters which are known a priori. Given n and m, the combinatorial complexity of enumerating the maintenance-segment assignment problem is N == n+m−1 m . However, the optimal number of maintenance events m is a decision variable. Therefore, finding the globally optimal maintenancesegment assignment also requires enumerating different values of m.
Alg. (3) derives upper bounds for the number of maintenance events m as well as their location. It starts by solving Problem (24) without any maintenance events and ignoring the upper bound on the degradation indicator R n 1. The floor of the maintenance indicator at the target depth x n , R n is an upper bound for the necessary number of maintenance events m. Alg. (3) then starts at the target depth x n and inserts R n maintenance events at the earliest possible points that satisfy Proof. Let j * and be the (i+1)-th last elements in M * andM respectively. Assume by moving x * j * tox and drilling at maximum speedV betweenx and x * i * : (M , x , V ) has drilling and maintenance cost lower than (M * , x * , V * ), which is a contradiction. Therefore x * i * ≤xî =⇒ x * j * ≤x. Furthermore, note that x * i * ≤xî has to be true for the last maintenance event by the same logic as above. The proposition follows by induction.
Lemma (5) reduces the number of maintenance-segment assignments to enumerate:

Heuristics
Alg. (3) is equivalent to minimizing the drilling cost without considering degradation -a strategy often used in practice. It provides feasible but likely suboptimal solutions to Problem (24), i.e., it can be used as a heuristic. We call this the no-degradation heuristic and propose a second, improved heuristic: the boundary heuristic, outlined in Alg. (4). Alg. (4) starts with the solution of the no-degradation heuristic (Alg. 3). It improves the solution by iteratively solving Problem (24) and reassigning maintenance events occurring at geological boundaries to the adjacent segment. It terminates after finding a solution with all maintenance events occurring in the interior of their segment. Note that moving a maintenance event occuring at While it does not guarantee global optimality of the maintenance-segment assignment, the boundary heuristic may be useful for very large instances when enumeration is prohibitive.

Results
The deterministic reformulations of both case studies were implemented in Pyomo (Version 5.6.8) [32,31], an algebraic modeling language for expressing optimization problems. As part of this work, we developed a Python (Version 3.6.8) module which takes a GP model trained using the Python library GPy (Version 1.9.6) [28] and predicts µ(x) and Σ(x) as Pyomo expressions (will be available open source on GitHub). This allows the easy incorporation of GP models into Pyomo optimization models. We use the interior-point convex optimization solver Ipopt [59] with a multistart strategy to solve the problem. Each instance was solved 30 times with a random starting point. The multistart procedure ends prematurely if it finds the same optimal solution (with a relative tolerance of 10 −4 ) 5 times.   (4) shows the warping functions for both case studies. Since the production planning warping function is concave and the production amounts x t are strictly positive, Theorem (1) applies and the warped set U is convex. Theorem (1) cannot be applied to the drill scheduling case, because its warping function is neither convex nor concave. However, because the warping function is only slightly non-convex, the warped set U may still be convex for many instances. To avoid solving the bilevel problem directly we therefore use the following strategy: (i) solve the robust reformulation (Eq. 13), (ii) check feasibility of the obtained solution using Algorithm (1) (to a tolerance of 10 − 2), and (iii) only solve the bilevel problem (Eq. 8) directly if the obtained solution is infeasible. For the instances considered in this work, the obtained solution always turns out to be feasible.

Production planning
For the production planning case study, we consider 4 model instances with T = 1, 2, 3 and 6 time periods. Table (1) shows the cost of production c. We solve each instance for 30 different confidence values 1 − α. The GP was trained based on 50 randomly generated data points using both uniform and non-uniform Gaussian noise with σ noise = 0.01, 0.03, 0.05, and 0.08. Standard GP: Fig. (5) shows results for the chance constrained approach using a standard GP model. We plot the fraction of feasible scenarios out of 1 million random samples from the true underlying distribution. Fig. (5) shows results for four different noise scenarios. By varying the confidence 1 − α, we adjust the robustness of the obtained solution. Clearly, the resulting feasibility does not exactly match the expected feasibility (shown as a dotted line) determined by the confidence level 1 − α. This is due to a mismatch between the true underlying distribution and the normal distribution estimated by the GP. As the amount of noise increases, the GP estimate deteriorates and the mismatch between feasibility and confidence increases. Warped GP: Fig. (6) shows solution feasibility as a function of confidence 1 − α for non-uniform noise using a warped GP model and the proposed robust approach. We show results for four different numbers of time periods. In the nominal case (1 − α = 0), the feasibility is always close to 50% because a solution which is valid for the mean price-supply curve will also be valid for many scenarios with higher prices.
In the robust case, as expected, feasibility increases as the size of the uncertainty set, i.e. 1 − α, increases. Notice that the robust approach is almost always a conservative approximation to the chance constraint, as the achieved feasibility is generally larger than the confidence 1 − α. Small violations of the a priori bound (dotted line) can still occur due to a mismatch between the GP model and the true underlying data generating distribution. The solution conservatism also varies with the number of time periods considered. The a priori bound relaxes as T increases.  Figure 7: Profit, normalized with respect to nominal profit with σ noise = 0.01 (objective of Problem (5)), as a function of confidence 1 − α for four different noise scenarios and time periods T = 1, 2, 3 and 6. As expected, the objective value decreases with increasing confidence 1 − α, because more extreme worst case scenarios are considered. Fig. (7) shows the worst case profit, normalized with respect to the nominal profit for σ noise = 0.01, achieved as a function of the confidence level 1 − α. As expected, increasing the confidence 1 − α leads to a lower worst case profit, because a larger confidence hedges against more uncertain price outcomes. Note that results are shown for values of 1 − α between 0.001 and 0.999. At 1 − α = 1, the profit is always zero, because the uncertainty set includes negative prices and the optimal solution is to not produce anything. For a fixed confidence level, noisier data will generally lead to a smaller objective value as there is more uncertainty to hedge against.
Iterative procedure: Finally, Fig. (8) shows solution feasibility for the iterative a posteriori procedure (Alg. 2). We use confidence values between 0.55 and 0.999, since smaller confidences can often not be achieved using the iterative approach (the smallest achievable confidence is the feasibility of the nominal solution, i.e., ∼ 50%). The a posteriori approach is clearly less conservative than the a priori approach, however, this comes at the cost of additional computational expense and also potential bound violations when the warped GP does not model the underlying distribution perfectly. The a posteriori approach could therefore be a viable less conservative alternative in relatively low noise scenarios or when more training data is available.

Drill scheduling
For the drill scheduling case study, we consider two different geologies with 2 and 6 geological segments. We consider a range of target depths and confidence values. Fig. (9) shows the drilling, maintenance, and total cost for a target depth of 2200m as a function of the confidence parameter 1 − α. In the deterministic case (1 − α = 0.5), the optimal strategy is to not do maintenance at all and drill as fast as possible. As we increase 1 − α to obtain more robust solutions, we eventually reach a point where the average rate of penetration is slightly lower in order to reduce degradation and guarantee that the well can be completed without a motor failure. For the 2-segment geology the increased cost of drilling outweighs the zero maintenance cost at aroun 1 − α = 0.92. After this point the optimal strategy is to do maintenance once.
Results are shown for both the no-degradation and boundary heuristics as well as total enumeration. For this instance, the boundary heuristic leads to the same solution as the globally optimal enumeration strategy. The no-degradation heuristic, on the other hand, leads to suboptimal solutions when the optimal maintenance number is lower than the upper bound R n .  A larger confidence always leads to a higher cost, as would be expected, but the difference between the deterministic solution and a 99%-confidence robust solution can be larger or small, depending on the target depth, e.g., for a target depth of x n = 3000m hedging against uncertainty does not lead to significant cost increases.
Finally, Fig. (11) shows the total solution time for the three integer strategies for the instance with 6 geological segments as a function of confidence parameter 1 − α. While the no-degradation heuristic often leads to suboptimal solutions, as seen above, it is computationally very cheap. The boundary heuristic comprises a good compromise: it frequently finds the global optimum while being much cheaper computationally. Especially for instances with many geological segments and maintenance events, where the combinatorial complexity of the enumeration strategy becomes prohibitive, it may therefore be a good alternative.

Conclusion
Our approach reformulates uncertain black-box constraints, modeled by warped Gaussian processes, into deterministic constraints guaranteed to hold with a given confidence. We achieve this deterministic reformulation of chance constraints by constructing confidence ellipsoids and utilizing Wolfe duality. We show that this approach allows the solution conservatism to be controlled by a sensible confidence probability choice. This could be especially useful in safety-critical settings where constraint violations should be avoided.

B Connection to uncertain functions
Consider the following robust optimization problem: Instead of uncertain parameters, Problem (25) considers an uncertainty set U g over uncertain functionsg(·). We are interested in defining U g in a way that it contains "likely" realizations of the GP.
Recall that for any finite set of points y 1 , . . . , y l , l ∈ N: is a multivariate Gaussian with mean µ(y 1 , . . . , y l ) and covariance Σ(y 1 , . . . , y l ). For any such G y 1 ,...,y l , we can construct a confidence ellipsoid E α (y 1 , . . . , y l ) containing the true values [g(x 1 ), . . . , g(x l )] with probability 1 − α: is the cumulative distribution function of the χ 2 distribution with l degrees of freedom. We then construct a set U g over functionsg(·) for which [g(x 1 ), . . . ,g(x l )] lies in the corresponding α-confidence ellipsoid E α (y 1 , . . . , y l ) l for any l ∈ N and y 1 , . . . , y l with x i ∈ R k : [g(x 1 ), . . . ,g(x l )] ∈ E α (y 1 , . . . , y l ), ∀{y 1 , . . . , y l }, y i ∈ R k , l ∈ N    Replacing U g with U E transforms Problem (25) into a robust optimization problem with an uncertainty set over functions defined by an infinite number of confidence ellipsoids which can have arbitrarily many dimensions. This set is not semialgebraic and it is not clear how it could practically be used in optimization. In practice, however, we are only interested in evaluating the GP at a finite number of points.  Figure 12: Any l 1 -dimensional α-confidence ellipsoid E α l 1 is a strict subset of the projection of higher order α-confidence ellipsoids E α l 2 , l 2 > l 1 onto the l 1 -dimensional space.
Here, the number of evaluation points is the number of times |S| that the GP occurs in the optimization problem. Consider the following robust optimization problem: Theorem 1. A vector x * which is a feasible solution to Problem (27) is also a feasible solution to Problem (25).
Proof. Assume x * is a solution to Problem (27) but not to Problem (25). Then ∃ĝ ∈ U g s.t.
, which is a contradiction. Fig. (12) shows that the converse of Theorem (1) is not necessarily true. Because all confidence ellipsoids are symmetric and centered at the mean of the distribution, any lower dimensional ellipsoid E α l = E α (y 1 , . . . , y l ), l < n is a strict subset of the projection of E α n = E α (y) onto the l-dimensional space (otherwise it would have to contain a larger probability mass). Problem (27) therefore conservatively approximates Problem (25). Furthermore, the α-confidence ellipsoid E α (y) implies that a solution to Problem (27) is a feasible solution to the black-box constrained problem with a probability of at least 1 − α (see Theorem 1).
C Globally optimizing non-convex inner maximization problems Lemma 6. Let z * be the solution of Problem ??, then z * ∈ ∂U.
Lemma 7. The bounding box of an ellipsoid x Σ −1 x is given by the extreme points Proof. Consider the optimization problem: It's stationarity condition is: Pre-multiplying by x and substituting primal feasibility leads to the expression: Substituting this back into the stationarity condition and rearranging gives: which, substituted into the primal constraint leads to the desired results:

D Drill scheduling model
In order to connect the penetration rate V and degradation rate r to the drilling parameters, weight-on-but W and rotational speedṄ , we require two models: • A bit-rock interaction model [22] connecting W andṄ with V and differential pressure across the mud motor ∆p • A mud motor degradation model [2] connecting the degradation rate r with the differential pressure ∆p.

D.1 Detournay's bit-rock interaction model
To model the connection between W ,Ṅ , V , and ∆p, we combine the bit-rock interaction model by Detournay et al. [22] with the PDM's powercurve. There are three relevant rotational speeds in the drilling process: The drill-string speedṄ top , the PDM speed (relative to the drill string)Ṅ P DM , and the drill-bit speedṄ bit : Based on Detournay et al. [22], the following drilling response model can be formulated relating N bit with the weight-on-bit W and the rate of penetration V :  Figure 13: Example of a PDM power curve. [2] Putting this together, we obtain the following model relating V to W andṄ : N P DM = f T,Q (from Fig. 13) (36f) Assuming that the flow rateQ(t) is treated as a parameter, the only decision variables are W (t), andṄ top (t). For the purpose of this work we model the above power   Figure 14: Maximum lifetime of a PDM as a function of differential ∆p (for a given PDM geometry and elastomer, mud, flow, and temperature). [2] curves using quadratic equations. Notice that the variables w, t, d, andṄ PDM could easily be eliminated, resulting in a more compact albeit less intuitive/physically meaningful formulation.

D.2 Mud motor degradation model
For the mud motor degradation characteristics we use data obtained by Ba et al. [2], determined through a combination of simulation and experiments, shown in Fig. 14. [2].