Bayesian optimization with safety constraints: safe and automatic parameter tuning in robotics

Selecting the right tuning parameters for algorithms is a pravelent problem in machine learning that can significantly affect the performance of algorithms. Data-efficient optimization algorithms, such as Bayesian optimization, have been used to automate this process. During experiments on real-world systems such as robotic platforms these methods can evaluate unsafe parameters that lead to safety-critical system failures and can destroy the system. Recently, a safe Bayesian optimization algorithm, called SafeOpt, has been developed, which guarantees that the performance of the system never falls below a critical value; that is, safety is defined based on the performance function. However, coupling performance and safety is often not desirable in practice, since they are often opposing objectives. In this paper, we present a generalized algorithm that allows for multiple safety constraints separate from the objective. Given an initial set of safe parameters, the algorithm maximizes performance but only evaluates parameters that satisfy safety for all constraints with high probability. To this end, it carefully explores the parameter space by exploiting regularity assumptions in terms of a Gaussian process prior. Moreover, we show how context variables can be used to safely transfer knowledge to new situations and tasks. We provide a theoretical analysis and demonstrate that the proposed algorithm enables fast, automatic, and safe optimization of tuning parameters in experiments on a quadrotor vehicle.


Introduction
Safety and the ability to operate within constraints imposed by an environment are critical prerequisites for any algorithm that is applied on a real robotic system. Especially in robotics, where systems often face large prior uncertainties, failures can cause serious damage to the robot and its environment (Schaal and Atkeson 2010). To avoid unsafe behavior, safety is typically guaranteed with respect to a model of the robot's dynamics and environment. When accurate models are not available or when the robotic system contains elements that are difficult to model, such as computer vision components, the parameters of the algorithms are either tuned manually in experiments on the real system or tuned based on massive amounts of experimental data (Lillicrap et al. 2015). Both methods are time-consuming and potentially safety-critical: the engineer must either carefully select parameters that are safe or collect enough representative data that leads to safe behavior.
In this paper, we present a method to automatically optimize parameters of robotics algorithms while respecting safety constraints during the optimization. The resulting algorithm can be used to optimize parameters on the real robot without failures, since no unsafe parameters are evaluated during the optimization. We expand the theoretical framework of SAFEOPT (Safe Optimization) by Sui et al. (2015) to the more general setting with multiple constraints. We show that our algorithm, SAFEOPT-MC (for multiple constraints), enjoys strong theoretical guarantees about safety and performance, and works well in practice.
Related work In control theory, guaranteeing safety in the presence of unmodeled dynamics is often interpreted as the problem of ensuring stability of the underlying control law with respect to an uncertain dynamic system model (Zhou and Doyle 1998). In this setting, controllers can be gradually improved by estimating the unmodeled dynamics and updating the control law based on this estimate. Safety can be guaranteed by ensuring that either the controller is robustly stable for all possible models within the uncertainty specification (Berkenkamp and Schoellig 2015;Berkenkamp et al. 2017) or the system never leaves a safe subset of the state space (Ostafew et al. 2016;Aswani et al. 2013;Akametalu et al. 2014;Moldovan and Abbeel 2012;Turchetta et al. 2016). Both methods require a system model and uncertainty specification to be known a priori, which must be accurate enough to guarantee stability. In contrast, in our setting we do not assume to have access to a model of the system, but aim to directly optimize the parameters of a control algorithm, without violating safety constraints.
In the robotics literature, optimization algorithms have previously been applied with the goal of maximizing a user-defined performance function through iterative experiments. This is especially powerful when no prior model of the robot is available. However, typical algorithms in the literature do not consider safety of the optimization process, and make other restrictive assumptions such as requiring gradients (Killingsworth and Krstić 2006;Åström et al. 1993), which are difficult to obtain from noisy data, or an impractical number of experiments (Davidor 1991).
The objective of learning optimal policies has been extensively studied in the reinforcement learning community (Sutton and Barto 1998). In particular, the area of policy search considers the objective of optimizing the parameters of control algorithms (Kober and Peters 2014). The state of the art methods are based on estimating the gradients of the performance function Schaal 2006, 2008). As a result, typically multiple evaluations of very similar parameters are conducted on the real system in order to estimate the gradients, which means the approaches are often not data-efficient and converge to local optima. Safety in gradient-based policy search has only been considered by disallowing large steps along the gradient into areas of the parameter space that have not been explored before (Achiam et al. 2017). Guarantees there hold only in expectation. In contrast, our method is gradient-free, so that it can explore the parameter space globally in a more data-efficient manner. At the same time, we provide highprobability worst-case guarantees for not violating safety constraints during the optimization process.
One class of optimization algorithms that has been successfully applied to robotics is Bayesian optimization (Mockus 2012). In Bayesian optimization, rather than considering the objective function as a black-box about which we can only obtain point-wise information, regularity assumptions are made. These are used to actively learn a model of the objective function. The resulting algorithms are practical and provably find the global optimum of the objective function while evaluating the function at only few parameters (Bull 2011;Srinivas et al. 2012). Bayesian optimization methods often model the unknown function as a Gaussian process (GP) (Rasmussen and Williams 2006). These models are highly flexible, allow to encode as much prior knowledge as desired, and explicitly model noise in the performance function evaluations. The GP models are used to guide function evaluations to locations that are informative about the optimum of the unknown function (Mockus 2012;Jones 2001). Example applications of Bayesian optimization in robotics include gait optimization of legged robots (Calandra et al. 2014a;Lizotte et al. 2007) and the optimization of the controller parameters of a snake-like robot (Tesch et al. 2011). Marco et al. (2017) optimize the weighting matrices of an LQR controller for an inverted pendulum by exploiting additional information from a simulator. Several different Bayesian optimization methods are compared by Calandra et al. (2014b) for the case of bipedal locomotion. While these examples illustrate the potential of Bayesian optimization methods in robotics, none of these examples explicitly considers safety as a requirement.
Recently, the concept of constraints has been incorporated into Bayesian optimization. Gelbart et al. (2014) introduce an algorithm to optimize an unknown function subject to an unknown constraint. However, this constraint was not considered to be safety-critical; that is, the algorithm is allowed to evaluate unsafe parameters. The case of finding a safe subset of the parameters without violating safety constraints was considered by Schreiter et al. (2015), while Sui et al. (2015) presented a similar algorithm to safely optimize an objective function. However, the algorithm of Sui et al. (2015) considers safety as a minimum performance requirement. In robotics, safety constraints are typically functions of the states or inputs that are independent of the performance.
Our contributions In this paper, we present an algorithm that considers multiple, arbitrary safety constraints decoupled from the performance objective. This generalization retains the desirable sample-efficient properties of normal Bayesian optimization, but carefully explores the parameter space in order to maximize performance while guaranteeing safety with high probability. We extend the theory of SAFEOPT (Sui et al. 2015) to account for these additional constraints and show that similar theoretical guarantees can be obtained for the more general setting. We then relax the assumptions used in the proofs to obtain a more practical version of the algorithm and additionally show that the safety guarantees carry over to this case. Next to the theoretical contributions, the second main contribution is an extensive experimental evaluation of the method, where we consider the problem of safely optimizing linear and nonlinear laws on a quadrotor vehicle. The experiments demonstrate that the proposed approach is able to safely optimize parameters of a control algorithms while respecting safety constraints with high probability. Moreover, we show how ideas from contextbased optimization (Krause and Ong 2011) can be used to safely transfer knowledge in order to obtain environmentdependent control laws. For example, in our experiments we optimize a control law for different flying speeds of a quadrotor vehicle. Early results with only a safety constraint on performance and without additional theoretical results were presented in .
The rest of the paper is structured as follows: in Section 2, we define the problem of safely optimizing the parameters of control algorithms and provide an overview on GPs and Bayesian optimization in Section 3. In Section 4, we introduce our algorithm, and analyze its theoretical properties in Section 4.2. We evaluate the performance of the algorithm in experiments on a quadrotor vehicle in Section 5 and draw conclusions in Section 6. The proofs of the theorems are provided in Section 7.

Problem Statement
We consider a given algorithm that is used to accomplish a certain task with a robot. In general, this algorithm is arbitrary and may contain several components including vision, state estimation, planning, and control laws. The algorithm depends on tuning parameters a ∈ A in some specified, domain A ⊆ R d .
The goal is to find the parameters within A that maximize a given, scalar performance measure, f . For example, this performance measure may represent the negative tracking error of a robot , the average walking speed of a bipedal robot (Calandra et al. 2014a), or any other quantity that can be computed over a finite time horizon. We can only evaluate the performance measure for any parameter set a on finitetime trajectories from experiments on the real robot. The functional dependence of f on a is not known a priori. In the following, we write the performance measure as a function of the parameters a, f : A → R, even though measuring performance requires an experiment on the physical robot and typically depends on a trajectory of states, control inputs, and external signals.
We assume that the underlying system is safetycritical; that is, there are constraints that the system must respect when evaluating parameters. Similarly to the performance measure, f (a), these constraints can represent any quantity and may depend on states, inputs, or even environment variables. There are q safety constraints of the form g i (a) ≥ 0, g i : A → R, i = 1 . . . q, which together define the safety conditions. This is without loss of generality, since any constraint function can be shifted by a constant in order to obtain this form. The functions g i are unknown a priori but can be estimated through (typically noisy) experiments for a given parameter set a. For example, in order to encode a state constraint on an obstacle for a robot, the safety function g i (a) can return the smallest distance to the obstacle along a trajectory of states when using algorithm parameters a. Note that if the functions were known in advance, we could simply exclude unsafe parameters from the set A.
The overall optimization problem can be written as The goal is to iteratively find the global maximum of this constrained optimization problem by, at each iteration n, selecting parameters a n and evaluating (up to noise) the corresponding function values f (a n ) and g i (a n ) until the optimal parameters are found. In particular, since the constraints define the safety of the underlying system, only parameters that are inside the feasible region of (1) are allowed to be evaluated; that is, only parameters that fulfill these safety requirements on the real system.
Since the functions f and g i in (1) are unknown a priori, it is not generally possible to solve the corresponding optimization problem without violating the constraints. The first problem is that we do not know how to select a first, safe parameter to evaluate. In the following, we assume that an initial safe set of parameters S 0 ⊆ A is known for which the constraints are fulfilled. These serve as a starting point for the exploration of the safe region in (1). In robotics, safe initial parameters with poor performance can often be obtained from a simulation or domain knowledge.
Secondly, in order to safely explore the parameter space beyond S 0 , we must be able to infer whether parameters a that we have not evaluated yet are safe to use on the real system. To this end, we make regularity assumptions about the functions f and g i in (1). We discuss these assumptions in more detail in Section 4.2. However, broadly speaking we make assumptions that allow us to model the functions f and g i as a GP, construct reliable confidence intervals over the domain A, and imply Lipschitz continuity properties. Using these properties, we are able to generalize safety beyond the initial, safe parameters S 0 . Given the model assumptions, we require that the safety constraints hold with high probability over the entire sequence of experiments.
As a consequence of the safety requirements, it is not generally possible to find the global optimum of (1). Instead we aim to find the optimum in the part of the feasible region that is safely reachable from S 0 . We formalize this precisely in Section 4.
Lastly, whenever we evaluate parameters on the real system, we only obtain noisy estimates of both the performance function and the constraints, since both depend on noisy sensor data along trajectories. That is, for each parameter a the we evaluate, we obtain measurementsf (a) = f (a) + ω 0 andĝ i (a) = g i (a) + ω i , where ω i , i = 0, . . . , q, is zero-mean, σ-sub-Gaussian noise. Note that whilef (a) is a random variable, we usef (a n ) to denote the measurement obtained at iteration n. In general, the noise variables may be correlated, but we do not consider this case in our theoretical analysis in Section 4.2. We only want to evaluate parameters where all safety constraints are fulfilled, so that g i (a n ) ≥ 0 for all i ∈ {1, . . . , q} and n ≥ 1.

Background
In this section, we review Gaussian processes (GPs) and Bayesian optimization, which form the foundation of our safe Bayesian optimization algorithm in Section 4. The introduction to GPs is standard and based on  and (Rasmussen and Williams 2006).

Gaussian Process (GP)
Both the function f (a) and the safety constraints g i (a) in Section 2 are unknown a priori. We use GPs as a nonparametric model to approximate these unknown functions over their common domain A. In the following, we focus on a single function, the performance function. We extend this model to multiple functions in order to represent both performance and constraints in Section 3.1.1.
GPs are a popular choice for nonparametric regression in machine learning, where the goal is to find an approximation of a nonlinear map, f (a) : A → R, from an input vector a ∈ A to the function value f (a). This is accomplished by assuming that the function values f (a), associated with different values of a, are random variables and that any finite number of these random variables have a joint Gaussian distribution (Rasmussen and Williams 2006). A GP is parameterized by a prior mean function and a covariance function k(a, a ), which defines the covariance of any two function values f (a) and f (a ), a, a ∈ A. The latter is also known as the kernel. In this work, the mean is assumed to be zero, without loss of generality. The choice of kernel function is problem-dependent and encodes assumptions about smoothness and rate of change of the unknown function. A review of potential kernels can be found in (Rasmussen and Williams 2006) and more information about the kernels used in this paper is given in Section 5.
The GP framework can be used to predict the function value f (a * ) for an arbitrary parameter a * ∈ A based on a set of n past observations, {f (a i )} n i=1 , at the chosen parameters D n = {a i } n i=1 . The GP model assumes that observations are noisy measurements of the true function value f (a); that is,f (a) = f (a) + ω with ω ∼ N (0, σ 2 ). Conditioned on these observations, the posterior distribution is a GP again with mean and variance µ n (a * ) = k n (a * )(K n + I n σ 2 ) −1f n , wheref n = f (a 1 ), . . . ,f (a n ) T is the vector of observed, noisy function values, the covariance matrix K n ∈ R n×n has entries [K n ] (i,j) = k(a i , a j ), i, j ∈ {1, . . . , n}, and the vector k n (a * ) = k(a * , a 1 ), . . . , k(a * , a n ) contains the covariances between the new input a * and the observed data points in D n . The matrix I n ∈ R n×n denotes the identity matrix.
3.1.1 GPs with multiple outputs So far, we have focused on GPs that model a single scalar function. In order to model not only the performance, f (a), but also the safety constraints, g i (a), we have to consider multiple, possibly correlated functions. In the GP literature, these are usually treated by considering a matrix of kernel functions, which models the correlation between different functions (Álvarez et al. 2012). Here instead, we use an equivalent representation by considering a surrogate function, which returns either the performance function or the individual safety constraints depending on the additional input i ∈ I with I = {0, . . . , q}, where I g = {1, . . . , q} ⊂ I are the indices belonging to the constraints. The function h(·, ·) is a single-output function and can be modeled as a GP with scalar output over the extended parameter space A × I. For example, the kernel for the performance function f (a) and one safety constraint g(a) may look like this: where δ ij is the Kronecker delta. This kernel models the functions f (a) and g(a) with independent kernels k f and k g respectively, but also introduces a covariance function k f g that models similarities between the two function outputs. By extending the training data by the extra parameter i, we can use the normal GP framework and predict function values and corresponding uncertainties using (2) and (3). When observing the function values, the index i is added to the parameter set a for each observation. Including noise parameters inside the kernel allows to model noise correlation between the individual functions.
Importantly, using this surrogate function rather than the framework ofÁlvarez et al. (2012) enables us to lift theoretical results of Sui et al. (2015) to the more general case with multiple constraints and provide theoretical guarantees for our algorithm in Section 4.2.
In the setting with multiple outputs, at every iteration n, we obtain |I| = q + 1 measurements; one for each function. For ease of notation, we continue to write µ n and σ n , even though we have obtained n · (q + 1) measurements at locations D n × I in the extended parameter space.

Bayesian Optimization
Bayesian optimization aims to find the global maximum of an unknown function (Mockus 2012). The framework assumes that evaluating the function is expensive, while computational resources are relatively cheap. This fits our problem in Section 2, where each evaluation of the performance function corresponds to an experiment on the real system, which takes time and causes wear in the robotic system.
In general, Bayesian optimization models the objective function as a random function and uses this model to determine informative sample locations. A popular approach is to model the underlying function as a GP, see Section 3.1. GP-based methods use the posterior mean and variance predictions in (2) and (3) to compute the next sample location. For example, according to the GP-UCB (GP-Upper Confidence Bound) algorithm by Srinivas et al. (2012), the next sample location is a n = argmax a∈A µ n−1 (a) + β 1/2 n σ n−1 (a), where β n is an iteration-dependent scalar that reflects the confidence interval of the GP. Intuitively, (5) selects new evaluation points at locations where the upper bound of the confidence interval of the GP estimate is maximal. Repeatedly evaluating the system at locations given by (5) improves the mean estimate of the underlying function and decreases the uncertainty at candidate locations for the maximum, such that the global maximum is provably found eventually (Srinivas et al. 2012). While (5) is also an optimization problem, its solution does not require any evaluations on the real system and only uses the GP model. This reflects the assumption of cheap computational resources. In practice, Bayesian optimization typically focuses on lowdimensional problems. However, this can be scaled up by discovering a low-dimensional subspace of A for Bayesian optimization (Djolonga et al. 2013;Wang et al. 2013) or encoding additional structure in the kernel (Duvenaud et al. 2011).

Contextual Bayesian Optimization
Contextual Bayesian optimization is a conceptually straightforward extension of Bayesian optimization (Krause and Ong 2011). It enables optimization of functions that depend on additional, external variables, which are called contexts. For example, the performance of a robot may depend on its battery level or the weather conditions, both of which cannot be influenced directly. Alternatively, contexts can also represent different tasks that the robot has to solve, which are specified externally by a user. The idea is to include the functional dependence on the context in the GP model, but to consider them fixed when selecting the next parameters to evaluate.
For example, given a context z ∈ Z that is fixed by the environment, we can model how the performance and constraint functions change with respect to different contexts by multiplying the kernel function k a over the parameters, with another kernel k z : Z × Z → R over the contexts, (6) This kernel structure implies that function values are correlated when both parameters and the contexts are similar. For example, we would expect selecting the same parameters a for a control algorithm to lead to similar performance values if the context (e.g., the battery level) is similar.
Since contexts are not part of the optimization criterion, a modified version of (5) has to be used. It was shown by Krause and Ong (2011) that an algorithm that evaluates the GP-UCB criterion given a fixed context z n , a n = argmax a∈A µ n−1 (a, z n ) + β 1/2 n σ n−1 (a, z n ), enjoys similar convergence guarantees as normal Bayesian optimization in Section 3.2. Specifically, after seeing a particular context often enough, the criterion (7) will query parameters that are close-to-optimal.

Safe Bayesian Optimization ( SAFEOPT)
In this paper, we extend the safe optimization algorithm SAFEOPT (Sui et al. 2015) to multiple constraints. SAFEOPT is a Bayesian optimization algorithm, see Section 3.2. However, instead of optimizing the underlying performance function f (a) globally, it restricts itself to a safe set of parameters that achieve a certain minimum performance with high probability. This safe set is not known initially, but is estimated after each function evaluation. In this setting, the challenge is to find an appropriate evaluation strategy similar to (5), which at each iteration n not only aims to find the global maximum within the currently known safe set (exploitation), but also aims to increase the set of controllers that are known to be safe (exploration). SAFEOPT trades off between these two sets by choosing for the next experiment the parameters inside the safe set about whose performance we are the most uncertain.

SAFEOPT-MC (Multiple Constraints)
In this section, we introduce the SAFEOPT-MC algorithm for multiple constraints and discuss its theoretical properties. The goal of the algorithm is to solve (1) by evaluating different parameters from the domain A without violating the safety constraints. To this end, any algorithm has to consider two important properties: (i) Expanding the region of the optimization problem that is known to be feasible or safe as much as possible without violating the constraints, (ii) Finding the optimal parameters within the current safe set.
For objective i), we need quantify the size of the safe set. To do this in a tractable manner, we focus on finite sets A in the following. However, heuristic extensions to continuous domains exist (Duivenvoorden et al. 2017). The theoretical guarantees of the algorithm rely on the continuity of the underlying function. Many commonly used kernels, such as the squared exponential (Gaussian) kernel, encode Lipschitz-continuous functions with high probability (Ghosal and Roy 2006). We make more specific assumptions that ensure deterministic Lipschitz constants in Section 4.2. For now, we assume that f (a) and g i (a) are Lipschitz continuous with Lipschitz constant L with respect to some norm * .
Since we only observe noisy estimates of both the performance function and the constraints, we cannot expect to find the entire safe region encoded by the constraints within a finite number of evaluations. Instead, we follow Sui et al. (2015) and consider learning the safety constraint up to some accuracy . This assumption is equivalent to a minimum slack of on the constraints in (1).
As mentioned in Section 2, we assume that we have access to initial, safe parameters S 0 ⊆ A, for which we know that the safety constraints are satisfied a priori. Starting from these initial parameters, we ask what the best that any safe optimization algorithm could hope to achieve is. In particular, if we knew the safety constraint functions g i (·) up to accuracy within some safe set of parameters S, we could exploit the continuity properties to expand the safe set to where R (S) represents the number of parameters that can be classified as safe given that we know g up to -error inside S and exploiting the Lipschitz continuity to generalize to new parameters outside of S. The baseline that we compare against is the limit of repeatedly applying this operator on S 0 ; that is, with R n (S) = R (R n−1 (S)) and R 1 (S) = R (S) the baseline isR (S 0 ) := lim n→∞ R n (S 0 ). This set contains all the parameters in A that could be classified as safe starting from S 0 if we knew the function up to error. This set does not include all the parameters that potentially fulfill the constraints in (1), but is the best we can do without violating the safety constraints. Hence the optimal value that we compare against is not the one in (1), but which is the maximum performance value over the set that we could hope to classify as safe starting from the initial safe set, S 0 .

The Algorithm
In this section, we present the SAFEOPT-MC algorithm that guarantees convergence to the previously set baseline. The most critical aspect of the algorithm is safety. However, once safety is ensured, the second challenge is to find an evaluation criterion that enables trading off between exploration, trying to further expand the current estimate of the safe set, and exploitation, trying to improving the estimate of the best parameters within the current set.
To ensure safety, we construct confidence intervals that contain the true functions f and g i with high probability. In particular, we use the posterior GP estimate given the data observed so far. The confidence intervals for the surrogate function in (4) are defined as where β n is a scalar that determines the desired confidence interval. This set contains all possible function values between the lower and upper confidence interval based on the GP posterior. The probability of the true function value lying within this interval depends on the choice of β n , as well as on the assumptions made about the functions. We provide more details about this choice in Section 4.2, Lemma 1, and Section 4.3. Rather than defining the lower and upper bounds based on (10), the following analysis requires that consecutive * The functions f and g i can have different Lipschitz constants L i , but we assume a global Lipschitz constant for ease of notation. Additionally, the theoretical results transfer equivalently to the case of Lipschitz-continuity with respect to some metric. Inputs a (f) After 10 evaluations: safe maximum found.

Figure 2.
Optimization with the SAFEOPT-MC algorithm after 1, 2 and 10 parameter evaluations. Based on the mean estimate (blue) and the 2σ confidence interval (light blue), the algorithm selects evaluation points for which g(a) ≥ 0 (black dashed) from the safe set Sn (red), which are either potential maximizers Mn (green) or expanders Gn (magenta). It then learns about the function by drawing noisy samples from the unknown, underlying function (light gray). This way, we expand the safe region (red) as much as possible and, simultaneously, find the global optimum of the unknown function (17) (cyan circle).
estimates of the lower and upper bounds are contained within each other. This assumption ensures that the safe set does not shrink from one iteration to the next, which we require to prove our results. We relax this assumption in Section 4.3. We define the contained set at iteration n as C n (a, i) for all a ∈ S 0 and R otherwise. This ensures that parameters in the initial safe set S 0 remain safe according to the GP model after additional observations. The lower and upper bounds on this set are defined as l i n (a) := min C n (a, i) and u i n (a) := max C n (a, i), respectively. For notational clarity, we write l f n (a) := l 0 n (a) and u f n (a) := u 0 n (a) for the performance bounds.
Based on these confidence intervals for the function values and a current safe set S n−1 , we can enlargen the safe set using the Lipschitz continuity properties, (11) The set S n contains all points in S n−1 , as well as all additional parameters that fulfill the safety constraints given the GP confidence intervals and the Lipschitz constant.
With the set of safe parameters defined, the last remaining challenge is to trade off between exploration and exploitation. One could, similar to Schreiter et al. (2015), simply select the most uncertain element over the entire set. However, this approach is not sample-efficient, since it involves learning about the entire function rather than restricting evaluations to the relevant parameters. To avoid this, we first define subsets of S n that correspond to parameters that could either improve the estimate of the maximum or could expand the safe set. The set of potential maximizers is defined as which contains all parameters for which the upper bound of the current performance estimate is above the best lower bound. The parameters in M n are candidates for the optimum, since they could obtain performance values above the current conservative estimate of the optimal performance. Similarly, an optimistic set of parameters that could potentially enlarge the safe set is e n (a) := a ∈ A \ S n | ∃i ∈ I g : u i n (a) − L a − a ≥ 0 . The function e n enumerates the number of parameters that could additionally be classified as safe if a safety function obtained a measurement equal to its upper confidence bound. Thus, the set G n is an optimistic set of parameters that could potentially expand the safe set.
We trade off between the two sets, M n and G n , by selecting the most uncertain element across all performance and safety functions; that is, at each iteration n we select a n = argmax a∈Gn∪Mn max i∈I w n (a, i), as the next parameter set to be evaluated on the real system. The implications of this selection criterion will become more apparent in the next section, but from a high-level view this criterion leads to a behavior that focuses almost exclusively on exploration initially, as the most uncertain points will typically lie on the boundary of the safe set for many commonly used kernels. This changes once the constraint evaluations return results closer to the safety constraints. At this point, the algorithm keeps switching between selecting parameters that are potential maximizers, and parameters that could expand the safe set and lead to new areas in the parameter space with even higher function values. Pseudocode for the algorithm is found in Algorithm 1.
We show an example run of the algorithm in Figure 2. It starts from an initial safe parameter a 0 ∈ S 0 at which we obtain a measurement in Figure 2d. Based on this, the algorithms uses the continuity properties of the safety function and the GP in order to determine nearby parameters as safe (red set). This corresponds to the region where the high-probability confidence intervals of the GP model (blue shaded) are above the safety threshold (grey dashed line). At the next iteration in Figure 2e, the algorithm evaluates parameters that are close to the boundary of the safe set, in order to expand the set of safe parameters. Eventually the algorithm converges to the optimal parameters in Figure 2c, which obtain the largest performance value that is possible without violating the safety constraints. A local optimization approach, e.g. based on estimated gradients † , would have gotten stuck in the local optimum at the initial parameter a 0 .
At any iteration, we can obtain an estimate for the current best parameters from a n = argmax which returns the best, safe lower-bound on the performance function f .

Theoretical Results
In this section, we show that the same theoretical framework from the SAFEOPT algorithm (Sui et al. 2015) can be extended to multiple constraints and the evaluation criterion (15). Here, we only provide the results and highlevel ideas of the proofs. The mathematical details are provided in Section 7. For simplicity, we assume that all G n ← {a ∈ S n | e n (a) ≥ 0} 5 a n ← argmax a∈Gn∪Mn max i∈I w n (a, i) 6 Measurements:f (a n ),ĝ i (a n ) ∀i = 0, . . . , q 7 Update GP with new data 8 end function evaluations are corrupted by the same σ-sub-Gaussian noise in this section. In order to provide guarantees for safety, we need the confidence intervals in (10) to hold for all iterations and functions. In the following, we assume that the surrogate function h(a, i) has bounded norm in a reproducing kernel Hilbert space (RKHS, c.f., Christmann and Steinwart (2008)). A RKHS corresponding to a kernel k(·, ·) includes functions of the form h(a, i) = j α j k((a, i), (a j , i j )) with α i ∈ R and representer points (a j , i j ) ∈ A × I. The bounded norm property implies that the coefficients α j decay sufficiently fast as j increases. Intuitively, these functions are well-behaved, in that they are regular with respect to the choice of kernel.
The following Lemma allows us to choose a scaling factor β n for (10), so that we achieve a specific probability of the true function being contained in the confidence intervals for all iterations.
Lemma 1. (based on Chowdhury and Gopalan (2017)). Assume that h(a, i) has RKHS norm bounded by B and that measurements are corrupted by σ-sub-Gaussian noise. If β 1/2 n = B + 4σ γ (n−1)|I| + 1 + ln(1/δ), then the following holds for all parameters a ∈ A, function indices i ∈ I, and iterations n ≥ 1 jointly with probability at least 1 − δ: Moreover, if the kernel is continuously differentiable, then the corresponding functions are Lipschitz continuous (Christmann and Steinwart 2008). Note that Lemma 1 † If gradient information is available, it can be incorporated in the GP model too (Solak et al. 2003) does not make probabilistic assumptions on h -in fact, h could be chosen adversarially, as long as it has bounded norm in the RKHS. Similar results can be be obtained for the Bayesian setting where the function h is assumed to be drawn from the GP prior (Srinivas et al. 2012).
The scaling factor β n in Lemma 1 depends on the information capacity γ n associated with the kernel k. It is the maximum amount of mutual information that we can obtain about the GP model of h(·) from n noisy measurementsĥ D at parameters D = {(a 1 , i 1 ), . . . }, Intuitively, it quantifies a best case scenario where we can select the measurements in the most informative manner possible. The information capacity γ n has a sublinear dependence on n for many commonly-used kernels and can numerically approximated up to a small constant factor for any given kernel Srinivas et al. (2012).
Since the confidence intervals hold with probability 1 − δ and the safe set is not empty starting from S 0 , it is possible to prove that parameters within the safe set S n are always safe with high probability. In order for the algorithm to compete with our baseline, we must additionally ensure that the algorithm learns the true function up to confidence in both the sets M n and G n . The number of measurements required to achieve this depends on the information capacity γ n , since it encodes how much information can be obtained about the true function from n measurements. We use the sublinearity of γ n in order to bound the number of samples required to estimate the function up to accuracy. We have the following result: Theorem 1. Assume that h(a, i) has bounded norm in an RKHS and that the measurement noise is σ-sub-Gaussian. Also, assume that S 0 = ∅ and g i (a) ≥ 0 for all a ∈ S 0 and i ∈ I g . Choose β n as in Lemma 1, defineâ n as in (17), and let n * ( , δ) be the smallest positive integer satisfying where C 1 = 8/ log(1 + σ −2 ). For any > 0 and δ ∈ (0, 1), when running Algorithm 1 the following inequalities jointly hold with probability at least 1 − δ: 1. Safety: ∀n ≥ 1, ∀i ∈ I g : g i (a n ) ≥ 0 2. Optimality: ∀n ≥ n * , f (â n ) ≥ f * − Proof. Main idea: safety follows from Lemma 1, since accurate confidence intervals imply that we do not evaluate unsafe parameters. For the optimality, the main idea is that, since we evaluate the most uncertain element in M n ∪ G n , the uncertainty about the maximum is bounded by w n (a n , i n ). The result follows from showing that, after a finite number of evaluations, either the safe set expands or the maximum uncertainty within M n ∪ G n shrinks to . See Section 7 for derivations and details.
Theorem 1 states that, given the assumptions we made about the underlying function, Algorithm 1 explores the state space without violating the safety constraints and, after at most n * samples, finds an estimate that isclose to the optimal value over the safely reachable region. The information capacity γ |I|n * , grows at a faster rate of |I|n compared to n in SAFEOPT, since we obtain |I| observations at the same parameters set a, while the SAFEOPT analysis assumes every measurement is optimized independently. However, γ |I|n remains sublinear in n, see Section 7.

Contexts
In this section, we show how the theoretical guarantees derived in the previous section transfer to contextual Bayesian optimization. In this setting, part of the variables that influence the performance, the contexts, are fixed by an external process that we do not necessarily control. In normal Bayesian optimization, it was shown by Krause and Ong (2011) that an algorithm that optimizes the GP-UCB criterion in (7) for a fixed context converges to the global optimum after repeatedly seeing the corresponding context.
Intuitively, the fact that part of the variables that influence the performance, the contexts, are now fixed by an external process should not impact the algorithm on a fundamental level. However, safety is a critical issue in our experiments and, in general, one could always select an adversarial context for which we do not have sufficient knowledge to determine safe controller parameters. As a consequence, we make the additional assumption that only 'safe' contexts are visited; that is, we assume the following: Assumption 1. For any n ≥ 1, the context z n ∈ Z is selected such that S n (z n ) = ∅.
Here, S n (z n ) denotes the safe set for the given context z n . Intuitively, Assumption 1 ensures that for every context there exists at least one parameter choice that is known to satisfy all safety constraints. This assumption includes the case where safe initial parameters for all contexts are known a priori and the case where the algorithm terminates and asks for help from a domainexpert whenever a context leads to an empty safe set.
A trivial extension of SAFEOPT-MC to contexts is to run |Z| independent instances of Algorithm 1, one for each context. This way, it is sufficient to repeatedly see a context several times to apply the previous results to the safe contextual optimization case. One can apply the previous analysis to this setting, but it would fail to yield guarantes that hold jointly for all contexts.
In order to obtain stronger results that hold jointly across all contexts in Z, we adapt the information capacity (worstcase mutual information) γ n to consider contexts, Unlike in (19), the mutual information is maximized across contexts in (21). As a result, we can use Lemma 1 to obtain confidence intervals that hold jointly across all contexts. A second challenge is that contexts are chosen in an arbitrary order. This is in stark contrast to the parameters a n , which are chosen according to (15) in order to be informative. This means that any tight finite sample bound on Algorithm 1 must necessarily depend on the order of contexts. The following theorem accounts for both of these challenges.
Theorem 2. Under the assumptions of Theorem 1 and Assumption 1. Choose β n as in Lemma 1, where γ n is now the worst-case mutual information over contexts as in (21). For a given context order (z 1 , z 2 , . . . ) and any context z ∈ Z, let be the number of times that we have observed the context z up to iteration n * and 1 is the indicator function. Let n * (z) be the smallest positive integers such that where C 1 = 8/ log(1 + σ −2 ). We note the information capacity for a fixed context z by γ n (z). That is, with h z (a, i) = h(a, i, z) it is defined as γ n (z) = max D⊆A×I,|D|≤n I(ĥ z D ; h z ). For any > 0 and δ ∈ (0, 1), let f * (z) = max a∈R (S0) f (a, z). Then, when running Algorithm 1 the following inequalities jointly hold with probability at least 1 − δ: 1. ∀n ≥ 1, i ∈ I g : g i (a n , z n ) ≥ 0 Proof. For a fixed context, z n = z ∀n, we have n * (z) = n(z) and the results follow directly as in Theorem 1. Otherwise, the only difference in the proofs is that β depends on the information capacity over contexts, making sure that the confidence intervals are valid across contexts. By visiting contexts in Z \ {z}, we obtain more measurements and β increases, which in turn increases the upper bound on the number of samples required at context z.
Theorem 2 states that the contextual variant of Algorithm 1 enjoys the same safety guarantees as the noncontextual version. Additionally, it shows that, after gaining enough information about a particular context, it can identify the optimal parameters. In practice, this upper bound is conservative, since it does not acount for knowledge transfer accross contexts. In practice, correlations between contexts significantly speed up the learning process. For example, in Figure 3 we show a contextual safe optimization problem with two contexts. Even though the algorithm has only been able to explore the parameter space at the first context (z = 0, left function), the correlation between the functions means that information can be transferred to the as-of-yet unobserved context (z = 1, right function). This knowledge transfer significantly improves data-efficiency and the number of evaluations required by the algorithm.

Practical Implementation
In this section, we discuss possible changes to Algorithm 1 that make the algorithm more practical, at the expense of loosing some of the theoretical guarantees. One challenge in applying Algorithm 1 in practice, is defining a suitable Lipschitz constant. In particular, specifying the wrong constant can lead to conservativeness or unsafe parameters being evaluated. Moreover, smoothness assumptions are already encoded by the kernel choice, which is more intuitive to specify than Lipschitz constants on their own. In practice, we use only the GP model to ensure safety ; that is, we define l i n (a) = min Q n (a, i) and u i n (a, i) = max Q n (a, i) in terms of the confidence intervals of the GP directly. Thus, we can define the safe set without a Lipschitz constant as S n = S 0 ∪ a ∈ A | ∀i ∈ I g : l i n (a) ≥ 0 .
While it is difficult to prove the full exploration of the safely reachable set as in Theorem 1, the resulting algorithm remains safe: Lemma 2. With the assumptions of Lemma 1, S 0 = ∅, and g i (a) ≥ 0 for all a ∈ S 0 and i ∈ I g , when running Algorithm 1 with the safe set defined as in (24), the following holds with probability at least 1 − δ: ∀n ≥ 1, ∀i ∈ I g : g i (a n ) ≥ 0.
Proof. The confidence intervals hold with probability 1 − δ following Lemma 1. Since S n in (24) is defined as the set of parameters that fulfill the safety constraint and the safe set is never empty since S 0 = ∅, the claim follows.
Similarly, the set of expanders can be defined in terms of the GP directly, by adding optimistic measurements and counting the number of new parameters that are classified as safe, see ) for details. However, this potentially adds a large computational burden.
The parameter β n , which determines the GP's confidence interval in Lemma 1, may be impractically conservative for experiments. The theoretical safety results also hold when we replace γ n in β n by the empirical mutual information gained so far, I(ĥ Dn×I , h). Empirically, depending on the application, one may also consider setting β n to a constant value. This roughly corresponds to bounding the failure probability per iteration, rather than over all iterations.
Learning all the different functions, f and g i , up to the same accuracy may be restrictive if they are scaled differently. A possible solution is to either scale the observed data, or to scale the uncertainties in (15) by the prior variances of the kernels for that specific output. This enables (15) to make more homogeneous decisions across different scales.

Quadrotor Experiments
In this section, we demonstrate Algorithm 1 (with the changes discussed in Section 4.3) in experiments on a quadrotor vehicle, a Parrot AR.Drone 2.0.

Experimental Setup
During the experiments, measurements of all vehicle states are estimated from position and pose data provided by an overhead motion capture camera system. The quadrotor's dynamics can be described by six states: positions x = (x, y, z), velocitiesẋ = (ẋ,ẏ,ż), ZYX Euler angles (φ, θ, ψ), and body angular velocities (ω x , ω y , ω z ). The control inputs u are the desired roll and pitch angles θ des and φ des , the desired z-velocityż des , and the desired yaw angular velocity ω z,des , which in turn are inputs to an unknown, proprietary, on-board controller.
The position dynamics in the global coordinate frame arë where R ZYX is the rotation matrix from the body frame to the inertial frame, f = (0, 0, c) is the mass-normalized thrust, and g = (0, 0, g) is the gravitational force. The goal of the controller is to track a reference signal. We assume that z-position and the yaw angle are controlled by fixed control laws and focus on the position control in x-and y direction. We use two different control laws in the following experiments.
The most simple control law that can be used for this setting is a PD-controller, defined as where a = (k 1 , k 2 ) are the two tuning parameters. Intuitively, a larger parameter k 1 encourages tracking reference changes more aggressively, while k 2 is a damping factor that encourages moderate velocities. A more sophisticated approach to control quadrotor vehicles is to use estimates of the angles and accelerations to solve for the thrust c. We use loop shaping on the horizontal position control loops so that they behave in the manner of a second-order systems with time constant τ and damping ratio ζ. Based on a given desired reference trajectory, commanded accelerations are computed as From the commanded accelerations, we then obtain the control inputs for the desired roll and pitch angles by solving (26) for the angles. Here, the tuning parameters are a = (τ, ζ). For details regarding the controllers see (Schoellig et al. 2012;Lupashin et al. 2014). The quadrotor was controlled using the ardrone autonomy and vicon bridge packages in ROS Hydro. Computations for the SAFEOPT-MC algorithm in Algorithm 1 were conducted on a regular laptop and took significantly less than one second per iteration. As a result, experiments could be conducted continuously without interruptions or human interventions.

Kernel Selection
Algorithm 1 critically depends on the GP model for the performance and constraint functions. In this section, we review the kernel used in our experiments and the kind of models that they encode. A more thorough review of kernel properties can be found in (Rasmussen and Williams 2006).
In our experiments, we use the Matèrn kernel with parameter ν = 3/2 (Rasmussen and Williams 2006), which encodes that mean functions are one-times differentiable. This is in contrast to the commonly used squared exponential kernels, which encode smooth (infinitely differentiable) functions. With the Matèrn kernel, the GP model is parameterized by three hyperparameters: measurement noise σ 2 in (2) and (3), the kernel's prior variance κ 2 , and positive lengthscales l ∈ R A + , which are the diagonal elements of the diagonal matrix M, M = diag(l). These hyperparameters have intuitive interpretations: the variance of the measurement noise κ 2 corresponds to the noise in the observations, which includes any randomness in the algorithm and initial conditions, and random disturbances. The prior variance κ 2 determines the expected magnitude of function values; that is, |f (a)| ≤ κ with probability 0.68 according to the GP prior. Lastly, the lengthscales l determine how quickly the covariance between neighboring values deteriorates with their distance. The smaller the lengthscales, the faster the function values can change from one parameter set to the next. In particular, the high-probability Lipschitz constant encoded by this kernel depends on the ratio between the prior variance and the lengthscales, κ/l.
When using GPs to model dynamic systems, typically a maximum likelihood estimate of the hyperparameters is used based on data; see (Ostafew et al. 2016) for an example. For Bayesian optimization, the GP model is used to actively acquire data, rather than only using it for regression based on existing data. This dependence between the kernel hyperparameters and the acquired data is known to lead to poor results in Bayesian optimization when using a maximum likelihood estimate of the hyperparameters during data acquisition (Bull 2011). In particular, the corresponding GP estimate is not guaranteed to contain the true function as in Lemma 1. In this work, we critically rely on these confidence bounds to guarantee safety. As a consequence, we do not adapt the hyperparameters as more data becomes available, but treat the kernel as a prior over functions in the true Bayesian sense; that is, the kernel hyperparameters encode our prior knowledge about the functions that we model and are fixed before experiments begin. While this requires intuition about the process, intuitively the less knowledge we encode in the prior, the more data and evaluations on the real system are required in order to determine the best parameters.

Linear Control
In this section, we use SAFEOPT-MC to optimize the parameters of the linear control law in (27). The entire control algorithm consists of this control law together with the on-board controller and state estimation.
The goal is to find controller parameters that maximize the performance during a 1-meter reference position change. For an experiment with parameters a n at iteration n, the performance function is defined as f (a n ) = c(a n ) − 0.95 c(a 0 ), where, to compute the cost c, the states x = (x − 1,ẋ, φ, ω) and the input u are weighted by positive semi-definite matrices Q and R. The subscript k indicates the state measurement at time step k in the trajectory and the time horizon is 5 s (N = 350). Performance is defined as the cost improvement relative to 95% of the initial controller cost. The safety constraint is defined only in terms of the performance; that is, the constraint is g(a) = f (a) ≥ 0, which encodes that we do not want to evaluate controller parameters that perform significantly worse than the initial parameters. While the optimal controller parameters could be easily computed given an accurate model of the system, we do not have a model of the dynamics of the proprietary, on-board controller and the time delays in the system. Moreover, we want to optimize the performance for the real, nonlinear quadrotor system, which is difficult to model accurately. An inaccurate model of the system could be used to improve the prior GP model of the performance function, with the goal of achieving faster convergence. In this case, the uncertainty in the GP model of the performance function would account for inaccuracies in the system model.
We define the domain of the controller parameters as [−0.6, 0.1] 2 , explicitly including positive controller parameters that certainly lead to crashes. In practice, one In the bottom-left corner, there is the magnified section of the first three samples, which are close together to determine the location of the initial, safe region. The maximum, magnified in the top-left corner, also has more samples to determine the precise location of the maximum. Other areas are more coarsely sampled to expand the safe region.
would exclude parameters that are known to be unsafe a priori. The initial controller parameters are (−0.4, −0.4), which result in a controller with poor performance. Decreasing the controller gains further leads to unstable controllers. The parameters for the experiments were set as follows: the length-scales were set to 0.05 for both parameters, which corresponds to the notion that a 0.05-0.1 change in the parameters leads to very different performance values. The prior standard deviation, κ, and the noise standard deviation, σ, are set to 5% and 10% of the performance of the inital controller, f (a 0 ), respectively. The noise standard deviation, σ, mostly models errors due to initial position offsets, since state measurements have low noise. The size of these errors depends on the choice of the matrices Q and R. By choosing σ as a function of the initial performance, we account for the Q and R dependency. Similarly, κ specifies the expected size of the performance function values. Initially, the best we can do is to set this quantity dependent on the initial performance and leave additional room for future, larger performance values. For the GP model, we choose β 1/2 n = 2 to define the confidence interval in (10).
The resulting, estimated performance function after running Algorithm 1 for 30 experiments is shown in Fig. 4. The unknown function has been reliably identified. Samples  Figure 4. The initial controller (blue) performs poorly but is stable. In contrast, the optimized controller (red) shows an optimized, smooth, and fast response. The trajectories of other controller parameters that were evaluated are shown in gray.
are spread out over the entire safe set, with more samples close to the maximum of the function and close to the initial controller parameters. No unsafe parameters below the safety threshold were evaluated on the real system. Typically, the optimization behavior of Algorithm 1 can be roughly separated into three stages. Initially, the algorithm evaluates controller parameters close to the initial parameters in order for the GP to acquire information about the safe set (see lower-left, zoomed-in section in Figure 4). Once a region of safe controller parameters is determined, the algorithm evaluates the performance function more coarsely in order to expand the safe set. Eventually, the controller parameters are refined by evaluating highperformance parameters that are potential maximizers in a finer grid (see upper-left, zoomed-in section in Figure 4). The trajectories of the initial, best and intermediate controllers can be seen in Figure 5.

Nonlinear Control
In the previous section, we showed how to optimize the performance of a linear control law subject to a simple constraint on performance. In this section, we optimize the nonlinear controller in (29) and (30) and show how more complex constraints can be used. We use the same task as in the previous section, but this time the goal is to minimize the root-mean-square error (RMSE) over a time horizon of 5 s (N = 350 samples) during a 1-meter reference position change in x-direction. We define the performance function, f (a n ) = c(a n ) − 0.75 c(a 0 ), c(a n ) = 1 √ N N k=1 x k − x des,k 2 2 Performance f (a) Figure 6. Mean estimate of the root-mean-square error when optimizing the parameters of the nonlinear control law for a step response, subject to safety constraints. The algorithm carefully evaluates only safe parameter combinations, until the safe region cannot be expanded further without violating constraints. Without the safety constraint, the algorithm explores a larger region of the parameter space (light blue) and eventually evaluates an unsafe parameter set.
as the performance relative to 75% of the performance of the initial parameters a 0 = (0.9, 0.8). We define the GP model of this performance function as follows: in this experiment, measurement noise is minimal, since the positions are measured accurately by the overhead camera system. However, to capture errors in the initial position, we define σ = 0.05c(a 0 ). We assume that we can improve the initial controller by roughly 20%, so we set κ = 0.2c(a 0 ). The lengthscales are set to 0.05 in order to encourage cautious exploration. These parameters turned out to be conservative for the real system. Notice that the cost is specified relative to c(a 0 ) instead of f (a 0 ) as in Section 5.3. Since c(a 0 ) > f (a 0 ), these hyperparameters are more conservative, so that we require more evaluations on the real system. The reason for this more conservative choice is that the nonlinear controller is expected to have a less smooth performance function, unlike the one for linear control, which is expected to be roughly quadratic. If, as in the previous section, one were to set the safety constraint to g 1 (a) = f (a), the algorithm would classify the blue shaded region in Figure 6 as safe. This region includes time constants as low as τ = 0.3, which encourage highly aggressive maneuvers, as would be expected from a performance function that encourages changing position as fast as possible. However, these high gains amplify noise in the measurements, which can lead to crashes; that is, the performance-based constraint cannot properly encode safety. Notice that the blue shaded area does not correspond to full exploration, since the experiment was aborted after the first, serious crash. The reason for the unsafe exploration is that the RMSE performance function in (36) does not encode safety the same way as as weighting of state and input errors in Figure 4 does. Thus, in order to encode safety, we need to specify additional safety constraints.
One indication of unsafe behavior in quadrotors are high angular velocities when the quadrotor oscillates around the reference point. We define an additional safety constraint on the maximum angular velocity max k |ω x,k | ≤ 0.5 rad/s by setting g 2 (a) = 0.5 − max k |ω x,k |. The corresponding hyperparameters are selected as σ 2 = 0.1, l = 0.2, and κ = 0.25. The measurement noise can be chosen relatively small here, since it corresponds to a single measurement of angular velocity. Note that it is difficult to perform the step maneuver with an angular velocity lower than 0.4 rad/s, so that typical values of g 2 are smaller than 0.1.
With this additional safety constraint, the algorithm explores the parameter space and stops before the safety constraints are violated, as can be seen in Figure 6. Rather than exploring smaller time constants τ (higher gains), the algorithm evaluates larger damping ratios, which allow slightly smaller values of τ and therefore higher performance without violating the safety constraints. The optimal parameters are to the top-left of the safe set, where small time constants encourage tracking the reference aggressively, while the increased damping ratio ensures a moderate angular velocity.

Circle Trajectory
In this section, we use the same nonlinear controller and cost function as in the previous section, but aim to optimize the RMSE with respect to a circle trajectory of radius 1 m at a speed of 1 m/s. The reference is defined as a point moving along the circle at the desired speed. Feasibility of such motions has been analyzed in Schoellig et al. (2011).
In order to ensure good tracking behavior, we define safety as a constraint on the maximum RMSE of 0.2 m. Additionally, we use the same constraint on the maximum angular velocity around the x and y axis of 0.5 rad/s as before. The yaw-angle is set so that the quadrotor always points to the center of the circle, which ideally should lead to zero angular velocity. Deviations from this are an indication of unsafe behavior. We use the same kernel hyperparameters as in Section 5.4.
The trajectories that result from running the optimization algorithm are shown in Figure 7. The initial controller parameters lead to very poor performance. In particular, the initial time constant is too large, so that the quadrotor lags behind the reference. As a result, the quadrotor flies a circle of smaller radius. In contrast, the resulting optimized trajectory (in red) is the best that can be obtained given the safety constraints and controller structure above. The mean estimate of the performance function after the experiments can be seen in Figure 8a. The optimal parameters have smaller time constants, so that the position is tracked aggressively. Since the reference point moves of 1 m/s, these aggressive controller parameters do not lead to unsafe behavior. During the entire optimization, only safe parameters that keep the vehicle within the constraints on RMSE and angular velocity are evaluated.

Context-Dependent Optimization
In this section, we show how the knowledge about good controller parameters at low speeds can be used to speed up the safe learning at higher speeds. In our circle experiment, the quadrotor tracked a moving reference. As this reference moves with high velocities, the quadrotor gets pushed to its physical actuator limits and starts to lag behind the reference. This causes the circle that is flown by the quadrotor to have a smaller radius than the reference trajectory. In this section, the goal is to maximize the speed of the reference trajectory subject to the safety constraints of the previous experiment in Section 5.5. One way to achieve this goal, is to add the speed of the reference point to the performance function. However, this would lead to more experiments, as the algorithm will explore the safe parameter space for every velocity. Instead, here we define the trajectory speed as a context, which is set externally. In particular, we set z n = argmax v∈R, a∈A v subject to: g i (a, v) ≥ 0, ∀i ∈ I g , (37) that is, we select the maximum velocity for which there are safe parameters known. While here we select the context manually, in practice contexts can be used to model any external, measurable variables, such as the battery level, see Section 3.3.
In order to transfer knowledge about good controller parameters from the slow speed in Section 5.5 to higher speeds, we model how performance and constraints vary with desired speed by defining a kernel k z (ẋ des ,ẋ des ) over contexts. We use the same kernel structure as in (6) and hyperparameters κ = 1 and l = 0.25. Based on the data from Section 5.5, the extended model allows us to determine speeds for which safe controller parameters are known.
Starting from the data of the previous experiments in Section 5.5, we run SAFEOPT-MC using the extended kernel with the additional speed context determined by (37). This allows us to find optimal parameters for increasingly higher speeds, which satisfy the constraints. We can safely increase the speed up to 1.8 m/s. We show the mean performance function estimates for two speeds in Figure 8. For lower speeds, the best controller parameters track the reference position more aggressively (low τ ). For higher speeds, this behavior becomes unsafe as the quadrotor lags behind the reference point. Instead, the optimal parameters shift to higher time constants (lower gains). Additionally, as expected, high speeds lead to higher reference tracking errors. Increasing the reference velocity any further causes the performance constraint to be violated.
The trajectories that result from applying the optimal parameters for a speed of 1 m/s and the maximum safe speed of 1.8 m/s can be seen in Figure 9. For the relatively slow speed of 1 m/s the quadrotor can track the circle well using aggressive parameters. For the higher speed, the reference trajectory moves too fast for the quadrotor to track perfectly within the actuator limits, so that the best parameters just barely satisfy the safety constraint on the average deviation from the reference. Overall, this approach allows us to find context-dependent parameters, while remaining within the safety constraints.

Conclusion and Future Work
We presented a generalization of the Safe Bayesian Optimization algorithm of Sui et al. (2015) that allows multiple, separate safety constraints to be specified and applied it to nonlinear control problems on a quadrotor vehicle. Overall, the algorithm enabled efficient and automatic optimization of parameters without violating the  Figure 7 for a speed of 1 m/s (left) and 1.8 m/s (right). Extending the kernel with a context for speed allows to transfer knowledge to different speeds and leads to speed-dependent optimal control parameters, speeding up the learning for higher speeds. . Trajectories with optimal parameters for speeds of 1 m/s (red) and 1.8[m/s] (green) when tracking the black reference. At slower speeds there exist aggressive controller parameters that allow the quadrotor to track the reference almost perfectly. At higher speeds, actuator saturation limits the achievable performance. Due to the safe optimization framework the maximum speed can be found that does not deviate more from the reference trajectory than is allowed by the safety constraint. The corresponding performance functions can be seen in Figure 8.
safety constraints, which would lead to system failures.
Currently, the algorithm is mostly applicable to lowdimensional problems due to the computational burdon of optimizing (15) and the statistical problem of defining suitable GP priors in high-dimensions. While interesting progress has been made in this direction in the standard Bayesian optimization case, future work could explore this in the safety-critical case.

Proofs
In this section, we provide the proofs for Theorem 1 and Lemma 2.
Since we consider the surrogate function h(a, i) in (4), we obtain q + 1 individual measurements at each iteration, each with individual noise. A measure of how difficult it is to learn an accurate GP model of a function is given by the information capacity. This corresponds to the maximum amount of mutual information between a scalar function h and measurementsĥ D at a set of parameters D of size n. The measurementsĥ D are corrupted by zero-mean, Gaussian noise. The information capacity is then defined as γ n := max D⊂A×I,|D|=n which is the maximum amount of information we can obtain about the function h from n measurements. The information gain is known to be sublinear in n for many commonly used kernels Srinivas et al. (2012). Intuitively, the first samples for the GP model provide a lot of information, since each sample improves the prior significantly. After some iterations the domain A is covered with samples in D, so that additional samples are more correlated with previous data points in D, rendering the samples less informative. The more prior information we encode in the GP prior, the less information can be gained from the same number of samples. In our setting, we obtain |I| = q + 1 measurements at every iteration step n, each with different, independent noise. The mutual information with regards to these multiple measurements at parametersĀ ⊂ A can be bounded with ≤ max D⊂A×I,|D|≤n|I| whereĀ × I is the cartesian product that means we obtain one measurement for every function indexed by i ∈ I at each parameter inĀ. The first inequality bounds the mutual information gained by Algorithm 1 by the worst-case mutual information, while the second inequality bounds this again by the worst-case mutual information when optimizing over the |I| measurements at each iteration step individually. Intuitively, obtaining multiple optimal samples does not fundamentally change the properties of the information gain, but accelerates the rate at which information can be obtained in the worst case by |I|.
In the following, we assume that h(a, i) has bounded RKHS norm. Lemma 1 provides requirements for β n , which will be used in the following to prove the results. (2017)). Assume that h(a, i) has RKHS norm bounded by B and that measurements are corrupted by σ-sub-Gaussian noise. If β 1/2 n = B + 4σ γ (n−1)|I| + 1 + ln(1/δ), then the following holds for all parameters a ∈ A, function indices i ∈ I, and iterations n ≥ 1 jointly with probability at least 1 − δ: h(a, i) − µ n−1 (a, i) ≤ β 1/2 n σ n−1 (a, i).

Lemma 1. (based on Chowdhury and Gopalan
Proof. Directly follows from Chowdhury and Gopalan (2017). The only difference is that we obtain |I| measurements at every iteration, which causes the information capacity γ to grow at a faster rate.
Note Where needed in the following lemmas, we implicitly assume that the assumptions of Lemma 1 hold, and that β n is defined as above.
Proof. From Lemma 1 we know that the true functions are contained in Q n (a, i) for all iterations n with probability at least 1 − δ. As a consequence, the true functions will be contained in the intersection of these sets with the same probability.
Corollary 1 gives a choice of β n , which ensures that all the function values of h are contained within their respective confidence intervals with high probability. In the remainder of the paper, we follow the outline of the proofs in Sui et al. (2015), but extended them to account for multiple constraints.
We start by showing the dynamics of important sets and functions. Most importantly, the upper confidence bounds are decreasing, lower confidence bounds increasing with the number of iterations, since the sets C n+1 ⊆ C n for all iterations n.
(iv) Proof by induction. Consider the initial safe set, S 0 .
By definition of C 0 we have for all a ∈ S 0 and i ∈ I that l i 1 (a) − L a − a = l i 1 (a) ≥ l i 0 (a) ≥ 0. It then follows from the definition of S n that a ∈ S 1 .
For the induction step, assume that for some n ≥ 2, S n−1 ⊆ S n and let a ∈ S n . This means that for all i ∈ I g , ∃z i ∈ S n−1 , l i n (z i ) − L z i − a ≥ 0 by the definition of the safe set. But, since S n−1 ⊆ S n , this implies that z i ∈ S n , ∀i ∈ I g . Furthermore, by part (ii), l i n+1 (z) ≥ l i n (z i ). Therefore, we conclude that for all i ∈ I g , l i n+1 (z i ) − L z i − a ≥ 0, which implies that a ∈ S n+1 .
(v) Let a ∈ R (S). Then, by definition, for all i ∈ I g , ∃z i ∈ S, g i (z i ) − L z i − a ≥ 0. But, since S ⊆ R, it means that z i ∈ R∀i ∈ I g , and, therefore, g i (z i ) − L z i − a ≥ 0 for all i ∈ I g also implies that a ∈ R (R).
(vi) This follows directly by repeatedly applying the result of part (v).
Using the previous results, we start by showing that, after a finite number of iterations, the safe set has to expand if possible. As a first step, note that the set of expanders and maximizers are contained in each other as well if the safe set does not increase: Lemma 4. For any n 1 ≥ n 0 ≥ 1, if S n1 = S n0 , then, for any n, such that n 0 ≤ n < n 1 , it holds that Proof. Given the assumption that S n does not change, both G n+1 ⊆ G n and M n+1 ⊆ M n follow directly from the definitions of G n and M n . In particular, for G n , note that for any a ∈ S n , e i n (a) is decreasing in n for all i ∈ I g , since u i n (a) are decreasing in n. For M n , note that max a ∈Sn l f n (a ) is increasing in n, while u f n (a) is decreasing in n (see Lemma 3 (i), (ii)).
When running the SAFEOPT-MC algorithm, we repeatedly choose the most uncertain element from G n and M n . Since these sets are contained in each other if the safe set does not expand, we gain more information about these sets with each sample. Since the information gain is bounded, this allows us to bound the uncertainty in terms of the information gain over the entire set: Lemma 5. For any n 1 ≥ n 0 ≥ 1, if S n1 = S n0 and C 1 := 8/ log(1 + σ −2 ), then, for any n, such that n 0 ≤ t ≤ n 1 , it holds for all i ∈ I that w n (a n , i) ≤ C 1 β n γ |I|n n − n 0 .
Proof. Given Lemma 4, the definition of a n := argmax a∈Gn∪Mn (w n (a)), and the fact that, w i n (a n ) ≤ 2β 1/2 n max ∈∈I σ n−1 (a n , i) = 2β 1/2 n (a n , i n ), the proof is completely analogous to that of Lemma 5.3 by Srinivas et al. (2012). We only highlight the main differences here, which results from having several functions.
Corollary 2. For any n ≥ 1, if C 1 is defined as above, N n is the smallest positive integer satisfying N n β n+Nn γ |I|(n+Nn) ≥ C 1 2 , and S n+Nn = S n , then, for any a ∈ G n+Nn ∪ M n+Nn , and for all i ∈ I it holds that w n+Nn (a, i) ≤ .
Note Where needed in the following lemmas, we assume that C 1 and N n are defined as above.
That is, after a finite number of evaluations N n the most uncertain element within these sets is at most . Given that the reachability operator in (8) is defined in terms of the same accuracy, it allows us to show that after at most N n evaluations, the safe set has to increase unless it is impossible to do so: Lemma 6. For any n ≥ 1, ifR (S 0 ) \ S n = ∅, then R (S n ) \ S n = ∅.
Proof. Assume, to the contrary, that R (S n ) \ S n = ∅. By definition, R (S n ) ⊇ S n , therefore R (S n ) = S n . Iteratively applying R to both sides, we get in the limitR (S n ) = S n . But then, by Lemma 3 (iv) and (vi), we getR which contradicts the lemma's assumption thatR (S 0 ) \ S n = ∅.
Lemma 7. For any n ≥ 1, ifR (S 0 ) \ S n = ∅, then the following holds with probability at least 1 − δ: S n+Nn S n .
Proof. By Lemma 6, we get that, R (S n ) \ S n = ∅, Equivalently, by definition, for all i ∈ I g ∃a ∈ R (S n ) \ S n , ∃z i ∈ S n : g i (z i ) − − L z i − a ≥ 0.
The previous results is enough to show that we eventually explore the full safe set by repeatedly applying Lemma 7: Lemma 10. Let n * be the smallest integer, such that n * ≥ |R 0 (S 0 )|T n * . Then, there exists n 0 ≤ n * , such that S n0+Tn 0 = S n0 .
Proof. Assume, to the contrary, that for any n ≤ n * , S n S n+Tn . (By Lemma 3 (iv), we know that S n ⊆ S n+Tn .) Since N n is increasing in n, we have S 0 S n0 ⊆ S T n * S T n * +T T n * ⊆ S 2T n * · · · , which implies that, for any 0 ≤ k ≤ |R 0 (S 0 )|, it holds that |S kT n * | > k. In particular, for k * := |R 0 (S 0 )|, we get |S k * T | > |R 0 (S 0 )| which contradicts S k * T ⊆R 0 (S 0 ) by Lemma 9.
Proof. This is a direct consequence of combining Lemma 10 and Corollary 2.
Since we showed that we completely explore the safe set and that we remain safe throughout the exploration procedure, we are ready to state the main results: Lemma 11. If h is L-Lipschitz continuous, then, for any n ≥ 0, the following holds with probability at least 1 − δ for all i ∈ I g : ∀a ∈ S n , g i (a) ≥ 0.
Proof. We will prove this by induction. For the base case n = 0, by definition, for any a ∈ S 0 and i ∈ I g , g i (a) ≥ 0.
For the induction step, assume that for some n ≥ 1, for any a ∈ S n−1 and for all i ∈ I g , g i (a) ≥ 0. Then, for any a ∈ S n , by definition, for all i ∈ I g , ∃z i ∈ S n−1 , by L-Lipschitz-continuity Theorem 1. Assume that h(a, i) has bounded norm in an RKHS and that the measurement noise is σ-sub-Gaussian. Also, assume that S 0 = ∅ and g i (a) ≥ 0 for all a ∈ S 0 and i ∈ I g . Choose β n as in Lemma 1, defineâ n as in (17), and let n * ( , δ) be the smallest positive integer satisfying n * β n * γ |I|n * ≥ C 1 (|R 0 (S 0 )| + 1) where C 1 = 8/ log(1 + σ −2 ). For any > 0 and δ ∈ (0, 1), when running Algorithm 1 the following inequalities jointly hold with probability at least 1 − δ: 1. Safety: ∀n ≥ 1, ∀i ∈ I g : g i (a n ) ≥ 0 2. Optimality: ∀n ≥ n * , f (â n ) ≥ f * − Proof. The first part of the theorem is a direct consequence of Lemma 11. The second part follows from combining Corollary 3 and Corollary 4.