Abstract
Safety is critical to broadening the realworld use of reinforcement learning. Modeling the safety aspects using a safetycost signal separate from the reward and bounding the expected safetycost is becoming standard practice, since it avoids the problem of finding a good balance between safety and performance. However, it can be risky to set constraints only on the expectation neglecting the tail of the distribution, which might have prohibitively large values. In this paper, we propose a method called WorstCase Soft Actor Critic for safe RL that approximates the distribution of accumulated safetycosts to achieve risk control. More specifically, a certain level of conditional ValueatRisk from the distribution is regarded as a safety constraint, which guides the change of adaptive safety weights to achieve a tradeoff between reward and safety. As a result, we can compute policies whose worstcase performance satisfies the constraints. We investigate two ways to estimate the safetycost distribution, namely a Gaussian approximation and a quantile regression algorithm. On the one hand, the Gaussian approximation is simple and easy to implement, but may underestimate the safety cost, on the other hand, the quantile regression leads to a more conservative behavior. The empirical analysis shows that the quantile regression method achieves excellent results in complex safetyconstrained environments, showing good risk control.
Similar content being viewed by others
1 Introduction
In traditional reinforcement learning (RL) problems (Sutton & Barto, 2018), agents can explore environments to learn optimal policies without safety concerns. However, unsafe interactions with environments are unacceptable in many safetycritical problems, for instance, an autonomous robot should never break equipment or harm humans. Even though RL agents can be trained in simulators, there are many realworld problems without simulators of sufficient fidelity. Constructing safe RL algorithms for dangerous environments is challenging because of the trialanderror nature of RL (Pecka & Svoboda, 2014). In general, safety is still an open problem that hinders the wider application of RL (García & Fernández, 2015).
Ray et al. (2019) propose to make constrained optimization the main formalism of safe RL, where the reward function and cost function (related to safety) are distinct. This framework tries to mitigate the problem of designing a single reward function that needs to carefully select a tradeoff between safety and performance, which is problematic in most instances (Roy et al., 2021). Then, the objective is changed to select a policy with maximum expected return among the policies that have a bounded expected costreturn. The issue with this approach is that the cost of individual episodes might exceed the given bound with a high probability, in particular we might observe some episodes with arbitrarily high costs. For safetycritical problems, it can be hazardous to use the expected costreturn as safety evaluation. Instead, better alternatives for safetyconstrained RL are algorithms that compute policies based on varying risk requirements, specialized to riskneutral or riskaverse behavior (Duan et al., 2020; Ma et al., 2020).
We propose a new criterion for riskaverse constrained RL, where we focus on the upper tail of the cost distribution, represented by the conditional ValueatRisk (CVaR; Rockafellar & Uryasev, 2000). With the new formalism, we design the WorstCase Soft Actor Critic (WCSAC) algorithm that uses a separate safety critic to estimate the distribution of accumulated cost to achieve risk control. In this way, policies can be optimized given different levels of CVaR, which determine the degree of risk aversion from a safety perspective. In addition, we use a Lagrangian formulation to constrain the risk during training.
We focus on offpolicy algorithms since they require fewer samples to optimize a policy by using experiences from past policies, which can reduce the probability of performing unsafe interactions with the environment (Achiam et al., 2017; Schulman et al., 2015, 2017). Soft actor critic (SAC; Haarnoja et al., 2018a, b) is an offpolicy method built on the actor critic framework, which encourages agents to explore by including a policy’s entropy as a part of the reward. SACLagrangian (Ha et al., 2020) combines SAC with Lagrangian methods to address safetyconstrained RL with local constraints, i.e., constraints are set for each timestep instead of each episode. The SACLagrangian can be easily generalized to constrain the expected costreturn, but it is not apt to handle the riskaverse setting.
To find a riskaverse policy, we investigate two safety critics that estimate the full costreturn distribution, which we can use to infer the CVaR costreturn as mentioned before. Namely, we investigate a Gaussian approximation and quantile regression (Dabney et al., 2018a). On the one hand, the Gaussian approximation is simple and easy to implement, but may underestimate the safety cost, on the other hand, the quantile regression leads to a more conservative behavior. Figure 1 shows a comparison of the three safety critic architectures. With the development in distributional RL, WCSAC is also possible to be further improved by directly deploying some new techniques in distributionapproximating.
Experimental analysis shows that by setting the level of risk control, our two WCSAC algorithms can both attain stronger adaptability (compared to expectationbased baselines) when facing RL problems with higher safety requirements. The two WCSAC algorithms can achieve safe behavior in environments with a Gaussian costreturn. Compared to the Gaussian approximation, the WCSAC with quantile regression has better performance in environments with nonGaussian costreturn distribution, and shows better risk control in complex safetyconstrained environments.
The main contributions of this article can be summarized as follows:

1.
We propose a new criterion for riskaverse constrained RL to achieve risk control in safety critical problems.

2.
We design an offpolicy algorithm for riskaverse constrained RL, namely the WCSAC.

3.
We show the versatility of WCSAC by using two methods to approximate the costreturn distribution.

4.
We investigate the safety of these algorithms in environments with Gaussian and nonGaussian costreturn distributions.
The above items 1 and 2 were partially covered in our previous work (Yang et al., 2021). We also improved the exposition of the core ideas of the paper. Taken as a whole, this paper demonstrates that the WCSAC algorithm is able to learn safe policies in a range of environments. In doing so, it highlights the value of using riskaverse metrics in RL algorithms.
2 Background
In this section, we formulate constrained RL problems, present algorithms used to solve them, and investigate how to estimate the distribution of longterm rewards/costs.
2.1 Constrained Markov decision processes
We formulate the safe RL problem as a Constrained Markov Decision Process (CMDP; Altman, 1999, Borkar, 2005), defined by a tuple \((\mathcal {S}, \mathcal {A}, \mathcal {P}, r, c, d, T, {\iota })\): where \(\mathcal {S}\) is the state space and \(\mathcal {A}\) is the action space. In a constrained RL an agent interacts with a CMDP, without knowledge about the transition, reward, and cost functions (\(\mathcal {P}: \mathcal {S} \times \mathcal {A} \times \mathcal {S} \rightarrow [0,1], r : \mathcal {S} \times \mathcal {A} \rightarrow [r_{min}, r_{max}], \text { and } c : \mathcal {S} \times \mathcal {A} \rightarrow [c_{min}, c_{max}]\)). Each episode begins in a random state \(s_0\sim \iota : \mathcal {S} \rightarrow [0,1]\). At each timestep t of an episode, the agent observes the current state \(s_t \in \mathcal {S}\), and takes an action \(a_t \in \mathcal {A}\). Then, it observes a reward \(r(s_t, a_t)\), a cost \(c(s_t, a_t)\), and the next state \(s_{t+1} \sim \mathcal {P}(\cdot \mid s_t,a_t)\). This process is repeated until some terminal condition is met, such as reaching the time horizon T. The behavior of the agent is defined by a policy \(\pi : \mathcal {S} \times \mathcal {A} \rightarrow [0, 1]\). This way, a policy \(\pi\) induces a distribution over full trajectories \(\mathcal {T}_{\pi } = (s_0,a_0,s_1,\cdots )\) where \(s_0 \sim \iota\), \(a_t \sim \pi (\cdot \mid s_t)\), and \(s_{t + 1} \sim \mathcal {P}(\cdot \mid s_t,a_t)\).
In a CMDP there are two random variables of interest, the return \(Z_\pi ^r = \sum ^{T}_{t=0} r(s_t,a_t)\) and the costreturn \(Z_\pi ^c = \sum ^{T}_{t=0} c(s_t,a_t)\) that are, respectively, the sum of rewards and the sum of costs obtained in a trajectory following a fixed policy \(\pi\).
Definition 1
(Safety based on Expected Value) A policy \(\pi\) is safe if its expected costreturn remains below a safety threshold d:
Over the episodes, the agent must learn a safe policy \(\pi\) that maximizes the expected return for each episode:
For a complex and longhorizon problem (\(T \gg 1\)), it is common to introduce a discount factor \(\gamma \in (0.0,1.0)\) to make the problem tractable, since it allows the agent to compute a single stationary value function, instead of indexing it by the time step. Henceforth, we consider the discounted return and discounted costreturn, accumulated discounted rewards and costs, respectively, from (s, a) as
We will refer to the costreturn \(Z_\pi ^c(s,a)\) as C whenever \(\pi\), s and a are clear from the context. So, we have \(Q_{\pi }^r(s,a) = \mathbb {E} [Z_{\pi }^{r}(s,a)]\), and \(Q_{\pi }^c(s,a) = \mathbb {E} [Z_{\pi }^{c}(s,a)] = \mathbb {E} [C]\).
2.2 SAClagrangian
When the agent knows nothing about the environment, the safety constraint cannot be strictly fulfilled during exploration. During the early steps of learning, we still hope to encourage exploration to learn more about the environment. But the policy’s entropy must be carefully balanced with the safety constraint, and the policy must be allowed to converge to a relatively deterministic policy, which reduces risks in terms of (safetyrelated) cost. SACbased methods with entropy constraints and adaptive entropy weights are candidates to meet these conditions. In this section, we describe the SACLagrangian (SACLag; Ha et al., 2020), a method designed for maximum entropy RL with safety constraints.
Notice that we have a global constraint on the costreturn (over trajectories) and a local constraint on the policy entropy (for each time step).
In general, SACLag is a SACbased method that has two critics, where we use the reward critic to estimate the expected return (possibly with entropy) to promote reward during learning, while the safety critic estimates the costreturn to encourage safety. In SACLag, the constrained optimization problems are solved by Lagrangian methods (Bertsekas, 1982). To manage a tradeoff between exploration, reward, and safety, adaptive entropy and safety weights (Lagrangemultipliers) \(\beta\) and \(\omega\) are introduced to the constrained optimization (1):
where \(f(\pi ) = \mathop {\mathbb {E}}_{s_0 \sim \iota , a_0 \sim \pi (\cdot \mid s_0)} \left[ Z_{\pi }^r(s_0, a_0)\right]\), \(e(\pi ) = \mathop {\mathbb {E}}_{(s_t,a_t) \sim \mathcal {T}_{\pi }} \left[ \log (\pi (a_t \mid s_t)) \right] + h\), and \(g(\pi ) = \mathop {\mathbb {E}}_{s_0 \sim \iota , a_0 \sim \pi (\cdot \mid s_0)}\left[ Z_{\pi }^c(s_0, a_0)\right]  \overline{d}\). h is the minimum entropy, and \(\overline{d}\) is the discounted approximation of d, see Appendix A for details. The above maxmin optimization problem is solved by gradient ascent on \(\pi\), and descent on \(\beta\) and \(\omega\).
Ha et al. (2020) developed SACLag for local constraints, which means that the safety cost is constrained at each timestep. However, it can be easily generalized to constrain the expected costreturn.^{Footnote 1} In this paper, we use J to denote loss functions, and \(\theta\) to denote neural network parameters. Similar to the formulation used by Haarnoja et al. (2018b), we can get the actor loss:
where the entropy weight \(\beta\) (Lagrange multiplier) manages the stochasticity of the policy \(\pi\) and also determines the relative importance of the entropy term compared to rewards and costs. \(\mathcal {D}\) is the replay buffer and \(\theta _\pi\) indicates the parameters of the policy \(\pi\). Finally, let \(\theta _\omega\) and \(\theta _\beta\) be the parameters learned for the safety and exploration weight such that \(\omega = \mathrm{softplus}(\theta _\omega )\) and \(\beta = \mathrm{softplus}(\theta _\beta )\), where
We can learn \(\omega\) and \(\beta\) by minimizing the loss functions:
So the corresponding weight will be adjusted if the constraints are violated, that is, if we estimate that the current policy is unsafe or if it does not have enough entropy.
2.3 Distributional RL based on quantile regression
So far, we considered only the expected value of the return and the cost return. In this section we describe how we can estimate the full distribution of these random variables. Later, we will discuss how to use the tails of the costreturn to compute safer policies.
Distributional RL provides a means to estimate the return distribution instead of only modeling expected values (Bellemare et al., 2017; Dabney et al., 2018a, b; Yang et al., 2019). So it is natural to apply distributional RL in riskaverse domains. Even in traditional RL problems, distributional RL algorithms show better sample efficiency and ultimate performance compared to the standard expectationbased approach, but the stateoftheart techniques have not been applied to safetyconstrained RL with separate reward and safety signals.
Quantile regression, one of the main techniques in distributional RL, is widely used to estimate the return distribution, which has been combined with DQN (Mnih et al., 2015) to generate distributional variants such as QRDQN (Dabney et al., 2018b), IQN (Dabney et al., 2018a), and FQF (Yang et al., 2019). In these methods, the difference between the distributions is measured by 1Wasserstein distance:
where u and v are random variables (e.g., the return or costreturn), and F is the cumulative distribution function (CDF). In these methods, we learn the inverse CDF of the return distribution, i.e., mapping quantile fraction \(\tau \in [0,1]\) to the corresponding quantile function value \(Z^{\tau }\),^{Footnote 2} which can be expressed as \(Z^{\tau }=F^{1}_Z(\tau )\). QRDQN, IQN, and FQF differ in how to generate the quantile fractions during training. Compared to fixing the quantile fractions (QRDQN) and random sampling (IQN), we can theoretically better approximate the real distribution by using a proposal network (FQF) that generates appropriate quantile fractions for each stateaction pair. However, IQN has been found to perform better in experiments and has fewer parameters to tune in complex environments (Ma et al., 2020). The quantile values of IQN are learned based on the Huber quantile regression loss (Huber, 1964):
where \(\kappa\) is the threshold to make the loss within an interval \([\kappa , \kappa ]\) quadratic but a regular quantile loss if outside the interval. Based on the distributional Bellman operator (Morimura et al., 2010; Sobel, 1982; Tamar et al., 2016)
we can get the TD error \(\delta _{ij}\) between the quantile values at quantile fractions \(\tau _i\) and \(\tau '_j\), i.e.,
where \((s,a,r,a')\) is sampled from the replay buffer \(\mathcal {D}\), and \(\pi (s) = \mathop {\arg }\max _{a \in \mathcal {A}} Q^r(s,a)\). We can approximate \(Q^r(s,a)\) using K i.i.d. samples of \(\widetilde{\tau } \sim U([0,1])\):
It is important to note that \(\tau\), \(\tau '\), and \(\widetilde{\tau }\) are sampled from continuous and independent distributions in IQN. \(\tau '\) is for the TD target (average quantile values at several \(\tau '\)), and \(\tau\) is the given quantile we aim to estimate.
3 Riskaverse constrained RL
Traditional expectationbased safe RL methods maximize the return under the premise that the expected costreturn remains below the safety threshold d. In this way, RL agents are not aware of the potential risks because of the randomness in costreturn, which is generated by the stochastic policy and the dynamics of the environment. In expectationbased cases, if a safe policy has higher returns and higher variance in safety costs, it will be preferred over another safe policy with lower returns and lower variance in safety costs. In safetycritical domains, the optimal policies are expected to be more robust, i.e., to have a lower risk of hazardous events even for stochastic or heavytailed costreturn.
In Fig. 2, the xaxis depicts the costreturn C (Eq. 2). The yaxis depicts the density of its probability distribution. The expectationbased algorithm focuses on the average performance in safety when optimizing policies. Thus, \(\pi\), \(Q_{\pi }^c\), and the shape of the costreturn distribution \(p^{\pi }(C \mid s,a)\) will be changed during the training process until \(Q_{\pi }^c\) (blue line) is shifted to the left side of the boundary (red line). After that, there is still a strong likelihood that the constraint value d is exceeded. For a policy \(\pi\), \(Q_{\pi }^c\) can only be used as the evaluation of average performance in safety, however, in safetycritical domains, the worstcase performance in safety is preferred over the average performance. Therefore, we replace the expected value with the Conditional ValueatRisk (CVaR; Rockafellar & Uryasev, 2000), using the upper \(\alpha\) of the distribution to assess the safety of a policy. In the right panel of Fig. 2, we set the constraint on CVaR. Thus we optimize policies that will move the tailend of \(p^{\pi }(C \mid s,a)\) (blue line) to the left side of the boundary d (red line).
Definition 2
(Risk level.) A positive scalar \(\alpha \in (0,1]\) is used to define the risk level in WCSAC. A WCSAC with smaller \(\alpha\) (\(\alpha \rightarrow 0\)) is expected to be more pessimistic and riskaverse. Conversely, a larger value of \(\alpha\) leads to a less riskaverse behavior, with \(\alpha =1\) corresponding to the riskneutral case.
Considering the probability distribution of costreturns \(p^{\pi }(C)\) induced by the aleatoric uncertainty of the environment and the policy \(\pi\), we model the safetyconstrained RL problem in a more riskaverse way than the traditional formulation (1). We focus on the \(\alpha\)percentile \(F_C^{1}(1\alpha )\), where \(F_C\) is the CDF of \(p^{\pi }(C\mid s,a)\), so we can get the CVaR:
The following definition gives us a new constraint to learn riskaverse policies, which differs from the traditional constraint (1).
Definition 3
(Safety based on CVaR) Given the risk level \(\alpha\), a policy \(\pi\) is safe if it satisfies \(\Gamma _{\pi }(s_t, a_t, \alpha ) \le \overline{d} \quad \forall t\), where \((s_t,a_t)\sim \mathcal {T}_{\pi }\) and \(s_0 \sim \iota\).
Now we can generalize the framework from Sect. 2.2, using maximum entropy RL with the above risksensitive safety constraints. That is, the optimal policy in a constrained RL problem might be stochastic therefore it is reasonable to seek a policy with some entropy (Eq. 3). So, the policy is optimized to satisfy
With (11) it is possible to solve safe RL problems using the Soft Actor Critic (SAC; Haarnoja et al., 2018a) framework, maintaining a minimum expected entropy (Haarnoja et al., 2018b).
4 Worstcase soft actor critic
To solve the riskaverse constrained RL problem (11), we design the WorstCase Soft Actor Critic (WCSAC) algorithm. WCSAC generalizes SACLag (Sect. 2.2), because SACLag can be regarded as WCSAC with \(\alpha = 1\), such that \(\Gamma _{\pi }(s,a,1)=Q_{\pi }^c(s,a)\) (10). In this section, we start by describing a safety critic that assumes the costreturn distribution is Gaussian, then we show how to handle the case where the costreturn distribution is not Gaussian using a quantile regression approach. Finally, we show how to optimize the actor with the new safety critics and present an overview of the full algorithm.
4.1 Gaussian safety critic
In this section, we present how to obtain a Gaussian approximation of the safety critic. We will refer to the WCSAC with a Gaussian safety critic as WCSACGS in the following parts of the paper.
4.1.1 Gaussian approximation
WCSACGS uses a separate Gaussian safety critic (parallel to the reward critic for the return) to estimate the distribution of C instead of computing a point estimate of the expected costreturn, as the SACLag algorithm. To obtain the costreturn distribution, \(p^\pi (C\mid s,a)\) is approximated with a Gaussian, i.e.,
where \(V^c_{\pi }(s,a) = \mathbb {E}_{p^{\pi }}[C^2 \mid s,a](Q_{\pi }^c(s, a))^2\) is the variance of the costreturn.
Given the Gaussian approximation, the CVaR measure is easily computed (Khokhlov, 2016; Tang et al., 2020). At each iteration, \(Q_{\pi }^c(s, a)\) and \(V^c_{\pi }(s,a)\) can be estimated. Thus, the new safety measure for risk level \(\alpha\) is computed by
where \(\phi (\cdot )\) and \(\Phi (\cdot )\) denote the probability distribution function (PDF) and the cumulative distribution function (CDF) of the standard normal distribution (Khokhlov, 2016).
WCSACGS learns the mean and variance of \(p^\pi (C)\), as shown in Fig. 1. To estimate \(Q_{\pi }^c\), we can use the standard Bellman function:
The projection equation for estimating \(V^c_{\pi }(s,a)\) is:
We refer the reader to Tang et al. (2020) for the proof of (15).
4.1.2 Gaussian safety critic learning
WCSACGS uses two neural networks parameterized by \(\theta _C^\mu\) and \(\theta _C^\sigma\), respectively, to estimate the safety critic, i.e.,
In order to learn the safety critic, the distance between value distributions is measured by the 2Wasserstein distance (Bellemare et al., 2017; Olkin & Pukelsheim, 1982): \(W_2(u,v)\doteq \left( \int _0^1 F^{1}_u(\chi )F^{1}_v(\chi )^2 d\chi \right) ^{1/2}\), where \(u\sim \mathcal {N}(Q_1, V_1)\), \(v\sim \mathcal {N}(Q_2, V_2)\). WCSACGS uses the simplified 2Wasserstein distance (Tang et al., 2020) to estimate the safety critic loss:
The 2Wasserstein distance can be computed as the Temporal Difference (TD) error based on the projection Eqs. (14) and (15) to update the safety critic, i.e., WCSACGS minimizes the following values:
where \(J_C^\mu (\theta _C^\mu )\) is the loss function of \(Q_{\theta _C^\mu }^c\), and \(J_C^\sigma (\theta _C^\sigma )\) is the loss function of \(V_{\theta _C^\sigma }^c\). So,
where \(\overline{Q}_{\theta _C^\mu }^c(s_t,a_t)\) is the TD target from (14), and
where \(\overline{V}_{\theta _C^\sigma }^c(s_t,a_t)\) is the TD target from (15).
Unfortunately, this Gaussian approximation can be coarse in many domains. In the next section, we will investigate methods that provide precise estimates of the return distribution, which later we apply to estimate the costreturn distribution.
4.2 Safety critic with quantile regression
Although the Gaussian approximation leverages distributional information to attain more riskaverse policies, only an additional variance is estimated compared to regular constrained RL methods. This means the information of the experiences collected are only used to a limited extent. Thus, the Gaussian approximation does not possess the general advantages of distributional RL algorithms.
Besides, it is not always appropriate to approximate the costreturn by a Gaussian distribution, as shown in Fig. 3, since the contribution from the tail of the cost distribution might be underestimated. In this case, the agent might converge to an unsafe policy, according to (11). In this section, we present a distributional safety critic modeled by IQN, as illustrated on Fig. 1, which provides a more precise estimate of the upper tail part of the distribution. Henceforth, we refer to WCSAC with a safety critic modeled by IQN as WCSACIQN.
4.2.1 Estimating safety critic with IQN
We propose the implicit quantile network to model the costreturn distribution (safetyIQN), regarded as the safety critic of the SAC method. SafetyIQN maps the samples from a base distribution (usually \(\tau \sim U([0,1])\)) to the corresponding quantile values of the costreturn distribution. In theory, by adjusting the capacity of the neural network, safetyIQN can fit the costreturn distribution with arbitrary precision, which is essential for safetycritical problems.
We denote \(F^{1}_C(\tau )\) as the quantile function for the costreturn C and, for clarity of exposition, we define \(C^{\tau } = F^{1}_C(\tau )\). We use \(\theta _C\) to parameterize the safetyIQN. The approximation is implemented as \(\hat{C}^{\tau }(s, a) \leftarrow f_{ IQN }(s,a,\tau \mid \theta _C),\) which also takes the quantile fraction \(\tau\) as the input of the model, so that it uses the neural network to fit the entire continuous distribution. When training \(f_{ IQN }\), two quantile fraction samples \(\tau , \tau ' \sim U([0,1])\) at time step t are used to get the sampled TD error:
Following (7), we can get the loss function for safetyIQN, i.e.,
where
In (22): (a) indicates that the total loss of all the target quantiles \(\tau _i, i=1,\cdots ,N\) is computed at once, and applies the distributional Bellman operator \(\mathcal {B}\) (Bellemare et al., 2017), (b) expands the Bellman operator, taking an action for the next state sampled from the current policy \(a_{t+1} \sim \pi (\cdot \mid s_{t+1})\), (c) introduces \(\tau _j\) to estimate the TD target, and (d) uses (20). We point out that for the estimation of quantiles, the quantile loss is replaced by the Huber loss to ease training, as in the regular IQN method (Dabney et al., 2018a). However, this may lead to a bias in the safety distribution (Rowland et al., 2019), especially for larger values of \(\kappa\). The imputation approach proposed in the work by Rowland et al. (2019) can be combined with the proposed method to reduce this bias. Investigation of the extent of the bias and the efficacy of the correction in riskaverse RL is a subject for future research.
4.2.2 CVaR safety measure based on sample mean
Since we base our estimate of the distribution of costreturn on a quantileparameterized approximation, we approximate the CVaR based on the expectation over the values of the quantile \(\tau\) as \(\Gamma _{\pi }(s,a,\alpha ) \doteq \mathop {{{{\mathbb {E}}}}}\limits _{\tau \sim U([1\alpha ,1])} \left[ C^{\tau }_{\pi } (s,a) \right] .\) This allows us to estimate \(\Gamma _{\pi }(s,a,\alpha )\) at each update step using K i.i.d. samples of \(\widetilde{\tau } \sim U([1\alpha ,1])\):
While the Gaussian approximation leverages a closedform approach to estimate the CVaR, which is inherently limited by the Gaussian assumption, our method efficiently estimates the CVaR using a sampling approach. This can attain higher accuracy due to the quantile regression framework. We also highlight that this method still estimates the full distribution, sampling \(\tau , \tau '\) from U([0, 1]) to compute the safety critic loss. We use (23) only when estimating the CVaR to compute the Lagrangian safety loss \(J_s\). In the next section, we describe how the actor uses the estimates of the CVaR described so far.
4.3 Worstcase actor
For a certain risk level \(\alpha\), we optimize the policy \(\pi\) until it satisfies the safety criterion \(\Gamma _{\pi }(s_t,a_t,\alpha )\le \overline{d} \quad \forall t\) according to Definition 3. In the policy improvement step, we update the policy towards the exponential of the new policy evaluation \(X_{\alpha ,\omega }^{\pi }(s,a) = Q_{\pi }^r(s,a)\omega \Gamma _{\pi }(s,a,\alpha )\), based on the balance between safety and performance. This particular choice of update can be guaranteed to result in an improved policy in terms of this new evaluation (Haarnoja et al., 2018a). The role of safety changes over the training process. As the policy becomes safe, the influence of the safety term wanes, then the return optimization will play a greater role in our formulation.
Since in practice we prefer tractable policies, we will additionally restrict the policy to some set of policies \(\Pi\), which can correspond, for example, to a parameterized family of distributions such as Gaussians. To account for the constraint that \(\pi \in \Pi\), we project the improved policy into the desired set of policies. In principle, we could choose any projection, but it turns out to be convenient to use the information projection defined in terms of the KL divergence (Dabney et al., 2018b; Kullback & Leibler, 1951). Similar to the work by Haarnoja et al. (2018b), for each state \(s_t\), we try to minimize the following KL divergence to update the policy:
where \(\Xi ^{\pi }(s_t)\) is the partition function to normalize the distribution. \(\beta\) and \(\omega\) are the adaptive entropy and safety weights, respectively. A loss function can be constructed by averaging the KL divergence over all states in the sample buffer and approximating the KL divergence using a single sampled action, resulting in
\(\Xi ^{\pi }(s_t)\) has no influence on updating \(\theta\), thus it can be omitted. The resulting actor loss is
The main difference to Eq. 4 is that we replace the expected cost by the CVaR estimate.
We update the reward critic \(Q^r\) and entropy weight \(\beta\) in the same way as the SAC method. The reward critic (including a bonus for the policy entropy) is trained to minimize
where \(Q^r\) is parameterized by \(\theta _R\), and \(a_{t+1} \sim \pi (\cdot \mid s_{t+1})\). Based on the new safety measure, the safety weight \(\omega\) can be learned by minimizing the loss function:
so \(\omega\) will be decreased if \(\overline{d} \ge \Gamma _{\pi }(s,a,\alpha )\), otherwise \(\omega\) will be increased to emphasize safety more. The main difference to how SACLag optimizes its safety weight, is the use of the CVaR estimate, in opposite to the mean estimate (5). We note that in (28), we sample from the replay buffer \(\mathcal {D}\), whereas (2) suggests that the constraint applies to the initial state distribution. This replacement is certainly valid in the strongly discounted regime, or when episodes are very long. In this case, each visited state can be considered an initial state for the cost calculation. Although the replay buffer may initially be strongly offpolicy, this deviation reduces over time. Moreover, this replacement also turns out to work well in practice when these conditions do not apply.
Figure 4 shows a general overview of the proposed algorithm, indicating the relations between safety, reward, and policy components. The arrows depict the relations between all terms in the method, i.e., the element at the beginning of an arrow influences the element at its end. We may notice that the safety and reward terms only influence each other through the policy.
4.4 Complete algorithm
The complete algorithm WCSAC is presented in Algorithm 1, where we list the input of the algorithm and all initialization objects in lines 02. Under a certain safety requirement \(\alpha\), we input \(\langle d,h \rangle\) for the constraints. With the WCSACIQN, we also need the hyperparameters \(\langle N,N' \rangle\) for updating the safetyIQN, K for computing the new safety measure \(\Gamma _{\pi }\). For the environment steps (lines 47), we sample actions from the policy to attain experience for the replay buffer \(\mathcal {D}\), which allows us to get batches for updating all parameters at each gradient descent step (lines 821). After line 22, we list all the optimized parameters of the algorithm.
In standard maximum entropy RL, the entropy of the policy is expected to be as large as possible. However, relatively deterministic policies are preferred over stochastic policies in safe exploration, even though it is essential to encourage exploration during the early steps of learning. In SAC, the entropy of the policy is constrained to ensure that the final optimal policy is more robust (Haarnoja et al., 2018b). Therefore, for safetycritical domains, it is preferred to set a relatively low minimum requirement h for the entropy, or omit this constraint altogether.
With the Gaussian safety critic, we use two separate neural networks to estimate the mean function and variance function respectively. The size of each network can be smaller than using one network to estimate the mean function and variance together, so it does not add more parameters to be trained. In addition, it is much easier to compare the distributional safety critic of WCSACGS to the regular safety critic of SACLag, which can be seen as an ablation of WCSACGS. As to the neural network structure of safetyIQN, we use the same function as in IQN for return (Dabney et al., 2018a), i.e., a DQNlike network with an additional embedding for the quantile fraction \(\tau\).
For the reward critic, to avoid overestimation and reduce the positive bias during the policy improvement process, we also learn two soft Qfunctions independently, which are parameterized by \(\theta _{R1}\) and \(\theta _{R2}\). The minimum Qfunction is used in each gradient step. We leverage target networks (parameterized by \(\overline{\theta }_R\) and \(\overline{\theta }_C\)) to achieve stable updating, a common technique used in DQN (Mnih et al., 2015) and DDPG (Lillicrap et al., 2015). Specifically, the parameters of target networks (including safety critic and reward critic) are updated by moving averages (lines 1920), where hyperparameter \(\eta \in [0,1]\) is used to reduce fluctuations.
When selecting the learning rate for the neural networks (\(\lambda _R\), \(\lambda _C\), \(\lambda _{\pi }\), \(\lambda _{\beta }\), and \(\lambda _{\omega }\)) which are used to minimize the corresponding loss functions (\(J_R\), \(J_C\), \(J_{\pi }\), \(J_{\beta }\), and \(J_{\omega }\)), we usually make \(\lambda _{\omega }\) larger than the others to enhance the safety constraint. A relatively low learning rate for the safety weight does not converge fast enough to improve the safety of the actor’s policy, but the practical learning rate should be set according to the environment. Typically, the disparity between \(\lambda _{\omega }\) and the remaining learning rates (\(\lambda _R\), \(\lambda _C\), \(\lambda _{\pi }\), and \(\lambda _{\beta }\)) will be more pronounced in more complex and safetycritical environments.
5 Empirical analysis
In this section, we evaluate our methods WCSACGS and WCSACIQN on the tasks with different difficulties, i.e., two SpyGame environments and Safety Gym benchmark (Ray et al., 2019). This section has three goals: (i) test the hypothesis that WCSAC can achieve good risk control in an environment with a Gaussian costreturn distribution; (ii) test the hypothesis that WCSAC can find a safe policy on environments with a nonGaussian costreturn distribution when equipped with an appropriate estimate of the distribution; and (iii) evaluate the performance of the proposed method in highlycomplex environments.
5.1 SpyGame environments
To test whether the two WCSAC algorithms can achieve safe behavior in environments with a Gaussian costreturn, and whether WCSACIQN indeed has better performance in environments with nonGaussian costreturn distribution, we designed two SpyGame environments: SpyUnimodal and SpyBimodal. For any policy, SpyUnimodal leads to a unimodal costreturn distribution (approximately Gaussian), while SpyBimodal has a bimodal costreturn distribution (nonGaussian). The SpyGame is a toy model, meant to give rise to unimodal (approximately Gaussian) and bimodal cost distributions. For this model, we consider an agency that trains spies to go on covert missions. On each mission, the spy gets a random amount of useful information (the reward) and leaves some traces (the cost). If too many clues to the spy’s identity are left across missions, the spy is likely to get discovered. In order to control the risk of discovery, a safety constraint is implemented on cumulative cost. For each mission, the spy has a choice of lowrisk, lowreward and highrisk, highreward approaches, parameterized by the action \(a \in [0,1]\). For a choice of a, random rewards and costs are drawn from uniform distributions as follows (Fig. 5):
Two variants of the game are implemented in the SpyEnv environment, which are named SpyUnimodal and SpyBimodal according to the shape of their cost distributions.^{Footnote 3}
SpyUnimodal: Each spy executes 100 missions until retirement. The aim is to maximise expected reward subject to a cost constraint (in expectation or CVaR). The cumulative costs are a linear sum of a large number of independent random variables, so they are approximately normally distributed.
SpyBimodal: In this variant, spies face early retirement if they do not gain sufficiently useful information. After 5 missions, a stopping criterion is evaluated that terminates the game unless the average reward per mission exceeds 0.15. This results in a significant fraction of spies retiring early, which is reflected in a bimodal cost distribution.
We set the safety thresholds \(d=25\) for SpyUnimodal, and \(d=15\) for SpyBimodal. We use WCSACGS and WCSACIQN with riskneutral and riskaverse constraints (costCVaR\(\alpha\)) to solve both variations of the SpyGame. Each algorithm uses small neural networks (2 layers with 16 units) and trains for 30000 steps. After training, we run each of the final policies for 10000 episodes to evaluate the costreturns of our algorithms.
5.1.1 Costreturn distribution evaluation
In Fig. 6, we compare the two algorithms with riskneutral (\(\alpha =1\)) and riskaverse (\(\alpha =0.1\)) constraints on both versions of the SpyGame. We report the distribution of costreturns. This gives a clear overview of the full costreturn distribution, allowing us to evaluate the frequency the safety constraint is violated. We also report the metric used as safety constraints to verify when each agent can reach the designated safety requirements.
At the top of Fig. 6 (riskneutral case), we can see that the two WCSAC algorithms approximately attain a constraintsatisfying expected costreturn in both environments, and the realised values are very close. So, in the average case, WCSACGS and WCSACIQN have similar performance independently of the underlying distribution.
At the bottom of Fig. 6 (riskaverse case), first we notice that in the SpyUnimodal environment (Gaussian) both WCSAC algorithms attain a costCVaR0.1 below the threshold. We can also notice that WCSACIQN is closer to the bound showing a slightly better control over the costCVaR0.1. On SpyBimodal (nonGaussian), WCSACGS is unable to satisfy the safety constraint, attaining a costCVaR0.1 larger than the bound. This indicates that the Gaussian approximation cannot control the risk level in this domain.
Overall, comparing the top and bottom plots in Fig. 6, we can see that both WCSAC algorithms can attain a more riskaverse behavior by setting the risk level \(\alpha\) to be a small value, reducing significantly the probability that a trajectory violates the safety constraints.
5.1.2 Varying level of safety constraint
To get a better overview of when the safety constraints are violated or not, we consider the same environment setting different risk level constraints. In Fig. 7, the xaxis depicts the risk level \(\alpha\), under which the agents are trained. The yaxis depicts the corresponding costCVaR\(\alpha\) (Fig. 7 top) and expected return (Fig. 7 bottom) generated by the final policies and the standard deviation over 5 repetitions.
At the bottom of Fig. 7, we can see that, in the more riskaverse settings (lower value of \(\alpha\)), WCSACGS and WCSACIQN will both have lower expected returns. In general, the changes in the costCVaR\(\alpha\) and expected return under different risk levels show the same trend, i.e., a larger costCVaR\(\alpha\) corresponds to a larger expected return at the risk level \(\alpha\).
When we have a unimodal costreturn distribution (left panel of Fig. 7), we can see that the WCSAC algorithms attain safe performance with different risk levels \(\alpha\). But when we have higher safety requirements, both WCSAC algorithms generate a greater variance, and the distance between the corresponding costCVaR\(\alpha\) and the safety bound d becomes larger. Compared to WCSACIQN, WCSACGS is more overconservative with lower \(\alpha\).
In the right panel of Fig. 7, we show the results with a bimodal costreturn distribution, where a Gaussian approximation can underestimate the CVaR, as we have seen in the previous section. In this case, WCSACIQN is safe in all different \(\alpha\), and WCSACGS can also obtain safe performance for values closer to the riskneutral constraint (\(\alpha \in [0.7,1.0]\)). However, WCSACGS starts to become increasingly unsafe for lower values of \(\alpha\) (more riskaverse constraints).
No matter in the unimodal or bimodal cases, both WCSAC algorithms approach the safety boundary better with higher \(\alpha\) (more riskneutral). But we have a greater deviation with lower \(\alpha\), especially in the bimodal case. Even with WCSACIQN, our safety performance is becoming more pessimistic when we decrease \(\alpha\). Based on the experimental results in the work by (Théate et al., 2021), it appears to be the case that the quantile regression methods may result in more approximation errors for higherorder moments compared to the firstorder moment.
5.2 SafetyGym environments
Next, we evaluate our method in three domains from the Safety Gym benchmark suite (Ray et al., 2019), where a robot navigates in a 2D map to reach target positions while trying to avoid dangerous areas, with different complexity levels (Fig. 8). The first one is StaticEnv with one fixed hazard and one fixed goal, but the initial position of the Point agent is randomly generated at the beginning of each episode. The second is PointGoal (SafexpPointGoal1v0 in Safety Gym) with one Point agent, several hazards, and one vase. The third and more complex environment is CarButton (SafexpCarButton1v0 in Safety Gym) where a Car robot (higher dimensional action space than Point) is navigating to press a goal button while trying to avoid hazards and moving gremlins, and not pressing any of the wrong buttons. These tasks are particularly complex due to the observation space, instead of observing its location, the agent has a lidar that indicates the distance to other objects. All experiments are performed with 10 random seeds. In all environments, \(c=1\) if an unsafe interaction happens, otherwise \(c=0\). We use the original reward signal in Safety Gym, i.e., the absolute distance towards the goal plus a constant for finishing the task, e.g., press the goal button.
We evaluate four versions of the WCSAC: GS1.0 (WCSACGS with \(\alpha = 1.0\)), GS0.5 (WCSACGS with \(\alpha = 0.5\)), IQN1.0 (WCSACIQN with \(\alpha = 1.0\)), and IQN0.5 (WCSACIQN with \(\alpha = 0.5\)). For comparison, we used SAC (Haarnoja et al., 2018b), CPO (Achiam et al., 2017), and PPOLagrangian (PPOLag Ray et al., 2019; Schulman et al., 2017) as baselines. In this experiment, we use the discount factor \(\gamma = 0.99\) and \(\kappa =1\) for the Huber loss in WCSACIQN. The safety thresholds are set to be \(d=8\) for StaticEnv, and \(d=25\) for PointGoal and CarButton. We train each agent for 50 epochs in StaticEnv, and for 150 epochs in PointGoal and CarButton. The epoch length is 30000 environment steps, and the maximal episodic length is 1000 environment steps.^{Footnote 4}
To evaluate the performance of the algorithms, we use the following metrics: CVaR0.5 of the costreturn (costCVaR0.5), expected cost (AverageEpCost), and expected reward (AverageEpRet). Table 1 shows the performance of the policies returned by the algorithms after training. We use 1000 episodes (100 runs for each random seed) to evaluate the final policy of each method; the expected cost and expected return are estimated by the average of all runs, while the costCVaR0.5 is estimated by the average of the worst 500 runs. In Fig. 9, we visualize the distribution by plotting the PDF and CDF histograms of sampled episodic costs in PointGoal and CarButton. Finally, Fig. 10 shows the behavior of the algorithms during training. We provide a collection of videos of the execution of the final policies on the following webpage: https://sites.google.com/view/wcsac.
5.2.1 Final behavior
We start our analysis by considering the behavior of the final policy. In Table 1, we can see that only IQN1.0 and IQN0.5 can be considered safe, because they satisfy the cost constraint with which they trained in all environments. In particular, only IQN0.5 satisfies the riskaverse threshold on costCVaR0.5, demonstrating its suitability for riskaverse agents. PPOLag has competitive safety performance in all the environments, but fails to achieve a high return in StaticEnv and CarButton. Compared to the safe RL methods (CPO, PPOLag, and WCSAC), SAC has an excellent performance in expected return, but, naturally it does not satisfy the safety constraint, this shows that a safe agent must find a tradeoff between safety and performance. Although the final policies of the remaining algorithms CPO, GS1.0 (SACLag), and GS0.5 may show better expected returns, these methods are not safe in PointGoal and CarButton.
In Fig. 9, we can observe that the distributions in PointGoal and CarButton are not Gaussian, which justifies the use of a quantile regression algorithm and the safe behavior of the WCSACIQN algorithms in these more complex environments. Compared to GS1.0, GS0.5, and IQN1.0, the distribution of IQN0.5 displays a smaller range of costs, most of which are within the safety bound. Although the policies from GS0.5 still generate some unsafe trajectories, the likelihood is much lower.
5.2.2 Behavior during training
Figure 10 shows the behavior of the agents during training. The top row shows the expected return, while the bottom row shows the expected costreturn.
We can see that all safe RL methods manage to make some safety improvement, while SAC has better and stable performance in average episodic return obviously across all the environments since it ignores the safety constraints.
In StaticEnv (Fig. 10a), we notice that all safe RL algorithms converge toward the optimal policy. However, compared to the offpolicy WCSAC, the onpolicy baselines CPO and PPOLag take more time to do so. When we look closely, we notice that CPO and GS1.0 exceed slightly the cost bound at the end of the training, while PPOLag, GS0.5, and IQN1.0 end slightly below the cost bound. In particular, we highlight that IQN0.5 achieves a lower expected cost without sacrificing much performance in terms of return.
In PointGoal (Fig. 10b), we see a different behavior: only the PPOLag and WCSACIQN algorithms manage to find a satisfactory policy. Although WCSACGS and CPO manage to find policies with high returns, they fail to achieve safe behavior.
Finally, in the most complex environment, CarButton (Fig. 10c), we see that the cost constraints severely limit the ability to find highreward policies: PPOLag, IQN1.0, and IQN0.5 manage to find safe policies, however, they cannot improve significantly in terms of return; GS1.0 and GS0.5 also approach a safe policy and manage to get some improvements in terms of return; and CPO does not find a safe policy whilst simultaneously struggling to improve returns.
As to the unsafe performance of CPO in PointGoal and CarButton, the approximation errors in CPO may prevent it from fully satisfying constraints in these environments, which are harder than ones where CPO has previously been tested. PPOLag has competitive performance in safety, but it converges slowly compared to the offpolicy baselines. This phenomenon is even more obvious in relatively simple StaticEnv. For WCSACGS in the relatively more complex environments PointGoal and CarButton (Fig. 10b and c), we can see that the return and costreturn of WCSACGS start to stabilize near a certain value, instead of making continuous improvements until satisfying the constraint. However, in Fig. 12 (Appendix B), the safety weights of GS1.0 and GS0.5 quickly converge to a small value. It appears to be the case that the algorithm mistakenly takes the policy as safe, which means we get a convergent safety approximation (CVaR or expectation) that is below the safety threshold, but the safety of the policy is not truly reflected. Then, the algorithm stops making any progress in safety. Compared to the Gaussian approximation, the safety weights of IQN1.0 and IQN0.5 have drastic changes at the beginning of training (Fig B12b and Bc), but they finally converge to a safe policy according to the training process in Fig. 10. We hypothesize that WCSACIQN benefits from the quantile regression to enhance exploration and avoid overfitting. That may also explain why distributional RL can converge to a better policy than traditional RL (Dabney et al., 2018a).
5.2.3 Trajectory analysis
Finally, we execute a trajectory analysis for each algorithm in StaticEnv, see Fig. 11. Specifically, we will compare SAC, CPO, PPOLag, WCSAC with different risk levels in safety, i.e., \(\alpha = 0.1\) (highly riskaverse), \(\alpha = 0.5\), \(\alpha = 0.9\), and \(\alpha = 1.0\) (riskneutral).
The behavior of the SAC agent is presented in Fig. 11a, the agent chooses the shortest path to reach the target directly since the safety constraint is not considered.
We also consider the training process of GS1.0 at different stages.^{Footnote 5} At the beginning of learning (Fig. 11d), it is possible that the agent cannot get out of the hazard, and gets stuck before arriving at the goal area. We can observe the number of constraint violations being reduced over time (Fig. 11e and f).
The final policies from CPO, PPOLag, GS1.0, GS0.9, IQN1.0, and IQN0.9 perform better than before, but still prefer to take a risk within the budget to get a larger return (Fig 11f, b, g, j and k). Conversely, the agents GS0.5 and IQN0.5 are more riskaverse (Fig 11h and l). Finally, in Fig 11i and m, we can see that the agents GS0.1 and IQN0.1 prefer to avoid the hazardous area more strictly given its risk level setting.
Overall, we observe that with a higher risk level \(\alpha\) (around 1.0), WCSAC can attain riskneutral performance similar to expectationbased methods. Both WCSAC algorithms can become more riskaverse by setting lower risk level \(\alpha\).
6 Related work
Riskaverse methods have commonly been used in RL problems with a single signal (reward or cost). Although CVaR in the context of IQN was used in the work by Dabney et al. (2018a) (with only a reward signal) to get risksensitive policies, the implementation is significantly different from ours. We deploy IQN in the safetyconstrained RL problems (with separate reward and cost signals) with continuous action spaces, instead of problems with a discrete action space.
Based on the work by Bellemare et al. (2017), Keramati et al. (2020) propose to perform strategic exploration to quickly obtain the optimal riskaverse policy. Following Dabney et al. (2018a), Urpí et al. (2021) propose a new actor critic method to optimize a riskaverse criterion in terms of return, where only samples previously collected by safe policies are available for training. Although their paper provides the offpolicy algorithm version, it is not clear how the explorationexploitation tradeoff is handled, while we explicitly define a SACbased method with an entropyrelated mechanism for exploration. Chow et al. (2017) propose efficient RL algorithms for riskconstrained MDPs, but their goal is to minimize an expected cumulative cost while keeping the cost CVaR below a given threshold, instead of maintaining reward and cost signals independently. To some extent, the way they update the Lagrange multiplier inspired our use of adaptive safety weights. However, in the real world, safe RL problems typically involve multiple objectives, some of which may be contradictory, like collision avoidance and speed on an autonomous driving task (Kamran et al., 2020). Therefore, the setting with an explicit safety signal can be more practical (DulacArnold et al., 2021).
The safe RL setting with separate reward and cost signals has also been studied in several works (Achiam et al., 2017; Bharadhwaj et al., 2021; Liu et al., 2020; Yang et al., 2020). Specifically, Achiam et al. (2017); Liu et al. (2020); Yang et al. (2020) propose a series of onpolicy constrained policy optimization methods with trustregion property, where the worst case performance is bounded at each update. However, they do not present a clear risk aversion mechanism for the intrinsic uncertainty, captured by the distribution over the costreturn. In addition, onpolicy methods (with worse sample efficiency compared to offpolicy methods) are usually not favored in safe RL domains. Under a similar problem setting, Bharadhwaj et al. (2021) work on a conservative estimate of the expected costreturn (Kumar et al., 2020) for each candidate stateaction tuple, which is used in both safeexploration and policy updates. With the conservative safety estimate, their proposed method can learn effective policies while reducing the rate of catastrophic failures. However, they only focus on the parametric uncertainty over the value estimate instead of the intrinsic uncertainty. On the other hand, their paper focuses on catastrophic events, which is a binary signal. While our paper considers safety according to the accumulated cost in a trajectory. Overall, our approach gives more freedom to the designer of the system to indicate which behaviors are more or less desirable.
Finally, we did not find stateoftheart distributional RL techniques had been used in safetyconstrained RL with separate reward and safety signals prior to our work.
7 Conclusion
In this paper, we propose an actor critic method, WCSAC , to achieve control of risks for safetyconstrained RL problems. We employ a Gaussian distribution or an implicit quantile network as the safety critic to overcome considerable risks caused by the randomness in costreturn. The experiments show that both WCSACGS and WCSACIQN can attain better risk control compared to expectationbased methods. In complex environments, WCSACGS does not show improvements in safety, where the safety weight is not updated fast enough to truly reflect the current policy. However, WCSACIQN has a strong performance with the benefits from IQN, which provides a stronger safety signal than the one from the Gaussian approximation. The novel use of IQN for safety constraints can potentially be extended to other safe RL methods.
7.1 Limitations
Without any knowledge about the environment, it is hard to strictly fulfill the safety constraint during exploration. Thus, our algorithm still focus more on the performance of the final policy. While our method has good risk control for the safetyconstrained RL problems, one limitation is that we cannot ensure a safe training process. Also, although our method shows good performance in practice, our work has not established theoretical proofs of convergence.
7.2 Future work
In safetycritical problems, the sample efficiency and adaptation to new tasks are both particularly crucial, so offpolicy RL and metaRL are natural approaches to solve safeRL problems. We will further explore metaRL with safe exploration tasks in the future (Finn et al., 2017; Rakelly et al., 2019). Another direction is to leverage the epistemic uncertainty about the safety dynamics to ensure safety also during training (Simão et al., 2021; Yang et al., 2022; Zheng & Ratliff, 2020).
Availability of data and materials
Not applicable.
Code availability
Code to reproduce our experiments are available at https://github.com/AlgTUDelft/WCSAC.
Notes
A similar approach has been used in the code available at https://github.com/openai/safetystarteragents.
In this section, Z stands for the return \(Z_{\pi }^{r}(s,a)\), but this method can easily be adapted to estimate the costreturn distribution.
The code of SpyEnv will be made available online.
The code of WCSAC will be made available online.
Other methods show similar behavior during training.
References
Achiam, J., Held, D., Tamar, A., & Abbeel, P. (2017). Constrained policy optimization. Proceedings of the 34th international conference on machine learning (pp. 2231). PMLR.
Altman, E. (1999). Constrained Markov decision processes (Vol. 7). CRC Press.
Bellemare, M. G., Dabney, W., & Munos, R. (2017). A distributional perspective on reinforcement learning. Proceedings of the 34th international conference on machine learning (pp. 449458). PMLR.
Bertsekas, D. P. (1982). Constrained optimization and Lagrange multiplier methods (Vol. 1). Academic press.
Bharadhwaj, H., Kumar, A., Rhinehart, N., Levine, S., Shkurti, F., & Garg, A. (2021). Conservative safety critics for exploration. 9th international conference on learning representations (pp. 19).
Borkar, V. S. (2005). An actorcritic algorithm for constrained Markov decision processes. Systems & Control Letters, 54(3), 207–213.
Chow, Y., Ghavamzadeh, M., Janson, L., & Pavone, M. (2017). Riskconstrained reinforcement learning with percentile risk criteria. The Journal of Machine Learning Research, 18(1), 6070–6120.
Dabney, W., Ostrovski, G., Silver, D., & Munos, R. (2018). Implicit quantile networks for distributional reinforcement learning. Proceedings of the 35th international conference on machine learning (pp. 10961105).
Dabney, W., Rowland, M., Bellemare, M. G., & Munos, R. (2018). Distributional reinforcement learning with quantile regression. ThirtySecond AAAI Conference on Artificial Intelligence (pp. 28922901). AAAI Press.
Duan, J., Guan, Y., Li, S. E., Ren, Y., & Cheng, B. (2020). Distributional soft actorcritic: Offpolicy reinforcement learning for addressing value estimation errors. arXiv preprint arxiv:2001.02811.
DulacArnold, G., Levine, N., Mankowitz, D. J., Li, J., Paduraru, C., Gowal, S., & Hester, T. (2021). Challenges of realworld reinforcement learning: definitions, benchmarks and analysis. Machine Learning, 24192468.
Finn, C., Abbeel, P., & Levine, S. (2017). Modelagnostic metalearning for fast adaptation of deep networks. Proceedings of the 34th international conference on machine learning (pp. 11261135). PMLR.
García, J., & Fernández, F. (2015). A comprehensive survey on safe reinforcement learning. The Journal of Machine Learning Research, 16(1), 1437–1480.
Ha, S., Xu, P., Tan, Z., Levine, S., & Tan, J. (2020). Learning to walk in the real world with minimal human effort. arXiv preprint arxiv:2002.08550.
Haarnoja, T., Zhou, A., Abbeel, P., & Levine, S. (2018). Soft actorCritic: Offpolicy maximum entropy deep reinforcement learning with a stochastic actor. Proceedings of the 35th international conference on machine learning (pp. 18611870). PMLR.
Haarnoja, T., Zhou, A., Hartikainen, K., Tucker, G., Ha, S., Tan, J., & Levine, S. (2018). Soft actorcritic algorithms and applications. arXiv preprint arxiv:1812.05905.
Huber, P. J. (1964). Robust estimation of a location parameter. The Annals of Mathematical Statistics, 73101.
Kamran, D., Lopez, C. F., Lauer, M., & Stiller, C. (2020). Riskaware highlevel decisions for automated driving at occluded intersections with reinforcement learning. IEEE intelligent vehicles symposium, IV (pp. 12051212). IEEE.
Keramati, R., Dann, C., Tamkin, A., & Brunskill, E. (2020). Being optimistic to be conservative: Quickly learning a cvar policy. Proceedings of the AAAI conference on artificial intelligence (pp. 44364443).
Khokhlov, V. (2016). Conditional valueatrisk for elliptical distributions. Evropskỳ časopis ekonomiky a managementu, 2(6), 70–79.
Kullback, S., & Leibler, R. A. (1951). On information and sufficiency. The Annals of Mathematical Statistics, 22(1), 79–86.
Kumar, A., Zhou, A., Tucker, G., & Levine, S. (2020). Conservative Qlearning for offline reinforcement learning. Advances in Neural Information Processing Systems, 33, 1179–1191.
Lillicrap, T., Hunt, J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Wierstra, D. (2015). Continuous control with deep reinforcement learning. 4th international conference on learning representations (pp. 110). ICLR.
Liu, Y., Ding, J., & Liu, X. (2020). IPO: Interiorpoint policy optimization under constraints. Proceedings of the AAAI conference on artificial intelligence (pp. 49404947).
Ma, X., Zhang, Q., Xia, L., Zhou, Z., Yang, J., & Zhao, Q. (2020). Distributional soft actor critic for risk sensitive learning. arXiv preprint arxiv:2004.14547.
Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., & Hassabis, D. (2015). Humanlevel control through deep reinforcement learning. Nature, 518(7540), 529–533.
Morimura, T., Sugiyama, M., Kashima, H., Hachiya, H., & Tanaka, T. (2010). Parametric return density estimation for reinforcement learning. Twentysixth conference on uncertainty in artificial intelligence (pp. 368375). AUAI Press.
Olkin, I., & Pukelsheim, F. (1982). The distance between two random vectors with given dispersion matrices. Linear Algebra and its Applications, 48, 257–263.
Pecka, M., & Svoboda, T. (2014). Safe exploration techniques for reinforcement learning–an overview. First international workshop on modelling and simulation for autonomous systems (pp. 357375). Springer.
Rakelly, K., Zhou, A., Finn, C., Levine, S., & Quillen, D. (2019). Efficient offpolicy metareinforcement learning via probabilistic context variables. Proceedings of the 36th international conference on machine learning (Vol. 97, pp. 53315340). PMLR.
Ray, A., Achiam, J., & Amodei, D. (2019). Benchmarking safe exploration in deep reinforcement learning. Retrieved from https://cdn.openai.com/safexpshort.pdf
Rockafellar, R. T., & Uryasev, S. (2000). Optimization of conditional valueatrisk. Journal of Risk, 2(3), 21–41.
Rowland, M., Dadashi, R., Kumar, S., Munos, R., Bellemare, M. G., & Dabney, W. (2019). Statistics and samples in distributional reinforcement learning. Proceedings of the 36th international conference on machine learning (pp. 55285536).
Roy, J., Girgis, R., Romoff, J., Bacon, P.L., & Pal, C. (2021). Direct behavior specification via constrained reinforcement learning. arXiv preprint arxiv:2112.12228.
Schulman, J., Levine, S., Abbeel, P., Jordan, M., & Moritz, P. (2015). Trust region policy optimization. Proceedings of the 32nd international conference on machine learning (pp. 18891897). JMLR.org.
Schulman, J., Wolski, F., Dhariwal, P., Radford, A., & Klimov, O. (2017). Proximal Policy optimization algorithms. arXiv preprint arxiv:1707.06347.
Simão, T. D., Jansen, N., & Spaan, M. T. J. (2021). AlwaysSafe: Reinforcement learning without safety constraint violations during training. Proceedings of the 20th international conference on autonomous agents and multiagent systems (AAMAS) (pp. 12261235). IFAAMAS.
Sobel, M. J. (1982). The variance of discounted markov decision processes. Journal of Applied Probability, 19(4), 794–802.
Sutton, R. S., & Barto, A. G. (2018). Reinforcement Learning: An Introduction (Vol. 2). MIT press.
Tamar, A., Di Castro, D., & Mannor, S. (2016). Learning the variance of the rewardToGo. The Journal of Machine Learning Research, 17(1), 361–396.
Tang, Y. C., Zhang, J., & Salakhutdinov, R. (2020). Worst cases policy gradients. 3rd annual conference on robot learning (pp. 10781093). PMLR.
Théate, T., Wehenkel, A., Bolland, A., Louppe, G., & Ernst, D. (2021). Distributional reinforcement learning with unconstrained monotonic neural networks. arXiv preprint arxiv:2106.03228.
Urpí, N. A., Curi, S., & A. K. (2021). Riskaverse offline reinforcement learning. 9th international conference on learning representations.
Yang, T.Y., Rosca, J., Narasimhan, K., & Ramadge, P. J. (2020). Projectionbased constrained policy optimization. 8th international conference on learning representations.
Yang, Q., Simão, T. D., Jansen, N., Tindemans, S. H., & Spaan, M. T. J. (2022). Training and transferring safe policies in reinforcement learning. AAMAS 2022 Workshop on Adaptive Learning Agents.
Yang, Q., Simão, T. D., Tindemans, S. H., & Spaan, M. T. J. (2021). WCSAC: Worstcase soft actor critic for safetyconstrained reinforcement learning. ThirtyFifth AAAI conference on artificial intelligence (pp. 10639–10646). AAAI Press.
Yang, D., Zhao, L., Lin, Z., Qin, T., Bian, J., & Liu, T.Y. (2019). Fully parameterized quantile function for distributional reinforcement learning. Advances in Neural Information Processing Systems 32 (pp. 61936202). Curran Associates, Inc.
Zheng, L., & Ratliff, L. (2020). Constrained upper confidence reinforcement learning. Proceedings of the 2nd conference on learning for dynamics and control (pp. 620629). online: PMLR.
Funding
This research is funded by the Netherlands Organisation for Scientific Research (NWO), as part of the Energy System Integration: planning, operations, and societal embedding program and the grant NWA.1160.18.238: “PrimaVera”. Qisong Yang is supported by Xidian University.
Author information
Authors and Affiliations
Contributions
Q.Y. conceived of the presented idea, implemented the algorithms, carried out the experiments. Q.Y. and T.S. conceived and planned the experiments, contributed to the analysis of the results. S.T. proposed the spy game environment. S.T. and M.S. supervised the project. All authors discussed the results and contributed to the final manuscript, helping with writing, reviewing and editing.
Corresponding author
Ethics declarations
Conflict of interest/Competing interests
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Ethics approval
Not applicable.
Consent to participate
Not applicable.
Consent for publication
Not applicable.
Additional information
Editors: Dana Drachsler Cohen, Javier Garcia, Mohammad Ghavamzadeh, Marek Petrik, Philip S. Thomas.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Appendix A Approximate safety constraint
A discounted version of d is defined as \(\overline{d} = [(1\gamma ^T)d]/ [(1\gamma )T_{\text {max}}]\), where we assume equal cost accumulated at each step, and \(T_{\text {max}}\) is the maximum length of the episode. The assumption is not strictly correct, since we do not have equal cost at each step of a real episode, and often no costs are incurred early in the episode. However, since our algorithm optimizes the discounted infinite horizon from each stateaction pair in the replay buffer, we should be approximately correct here.
Appendix B Safety weights
We present the change of adaptive safety weights \(\omega\) in WCSAC (GS1.0, GS0.5, IQN1.0, and IQN0.5) during training in Fig. 12, where a small stable weight means that the constraint is approximately satisfied. In StaticEnv (Fig. 12a), we can see that the safety weights in the four WCSAC algorithms change similarly, which accord with the convergence process of costreturn in Fig. 10a. In PointGoal and CarButton (Fig. 12b and c), compared to IQN1.0 and IQN0.5, the safety weights of GS1.0 and GS0.5 quickly converge to a small value, but they fail to obtain a constraintsatisfying policy from the results in Fig. 10b and c. Although the safety weights of IQN1.0 and IQN0.5 have drastic changes at the beginning of training (Fig. 12b and c), they finally converge to a safe policy, according to the training process in Fig. 10.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Yang, Q., Simão, T.D., Tindemans, S.H. et al. Safetyconstrained reinforcement learning with a distributional safety critic. Mach Learn 112, 859–887 (2023). https://doi.org/10.1007/s10994022061878
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10994022061878