Towards Safe Autonomous Driving: Decision Making with Observation-Robust Reinforcement Learning

He, Xiangkun; Lv, Chen

doi:10.1007/s42154-023-00256-x

Towards Safe Autonomous Driving: Decision Making with Observation-Robust Reinforcement Learning

Open access
Published: 08 November 2023

Volume 6, pages 509–520, (2023)
Cite this article

Download PDF

You have full access to this open access article

Automotive Innovation Aims and scope Submit manuscript

Towards Safe Autonomous Driving: Decision Making with Observation-Robust Reinforcement Learning

Download PDF

1968 Accesses
Explore all metrics

Abstract

Most real-world situations involve unavoidable measurement noises or perception errors which result in unsafe decision making or even casualty in autonomous driving. To address these issues and further improve safety, automated driving is required to be capable of handling perception uncertainties. Here, this paper presents an observation-robust reinforcement learning against observational uncertainties to realize safe decision making for autonomous vehicles. Specifically, an adversarial agent is trained online to generate optimal adversarial attacks on observations, which attempts to amplify the average variation distance on perturbed policies. In addition, an observation-robust actor-critic approach is developed to enable the agent to learn the optimal policies and ensure that the changes of the policies perturbed by optimal adversarial attacks remain within a certain bound. Lastly, the safe decision making scheme is evaluated on a lane change task under complex highway traffic scenarios. The results show that the developed approach can ensure autonomous driving performance, as well as the policy robustness against adversarial attacks on observations.

Intelligent Safety Decision-Making for Autonomous Vehicle in Highway Environment

An actor-critic based learning method for decision-making and planning of autonomous vehicles

Article 19 March 2021

Lane Keeping Algorithm for Autonomous Driving via Safe Reinforcement Learning

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Recently, autonomous driving has captivated considerable attention of people across the globe [1, 2]. Their societal benefits are expected to be safer transportation, less congestion and lower emissions. However, safety continues to be a major obstacle to the advancement of autonomous driving [3, 4]. Unsafe driving behaviours in autonomous vehicles may jeopardize human life and result in significant financial loss [5]. Considering these potential risks, it becomes evident that there is a substantial journey ahead to fulfill the stringent requirements and lofty expectations concerning autonomous driving [6].

Autonomous vehicles are products based on multi-disciplinary knowledge and theories [7]. Decision making module, which is usually seen as the intelligent brain of autonomous vehicles, determines the driving mode or behavior according to environmental information and vehicle status. To deal with the decision making problems, reinforcement learning (RL) has shown great potential and achieved impressive successes across a wide range of challenging tasks [8, 9]. As a result, researchers attempt to explore various RL algorithms to cope with a sequence of autonomous driving tasks [10, 11].

In many studies, the RL has been applied to generate lane change behaviors during autonomous driving [12, 13]. One popular approach is developed based on the deep Q-network (DQN). For instance, a lateral control policy is developed through DQN with safety checkers for autonomous vehicles in Ref. [14]. In Ref. [15], a lane-change algorithm is developed by leveraging partial observed Markov decision process with DQN. A combined DQN and rule based method is proposed for learning the lane changing decision in automated vehicles in Ref. [16]. A DQN-based harmonious lane-change strategy is presented for automated driving to enhance the overall transportation efficiency in Ref. [17]. A lane-change algorithm is proposed to optimize automated vehicle decision making using DQN and risk-awareness prioritized replay in Ref. [18].

Apart from the above DQN-based paradigms, there are also other methods that are developed for automated vehicle decision making using other RL algorithms. For example, a multi-objective driving policy learning method is developed to optimize automated vehicle lateral decision making [19]. A decision making scheme for lane change task is developed by attention-based hierarchical deep RL in Ref. [20]. An autonomous lateral decision making method is developed using proximal policy optimization (PPO) in Ref. [21].

The speed mode (e.g., keeping, acceleration and deceleration) or target speed of autonomous vehicles is able to be learned via DQN [22, 23], or deep deterministic policy gradient (DDPG) [24, 25], or PPO [26] algorithms. For example, the longitudinal acceleration level of an autonomous vehicle at intersection scenarios can be determined by a learned belief updater and safe RL relying on a model-checker in Ref. [27]. In addition, the lane change behavior and target speed of autonomous driving agents are simultaneously determined by RL algorithms in Refs. [28,29,30]. For example, a coordinated decision making scheme using DDPG algorithm is developed to learn autonomous driving steering and throttle maneuvers in Ref. [31].

The autonomous driving decision solutions mentioned above, which are based on RL algorithms, have yielded remarkable outcomes. However, it is crucial to acknowledge that the real-world environment is prone to sensor noises and measurement errors. These factors can potentially lead autonomous driving agents to make suboptimal decisions or, in extreme cases, result in catastrophic damage. The lack of robustness guarantees limits their application in safety-critical autonomous driving domains. In view of these hazards, autonomous driving is necessitated to guarantee that the decision making behaviors are capable of coping with the natural sensing and perception uncertainties, particularly adversarial attacks on observations.

A handful of existing studies have made endeavors to tackle this challenge. In Refs. [11, 32], the robust RL frameworks against white-box and black-box attacks on perception systems are developed to ensure the robustness of decision making for autonomous vehicles, respectively. In Ref. [33], to enhance the robustness and safety of autonomous driving, a robust decision making method is proposed. This scheme incorporates a switching mechanism of principle-based policies, aiming to effectively adapt to various environments and ensure reliable decision making in unseen scenarios. Nevertheless, it should be noted that the above studies may not be able to provide a guarantee for handling worst-case perturbations. This limitation arises because the autonomous driving agents, trained using these methods, are not exposed to the optimal adversarial attacks generated by the learnable adversary that have the potential to produce stronger attack performance compared to existing white-box or black-box attack techniques [34].

Accordingly, this paper presents a novel observation-robust RL (ORRL) scheme for safe decision making on a lane change task, which aims to ensure autonomous driving performance and the policy robustness against adversarial attacks on observations. The following is a summary of the main contributions of this work:

The proposed ORRL scheme enables autonomous driving agents to approximate robust driving policies against adversarial attacks on observations and guarantee travel safety and efficiency.
An adversarial agent is trained online to approximate the optimal adversarial attacks on observations, which attempts to maximize the perturbed policies’ average variation distance that is measured by Jensen–Shannon (JS) divergence.
A novel observation-robust actor-critic (ORAC) method is developed to maximize expected return and keep the optimal adversary-induced performance variations within boundaries.

Multiple highway driving conditions with various traffic flow densities are simulated to assess the feasibility and effectiveness of the proposed ORRL method using simulation of urban mobility (SUMO) [35, 36]. The results indicate that the developed safe decision making method is advantageous over three existing state-of-the-art methods.

The remaining sections of this paper are organized as follows. In Sect. 2, the developed ORRL scheme of safe decision making for autonomous driving is presented. In Sect. 3, the detailed algorithms with their implementation of the proposed approach are introduced. In Sect. 4, the evaluation results are discussed and analysed. Section 5 concludes this study.

2 Observation-Robust Reinforcement Learning for Safe Autonomous Driving

2.1 Technique Framework

In the context of lane change of autonomous driving, to deal with the adversarial attacks on observations, the high-level framework of safe decision making using ORRL is presented, as shown in Fig. 1. The ego agent is an autonomous vehicle with the colour of gold. Surrounding vehicles with other colors are controlled by the SUMO intelligent driving model (IDM). In addition, the ego vehicle’s action is set as discrete. The action set contains behaviours of lane keeping, left lane-change, and right lane-change.

In Fig. 1, the block of ORAC is adopted for optimizing safe lane change policies, which allows the agent to interact with the environment. The state s, reward r, and the adversary $\Delta ^{*}$ on observations, are considered as the input, while the output includes the agent action a and the policy $\pi (a\vert s)$.

The aim of the adversarial agent module is to approximate the optimal adversarial attacks that are able to maximize the average variation distance on perturbed policies. This block’s input contains the state s and action a, and its output is the optimal adversarial attacks $\Delta ^{*}$ on observations.

In addition, the block concerning the environment is employed to produce the next-step state $s_{t+1}$ with the reward $r_t$. The input is the action $a_t$ with the policy $\pi (a_t\vert s_t)$, where t denotes the time step.

2.2 Observation-Robust MDP

The observation-robust Markov Decision Process (ORMDP) is developed for modelling the decision making of agents under observation perturbations and policy constraints, in this section.

Definition 1: An ORMDP can be represented by the 7-tuple $\left( {\mathcal {S}}, {\mathcal {A}}, p, r, c, \Delta , \gamma \right)$, where ${\mathcal {A}}$ represents the action space, ${\mathcal {S}}$ denotes the state space, p is the probability of the state transition, $r: {\mathcal {S}} \times {\mathcal {A}} \rightarrow \mathbb {R}$ denotes the reward function, c represents the function of constraints, $\Delta$ denotes the uncertainty of observation. $\gamma \in (0, 1)$ indicates a discount factor.

ORMDP seeks to solve the constrained optimization problem formulated as follows:

$$\begin{aligned}&\max _{\pi } \mathbb {E} \left[ \sum _{t=0}^T \gamma ^t r(s_t, a_t) \right] \\&\text {s.t.} \ \ \mathbb {E} \left[ c(s,\pi ,\Delta ) \right] \le \epsilon \nonumber \end{aligned}$$

(1)

where T represents time step, and $\epsilon$ indicates an expected threshold value.

2.3 Adversarial Agent

The aim of adversarial agent training is to approximate the optimal adversarial attacks on observations.

The JS divergence can be seen as a smoothed and symmetrized Kullback–Leibler (KL) divergence [37, 38]. However, very importantly, the JS divergence is bounded within 1 for two probability distributions. Therefore, in this paper, the JS divergence is used to model the variations of the perturbed policy caused by the adversarial attacks. The optimization objective based on JS divergence is defined as:

$$\begin{aligned} c(s,\pi ,\Delta )&= D_{\rm JS} \big (\pi (a\vert s)\vert \vert \pi (\tilde{a}\vert \tilde{s}) \big ) \end{aligned}$$

(2)

$$\begin{aligned}&= D_{\rm JS} \big (\pi (a\vert s)\vert \vert \pi (\tilde{a}\vert s+\Delta ) \big ) \nonumber \\&= \frac{1}{2} D_{\rm KL}\big (\pi (a\vert s)\vert \vert m \big ) \nonumber \\&\quad + \frac{1}{2} D_{\rm KL} \big (\pi (\tilde{a}\vert s+\Delta )\vert \vert m \big ) \nonumber \\ m&=\frac{1}{2}\big (\pi (a\vert s)+\pi (\tilde{a}\vert s+\Delta ) \big ) \end{aligned}$$

(3)

where $D_{\rm JS}$ denotes the JS divergence based distance, $D_{\rm KL}$ represents the KL divergence based distance, $\tilde{a}$ and $\tilde{s}$ are the action and state under observation perturbations respectively.

The optimization problem with regard to the adversarial agent is able to be formulated as:

$$\begin{aligned}&\Delta ^{*} \in \arg \underset{{\Delta }}{\text {max}} \ \ \mathbb {E}[c(s, \pi , \Delta )] \\&\text {s.t.} \ \ \left| \Delta \right| \le \eta \nonumber \end{aligned}$$

(4)

where $\eta$ denotes the perturbation limit.

In order to simplify the constrained optimization problem mentioned above, this work utilizes the hyperbolic tangent function tanh$(\cdot )$ to restrict the margin of the observational perturbation:

$$\begin{aligned} \Delta (x) = \alpha \frac{e^x-e^{-x}}{e^x+e^{-x}} \end{aligned}$$

(5)

where x represents the optimization variable, $\alpha$ denotes an upper bound, and e stands for the natural logarithm. In addition, the constrained optimization problem, i.e., Eq. (4) is able to be reformulated as follows:

$$\begin{aligned} x^{*} \in \arg \underset{{x}}{\text {max}} \ \ \mathbb {E} \big [c\big (s, \pi , \Delta (x) \big ) \big ] \end{aligned}$$

(6)

where $x^{*}$ represents the optimal solution.

Here, the optimal adversarial observational perturbation $\Delta ^{*}$ is able to be acquired:

$$\begin{aligned} \Delta ^{*} = \alpha \frac{e^{x^{*}}-e^{-x^{*}}}{e^{x^{*}}+e^{-x^{*}}} \end{aligned}$$

(7)

Therefore, to approximate the optimal solution $x^{*}$, the adversarial agent is optimized via maximizing the following objective function:

$$\begin{aligned} J_{\Delta }(\bar{\theta }) = \mathbb {E}\bigg [c \bigg (s, \pi , \Delta \big (x(s;\,\bar{\theta }) \big ) \bigg )\bigg ] \end{aligned}$$

(8)

where $\bar{\theta }$ denotes the adversary network parameter. It is noteworthy that, the input of the adversarial agent is state s, and its output is solution x.

2.4 Observation-Robust Actor-Critic

This section introduces the proposed ORAC algorithm that aims to solve the following constrained optimization problem concerning ORMDP under the optimal adversarial attacks $\Delta ^{*}$:

$$\begin{aligned}&\max _{\pi } \mathbb {E} \left[ \sum _{t=0}^T \gamma ^t r(s_t, a_t) \right] \\&\text {s.t.} \ \ \mathbb {E} \left[ c(s, \pi ,\Delta ^{*}) \right] \le \epsilon \nonumber \end{aligned}$$

(9)

In order to solve the ORMDP, in this work a policy iteration (PI) method is used, also known as observation-robust PI (ORPI). The ORPI algorithm mainly contains two key processes: policy evaluation (PE) and policy improvement (PI). Moreover, these two components undergo iterative updates until they converge.

Based on the duality theory [39], the Lagrangian of the optimization problem with constraints outlined in Eq. (9) can be expressed as follows:

$$\begin{aligned} \begin{aligned} L(\pi , \beta )=\mathbb {E} \left[ \sum _{t=0}^T \gamma ^t r(s_t, a_t) + \beta \big (\epsilon - c(s,\pi ,\Delta ^{*}) \big ) \right] \end{aligned} \end{aligned}$$

(10)

where $\beta$ denotes dual variables.

2.4.1 Observation-Robust PE

Learning of the action-value function $Q^{\pi }(\cdot )$ under optimal adversary $\Delta ^{*}$ can be obtained iteratively by using a fixed policy. The iterations can start from any $Q^{\pi }(\cdot ): {\mathcal {S}} \rightarrow \mathbb {R}^{\left| {\mathcal {A}} \right| }$ and repeat through employment of a specified Bellman backup operator $\mathcal {T}^{\pi }$, which can be given by:

$$\begin{aligned} \mathcal {T}^{\pi }Q^{\pi }(s_t)&:= r(s_t, a_t) + \gamma \mathbb {E} [V^{\pi }(s_{t+1})] \end{aligned}$$

(11)

where

$$\begin{aligned} \mathbb {E} [V^{\pi }(s_{t+1})]&= \pi (s_{t+1})^{\intercal }Q^{\pi }(s_{t+1}) \end{aligned}$$

(12)

represents the expectation of the value function with $\Delta ^{*}$, and it is able to be calculated based on the agent output with discrete action distribution.

To speed up the training of the policy model, the ORAC algorithm uses two parameterized action-value functions with parameters $\phi ^z$, $z \in \left\{ 1,2\right\}$. Parameter optimization can be achieved via minimizing the loss function in the critic network:

$$J_{c} \left( {\phi ^{z} } \right) = \mathop {\mathbb{E}}\limits_{{\mathop {a_{{t + 1}} \sim \pi }\limits_{{Ts\sim {\mathcal{D}}}} }} \left[ {\left\| {y_{t} - Q^{\pi } \left( {s_{t} ;\,\phi ^{z} } \right)} \right\|_{2}^{2} } \right]$$

(13)

where $y_t$ represents the action-value function’s target value in the time step t, and Ts denotes transition data sampled from the replay buffer $\mathcal {D}$.

The smaller value of the two $Q^{\pi }(\cdot )$ values is adopted to mitigate overestimating the value function during training critic network. Consequently, $y_t$ can be given by:

$$\begin{aligned} y_t&= r(s_t, a_{t}) +\gamma \pi (s_{t+1})^{\intercal } \Big [\underset{z\in \left\{ 1,2\right\} }{\text {min}}\hat{Q}^{\pi }(s_{t+1}; \,\bar{\phi }^z) \Big ] \end{aligned}$$

(14)

where $\bar{\phi }^z$ represents the target action-value function’s network parameter, and $\hat{Q}^{\pi }(\cdot )$ denotes the target action-value function, and its parameters can be updated through polyak averaging:

$$\begin{aligned} \bar{\phi }^z \leftarrow \delta \bar{\phi }^z + (1-\delta ) \phi ^z \end{aligned}$$

(15)

where $\delta$ is a scale coefficient between 0 and 1.

2.4.2 Observation-Robust PI

To further improve the policy of the ORRL agent, the expected return is supposed to be maximized with satisfaction of the constraint $c(\cdot )$.

The Lagrange dual function can be derived based on Eq. (10), as:

$$\begin{aligned}&\bar{L}(\beta ) = \max _{\pi } L(\pi , \beta ) \\&= \max _{\pi } \mathbb {E} \left[ \sum _{t=0}^T \gamma ^t r(s_t, a_t) + \beta \big (\epsilon - c(s,\pi ,\Delta ^{*}) \big ) \right] \nonumber \end{aligned}$$

(16)

Additionally, the Lagrange dual problem with regard to Eq. (9) is able to be written as:

$$\begin{aligned}&\min _{\beta \ge 0} \bar{L}(\beta ) = \min _{\beta \ge 0} \max _{\pi } L(\pi , \beta ) \\&=\min _{\beta \ge 0} \max _{\pi } \mathbb {E} \left[ \sum _{t=0}^T \gamma ^t r(s_t, a_t) + \beta \big (\epsilon - c(s,\pi ,\Delta ^{*}) \big ) \right] \nonumber \end{aligned}$$

(17)

The optimal policy $\pi ^{*}$ and the optimal dual variable $\beta ^{*}$ can be found by the following alternating procedure. First given a fixed dual variable $\beta$, then learn the optimal policy $\pi ^{*}$ via maximizing $L(\pi , \beta )$. Furthermore, plug in $\pi ^{*}$ and approximate the optimal dual variable $\lambda ^{*}$ through minimizing $L(\pi ^{*}, \beta )$. According to Eq. (17), the following relational expressions are able to be derived:

$$\begin{aligned} \pi ^{*}= & {} \arg \max _{\pi } L(\pi , \beta ) \end{aligned}$$

(18)

$$\begin{aligned} \beta ^{*}= & {} \arg \min _{\beta \ge 0} L(\pi ^{*}, \beta ) \end{aligned}$$

(19)

In order to minimize the estimation error of the expected return, the double $Q^{\pi }(\cdot )$ trick is employed. As a result, the parameter $\theta$ of the policy model is updated by maximizing the objective function of the actor network, given by:

$$\begin{aligned} J_a(\theta )=&\underset{\underset{Ts \sim {\mathcal {D}}}{a_{t} \sim {\pi }}}{\mathbb {E}} \Big [\pi (s_{t}; \theta )^{\intercal }[\underset{z\in \left\{ 1,2\right\} }{\text {min}}Q^{\pi }(s_{t}; \phi ^z) \nonumber \\&- \beta c(s, \pi ,\Delta )]\Big ] \end{aligned}$$

(20)

Moreover, the dual variables are able to be optimized by minimizing the following objective function:

$$\begin{aligned} J_{\text {d}} (\beta )&= \underset{\underset{Ts \sim \mathcal {D}}{a_{t} \sim {\pi }}}{\mathbb {E}} \big [\pi (s_{t}; \theta )^{\intercal }[\beta \epsilon - \beta c(s, \pi ,\Delta )]\big] \end{aligned}$$

(21)

3 Algorithm Implementation

Algorithm 1 overviews the ORRL approach in detail. The ORRL method is able to update the driving policies of the agent through below procedures. The actor and critic’s initial network parameters are sampled from a stochastic distribution. For each iteration, the RL agent needs to gather the data in M timesteps and save them to buffer ${\mathcal {D}}$. The environment includes the reward functions and the transition probability. The optimal adversarial attacks $\Delta ^{*}$ on observations is approximated via the adversarial agent. $\rho$ represents a delayed update coefficient. The policies of the RL agent are then updated iteratively. $d_t$ is a done signal, which represents that the ego vehicle has encountered collision at the time step t.

To learn the policies for lane change of autonomous vehicles, the RL agent’s state, action and reward need to be determined. The autonomous driving agent’s state contains 16 dimensions, and the details are given in Fig. 2. The social vehicles perform lane change maneuvers via the SUMO LC2013 model [40].

Furthermore, the autonomous driving agent’s action is discrete, which contains left lane changing, right lane changing and lane keeping.

One tricky problem is to learn the safe policies for lane-change against adversarial perturbations on state observations from scratch without prior knowledge. Consequently, the reward function plays a pivotal role in optimizing the agent’s policies. Safety, efficiency and comfort factors are considered when designing the agent’s reward function.

In order to promote travel efficiency optimization for the ego vehicle, a reward function denoted as $r(\cdot )$ is developed, where the reward is proportional to the ego vehicle’s speed, represented as $v_0/35$. This implies that the autonomous driving agent can receive higher rewards by operating at higher speeds. If the headway of the ego vehicle is below 30 m, then the reward gained would be decreased by 0.1. This is to avoid the ego car from always following its proceeding vehicle. Both vehicle dynamics stability and collision are considered in terms of driving safety. If the desired yaw rate’s upper bound $k\cdot \bar{\mu } \cdot g/v_0$ given in Ref. [41] is exceeded, the agent’s reward would be decreased by 0.05. $\bar{\mu }$ denotes adhesion coefficient, k represents the dynamic factor proposed in Ref. [42], and g indicates the constant of gravity acceleration. In addition, if the ego vehicle collides, the reward of the agent will be decreased by 0.1. If a lateral decision is made by the ego vehicle with over 20 m/s, then the reward gained would be diminished by $v_0/350$, which aims to avoid frequent lateral lane changing at high speed. Algorithm 2 illustrates the details of the reward function designed.

By using two fully-connected hidden layers, the actor, critic, and adversary networks are developed. Moreover, the size of the both of hidden layers is set as 128. ReLU is adopted as the hidden layers’ activation function. The actor and critic networks’ inputs and outputs have 16 and 3 dimensions, respectively. Moreover, both the input and output of the adversary network are 16 dimensions. Table A1 of Appendix A provides the main hyperparameters of the ORRL algorithm.

4 Experimental Setup and Comparative Evaluation

4.1 Simulation Environment

To evaluate the performance of the developed safe decision making scheme for automated vehicles, training and testing are conducted with SUMO. Three highway scenarios are set to be random mixed traffic flows with various traffic densities. The longitudinal speed and lane change behaviours of surrounding vehicles are controlled by using IDM. The maximum speed limit for all lanes is defined as 35 m/s.

Figure 3 shows the high-level framework of the performance evaluation. P denotes the probability of starting a vehicle each second. Moreover, P is set as 0.07, 0.14 and 0.28 to represent the random mixed traffic with low, medium, and high traffic densities, respectively. The proposed approach and the baseline methods are assessed in both the training and testing phases. The scenario of normal traffic density, i.e. the medium traffic flow, is used for training and testing of the autonomous agents. And the scenarios with low and high traffic flow densities are only used for evaluating the trained RL agents. Unlike the setting of ORRL agent in the training process, the autonomous driving agent in testing receives the state $\tilde{s}_t$ attacked via the trained adversarial agent instead of the state $s_t$.

4.2 Baseline Algorithms

To benchmark the performance of the proposed ORRL, the DQN, PPO, and soft actor-critic (SAC) based on discrete action [43] algorithms are leveraged as compelling baselines. The DQN and PPO based agents are implemented as classical baselines. In addition, discrete action based SAC is used as a state-of-the-art method for comparison, as it is one of the most advanced RL scheme with discrete action.

Table 1 Final performance comparison in training

Full size table

4.3 Comparative Evaluation

Figure 4 shows the training performance of each agent in the condition with normal density traffic flow setting. Figure 4(a), (b) and (c) indicate the average return, speed and collision times of different agents, respectively. The solid curve represents the mean. Moreover, the shaded region denotes the standard deviation. Five runs of each algorithm with different random seeds and 400 episodes are trained in the normal density traffic flow. The maximum length of each episode is 200 time steps.

The different agents’ final performance is provided in Table 1. The bold number denotes the best in each column. The average return, speed and collision times can reflect the comprehensive performance, travel efficiency and driving safety of autonomous driving agents respectively. The average values of metrics are counted throughout the ultimate 2000 time steps (i.e. 200 time steps $\times$ 10 episodes). The training results indicate that the ORRL agent surpasses the baseline agents greatly, with respect to both the final performance and learning efficiency. For instance, compared to the DQN, PPO and SAC agents, the ORRL agent gains $74.30\%$, $68.56\%$ and $13.29\%$ improvements concerning the final return respectively. Based on the results, the ORRL agent achieves comparable performance with the SAC agent and exceeds the DQN and PPO agents regarding the final speed. Moreover, the driving safety of the ORRL agent is enhanced by $28.57\%$, $80.00\%$ and $54.55\%$, respectively, in compared to the DQN, PPO and SAC agents.

The average return, robustness and collision times are used to evaluate the performance of different agents under adversarial observation perturbations. Moreover, Eq. (2) is adopted to assess the policy robustness. The five policies eventually trained via each algorithm under different random seeds are evaluated. Furthermore, the average values of the metrics are calculated over 20000 time steps (i.e. 200 time steps $\times$ 100 episodes).

Figure 5 illustrates performance of agents with different algorithms under the three stochastic highway conditions with various traffic flow densities. According to the results, the ORRL agent outperforms baseline agents, in terms of the average return, robustness and collision times. It is noteworthy that, compared with other policies, the variations of the ORRL policies are slight under optimal adversarial attacks on observations from the adversary.

Table 2 Evaluation of different agents in three complex highway driving conditions with optimal adversarial attacks

Full size table

Table 2 reports the evaluation results of different autonomous driving agents quantitatively. It is noticeable that the ORRL agent surpasses the three baseline agents in all testing cases. For instance, in the low-density traffic flow, in comparison with the DQN, PPO and SAC agents, the ORRL agent gains $56.14\%$, $60.05\%$, and $4.08\%$ improvements, respectively, regarding the average return. The average robustness of the ORRL agent is enhanced by $41.02\%$, $32.21\%$, and $32.38\%$, respectively. And the average collision times of the ORRL agent are reduced by $88.43\%$, $54.84\%$, and $44.00\%$, respectively.

Under the normal density traffic flow, compared to the DQN, PPO and SAC agents, the ORRL agent improves the average return by $82.96\%$, $58.42\%$, and $27.52\%$, respectively. Meanwhile, the average robustness of the ORRL agent is improved by about $79.54\%$, $94.36\%$, and $88.20\%$, respectively. Additionally, the safety of the ORRL agent is also enhanced by about $66.89\%$, $9.26\%$ and $64.49\%$, respectively.

Further, under the high traffic density condition, compared to the DQN, PPO and SAC agents, the ORRL agent improves the average return by $94.69\%$, $29.78\%$, and $51.63\%$, respectively. The average robustness of the ORRL agent is enhanced by $70.21\%$, $94.89\%$, and $84.34\%$ respectively, and the average collision times are reduced by $49.51\%$, $39.77\%$, and $45.79\%$, respectively. As a consequence, it can be seen that the ORRL autonomous driving agent performs consistently under the optimal adversary in the three complex traffic scenarios.

Table 3 The computational cost (second) of different schemes during model optimization and inference

Full size table

Figure 6 visually illustrates the collision times of the DQN, PPO, SAC and ORRL autonomous driving agents in the high-density stochastic dynamic traffic flows under different attack situations. As can be seen from Fig. 6, the adversarial attacks with the trained adversarial agents have a distinct impact on the driving safety of the autonomous vehicles driven by the baseline agents. For example, compared with the case without the adversarial attacks, the average collision times of the attacked DQN, PPO, SAC and ORRL agents increases by about $43.66\%$, $59.81\%$, $40.74\%$ and $5.10\%$ respectively. Hence, the proposed ORRL autonomous driving agent performs consistently across different attack situations. This means that the ORRL policy model is robust to the adversarial attacks on observations, which highlights our primary contribution to realizing safe decision making for autonomous vehicles.

Table 3 shows the comparison on computational cost for different methods in terms of model optimization and inference. The average time consumption of each model for each update and inference is reported separately. It can be found that compared with the others, the average computational costs of the model optimization and inference based on the DQN scheme are minimum, which are about $3.05\times 10^{-3}$ s and $2.08\times 10^{-4}$ s, respectively. Because the ORRL approach utilizes the adversarial model and requires solving the constrained optimization problem, the computational cost of the model optimization is higher than that of baseline methods. However, the average time consumption of model inference based on different methods is close.

5 Conclusions

In this work, a novel ORRL approach is proposed for safe decision making of lane change in autonomous driving. An adversarial agent is trained online to obtain adversarial observation with optimal perturbations, which aims to maximize the average variation distance on perturbed policies. Furthermore, an ORAC method is developed for automated vehicle lateral decision making policy optimization, while ensuring the policy variations under adversarial attacks to be within expected bounds.

Training and testing of the polices are conducted in complex highway driving situations with different traffic flow densities simulated in SUMO. The results illustrate that the developed approach enables the autonomous vehicles to make safe lane change decisions under perception uncertainties. Additionally, compared to the three baselines, the agent under the proposed method shows better performance with respect to the generalization ability and robustness under adversarial observation perturbations.

In the future, a certified ORRL algorithm will be investigated to provide theoretical guarantees regarding safe decision making for autonomous driving.

Abbreviations

DQN:: Deep Q-network
JS:: Jensen-Shannon
MDP:: Markov decision process
ORAC:: Observation-robust actor-critic
ORMDP:: Observation-robust MDP
ORRL:: Observation-robust RL
PPO:: Proximal policy optimization
RL:: Reinforcement learning
SAC:: Soft actor-critic
SUMO:: Simulation of urban mobility

References

Schwarting, W., Alonso-Mora, J., Rus, D.: Planning and decision-making for autonomous vehicles. Ann. Rev. Control Robot. Autonom. Syst. 1, 187–210 (2018)
Article Google Scholar
Quante, L., Zhang, M., Preuk, K., Schießl, C.: Human performance in critical scenarios as a benchmark for highly automated vehicles. Automot. Innovat. 4(3), 274–283 (2021)
Article Google Scholar
Pek, C., Manzinger, S., Koschi, M., Althoff, M.: Using online verification to prevent autonomous vehicles from causing accidents. Nature Mach. Intell. 2(9), 518–528 (2020)
Article Google Scholar
Wu, J., Zhang, J., Nie, B., Liu, Y., He, X.: Adaptive control of pmsm servo system for steering-by-wire system with disturbances observation. IEEE Trans. Transp. Electrif. 8(2), 2015–2028 (2022). https://doi.org/10.1109/TTE.2021.3128429
Article Google Scholar
Li, G., Li, S., Li, S., Qin, Y., Cao, D., Qu, X., Cheng, B.: Deep reinforcement learning enabled decision-making for autonomous driving at intersections. Automot. Innovat. 3(4), 374–385 (2020)
Article Google Scholar
Li, W., Pan, C., Zhang, R., Ren, J., Ma, Y., Fang, J., Yan, F., Geng, Q., Huang, X., Gong, H., et al.: Aads: augmented autonomous driving simulation using data-driven algorithms. Sci. Robot. 4(28), 0863 (2019)
Article Google Scholar
Ji, X., He, X., Lv, C., Liu, Y., Wu, J.: Adaptive-neural-network-based robust lateral motion control for autonomous vehicle at driving limits. Control Eng. Pract. 76, 41–53 (2018)
Article Google Scholar
Silver, D., Schrittwieser, J., Simonyan, K., Antonoglou, I., Huang, A., Guez, A., Hubert, T., Baker, L., Lai, M., Bolton, A., et al.: Mastering the game of go without human knowledge. Nature 550(7676), 354–359 (2017)
Article Google Scholar
Vinyals, O., Babuschkin, I., Czarnecki, W.M., Mathieu, M., Dudzik, A., Chung, J., Choi, D.H., Powell, R., Ewalds, T., Georgiev, P., et al.: Grandmaster level in starcraft ii using multi-agent reinforcement learning. Nature 575(7782), 350–354 (2019)
Article Google Scholar
Kiran, B.R., Sobh, I., Talpaert, V., Mannion, P., Sallab, A.A.A., Yogamani, S., Pérez, P.: Deep reinforcement learning for autonomous driving: a survey. IEEE Trans. Intell. Transp. Syst. 23(6), 4909–4926 (2022). https://doi.org/10.1109/TITS.2021.3054625
Article Google Scholar
He, X., Lou, B., Yang, H., Lv, C.: Robust decision making for autonomous vehicles at highway on-ramps: a constrained adversarial reinforcement learning approach. IEEE Trans. Intell. Transp. Syst. 24(4), 4103–4113 (2023). https://doi.org/10.1109/TITS.2022.3229518
Article Google Scholar
Huegle, M., Kalweit, G., Mirchevska, B., Werling, M., Boedecker, J.: Dynamic input for deep reinforcement learning in autonomous driving. In: 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 7566–7573 (2019). https://doi.org/10.1109/IROS40897.2019.8968560
Lu, H., Lu, C., Yu, Y., Xiong, G., Gong, J.: Autonomous overtaking for intelligent vehicles considering social preference based on hierarchical reinforcement learning. Automot. Innovat. 5(2), 195–208 (2022)
Article Google Scholar
Mirchevska, B., Pek, C., Werling, M., Althoff, M., Boedecker, J.: High-level decision making for safe and reasonable autonomous lane changing using reinforcement learning. In: 2018 21st International Conference on Intelligent Transportation Systems (ITSC), pp. 2156–2162 (2018). https://doi.org/10.1109/ITSC.2018.8569448
Jiang, S., Chen, J., Shen, M.: An interactive lane change decision making model with deep reinforcement learning. In: 2019 7th International Conference on Control, Mechatronics and Automation (ICCMA), pp. 370–376 (2019). https://doi.org/10.1109/ICCMA46720.2019.8988750
Wang, J., Zhang, Q., Zhao, D., Chen, Y.: Lane change decision-making through deep reinforcement learning with rule-based constraints. In: 2019 International Joint Conference on Neural Networks (IJCNN), pp. 1–6 (2019). https://doi.org/10.1109/IJCNN.2019.8852110
Wang, G., Hu, J., Li, Z., Li, L.: Harmonious lane changing via deep reinforcement learning. IEEE Trans. Intell. Transp. Syst. 23(5), 4642–4650 (2022). https://doi.org/10.1109/TITS.2020.3047129
Article Google Scholar
Li, G., Yang, Y., Li, S., Qu, X., Lyu, N., Li, S.E.: Decision making of autonomous vehicles in lane change scenarios: deep reinforcement learning approaches with risk awareness. Transp. Res. Part C Emerg. Technol. 134, 103452 (2021)
Article Google Scholar
Xu, X., Zuo, L., Li, X., Qian, L., Ren, J., Sun, Z.: A reinforcement learning approach to autonomous decision making of intelligent vehicles on highways. IEEE Trans. Syst. Man Cybernet. Syst. 50(10), 3884–3897 (2020). https://doi.org/10.1109/TSMC.2018.2870983
Article Google Scholar
Chen, Y., Dong, C., Palanisamy, P., Mudalige, P., Muelling, K., Dolan, J.M.: Attention-based hierarchical deep reinforcement learning for lane change behaviors in autonomous driving. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pp. 0–0 (2019)
Ye, F., Cheng, X., Wang, P., Chan, C.-Y., Zhang, J.: Automated lane change strategy using proximal policy optimization-based deep reinforcement learning. In: 2020 IEEE Intelligent Vehicles Symposium (IV), pp. 1746–1752 (2020). https://doi.org/10.1109/IV47402.2020.9304668
Zhang, Y., Sun, P., Yin, Y., Lin, L., Wang, X.: Human-like autonomous vehicle speed control by deep reinforcement learning with double q-learning. In: 2018 IEEE Intelligent Vehicles Symposium (IV), pp. 1251–1256 (2018). https://doi.org/10.1109/IVS.2018.8500630
Nageshrao, S., Tseng, H.E., Filev, D.: Autonomous highway driving using deep reinforcement learning. In: 2019 IEEE International Conference on Systems, Man and Cybernetics (SMC), pp. 2326–2331 (2019). https://doi.org/10.1109/SMC.2019.8914621
Ye, Y., Zhang, X., Sun, J.: Automated vehicle’s behavior decision making using deep reinforcement learning and high-fidelity simulation environment. Transp. Res. Part C Emerg. Technol. 107, 155–170 (2019)
Article Google Scholar
Lin, Y., McPhee, J., Azad, N.L.: Anti-jerk on-ramp merging using deep reinforcement learning. In: 2020 IEEE Intelligent Vehicles Symposium (IV), pp. 7–14 (2020). https://doi.org/10.1109/IV47402.2020.9304647
He, X., Fei, C., Liu, Y., Yang, K., Ji, X.: Multi-objective longitudinal decision-making for autonomous electric vehicle: A entropy-constrained reinforcement learning approach. In: 2020 IEEE 23rd International Conference on Intelligent Transportation Systems (ITSC), pp. 1–6 (2020). https://doi.org/10.1109/ITSC45102.2020.9294736
Bouton, M., Nakhaei, A., Fujimura, K., Kochenderfer, M.J.: Safe reinforcement learning with scene decomposition for navigating complex urban environments. In: 2019 IEEE Intelligent Vehicles Symposium (IV), pp. 1469–1476 (2019). https://doi.org/10.1109/IVS.2019.8813803
Hoel, C.-J., Driggs-Campbell, K., Wolff, K., Laine, L., Kochenderfer, M.J.: Combining planning and deep reinforcement learning in tactical decision making for autonomous driving. IEEE Trans. Intell. Vehicl. 5(2), 294–305 (2020). https://doi.org/10.1109/TIV.2019.2955905
Article Google Scholar
Zhang, Y., Gao, B., Guo, L., Guo, H., Chen, H.: Adaptive decision-making for automated vehicles under roundabout scenarios using optimization embedded reinforcement learning. IEEE Trans. Neural Netw. Learn. Syst. 32(12), 5526–5538 (2021). https://doi.org/10.1109/TNNLS.2020.3042981
Article Google Scholar
Huang, W., Zhou, Y., He, X., Lv, C.: Goal-guided transformer-enabled reinforcement learning for efficient autonomous navigation. arXiv preprint arXiv:2301.00362 (2023)
An, H., Jung, J.-i: Decision-making system for lane change using deep reinforcement learning in connected and automated driving. Electronics 8(5), 543 (2019)
Article Google Scholar
He, X., Yang, H., Hu, Z., Lv, C.: Robust lane change decision making for autonomous vehicles: an observation adversarial reinforcement learning approach. IEEE Transact. Intell. Vehic. 8(1), 184–193 (2023). https://doi.org/10.1109/TIV.2022.3165178
Article Google Scholar
Yang, K., Tang, X., Qiu, S., Jin, S., Wei, Z., Wang, H.: Towards robust decision-making for autonomous driving on highway. IEEE Transactions on Vehicular Technology, 1–13 (2023). https://doi.org/10.1109/TVT.2023.3268500
Zhang, H., Chen, H., Boning, D.S., Hsieh, C.-J.: Robust reinforcement learning on state observations with learned optimal adversary. In: International Conference on Learning Representations (2021). https://openreview.net/forum?id=sCZbhBvqQaU
Behrisch, M., Bieker, L., Erdmann, J., Krajzewicz, D.: Sumo–simulation of urban mobility: an overview. In: Proceedings of SIMUL 2011, The Third International Conference on Advances in System Simulation (2011)
Lopez, P.A., Behrisch, M., Bieker-Walz, L., Erdmann, J., Flötteröd, Y.-P., Hilbrich, R., Lücken, L., Rummel, J., Wagner, P., Wiessner, E.: Microscopic traffic simulation using sumo. In: 2018 21st International Conference on Intelligent Transportation Systems (ITSC), pp. 2575–2582 (2018). https://doi.org/10.1109/ITSC.2018.8569938
Lin, J.: Divergence measures based on the Shannon entropy. IEEE Trans. Inform. Theory 37(1), 145–151 (1991). https://doi.org/10.1109/18.61115
Article MathSciNet MATH Google Scholar
Huszár, F.: How (not) to train your generative model: Scheduled sampling, likelihood, adversary? arXiv preprint arXiv:1511.05101 (2015)
Boyd, S., Boyd, S.P., Vandenberghe, L.: Convex Optimization. Cambridge University Press, Cambridge (2004)
Book MATH Google Scholar
Erdmann, J.: Sumo’s lane-changing model. In: Modeling Mobility with Open Data, pp. 105–123. Springer, Heidelberg (2015)
Chapter Google Scholar
Rajamani, R.: Vehicle Dynamics and Control. Springer, Heidelberg (2011)
MATH Google Scholar
He, X., Liu, Y., Lv, C., Ji, X., Liu, Y.: Emergency steering control of autonomous vehicle for collision avoidance and stabilisation. Vehic. Syst. Dyn. 57(8), 1163–1187 (2019)
Article Google Scholar
Christodoulou, P.: Soft actor-critic for discrete action settings. arXiv preprint arXiv:1910.07207 (2019)

Download references

Acknowledgements

This work was supported by Foundation of State Key Laboratory of Automotive Simulation and Control.

Author information

Authors and Affiliations

State Key Laboratory of Automotive Simulation and Control, Jilin University, Changchun, China
Xiangkun He & Chen Lv
School of Mechanical and Aerospace Engineering, Nanyang Technological University, 639798, Nanyang, Singapore
Xiangkun He & Chen Lv

Authors

Xiangkun He
View author publications
You can also search for this author in PubMed Google Scholar
Chen Lv
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Chen Lv.

Ethics declarations

Conflict of interest

On behalf of all the authors, the corresponding author states that there is no conflict of interest.

Appendix A Hyperparameters

Table A1 Key hyperparameters of the proposed algorithm

Full size table

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

He, X., Lv, C. Towards Safe Autonomous Driving: Decision Making with Observation-Robust Reinforcement Learning. Automot. Innov. 6, 509–520 (2023). https://doi.org/10.1007/s42154-023-00256-x

Download citation

Received: 23 July 2022
Accepted: 28 August 2023
Published: 08 November 2023
Issue Date: November 2023
DOI: https://doi.org/10.1007/s42154-023-00256-x

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Towards Safe Autonomous Driving: Decision Making with Observation-Robust Reinforcement Learning

Abstract

Similar content being viewed by others

Intelligent Safety Decision-Making for Autonomous Vehicle in Highway Environment

An actor-critic based learning method for decision-making and planning of autonomous vehicles

Lane Keeping Algorithm for Autonomous Driving via Safe Reinforcement Learning

1 Introduction