1 Introduction

Recently, autonomous driving has captivated considerable attention of people across the globe [1, 2]. Their societal benefits are expected to be safer transportation, less congestion and lower emissions. However, safety continues to be a major obstacle to the advancement of autonomous driving [3, 4]. Unsafe driving behaviours in autonomous vehicles may jeopardize human life and result in significant financial loss [5]. Considering these potential risks, it becomes evident that there is a substantial journey ahead to fulfill the stringent requirements and lofty expectations concerning autonomous driving [6].

Autonomous vehicles are products based on multi-disciplinary knowledge and theories [7]. Decision making module, which is usually seen as the intelligent brain of autonomous vehicles, determines the driving mode or behavior according to environmental information and vehicle status. To deal with the decision making problems, reinforcement learning (RL) has shown great potential and achieved impressive successes across a wide range of challenging tasks [8, 9]. As a result, researchers attempt to explore various RL algorithms to cope with a sequence of autonomous driving tasks [10, 11].

In many studies, the RL has been applied to generate lane change behaviors during autonomous driving [12, 13]. One popular approach is developed based on the deep Q-network (DQN). For instance, a lateral control policy is developed through DQN with safety checkers for autonomous vehicles in Ref. [14]. In Ref. [15], a lane-change algorithm is developed by leveraging partial observed Markov decision process with DQN. A combined DQN and rule based method is proposed for learning the lane changing decision in automated vehicles in Ref. [16]. A DQN-based harmonious lane-change strategy is presented for automated driving to enhance the overall transportation efficiency in Ref. [17]. A lane-change algorithm is proposed to optimize automated vehicle decision making using DQN and risk-awareness prioritized replay in Ref. [18].

Apart from the above DQN-based paradigms, there are also other methods that are developed for automated vehicle decision making using other RL algorithms. For example, a multi-objective driving policy learning method is developed to optimize automated vehicle lateral decision making [19]. A decision making scheme for lane change task is developed by attention-based hierarchical deep RL in Ref. [20]. An autonomous lateral decision making method is developed using proximal policy optimization (PPO) in Ref. [21].

The speed mode (e.g., keeping, acceleration and deceleration) or target speed of autonomous vehicles is able to be learned via DQN [22, 23], or deep deterministic policy gradient (DDPG) [24, 25], or PPO [26] algorithms. For example, the longitudinal acceleration level of an autonomous vehicle at intersection scenarios can be determined by a learned belief updater and safe RL relying on a model-checker in Ref. [27]. In addition, the lane change behavior and target speed of autonomous driving agents are simultaneously determined by RL algorithms in Refs. [28,29,30]. For example, a coordinated decision making scheme using DDPG algorithm is developed to learn autonomous driving steering and throttle maneuvers in Ref. [31].

The autonomous driving decision solutions mentioned above, which are based on RL algorithms, have yielded remarkable outcomes. However, it is crucial to acknowledge that the real-world environment is prone to sensor noises and measurement errors. These factors can potentially lead autonomous driving agents to make suboptimal decisions or, in extreme cases, result in catastrophic damage. The lack of robustness guarantees limits their application in safety-critical autonomous driving domains. In view of these hazards, autonomous driving is necessitated to guarantee that the decision making behaviors are capable of coping with the natural sensing and perception uncertainties, particularly adversarial attacks on observations.

A handful of existing studies have made endeavors to tackle this challenge. In Refs. [11, 32], the robust RL frameworks against white-box and black-box attacks on perception systems are developed to ensure the robustness of decision making for autonomous vehicles, respectively. In Ref. [33], to enhance the robustness and safety of autonomous driving, a robust decision making method is proposed. This scheme incorporates a switching mechanism of principle-based policies, aiming to effectively adapt to various environments and ensure reliable decision making in unseen scenarios. Nevertheless, it should be noted that the above studies may not be able to provide a guarantee for handling worst-case perturbations. This limitation arises because the autonomous driving agents, trained using these methods, are not exposed to the optimal adversarial attacks generated by the learnable adversary that have the potential to produce stronger attack performance compared to existing white-box or black-box attack techniques [34].

Accordingly, this paper presents a novel observation-robust RL (ORRL) scheme for safe decision making on a lane change task, which aims to ensure autonomous driving performance and the policy robustness against adversarial attacks on observations. The following is a summary of the main contributions of this work:

  • The proposed ORRL scheme enables autonomous driving agents to approximate robust driving policies against adversarial attacks on observations and guarantee travel safety and efficiency.

  • An adversarial agent is trained online to approximate the optimal adversarial attacks on observations, which attempts to maximize the perturbed policies’ average variation distance that is measured by Jensen–Shannon (JS) divergence.

  • A novel observation-robust actor-critic (ORAC) method is developed to maximize expected return and keep the optimal adversary-induced performance variations within boundaries.

Multiple highway driving conditions with various traffic flow densities are simulated to assess the feasibility and effectiveness of the proposed ORRL method using simulation of urban mobility (SUMO) [35, 36]. The results indicate that the developed safe decision making method is advantageous over three existing state-of-the-art methods.

The remaining sections of this paper are organized as follows. In Sect. 2, the developed ORRL scheme of safe decision making for autonomous driving is presented. In Sect. 3, the detailed algorithms with their implementation of the proposed approach are introduced. In Sect. 4, the evaluation results are discussed and analysed. Section 5 concludes this study.

Fig. 1
figure 1

The proposed safe decision making framework

2 Observation-Robust Reinforcement Learning for Safe Autonomous Driving

2.1 Technique Framework

In the context of lane change of autonomous driving, to deal with the adversarial attacks on observations, the high-level framework of safe decision making using ORRL is presented, as shown in Fig. 1. The ego agent is an autonomous vehicle with the colour of gold. Surrounding vehicles with other colors are controlled by the SUMO intelligent driving model (IDM). In addition, the ego vehicle’s action is set as discrete. The action set contains behaviours of lane keeping, left lane-change, and right lane-change.

In Fig. 1, the block of ORAC is adopted for optimizing safe lane change policies, which allows the agent to interact with the environment. The state s, reward r, and the adversary \(\Delta ^{*}\) on observations, are considered as the input, while the output includes the agent action a and the policy \(\pi (a\vert s)\).

The aim of the adversarial agent module is to approximate the optimal adversarial attacks that are able to maximize the average variation distance on perturbed policies. This block’s input contains the state s and action a, and its output is the optimal adversarial attacks \(\Delta ^{*}\) on observations.

In addition, the block concerning the environment is employed to produce the next-step state \(s_{t+1}\) with the reward \(r_t\). The input is the action \(a_t\) with the policy \(\pi (a_t\vert s_t)\), where t denotes the time step.

2.2 Observation-Robust MDP

The observation-robust Markov Decision Process (ORMDP) is developed for modelling the decision making of agents under observation perturbations and policy constraints, in this section.

Definition 1: An ORMDP can be represented by the 7-tuple \(\left( {\mathcal {S}}, {\mathcal {A}}, p, r, c, \Delta , \gamma \right)\), where \({\mathcal {A}}\) represents the action space, \({\mathcal {S}}\) denotes the state space, p is the probability of the state transition, \(r: {\mathcal {S}} \times {\mathcal {A}} \rightarrow \mathbb {R}\) denotes the reward function, c represents the function of constraints, \(\Delta\) denotes the uncertainty of observation. \(\gamma \in (0, 1)\) indicates a discount factor.

ORMDP seeks to solve the constrained optimization problem formulated as follows:

$$\begin{aligned}&\max _{\pi } \mathbb {E} \left[ \sum _{t=0}^T \gamma ^t r(s_t, a_t) \right] \\&\text {s.t.} \ \ \mathbb {E} \left[ c(s,\pi ,\Delta ) \right] \le \epsilon \nonumber \end{aligned}$$
(1)

where T represents time step, and \(\epsilon\) indicates an expected threshold value.

2.3 Adversarial Agent

The aim of adversarial agent training is to approximate the optimal adversarial attacks on observations.

The JS divergence can be seen as a smoothed and symmetrized Kullback–Leibler (KL) divergence [37, 38]. However, very importantly, the JS divergence is bounded within 1 for two probability distributions. Therefore, in this paper, the JS divergence is used to model the variations of the perturbed policy caused by the adversarial attacks. The optimization objective based on JS divergence is defined as:

$$\begin{aligned} c(s,\pi ,\Delta )&= D_{\rm JS} \big (\pi (a\vert s)\vert \vert \pi (\tilde{a}\vert \tilde{s}) \big ) \end{aligned}$$
(2)
$$\begin{aligned}&= D_{\rm JS} \big (\pi (a\vert s)\vert \vert \pi (\tilde{a}\vert s+\Delta ) \big ) \nonumber \\&= \frac{1}{2} D_{\rm KL}\big (\pi (a\vert s)\vert \vert m \big ) \nonumber \\&\quad + \frac{1}{2} D_{\rm KL} \big (\pi (\tilde{a}\vert s+\Delta )\vert \vert m \big ) \nonumber \\ m&=\frac{1}{2}\big (\pi (a\vert s)+\pi (\tilde{a}\vert s+\Delta ) \big ) \end{aligned}$$
(3)

where \(D_{\rm JS}\) denotes the JS divergence based distance, \(D_{\rm KL}\) represents the KL divergence based distance, \(\tilde{a}\) and \(\tilde{s}\) are the action and state under observation perturbations respectively.

The optimization problem with regard to the adversarial agent is able to be formulated as:

$$\begin{aligned}&\Delta ^{*} \in \arg \underset{{\Delta }}{\text {max}} \ \ \mathbb {E}[c(s, \pi , \Delta )] \\&\text {s.t.} \ \ \left| \Delta \right| \le \eta \nonumber \end{aligned}$$
(4)

where \(\eta\) denotes the perturbation limit.

In order to simplify the constrained optimization problem mentioned above, this work utilizes the hyperbolic tangent function tanh\((\cdot )\) to restrict the margin of the observational perturbation:

$$\begin{aligned} \Delta (x) = \alpha \frac{e^x-e^{-x}}{e^x+e^{-x}} \end{aligned}$$
(5)

where x represents the optimization variable, \(\alpha\) denotes an upper bound, and e stands for the natural logarithm. In addition, the constrained optimization problem, i.e., Eq. (4) is able to be reformulated as follows:

$$\begin{aligned} x^{*} \in \arg \underset{{x}}{\text {max}} \ \ \mathbb {E} \big [c\big (s, \pi , \Delta (x) \big ) \big ] \end{aligned}$$
(6)

where \(x^{*}\) represents the optimal solution.

Here, the optimal adversarial observational perturbation \(\Delta ^{*}\) is able to be acquired:

$$\begin{aligned} \Delta ^{*} = \alpha \frac{e^{x^{*}}-e^{-x^{*}}}{e^{x^{*}}+e^{-x^{*}}} \end{aligned}$$
(7)

Therefore, to approximate the optimal solution \(x^{*}\), the adversarial agent is optimized via maximizing the following objective function:

$$\begin{aligned} J_{\Delta }(\bar{\theta }) = \mathbb {E}\bigg [c \bigg (s, \pi , \Delta \big (x(s;\,\bar{\theta }) \big ) \bigg )\bigg ] \end{aligned}$$
(8)

where \(\bar{\theta }\) denotes the adversary network parameter. It is noteworthy that, the input of the adversarial agent is state s, and its output is solution x.

2.4 Observation-Robust Actor-Critic

This section introduces the proposed ORAC algorithm that aims to solve the following constrained optimization problem concerning ORMDP under the optimal adversarial attacks \(\Delta ^{*}\):

$$\begin{aligned}&\max _{\pi } \mathbb {E} \left[ \sum _{t=0}^T \gamma ^t r(s_t, a_t) \right] \\&\text {s.t.} \ \ \mathbb {E} \left[ c(s, \pi ,\Delta ^{*}) \right] \le \epsilon \nonumber \end{aligned}$$
(9)

In order to solve the ORMDP, in this work a policy iteration (PI) method is used, also known as observation-robust PI (ORPI). The ORPI algorithm mainly contains two key processes: policy evaluation (PE) and policy improvement (PI). Moreover, these two components undergo iterative updates until they converge.

Based on the duality theory [39], the Lagrangian of the optimization problem with constraints outlined in Eq. (9) can be expressed as follows:

$$\begin{aligned} \begin{aligned} L(\pi , \beta )=\mathbb {E} \left[ \sum _{t=0}^T \gamma ^t r(s_t, a_t) + \beta \big (\epsilon - c(s,\pi ,\Delta ^{*}) \big ) \right] \end{aligned} \end{aligned}$$
(10)

where \(\beta\) denotes dual variables.

2.4.1 Observation-Robust PE

Learning of the action-value function \(Q^{\pi }(\cdot )\) under optimal adversary \(\Delta ^{*}\) can be obtained iteratively by using a fixed policy. The iterations can start from any \(Q^{\pi }(\cdot ): {\mathcal {S}} \rightarrow \mathbb {R}^{\left| {\mathcal {A}} \right| }\) and repeat through employment of a specified Bellman backup operator \(\mathcal {T}^{\pi }\), which can be given by:

$$\begin{aligned} \mathcal {T}^{\pi }Q^{\pi }(s_t)&:= r(s_t, a_t) + \gamma \mathbb {E} [V^{\pi }(s_{t+1})] \end{aligned}$$
(11)

where

$$\begin{aligned} \mathbb {E} [V^{\pi }(s_{t+1})]&= \pi (s_{t+1})^{\intercal }Q^{\pi }(s_{t+1}) \end{aligned}$$
(12)

represents the expectation of the value function with \(\Delta ^{*}\), and it is able to be calculated based on the agent output with discrete action distribution.

To speed up the training of the policy model, the ORAC algorithm uses two parameterized action-value functions with parameters \(\phi ^z\), \(z \in \left\{ 1,2\right\}\). Parameter optimization can be achieved via minimizing the loss function in the critic network:

$$J_{c} \left( {\phi ^{z} } \right) = \mathop {\mathbb{E}}\limits_{{\mathop {a_{{t + 1}} \sim \pi }\limits_{{Ts\sim {\mathcal{D}}}} }} \left[ {\left\| {y_{t} - Q^{\pi } \left( {s_{t} ;\,\phi ^{z} } \right)} \right\|_{2}^{2} } \right]$$
(13)

where \(y_t\) represents the action-value function’s target value in the time step t, and Ts denotes transition data sampled from the replay buffer \(\mathcal {D}\).

The smaller value of the two \(Q^{\pi }(\cdot )\) values is adopted to mitigate overestimating the value function during training critic network. Consequently, \(y_t\) can be given by:

$$\begin{aligned} y_t&= r(s_t, a_{t}) +\gamma \pi (s_{t+1})^{\intercal } \Big [\underset{z\in \left\{ 1,2\right\} }{\text {min}}\hat{Q}^{\pi }(s_{t+1}; \,\bar{\phi }^z) \Big ] \end{aligned}$$
(14)

where \(\bar{\phi }^z\) represents the target action-value function’s network parameter, and \(\hat{Q}^{\pi }(\cdot )\) denotes the target action-value function, and its parameters can be updated through polyak averaging:

$$\begin{aligned} \bar{\phi }^z \leftarrow \delta \bar{\phi }^z + (1-\delta ) \phi ^z \end{aligned}$$
(15)

where \(\delta\) is a scale coefficient between 0 and 1.

2.4.2 Observation-Robust PI

To further improve the policy of the ORRL agent, the expected return is supposed to be maximized with satisfaction of the constraint \(c(\cdot )\).

The Lagrange dual function can be derived based on Eq. (10), as:

$$\begin{aligned}&\bar{L}(\beta ) = \max _{\pi } L(\pi , \beta ) \\&= \max _{\pi } \mathbb {E} \left[ \sum _{t=0}^T \gamma ^t r(s_t, a_t) + \beta \big (\epsilon - c(s,\pi ,\Delta ^{*}) \big ) \right] \nonumber \end{aligned}$$
(16)

Additionally, the Lagrange dual problem with regard to Eq. (9) is able to be written as:

$$\begin{aligned}&\min _{\beta \ge 0} \bar{L}(\beta ) = \min _{\beta \ge 0} \max _{\pi } L(\pi , \beta ) \\&=\min _{\beta \ge 0} \max _{\pi } \mathbb {E} \left[ \sum _{t=0}^T \gamma ^t r(s_t, a_t) + \beta \big (\epsilon - c(s,\pi ,\Delta ^{*}) \big ) \right] \nonumber \end{aligned}$$
(17)

The optimal policy \(\pi ^{*}\) and the optimal dual variable \(\beta ^{*}\) can be found by the following alternating procedure. First given a fixed dual variable \(\beta\), then learn the optimal policy \(\pi ^{*}\) via maximizing \(L(\pi , \beta )\). Furthermore, plug in \(\pi ^{*}\) and approximate the optimal dual variable \(\lambda ^{*}\) through minimizing \(L(\pi ^{*}, \beta )\). According to Eq. (17), the following relational expressions are able to be derived:

$$\begin{aligned} \pi ^{*}= & {} \arg \max _{\pi } L(\pi , \beta ) \end{aligned}$$
(18)
$$\begin{aligned} \beta ^{*}= & {} \arg \min _{\beta \ge 0} L(\pi ^{*}, \beta ) \end{aligned}$$
(19)

In order to minimize the estimation error of the expected return, the double \(Q^{\pi }(\cdot )\) trick is employed. As a result, the parameter \(\theta\) of the policy model is updated by maximizing the objective function of the actor network, given by:

$$\begin{aligned} J_a(\theta )=&\underset{\underset{Ts \sim {\mathcal {D}}}{a_{t} \sim {\pi }}}{\mathbb {E}} \Big [\pi (s_{t}; \theta )^{\intercal }[\underset{z\in \left\{ 1,2\right\} }{\text {min}}Q^{\pi }(s_{t}; \phi ^z) \nonumber \\&- \beta c(s, \pi ,\Delta )]\Big ] \end{aligned}$$
(20)

Moreover, the dual variables are able to be optimized by minimizing the following objective function:

$$\begin{aligned} J_{\text {d}} (\beta )&= \underset{\underset{Ts \sim \mathcal {D}}{a_{t} \sim {\pi }}}{\mathbb {E}} \big [\pi (s_{t}; \theta )^{\intercal }[\beta \epsilon - \beta c(s, \pi ,\Delta )]\big] \end{aligned}$$
(21)

3 Algorithm Implementation

Algorithm 1 overviews the ORRL approach in detail. The ORRL method is able to update the driving policies of the agent through below procedures. The actor and critic’s initial network parameters are sampled from a stochastic distribution. For each iteration, the RL agent needs to gather the data in M timesteps and save them to buffer \({\mathcal {D}}\). The environment includes the reward functions and the transition probability. The optimal adversarial attacks \(\Delta ^{*}\) on observations is approximated via the adversarial agent. \(\rho\) represents a delayed update coefficient. The policies of the RL agent are then updated iteratively. \(d_t\) is a done signal, which represents that the ego vehicle has encountered collision at the time step t.

figure c

To learn the policies for lane change of autonomous vehicles, the RL agent’s state, action and reward need to be determined. The autonomous driving agent’s state contains 16 dimensions, and the details are given in Fig. 2. The social vehicles perform lane change maneuvers via the SUMO LC2013 model [40].

Fig. 2
figure 2

States of autonomous driving agent

Furthermore, the autonomous driving agent’s action is discrete, which contains left lane changing, right lane changing and lane keeping.

One tricky problem is to learn the safe policies for lane-change against adversarial perturbations on state observations from scratch without prior knowledge. Consequently, the reward function plays a pivotal role in optimizing the agent’s policies. Safety, efficiency and comfort factors are considered when designing the agent’s reward function.

In order to promote travel efficiency optimization for the ego vehicle, a reward function denoted as \(r(\cdot )\) is developed, where the reward is proportional to the ego vehicle’s speed, represented as \(v_0/35\). This implies that the autonomous driving agent can receive higher rewards by operating at higher speeds. If the headway of the ego vehicle is below 30 m, then the reward gained would be decreased by 0.1. This is to avoid the ego car from always following its proceeding vehicle. Both vehicle dynamics stability and collision are considered in terms of driving safety. If the desired yaw rate’s upper bound \(k\cdot \bar{\mu } \cdot g/v_0\) given in Ref. [41] is exceeded, the agent’s reward would be decreased by 0.05. \(\bar{\mu }\) denotes adhesion coefficient, k represents the dynamic factor proposed in Ref. [42], and g indicates the constant of gravity acceleration. In addition, if the ego vehicle collides, the reward of the agent will be decreased by 0.1. If a lateral decision is made by the ego vehicle with over 20 m/s, then the reward gained would be diminished by \(v_0/350\), which aims to avoid frequent lateral lane changing at high speed. Algorithm 2 illustrates the details of the reward function designed.

figure d

By using two fully-connected hidden layers, the actor, critic, and adversary networks are developed. Moreover, the size of the both of hidden layers is set as 128. ReLU is adopted as the hidden layers’ activation function. The actor and critic networks’ inputs and outputs have 16 and 3 dimensions, respectively. Moreover, both the input and output of the adversary network are 16 dimensions. Table A1 of Appendix A provides the main hyperparameters of the ORRL algorithm.

4 Experimental Setup and Comparative Evaluation

4.1 Simulation Environment

To evaluate the performance of the developed safe decision making scheme for automated vehicles, training and testing are conducted with SUMO. Three highway scenarios are set to be random mixed traffic flows with various traffic densities. The longitudinal speed and lane change behaviours of surrounding vehicles are controlled by using IDM. The maximum speed limit for all lanes is defined as 35 m/s.

Figure 3 shows the high-level framework of the performance evaluation. P denotes the probability of starting a vehicle each second. Moreover, P is set as 0.07, 0.14 and 0.28 to represent the random mixed traffic with low, medium, and high traffic densities, respectively. The proposed approach and the baseline methods are assessed in both the training and testing phases. The scenario of normal traffic density, i.e. the medium traffic flow, is used for training and testing of the autonomous agents. And the scenarios with low and high traffic flow densities are only used for evaluating the trained RL agents. Unlike the setting of ORRL agent in the training process, the autonomous driving agent in testing receives the state \(\tilde{s}_t\) attacked via the trained adversarial agent instead of the state \(s_t\).

Fig. 3
figure 3

The architecture of the evaluation framework with mixed traffic flows

Fig. 4
figure 4

Learning curves based on the DQN, PPO, SAC and ORRL methods

4.2 Baseline Algorithms

To benchmark the performance of the proposed ORRL, the DQN, PPO, and soft actor-critic (SAC) based on discrete action [43] algorithms are leveraged as compelling baselines. The DQN and PPO based agents are implemented as classical baselines. In addition, discrete action based SAC is used as a state-of-the-art method for comparison, as it is one of the most advanced RL scheme with discrete action.

Table 1 Final performance comparison in training

4.3 Comparative Evaluation

Figure 4 shows the training performance of each agent in the condition with normal density traffic flow setting. Figure 4(a), (b) and (c) indicate the average return, speed and collision times of different agents, respectively. The solid curve represents the mean. Moreover, the shaded region denotes the standard deviation. Five runs of each algorithm with different random seeds and 400 episodes are trained in the normal density traffic flow. The maximum length of each episode is 200 time steps.

The different agents’ final performance is provided in Table 1. The bold number denotes the best in each column. The average return, speed and collision times can reflect the comprehensive performance, travel efficiency and driving safety of autonomous driving agents respectively. The average values of metrics are counted throughout the ultimate 2000 time steps (i.e. 200 time steps \(\times\) 10 episodes). The training results indicate that the ORRL agent surpasses the baseline agents greatly, with respect to both the final performance and learning efficiency. For instance, compared to the DQN, PPO and SAC agents, the ORRL agent gains \(74.30\%\), \(68.56\%\) and \(13.29\%\) improvements concerning the final return respectively. Based on the results, the ORRL agent achieves comparable performance with the SAC agent and exceeds the DQN and PPO agents regarding the final speed. Moreover, the driving safety of the ORRL agent is enhanced by \(28.57\%\), \(80.00\%\) and \(54.55\%\), respectively, in compared to the DQN, PPO and SAC agents.

Fig. 5
figure 5

Assessment results for agents under different methods with the optimal adversarial attacks on observations

The average return, robustness and collision times are used to evaluate the performance of different agents under adversarial observation perturbations. Moreover, Eq. (2) is adopted to assess the policy robustness. The five policies eventually trained via each algorithm under different random seeds are evaluated. Furthermore, the average values of the metrics are calculated over 20000 time steps (i.e. 200 time steps \(\times\) 100 episodes).

Figure 5 illustrates performance of agents with different algorithms under the three stochastic highway conditions with various traffic flow densities. According to the results, the ORRL agent outperforms baseline agents, in terms of the average return, robustness and collision times. It is noteworthy that, compared with other policies, the variations of the ORRL policies are slight under optimal adversarial attacks on observations from the adversary.

Table 2 Evaluation of different agents in three complex highway driving conditions with optimal adversarial attacks

Table 2 reports the evaluation results of different autonomous driving agents quantitatively. It is noticeable that the ORRL agent surpasses the three baseline agents in all testing cases. For instance, in the low-density traffic flow, in comparison with the DQN, PPO and SAC agents, the ORRL agent gains \(56.14\%\), \(60.05\%\), and \(4.08\%\) improvements, respectively, regarding the average return. The average robustness of the ORRL agent is enhanced by \(41.02\%\), \(32.21\%\), and \(32.38\%\), respectively. And the average collision times of the ORRL agent are reduced by \(88.43\%\), \(54.84\%\), and \(44.00\%\), respectively.

Fig. 6
figure 6

Collision times of autonomous driving agents in the high-density traffic flows under different attack situations

Under the normal density traffic flow, compared to the DQN, PPO and SAC agents, the ORRL agent improves the average return by \(82.96\%\), \(58.42\%\), and \(27.52\%\), respectively. Meanwhile, the average robustness of the ORRL agent is improved by about \(79.54\%\), \(94.36\%\), and \(88.20\%\), respectively. Additionally, the safety of the ORRL agent is also enhanced by about \(66.89\%\), \(9.26\%\) and \(64.49\%\), respectively.

Further, under the high traffic density condition, compared to the DQN, PPO and SAC agents, the ORRL agent improves the average return by \(94.69\%\), \(29.78\%\), and \(51.63\%\), respectively. The average robustness of the ORRL agent is enhanced by \(70.21\%\), \(94.89\%\), and \(84.34\%\) respectively, and the average collision times are reduced by \(49.51\%\), \(39.77\%\), and \(45.79\%\), respectively. As a consequence, it can be seen that the ORRL autonomous driving agent performs consistently under the optimal adversary in the three complex traffic scenarios.

Table 3 The computational cost (second) of different schemes during model optimization and inference

Figure 6 visually illustrates the collision times of the DQN, PPO, SAC and ORRL autonomous driving agents in the high-density stochastic dynamic traffic flows under different attack situations. As can be seen from Fig. 6, the adversarial attacks with the trained adversarial agents have a distinct impact on the driving safety of the autonomous vehicles driven by the baseline agents. For example, compared with the case without the adversarial attacks, the average collision times of the attacked DQN, PPO, SAC and ORRL agents increases by about \(43.66\%\), \(59.81\%\), \(40.74\%\) and \(5.10\%\) respectively. Hence, the proposed ORRL autonomous driving agent performs consistently across different attack situations. This means that the ORRL policy model is robust to the adversarial attacks on observations, which highlights our primary contribution to realizing safe decision making for autonomous vehicles.

Table 3 shows the comparison on computational cost for different methods in terms of model optimization and inference. The average time consumption of each model for each update and inference is reported separately. It can be found that compared with the others, the average computational costs of the model optimization and inference based on the DQN scheme are minimum, which are about \(3.05\times 10^{-3}\) s and \(2.08\times 10^{-4}\) s, respectively. Because the ORRL approach utilizes the adversarial model and requires solving the constrained optimization problem, the computational cost of the model optimization is higher than that of baseline methods. However, the average time consumption of model inference based on different methods is close.

5 Conclusions

In this work, a novel ORRL approach is proposed for safe decision making of lane change in autonomous driving. An adversarial agent is trained online to obtain adversarial observation with optimal perturbations, which aims to maximize the average variation distance on perturbed policies. Furthermore, an ORAC method is developed for automated vehicle lateral decision making policy optimization, while ensuring the policy variations under adversarial attacks to be within expected bounds.

Training and testing of the polices are conducted in complex highway driving situations with different traffic flow densities simulated in SUMO. The results illustrate that the developed approach enables the autonomous vehicles to make safe lane change decisions under perception uncertainties. Additionally, compared to the three baselines, the agent under the proposed method shows better performance with respect to the generalization ability and robustness under adversarial observation perturbations.

In the future, a certified ORRL algorithm will be investigated to provide theoretical guarantees regarding safe decision making for autonomous driving.