Towards Safe Autonomous Driving: Decision Making with Observation-Robust Reinforcement Learning

Most real-world situations involve unavoidable measurement noises or perception errors which result in unsafe decision making or even casualty in autonomous driving. To address these issues and further improve safety, automated driving is required to be capable of handling perception uncertainties. Here, this paper presents an observation-robust reinforcement learning against observational uncertainties to realize safe decision making for autonomous vehicles. Specifically, an adversarial agent is trained online to generate optimal adversarial attacks on observations, which attempts to amplify the average variation distance on perturbed policies. In addition, an observation-robust actor-critic approach is developed to enable the agent to learn the optimal policies and ensure that the changes of the policies perturbed by optimal adversarial attacks remain within a certain bound. Lastly, the safe decision making scheme is evaluated on a lane change task under complex highway traffic scenarios. The results show that the developed approach can ensure autonomous driving performance, as well as the policy robustness against adversarial attacks on observations.


Introduction
Recently, autonomous driving has captivated considerable attention of people across the globe [1,2].Their societal benefits are expected to be safer transportation, less congestion and lower emissions.However, safety continues to be a major obstacle to the advancement of autonomous driving [3,4].Unsafe driving behaviours in autonomous vehicles may jeopardize human life and result in significant financial loss [5].Considering these potential risks, it becomes evident that there is a substantial journey ahead to fulfill the stringent requirements and lofty expectations concerning autonomous driving [6].
Autonomous vehicles are products based on multi-disciplinary knowledge and theories [7].Decision making module, which is usually seen as the intelligent brain of autonomous vehicles, determines the driving mode or behavior according to environmental information and vehicle status.To deal with the decision making problems, reinforcement learning (RL) has shown great potential and achieved impressive successes across a wide range of challenging tasks [8,9].As a result, researchers attempt to explore various RL algorithms to cope with a sequence of autonomous driving tasks [10,11].
In many studies, the RL has been applied to generate lane change behaviors during autonomous driving [12,13].One popular approach is developed based on the deep Q-network (DQN).For instance, a lateral control policy is developed through DQN with safety checkers for autonomous vehicles in Ref. [14].In Ref. [15], a lane-change algorithm is developed by leveraging partial observed Markov decision process with DQN.A combined DQN and rule based method is proposed for learning the lane changing decision in automated vehicles in Ref. [16].A DQN-based harmonious lane-change strategy is presented for automated driving to enhance the overall transportation efficiency in Ref. [17].A lane-change algorithm is proposed to optimize automated vehicle decision making using DQN and risk-awareness prioritized replay in Ref. [18].
Apart from the above DQN-based paradigms, there are also other methods that are developed for automated vehicle decision making using other RL algorithms.For example, a multi-objective driving policy learning method is developed to optimize automated vehicle lateral decision making [19].A decision making scheme for lane change task is developed by attention-based hierarchical deep RL in Ref. [20].An autonomous lateral decision making method is developed using proximal policy optimization (PPO) in Ref. [21].
The speed mode (e.g., keeping, acceleration and deceleration) or target speed of autonomous vehicles is able to be learned via DQN [22,23], or deep deterministic policy gradient (DDPG) [24,25], or PPO [26] algorithms.For example, the longitudinal acceleration level of an autonomous vehicle at intersection scenarios can be determined by a learned belief updater and safe RL relying on a modelchecker in Ref. [27].In addition, the lane change behavior and target speed of autonomous driving agents are simultaneously determined by RL algorithms in Refs.[28][29][30].For example, a coordinated decision making scheme using DDPG algorithm is developed to learn autonomous driving steering and throttle maneuvers in Ref. [31].
The autonomous driving decision solutions mentioned above, which are based on RL algorithms, have yielded remarkable outcomes.However, it is crucial to acknowledge that the real-world environment is prone to sensor noises and measurement errors.These factors can potentially lead autonomous driving agents to make suboptimal decisions or, in extreme cases, result in catastrophic damage.The lack of robustness guarantees limits their application in safetycritical autonomous driving domains.In view of these hazards, autonomous driving is necessitated to guarantee that the decision making behaviors are capable of coping with the natural sensing and perception uncertainties, particularly adversarial attacks on observations.A handful of existing studies have made endeavors to tackle this challenge.In Refs.[11,32], the robust RL frameworks against white-box and black-box attacks on perception systems are developed to ensure the robustness of decision making for autonomous vehicles, respectively.In Ref. [33], to enhance the robustness and safety of autonomous driving, a robust decision making method is proposed.This scheme incorporates a switching mechanism of principle-based policies, aiming to effectively adapt to various environments and ensure reliable decision making in unseen scenarios.Nevertheless, it should be noted that the above studies may not be able to provide a guarantee for handling worst-case perturbations.This limitation arises because the autonomous driving agents, trained using these methods, are not exposed to the optimal adversarial attacks generated by the learnable adversary that have the potential to produce stronger attack performance compared to existing white-box or black-box attack techniques [34].
Accordingly, this paper presents a novel observationrobust RL (ORRL) scheme for safe decision making on a lane change task, which aims to ensure autonomous driving performance and the policy robustness against adversarial attacks on observations.The following is a summary of the main contributions of this work: • The proposed ORRL scheme enables autonomous driving agents to approximate robust driving policies against adversarial attacks on observations and guarantee travel safety and efficiency.• An adversarial agent is trained online to approximate the optimal adversarial attacks on observations, which attempts to maximize the perturbed policies' average variation distance that is measured by Jensen-Shannon (JS) divergence.• A novel observation-robust actor-critic (ORAC) method is developed to maximize expected return and keep the optimal adversary-induced performance variations within boundaries.
Multiple highway driving conditions with various traffic flow densities are simulated to assess the feasibility and effectiveness of the proposed ORRL method using simulation of urban mobility (SUMO) [35,36].The results indicate that the developed safe decision making method is advantageous over three existing state-of-theart methods.
The remaining sections of this paper are organized as follows.In Sect.2, the developed ORRL scheme of safe decision making for autonomous driving is presented.In Sect.3, the detailed algorithms with their implementation of the proposed approach are introduced.In Sect.4, the evaluation results are discussed and analysed.Section 5 concludes this study.

Observation-Robust Reinforcement
Learning for Safe Autonomous Driving

Technique Framework
In the context of lane change of autonomous driving, to deal with the adversarial attacks on observations, the high-level framework of safe decision making using ORRL is presented, as shown in Fig. 1.The ego agent is an autonomous vehicle with the colour of gold.Surrounding vehicles with other colors are controlled by the SUMO intelligent driving model (IDM).In addition, the ego vehicle's action is set as discrete.The action set contains behaviours of lane keeping, left lane-change, and right lane-change.
In Fig. 1, the block of ORAC is adopted for optimizing safe lane change policies, which allows the agent to interact with the environment.The state s, reward r, and the adversary Δ * on observations, are considered as the input, while the output includes the agent action a and the policy (a|s).
The aim of the adversarial agent module is to approximate the optimal adversarial attacks that are able to maximize the average variation distance on perturbed policies.This block's input contains the state s and action a, and its output is the optimal adversarial attacks Δ * on observations.
In addition, the block concerning the environment is employed to produce the next-step state s t+1 with the reward r t .The input is the action a t with the policy (a t |s t ) , where t denotes the time step.

Observation-Robust MDP
The observation-robust Markov Decision Process (ORMDP) is developed for modelling the decision making of agents under observation perturbations and policy constraints, in this section.
Definition 1: An ORMDP can be represented by the 7-tuple (S, A, p, r, c, Δ, ) , where A represents the action space, S denotes the state space, p is the probability of the state transition, r ∶ S × A → ℝ denotes the reward function, c represents the function of constraints, Δ denotes the uncertainty of obser- vation.∈ (0, 1) indicates a discount factor.
ORMDP seeks to solve the constrained optimization problem formulated as follows: where T represents time step, and indicates an expected threshold value.

Adversarial Agent
The aim of adversarial agent training is to approximate the optimal adversarial attacks on observations.
The JS divergence can be seen as a smoothed and symmetrized Kullback-Leibler (KL) divergence [37,38].However, very importantly, the JS divergence is bounded within 1 for two probability distributions.Therefore, in this paper, the JS divergence is used to model the variations of the perturbed policy caused by the adversarial attacks.The optimization objective based on JS divergence is defined as: where D JS denotes the JS divergence based distance, D KL represents the KL divergence based distance, ã and s are the action and state under observation perturbations respectively.
The optimization problem with regard to the adversarial agent is able to be formulated as: (1) max Fig. 1 The proposed safe decision making framework where denotes the perturbation limit.
In order to simplify the constrained optimization problem mentioned above, this work utilizes the hyperbolic tangent function tanh(⋅) to restrict the margin of the observational perturbation: where x represents the optimization variable, denotes an upper bound, and e stands for the natural logarithm.In addition, the constrained optimization problem, i.e., Eq. ( 4) is able to be reformulated as follows: where x * represents the optimal solution.
Here, the optimal adversarial observational perturbation Δ * is able to be acquired: Therefore, to approximate the optimal solution x * , the adversarial agent is optimized via maximizing the following objective function: where θ denotes the adversary network parameter.It is note- worthy that, the input of the adversarial agent is state s, and its output is solution x.

Observation-Robust Actor-Critic
This section introduces the proposed ORAC algorithm that aims to solve the following constrained optimization problem concerning ORMDP under the optimal adversarial attacks Δ * : In order to solve the ORMDP, in this work a policy iteration (PI) method is used, also known as observation-robust PI (ORPI).The ORPI algorithm mainly contains two key processes: policy evaluation (PE) and policy improvement (PI).Moreover, these two components undergo iterative updates until they converge. (4) Based on the duality theory [39], the Lagrangian of the optimization problem with constraints outlined in Eq. ( 9) can be expressed as follows: where denotes dual variables.

Observation-Robust PE
Learning of the action-value function Q (⋅) under optimal adversary Δ * can be obtained iteratively by using a fixed pol- icy.The iterations can start from any Q (⋅) ∶ S → ℝ |A| and repeat through employment of a specified Bellman backup operator T , which can be given by: where represents the expectation of the value function with Δ * , and it is able to be calculated based on the agent output with discrete action distribution.
To speed up the training of the policy model, the ORAC algorithm uses two parameterized action-value functions with parameters z , z ∈ {1, 2} .Parameter optimization can be achieved via minimizing the loss function in the critic network: where y t represents the action-value function's target value in the time step t, and Ts denotes transition data sampled from the replay buffer D.
The smaller value of the two Q (⋅) values is adopted to mitigate overestimating the value function during training critic network.Consequently, y t can be given by: where φz represents the target action-value function's net- work parameter, and Q (⋅) denotes the target action-value function, and its parameters can be updated through polyak averaging: where is a scale coefficient between 0 and 1.

Observation-Robust PI
To further improve the policy of the ORRL agent, the expected return is supposed to be maximized with satisfaction of the constraint c(⋅).
The Lagrange dual function can be derived based on Eq. ( 10), as: Additionally, the Lagrange dual problem with regard to Eq. ( 9) is able to be written as: The optimal policy * and the optimal dual variable * can be found by the following alternating procedure.First given a fixed dual variable , then learn the optimal policy * via maximizing L( , ) .Furthermore, plug in * and approximate the optimal dual variable * through minimizing L( * , ) .According to Eq. ( 17), the following relational expressions are able to be derived: In order to minimize the estimation error of the expected return, the double Q (⋅) trick is employed.As a result, the parameter of the policy model is updated by maximizing the objective function of the actor network, given by: Moreover, the dual variables are able to be optimized by minimizing the following objective function: ( 16)

Algorithm Implementation
Algorithm 1 overviews the ORRL approach in detail.The ORRL method is able to update the driving policies of the agent through below procedures.The actor and critic's initial network parameters are sampled from a stochastic distribution.For each iteration, the RL agent needs to gather the data in M timesteps and save them to buffer D .The envi- ronment includes the reward functions and the transition probability.The optimal adversarial attacks Δ * on observa- tions is approximated via the adversarial agent.represents a delayed update coefficient.The policies of the RL agent are then updated iteratively.d t is a done signal, which rep- resents that the ego vehicle has encountered collision at the time step t.
Algorithm 1 Observation-Robust RL Algorithm  Determine action with the policy: a t ∼ π θ (a t |s t ).

7:
Receive the transition of the environment: s t+1 , r t , d t ∼ p(s t+1 |s t , a t ).

9:
Save the transition to the replay buffer: for gradient step in the agent g = 1, 2, ...N a do

13:
Sample a batch of the training data in replay buffer D.

21:
Update the network parameters of the target action-value function via Eq.( 15):

23:
if g mod ρ then

24:
Optimize the adversary network parameters via Eq.( 8): θ ← ∇θJ ∆ ( θ). end for 27: end for To learn the policies for lane change of autonomous vehicles, the RL agent's state, action and reward need to be determined.The autonomous driving agent's state contains 16 dimensions, and the details are given in Fig. 2. The social vehicles perform lane change maneuvers via the SUMO LC2013 model [40].
Furthermore, the autonomous driving agent's action is discrete, which contains left lane changing, right lane changing and lane keeping.
One tricky problem is to learn the safe policies for lane-change against adversarial perturbations on state observations from scratch without prior knowledge.Consequently, the reward function plays a pivotal role in optimizing the agent's policies.Safety, efficiency and comfort factors are considered when designing the agent's reward function.
In order to promote travel efficiency optimization for the ego vehicle, a reward function denoted as r(⋅) is developed, where the reward is proportional to the ego vehicle's speed, represented as v 0 ∕35 .This implies that the autonomous driving agent can receive higher rewards by operating at higher speeds.If the headway of the ego vehicle is below 30 m, then the reward gained would be decreased by 0.1.This is to avoid the ego car from always following its proceeding vehicle.Both vehicle dynamics stability and collision are considered in terms of driving safety.If the desired yaw rate's upper bound k ⋅ μ ⋅ g∕v 0 given in Ref. [41] is exceeded, the agent's reward would be decreased by 0.05.μ denotes adhesion coefficient, k represents the dynamic factor proposed in Ref. [42], and g indicates the constant of gravity acceleration.In addition, if the ego vehicle collides, the reward of the agent will be decreased by 0.1.If a lateral decision is made by the ego vehicle with over 20 m/s, then the reward gained would be diminished by v 0 ∕350 , which aims to avoid frequent lateral lane changing at high speed.Algorithm 2 illustrates the details of the reward function designed.
By using two fully-connected hidden layers, the actor, critic, and adversary networks are developed.Moreover, the size of the both of hidden layers is set as 128.ReLU is adopted as the hidden layers' activation function.The actor and critic networks' inputs and outputs have 16 and 3 dimensions, respectively.Moreover, both the input and output of the adversary network are 16 dimensions.Table A1 of Appendix A provides the main hyperparameters of the ORRL algorithm.

Simulation Environment
To evaluate the performance of the developed safe decision making scheme for automated vehicles, training and testing are conducted with SUMO.Three highway scenarios are set to be random mixed traffic flows with various traffic densities.The longitudinal speed and lane change behaviours of surrounding vehicles are controlled by using IDM.The maximum speed limit for all lanes is defined as 35 m/s. Figure 3 shows the high-level framework of the performance evaluation.P denotes the probability of starting a vehicle each second.Moreover, P is set as 0.07, 0.14 and 0.28 to represent the random mixed traffic with low, medium, and high traffic densities, respectively.The

Baseline Algorithms
To benchmark the performance of the proposed ORRL, the DQN, PPO, and soft actor-critic (SAC) based on discrete action [43] algorithms are leveraged as compelling baselines.The DQN and PPO based agents are implemented as classical baselines.In addition, discrete action based SAC is used as a state-of-the-art method for comparison, as it is one of the most advanced RL scheme with discrete action.The different agents' final performance is provided in Table 1.The bold number denotes the best in each column.The average return, speed and collision times can reflect the comprehensive performance, travel efficiency and driving safety of autonomous driving agents respectively.The average values of metrics are counted throughout the ultimate 2000 time steps (i.e.200 time steps × 10 episodes).The training results indicate that the ORRL agent surpasses the baseline agents greatly, with respect to both the final performance and learning efficiency.For instance, compared to the DQN, PPO and SAC agents, the ORRL agent gains 74.30% , 68.56% and 13.29% improve- ments concerning the final return respectively.Based on the results, the ORRL agent achieves comparable performance with the SAC agent and exceeds the DQN and PPO agents regarding the final speed.Moreover, the driving safety of the ORRL agent is enhanced by 28.57% , 80.00% and 54.55% , respectively, in compared to the DQN, PPO and SAC agents.

Comparative Evaluation
The average return, robustness and collision times are used to evaluate the performance of different agents under adversarial observation perturbations.Moreover, Eq. ( 2) is adopted to assess the policy robustness.The five policies eventually trained via each algorithm under different random seeds are evaluated.Furthermore, the average values of the metrics are calculated over 20000 time steps (i.e.200 time steps × 100 episodes).
Figure 5 illustrates performance of agents with different algorithms under the three stochastic highway conditions with various traffic flow densities.According to the results, the ORRL agent outperforms baseline agents, in terms of the average return, robustness and collision times.It is noteworthy that, compared with other policies, the variations of the ORRL policies are slight under optimal adversarial attacks on observations from the adversary.
Table 2 reports the evaluation results of different autonomous driving agents quantitatively.It is noticeable that the ORRL agent surpasses the three baseline agents in all testing cases.For instance, in the low-density traffic flow, Fig. 5 Assessment results for agents under different methods with the optimal adversarial attacks on observations in comparison with the DQN, PPO and SAC agents, the ORRL agent gains 56.14% , 60.05% , and 4.08% improve- ments, respectively, regarding the average return.The average robustness of the ORRL agent is enhanced by 41.02% , 32.21% , and 32.38% , respectively.And the average collision times of the ORRL agent are reduced by 88.43% , 54.84% , and 44.00% , respectively.
Under the normal density traffic flow, compared to the DQN, PPO and SAC agents, the ORRL agent improves the average return by 82.96% , 58.42% , and 27.52% , respectively.Meanwhile, the average robustness of the ORRL agent is improved by about 79.54% , 94.36% , and 88.20% , respec- tively.Additionally, the safety of the ORRL agent is also enhanced by about 66.89% , 9.26% and 64.49% , respectively.
Further, under the high traffic density condition, compared to the DQN, PPO and SAC agents, the ORRL agent improves the average return by 94.69% , 29.78% , and 51.63% , respec- tively.The average robustness of the ORRL agent is enhanced by 70.21% , 94.89% , and 84.34% respectively, and the average collision times are reduced by 49.51% , 39.77% , and 45.79% , respectively.As a consequence, it can be seen that the ORRL autonomous driving agent performs consistently under the optimal adversary in the three complex traffic scenarios.
Figure 6 visually illustrates the collision times of the DQN, PPO, SAC and ORRL autonomous driving agents in the high-density stochastic dynamic traffic flows under different attack situations.As can be seen from Fig. 6, the adversarial attacks with the trained adversarial agents have a distinct impact on the driving safety of the autonomous vehicles driven by the baseline agents.For example, compared with the case without the adversarial attacks, the average collision times of the attacked DQN, PPO, SAC and ORRL agents increases by about 43.66% , 59.81% , 40.74% and 5.10% respectively.Hence, the proposed ORRL autonomous driving agent performs consistently across different attack situations.This means that the ORRL policy model is robust to the adversarial attacks on observations, which highlights our primary contribution to realizing safe decision making for autonomous vehicles.
Table 3 shows the comparison on computational cost for different methods in terms of model optimization and inference.The average time consumption of each model for each   update and inference is reported separately.It can be found that compared with the others, the average computational costs of the model optimization and inference based on the DQN scheme are minimum, which are about 3.05 × 10 −3 s and 2.08 × 10 −4 s, respectively.Because the ORRL approach utilizes the adversarial model and requires solving the constrained optimization problem, the computational cost of the model optimization is higher than that of baseline methods.
However, the average time consumption of model inference based on different methods is close.

Conclusions
In this work, a novel ORRL approach is proposed for safe decision making of lane change in autonomous driving.An adversarial agent is trained online to obtain adversarial observation with optimal perturbations, which aims to maximize the average variation distance on perturbed policies.Furthermore, an ORAC method is developed for automated vehicle lateral decision making policy optimization, while ensuring the policy variations under adversarial attacks to be within expected bounds.
Training and testing of the polices are conducted in complex highway driving situations with different traffic flow densities simulated in SUMO.The results illustrate that the developed approach enables the autonomous vehicles to make safe lane change decisions under perception uncertainties.Additionally, compared to the three baselines, the agent under the proposed method shows better performance with respect to the generalization ability and robustness under adversarial observation perturbations.
In the future, a certified ORRL algorithm will be investigated to provide theoretical guarantees regarding safe decision making for autonomous driving.

Fig. 2
Fig. 2 States of autonomous driving agent

Figure 4
Figure 4 shows the training performance of each agent in the condition with normal density traffic flow setting.Figure 4(a), (b) and (c) indicate the average return, speed and collision times of different agents, respectively.The solid curve represents the mean.Moreover, the shaded region denotes the standard deviation.Five runs of each algorithm with different random seeds and 400 episodes are trained in the normal density traffic flow.The maximum length of each episode is 200 time steps.

Fig. 3 Fig. 4
Fig.3The architecture of the evaluation framework with mixed traffic flows

Fig. 6
Fig. 6 Collision times of autonomous driving agents in the high-density traffic flows under different attack situations

Table 1
Final performance comparison in training

Table A1
Key hyperparameters of the proposed algorithm