1 Introduction

Autonomous driving is a potential and promising technology to improve road traffic efficiency and relieve traffic pressure compared with manual driving. In addition, it also provides convenience for people who are unable to drive. The goal of autonomous driving is to make the vehicle drive without part or full human control. Noting the road environment is dynamic and complex, an intelligent autonomous driving decision system is essential. To achieve the goal of autonomous driving, some studies employed non-learning approaches, which require manually designed driving strategies [1, 2]. It is feasible to design driving strategies with the help of human experience and knowledge, while artificially designed strategies are difficult to adapt to complex and dynamic environments. Some researchers also used supervised learning methods [3,4,5], where labeled data-action pairs were applied to train a neural network. However, these methods require a large amount of labeled data, which is very difficult to obtain and process. In addition, most works in this area focus on single-objective tasks, such as maximizing vehicle speed or lane-keeping [6], while the quality of driving skills was rarely considered, e.g., passenger comfort, efficiency in transporting passengers to their destinations.

Reinforcement learning (RL) is a type of machine learning. The reinforcement learning agent obtains feedback through interaction with the environment, and selects actions based on the feedback policy. Reinforcement learning is good at solving sequential decision-making and planning problems, it is suitable for learning autonomous driving strategies. However, most research in this area have also been limited to single objectives, such as safety or speed. The motivation of this paper is to achieve multi-objective optimization on autonomous driving strategy with reinforcement learning by considering two objectives: optimizing vehicle speed and optimizing passenger comfort. By proposing a multi-objective reward function for both objectives, the multi-objective reinforcement learning approach is developed based on deep Q-network [7, 8].

2 Related work

The development of reinforcement learning is closely related to optimal control theory, including dynamic programming [9], adaptive dynamic programming [10]. Watkins et al. used value functions to evaluate strategies and proposed the Q-Learning method [11]. To solve more complex problems with larger state space and action space, deep reinforcement learning (DRL) was proposed. DRL combines deep learning with reinforcement learning, using neural networks to reduce the dimensionality of high-dimensional data before processing them using reinforcement learning methods. Mnih et al. proposed deep Q-network (DQN) [7, 8], which uses convolutional neural networks (CNN) for image downscaling and feature extraction, and then combined it with Q-learning algorithms for training RL agents to play Atari video games, and the algorithms outperformed human levels. Subsequently, many kinds of improved algorithms have been reported. For example, dueling DQN [12], deep deterministic policy gradient (DDPG) [13], asynchronous advantage actor-critic framework (A3C) [14], trust region policy optimization (TRPO) [15], and proximal policy optimization (PPO) [16]. Deep reinforcement learning methods have advantages in solving decision problems, which optimizes long-term rewards and does not require mathematical models and labeled data. Thus, they are applied to solve many difficult problems, such as, Go [17, 18], robot control [19], recommender systems [20,21,22], etc.

Previously, deep learning has had more applications in autonomous driving systems, while reinforcement learning has been used less often. In recent years, due to the development of deep reinforcement learning, many studies have applied DRL to autonomous driving. Mariusz et al. applied CNN to autonomous driving [3]. Sallab et al. applied the DQN algorithm to a lane-keeping task in a simulator [6]. Hyunmin et al. applied the DQN algorithm to an automatic braking system, and the RL agent was able to learn the braking strategy after 70,000 training steps in a simulator [23]. Wolf et al. used the DQN algorithm to train the RL agent to control an autonomous car in a simulator [24]. Chen et al. applied various DRL algorithms such as DDQN and soft actor-critic (SAC) in the simulator CARLA to achieve autonomous driving in complex traffic environments [25]. It is noted that most of these works focused on the safety or reliability of the driving process. In 2018, Kendall et al. applied deep reinforcement learning algorithms to a real environment [26]. They first trained in a simulator using DDPG and then tested it in a real environment, achieving results close to human level. Min et al. combined image data and distance sensor data as input to a DRL algorithm to train self-driving cars [27]. They designed rules based on experiential knowledge to guide the learning of RL agents. However, their approach only seeks to maximize the vehicle speed. Li et al. trained a multi-objective DQN to learn to drive on multi-lane roads and intersections [28]. Their work considered passenger comfort issues, but only used manually designed rules.

3 Preliminaries

In this section, some preliminaries are presented. First, a modeling description of autonomous driving is established. Then, in order to apply DRL to the autonomous driving problem, the interaction of vehicle and environment is modeled as a Markov decision process (MDP) [29]. Finally, a Q-network for end-to-end autonomous driving decision-making is constructed.

3.1 Problem formulation

The construction of the autonomous driving model and Markov decision process is presented in this subsection.

3.1.1 Autonomous driving model

The aim of this paper is to learn autonomous driving strategies that strike a balance between efficiency and comfort. We divide the autonomous driving model into three parts: perception, decision, and control.

  1. 1.

    The perception module is responsible for acquiring information from the environment. For human drivers, this part usually includes vision, hearing, and touch. In our model, we use the image data captured by a front camera as the input to a deep neural network.

  2. 2.

    The decision module is responsible for making driving decisions and picking driving actions. For human drivers, this part is thinking and decision-making. This is a high-level decision, and we learn driving strategies based on DQN.

  3. 3.

    The control module is responsible for executing driving actions. We use an environmental simulator responsible for the actual execution of the actions.

3.1.2 Markov decision process

An MDP consists of a tuple of five elements \((S,A,P,R,\gamma )\), where S is the state space, A is the action space, P is the state transfer function to describe the dynamic properties of the environment, R is the reward function, R(Sa) denotes the immediate reward from taking action a at state S , \(\gamma \in [0, 1]\) is the discount factor, and \(\gamma\) reflects the importance given to long-term rewards . We model the interaction process between the self-driving car and the environment as a Markov decision process.

State space S: The state is the environmental information perceived by the RL agent, and we use the image taken by the camera in front of the vehicle as the state, which is similar to the human driver’s perspective. The image size is \(80 \times 80\) pixels.

Action space A: Action space is the set of all actions that can be selected by the RL agent, and the action space of DQN is discrete. The most basic action space in autonomous driving contains 5 actions: accelerate, decelerate, turn left, turn right, and maintain. To achieve more accurate speed control, we define the action space as follows:

  1. 1.

    Change lane to the left;

  2. 2.

    Change lane to the right;

  3. 3.

    Maintain;

  4. 4.

    Slow acceleration;

  5. 5.

    Medium acceleration;

  6. 6.

    Fast acceleration;

  7. 7.

    Slow deceleration;

  8. 8.

    Medium deceleration;

  9. 9.

    Fast deceleration.

Reward function R: At each step, the RL agent selects an action, and the environment gives the corresponding feedback. The details of the reward function will be designed later.

3.2 Deep Q-Network

In this subsection, the architecture of deep Q-network and several data preprocessing methods are presented.

3.2.1 Data Preprocessing

To make the RL agent learn better, the received image data is pre-processed first.

  1. 1.

    Image size conversion: The original input image is transformed to a uniform size of \(80 \times 80\) pixels.

  2. 2.

    Grayscale: The original RGB image with red, green, and blue channels is converted to the grayscale image to reduce the data dimension.

  3. 3.

    Image normalization: The gray value of the original image ranges from 0 to 255. To facilitate neural network learning, it is normalized to \([-1,1]\). This is shown in equation (1):

    $$\begin{aligned} {x_{normalize}} = \frac{{{x_{image}} - \frac{{255}}{2}}}{{\frac{{255}}{2}}} \end{aligned}$$
    (1)

    where \(x_{image}\) is the original gray value of the pixel and \({x_{normalize}}\) is the gray value of the pixel after normalization.

  4. 4.

    Multi-frame image stacking: Since a single frame does not contain any motion information, the motion trend can only be presented by successive multi-frame images. Obviously, the motion information is crucial to autonomous driving, so we stack the nearest 4 frames as the input to the DQN. That is, the dimension of input data is \(80 \times 80 \times 4\).

3.2.2 Q-network architecture

The architecture of the Q-network is shown in Fig. 1. The input of the Q-network is images of size \(80 \times 80 \times 4\). The network is constructed by three convolutional layers and two fully connected layers, where the convolutional layers extract the image features and the fully connected layers output the Q values of each of the nine actions in the action space.

Fig. 1
figure 1

The architecture of the Q-network

4 Multi-objective reinforcement learning approach

In this section, a multi-objective reward function is proposed to implement multi-objective reinforcement learning. The reward function is crucial for reinforcement learning models. By designing the reward function, we are able to guide the learning of the RL agent. We have two optimization objectives: to improve the speed of the vehicle and to improve the comfort of the passengers. Therefore, the reward function for speed optimization and the reward function for comfort optimization are designed, respectively. In addition, we also use a reward function to improve the learning efficiency of RL agents.

4.1 Reward function for speed optimization

In order to optimize the speed of the agent vehicle, higher speeds should be rewarded and lower speeds should be punished. Combined with the experimental tests, we design the following reward function architecture.

Speed reward: Speed reward is provided based on the speed of the vehicle, which is calculated as follows:

$$\begin{aligned} {r_{speed}}(v) = \frac{{v - 35}}{{\max (v,50) - 35}} \end{aligned}$$
(2)

where v is the current speed of the vehicle (km/h). The purpose of the speed reward function is to optimize speed by considering three aspects: 1) penalizing too low speed; 2) encouraging high speed; 3) the reward value should not grow infinitely. The maximum value of the speed reward is 1. When the vehicle speed is less than 35km/h, \({r_{speed}}\) is negative, forcing the RL agent to increase the speed. When the vehicle speed is greater than 35km/h and less than 50 km/h, \({r_{speed}}\) increases linearly with the vehicle speed; when the vehicle speed is greater than 50km/h, \({r_{speed}}\) equals 1 and does not increase anymore to avoid too high speed.

Overtaking reward: The purpose is to encourage the RL agent to overtake and avoid the situation where the RL agent follows the previous car at a lower speed for a long time. For each overtaking, the RL agent receives a reward with a value of 0.3. The overtaking reward \({r_{overtake}}\) is calculated as follows:

$$\begin{aligned} {r_{overtake}} = \left\{ \begin{array}{l} 0.3 , \mathrm{if}\, \mathrm{agent}\, \mathrm{overtakes}\,\mathrm{a}\,\mathrm{vehicle}\\ 0 , {\mathrm{otherwise}} \end{array} \right. \end{aligned}$$
(3)

The purpose of the speed reward and the overtaking reward is to increase the speed of the vehicle, and the two are added together to form the reward function for speed optimization:

$$\begin{aligned} {r_{optimize\_speed}} = {r_{speed}} + {r_{overtake}} \end{aligned}$$
(4)

4.2 Reward function for comfort optimization

There are many factors that affect passenger comfort, such as temperature, noise, vehicle acceleration, and vehicle vibration. However, most of the above factors depend on the environment and only the vehicle acceleration is controllable. Therefore, we consider two main aspects: the longitudinal acceleration of the vehicle, and the lane change frequency of the vehicle. Vehicle acceleration has a significant impact on passenger comfort. Driving behaviors such as rapid acceleration, rapid brake, and frequent lane changes can lead to a decrease in comfort. Combined with the experiment, the reward function for comfort optimization is designed as follows.

Lane change reward \({r_{lanechange}}\), which gives a negative reward for lane change actions:

$$\begin{aligned} {r_{lanechange}} = \left\{ \begin{array}{l} - 0.7 , \mathrm{if}\, \mathrm{agent}\, \mathrm{changes}\, \mathrm{lane}\\ 0 , {\mathrm{otherwise}} \end{array} \right. \end{aligned}$$
(5)

The speed change reward \({r_{speedchange}}\), which gives a reward according to the acceleration or deceleration action chosen by the RL agent, aims to avoid the RL agent to change speed frequently, or taking too high acceleration. \({r_{speedchange}}\) is calculated as follows:

$$\begin{aligned} {r_{speedchange}} = \left\{ \begin{array}{l} - 0.05 , \mathrm{slow}\,\mathrm{speed}\, \mathrm{change}\\ - 0.3 , \mathrm{medium}\,\mathrm{speed}\, \mathrm{change}\\ - 0.5 , \mathrm{fast}\, \mathrm{speed}\, \mathrm{change}\\ 0 , \mathrm{otherwise} \end{array} \right. \end{aligned}$$
(6)

The collision reward \({r_{collision}}\), which gives a reward based on whether the RL agent crashes or not. Collision events must be avoided, both for comfort and safety reasons. Due to the high cost of a collision, the value of the collision reward is high:

$$\begin{aligned} {r_{collision}} = \left\{ \begin{array}{l} - 10 , \mathrm{if}\,\mathrm{agent}\,\mathrm{colides}\\ 0 , {\mathrm{otherwise}} \end{array} \right. \end{aligned}$$
(7)

The purpose of the above lane change reward, speed change reward, and collision reward is to improve passenger comfort, and we add them together to form the reward function for comfort optimization:

$$\begin{aligned} {r_{optimize\_comfort}} = {r_{speedchange}} + {r_{lanechange}} + {r_{collision}} \end{aligned}$$
(8)

4.3 Reward function for learning assistance

Reinforcement learning agents generally start learning from nothing, which requires a lot of interaction with the environment. Therefore, Reward functions \({r_{maintain}}\) and \({r_{avoidance}}\) are designed to improve learning efficiency.

\({r_{maintain}}\): Since the probability of RL agents exploring the “maintain” action in the early stage is small, RL agents rarely choose the “maintain” action. Therefore, we set a reward \({r_{maintain}}\): if the vehicle speed is higher than 50km/h, the “maintain” action will be rewarded with \(+0.1\).

\({r_{avoidance}}\): The simulator can provide the distance between the vehicle and the surrounding obstacle. Using this knowledge, we set the reward \({r_{avoidance}}\) : if there is an obstacle ahead, then taking a lane change or deceleration action will receive a \(+0.2\) reward.

We add the two rewards together to form \({r_{assist}}\), which guides the exploration of RL agents:

$$\begin{aligned} {r_{assist}} = {r_{maintain}} + {r_{avoidance}} \end{aligned}$$
(9)

Meanwhile, several rules are established to shrink the action space and help RL agent learn faster.

  1. 1.

    If the vehicle is currently in the leftmost (or rightmost) lane, changing lanes to the left (or right) is not allowed.

  2. 2.

    Lane-keeping: If the vehicle is not in the center of the lane, move the vehicle to the center of the lane. This rule is implemented by the simulator [27], so the RL agent only needs to select the lane, not the specific steering angle.

4.4 Multi-objective reward function

The final multi-objective reward function is constructed as follows:

$$\begin{aligned} {r_{multi\_objective}} = {r_{optimize\_speed}} + {r_{optimize\_comfort}} + {r_{assist}} \end{aligned}$$
(10)

In the above multi-objective reward function, both the reward for speed maximization and the punishment for comfort reduction are included. Clearly, these two objectives are in conflict. On the one hand, to maximize speed, the vehicle must increase its speed as soon as the situation allows and avoid other slower vehicles by changing lanes frequently, which may also require frequent hard braking. On the other hand, to maximize comfort, the vehicle is required to drive as smoothly as possible. Therefore, the purpose of designing a multi-objective reward function is to find a more reasonable driving strategy by seeking a balance between efficiency and comfort.

5 Experiments

In this section, the effectiveness of the proposed approach is verified by simulations. Some implementation details of the experiments are presented and the results of the experiments are discussed.

5.1 Environment simulator

For cost and feasibility reasons, we experiment in a simulator. We use an open-source simulator developed by Min et al. based on the Unity ML module [27], and modify it as needed. The simulator is able to simulate the driving environment and vehicle sensors. The driving environment is random and dynamic, including a five-lane urban expressway, other vehicles on the road, and background elements such as sky and buildings. Other vehicles on the road are randomly generated and take random actions with a certain probability. Meanwhile, roads and buildings are constantly changing. Fig. 2 shows the simulator interface.

Fig. 2
figure 2

A screenshot of the simulator interface

5.2 Details

  1. 1.

    Exploration: We use the \(\varepsilon - greedy\) exploration policy. The RL agent does not always pick the action with the largest Q value during learning, but chooses a random action with probability \(\varepsilon\). During the learning process, the probability of taking a random action is gradually reduced.

  2. 2.

    Experience Replay [7]: There is a strong correlation between the data obtained from the RL agent’s interaction with the environment. To break this correlation, the DQN algorithm stores the data sampled by the RL agent and randomly selects samples from it during training.

  3. 3.

    Episode termination: If a vehicle collides or the vehicle travels more than a certain distance, the current episode is terminated and the training of the next episode begins.

  4. 4.

    Pre-exploration: Before starting to train the neural network, the RL agent performs a certain number of steps of exploration, interacts with the environment to obtain data samples, and stores these samples in the “experience replay”. The purpose of pre-exploration is to ensure that there are enough samples in the experience replay pool for sampling.

  5. 5.

    Hyperparameter: Table 1 shows the hyperparameter settings of the algorithm.

Table 1 Hyperparameter setting of DQN algorithm

5.3 Results

We trained the RL agent for one million steps, running on an NVIDIA RTX 3070 GPU, which took about 8 to 10 hours.

Fig. 3
figure 3

The variation of the reward value during the training

Fig. 3 shows the variation of the reward value during the training. In the early stage, the reward obtained by the RL agent is low, because the RL agent performs a lot of random exploration at this time. The reward tends to increase and eventually converge to the highest reward value that can be obtained. This is because the RL agent receives a higher speed reward (the maximum speed reward is 1) in the later stages and rarely receives a negative reward. This indicates that the DQN algorithm converges successfully. In Section 4.3, several rules are used to assist in reinforcement learning. Without the help of these rules, in the early stage of training, the RL agent takes random actions and the reward value should fluctuate around 0. Adding these rules can avoid the RL agent to choose some wrong actions and improve the learning efficiency. According to Fig. 4, it can be seen that the reward values are mostly positive at the early stage of training, which proves that the rules used are effective.

To demonstrate the advantages of the proposed multi-objective driving strategy, we trained a single-objective driving strategy optimized for speed only, referring to the reward function used by Min et al. in his study [27]. The reward function for the single-objective strategy is as follows:

$$\begin{aligned} r_{speed}^{\prime }(v)= & {} \frac{v-40}{v_{\max }-40} \end{aligned}$$
(11)
$$\begin{aligned} {r_{overtake}^{\prime }}= & {} \left\{ \begin{array}{l} 0.5 , \mathrm{if}\, \mathrm{agent}\, \mathrm{overtakes}\, \mathrm{a}\, \mathrm{vehicle}\\ 0 , {\mathrm{otherwise}} \end{array} \right. \end{aligned}$$
(12)
$$\begin{aligned} {r_{collision}^{\prime }}= & {} \left\{ \begin{array}{l} - 10 , \mathrm{if}\, \mathrm{agent}\, \mathrm{colides}\\ 0 , {\mathrm{otherwise}} \end{array} \right. \end{aligned}$$
(13)
$$\begin{aligned} {r_{single\_objective}}= & {} {r_{speed}^{\prime }} + {r_{overtake}^{\prime }} + {r_{collision}^{\prime }} \end{aligned}$$
(14)

We used the same hyperparameters as in Table 1. We tested each of the two strategies for 50,000 steps. In the tests, the RL agent picks the actions exactly according to the maximum Q value. Fig. 4 shows the speed of the vehicle under both strategies. Fig. 5 shows the acceleration of the vehicle under both strategies. We calculated the total number of lane changes for each episode during the test, as shown in Fig. 6.

Fig. 4
figure 4

The figure shows the speed of the vehicle under both strategies. The data shown here is from one episode of the test, each data point in the graph represents the average speed of 50 steps

Fig. 5
figure 5

The figure shows the acceleration of the vehicle under both strategies. The data shown here is from one episode of the test, the acceleration is computed by speed

Fig. 6
figure 6

The total number of lane changes for each episode. The statistics were computed by running a greedy policy for 50000 steps

Table 2 and Table 3 show the statistics about speed and acceleration, respectively. The statistics are calculated based on the results of the 50,000-step test. We used the root mean square value (RMS) of acceleration to evaluate passenger comfort, referring to the method of the international standard ISO 2631-1 [30]. A qualitative evaluation method for passenger comfort is also given in ISO 2631-1 [30], as shown in Table 4. According to Table 4, the comfort level of the proposed multi-objective driving strategy is evaluated as “a little uncomfortable” and the RMS value is very close to “not uncomfortable”. Meanwhile, the comfort rating of the single-objective strategy is between “uncomfortable” and “very uncomfortable”.

Table 2 The table shows the speed statistics for both strategies. The statistics were computed by running a greedy policy for 50000 steps
Table 3 The table shows the acceleration statistics for both strategies
Table 4 The table shows a qualitative evaluation method for passenger comfort, from ISO 2631-1 [30]

Comparing these statistics, the maximum speed of the vehicle with the single-objective strategy reached 80 km/h, but the speed fluctuated dramatically; the maximum speed of the vehicle with the multi-objective strategy was around 60 km/h, but the speed changed smoothly. Although the maximum speed achieved by the single-objective driving strategy is much higher, the average speed difference between the two strategies is not much. The reason for this phenomenon is the presence of many other vehicles on the road. If the vehicle’s speed is too high, it must brake and change lanes frequently, which not only leads to a decrease in comfort, but also to a loss of average speed. Conversely, if the vehicle maintains an appropriate speed, it is possible to strike a balance between comfort and average speed. The single-objective strategy tends to take higher speed as possible, which leads to frequent lane changes and frequent emergency braking, resulting in dramatic speed fluctuations and failure to have a significant increase in average speed. This indicates that the single-objective strategy is poorly comfortable and does not offer great advantages in terms of transport efficiency. The multi-objective driving strategy is: not to blindly increase speed even if conditions permit, but to maintain a more reasonable speed level. In this strategy, although the maximum speed of the vehicle is lower, the need for emergency braking and lane changes is rare because the speed difference between the RL agent vehicle and the surrounding vehicles is small. Also, the average speed can reach a high level due to the more stable speed. Our multi-objective strategy not only provides higher comfort but also ensures sufficient average speed.

6 Conclusion

A multi-objective reinforcement learning approach has been proposed, with the purpose of learning a multi-objective autonomous driving strategy. A multi-objective reward function has been constructed to optimize the two objectives of vehicle speed and passenger comfort. Experiments were performed in a simulator. Compared to the single-objective strategy, the multi-objective driving strategy achieves much more gentle speed changes and very few lane changes with sacrificing average speed slightly. The results showed that the proposed approach can strike a good balance between transportation efficiency and passenger comfort, which is more in accordance with the practical requirements.