1 Introduction

Autonomous driving has received significant research interest in the past two decades due to its many potential societal and economical benefits. Compared to traditional vehicles, autonomous vehicles (AVs) not only promise fewer emissions [1] but are also expected to improve safety and efficiency. However, there exists a huge challenge in the task of high-level decision-making in AVs due to the complex and dynamic traffic environment, especially in mixed traffic co-existing with other road users. Lane changing is one of the largest challenges in the high-level decision-making of AVs, which has significant influences on traffic safety and efficiency [2, 3].

The considered lane-changing scenario is illustrated in Fig. 1, where AVs and HDVs co-exist on a one-way highway with two lanes. The AVs aim to safely travel through the traffic while making necessary lane changes to overtake slow-moving vehicles for improved efficiency. Furthermore, in the presence of multiple AVs, the AVs are expected to collaboratively learn a policy to adapt to HDVs and enable safe and efficient lane changes. As HDVs bring unknown/uncertain behaviors, planning, and control in such mixed traffic to realize safe and efficient maneuvers is a challenging task [4].

Figure 1
figure 1

Illustration of the considered lane-changing scenario (green: AVs, blue: HDVs, arrow curve: a possible trajectory of the ego vehicle AV1 to make the lane change)

Recently, reinforcement learning (RL) has emerged as a promising framework for autonomous driving due to its online adaptation capabilities and the ability to solve complex problems [5, 6]. Several recent studies have explored the use of RL in AV lane-changing [4, 7, 8], which consider a single AV setting where the ego vehicle learns a lane-changing behavior by taking all other vehicles as part of the driving environment for decision making. While this single-agent approach is completely scalable, it will lead to unsatisfactory performance in the complex environment like multi-AV lane-changing in mixed traffic that requires close collaboration and coordination among AVs [9].

On the other hand, multi-agent reinforcement learning (MARL) has been greatly advanced and successfully applied to a variety of complex multi-agent systems such as games [10], traffic light control [11] and fleet management [12]. The MARL algorithms have also been applied to autonomous driving [1316], with the objective of accomplishing autonomous driving tasks cooperatively and reacting timely to HDVs. In particular, the MARL methods [15, 17] have been applied to highway lane change tasks and show promising and scalable performance, in which AVs learn cooperatively via sharing the same objective (i.e., reward/cost function) that considers safety and efficiency. However, those reward designs often ignore the passengers’ comfort, which may lead to sudden acceleration and deceleration that can cause ride discomfort. In addition, they assume that the HDVs follow unchanged, universal human driver behaviors, which is clearly oversimplified and impractical in the real world as different human drivers tend to behave quite differently. Learning algorithms should thus work with different human driving behaviors, e.g., aggressive or mild behaviors.

To address the above issues, we develop a multi-agent reinforcement learning algorithm by employing a multi-agent advantage actor-critic network (MA2C) for multi-AV lane-changing decision making, featuring a novel local reward design that incorporates the safety, efficiency, and passenger comfort as well as a parameter sharing scheme to foster inter-agent collaborations. The main contributions and the technical advancements of this paper are summarized as follows.

  1. 1.

    We formulate the multi-AV highway lane changing in the mixed traffic as a decentralized cooperative MARL problem, where agents cooperatively learn a safe and efficient driving policy.

  2. 2.

    We develop a novel, efficient, and scalable multi-agent advantage actor-critic network model, by introducing a parameter-sharing mechanism and effective reward function design.

  3. 3.

    We conduct a comprehensive empirical study on three different traffic densities and two levels of drivers’ behavior modes and compare with other state-of-the-art models to demonstrate the driving safety, efficiency, and driver comfort of our models.

The rest of the paper is organized as follows. Section 2 reviews the state-of-the-art dynamics-based and RL/MARL algorithms for autonomous driving tasks. The preliminaries of RL and the proposed MARL algorithm are introduced in Sect. 3. Experiments, results, and discussions are presented in Sect. 4. Finally, we summarize the paper and discuss future work in Sect. 5.

2 Related work

In this section, we survey the existing literature on decision-making tasks in autonomous driving, which can be mainly classified into two categories: non-data-driven and data-driven methods.

2.1 Non-data-driven methods

Conventional rule-based or model-based approaches [1820] rely on hard-coded rules or dynamical models to construct predefined logic mechanisms to determine the behaviors of ego vehicles under different situations. For instance, lane changing guidance can be realized by establishing virtual trajectory references for every vehicle, and a safe trajectory is then planned by considering the trajectories of other vehicles [18]. In the previous literature [19], a low-complexity lane-changing algorithm was developed by following heuristic rules such as keeping appropriate inter-vehicle traffic gaps and time instances to perform the maneuver. After that, an optimization-based lane change approach was proposed [20], which formulates the trajectory planning problem as coupled longitudinal and lateral predictive control problems and is then solved via Quadratic Programs under specific system constraints. However, the rules and optimization criteria for real-world driving problems may become too complex to be explicitly formulated for all scenarios. Such a problem is more serious in mixed-traffic scenarios with unknown or stochastic behaviors of drivers.

2.2 Data-driven methods

Recently, data-driven methods, such as reinforcement learning (RL), have received great attention and have been widely explored for autonomous driving tasks. Particularly, a model-free RL approach based on deep deterministic policy gradient (DDPG) was proposed in the recent literature [5] to learn a continuous control policy efficient lane changing. Afterward, a safe RL framework [4] was presented by integrating a lane-changing regret model into a safety supervisor based on an extended double deep Q-network (DDQN). In the significant literature [21], a hierarchical RL algorithm was developed to learn lane-changing behaviors in dense traffic by applying the designed temporal and spatial attention strategies, and promising performance is demonstrated in the open racing car simulator (TORCS) under various lane change scenarios. However, the aforementioned methods are designed for the single-agent (i.e., one ego vehicle) scenarios, treating all other vehicles as part of the environment, which makes them implausible for the considered multi-agent lane-changing setting where collaboration and coordination among AVs are required.

On the other hand, multi-agent reinforcement learning (MARL) algorithms have also been explored for autonomous driving tasks [13, 14, 22, 23]. An MARL algorithm with hard-coded safety constraints [13] was proposed to solve the double-merge problem. In such a framework, a hierarchical temporal abstraction method was applied to reduce the effective horizon and the variance of the gradient estimation error. In the recent literature [23], an MARL algorithm was delivered to solve the on-ramp merging problem with safety enhancement by a novel priority-based safety supervisor. In addition, a novel MARL approach [22] was realized with the combination of Graphic Convolution Neural Network (GCN) [24] and Deep Q Network (DQN) [25] to better fuse the acquired information from collaborative sensing, showing promising results on a 3-lane freeway containing 2 off-ramps highway environment. While these MARL algorithms only consider the efficiency, and safety in their designed reward function, another important factor, the passenger comfort, is not considered in their reward function design. Furthermore, those approaches assume the HDVs follow a constant, universal driving behavior, which has limited implications for real-world applications as different human drivers may behave totally differently.

In this paper, we formulate the decision-making of multiple AVs on highway lane changing as an MARL problem, where a multi-objective reward function is proposed to simultaneously promote safety, efficiency, and passenger comfort. A parameter-sharing scheme is exploited to foster inter-agent collaborations. Experimental results on three different traffic densities with two levels of driver aggressiveness show the proposed MARL performs well on different lane change scenarios.

3 Problem formulation

In this section, we review the preliminaries of RL and formulate the considered highway lane-changing problem as a partially observable Markov decision process (POMDP). Then we present the proposed multi-agent actor-critic algorithm, featuring a parameter-sharing mechanism and efficient reward function design, to solve the formulated POMDP.

3.1 Preliminary of RL

In the standard RL setting, the agent aims to learn an optimal policy \(\pi ^{*}\) to maximize the accumulated future reward \(R_{t} = \sum_{k=0}^{T} \gamma ^{k} r_{t+k}\) from the time step t with discount factor \(\gamma \in (0,1]\) by continuous interacting with the environment. Especially, at time step t, the agent receives a state \(s_{t} \in \mathcal{R}^{n}\) from the environment and selects an action \(a_{t} \in \mathcal{A}^{m}\) according to its policy \(\pi :\,\mathcal{S}\rightarrow \Pr (\mathcal{A})\). As a result, the agent receives the next state \(s_{t+1}\) and receives a scalar reward \(r_{t}\). If the agent can only observe a part of the state \(s_{t}\), the underlying dynamics becomes a POMDP [26] and the goal is then to learn a policy that maps from the partial observation to an appropriate action to maximize the rewards.

The action-value function \(Q^{\pi }(s, a) = E[R_{t}{\mid }{s=s_{t}}, a]\) is defined as the expected return obtained by selecting an action a in state \(s_{t}\) and following policy π afterwards. The optimal Q-function is given by \(Q^{*}(s,a) = \max_{\pi } Q^{\pi }(s,a)\) for state s and action a. Similarly, the state-value function is defined as \(V^{\pi }(s_{t}) = E_{\pi }{[R_{t}{\mid }{s=s_{t}}]}\) representing the expected return for following the policy π from state \(s_{t}\). In model-free RL methods, the policy is often represented by a neural network denoted as \(\pi _{\theta }(a_{t}{\mid }s_{t})\), where θ is the learnable parameters. In actor-critic (A2C) algorithms [27], a critic network, parameterized by ω, learns the state-value function \(V_{\omega }^{\pi _{\theta }}(s_{t})\) and an actor network \(\pi _{\theta }(a_{t}{\mid }s_{t})\) parameterized by θ is applied to update the policy distribution in the direction suggested by the critic network as follows:

$$ \theta \leftarrow \theta + E_{\pi _{\theta }} \bigl[ \bigl(\nabla _{ \theta } \log \pi _{\theta }(a_{t}{ \mid }s_{t}) \bigr) A_{t} \bigr], $$
(1)

where the advantage function \(A_{t}= Q^{\pi _{\theta }}(s,a) - V_{\omega }^{\pi _{\theta }}(s_{t})\) [27] is introduced to reduce the sample variance. The parameters of the state-value function are then updated by minimizing the following loss function:

$$ \min_{\omega } E_{\mathcal{B}} \bigl(R_{t} + \gamma V_{\omega '} (s_{t+1}) - V_{\omega }(s_{t}) \bigr)^{2}, $$
(2)

where \(\mathcal{B}\) is the experience replay buffer that stores previously encountered trajectories and \(\omega '\) denotes the parameters of the target network [25].

3.2 Lane changing as MARL

In this subsection, we develop a decentralized, MARL-based approach for highway lane-changing of multiple AVs. Discontinuous evaluation is a very common way to design in the autonomous driving field, which is widely used by many papers [2832]. In particular, we model the mixed-traffic lane-changing environment as a multi-agent network: \(\mathcal{G} = (\nu , \varepsilon )\), where each agent (i.e., ego vehicle) \(i \in \nu \) communicates with its neighbors \(\mathcal{N}_{i}\) via the communication link \(\varepsilon _{ij}\in \varepsilon \). The corresponding POMDP is characterized as \((\{\mathcal{A}_{i}, \mathcal{O}_{i}, \mathcal{R}_{i}\}_{i\subseteq \nu }, \mathcal{T})\), where \(\mathcal{O}_{i} \in \mathcal{S}_{i}\) is the partial description of the environment state as stated in [33]. In a multi-agent POMDP, each agent i follows a decentralized policy \(\pi _{i}: \mathcal{O}_{i} \times \mathcal{S}_{i} \rightarrow [0, 1]\) to choose the action \(a_{t}\) at time step t. The described POMDP is defined as:

  1. 1.

    State Space: The state space \(\mathcal{O}_{i}\) of Agent i is defined as a matrix \(\mathcal{N}_{N_{i}}\times \mathcal{F}\), where \(\mathcal{N}_{N_{i}}\) is the number of detected vehicles, and \(\mathcal{F}\) is the number of features, which is used to represent the current state of vehicles. It includes the longitudinal position x, the lateral position y of the observed vehicle relative to the ego vehicle, the longitudinal speed \(v_{x}\), and the lateral speed \(v_{y}\) of the observed vehicle relative to the ego vehicle.

  2. 2.

    Action Space: The action space \(\mathcal{A}_{i}\) of agent i is defined as a set of high-level control decisions, including speed up, slow down, cruising, turn left, and turn right. The action space combination for AVs is defined as \(\mathcal{A}=\mathcal{A}_{1}\times \mathcal{A}_{2}\times \cdots \times \mathcal{A}_{N}\), where N is the total number of vehicles in the scene.

  3. 3.

    Reward Function: Multiple metrics including safety, traffic efficiency, and passenger’s comfort are considered in the reward function design:

    • safety evaluation \(r_{s}\): The vehicle should operate without collisions.

    • headway evaluation \(r_{d}\): The vehicle should maintain a safe distance from the preceding vehicles during driving to avoid collisions.

    • speed evaluation \(r_{v}\): Under the premise of ensuring safety, the vehicle is expected to drive at a high and stable speed.

    • driving comfort \(r_{c}\): Smooth acceleration and deceleration are expected to ensure safety and comfort. In addition, frequent lane changes should be avoided.

    As such, a multi-objective reward \(r_{i,t}\) at the time step t is defined as:

    $$ r_{i,t}=\omega _{s}r_{s}+ \omega _{d}r_{d}+\omega _{v}r_{v}- \omega _{c}r_{c}, $$
    (3)

    where \(\omega _{s}\), \(\omega _{d}\), \(\omega _{v}\) and \(\omega _{c}\) are the weighting coefficients. We set the safety factor \(\omega _{s}\) to a large value, because safety is the most important criterion during driving. The details of the four performance measurements are discussed next:

    1. (1)

      If there is no collision, the collision evaluation \(r_{s}\) is set to 0, otherwise, \(r_{s}\) is set as −1.

    2. (2)

      The headway evaluation is defined as

      $$ r_{d}=\log \frac{d_{\text{headway}}}{v_{t}t_{d}}, $$
      (4)

      where \(d_{\text{headway}}\) is the distance to the preceding vehicle, and \(v_{t}\) and \(t_{d}\) are the current vehicle speed and time headway threshold, respectively.

    3. (3)

      The speed evaluation \(r_{v}\) is defined as

      $$ r_{v}=\min \biggl\{ \frac{v_{t}-v_{\min }}{v_{\max }-v_{\min }},1 \biggr\} , $$
      (5)

      where \(v_{t}\), \(v_{\min }\) and \(v_{\max }\) are the current, minimum, and maximum speeds of the ego vehicle, respectively. Within the specified speed range, higher speed is preferred to improve the driving efficiency.

    4. (4)

      The driving comfort \(r_{c}\) is defined as

      $$ r_{c}=r_{a}+r_{lc}, $$
      (6)

      where

      $$ r_{a}=\textstyle\begin{cases} -1,& \vert a_{t} \vert \geq a_{th}, \\ 0,& \vert a_{t} \vert < a_{th} \end{cases} $$

      is the penalty term of rapid acceleration and deceleration than a given threshold \(a_{th}\). Here \(a_{t}\) presents the acceleration at time t.

      $$ r_{lc}=\textstyle\begin{cases} -1,&\text{change lane},\\ 0,&\text{keep lane} \end{cases} $$

      is defined as the lane change penalty. Excessive lane changes can cause discomfort and safety issues. Note that this lane-changing penalty term is to avoid frequent, unnecessary lane changes while necessary lane changes (i.e., to maintain safety and efficiency) are still promoted through the safety and speed evaluation terms.

  4. 4.

    Transition Probability: The transition probability \(T(s^{,}\mid s,a)\) characterizes the transition from one state to another. Since our MARL algorithm is a model-free design, we do not assume any prior knowledge about transition probability.

3.3 MA2C for AVs

In this paper, we extend the actor-critic network [27] to the multi-agent setting as a multi-agent actor-critic network (i.e., MA2C). MA2C improves the stability and scalability of the learning process by allowing certain communication among agents [33]. To take the advantage of homogeneous agents in the considered MARL setting, we assume all the agents share the same network structure and parameters, while they are still able to make different maneuvers according to different input states. The goal in cooperative MARL setting is to maximize the global reward of all the agents. To overcome the communication overhead and the credit assignment problem [34], we adopt the local reward design [23] as follows:

$$ r_{i, t} = \frac{1}{\mid {\nu _{i}}\mid } \sum _{j\in \nu _{i}} r_{j,t}, $$
(7)

where \(\mid \nu _{i} \mid \) denotes the cardinality of a set containing the ego vehicle and its close neighbors. Compared to the global reward design previously used in [22, 35], the designed local reward design mitigates the impact of remote agents.

The backbone of the proposed MA2C network is shown in Fig. 2, in which states separated by physical units are first processed by separate 64-neuron fully connected (FC) layers. Then all hidden units are combined and fed into the 128-neuron FC layer. Then the shared actor-critic network will update the policy and value networks with the extracted features. As mentioned in the recent literature [23], the adopted parameter sharing scheme [12] between the actor and value networks can greatly improve the learning efficiency.

Figure 2
figure 2

The architecture of the proposed MA2C network with shared actor-critic network design, where x and y are the longitudinal and lateral position of the observed vehicle relative to the ego vehicle, and \(v_{x}\) and \(v_{y}\) are the longitudinal and lateral speed of the observed vehicle relative to the ego vehicle

The pseudo-code of the proposed MA2C algorithm is shown in Algorithm 1. The hyperparameters include the (time)-discount factor γ, the learning rate η, the politeness coefficient p and the epoch length T. Specifically, the agent receives the observation \(O_{i,t}\) from the environment and updates the action by its policy (Lines 3-6). After each episode is completed, the network parameters are updated accordingly (Lines 9-11). If an episode is completed or a collision occurs, the “DONE” signal is released and the environment will be reset to its initial state to start a new epoch (Lines 13-14).

Algorithm 1
figure a

MARL for AVs

4 Experiments and discussion

In this section, we evaluate the performance of the proposed MARL algorithm in terms of training efficiency, safety, and driving comfort in the considered highway lane changing scenario shown in Fig. 1.

4.1 HDV models

In this experiment, we assume that the longitudinal control of HDVs follows the Intelligent Driver Model (IDM) [36], which is a deterministic continuous-time model describing the dynamics of the position and speed of each vehicle. It takes into account the expected speed, distance between the vehicles and the behavior of the acceleration/deceleration process caused by the different driving habits. In addition, Minimize Overall Braking Induced By Lane Change model (MOBIL) [37] is adopted for the lateral control. It takes vehicle acceleration as the input variable of the model and can work well with most car-following models. The acceleration expression is defined as follows:

$$ \tilde{a}_{n} \ge -b_{\text{safe}}, $$
(8)

where \(\tilde{a}_{n}\) is the acceleration of the new follower after the lane change, and \(b_{\text{safe}}\) is the maximum braking imposed to the new follower. If the inequality in Eqn. (8) is satisfied, the ego vehicle is able to change lanes. The incentive condition is defined as:

$$ {\underbrace{\tilde{a}_{c}-a_{c}}_{\mathrm{ego~vehicle}}}+p ({ \underbrace{\tilde{a}_{n}-a_{n}}_{\mathrm{new~follower}}}+{ \underbrace{\tilde{a}_{o}-a_{o}}_{\mathrm{old~follower}}} ) \ge \Delta a_{th}, $$
(9)

where a and ã are the accelerations of the ego vehicle before and after the lane change, respectively, \(\Delta a_{th}\) is the threshold that determines whether to trigger the lane change or not, and p is a politeness coefficient that controls how much effect we want to take into account for the followers, where \(p=1\) represents the most considerate drivers whose decision on change lanes may give way to the following blocked vehicles whereas \(p=0\) characterizes the most aggressive drivers where the HDV makes selfish lane-changing decisions by only considering their own speed gains and ignoring other vehicles. The performance evaluation of different p values is discussed in Sect. 4.3.3.

4.2 Experimental settings

The simulation environment is modified from the gym-based highway-env simulator [38]. We set the highway road length to 520 m, and the vehicles beyond the road are ignored. The vehicles are randomly spawned on the highway with different initial speeds 25–30 m/s (56 mph–67 mph). The vehicle control sampling frequency is set as the default value of 5 Hz. The motions of HDVs follow the IDM and MOBIL model, where the maximum deceleration for safety purposes is limited by \(b_{\text{safe}}=-9~\text{m/s}^{2}\), politeness factor p is 0, and the lane-changing threshold \(\Delta a_{th}\) is set as 0.1 m/s2. To evaluate the effectiveness of the proposed methods, three traffic density levels are employed, which correspond to low, middle, high levels of traffic congestion, respectively. The number of vehicles in different traffic modes is shown in Table 1.

Table 1 Traffic density modes

We train the MARL algorithms for 1 million steps (10,000 epochs) by applying two different random seeds and the same random seed is shared among agents. We evaluate each model 3 times every 200 training episodes. The parameters γ and learning rate η are set as 0.99 and \(5 \times 10^{-4}\), respectively. The weighting coefficients in the reward function are set as \(\omega _{s}=200\), \(\omega _{d}=4\), \(\omega _{v}=1\) and \(\omega _{c}=1\), respectively. These experiments are conducted on a macOS server with a 2.7 GHz Intel Core i5 processor and 8 GB of memory.

4.3 Results & analysis

4.3.1 Local v.s. Global reward designs

Figure 3 shows the performance comparison between the proposed local reward and the global reward design [22, 35] (with shared actor-critic parameters). In all three traffic modes, the local reward design consistently outperforms the global reward design in terms of larger evaluation rewards and smaller variance. Although the global reward design outperforms the local reward design before 2000 epochs, the variance of the global reward is relatively large. In addition, the performance gaps are enlarged as the number of vehicles increases. This is due to the fact that the global reward design is more likely to cause credit assignment issues as mentioned in the significant monograph [34].

Figure 3
figure 3

Performance comparisons between local and global reward designs. The shaded region denotes the standard deviation over 2 random seeds

4.3.2 Sharing v.s. Separate actor-critic network

Figure 4 shows the performance comparison between strategies with or without sharing the actor-critic network parameters during training. Obviously, sharing an actor-critic network has better performance than without sharing. Specifically, sharing actor-critic parameters in all three modes results in higher rewards and lower variance. The reason is that, in separate actor-critic networks, the critic network can only guide the actor network to the correct training direction until the critic network is well-trained which may take a long time to achieve. In contrast, the actor network can benefit from the shared state representation via the critic network in a shared actor-critic network [23, 39].

Figure 4
figure 4

Performance comparisons between with and without actor-critic network sharing

4.3.3 Verification of driving comfort

In this subsection, we evaluate the effectiveness of the proposed multi-objective reward function with the driving comfort in Eqn. (3). Figure 5 shows the acceleration and deceleration of the AV with or without the comfort measurement defined in Eqn. (6). It is clear that the proposed reward design with the comfort measurement has a low variance (average deviation: 0.455 m/s2) and is more smooth than the reward design without comfort term (average deviation: 0.582 m/s2), which shows the proposed reward design presents good driving comfort.

Figure 5
figure 5

Performance comparisons of acceleration between the reward design with or without comfort measurement

4.3.4 Adaptability of the proposed method

In this subsection, we evaluate the proposed MA2C under different HDV behaviors, which is controlled by the politeness coefficient p denoted in Eqn. (9), in which \(p=0\) means the most aggressive behavior while \(p=1\) represents the most polite behavior. Figure 6 shows the training performance of two different HDV models (i.e., aggressive or politeness) under different traffic densities. It is clear that the proposed algorithm achieves scalable and stable performance whenever the HDVs take aggressive or courteous behaviors.

Figure 6
figure 6

Performance comparisons on different politeness coefficients p under different traffic densities

4.3.5 Comparison with the state-of-the-art benchmarks

In order to demonstrate the performance of the proposed MARL approach, we compared it with several state-of-the-art MARL methods:

  1. 1.

    Multi-agent Deep Q-Network (MADQN) [40]: This is the multi-agent version of Deep Q-Network (DQN) [25], which is an off-policy RL method by applying a deep neural network to approximate the value function and an experience replay buffer to break the correlations between samples to stabilize the training.

  2. 2.

    Multi-agent actor-critic using Kronecker-Factored Trust Region (MAACKTR): This is the multi-agent version of actor-critic using Kronecker-Factored Trust Region (ACKTR) [41], which is an on-policy RL algorithm by optimizing both the actor and the critic using Kronecker-factored approximate curvature (K-FAC) with trust region.

  3. 3.

    Multi-agent Proximal Policy Optimization (MAPPO) [42]: This is a multi-agent version of Proximal Policy Optimization (PPO) [43], which improves the trust region policy optimization (TRPO) [44] by using a clipped surrogate objective and adaptive KL penalty coefficient.

  4. 4.

    The Proposed MA2C: This is our proposed method with the designed multi-objective reward function design, parameter sharing, and local reward design schemes.

Table 2 shows the average return for the MARL algorithms during the evaluation. Obviously, the proposed MA2C algorithm shows the best performance under the density 1 scenario than other MARL algorithms. It also shows promising results on the density 2 and density 3 scenarios and outperforms MAACKTR and MAPPO algorithms. Note that even though MADQN shows a better average reward than the MA2C algorithm, it shows larger reward deviations which may cause unstable training and safety issues. MADQN depends on the current state during calculation, and cannot draw specific control strategies, and is not suitable for complex control with large state gaps. Indeed, MA2C appears more robust performances, which shows a very clear increasing and plateauing tendency. Similarly, the evaluation curves during the training process are shown in Fig. 7. As expected, the proposed MA2C algorithm outperforms other benchmarks in terms of evaluation reward and reward standard deviations.

Figure 7
figure 7

Performance comparisons on accumulated rewards in MADQN, MA2C, MAACKTR, and MAPPO

Table 2 Mean episode reward in different traffic flow scenario

4.3.6 Policy interpretation

In this subsection, we attempt to interpret the learned AVs’ behavior. Figure 8 shows the snapshots during testing at time steps 20, 28, and 40. As shown in Fig. 8(a), ego vehicle ➅ attempts to make a lane change to achieve a higher speed. To make a safe lane change, ego vehicle ➅ and ego vehicle ➆ are expected to work cooperatively. Specially, the ego vehicle ➆ should slow down to make space for the ego vehicle ➅ to avoid collisions, which is also represented in Fig. 9, where the ego vehicle ➆ starts to slow down at about 20-time steps. Then the ego vehicle ➅ begins to speed up to make the lane change as shown in Fig. 8(b) and Fig. 9. Meanwhile, the ego vehicle ➆ continues to slow down to ensure a safe headway distance with ego vehicle ➅ as shown in Fig. 9. Figure 8(c) shows the completed lane changes, at which time the ego vehicle ➆ starts to speed up. This demonstration shows the proposed MARL framework learns a reasonable and cooperative policy for ego vehicles.

Figure 8
figure 8

Lane change in simulation environment (vehicles ➀-➂: HDVs, vehicles ➅-➇: AVs)

Figure 9
figure 9

Speeds of the AVs ➅, ➆ and HDVs ➀

5 Conclusion

In this paper, we formulated the highway lane-changing problem in mixed traffic as an on-policy MARL problem, and extended the A2C into the multi-agent setting, featuring a novel local reward design and parameter sharing schemes. Specifically, a multi-objective reward function is proposed to simultaneously promote the driving efficiency, comfort, and safety of autonomous driving. Comprehensive experimental results, conducted on three different traffic densities under different levels of HDV aggressiveness, show that our proposed MARL framework consistently outperforms several state-of-the-art benchmarks in terms of efficiency, safety, and driver comfort.