1 Introduction

According to the world health organization (WHOFootnote 1), road accidents kill 1.3 million people and injure 50 million people each year. Several technologies have been proposed to make driving safer, such as advanced driver assistance systems (ADAS), adaptative cruise control (ACC), and intelligent transportation systems (ITS). The latter, with the recent technological advances in communication systems, paved the way for the deployment of autonomous vehicles.

Trommer et al. [1] described five levels of vehicle automation in their technical report, ranging from superficial assistance (level 1) to full automation (level 5). With effective algorithms that prevent fatal accidents, the latter level could make traffic safer. AVs and humans may cohabit in mixed traffic before reaching full automation. However, the evidence suggests that accident-free mixed traffic may be impossible [2]. Human drivers follow informal and subjective norms, but autonomous vehicles comply with traffic rules [3, 4]. Because of their divergent concerns, AVs are unlikely to be effective in mixed traffic. By contrast, coordinating a fully-autonomous fleet is straightforward because AVs act homogeneously and are therefore predictable. AVs should be capable of handling all traffic scenarios, whether they are driving in mixed traffic or fully autonomous fleets. However, because these scenarios are nearly endless, designing ruled-based models is practically certain to fail.

With advances in hardware, machine learning approaches provide new opportunities to generalize driving scenarios. Reinforcement learning (RL) approaches, in particular, are successful at solving sequential decision-making problems, such as Go, Chess, arcade games, and real-time video games [59]. In RL, an agent learns and self-corrects by receiving feedback on the quality of its interactions within an environment. Multi-agent RL (MARL) is a more distributed framework in which several agents simultaneously learn cooperative or competitive behavior. Since several decision-makers learn simultaneously and possibly coordinate, more robust and convincing policies can emerge than with single-agent RL approaches.

Several surveys have investigated relative aspects of RL for AVs more global way. Schmidt et al. [10] tackled autonomous mobility, including traffic management, unmanned aerial vehicles (UAVs), AVs, and resource optimization using MARL algorithms. Elallid et al. [11] surveyed AVs’ scene understanding, decision-making, planning, and social behavior using RL approaches. Kiran et al. [12] tackled scene understanding, decision-making, and planning using RL algorithms. Ye et al. [13] tackled motion planning and control using RL approaches. Notwithstanding, no reviews investigated the decision-making of autonomous vehicles using MARL algorithms.

Our survey seeks to fulfill this gap by answering two research questions: (RQ1) what is the recent state-of-art of AVs’ decision-making using MARL algorithms; and (RQ2) what are the topic’s primary current limitations. To answer these questions as concisely as possible while considering recent breakthroughs in MARL algorithms, we have restricted this review to sixteen papers published since 2019 (distribution in Fig. 1). We focus our survey on decision-making problems; nonetheless, interested readers can find in [14], a recent survey that focuses on autonomous driving policy learning using deep reinforcement learning (DRL) and deep imitation learning (DIL) techniques.

Figure 1
figure 1

Distribution of the reviewed papers

We have organized the remainder of this review as follows. Firstly, we introduce the state-of-art of RL and MARL algorithms (Sect. 2). Secondly, we highlight the learning schemes and strategies of MARL algorithms (Sect. 3). Thirdly, we review the driving simulation environments (Sect. 4). Fourthly, we investigate articles tackling AVs’ decision-making using MARL algorithms (Sect. 5). Lastly, we discuss open challenges and conclude this study (Sect. 6).

2 Reinforcement learning

This section provides a state-of-art of single (2.1) and multi-agent (2.2) reinforcement learning algorithms.

2.1 Single-agent reinforcement learning

Reinforcement learning (RL) is a trial-and-error learning method where an agent interacts within an environment [5] (Fig. 2). The agent’s goal is to reach the most rewarding states of the environment. The agent explores the environment, grasping its dynamics and devising an appropriate policy (behavior) to discover these states. As a result, the agent gains knowledge from its actions and maximizes long-term accumulated rewards. Non-learning agents who obey stationary policies may be present in the environment. The environment, the state, the actions, and the rewards for an autonomous car may correspond to the roadway, the positions of other vehicles, accelerating or braking, and collision avoidance, respectively.

Figure 2
figure 2

Single-agent reinforcement learning

There are three types of RL learning algorithms: value-based, policy-based, and actor-critic. In value-based methods, the agent implicitly learns a deterministic policy by picking higher-valued actions via a value function that maps state-action pairs. Nevertheless, the value function becomes inefficient as the state-action space grows, such as discrete spaces [15]. In policy-based methods, the agent explicitly learns a stochastic policy function. However, policy-based approaches suffer from high variance, which slows down the learning process. Actor-critic approaches appear to be a reasonable compromise that combines the benefits of the preceding methods. The latter is divided into a critic part which approximates the value function, while an actor part learns a policy based on critic estimations to alleviate the variance. Because they work effectively in real-world contexts with continuous space, actor-critic approaches are widespread within the RL community.

We briefly describe the single-agent RL algorithms (Fig. 3) addressed in Sect. 5. Deep Q-network (DQN) [16] is a value-based agent that builds a deep learning model to estimate future rewards and execute behaviors that lead to the best outcome. Advantage actor-critic (A2C) [17] is an actor-critic agent that builds a stochastic policy to estimate the advantage of taking action over others. Deep deterministic policy gradient (DDPG) [18] is an A2C agent with deterministic off-policy, which means that the present policy does not guide the learning process. Instead of employing a logarithmic update, proximal policy optimization (PPO) [19] is an expansion of the A2C agent that updates the policy based on the ratio between the old and new policies weighted by the advantage. None of them deal with policy-based methods.

Figure 3
figure 3

Taxonomy of RL methods

2.2 Multi-agent reinforcement learning

Multi-agent reinforcement learning (MARL) algorithms involve several agents learning simultaneously in a shared environment. Agents are either cooperative, competitive, or have a mixed approach. Cooperative agents possibly communicate to coordinate their actions (Fig. 4) and often share a common reward function. Conversely, competitive agents play a zero-sum game attempting to outperform their opponents. When agents do not behave fully cooperatively or fully competitively, they follow the mix setting, a general-sum game without any restrictions on agents’ relations [20].

Figure 4
figure 4

MARL with two communicative agents

MARL algorithms follow the same taxonomy of single-RL methods introduced in Fig. 3. Multi-agent extensions of single-agent algorithms are often prefixed with MA, e.g., MAA2C and MADDPG [21, 22]. MARL algorithms are more complicated than single-agent RL approaches because several agents learn simultaneously and constantly co-adapt their policies. This non-stationarity disrupts the dynamics of the environment and impedes the learning process [23]. Furthermore, as the number of agents increases, the space expands exponentially, slowing the learning process. The latter phenomenon is called the curse of dimensionality.

In other environments, agents operate with just partial observations of the present state, making learning more challenging; for example, it is hard to observe the whole traffic flow in road driving. To dispel these obstruction zones, agents can communicate in cooperative tasks [24]. Connected autonomous vehicles, for example, could share and merge their local observations to better represent traffic, potentially revealing a vehicle in a blind spot. Non-stationarity and partial observability are mitigated by communication.

Many learning schemes and strategies have been proposed in response to the additional challenges of MARL algorithms, which are exacerbated by the number of agents.

3 Learning schemes

The curse of dimensionality, partial observability, and non-stationarity represent three critical challenges for MARL development. This section introduces how MARL centralized or decentralized the learning and its execution (3.1) and what are learning schemes (3.2) implemented in the reviewed papers that tackle these challenges.

3.1 Centralization and decentralization

In learning algorithms, an agent learns a policy during a training phase and follows it during the execution phase. These phases, in MARL algorithms, can be either centralized or decentralized. In the centralized one, agents share information to improve their policies, whereas, in the decentralized one, they learn independently with no additional information. Three major learning schemes have been proposed depending on whether the training and execution phases are centralized or decentralized.

3.1.1 Centralized training centralized execution (CTCE)

In centralized training centralized execution (CTCE) scheme, a central learner gathers information from agents to learn a joint policy, which mitigates the partial observability and non-stationarity issues. However, CTCE suffers from centralization, which exacerbates the curse of dimensionality. Furthermore, agents with competing goals may disrupt each other’s policies, making learning harder. Single-agent RL algorithms may suffice because CTCE does not expressly assume decentralization. In contrast to CTCE, a fully-decentralized scheme has been proposed.

3.1.2 Decentralized training decentralized execution (DTDE)

Decentralized training decentralized execution (DTDE) scheme allows each agent to learn independently without exchanging additional information. As a result, agents are unaware of one another’s existence, and the environment appears non-stationary from their viewpoints. Furthermore, Gupta et al. [25] demonstrated that DTDE scales poorly with agent number.

One last scheme has been proposed as an intermediary solution, given the previous limitations of the fully-centralized and fully-decentralized approaches.

3.1.3 Centralized training decentralized execution (CTDE)

Lowe et al. [22] introduced the centralized training decentralized execution (CTDE) method, which overcomes the shortcomings of the fully-centralized and fully-decentralized approaches. During the training phase, agents share additional information to reduce non-stationarity and partial observability, then discard it during the execution phase. CTDE scheme includes two popular strategies that can be used depending on the agents’ nature [25].

Parameter sharing

Parameter sharing (PS) is a well-known approach for dealing with large-scale environments where several homogeneous agents cooperate [25]. PS mitigates the curse of dimensionality by allowing all agents to learn simultaneously using a single neural network during the training phase (Fig. 5(a)).

Figure 5
figure 5

CTDE learning schemes

Centralized critic decentralized actor

However, when agents are heterogeneous, the centralized critic decentralized actor is more convenient [22]. It follows the actor-critic architecture. Since the critic focuses on assessing the actor, it is no longer helpful for the execution phase. Therefore, each agent receives a duplicate of the actor after the training phase (Fig. 5(b)).

MARL research is still in its infancy, and we have barely skimmed its surface. Interested readers may find comprehensive reviews dedicated to MARL algorithms and challenges [20, 2630]. In addition to these MARL learning schemes, various RL strategies may overcome multi-agent challenges.

3.2 Learning strategies

This subsection presents some RL strategies inspired by human cognitive mechanisms that were used in the papers discussed in Sect. 5.

3.2.1 Memory

Memory is a mechanism allowing humans to analyze dynamics. Because RL approaches deal with sequential problems, giving agents memory strengthens their ability to figure out the environment’s dynamics [31]. Researchers designed a Recurrent Neural Network (RNN), a memory-based neural network with information cycles that remember the past inputs and reuse them in subsequent decisions. As a result, RNN reduces non-stationarity by improving the analysis of current dynamics based on these experiences. In the case of driving, the memory enables determining the heading of a vehicle between two lanes (Fig. 6).

Figure 6
figure 6

The benefit of memory. It is impossible to figure out the car’s heading without memory (6(a)), while it becomes straightforward with memory (6(b))

3.2.2 Masking

Masking prevents humans from performing undesirable actions, making the environment safer and decision-making straightforward [31]. When a designer knows a priori that an action is counterproductive, he or she can prevent the agents from undertaking it. For example, when a road is under construction, barriers prevent us from taking it (Fig. 7). Masking speeds up the training and alleviates the curse of dimensionality by narrowing the action space. Another way to ease learning is to reduce exploration.

Figure 7
figure 7

Masking prevents from undertaking undesirable actions (red)

3.2.3 Curriculum learning

Curriculum learning [32] refers to a learning method that gradually increases the difficulty. For example, when people learn to drive, they usually start in low-traffic areas, and when they master it, they move on to denser areas (Fig. 8). In MARL, agents often fail to learn practical policies because of the non-stationarity. With curriculum learning, agents start learning in stationary environments and gradually remove this stationarity, making the task more challenging. Another way to ease learning is to consider it hierarchically.

Figure 8
figure 8

Curriculum Learning for driving. From left to right, agents start in light traffic, increase the complexity, and become more robust in dense traffic

3.2.4 Hierarchical reinforcement learning (HRL)

Hierarchical reinforcement learning are “divide and conquer” algorithms [33]. Dividing the main policy into lower-level sub-policies make problems more manageable since these sub-policies can be reused in related tasks (Fig. 9). For example, a left lane change on a highway can reuse the knowledge acquired from a similar task on a country road. Sub-tasks are sometimes less resource-intensive than global tasks; because they can operate in a narrowed state-action space, thus alleviating the curse of dimensionality.

Figure 9
figure 9

Hierarchical reinforcement learning (inspired from Chen et al. [34]). A higher-level policy (orange) selects a subtask (blue) to perform a sub-policy on a narrow action space (green)

We showed in this section that centralized and decentralized schemes suffer from many problems that learning strategies can alleviate. The following section will describe the MARL-based driving simulation environments.

4 MARL-based driving simulation environments

Coordinating a fully-autonomous fleet, i.e., without human drivers, is more tractable than driving in mixed traffic because of the predictable nature of homogeneous agents. Furthermore, to keep traffic flowing, AVs share information and coordinate within short reaction times. Most MARL training use simulation environments (4.1) to learn these features on various scenarios (4.2) and with human driver models (4.3).

4.1 Simulation environments

Simulation environments provide tools to simulate traffic and develop learning algorithms for AVs. They allow benchmarking of the effectiveness of the suggested algorithms before shifting to a real-world implementation. We briefly introduce, in alphabetic order, four simulation environments used in the papers introduced in Sect. 5.

  • CARLA [35] is an open-source road environment based on Unreal Engine.Footnote 2 It provides assets to model the road environment and implement perception, planning, and control modules.

  • FlowFootnote 3 [36] is a framework combining the SUMO traffic simulator [37] and a deep RL library Rllab [38]. It provides many traffic scenarios and supports training involving a fixed number of vehicles.

  • Highway-envFootnote 4 is an open-source Gym-based platform. It provides road scenarios designed to train AVs’ decision-making in mixed traffic. According to Schmidt et al. [10], its performance decreases with the number of vehicles.

  • MACAD-GymFootnote 5 [39] is a Gym-based training environment based on CARLA. As its name implies, multi-agent connected autonomous driving (MACAD) allows the implementation of communicative agents.

All these simulation environments support the design of different scenario types.

4.2 Driving scenarios

Most papers focus on narrow scenarios instead of considering overall traffic. We present the traffic scenarios according to their complexity.

  1. 1.

    Highway driving. This scenario is commonly accepted as the most straightforward scenario and considers two maneuvers: car-following and lane changing. Mastering these maneuvers, which account for 98% of driver actions, is crucial for safe driving. Robust AVs mastering highway driving should avoid collisions and frequent lane changes, which will affect traffic flow.

  2. 2.

    Merging and exiting. These maneuvers are similar to lane change but are constrained in space and time. Robust AVs must anticipate gaps in traffic to merge smoothly within the traffic flow and space-time constraints. Inference capabilities should also determine whether a driver is inclined to engage in an altruistic behavior by leaving a gap, which is not straightforward since AVs are agnostic about informal rules.

  3. 3.

    Intersections and roundabouts. There are heterogeneous configurations of intersection, and apprehending them can be challenging. For instance, designers failed to generalize them via rules-based models and designed a decision graph for each one, which is tedious. Robust AVs should generalize them and figure out the singularities of each.

Because designing a generic model of intersections is difficult, most research concentrate on the first two levels. Regardless of the scenario, robust AVs should have an advantage by avoiding more collisions if they can predict human driver behavior.

4.3 Human driver models

AVs have difficulty adapting to the heterogeneity of human behavior because it produces additional uncertainty and forces caution. To overcome uncertainty, humans make assumptions based on experience, informal rules, and behavioral cues, which are sometimes biased or stereotyped [40, 41]. It is impossible to replicate the entire human cognitive process, and therefore, AVs often learn with oversimplified human models.

Human-driven vehicle (HDV) models simulate car-following and lane-changing maneuvers [4244]. The well-known intelligent driver model (IDM) describes speed and acceleration based on the driver’s preferences for speed and headway [45]. The IDM is often combined with the MOBIL or LC2013 lane-changing model, which considers the utility and risk associated with this maneuver [46, 47]. Although the literature refers to them as human-driven models, they lack human traits such as psychology or intrinsic motivation.

Designing AVs for mixed traffic is challenging because of the fundamental differences between humans and machines. Although inferring human social behavior helps AVs’ decision-making, the following section shows that this approach is not widespread in the literature.

5 MARL algorithms for AVs

We have identified four research paradigms throughout the MARL decision-making for AVs literature. Some authors focused on mixed traffic where AVs drive in a self-concern way (5.1), while others attempted to incorporate social abilities into their decision-making (5.2). In both cases, the authors realized that the current HDV models do not fulfill their objectives since they are oversimplified and fail at providing a heterogeneity of behaviors. As a result, researchers designed a more sophisticated HDV model endowed with social capabilities (5.3). The last paradigm tackles the fully autonomous traffic case where no human driver can disturb AVs’ coordination (5.4). Finally, we present the formulation of the authors (5.5). For each paradigm introduced, we present the authors’ formulation of the MARL problem in terms of observation, action, and reward function.

5.1 Mixed traffic

Before reaching the full automation level, AVs will potentially cohabit with human drivers in mixed traffic, which is no easy feat. AVs follow homogeneous policies, while humans are sometimes erratic and irrational. Here, we focus on papers suggesting self-concern AVs driving in mixed traffic.

Wang et al. [48] trained AVs on three scenarios: a ring network, a figure-of-eight network, and a mini-city with intersections and roundabouts. The ego-agent state comprises its position, speed, and the distance and speed headway of the leading and following vehicles. AVs communicate local observations with other AVs within range. The ego-agent’s actions are constrained within predefined discrete acceleration values, and its reward function promotes safety and efficiency.

Dong et al. [31] tackled a challenging environment where AVs have to exit by one of the two off-ramps on a three-lane highway. The agent’s observations contain the relative speeds, longitudinal locations, lane positions, and intentions of surrounding AVs, as well as an adjacency matrix and a mask. AVs pick up high-level actions: lane change or lane keeping. Functions reward when each AV reaches the desired off-ramp indicated by the intention and penalizes collision and lane changes to prevent versatility.

Han and Wang [49] trained AVs to drive on a three-lane freeway. Each AV observes its position, velocity, acceleration, and data captured from an onboard camera and LIDAR sensors. Additionally, AVs share their states, actions, and observations with each other. AVs select high-level actions such as lane keeping, lane change, or emergency stop and are rewarded according to their velocities and passengers’ comfort. The reward system deals with the credit assignment problem, i.e., how to fairly reallocate a shared global reward by marginalizing rewards using the Shapley value. In the cooperative game theory, the Shapley value is a solution concept that distributes fair payoffs to players proportionally to their contribution. Since the complexity of the Shapley value is polynomial with the number of agents, the authors estimated via a neural network and extended it to sequential problems.

AVs decision-making in mixed traffic is significantly impacted by the absence of other AVs in their vicinities. As AVs communicate local observations within range, meaning the uncertainty about the environment grows as the number of surrounding AVs decreases. To overcome this challenge, some researchers envision AVs that are more aware of their surroundings and propose algorithms with social capabilities.

5.2 Socially desirable AVs

Socially desirable AVs will likely include the concept of altruism. In psychology, social value orientation (SVO) quantifies an individual’s level of altruism, i.e., how much importance to place on others. Lower SVO levels denote selfish behavior, while higher levels denote true altruism.

In their first paper, Toghi et al. [50] tackled the merging and exiting scenarios with socially desirable AVs. AVs observe the kinematics of their neighboring vehicles as well as their last high-level actions to extract the temporal information giving their current trajectories. They perform meta-actions, including lane change, acceleration, and deceleration. The socially desirable behavior is induced through a reward function acting as a trade-off between egoistic and altruistic behavior, differentiating altruism towards AVs and human-driven vehicles. The SVO of the AV weighs this trade-off and the distance of the surrounding vehicles considered.

In their second paper, Toghi et al. [51] enhanced their approach using a 3D convolution network with the relative vehicle speeds as channels. Further experiments have identified an optimal level of SVO that improves overall traffic flow and show that overly altruistic AVs reduce performance. Their third paper achieves better results using a multi-agent actor-critic algorithm [52].

Chen et al. [53] trained AVs to avoid collisions in a merging scenario using a supervisor prioritizing vehicles that merge because their situation is time-critical. AVs observe lateral and longitudinal positions and velocities of surrounding vehicles and pick up meta-action among lane change, accelerating or decelerating. A reward function promotes fast merging, high velocity, and safe time headway and penalizes collisions. This function is a global reward shared by all the AVs in the simulation for encouraging coordination among AVs.

As the results of these articles noticed, socially desirable AVs improve the success rate of merging and exiting maneuvers. Nonetheless, this coordination is facilitated because human-driven vehicles are all controlled by the IDM model and thus are easily predictable. Designing robust AVs that cope with heterogeneous driver behavior and traffic simulations will require comprehensive driver models.

5.3 Heterogeneous HDVs

Robust AVs inevitably will have to be trained to drive in complex mixed traffic composed of heterogeneous human-driven vehicles (HDV). Some researchers [54] attempted to learn an HDV model via inverse RL (IRL), a technique for figuring out an agent’s reward function given its policy; but this approach is highly dependent on the quality of the extracted data and the studied scenario. As a consequence, there is a need for a “realistic” and heterogeneous HDV model.

Valiente et al. [55] extended the research of Toghi et al. by incorporating an SVO factor into the IDM model used for controlled HDVs. Similarly, Zhou et al. [56] endowed HDVs with a politeness factor, and Hu et al. [57] designed a social HDV model with different levels of cooperation.

All the mentioned authors took advantage of their new HDV models by enabling AVs to infer this SVO and thus anticipated which driver is prone to act altruistically or not.

5.4 Fully-autonomous fleet

When AVs reach the fifth level of automation, human drivers might be considered the main threat to road safety and therefore be banned from driving. In this context, all traffic will be composed of fully-autonomous fleets.

Yu et al. [58] addressed the problem of coordination on the highway. AVs observe their current lane position, speed, and the distances and velocities of four neighboring vehicles. Actions comprise driving in the driving lane at a suboptimal speed or driving in the overtaking lane with a higher velocity. The reward function exclusively promotes safety and is shared among a local group of AVs depicted by coordination graphs.

Bhalla et al. [59] learned AVs to better communicate and coordinate on a highway. They measure them against DIAL, a benchmark algorithm that focuses on learning to communicate in cooperative tasks [60]. Unlike DIAL, their method does not require past experiences, which mitigates non-stationarity and stabilizes learning. AVs’ actions include sending messages, accelerating, decelerating, and direction change. The reward function does not provide explicit rewards for cooperation between the agents but promotes safety distance and penalizes crashes.

Liu et al. [15] proposed a framework for fleet control where each vehicle learns to maintain a constant headway with the vehicles ahead and behind on a highway. Each AV observes its position and speed, as well as those of front and rear vehicles. To maintain the homogeneity of the fleet, a reward function penalizes the AVs which are not at equidistance to the front and rear vehicles or AVs whose velocity and acceleration differ from the group.

Palanisamy [39] designed MACAD, a simulation environment to simulate AV’s perception, decision-making, and control. In an intersection scenario, AVs’ observations are images captured from an onboard camera, and they can pick up one of the eight discrete actions controlling steering angle, throttle, and brake. The function rewards AVs crossing the intersection while maintaining a high speed and avoiding collisions. Optionally a factor encourages/discourages cooperativeness/competitiveness among the agents.

Nakka et al. [61] tackled the coordination problem in a merging scenario. The merging AV observes the distances and velocities of the surrounding vehicles and the distance from the end of the merging zone. Actions allow the AV to accelerate or decelerate, and the reward function encourages agents to maintain their speed within a predefined range and penalizes rear-end collisions.

5.5 Synthesis

We synthesize the previous papers according to the concepts introduced in this survey (Table 1). Most authors used single-agent RL methods, especially those based on DQN, to address MARL problems (12 out of 16) and mainly adopted the CTDE scheme for MARL approaches (3 out of 4). The action space’s nature seems to guide the motivations for using value-based or actor-critic methods since the latter better deal with continuous action space. In addition, few articles used learning strategies or explicitly mentioned them.

Table 1 Summary of papers according to the problem addressed and simulation settings. Scenarios include merging (M), exiting (E), highway (H) without merging nor exiting, urban navigation comprising intersections and roundabouts (U), and intersection (I). Learning strategies include Hierarchical Reinforcement Learning (HRL), Curriculum Learning (CL), Memory module (Mem), and Masking (Mask)

Interestingly, most papers (12 out of 16) focused their study on simulations involving few agents (≤10). This choice is presumably motivated by the MARL challenges, notably the curse of dimensionality [62].

Most studies investigated highway driving and merging scenarios (13 out of 16), as these critical maneuvers involve anticipation and often cause accidents to AVs. For their simulations, Gym-based environments prevail due to their manageable API for RL. Similarly, IDM prevails because of its efficiency and computational simplicity.

Since 2019, few papers have addressed AVs’ decision-making using MARL compared to those using single-agent RL. Due to the limited number of articles dealing with MARL, our conclusions may be biased, so we invite readers to consider this.

6 Open challenges and conclusion

Overall, most studies focus on simulations rather than addressing transferability to real traffic scenarios. The needs for “realistic” driver models, safe and interpretable models are two significant problems for AV simulation discussed in this section.

Safety is undoubtedly the critical point of the development of AV algorithms. In MARL, designing a safe policy is a real challenge that implies considering safety constraints at the agent and group levels. The constrained markov decision process (CMDP) framework provides tools for designing such safe RL [64] algorithms.

Most studies agree that existing HDV models are unrealistic because they disregard human characteristics such as psychological and biological traits. Although some researchers tried to provide heterogeneity in HDV models, their models are still limited to a single SVO trait. Besides, despite their differences, HDV and AV models behave deterministically. Introducing AVs trained with these HDV models into real-world traffic would likely result in accidents.

Therefore, developing convincing driver models for safe driving is critical, as driving styles vary among countries and cultures [65]. Attempts have been made using inverse reinforcement learning (IRL), but these algorithms are overly dependent on the situations under study and frequently fail to generalize. Others have proposed utilizing MARL algorithms to learn social norms, which may be a new field of research [66].

Another way to prepare AVs for real-world traffic is to make them trustworthy by incorporating interpretability. Explainable artificial intelligence (EAI) is an important research topic gaining interest over the years, mainly because lawmakers require AI to be interpretable, as in Europe with the general data protection regulation (GDPRFootnote 6). Therefore, robust AVs should incorporate interpretable algorithms providing security and robustness guarantees. Interpreting MARL policies involves explaining short- and long-term decision-making and interactions of multiple agents. This may be accomplished via Causal MARL [67].

Since multi-agent simulations, and MARL algorithms in a broader way, enable the emergence of organizational structures, it might be interesting to investigate how self-organization occurs in a fully autonomous fleet with no predetermined rules. While researchers tend to incorporate standards into AVs’ decision-making, they do not rule them out for the fully-autonomous fleets. These emergent organizations may be more appropriate for AVs than current regulations based on humans’ limitations.

We posed two research questions in the introduction (1), which we now address.

  • RQ1. Recent AVs’ decision-making research focused on two paradigms. On the one hand, since autonomous vehicles may soon coexist with human drivers, mixed traffic received much attention. Some studies concentrated on improving traffic safety and throughput, while others proposed empowering AVs with social abilities. Some attempted to design HDV models that mimic driver altruism to robustify AVs’ policies. On the other hand, since human drivers might be banned from traffic, some researchers devised fully-autonomous fleets that should enhance the overall traffic flow and security.

  • RQ2. Designing traffic simulations with adequate HDV models is challenging, and despite the proposed models, none covered the heterogeneity of human behavior. Given the current limitations, it seems involved to consider mixed traffic, and future research will likely pay more attention to this problem. In addition, since intersections and roundabouts are manifolds, most studies concentrated on the most straightforward scenarios, such as highway driving, merging, and exiting. Finally, most experiments involved few agents due to the aforementioned MARL challenges [62].

In conclusion, RL and MARL algorithms have recently received interest due to their recent achievements and generalization capabilities. They provide a practical approach for learning complex policies involving real-time decision-making in stochastic environments. However, many challenges remain in mitigating the scalability when involving numerous agents. Furthermore, mixed traffic does not meet the security standards in the current simulations. Recent papers attempted to mimic human behavior, particularly social capabilities, to enforce AVs’ policies. Given current AVs’ algorithms, future research will most likely continue to design less deterministic driver models.