Multi-agent reinforcement learning for autonomous vehicles: a survey

In the near future, autonomous vehicles (AVs) may cohabit with human drivers in mixed traffic. This cohabitation raises serious challenges, both in terms of traffic flow and individual mobility, as well as from the road safety point of view. Mixed traffic may fail to fulfill expected security requirements due to the heterogeneity and unpredictability of human drivers, and autonomous cars could then monopolize the traffic. Using multi-agent reinforcement learning (MARL) algorithms, researchers have attempted to design autonomous vehicles for both scenarios, and this paper investigates their recent advances. We focus on articles tackling decision-making problems and identify four paradigms. While some authors address mixed traffic problems with or without social-desirable AVs, others tackle the case of fully-autonomous traffic. While the latter case is essentially a communication problem, most authors addressing the mixed traffic admit some limitations. The current human driver models found in the literature are too simplistic since they do not cover the heterogeneity of the drivers’ behaviors. As a result, they fail to generalize over the wide range of possible behaviors. For each paper investigated, we analyze how the authors formulated the MARL problem in terms of observation, action, and rewards to match the paradigm they apply.


Introduction
According to the world health organization (WHO 1 ), road accidents kill 1.3 million people and injure 50 million people each year.Several technologies have been proposed to make driving safer, such as advanced driver assistance systems (ADAS), adaptative cruise control (ACC), and intelligent transportation systems (ITS).The latter, with the recent technological advances in communication systems, paved the way for the deployment of autonomous vehicles.
Trommer et al. [1] described five levels of vehicle automation in their technical report, ranging from superficial assistance (level 1) to full automation (level 5).With effective algorithms that prevent fatal accidents, the latter level could make traffic safer.AVs and humans may cohabit in mixed traffic before reaching full automation.However, the evidence suggests that accident-free mixed traffic may be impossible [2].Human drivers follow informal and subjective norms, but autonomous vehicles comply with traffic rules [3,4].Because of their divergent concerns, AVs are unlikely to be effective in mixed traffic.By contrast, coordinating a fully-autonomous fleet is straightforward because AVs act homogeneously and are therefore predictable.AVs should be capable of handling all traffic scenarios, whether they are driving in mixed traffic or fully autonomous fleets.However, because these scenarios are nearly endless, designing ruled-based models is practically certain to fail.
With advances in hardware, machine learning approaches provide new opportunities to generalize driving scenarios.Reinforcement learning (RL) approaches, in particular, are successful at solving sequential decisionmaking problems, such as Go, Chess, arcade games, and real-time video games [5][6][7][8][9].In RL, an agent learns and self-corrects by receiving feedback on the quality of its in- teractions within an environment.Multi-agent RL (MARL) is a more distributed framework in which several agents simultaneously learn cooperative or competitive behavior.Since several decision-makers learn simultaneously and possibly coordinate, more robust and convincing policies can emerge than with single-agent RL approaches.
Several surveys have investigated relative aspects of RL for AVs more global way.Schmidt et al. [10] tackled autonomous mobility, including traffic management, unmanned aerial vehicles (UAVs), AVs, and resource optimization using MARL algorithms.Elallid et al. [11] surveyed AVs' scene understanding, decision-making, planning, and social behavior using RL approaches.Kiran et al. [12] tackled scene understanding, decision-making, and planning using RL algorithms.Ye et al. [13] tackled motion planning and control using RL approaches.Notwithstanding, no reviews investigated the decision-making of autonomous vehicles using MARL algorithms.
Our survey seeks to fulfill this gap by answering two research questions: (RQ1) what is the recent state-of-art of AVs' decision-making using MARL algorithms; and (RQ2) what are the topic's primary current limitations.To answer these questions as concisely as possible while considering recent breakthroughs in MARL algorithms, we have restricted this review to sixteen papers published since 2019 (distribution in Fig. 1).We focus our survey on decisionmaking problems; nonetheless, interested readers can find in [14], a recent survey that focuses on autonomous driving policy learning using deep reinforcement learning (DRL) and deep imitation learning (DIL) techniques.
We have organized the remainder of this review as follows.Firstly, we introduce the state-of-art of RL and MARL algorithms (Sect.2).Secondly, we highlight the learning schemes and strategies of MARL algorithms (Sect.3).Thirdly, we review the driving simulation environments (Sect.4).Fourthly, we investigate articles tackling AVs' decision-making using MARL algorithms (Sect.5).Lastly, we discuss open challenges and conclude this study (Sect.6).

Reinforcement learning
This section provides a state-of-art of single (2.1) and multi-agent (2.2) reinforcement learning algorithms.

Single-agent reinforcement learning
Reinforcement learning (RL) is a trial-and-error learning method where an agent interacts within an environment [5] (Fig. 2).The agent's goal is to reach the most rewarding states of the environment.The agent explores the environment, grasping its dynamics and devising an appropriate policy (behavior) to discover these states.As a result, the agent gains knowledge from its actions and maximizes long-term accumulated rewards.Non-learning agents who obey stationary policies may be present in the environment.The environment, the state, the actions, and the rewards for an autonomous car may correspond to the roadway, the positions of other vehicles, accelerating or braking, and collision avoidance, respectively.
There are three types of RL learning algorithms: valuebased, policy-based, and actor-critic.In value-based methods, the agent implicitly learns a deterministic policy by picking higher-valued actions via a value function that maps state-action pairs.Nevertheless, the value function becomes inefficient as the state-action space grows, such as discrete spaces [15].In policy-based methods, the agent explicitly learns a stochastic policy function.However, policy-based approaches suffer from high variance, which slows down the learning process.Actor-critic approaches appear to be a reasonable compromise that combines the benefits of the preceding methods.The latter is divided into a critic part which approximates the value function, while an actor part learns a policy based on critic estimations to alleviate the variance.Because they work effectively in real-world contexts with continuous space, actorcritic approaches are widespread within the RL community.
We briefly describe the single-agent RL algorithms (Fig. 3) addressed in Sect. 5. Deep Q-network (DQN) [16] is a value-based agent that builds a deep learning model to estimate future rewards and execute behaviors that lead to the best outcome.Advantage actor-critic (A2C) [17] is an actor-critic agent that builds a stochastic policy to estimate the advantage of taking action over others.Deep deterministic policy gradient (DDPG) [18] is an A2C agent with de-Figure 3 Taxonomy of RL methods terministic off-policy, which means that the present policy does not guide the learning process.Instead of employing a logarithmic update, proximal policy optimization (PPO) [19] is an expansion of the A2C agent that updates the policy based on the ratio between the old and new policies weighted by the advantage.None of them deal with policybased methods.

Multi-agent reinforcement learning
Multi-agent reinforcement learning (MARL) algorithms involve several agents learning simultaneously in a shared environment.Agents are either cooperative, competitive, or have a mixed approach.Cooperative agents possibly communicate to coordinate their actions (Fig. 4) and often share a common reward function.Conversely, competitive agents play a zero-sum game attempting to outperform their opponents.When agents do not behave fully cooperatively or fully competitively, they follow the mix setting, a general-sum game without any restrictions on agents' relations [20].
MARL algorithms follow the same taxonomy of single-RL methods introduced in Fig. 3. Multi-agent extensions of single-agent algorithms are often prefixed with MA, e.g., MAA2C and MADDPG [21,22].MARL algorithms are more complicated than single-agent RL approaches because several agents learn simultaneously and constantly co-adapt their policies.This non-stationarity disrupts the dynamics of the environment and impedes the learning process [23].Furthermore, as the number of agents increases, the space expands exponentially, slowing the learning process.The latter phenomenon is called the curse of dimensionality.
In other environments, agents operate with just partial observations of the present state, making learning more challenging; for example, it is hard to observe the whole traffic flow in road driving.To dispel these obstruction zones, agents can communicate in cooperative tasks [24].Connected autonomous vehicles, for example, could share and merge their local observations to better represent traffic, potentially revealing a vehicle in a blind spot.Non-stationarity and partial observability are mitigated by communication.
Many learning schemes and strategies have been proposed in response to the additional challenges of MARL algorithms, which are exacerbated by the number of agents.

Learning schemes
The curse of dimensionality, partial observability, and nonstationarity represent three critical challenges for MARL development.This section introduces how MARL centralized or decentralized the learning and its execution (3.1) and what are learning schemes (3.2) implemented in the reviewed papers that tackle these challenges.

Centralization and decentralization
In learning algorithms, an agent learns a policy during a training phase and follows it during the execution phase.These phases, in MARL algorithms, can be either centralized or decentralized.In the centralized one, agents share information to improve their policies, whereas, in the decentralized one, they learn independently with no additional information.Three major learning schemes have been proposed depending on whether the training and execution phases are centralized or decentralized.

Centralized training centralized execution (CTCE)
In centralized training centralized execution (CTCE) scheme, a central learner gathers information from agents to learn a joint policy, which mitigates the partial observability and non-stationarity issues.However, CTCE suffers from centralization, which exacerbates the curse of dimensionality.Furthermore, agents with competing goals may disrupt each other's policies, making learning harder.Single-agent RL algorithms may suffice because CTCE does not expressly assume decentralization.In contrast to CTCE, a fully-decentralized scheme has been proposed.

Decentralized training decentralized execution
(DTDE) Decentralized training decentralized execution (DTDE) scheme allows each agent to learn independently without exchanging additional information.As a result, agents are unaware of one another's existence, and the environment appears non-stationary from their viewpoints.Furthermore, Gupta et al. [25] demonstrated that DTDE scales poorly with agent number.
One last scheme has been proposed as an intermediary solution, given the previous limitations of the fullycentralized and fully-decentralized approaches.

Centralized training decentralized execution
(CTDE) Lowe et al. [22] introduced the centralized training decentralized execution (CTDE) method, which overcomes the shortcomings of the fully-centralized and fullydecentralized approaches.During the training phase, agents share additional information to reduce nonstationarity and partial observability, then discard it during the execution phase.CTDE scheme includes two popular strategies that can be used depending on the agents' nature [25].
Parameter sharing Parameter sharing (PS) is a wellknown approach for dealing with large-scale environments where several homogeneous agents cooperate [25].PS mitigates the curse of dimensionality by allowing all agents to learn simultaneously using a single neural network during the training phase (Fig. 5(a)).
Centralized critic decentralized actor However, when agents are heterogeneous, the centralized critic decentralized actor is more convenient [22].It follows the actorcritic architecture.Since the critic focuses on assessing the actor, it is no longer helpful for the execution phase.Therefore, each agent receives a duplicate of the actor after the training phase (Fig. 5(b)).
MARL research is still in its infancy, and we have barely skimmed its surface.Interested readers may find comprehensive reviews dedicated to MARL algorithms and challenges [20,[26][27][28][29][30].In addition to these MARL learning schemes, various RL strategies may overcome multi-agent challenges.

Learning strategies
This subsection presents some RL strategies inspired by human cognitive mechanisms that were used in the papers discussed in Sect. 5.

Memory
Memory is a mechanism allowing humans to analyze dynamics.Because RL approaches deal with sequential problems, giving agents memory strengthens their ability to figure out the environment's dynamics [31].Researchers designed a Recurrent Neural Network (RNN), a memorybased neural network with information cycles that remember the past inputs and reuse them in subsequent decisions.As a result, RNN reduces non-stationarity by improving the analysis of current dynamics based on these experiences.In the case of driving, the memory enables determining the heading of a vehicle between two lanes (Fig. 6).

Masking
Masking prevents humans from performing undesirable actions, making the environment safer and decision- making straightforward [31].When a designer knows a priori that an action is counterproductive, he or she can prevent the agents from undertaking it.For example, when a road is under construction, barriers prevent us from taking it (Fig. 7).Masking speeds up the training and alleviates the curse of dimensionality by narrowing the action space.Another way to ease learning is to reduce exploration.

Curriculum learning
Curriculum learning [32] refers to a learning method that gradually increases the difficulty.For example, when people learn to drive, they usually start in low-traffic areas, and when they master it, they move on to denser areas (Fig. 8).In MARL, agents often fail to learn practical policies because of the non-stationarity.With curriculum learning, agents start learning in stationary environments and gradually remove this stationarity, making the task more chal-lenging.Another way to ease learning is to consider it hierarchically.

Hierarchical reinforcement learning (HRL)
Hierarchical reinforcement learning are "divide and conquer" algorithms [33].Dividing the main policy into lowerlevel sub-policies make problems more manageable since these sub-policies can be reused in related tasks (Fig. 9).For example, a left lane change on a highway can reuse the knowledge acquired from a similar task on a country road.Sub-tasks are sometimes less resource-intensive than global tasks; because they can operate in a narrowed stateaction space, thus alleviating the curse of dimensionality.
We showed in this section that centralized and decentralized schemes suffer from many problems that learning strategies can alleviate.The following section will describe the MARL-based driving simulation environments.[34]).A higher-level policy (orange) selects a subtask (blue) to perform a sub-policy on a narrow action space (green)

MARL-based driving simulation environments
Coordinating a fully-autonomous fleet, i.e., without human drivers, is more tractable than driving in mixed traffic because of the predictable nature of homogeneous agents.Furthermore, to keep traffic flowing, AVs share information and coordinate within short reaction times.Most MARL training use simulation environments (4.1) to learn these features on various scenarios (4.2) and with human driver models (4.3).

Simulation environments
Simulation environments provide tools to simulate traffic and develop learning algorithms for AVs.They allow benchmarking of the effectiveness of the suggested algorithms before shifting to a real-world implementation.We briefly introduce, in alphabetic order, four simulation environments used in the papers introduced in Sect. 5.
• CARLA [35] is an open-source road environment based on Unreal Engine. 2 It provides assets to model the road environment and implement perception, planning, and control modules.• Flow 3 [36] is a framework combining the SUMO traffic simulator [37] and a deep RL library Rllab [38].
It provides many traffic scenarios and supports training involving a fixed number of vehicles.• Highway-env 4 is an open-source Gym-based platform.
It provides road scenarios designed to train AVs' decision-making in mixed traffic.According to Schmidt et al. [10], its performance decreases with the number of vehicles.

Driving scenarios
Most papers focus on narrow scenarios instead of considering overall traffic.We present the traffic scenarios according to their complexity.
1. Highway driving.This scenario is commonly accepted as the most straightforward scenario and considers two maneuvers: car-following and lane changing.Mastering these maneuvers, which account for 98% of driver actions, is crucial for safe driving.Robust AVs mastering highway driving should avoid collisions and frequent lane changes, which will affect traffic flow.

Human driver models
AVs have difficulty adapting to the heterogeneity of human behavior because it produces additional uncertainty and forces caution.To overcome uncertainty, humans make assumptions based on experience, informal rules, and behavioral cues, which are sometimes biased or stereotyped [40,41].It is impossible to replicate the entire human cognitive process, and therefore, AVs often learn with oversimplified human models.
Human-driven vehicle (HDV) models simulate carfollowing and lane-changing maneuvers [42][43][44].The well-known intelligent driver model (IDM) describes speed and acceleration based on the driver's preferences for speed and headway [45].The IDM is often combined with the MOBIL or LC2013 lane-changing model, which considers the utility and risk associated with this maneuver [46,47].Although the literature refers to them as humandriven models, they lack human traits such as psychology or intrinsic motivation.
Designing AVs for mixed traffic is challenging because of the fundamental differences between humans and machines.Although inferring human social behavior helps AVs' decision-making, the following section shows that this approach is not widespread in the literature.

MARL algorithms for AVs
We have identified four research paradigms throughout the MARL decision-making for AVs literature.Some authors focused on mixed traffic where AVs drive in a selfconcern way (5.1), while others attempted to incorporate social abilities into their decision-making (5.2).In both cases, the authors realized that the current HDV models do not fulfill their objectives since they are oversimplified and fail at providing a heterogeneity of behaviors.As a result, researchers designed a more sophisticated HDV model endowed with social capabilities (5.3).The last paradigm tackles the fully autonomous traffic case where no human driver can disturb AVs' coordination (5.4).Finally, we present the formulation of the authors (5.5).For each paradigm introduced, we present the authors' formulation of the MARL problem in terms of observation, action, and reward function.

Mixed traffic
Before reaching the full automation level, AVs will potentially cohabit with human drivers in mixed traffic, which is no easy feat.AVs follow homogeneous policies, while humans are sometimes erratic and irrational.Here, we focus on papers suggesting self-concern AVs driving in mixed traffic.
Wang et al. [48] trained AVs on three scenarios: a ring network, a figure-of-eight network, and a mini-city with intersections and roundabouts.The ego-agent state comprises its position, speed, and the distance and speed head-way of the leading and following vehicles.AVs communicate local observations with other AVs within range.The ego-agent's actions are constrained within predefined discrete acceleration values, and its reward function promotes safety and efficiency.
Dong et al. [31] tackled a challenging environment where AVs have to exit by one of the two off-ramps on a threelane highway.The agent's observations contain the relative speeds, longitudinal locations, lane positions, and intentions of surrounding AVs, as well as an adjacency matrix and a mask.AVs pick up high-level actions: lane change or lane keeping.Functions reward when each AV reaches the desired off-ramp indicated by the intention and penalizes collision and lane changes to prevent versatility.
Han and Wang [49] trained AVs to drive on a three-lane freeway.Each AV observes its position, velocity, acceleration, and data captured from an onboard camera and LI-DAR sensors.Additionally, AVs share their states, actions, and observations with each other.AVs select high-level actions such as lane keeping, lane change, or emergency stop and are rewarded according to their velocities and passengers' comfort.The reward system deals with the credit assignment problem, i.e., how to fairly reallocate a shared global reward by marginalizing rewards using the Shapley value.In the cooperative game theory, the Shapley value is a solution concept that distributes fair payoffs to players proportionally to their contribution.Since the complexity of the Shapley value is polynomial with the number of agents, the authors estimated via a neural network and extended it to sequential problems.
AVs decision-making in mixed traffic is significantly impacted by the absence of other AVs in their vicinities.As AVs communicate local observations within range, meaning the uncertainty about the environment grows as the number of surrounding AVs decreases.To overcome this challenge, some researchers envision AVs that are more aware of their surroundings and propose algorithms with social capabilities.

Socially desirable AVs
Socially desirable AVs will likely include the concept of altruism.In psychology, social value orientation (SVO) quantifies an individual's level of altruism, i.e., how much importance to place on others.Lower SVO levels denote selfish behavior, while higher levels denote true altruism.
In their first paper, Toghi et al. [50] tackled the merging and exiting scenarios with socially desirable AVs.AVs observe the kinematics of their neighboring vehicles as well as their last high-level actions to extract the temporal information giving their current trajectories.They perform meta-actions, including lane change, acceleration, and deceleration.The socially desirable behavior is induced through a reward function acting as a trade-off between egoistic and altruistic behavior, differentiating altruism towards AVs and human-driven vehicles.The SVO of the AV weighs this trade-off and the distance of the surrounding vehicles considered.
In their second paper, Toghi et al. [51] enhanced their approach using a 3D convolution network with the relative vehicle speeds as channels.Further experiments have identified an optimal level of SVO that improves overall traffic flow and show that overly altruistic AVs reduce performance.Their third paper achieves better results using a multi-agent actor-critic algorithm [52].
Chen et al. [53] trained AVs to avoid collisions in a merging scenario using a supervisor prioritizing vehicles that merge because their situation is time-critical.AVs observe lateral and longitudinal positions and velocities of surrounding vehicles and pick up meta-action among lane change, accelerating or decelerating.A reward function promotes fast merging, high velocity, and safe time headway and penalizes collisions.This function is a global reward shared by all the AVs in the simulation for encouraging coordination among AVs.
As the results of these articles noticed, socially desirable AVs improve the success rate of merging and exiting maneuvers.Nonetheless, this coordination is facilitated because human-driven vehicles are all controlled by the IDM model and thus are easily predictable.Designing robust AVs that cope with heterogeneous driver behavior and traffic simulations will require comprehensive driver models.

Heterogeneous HDVs
Robust AVs inevitably will have to be trained to drive in complex mixed traffic composed of heterogeneous human-driven vehicles (HDV).Some researchers [54] attempted to learn an HDV model via inverse RL (IRL), a technique for figuring out an agent's reward function given its policy; but this approach is highly dependent on the quality of the extracted data and the studied scenario.As a consequence, there is a need for a "realistic" and heterogeneous HDV model.
Valiente et al. [55] extended the research of Toghi et al. by incorporating an SVO factor into the IDM model used for controlled HDVs.Similarly, Zhou et al. [56] endowed HDVs with a politeness factor, and Hu et al. [57] designed a social HDV model with different levels of cooperation.
All the mentioned authors took advantage of their new HDV models by enabling AVs to infer this SVO and thus anticipated which driver is prone to act altruistically or not.

Fully-autonomous fleet
When AVs reach the fifth level of automation, human drivers might be considered the main threat to road safety and therefore be banned from driving.In this context, all traffic will be composed of fully-autonomous fleets.
Yu et al. [58] addressed the problem of coordination on the highway.AVs observe their current lane position, speed, and the distances and velocities of four neighboring vehicles.Actions comprise driving in the driving lane at a suboptimal speed or driving in the overtaking lane with a higher velocity.The reward function exclusively promotes safety and is shared among a local group of AVs depicted by coordination graphs.
Bhalla et al. [59] learned AVs to better communicate and coordinate on a highway.They measure them against DIAL, a benchmark algorithm that focuses on learning to communicate in cooperative tasks [60].Unlike DIAL, their method does not require past experiences, which mitigates non-stationarity and stabilizes learning.AVs' actions include sending messages, accelerating, decelerating, and direction change.The reward function does not provide explicit rewards for cooperation between the agents but promotes safety distance and penalizes crashes.
Liu et al. [15] proposed a framework for fleet control where each vehicle learns to maintain a constant headway with the vehicles ahead and behind on a highway.Each AV observes its position and speed, as well as those of front and rear vehicles.To maintain the homogeneity of the fleet, a reward function penalizes the AVs which are not at equidistance to the front and rear vehicles or AVs whose velocity and acceleration differ from the group.
Palanisamy [39] designed MACAD, a simulation environment to simulate AV's perception, decision-making, and control.In an intersection scenario, AVs' observations are images captured from an onboard camera, and they can pick up one of the eight discrete actions controlling steering angle, throttle, and brake.The function rewards AVs crossing the intersection while maintaining a high speed and avoiding collisions.Optionally a factor encourages/discourages cooperativeness/competitiveness among the agents.
Nakka et al. [61] tackled the coordination problem in a merging scenario.The merging AV observes the distances and velocities of the surrounding vehicles and the distance from the end of the merging zone.Actions allow the AV to accelerate or decelerate, and the reward function encourages agents to maintain their speed within a predefined range and penalizes rear-end collisions.

Synthesis
We synthesize the previous papers according to the concepts introduced in this survey (Table 1).Most authors used single-agent RL methods, especially those based on DQN, to address MARL problems (12 out of 16) and mainly adopted the CTDE scheme for MARL approaches (3 out of 4).The action space's nature seems to guide the motivations for using value-based or actor-critic methods since the latter better deal with continuous action space.In addition, few articles used learning strategies or explicitly mentioned them.
Interestingly, most papers (12 out of 16) focused their study on simulations involving few agents (≤ 10).This choice is presumably motivated by the MARL challenges, notably the curse of dimensionality [62].
Most studies investigated highway driving and merging scenarios (13 out of 16), as these critical maneuvers involve anticipation and often cause accidents to AVs.For their simulations, Gym-based environments prevail due to their manageable API for RL.Similarly, IDM prevails because of its efficiency and computational simplicity.
Since 2019, few papers have addressed AVs' decisionmaking using MARL compared to those using single-agent RL.Due to the limited number of articles dealing with MARL, our conclusions may be biased, so we invite readers to consider this.

Open challenges and conclusion
Overall, most studies focus on simulations rather than addressing transferability to real traffic scenarios.The needs for "realistic" driver models, safe and interpretable models are two significant problems for AV simulation discussed in this section.
Safety is undoubtedly the critical point of the development of AV algorithms.In MARL, designing a safe policy is a real challenge that implies considering safety constraints at the agent and group levels.The constrained markov decision process (CMDP) framework provides tools for designing such safe RL [64] algorithms.
Most studies agree that existing HDV models are unrealistic because they disregard human characteristics such as psychological and biological traits.Although some researchers tried to provide heterogeneity in HDV models, their models are still limited to a single SVO trait.Besides, despite their differences, HDV and AV models behave deterministically.Introducing AVs trained with these HDV models into real-world traffic would likely result in accidents.
Therefore, developing convincing driver models for safe driving is critical, as driving styles vary among countries and cultures [65].Attempts have been made using inverse reinforcement learning (IRL), but these algorithms are overly dependent on the situations under study and frequently fail to generalize.Others have proposed utilizing MARL algorithms to learn social norms, which may be a new field of research [66].
Another way to prepare AVs for real-world traffic is to make them trustworthy by incorporating interpretability.Explainable artificial intelligence (EAI) is an important research topic gaining interest over the years, mainly because lawmakers require AI to be interpretable, as in Europe with the general data protection regulation (GDPR 6 ).Therefore, robust AVs should incorporate interpretable algorithms providing security and robustness guarantees.Interpreting MARL policies involves explaining short-and long-term decision-making and interactions of multiple agents.This may be accomplished via Causal MARL [67].
Since multi-agent simulations, and MARL algorithms in a broader way, enable the emergence of organizational structures, it might be interesting to investigate how selforganization occurs in a fully autonomous fleet with no predetermined rules.While researchers tend to incorporate standards into AVs' decision-making, they do not rule them out for the fully-autonomous fleets.These emergent organizations may be more appropriate for AVs than current regulations based on humans' limitations. 6https://gdpr-info.eu/ We posed two research questions in the introduction (1), which we now address.
• RQ1.Recent AVs' decision-making research focused on two paradigms.On the one hand, since autonomous vehicles may soon coexist with human drivers, mixed traffic received much attention.Some studies concentrated on improving traffic safety and throughput, while others proposed empowering AVs with social abilities.Some attempted to design HDV models that mimic driver altruism to robustify AVs' policies.On the other hand, since human drivers might be banned from traffic, some researchers devised fully-autonomous fleets that should enhance the overall traffic flow and security.• RQ2.Designing traffic simulations with adequate HDV models is challenging, and despite the proposed models, none covered the heterogeneity of human behavior.Given the current limitations, it seems involved to consider mixed traffic, and future research will likely pay more attention to this problem.In addition, since intersections and roundabouts are manifolds, most studies concentrated on the most straightforward scenarios, such as highway driving, merging, and exiting.Finally, most experiments involved few agents due to the aforementioned MARL challenges [62].In conclusion, RL and MARL algorithms have recently received interest due to their recent achievements and generalization capabilities.They provide a practical approach for learning complex policies involving real-time decision-making in stochastic environments.However, many challenges remain in mitigating the scalability when involving numerous agents.Furthermore, mixed traffic does not meet the security standards in the current simulations.Recent papers attempted to mimic human behavior, particularly social capabilities, to enforce AVs' policies.Given current AVs' algorithms, future research will most likely continue to design less deterministic driver models.

Funding
This manuscript was funded by the European Union's Horizon 2020 research and innovation programme under grant agreement No 815001 (project DriveToTheFuture).

Figure 1
Figure 1 Distribution of the reviewed papers

Figure 2
Figure 2 Single-agent reinforcement learning

Figure 4 Figure 5
Figure 4 MARL with two communicative agents

Figure 6 Figure 7 Figure 8
Figure 6 The benefit of memory.It is impossible to figure out the car's heading without memory (6(a)), while it becomes straightforward with memory (6(b))

Figure 9
Figure9 Hierarchical reinforcement learning (inspired from Chen et al.[34]).A higher-level policy (orange) selects a subtask (blue) to perform a sub-policy on a narrow action space (green) All these simulation environments support the design of different scenario types.

2 .
Merging and exiting.These maneuvers are similar to lane change but are constrained in space and time.