1 Introduction

While the capabilities of Automated Vehicles (AVs) are evolving rapidly, they still lack an essential component that distinguishes them from their human counterparts: the ability to cooperate (implicitly, i.e., without explicit communication, e.g., through eye contact or indicators) with others. Many of the remaining challenges in Automated Driving (AD) are due to the complex interactions between AVs and humans. Thus, it is paramount to equip future AVs with the ability to implicitly demand and provide cooperation where necessary to integrate smoothly into today’s heterogeneous traffic.

Our work addresses this challenge with three key contributions:

  1. 1.

    A cooperative multi-agent trajectory planner in continuous space, see Sect. 2.

  2. 2.

    A method to infer reward functions from human expert demonstrations to ensure a human-like behavior, see Sect. 3.

  3. 3.

    A method to integrate measurement uncertainties into the algorithm to robustify the planning, see Sect. 4.

The code of our work is publicly available on GitHubFootnote 1 under the BSD 3-Clause License.

2 Implicit Cooperative Trajectory Planning

Cooperative planning considers all traffic participants’ possible actions, intentions, and interdependencies, and seeks to maximize the total utility by following the best combination of actions. It is important to note that cooperation does not need to be beneficial for an individual agent (rational cooperation) but that it is sufficient if the combined utility increases compared to a reference point (i.e., a possible action sequence that fulfills the goals of individuals less efficiently) [15, 39].

2.1 Related Work

While the reward models vary depending on the desired behavior of the respective method [37], all approaches model the problem of implicit cooperative trajectory planning as a Markov decision process (MDP) or a partially observable Markov decision process (POMDP). And all approaches aim to find an optimal trajectory or policy by planning or learning.

The following methods were developed for cooperative planning. The majority of methods take the interdependence of each traffic participant’s choices into account; nevertheless, some call for communication, and only one can plan cooperative maneuvers for continuous action spaces required by obstructed road designs and congested urban areas.

2.1.1 Planning

Schwarting et al. [46] propose to determine the best individual plan using an egoistic driver model. If this strategy causes a conflict, a pairwise recursive conflict resolution procedure is started assuming that the traffic ahead influences the decision. The algorithm chooses the best non-conflicting solution. Similarly, Düring et al. [15] also conduct an exhaustive search over a communicated set of discrete actions between two vehicles and choose the joint action with minimum cost. Using this as a foundation, extensions have been created that incorporate fairness enhancements to prevent cooperation from becoming one-sided [39] and precomputed maneuver templates that match a given traffic scene with particular initial constraints to expedite the search [35]. Others propose a game-theoretic, interactive lane change method based on an exhaustive search [4].

Conducting an exhaustive search has the drawback of quickly becoming intractable for a larger number of traffic participants or longer time horizons. Thus, the problem of cooperative decision-making has been addressed by employing Monte Carlo Tree Search (MCTS) to estimate the best maneuver combination over multiple time steps. While some approaches rely on higher-level abstractions, such as the probability of participants yielding and driver models such as the Intelligent Driver Model (IDM) [22, 33], other approaches model the complete interaction and reasoning [29, 55].

Hubmann et al. [20, 21] propose interactive intersection-navigation and lane-change systems, modeled using POMDPs. They are solved via the Adaptive Belief Tree algorithm [23, 48].

Based on game-theoretical modeling, Schwarting et al. present an interactive lane-changing and intersection-navigation method [47]. Each of the traffic participants is considered with its entire action space. The cooperative component is included by assessing the agent’s social value orientation. A state-of-the-art nonlinear optimizer is employed to solve the nonlinear program.

Other approaches are not explicitly cooperative. However, they capture the interdependencies of actions as they evaluate the risk resulting from different maneuver combinations. They predict the motions of vehicles [32] and can generate proactive and cooperative driving actions [3].

2.1.2 Learning

Model-free Reinforcement Learning (RL) is used by Saxena et al. to propose an interactive lane change method in dense traffic [42]. Similar to the Social Generative Adversarial Network (SGAN) [2], the interactions of traffic participants do not need to be explicitly modeled, as they are learned implicitly. The data for training is generated with driver models based on IDM and Minimizing Overall Braking Induced by Lane Changes (MOBIL) in a simulator.

Bouton et al. propose an intersection navigation method capable of handling occlusions using model-free RL and a probabilistic model checker to ensure safety [8]. Two other traffic participants are modeled using rule-based models. The resulting interactions are learned implicitly from data during training. Further, the extension to scenarios with additional traffic participants is realized through scene decomposition, possibly leading to extremely conservative behavior.

Similarly, they propose an interactive single-lane lane change method in dense traffic using model-free RL [7]. The interactions of traffic participants do not need to be explicitly modeled, as they are learned implicitly from data during training generated in a simulator based on IDM driver models. The cooperativeness of other drivers is estimated using a recursive Bayesian filter. It can be seen that an MCTS approach with full observability performs best.

2.1.3 Combined Learning and Planning

Bae et al. propose an interactive lane change method in dense traffic using the predictions of an SGAN [17] as a basis for planning controls using Model Predictive Control (MPC) [2]. The SGAN is trained on simulations of dense traffic using the IDM and MOBIL driver models. The controller uses Monte Carlo rollouts to create a set of trajectories during planning that is assessed alongside the SGAN predictions. The trajectory with the lowest cumulative cost, which does not violate the constraints, is chosen. This approach is limited to lane changes and uses simple driver models for other traffic participants. Further, the amount of interaction is reduced due to the planning horizon of 2 s.

Hoel et al. propose an interactive lane change method using a combination of MCTS and RL [18]. The driver models used for other traffic participants are also based on IDM and MOBIL. The driver state is only partially observable, and its belief state is inferred using a particle filter. The action space does not include actions that result in collisions. An artificial neural network (ANN) that guides the sampling within the MCTS also generates state-value targets for the MCTS.

2.2 Problem Formulation

We model the cooperative trajectory planning problem with a variable number of non-communicating agents interacting in a Markov Game (MG). In a multi-agent system, the state transition and reward depend on the behaviors of all actors. But for every timestep, each agent individually picks an action without being aware of the choices made by others. All agents receive an immediate reward, and the system is transferred to the next state.

Therefore, it is the aim of each agent to maximize its expected cumulative reward,

$$\begin{aligned} {G}({\tau }) = \sum _{({{s}_{t}},{{a}_{t}}) \in {\tau }}{{\gamma }^{t}{{\mathcal {R}}_{{{s}_{t}}}^{{{a}_{t}}}}}, \end{aligned}$$
(1)

in the MG, by choosing a trajectory \({\tau }\) consisting of state-action tuples \(({{s}_{t}},{{a}_{t}})\) that maximize the sum of rewards \({{\mathcal {R}}_{{{s}_{t}}}^{{{a}_{t}}}}\) over the time horizon discounted by the discount factor \({\gamma }\).

While there are multiple ways to maximize the expected cumulative reward, one of them is value-based RL. Here, the agent attempts to estimate the state value function \(V^{{\pi}}({s})\) for each state \({s}\) of the MDP to later determine the optimal policy \({\pi}^{*}\) by maximizing the state value for each state,

$$\begin{aligned} {{{\pi}^*}({a}\mid {s})}= {\underset{{\pi}}{{{\,\mathrm{arg\,max}\,}}\,}}V^{{\pi}}({s}). \end{aligned}$$
(2)

Value-based RL can find the optimal policy; however, it is very time intensive. Therefore, in contrast to learning the optimal policy (global optimization for all states), we frame the problem as searching for an optimal action (local optimization for a single state, i.e., the current state).

Hence, the problem is formulated as the search for the optimal action, given a state in an MG.

2.3 Approach

We employ Monte Carlo Tree Search (MCTS) to search for the optimal action. MCTS, a reinforcement learning method [56], has shown great potential when facing problems with huge branching factors. The most prominent example, AlphaGo, reached super-human performance in the game Go [49]. The following introduces the basics of MCTS.

2.3.1 Monte Carlo Tree Search

An exhaustive tree search can find the optimal trajectory through any MDP with a finite set of states and actions [9]. However, as the action space grows, it quickly becomes intractable to search for the optimal trajectory through the entire tree.

Tree Search combined with Monte Carlo sampling addresses this issue by approximating the optimal solution asymptotically. Monte Carlo Tree Search (MCTS) is a computationally efficient, highly selective best-first search [13, 24], that explores different trajectories through the MDP to discover the trajectory that maximizes the return \({G}\) from the root state.

Given an initial root state of the MDP, MCTS approximates the state-action value in four sequential steps during each iteration until a terminal condition is met (e.g., until a time budget or computational budget is exceeded). Since MCTS is an anytime algorithm [24], it returns an estimate after the first iteration.

2.3.1.1 Selection

The most popular form of MCTS uses an Upper Confidence Bound for Trees (UCT), to control the selection of successor states. The UCT value [24] of all explored actions from the current state is calculated during the selection phase, see (3), and the state action tuple with the maximum UCT value is selected. This process repeats until a state is selected that has not been fully explored (i.e., not all available actions in the state have been expanded), see Fig. 1.

Using UCT, MCTS solves the exploration—exploitation dilemma [24], an upper confidence bound for the estimation of the true state-action value. The first term in (3), the estimated state-action value function \({\widehat{Q}^{{\pi}}({s}, {a})}\), fosters exploitation of previously explored actions with high state-action values. The second term guarantees that all actions for a given state are being explored at least once, with \({N}({s})\) being the visit count for state \({s}\) and \({N}({s},{a})\) the number of times action \({a}\) has been chosen in that state. To balance the exploration-exploitation trade-off, a constant factor c is used [24].

$$\begin{aligned} {\text {UCT}}({s},{a}) = {\widehat{Q}^{{\pi}}({s}, {a})}+c\sqrt{\frac{2\log {{N}({s})}}{{N}({s},{a})}} \end{aligned}$$
(3)
2.3.1.2 Expansion

Once the selection policy encounters a state with untried actions left, it expands that state by randomly sampling an action from a uniform distribution over the action space, see (4), and executing the action reaching a successor state, see Fig. 1.

$$\begin{aligned} {a}\sim U[\min ({\mathcal {A}}), \max ({\mathcal {A}})] \end{aligned}$$
(4)
Fig. 1
2 search trees for M C T S. Left. Selection has a root state, 3 child states, and 2 leaf states from the left state. The trajectory from the root to the left state is highlighted. Right. Expansion has 3 leaf states. The trajectory from the root to the left state, and the right leaf is highlighted.

Selection and expansion in MCTS: circles denote states and edges denote actions

2.3.1.3 Simulation

After the expansion of an action completes, a simulation (i.e., a roll-out) of subsequent random actions is conducted until a terminal condition is met (i.e., the planning horizon is reached or an action resulting in a terminal state is sampled). This generates an estimate of the state-action value for the previous expansion, see Fig. 2.

2.3.1.4 Update

Lastly, the return \({G} \) of the trajectory \({\tau } \) generated by the iteration is backpropagated to all states along the trajectory, see Fig. 2, and the state-action values and visit counts for all actions of the trajectory are updated, see (5) and (6), respectively.

$$\begin{aligned} {N}({s},{a}) \leftarrow {N}({s},{a}) + 1 \end{aligned}$$
(5)
$$\begin{aligned} {Q^{{\pi}}({s}, {a})}\leftarrow {Q^{{\pi}}({s}, {a})}+ \frac{1}{{N}({s},{a})}({G}({\tau }) - {Q^{{\pi}}({s}, {a})}) \end{aligned}$$
(6)
Fig. 2
2 search trees for M C T S. Left. Simulation has the trajectory from the root state to the left state, the right leaf, and the terminal state highlighted. Right. Update has the backpropagation from the terminal state to the right leaf, the left state, and the root state highlighted.

Simulation and update in MCTS: circles denote states and edges denote actions

2.3.2 Multi-agent Driving Simulator

To simulate an agent’s experience, RL-based methods need access to an environment, such as a simulator or the real world. However, as RL is founded on the notion of trial and error, it is challenging to guarantee safety while learning in the actual world. Consequently, we use simulation; the multi-agent driving simulator created for and utilized throughout this work is described in the following.

2.3.2.1 State Space

The state \({s}\) of a traffic participant is defined with its

  • longitudinal position \({{x}_\textrm{lon}}\),

  • lateral position \({{x}_\textrm{lat}}\),

  • longitudinal velocity \({{\dot{x}}_\textrm{lon}}\),

  • lateral velocity \({{\dot{x}}_\textrm{lat}}\),

  • longitudinal acceleration \({{\ddot{x}}_\textrm{lon}}\),

  • lateral acceleration \({{\ddot{x}}_\textrm{lat}}\), and

  • heading \({\phi }\).

Further, each traffic participant is a vehicle denoted by its

  • width \({w}\),

  • length \({l}\),

  • wheelbase \({l_\textrm{wb}}\),

  • maximum acceleration \({a_\textrm{max}}\) and

  • maximum steering angle \({\delta _\textrm{max}}\).

An agent \({\upsilon }\) represents a traffic participant. Generally, this could be any participant, such as a car, truck, trailer, motorcycle, bicycle, or even a human. This work uses the kinematics of a single-track model. Thus, only car-like vehicles are considered.

2.3.2.2 Action Space

The action space of an agent is two-dimensional. The two dimensions are the longitudinal velocity change \({\Delta {{\dot{x}}_\textrm{lon}}}\) and the lateral change in position \({\Delta {{x}_\textrm{lat}}}\). The tuple describes the desired state change over the action duration \({\Delta T}= t_1 - t_0\), with \(t_0\) denoting the start and \(t_1\) the end of an action. Based on the current state and the chosen action, a jerk-optimal trajectory is calculated using quintic polynomials [54], one for the longitudinal and lateral direction, respectively, see (7),

$$\begin{aligned} \begin{aligned} x(t) &= {a}_0 + {a}_1t + {a}_2t^2 + {a}_3t^3 + {a}_4t^4 + {a}_5t^5 \\ \dot{x}(t) &= {a}_1 + 2{a}_2t + 3{a}_3t^2 + 4{a}_4t^3 + 5{a}_5t^4 \\ \ddot{x}(t) &= 2{a}_2 + 6{a}_3t + 12{a}_4t^2 + 20{a}_5t^3 \end{aligned} \end{aligned}$$
(7)

as well as in matrix form (8).

$$\begin{aligned} \boldsymbol{M}{\boldsymbol{{a}}}= \begin{bmatrix} 1 & t_0 & t_0^2 & t_0^3 & t_0^4 & t_0^5\\ 0 & 1 & 2t_0 & 3t_0^2 & 4t_0^3 & 5t_0^4\\ 0 & 0 & 2 & 6t_0 & 12t_0^2 & 20t_0^3\\ 1 & t_1 & t_1^2 & t_1^3 & t_1^4 & t_1^5\\ 0 & 1 & 2t_1 & 3t_1^2 & 4t_1^3 & 5t_1^4\\ 0 & 0 & 2 & 6t_1 & 12t_1^2 & 20t_1^3\\ \end{bmatrix} \begin{bmatrix} {a}_0\\ {a}_1\\ {a}_2\\ {a}_3\\ {a}_4\\ {a}_5\\ \end{bmatrix} \end{aligned}$$
(8)

With each polynomial requiring six coefficients, six constraints need to be defined. The constraints are denoted by \({\boldsymbol{c}}\), see (9).

$$\begin{aligned} {\boldsymbol{c}}= \begin{bmatrix} x(t_0)\\ \dot{x}(t_0)\\ \ddot{x}(t_0)\\ x(t_1)\\ \dot{x}(t_1)\\ \ddot{x}(t_1)\\ \end{bmatrix} \end{aligned}$$
(9)

The start constraints are determined by the current state of the vehicle. The end constraints for \({{\dot{x}}_\textrm{lon}}\) and \({{x}_\textrm{lat}}\) are based on the selected action. The end position is given by the mean of the start and end velocity and the action duration \({\Delta T}\). Further, the acceleration in longitudinal and lateral direction, as well as the velocity in lateral direction are set to zero (10).

$$\begin{aligned} \begin{aligned} &{{x}_\textrm{lon}}(t_1) = {{x}_\textrm{lon}}(t_0) + \frac{{{\dot{x}}_\textrm{lon}}(t_0)+{{\dot{x}}_\textrm{lon}}(t_1)}{2}{\Delta T}\\ &{{\dot{x}}_\textrm{lon}}(t_1) = {{\dot{x}}_\textrm{lon}}(t_0) + {\Delta {{\dot{x}}_\textrm{lon}}}\\ &{{\ddot{x}}_\textrm{lon}}(t_1) = 0\\ &{{x}_\textrm{lat}}(t_1) = {{x}_\textrm{lat}}(t_1) + {\Delta {{x}_\textrm{lat}}}\\ &{{\dot{x}}_\textrm{lat}}(t_1) = 0\\ &{{\ddot{x}}_\textrm{lat}}(t_1) = 0 \end{aligned} \end{aligned}$$
(10)

Using the constraints, (7) can be solved for its coefficients \({\boldsymbol{{a}}}\), see (11) (assuming \(\boldsymbol{M}\) is invertable).

$$\begin{aligned} \begin{aligned} \boldsymbol{M}{\boldsymbol{{a}}}&= {\boldsymbol{c}}\\ \boldsymbol{M}^{-1}\boldsymbol{M}{\boldsymbol{{a}}}&= \boldsymbol{M}^{-1}{\boldsymbol{c}}\\ {\boldsymbol{{a}}}&= \boldsymbol{M}^{-1} {\boldsymbol{c}}\end{aligned} \end{aligned}$$
(11)

We generate trajectories utilizing the Frenet frame, a dynamic reference frame that aligns with the road’s centerline rather than Cartesian coordinates [59]. This transformation enables the separation of longitudinal and lateral trajectory planning.

2.3.2.3 Transition Function

Since this work does not focus on trajectory planning close to physical limits (e.g., required by evasive maneuvers), trajectories are evaluated using a single track model [45] as it has been shown to perform sufficiently well for trajectory planning tasks [25]. The execution of trajectories derived from the selected action is deterministic.

2.3.2.4 Physical Constraints

To ensure that a chosen trajectory is drivable for a single-track model, the differential and kinematic constraints, i.e., the maximal acceleration and the minimum curve radius, must be accounted for [31].

Based on the polynomials that describe the trajectories in longitudinal and lateral directions, the heading

$$\begin{aligned} {\phi }= \arctan \left( \frac{{{\dot{x}}_\textrm{lat}}}{{{\dot{x}}_\textrm{lon}}}\right) , \end{aligned}$$
(12)

curvature

$$\begin{aligned} {\kappa }= \frac{{{\dot{x}}_\textrm{lon}}{{\ddot{x}}_\textrm{lat}}- {{\dot{x}}_\textrm{lat}}{{\ddot{x}}_\textrm{lon}}}{\left( {{\dot{x}}_\textrm{lon}}^2+{{\dot{x}}_\textrm{lat}}^2\right) ^\frac{3}{2}}, \end{aligned}$$
(13)

steering angle

$$\begin{aligned} {\delta }= \arctan \left( {l_\textrm{wb}}{\kappa }\right) , \end{aligned}$$
(14)

acceleration

$$\begin{aligned} {\ddot{x}}= \sqrt{{{\ddot{x}}_\textrm{lon}}^2 + {{\ddot{x}}_\textrm{lat}}^2}, \end{aligned}$$
(15)

and velocity

$$\begin{aligned} {\dot{x}}= \sqrt{{{\dot{x}}_\textrm{lon}}^2 + {{\dot{x}}_\textrm{lat}}^2} \end{aligned}$$
(16)

are calculated.

A new action is sampled if a selected action violates either the maximum steering angle or the maximum acceleration. Resampling also occurs if an action would lead a traffic participant off the road, see Fig. 3. However, resampling is limited to a maximum number of retries.

The physical constraints are considered in the validation reward of the reward function, see Sect. 2.3.2.

Fig. 3
A scatterplot of delta x lat versus delta x dot lon in meters per second. The points for invalid states are arranged in a grid between 5 and negative 5 on the x and y axes. The points for false are between (0, 3) and (0, negative 3) along the rows, with points for true elsewhere in the grid.

Invalid states: invalid successor states resulting from different actions

2.3.2.5 Reward Function

The reward function \({{\mathcal {R}}_{{s}}^{{a}}}\) is the basis for the agent’s behavior. It considers the state \({s}\) and the action \({a}\) of an agent.

$$\begin{aligned} r = r_{{s}} + r_{{a}} + r_\textrm{validation} \end{aligned}$$
(17)

The importance of each of the features mentioned below is adjusted with a corresponding weight.

2.3.2.6 State Reward

The state reward \(r_{s}\) is based on the divergence of the current state to the desired state. The desired state is defined by a longitudinal velocity \({v_\textrm{des}}\) and a lane index \({l_\textrm{des}}\). The desired values must be estimated for all agents by a separate module. We assume that the desired longitudinal velocity and the lane index are equal to the values before the interaction in each scenario.

Additionally, the agent is encouraged to drive close to the center of a lane by reducing the reward for deviations from the center.

2.3.2.7 Action Reward

Actions are selected to minimize the deviation from the desired state, ideally leading to an equilibrium state. Thus, all actions are penalized, and the action reward \(r_{a}\) is always negative, i.e., a cost. In this work, \(r_{a}\) considers longitudinal and lateral acceleration as well as lane changes, with w being the weights and \(\Delta {l}\) the number of lane changes an action results in. If desired, both the state and action reward can easily be extended to capture additional safety, efficiency, and comfort-related aspects of the generated trajectories.

2.3.2.8 Validation Reward

The last term is the action validation reward, see (18). It evaluates whether a state and action are valid, i.e., being inside the drivable environment and adhering to the physical constraints and whether a state action combination is collision-free.

(18)
2.3.2.9 Cooperative Reward

To achieve cooperative behavior, a cooperative reward \({r^i_\textrm{coop}}\) is defined. The cooperative reward of an agent i is the sum of its own rewards, see (17), as well as the sum of all other rewards of all other agents multiplied by a cooperation factor \({\lambda }^i\), see (19), [29, 33]. The cooperation factor determines the agent’s willingness to cooperate with other agents with \({\lambda }^i = 0\) being purely interactive and \({\lambda }^i = 1\) being fully cooperative.

$$\begin{aligned} {r^i_\textrm{coop}}= r + {\lambda }^i \sum _{j = 0, j \ne i}^{n}r^{j} \end{aligned}$$
(19)

The cooperation factor can be used to represent different driver types. For example, an offensive driver weighs his own goals more than the goals of other road users. The joint reward function (19) is agent-individual and does not represent a global cost function. Therefore, the cooperative rewards of the individual agents cannot be compared since they have different values depending on the respective cooperation factor.

2.3.3 Decentralized Continuous MCTS

We depict an exemplary application of MCTS to a traffic scenario requiring cooperation in Fig. 4. It is important to note that while it is possible to apply MCTS directly, specific extensions are required to model the interdependence of actions between traffic participants and allow for continuous action spaces.

The original MCTS algorithm uses UCT [24], designed for sequential decision-making games with a finite set of states and actions. However, if traffic participants interact without communication, the actions of other traffic participants are not known until they are observed. Thus the basic MCTS used in turn-based games is not applicable, and we need to extend it to simultaneous move games. In addition, the requirement of trajectory planning in a continuous state and action space requires alterations to the standard selection procedure of actions. These alterations are necessary because UCT would degenerate MCTS to Monte Carlo Search (MCS), as each action in each state needs to be selected at least once; see (3).

Fig. 4
An illustration of a parked car on the side of a road with 2 moving cars from opposite directions. 4 tree structures below for selection, expansion, simulation, and update have trajectories indicated between similar illustrations of cars in different states.

Reference [29] © 2018 IEEE

Phases of MCTS for a scenario with a narrow passage; The gray vehicles are parked, and the red and the blue vehicle can pass the narrowing simultaneously if the red vehicle deviates from its optimal trajectory. During the selection, MCTS traverses the tree by selecting auspicious future states until a state is encountered that has untried actions left. After the expansion of the state, a simulation of subsequent actions is run until the planning horizon is reached. Next, the result is backpropagated to all states along the selected path.

2.3.3.1 Decoupled UCT

We employ Decoupled UCT (DUCT) to address the problem of decentralized, simultaneous decision-making [53]. While the decoupled version of UCT does not guarantee to converge to an optimal policy [43], it has been shown to perform best compared to other variants [53].

The complexity of simultaneous decision-making results from incomplete information regarding the decision of other agents. Thus, the state-action value for an action \({a}_i\) from agent i can only be approximated by averaging over all possible actions of the other agents. Based on the description of MCTS in Sect. 2.3.1, DUCT hence tracks the state-action value and visit count on a per-agent basis, and the dependency between different agents i is not considered when calculating the DUCT value,

$$\begin{aligned} {\text {DUCT}}({s},{a}_i) = \widehat{Q}^{{\pi}}({s}, {a}_i)+c\sqrt{\frac{2\log {{N}({s})}}{{N}({s},{a}_i)}}, \quad {a}_i \in {\mathcal {A}^i}. \end{aligned}$$
(20)
Fig. 5
A search tree for D U C T. S 0 leads to s 1 with a 0 and a 0, s 2 with a 0 and a 1, s 3 with a 2 and a 0, s 4 with a 1 and a 0, s 5 with a 1 and a 1, s 6 with a 1 and a 2, s 7 with a 2 and a 0, s 8 with a 2 and a 1, and s 9 with a 2 and a 2. The edges from s 0 to s 2, s 6, and s 7 are highlighted.

Action combinations in DUCT: the resulting successor states are based on an identical action space \(\left[ {a}_0, {a}_1, {a}_2\right] \in {\mathcal {A}}({s}_0)\) for two agents, with \(\cdot \vert \cdot \) denoting the joint action \({\boldsymbol{{a}}}\) that led to a state

Each agent selects the action that maximizes its DUCT value during the selection step. The resulting joint action \({\boldsymbol{{a}}}\) leads to the successor node if it exists or expands a new node.

For two agents with identical action spaces with three actions, Fig. 5 depicts all possible successor states. Since DUCT tracks the state-action value and visit count separately for each agent, \({s}_0\) is considered fully expanded once each agent has executed each of its available actions (\({a}_0, {a}_1, {a}_2\)) at least once (\({s}_2, {s}_6, {s}_7\)), rather than all possible combinations resulting from the joint action space. Thus, an action can only be executed again if all of an agent’s actions have already been explored at least once. Only by randomly selecting a joint action \({\boldsymbol{{a}}}\) that has not previously been selected the remaining combinations (\({s}_1, {s}_3, {s}_4, {s}_5, {s}_8, {s}_9\)) are added to the search tree.

Similarly, the final selection is conducted independently of other agents.

2.3.3.2 Progressive Widening

The use of UCT, see (3), in MCTS requires exploring all possible actions from a given state [24]. Actions of the successor states are only explored [9, 24] once all actions of the predecessor have been explored. Thus, if the action space is continuous, the application of UCT within MCTS degenerates to MCS.

Progressive widening, sometimes called progressive unpruning, aims to address the issue of large action spaces [10, 13], with additional research considering infinitely many actions within UCT [57].

Progressive widening gradually expands the existing action space of a node by adding additional actions. The number of actions for a state follows a sublinear function of the visit count \({N}\), see (21), with \(c \in {{\,\mathrm{\mathbb {R}}\,}}^+\) and \(\alpha \in (0,1)\) being determined empirically. The expansion of the action space can be random or follow a heuristic.

$$\begin{aligned} |{\mathcal {A}}({s})| = \lfloor c{N}(s)^{\alpha } \rfloor \end{aligned}$$
(21)

Picking a random action from the theoretical action space is the simplest technique to add new actions. Another more advanced method is to utilize a heuristic, such as blind value, as described below.

Since the visit count for individual actions decreases with increasing search depth due to branching, the application of progressive widening is restricted to a specific depth within the tree that we determined empirically. Therefore, the restriction ensures that the available actions at larger depths are still sufficiently explored.

2.3.3.3 Expansion Strategies

The expansion strategy is either random or guided. The random expansion strategy samples uniformly from the entire action space. In order to be able to use a continuous action space, we employ progressive widening to decide whether the action space should be expanded. The guided expansion strategy uses a heuristic such as blind value to find a promising node for expansion.

2.3.3.4 Final Selection Strategies

After the computational budget of the planning phase is exhausted, an action must be selected to be executed. While UCT defines a clear selection criterium within the search tree of MCTS, different strategies for the final selection exist. Two of the most common are

$$\begin{aligned} {\text {Max Visit Count}}({s}) = {\underset{{a}\in {\mathcal {A}}({s})}{{{\,\mathrm{arg\,max}\,}}\,}} {N}({s}, {a}) \end{aligned}$$
(22)

and

$$\begin{aligned} {\text {Max Action Value}}({s}) = {\underset{{a}\in {\mathcal {A}}({s})}{{{\,\mathrm{arg\,max}\,}}\,}} {\widehat{Q}^{{\pi}}({s}, {a})}. \end{aligned}$$
(23)
2.3.3.5 Blind Value

Whenever a node is to be expanded, MCTS needs to add an action to the action space of the node. The standard strategy for discrete as well as small continuous action spaces is to employ uniform sampling over the entire action space [9, 11]. However, as the action space grows, it becomes less likely that promising regions are sampled.

A heuristic that aims to increase the likelihood of sampling promising regions is called blind value (BV) [11]. Blind value uses the previously explored actions of a node to guide the next expansion. It first focuses on regions away from previously explored actions and then shifts towards regions with many highly valued actions.

The blind value for an action \({a}_i\) of a set of randomly sampled actions \({{\mathcal {A}}_\textrm{rnd}}\) is calculated using the set of explored actions \({\mathcal {A}}_\textrm{exp}\) as well as an adaptation coefficient \(\rho \), see (24) and (26), respectively.

$$\begin{aligned} {\text {BV}}({a}_i, {\mathcal {A}}_\textrm{exp}, \rho )=\min _{{a}_j \in {\mathcal {A}}_\textrm{exp}} {\text {UCT}}({s}, {a}_j)+\rho {d}({a}_i, {a}_j) \end{aligned}$$
(24)
$$\begin{aligned} {d}({a}_i, {a}_j)=\sqrt{\left( {a}_i^{{\Delta {{\dot{x}}_\textrm{lon}}}}-{a}_j^{{\Delta {{\dot{x}}_\textrm{lon}}}}\right) ^{2}+\left( {a}_i^{{\Delta {{x}_\textrm{lat}}}}-{a}_j^{{\Delta {{x}_\textrm{lat}}}}\right) ^{2}} \end{aligned}$$
(25)
$$\begin{aligned} \rho ({\mathcal {A}}_\textrm{exp},{{\mathcal {A}}_\textrm{rnd}})=\frac{\sigma \left( \left\{ {\text {UCT}}\left( {s}, {a}_j\right) \mid \forall {a}_j \in {\mathcal {A}}_\textrm{exp}\right\} \right) }{\sigma \left( \left\{ {d}\left( 0,{a}_i\right) \mid \forall {a}_i \in {{\mathcal {A}}_\textrm{rnd}}\right\} \right) } \end{aligned}$$
(26)

The action with the highest blind value is finally selected, see (27) (Fig. 6).

$$\begin{aligned} {a}^*= {\underset{{a}_i \in {{\mathcal {A}}_\textrm{rnd}}}{{{\,\mathrm{arg\,max}\,}}\,}}{\text {BV}}({a}_i, {\mathcal {A}}_\textrm{exp}, \rho ) \end{aligned}$$
(27)
Fig. 6
A scatterplot of delta x lat versus delta x dot lon. A i in the first quadrant has the highest blind value followed by a k in the fourth quadrant and a j in the second quadrant. A i, a j, and a k belong to A r n d. 3 points for A e x p are in the second and fourth quadrants with low blind values.

Blind values of actions in the continuous space: assuming that the previously explored actions have identical UCT values, \({a}_i \in {{\mathcal {A}}_\textrm{rnd}}\) has the highest blind value, since its distance to other actions is largest, cf. (24) and (25)

2.4 Experiments

We evaluate the algorithm’s performance on 15 challenging multi-agent scenarios, which can be found online.Footnote 2 Each scenario consists of at least two agents with conflicting goals. Further, the start state of each agent is determined randomly within predefined nonoverlapping areas.

Each scenario has a defined terminal condition; we deem it solved once reached. The resulting average scenario length is 16.5 s.

Due to the inherent random nature of the algorithm, we evaluated each configuration using 250 random seeds to generate statistically relevant results.Footnote 3

We assess the results with the success rate indicating the fraction of collision-free and valid trajectories. An agent’s trajectory is valid if the agent stays within the road boundaries and if the trajectory is physically drivable. We compare the results for different numbers of iterations.

First, we demonstrate that resampling invalid actions (i.e., randomly sampling a new action to sample a valid action), either due to violations of physical constraints, collisions with static obstacles, or driving off the road, improves the algorithm’s performance substantially, see Fig. 7a. With this, scenarios one to six reach a mean success rate close to 98.07 % for 1280 iterations, Fig. 7b. For scenarios seven to twelve, the mean success rate decreases to 95.13 % for 1280 iterations. This demonstrates the general applicability of our approach in cooperative driving scenarios. The more challenging scenarios, thirteen to fifteen, reach a mean success rate of 42.13 % for 1280 iterations and thus require a larger computational budget due to their complexity.

Second, we show that applying the blind value heuristic improves the baseline slightly, Fig. 7c. However, while most deviations are significant, no selected number of samples increases the performance consistently in a significant manner. This observation is surprising since the insignificant change only occurs for 160 iterations. Further, increasing the number of samples does not necessarily increase performance either. Overall, 10 and 20 samples perform best. The results for ten samples are depicted in Fig. 7d. The changes are negligible for scenarios one to six, already reaching high overall performance. The scenarios that profit the most from the blind value heuristic to guide the exploration of the action space are scenarios seven to nine and scenario fifteen. The overall improvement for scenarios seven to fifteen is 3.21 % points over all iterations.

Fig. 7
2 line graphs and 2 heatmaps for baselines plot success rates versus iterations. a. Points for without invalid are above the increasing baseline. b. S C 10 has the highest values. c. Points for 10, 20, 40, and 80 are slightly above the higher baseline. d. S C 07 and 08 have the highest values.

Comparison of the mean success rates of the baselines if invalid actions are included or resampled (a), with a detailed view of the absolute performance of resampling invalid actions (b). In (c) the performance is assessed in comparison with the blind value (BV) heuristic. The absolute difference of BV compared to the baseline is shown in (d)

3 Learning Reward Functions

Reinforcement Learning (RL) based approaches frequently use manually specified reward models [30, 60]. In environments where systems must interact with humans, their decisions must be comprehensible and predictable. However, as the complexity of the reward model rises its manual parametrization to generate a desired behavior becomes infeasible.

In the case of driving, it is clear that various features influence the reward of any given trajectory [37]. While tuning the weighting of features to create the desired behavior in a diverse set of scenarios is tedious and error-prone, Inverse Reinforcement Learning (IRL) has proven to be able to recover the underlying reward model from recorded trajectories that demonstrate expert behavior in areas such as robotics and automated driving [1, 26, 38, 63, 65].

Our work builds upon the previously introduced cooperative trajectory planning algorithm to generate expert trajectories. Our contribution is twofold. The first is a system that conducts Guided Cost Learning (GCL) [16], a sampling-based Maximum Entropy Inverse Reinforcement Learning method with Monte Carlo Tree Search (MCTS) to efficiently solve the forward RL problem in a cooperative multi-agent setting. The second is an evaluation that compares a linear and nonlinear reward model to a manually designed one. We demonstrate that the performance of the learned models is similar to or better than the tediously tuned baseline. An overview of the system is depicted in Fig. 8.

Fig. 8
A block diagram of a system. T E leads to I R L, I R L leads to M C T S via theta, and M C T S leads to I R L via T S. I R L has a r g max theta l of theta vertical bar T E, theta left arrow theta plus alpha del theta l of theta. M C T S has a r g max pi of V theta pi of s.

Reference [28] © 2022 IEEE

Overview of the system: at first, an initial set of expert trajectories \({\mathcal {T}}_{\textrm{E}}\) is generated. Then the cooperative trajectory planning algorithm computes a set of sample trajectories \({\mathcal {T}}_{\textrm{S}}\) using the randomly initialized reward model. Next, using the \({\mathcal {T}}_{\textrm{E}}\) and \({\mathcal {T}}_{\textrm{S}}\), the likelihood of the parameters \({\boldsymbol{\theta }}\) given the expert trajectories is increased using gradient ascent. Finally, the process repeats with the cooperative trajectory planning algorithm, sampling new trajectory samples until convergence.

3.1 Related Work

While RL’s task is deducing an optimal policy from an agent’s interactions with the environment based on a reward model [52], the opposite is the case for IRL. Here, the task is to infer the underlying reward model that the optimal policy aims to maximize [38]. Since the reward model is the most succinct and transferable description of an agent’s behavior [1], a close approximation of the underlying reward model will yield a behavior similar to the behavior that results from the optimal policy, i.e., the expert behavior.

Early work in IRL performed feature matching rather than estimating the true underlying reward function [1] to learn driving styles in a discrete driving simulator. More recently, driving styles have been learned using continuous trajectories and action spaces [26], including additional features that impact driver preference [37].

Wulfmeier et al. demonstrate the effectiveness of learning nonlinear reward models building on Maximum Entropy IRL [65] using Deep Neural Networks [62], which they extended to learning cost maps for path planning from raw sensor measurements [63].

Further improvements in the approximation of the partition function and the efficiency of IRL in combination with RL have been proposed by Guided Cost Learning [16]. By adapting the IRL procedure, the method yields both a cost function and policy given expert demonstrations using sampling-based methods. In addition, more efficient one-shot sampling methods have been proposed [61].

3.2 Problem Formulation

Using the definitions of a policy

$$\begin{aligned} {{{\pi}}({a}\mid {s})}\quad {a}\in {\mathcal {A}}({s}), \end{aligned}$$
(28)

and a trajectory

$$\begin{aligned} {\tau }= {({{{s}_{0}}, {{a}_{0}}, {{s}_{1}}, {{a}_{1}}, \dots , {{s}_{T}}})}, \end{aligned}$$
(29)

a policy over trajectories can be defined as

$$\begin{aligned} {\rho }({\tau }) = {\rho }({{{s}_{0}}, {{a}_{0}}, {{s}_{1}}, {{a}_{1}}, \dots , {{s}_{T}}}) = \prod _{t=0}^{T} {{{\pi}}({{a}_{t}}\mid {{s}_{t}})}, \end{aligned}$$
(30)

assuming a deterministic start state distribution and transition model. The return of a trajectory \({\tau }\) equals its accumulated discounted reward at time step t, taking action \({{a}_{t}}\) in state \({{s}_{t}}\) [52], see (1). The value function of a policy \({\pi}\) for an MDP with a reward model parameterized by \({\boldsymbol{\theta }}\) is the expectation of the return of trajectories sampled from that policy,

$$\begin{aligned} V^{{\pi}}_{{\boldsymbol{\theta }}}({s}) = {{\,\mathrm{\mathbb {E}}\,}}_{{\tau }\sim {\rho }}\left[ {G}_{{\boldsymbol{\theta }}}({\tau })\right] . \end{aligned}$$
(31)

While the forward RL problem is solved by finding the optimal policy, see (2), the inverse RL problem is solved by finding the parameters \({\boldsymbol{\theta }} \) so that,

$$\begin{aligned} V^{{{\pi _\mathrm {{\textrm{E}}}}}}_{{\boldsymbol{\theta }}}({s}) \ge V^{{\pi}}_{{\boldsymbol{\theta }}}({s}) \quad \forall {\pi}\in {\Pi }, \end{aligned}$$
(32)

with \({{\pi _\mathrm {{\textrm{E}}}}} \) being the expert policy as part of the policy space \({\Pi }\).

This work aims to learn the parameters \({\boldsymbol{\theta }} \) of a reward model for cooperative trajectory planning so that the optimal trajectories of the planning algorithm are similar to the demonstrated expert trajectories, i.e., that the expert policy yields the highest state value of all policies given the parametrization of the reward model (32).

3.3 Approach

IRL is used to learn a reward model from expert demonstrations so that the behavior sampled from the optimal policy based on this reward model resembles the expert demonstrations. Concisely, this work makes use of the previously introduced cooperative trajectory planning algorithm based on MCTS [29], and Maximum Entropy IRL [65], yielding a system that is similar to Guided Cost Learning [16].

The MCTS is used to solve the (forward) RL problem, i.e. finding the optimal policy/action given a reward model and generating near optimal trajectory samples \({\mathcal {T}}_{\textrm{S}}\) for that policy. Using these trajectories in combination with the expert trajectories \({\mathcal {T}}_{\textrm{E}}\), Maximum Entropy IRL is used to conduct a gradient ascent step increasing the likelihood of the parameters \({\boldsymbol{\theta }}\) given the expert trajectories, (50), (51), see Fig. 8.

3.3.1 Reward Model

The reward model is a central part of an RL system, as the goal of RL is to maximize the cumulative discounted reward by finding the optimal policy [52].

Initially, IRL applied solely linear reward models that are represented as a linear combination of features \({\phi ({s}, {a})} \) and parameters \({\boldsymbol{\theta }} \) (42) [1]. However, especially for larger RL problems, linear reward models have been outperformed by nonlinear reward models such as ANNs [16, 62]. This work uses both a linear reward model and a nonlinear reward model in the form of an ANN.

3.3.1.1 Features

Similar to many other planning methods, the cooperative trajectory planner assumes the desired lane \({l_\textrm{des}} \) as well as the desired velocity \({v_\textrm{des}} \) for each agent [37]. State and action dependent features \({\phi ({s}, {a})} \) are scalar values that consider specific characteristics of a state and action. Each feature is evaluated for each time step t of the trajectory,

$$\begin{aligned} {\phi }({\tau })=\frac{1}{T}\sum _{({{s}_{t}},{{a}_{t}}) \in {\tau }}^{T}{\phi }({{s}_{t}},{{a}_{t}}). \end{aligned}$$
(33)

The parameters \({\boldsymbol{\theta }} \) are identical for all agents, however features are not. All features are normalized to lie between \([-1, 1]\) for the length T of a trajectory \({\tau } \). The feature for the desired lane is defined as

$$\begin{aligned} {\phi }_\textrm{desLane}({\tau })=\frac{1}{T}\sum _{({{s}_{t}},{{a}_{t}}) \in {\tau }}^{T} \max \left( 1-\left| {l}_t-{l_\textrm{des}}\right| ,-1\right) , \end{aligned}$$
(34)

encouraging the agent to drive on the desired lane. A deviation from the desired velocity \({v_\textrm{des}} \) larger than 10 % results in a negative feature value.

$$\begin{aligned} {\phi }_\textrm{desVelocity}({\tau })=\frac{1}{T}\sum _{({{s}_{t}},{{a}_{t}}) \in {\tau }}^{T} \max \left( 1-10\left| \frac{v_t}{{v_\textrm{des}}}-1\right| ,-1\right) \end{aligned}$$
(35)

Similarly, deviating more than a quarter of the lane width \({{l}_\textrm{width}} \) from the lane center \({{l}_\textrm{center}} \) yields a negative feature value.

$$\begin{aligned} {\phi }_\textrm{laneCenter}({\tau })=\frac{1}{T}\sum _{({{s}_{t}},{{a}_{t}}) \in {\tau }}^{T} \max \left( 1-\frac{\left| {{l}_\textrm{center}}-y_t\right| }{{{l}_\textrm{width}}/4},-1\right) \end{aligned}$$
(36)

To avoid excessive accelerations, a proxy value for the acceleration a of an action is determined,

$$\begin{aligned} {c}_\textrm{acceleration}=\frac{1}{{g}}\sqrt{\frac{\int _{t}^{t+\Delta T}(a(t))^{2} {{\,\mathrm{d \!}\,}}t}{\Delta T}.} \end{aligned}$$
(37)

If this value is larger than an eighth of the gravity \({g} \), the feature turns negative,

$$\begin{aligned} {\phi }_\textrm{acceleration}({\tau })=\frac{1}{T}\sum _{({{s}_{t}},{{a}_{t}}) \in {\tau }}^{T} \max \left( 1-\frac{{c}_\textrm{acceleration}}{{g}/8},-1\right) . \end{aligned}$$
(38)

In addition, the following binary features are defined for trajectories that either result in collisions (39), invalid states (i.e. an agent drives off the road) (40) or invalid actions (41) (i.e. an agent executes a physically impossible action). Each of these binary features mark a terminal state.

$$\begin{aligned} {\phi }_\textrm{collision}({\tau }) = {\left\{ \begin{array}{ll} 1~& {\text { if } {\tau }\ \text {contains a collision}}~\\ 0~& {\text { if }{\tau }\ \text {does not contain a collision}} \end{array}\right. } \end{aligned}$$
(39)
$$\begin{aligned} {\phi }_\mathrm {invalid\ state}({\tau }) = {\left\{ \begin{array}{ll} 1~& {\text { if }{\tau }\ \text {contains an invalid state }}~\\ 0~& {\text { if }{\tau }\ \text {does not contain an invalid state}} \end{array}\right. } \end{aligned}$$
(40)
$$\begin{aligned} {\phi }_\mathrm {invalid\ action}({\tau }) = {\left\{ \begin{array}{ll} 1~& {\text { if }{\tau }\ \text {contains an invalid action}}~\\ 0~& {\text { if }{\tau }\ \text {does not contain an invalid action}} \end{array}\right. } \end{aligned}$$
(41)
3.3.1.2 Linear Reward Model

The linear reward model \({\mathcal {R}}_{{\boldsymbol{\theta }}}\) is a linear combination of the parameters \({\boldsymbol{\theta }} \) and their respective features \({\phi ({s}, {a})} \),

$$\begin{aligned} {\mathcal {R}}_{{\boldsymbol{\theta }}}({{s}_{t}}, {{a}_{t}})= {{\boldsymbol{\theta }}^{\top }}{\phi ({{s}_{t}}, {{a}_{t}})}. \end{aligned}$$
(42)

The feature count is normalized using the length of the trajectory. Since each feature is bounded between \([-1, 1]\), the return of a trajectory is bounded between \([-||{\boldsymbol{\theta }}||, ||{\boldsymbol{\theta }}||]\).

3.3.1.3 Nonlinear Reward Model

To allow for a more complex reward structure, a nonlinear reward model in the form of an ANN is introduced,

$$\begin{aligned} {\mathcal {R}}_{{\boldsymbol{\theta }}}({{s}_{t}}, {{a}_{t}}, {s}_{t-1})=W_{2}\Gamma (W_{1} {\phi }({{s}_{t}}, {{a}_{t}}, {s}_{t-1})). \end{aligned}$$
(43)

It consists of two fully connected layers, with parameters \(W_{1}\) and \(W_{2}\) respectively. The first layer is followed by a ReLU activation function \(\Gamma \). The inputs to the ANN are the features for the linear model in addition to the values of \({\phi }_\textrm{desLane}\), \({\phi }_\textrm{desVelocity}\) and \({\phi }_\textrm{laneCenter}\) at the previous time step.

3.3.2 Maximum Entropy Inverse Reinforcement Learning

IRL learns the parameters \({\boldsymbol{\theta }} \) of a parameterized reward model \({\mathcal {R}}_{{\boldsymbol{\theta }}}\) so that the expert policy \({{\pi _\mathrm {{\textrm{E}}}}} \) becomes the optimal policy given the reward model [38].

Instead of requiring access to the expert policy \({{\pi _\mathrm {{\textrm{E}}}}} \) itself, it is sufficient to observe trajectories \({\mathcal {T}}_{\textrm{E}}\) that originate from that policy [1].

$$\begin{aligned} {\tau _{\textrm{E}}}= ({{{s}_{0}}, {{a}_{0}}, {{s}_{1}}, {{a}_{1}}, \dots , {{s}_{T}}}) \quad {{a}_{t}}\sim {{\pi _\mathrm {{\textrm{E}}}}}({{a}_{t}}|{{s}_{t}}, {\boldsymbol{\theta }}) \end{aligned}$$
(44)

Similarly to the policy \({\pi} \) as a distribution over actions (28), a policy \({\rho } \) as a distribution over trajectories can be defined (30).

A prominent method for IRL is Maximum Entropy Inverse Reinforcement Learning [65], which assumes a probabilistic model for expert behavior. Using the definition of a policy over trajectories (30), Maximum Entropy IRL specifies the distribution over expert trajectories conditioned on the parameters of the reward model

$$\begin{aligned} {\rho }_{\textrm{E}}({\tau })=\frac{e^{{G}_{{\boldsymbol{\theta }}}({\tau })}}{{Z_{{\boldsymbol{\theta }}}}}. \end{aligned}$$
(45)

This model implies that the probability of an expert trajectory increases exponentially with its return. The numerator is the exponentiated return of a trajectory (1) and the denominator the partition function (46), the integral of the exponentiated return of all trajectories.

$$\begin{aligned} {Z_{{\boldsymbol{\theta }}}}= \int e^{{G}_{{\boldsymbol{\theta }}}({\tau })} {{\,\mathrm{d \!}\,}}{\tau } \end{aligned}$$
(46)

The likelihood of the parameters \({\boldsymbol{\theta }} \) given the expert trajectories \( {\mathcal {T}}_{\textrm{E}}\) is defined with

$$\begin{aligned} {L}({\boldsymbol{\theta }}| {\mathcal {T}}_{\textrm{E}}) = \prod _{{\tau }\in {\mathcal {T}}_{\textrm{E}}} {\frac{e^{{G}_{{\boldsymbol{\theta }}}({\tau })}}{{Z_{{\boldsymbol{\theta }}}}}}. \end{aligned}$$
(47)

Applying the logarithm to (47) yields the log-likelihood

$$\begin{aligned} {l}({\boldsymbol{\theta }}| {\mathcal {T}}_{\textrm{E}}) = \sum _{{\tau }\in {\mathcal {T}}_{\textrm{E}}} \left( {G}_{{\boldsymbol{\theta }}}({\tau }) - \log {{Z_{{\boldsymbol{\theta }}}}}\right) , \end{aligned}$$
(48)

which is proportional to

$$\begin{aligned} \frac{1}{|{\mathcal {T}}_{\textrm{E}}|}\sum _{{\tau }\in {\mathcal {T}}_{\textrm{E}}} {G}_{{\boldsymbol{\theta }}}({\tau }) - \log {{Z_{{\boldsymbol{\theta }}}}}. \end{aligned}$$
(49)

The maximization of the log-likelihoodFootnote 4 (49) through the parameters \({\boldsymbol{\theta }} \) will result in the parameters that best explain the expert trajectories.

$$\begin{aligned} \max _{{\boldsymbol{\theta }}\in {\boldsymbol{\Theta }}}\frac{1}{|{\mathcal {T}}_{\textrm{E}}|}\sum _{{\tau }\in {\mathcal {T}}_{\textrm{E}}} {G}_{{\boldsymbol{\theta }}}({\tau }) - \log {{Z_{{\boldsymbol{\theta }}}}} \end{aligned}$$
(50)

Using the gradient of the log-likelihood in a gradient ascent step, locally optimal parameters can be found (51).

$$\begin{aligned} {\boldsymbol{\theta }}\leftarrow {\boldsymbol{\theta }}+ {\alpha }\nabla _{{\boldsymbol{\theta }}} {l}({\boldsymbol{\theta }}) \end{aligned}$$
(51)

Using the logarithmic and exponential derivatives, the gradient of the log-likelihood can be formulated as an expectation (52) [28].

$$\begin{aligned} \nabla _{{\boldsymbol{\theta }}} {l}({\boldsymbol{\theta }}) = {\frac{1}{|{\mathcal {T}}_{\textrm{E}}|}\sum _{{\tau }\in {\mathcal {T}}_{\textrm{E}}} {\nabla _{{\boldsymbol{\theta }}}{G}_{{\boldsymbol{\theta }}}({\tau })}}- {{\,\mathrm{\mathbb {E}}\,}}_{{\tau }\sim {\rho }_{\textrm{E}}({\tau })}\left[ {\nabla _{{\boldsymbol{\theta }}}{G}_{{\boldsymbol{\theta }}}({\tau })}\right] \end{aligned}$$
(52)

3.3.3 Guided Cost Learning

GCL is an algorithm that combines sampling-based Maximum Entropy IRL with RL [16].

Since the partition function (46) can only be calculated for small and discrete MDPs, it cannot be computed for the cooperative trajectory planning problem. Instead, GCL circumvents this problem by sampling to approximate it.

It estimates the partition function (46) using the distribution over trajectories generated by a sampling-based method (in this work, the MCTS-based cooperative trajectory planner [29]) (55). The optimal proposal density for importance sampling

$$\begin{aligned} {\rho }_{\textrm{S}}^*({\tau }) \propto e^{{G}_{{\boldsymbol{\theta }}}({\tau })} \end{aligned}$$
(53)

is the distribution that yields the lowest variance [16]. The key concept of GCL is the adjustment of this sampling distribution to the distribution that follows from the current reward model (45). In order to achieve this within the MCTS, this work introduces a probabilistic final selection policy named Softmax Q-Proposal,

$$\begin{aligned} {\pi _\textrm{MCTS}}({a}| {{s}_{0}}) = \frac{e^{cQ^{{\pi}}({{s}_{0}}, {a})}}{\sum _{{a}\in {\mathcal {A}}({{s}_{0}})} e^{cQ^{{\pi}}({{s}_{0}}, {a})}}. \end{aligned}$$
(54)

The numerator is the exponentiated state-action value \(Q^{{\pi}}({{s}_{0}}, {a})\) (i.e. the expected return (1)) of taking action \({a} \) in root state \({{s}_{0}}\) over the sum of the state-action values of all explored actions \({\mathcal {A}}\) in the root state \({{s}_{0}}\). The coefficient c can be used to scale the variance of the distribution, its value is determined empirically. Based on (30), this results in the following distribution over trajectories

$$\begin{aligned} {\rho }({\tau }) = {\rho }_\textrm{MCTS}({{{s}_{0}}, {{a}_{0}}, {{s}_{1}}, {{a}_{1}}, \dots , {{s}_{T}}}) = \prod _{t=0}^{T-1} {\pi _\textrm{MCTS}}({{a}_{t}}|{{s}_{t}}). \end{aligned}$$
(55)

Applying importance sampling, the expectation in (52) can be calculated using the policy \({\rho }_{\textrm{S}}({\tau })\) [28],

$$\begin{aligned} {{\,\mathrm{\mathbb {E}}\,}}_{{\tau }\sim {\rho }_{\textrm{E}}({\tau })}\left[ {\nabla _{{\boldsymbol{\theta }}}{G}_{{\boldsymbol{\theta }}}({\tau })}\right] = {{\,\mathrm{\mathbb {E}}\,}}_{{\tau }\sim {\rho }_{\textrm{S}}({\tau })}\left[ \frac{e^{{G}_{{\boldsymbol{\theta }}}({\tau })}}{{\rho }_{\textrm{S}}({\tau }){Z_{{\boldsymbol{\theta }}}}}{\nabla _{{\boldsymbol{\theta }}}{G}_{{\boldsymbol{\theta }}}({\tau })}\right] \end{aligned}$$
(56)

Further, the partition function can be approximated using importance sampling as well,

$$\begin{aligned} {\widehat{Z}_{{\boldsymbol{\theta }}}}{{\,\mathrm{:=}\,}}\frac{1}{|{\mathcal {T}}_{\textrm{S}}|} \sum _{{\tau }\in {\mathcal {T}}_{\textrm{S}}} \frac{e^{{G}_{{\boldsymbol{\theta }}}({\tau })}}{{\rho }_{\textrm{S}}({\tau })}. \end{aligned}$$
(57)

Substituting the expectation in (52) with (56) as well as the partition function (46) with (57), the final approximation of the gradient can be obtained (58).

$$\begin{aligned} \begin{aligned} \nabla _{{\boldsymbol{\theta }}} {l}({\boldsymbol{\theta }}) & = {\frac{1}{|{\mathcal {T}}_{\textrm{E}}|}\sum _{{\tau }\in {\mathcal {T}}_{\textrm{E}}} {\nabla _{{\boldsymbol{\theta }}}{G}_{{\boldsymbol{\theta }}}({\tau })}}- {{\,\mathrm{\mathbb {E}}\,}}_{{\tau }\sim {\rho }_{\textrm{E}}({\tau })}\left[ {\nabla _{{\boldsymbol{\theta }}}{G}_{{\boldsymbol{\theta }}}({\tau })}\right] \\ & = {\frac{1}{|{\mathcal {T}}_{\textrm{E}}|}\sum _{{\tau }\in {\mathcal {T}}_{\textrm{E}}} {\nabla _{{\boldsymbol{\theta }}}{G}_{{\boldsymbol{\theta }}}({\tau })}}- {{\,\mathrm{\mathbb {E}}\,}}_{{\tau }\sim {\rho }_{\textrm{S}}({\tau })}\left[ \frac{e^{{G}_{{\boldsymbol{\theta }}}({\tau })}}{{\rho }_{\textrm{S}}({\tau }){Z_{{\boldsymbol{\theta }}}}}{\nabla _{{\boldsymbol{\theta }}}{G}_{{\boldsymbol{\theta }}}({\tau })}\right] \\ & \approx {\frac{1}{|{\mathcal {T}}_{\textrm{E}}|}\sum _{{\tau }\in {\mathcal {T}}_{\textrm{E}}} {\nabla _{{\boldsymbol{\theta }}}{G}_{{\boldsymbol{\theta }}}({\tau })}}- \frac{1}{|{\mathcal {T}}_{\textrm{S}}|} \sum _{{\tau }\in {\mathcal {T}}_{\textrm{S}}}\frac{e^{{G}_{{\boldsymbol{\theta }}}({\tau })}}{{\rho }_{\textrm{S}}({\tau }){\widehat{Z}_{{\boldsymbol{\theta }}}}}{\nabla _{{\boldsymbol{\theta }}}{G}_{{\boldsymbol{\theta }}}({\tau })}\end{aligned} \end{aligned}$$
(58)

Given this form of the gradient, the proposed Softmax Q-IRL algorithm (Algorithm 1) performs gradient ascent, converging towards the expert behavior.

Algorithm 1
An 8-line algorithm for softmax Q I R L. The input is T E and the output is theta. A nested for loop includes mathematical equations for del theta l of theta.

Softmax Q-IRL [28] © 2022 IEEE.

The necessary data sampling routine (Algorithm 1 Line 5) is depicted in Algorithm 2. It generates the sample trajectories \({\mathcal {T}}_{\textrm{S}}\) as well as their policies \({\Pi }\). Here, \({\Upsilon } \) denotes the number of agents in the respective scenario.

Algorithm 2
A 15-line algorithm for the sampling of trajectories and policies. The function, generate samples of theta, has statements to sample from the start state distribution, to estimate for each action explored at the root state, for each agent in the scenario, and to return T and pi.

Sampling of Trajectories and Policies [28] © 2022 IEEE.

3.4 Experiments

Using our cooperative trajectory planner, we create a set \({\mathcal {T}}_{\textrm{E}}\) of 50 expert trajectories for each scenario that resembles expert behavior. Each trajectory has a length of 10.4 s. The number of agents and obstacles in a scenario is set, but we sample its start state from a distribution. The agents’ lateral and longitudinal positions are drawn from a normal distribution. Further, different random seeds are used to initialize the sampling-based trajectory planning algorithm.

We trained the linear and nonlinear reward models for 2000 gradient steps with a learning rate of 5d-4.

We compare the performance of the linear and nonlinear models with the manually tuned baseline in Table 1. The reward model of the baseline has been hand-tuned over numerous days through an iterative process of parameter modification and quantitative and qualitative analysis of the resulting trajectories. The results were shown in Sect. 2.4. It can be seen that both models perform well in all scenarios. The nonlinear model outperforms the manually tuned baseline in SC02 and SC06, as it does not generate any collisions or invalid trajectories. Further, the learned models manage to reach the desired velocity \({v_\textrm{des}} \) in an additional 46 % of the cases, while the desired lane is reached less frequently \(-\)5 %. Finally, both models yield a lower mean distance to the expert trajectories than the manually tuned baseline, while only the nonlinear model is consistently better. A Euclidean distance metric is depicted in Fig. 9. As expected, the linear and the nonlinear model converge toward the expert trajectories. However, neither model converges completely but stall at a distance of 1.98 and 1.30 m for the linear and nonlinear, respectively. A possible remedy could be a reward model with a higher capacity.

The visual resemblance of the generated samples to the expert trajectories by the nonlinear reward model can be assessed in Fig. 10. Some of the trajectories that deviate significantly from the majority of the optimal trajectories could be the result of the inherently stochastic sampling-based trajectory planning algorithm.

Table 1 Absolute change in performance: the performances of the linear and nonlinear reward models are compared with the manually tuned baseline on the scenarios. Columns \(\Delta \)collision, \(\Delta \)invalid, \(\Delta {l_\textrm{des}} \), and \(\Delta {v_\textrm{des}} \) denote the fraction of trajectories compared to the baseline that reach that feature. Similarly, \(\Delta \mu (d)\) and \(\Delta \sigma (d)\) represent the mean and standard deviation of the Euclidean distance d to the k-nearest neighboring (\(k = 3\)) expert trajectories compared to the baseline.
Fig. 9
2 double line graphs plot logarithmically decreasing lines. a. Mu of d versus steps. a. Nonlinear is from (0, 30) to (2000, 2). Linear is from (0, 10) to (2000, 2.5). b. Sigma of d versus steps. Linear is from (0, 10) to (2000, 4). Nonlinear is from (0, 8) to (2000, 2). Values are estimated.

Reference [28] © 2022 IEEE

Distance mean and standard deviation between \({\mathcal {T}}_{\textrm{E}}\) and \({\mathcal {T}}_{\textrm{S}}\): the Euclidean distance to the k-nearest neighboring (\(k = 3\)) expert trajectories throughout the training for the linear (blue) and nonlinear reward model (orange).

Fig. 10
6 trajectory maps on roads. S C 01 is for delaying merging due to approaching vehicles. S C 02 is for reacting to the approaching vehicle. S C 03 is for merging into moving traffic. S C 04 has prior longitudinal adjustment. S C 05 is for changing the lane. S C 06 is for delaying lane change.

Reference [28] © 2022 IEEE

Sample trajectories from the nonlinear reward model: expert trajectories (red) that are used to learn the parameters of the reward model and the optimal trajectories (blue, after 2000 training steps of the nonlinear reward model).

4 Planning Under Uncertainties

In reality, the true state of the surrounding environment is never exactly known. Autonomous vehicles use perception and localization sensors to observe their environment and try to deduce the current state from these observations. The real state often differs from the observations due to noisy measurements and imperfections in the perception algorithms. Hence, the true state of the environment is uncertain. If this uncertainty is not considered in the planning process, the chosen actions may lead to unwanted outcomes such as collisions or undesirable driving behavior. Therefore, we also present an approach that considers uncertainty by constructing search trees from different start states with MCTS and applying kernel regression and risk metrics afterward to find robust actions.

4.1 Related Work

Uncertainties have been addressed with various MCTS approaches in the literature. For example, Kernel Regression UCT [64] tackles the execution uncertainty in continuous action spaces by selecting actions according to a modified upper confidence bound value that incorporates all action assessments by kernel regression. Furthermore, the estimated value of a selected action is refined by applying progressive widening to add similar actions to the search tree. Another concept is determinization which describes sampling several deterministic problems with perfect information from a stochastic problem with imperfect information, solving these problems, and fusing their results to get a solution for the original problem [9]. For instance, Couëtoux et al. [12] employ UCT with Double Progressive Widening in a setting with stochastic state transitions and continuous action and state spaces. Their method expands the set of available actions and the set of sampled outcomes iteratively. Another example is HOP-UCT [6], which uses Hindsight Optimization (HOP). Several deterministic UCT search trees are constructed, and their results are averaged to determine the action assessments. Ensemble-Sparse-UCT [6] creates multiple trees and restricts the number of outcomes for an action to a sampling width parameter. Afterward, the results of the search trees are combined. A different approach is Information Set MCTS [14], which uses nodes representing information sets. An information set comprises all indistinguishable states for an agent. This approach restricts the search tree to regions consistent with a determinization. The real-state uncertainty can be modeled by a POMDP approach [50] that uses a search tree with history nodes and an unweighted particle filter representing the belief state. A black box simulator samples successor states and observations, and actions are selected according to the UCT criterion. Generally, POMDPs can be solved online with search trees of belief states with Branch-and-Bound Pruning, Monte Carlo Sampling, or Heuristic Search [41].

In the domain of automated driving the Toolkit for Approximating and Adapting POMDP solutions In Realtime (TAPIR) [23] based on Adaptive Belief Trees [27] is successfully applied if the action space is sufficiently small [19, 20, 44]. Another approach to account for uncertainty is to use Distributional Reinforcement Learning in combination with risk metrics [5]. Instead of learning a return for each state and action, Distributional Reinforcement Learning learns a return distribution. This can be done offline, exposing the agent to various uncertainties during training. Online, during inference, a risk metric is applied to the (risk-neutral) return distribution to select the best action.

4.2 Problem Formulation

Besides the execution uncertainty, which is already modeled by the transition function of an MDP/MG, a POMDP also models the uncertainty emerging from the fact that the real state of the environment can only be partially observed in reality.

A POMDP introduces an observation space \({\mathcal {D}}\) as a set of possible observations and an observation function \({\text {O}}_{{a}}^{{s}'}\) that defines the probability of an observation after reaching state \({s}'\) by executing action \({a}\). With the help of past observations, an agent can determine a belief \({\text {b}}\) about the real state in the form of a probability distribution over the state space \({\mathcal {S}}\). The objective is to find a policy \({\pi}({a}| {\text {b}})\) that specifies which action shall be executed given the belief \({\text {b}}\).

In our multi-agent setting [51], the ego-agent only receives an observation with stochastic features of the environment. The position, orientation, and velocity of other agents and the length and width of the agents’ vehicles are modeled as Gaussian distributions. Furthermore, obstacles’ position, orientation, length, and width are normally distributed. Lastly, the observation of the lane width of the road is also regarded as stochastic. All used Gaussian distributions use mean vectors corresponding to the true state since the sensor system is assumed to be unbiased. The uncertainty is hence specified by the covariance matrixes, which are derived from the accuracy statistics of standard perception systems using lidar or radar sensor systems. For simplicity reasons, we assume that the stochastic observation features are mutually independent, and thus each feature follows an individual Gaussian distribution.

In the next section, we present an approach that considers the uncertainties mentioned above explicitly to find more robust actions.

4.3 Approach

In this section, we modify our previous approach so that the uncertainty about the real state due to noisy measurements is explicitly considered during planning. The objective is to find robust actions given an initial belief state \({\text {b}}_0\) that represents a probability distribution over the state space \({\mathcal {S}}\). Our approach [51] does not propose a procedure to update the belief state after each new observation in a new planning cycle. Instead, it uses the information of a belief state to select robust actions in the current planning cycle.

Similar to the modeling of the stochastic observations, we also use Gaussian distributions for the belief about the real state features. However, since we assume for simplicity reasons that the features are mutually independent, each belief feature is individually normally distributed with a mean corresponding to the observed value and the variance representing the accuracy of the sensor system.

Our approach samples multiple start states \(\mathcal {S}_0\) according to a modified procedure given the initial belief state \({\text {b}}_0\). It constructs a search tree with MCTS from each start state \( s _0\in \mathcal {S}_0\) to obtain action values \( Q ( s _0, \cdot )\). Action assessments from all search trees are combined with kernel regression to yield a return distribution over the start states \(\mathcal {S}_0\) (see Fig. 11). After the computational budget for the planning phase is exhausted, a risk metric is applied to the return distribution to select the best action vector \(\boldsymbol{a}^{*}\). This concept is sensible since robust actions should perform well from multiple likely start states according to the belief about the environment so that the risk of failure is reduced.

Fig. 11
A search tree for M C T S. B 0 is followed by 3 states with s 0 tilde b 0 and s 0 belongs to S 0, and 4 states with a belongs to A and s belongs to S.

Reference [51] © 2022 IEEE

Using the distribution of the belief state \({\text {b}}_0\), possible start states are sampled. Then, for each start state \( s _0\), an MCTS is run to determine the action values \( Q ( s _0, \cdot )\), yielding a return distribution over start states \(\mathcal {S}_0\).

Algorithm 3 summarizes the uncertainty handling concept. The sampling of start states and the construction of search trees are conducted iteratively until the computational budget is exhausted. Afterward, the best action vector is selected, given the search trees. Before we present further details about the construction of the search trees and the final selection of actions, the next sections briefly summarize the concepts of kernel regression and risk metrics.

Algorithm 3
A 14-line algorithm for the uncertainty handling concept. The input is the belief state b 0. The output is the best action vector a asterisk. Create initial start states S 0 is followed by a loop, and a asterisk where the final selection policy selects the best action vector given the search trees.

Uncertainty Handling Concept [51] © 2022 IEEE

4.3.1 Kernel Regression

The general objective of regression is to find the regression function

$$\begin{aligned} m(x) &= {{\,\mathrm{\mathbb {E}}\,}}[ Y | X = x] = \frac{\int _{-\infty }^{\infty } y f(x,y) dy}{\int _{-\infty }^{\infty } f(x,y) dy} \end{aligned}$$
(59)

specifying the conditional expected value of a random variable \( Y \) given the realization of a random variable \( X \) [58]. Since the joint probability density function f is unknown, kernel regression [36, 58] estimates f by

$$\begin{aligned} \hat{f}(x,y) = \frac{1}{n} \sum _{i=1}^{n} \widetilde{{K}}(x - x_i, y - y_i) \end{aligned}$$
(60)

with a kernel \(\widetilde{{K}}\) (i.e., a non-negative smoothing function whose integral over both dimensions equals one) from a finite set of data samples \(\{(x_1, y_1), \dots , (x_n, y_n)\}\). Under mild requirements, the estimated regression function can then be formulated as

$$\begin{aligned} \hat{m}(x) = \frac{\sum _{i=1}^{n} y_i \widetilde{{K}}(x - x_i)}{\sum _{i=1}^{n} \widetilde{{K}}(x - x_i)} = \frac{\sum _{i=1}^{n} y_i {K}(x, x_i)}{\sum _{i=1}^{n} {K}(x, x_i)}, \end{aligned}$$
(61)

where \(\widetilde{{K}}(\cdot )\) is the “marginal” kernel and the kernel value \({K}(x, x_i)\) is just a simplified notation for \(\widetilde{{K}}(x - x_i)\) [58].

In the context of an MDP/MG, kernel regression can be used to assess actions. The kernel regression value

$$\begin{aligned} {\text {KR}}({s}, {a}) = \frac{\sum _{{{a}^\prime }\in {\mathcal {A}}({s})} {K}({a}, {{a}^\prime }) Q({s}, {{a}^\prime }) {N}({s}, {{a}^\prime })}{{\text {W}}({s},{a})} \end{aligned}$$
(62)

for a state \({s}\) and action \({a}\) combines the state-action value estimates \(Q({s}, {{a}^\prime })\) of all actions \({{a}^\prime }\) of a finite action set \({\mathcal {A}}({s})\) weighted by the visit counts \({N}({s}, {{a}^\prime })\) and a kernel \({K}\) that specifies the similarity between the actions [64]. The denominator in (62) is the kernel density

$$\begin{aligned} {\text {W}}({s}, {a}) = \sum _{{{a}^\prime }\in {\mathcal {A}}({s})} {K}({a}, {{a}^\prime }) {N}({s}, {{a}^\prime }) \end{aligned}$$
(63)

and denotes the exploration of action \({a}\) and similar actions. The kernel regression lower confidence bound

$$\begin{aligned} {\text {KRLCB}}({s}, {a}) = {\text {KR}}({s}, {a}) - c \sqrt{\frac{\log \sum _{{{a}^\prime }\in {\mathcal {A}}({s})} {\text {W}}({s}, {{a}^\prime })}{{\text {W}}({s}, {a})}} \end{aligned}$$
(64)

penalizes poorly explored actions by subtracting a normalized exploration term scaled by a constant \(c \in {{\,\mathrm{\mathbb {R}}\,}}_{\ge 0}\) from the kernel regression value [64].

4.3.2 Risk Metrics

A risk metric is a measure of risk, but it is not a metric in a mathematical sense since it does not represent a distance function. Let \( Z : \Omega \rightarrow {{\,\mathrm{\mathbb {R}}\,}}\) be a random variable assigning costs (in monetary units) that are caused by an action to each outcome \(\omega \) of the sample space \(\Omega \). Furthermore, let \(\mathcal {Z}\) denote the set of all cost random variables, then a risk metric \({\rho }: \mathcal {Z} \rightarrow {{\,\mathrm{\mathbb {R}}\,}}\) maps each cost random variable \( Z \) to a real number that represents the amount of risk [34]. Sensible risk metrics should fulfill the following axioms [34]: Monotonicity, Translation Invariance, Positive Homogeneity, Subadditivity, Comonotone Additivity, and Law Invariance.

One example for a risk metric that fulfills all six axioms is the Conditional Value at Risk (CVaR) [34]. For the definition of CVaR, the Value at Risk (VaR)

$$\begin{aligned} {\text {VaR}}_{\alpha }( Z ) &:= \min \left\{ z \in {{\,\mathrm{\mathbb {R}}\,}}\ | \ {{\,\mathrm{\mathbb {P}}\,}}( Z > z) \le \alpha \right\} \end{aligned}$$
(65)
$$\begin{aligned} &= \min \left\{ z \in {{\,\mathrm{\mathbb {R}}\,}}\ | \ {{\,\mathrm{\mathbb {P}}\,}}( Z \le z) \ge 1 - \alpha \right\} \end{aligned}$$
(66)

for a cost random variable \( Z \) is necessary, which specifies the smallest \((1 - \alpha )\)-quantile of \( Z \) for a given probability \(\alpha \) [34, 40]. CVaR is then defined as the conditional expected value

$$\begin{aligned} {\text {CVaR}}_{\alpha }( Z ) := {{\,\mathrm{\mathbb {E}}\,}}\left[ Z \ | \ Z \ge {\text {VaR}}_{\alpha }( Z ) \right] . \end{aligned}$$
(67)

under the condition that all costs are less than or equal to \({\text {VaR}}_{\alpha }( Z )\). In addition, the Upper Value at Risk

$$\begin{aligned} {\text {VaR}}_{\alpha }^+ ( Z ) &:= \inf \left\{ z \in {{\,\mathrm{\mathbb {R}}\,}}\ | \ {{\,\mathrm{\mathbb {P}}\,}}( Z > z) < \alpha \right\} \end{aligned}$$
(68)
$$\begin{aligned} &= \inf \left\{ z \in {{\,\mathrm{\mathbb {R}}\,}}\ | \ {{\,\mathrm{\mathbb {P}}\,}}( Z \le z) > 1 - \alpha \right\} \end{aligned}$$
(69)

specifies the largest \((1 - \alpha )\)-quantile of \( Z \) for a given probability \(\alpha \) [40]. \({\text {VaR}}_{\alpha }^+\) and \({\text {VaR}}_{\alpha }\) only differ if the cumulative distribution function is constant around the probability level \((1 - \alpha )\).

4.3.3 Construction of Search Trees

Search trees are constructed iteratively from the start states sampled according to the belief state. We apply the idea of progressive widening (cf. Sect. 2.3.3) to balance the number and the depth of the trees. In each iteration, it is checked whether the condition

$$\begin{aligned} \left| \mathcal {S}_0\right| \ge c_{pw} {\text {N}}^{\alpha _{pw}} \end{aligned}$$
(70)

with start states \(\mathcal {S}_0\), constant parameters \(c_{pw} \in {{\,\mathrm{\mathbb {R}}\,}}_{\ge 0}\) and \(\alpha _{pw} \in [0,1)\), and the total visit count \({\text {N}}\) holds true. If this condition is fulfilled, an already existing start state is selected uniformly from \(\mathcal {S}_0\), and its subtree is expanded according to MCTS. Otherwise, a new start state associated with a new search tree is generated.

The generation of a new start state is a modified sampling process. Given the belief that is modeled as a random vector following a Gaussian distribution \(\mathcal {N}(\boldsymbol{\mu }, \mathbf {\Sigma })\) with mean vector \(\boldsymbol{\mu }\) and covariance matrix \(\mathbf {\Sigma }\), a start state is sampled from the scaled Gaussian distribution \(\mathcal {N}(\boldsymbol{\mu }, c \mathbf {\Sigma })\). If the sampled start state represents a collision or is invalid, the scaling factor c is increased, and a new sampling attempt is conducted. The scaling factor c is determined by

$$\begin{aligned} c = \min \left\{ \left( c_{step} \right) ^{l_{step}} , c_\textrm{max}\right\} , \end{aligned}$$
(71)

where \(c_{step} \in {{\,\mathrm{\mathbb {R}}\,}}_{\ge 0}\) is a constant, \(l_{step}\) is a step counter and \(c_\textrm{max} \in {{\,\mathrm{\mathbb {R}}\,}}_{\ge 0}\) is the maximally allowed value. The step counter

$$\begin{aligned} l_{step} = \left\lfloor \frac{l_{attempt}}{l_{stepSize}} \right\rfloor \end{aligned}$$
(72)

is dependent on the number of total attempts conducted so far \(l_{attempt} \in {{\,\mathrm{\mathbb {R}}\,}}_{\ge 0}\), and the step size \(l_{stepSize} \in {{\,\mathrm{\mathbb {R}}\,}}_{\ge 0}\) specifies the number of attempts for one step.

Since this process is executed independently for sampling each start state, the i-th start state is a realization of the random vector \(\textbf{X}_i \sim \mathcal {N}(\boldsymbol{\mu }, c_i \mathbf {\Sigma })\) with an individual factor \(c_i\).

Increasing the scaling factor \(c_i\) increases the variance of the sampled start states. Hence the probability of obtaining a collision-free and valid start state from which planning is sensible rises. Furthermore, actions must perform well from a greater variety of start states, increasing robustness. On the other hand, we restrict the variance to a realistic level by limiting the scaling factor to a maximum scaling factor \(c_\textrm{max}\).

4.3.4 Final Selection

After the computational budget is exhausted for constructing the search trees, a robust action is finally selected. We present two variants using the Kernel Regression Lower Confidence Bound (KRLCB) and the Conditional Value at Risk (CVaR). Both variants follow the general steps of Algorithm 4. For each agent \(i \in \Upsilon \), action candidates \(\mathcal {C}^i\) are determined at first. This set comprises all actions from collision-free and valid start states that have been visited more often than a threshold. This condition ensures that unreliable action value estimates with large variances are not considered. If the set \(\mathcal {C}^i\) is empty, the default action “Maintain (0)” is selected for the agent i. Otherwise, kernel regression is applied to calculate the densities \(\mathcal {W}^i\) and kernel regression values \(\mathcal{K}\mathcal{R}^i\). Given these values, each final policy variant computes action candidate values \(\mathcal {Q}^i\) that assess the performance and robustness of the action candidates. Finally, the action with the largest value \(\mathcal {Q}^i\) is selected.

Algorithm 4
An 11-line algorithm for the basic final selection policy. The inputs are search trees of the planning phase. The output is the best action vector a asterisk. A statement to empty the best action vector is followed by a for-loop to get action candidates and to determine a asterisk.

Basic Final Selection Policy [51] © 2022 IEEE

Similar to (63), the density for an action candidate \(a \in \mathcal {C}^i\) is calculated as

$$\begin{aligned} {\text {W}}[{a}| \mathcal {C}^i] = \sum _{{a}'\in \mathcal {C}^i} {K}({a}, {a}') {\text {N}}({s}_{{a}'}, {a}') \end{aligned}$$
(73)

with the kernel

$$\begin{aligned} {K}({a}, {a}') = \exp \left( -\gamma \left\Vert {a}- {a}'\right\Vert ^2 \right) \end{aligned}$$
(74)

and the visit count \({\text {N}}({s}_{{a}'}, {a}')\) of action candidate \({a}'\) from the corresponding start state \({s}_{{a}'}\). Furthermore, the kernel regression value (c.f., 62) is then defined as

$$\begin{aligned} {\text {KR}}[{a}| \mathcal {C}^i] = \frac{\sum _{{a}'\in \mathcal {C}^i} {K}({a}, {a}') Q ({s}_{{a}'}, {a}') {\text {N}}({s}_{{a}'}, {a}')}{\sum _{{a}'\in \mathcal {C}^i} {K}({a}, {a}') {\text {N}}({s}_{{a}'}, {a}')} \end{aligned}$$
(75)

with \( Q ({s}_{{a}'}, {a}')\) being the estimated action value of action \({a}'\) from the start state \({s}_{{a}'}\).

The first final policy variant calculates the Kernel Regression Lower Confidence Bound (KRLCB) by

$$\begin{aligned} {\text {KRLCB}}[{a}| \mathcal {C}^i] = {\text {KR}}[{a}| \mathcal {C}^i] - c \sqrt{\frac{\log \sum _{{a}'\in \mathcal {C}^i} {\text {W}}[{a}'| \mathcal {C}^i]}{{\text {W}}[{a}| \mathcal {C}^i]}} \end{aligned}$$
(76)

with a constant \(c \in {{\,\mathrm{\mathbb {R}}\,}}_{\ge 0}\). Actions with high KRLCB values are well explored from many start states due to the exploration term (subtrahend) and provide large returns from many start states due to the exploitation term (minuend). Hence, the KRLCB selects robust actions.

The second policy variant employs the Conditional Value at Risk (CVaR). Since CVaR has been defined for a cost random variable \( Z \) in Sect. 4.3.2, we derive a consistent definition for a reward random variable \( R := - Z \). The Value at Risk (VaR) of \( Z \) can be transformed with

$$\begin{aligned} \begin{aligned} {\text {VaR}}_{\alpha }( Z ) &= \min \left\{ z \in {{\,\mathrm{\mathbb {R}}\,}}\ | \ {{\,\mathrm{\mathbb {P}}\,}}(- R > z) \le \alpha \right\} \\ & = - \max \left\{ r \in {{\,\mathrm{\mathbb {R}}\,}}\ | \ {{\,\mathrm{\mathbb {P}}\,}}( R < r) \le \alpha \right\} \\ & = - \inf \left\{ r \in {{\,\mathrm{\mathbb {R}}\,}}\ | \ {{\,\mathrm{\mathbb {P}}\,}}( R \le r) > \alpha \right\} \\ & = - \inf \left\{ r \in {{\,\mathrm{\mathbb {R}}\,}}\ | \ {{\,\mathrm{\mathbb {P}}\,}}( R > r) < 1 - \alpha \right\} \\ & \overset{(68)}{=}\ - {\text {VaR}}_{ 1 - \alpha }^+ ( R ) \end{aligned} \end{aligned}$$
(77)

where \({\text {VaR}}_{ 1 - \alpha }^+\) is the Upper Value at Risk and \(\alpha \) is a probability.

The objective is to select the best action \({a}^{*}\) from a set of available actions \(~~{{\tilde{\!\!\!\mathcal {A}}}^i}\) that minimizes the risk of costs according to risk metric CVaR (see (78)). The following derivations

$$\begin{aligned} {a}^{*}&= {{\,\mathrm{arg\,min}\,}}_{{a}\in ~~{{\tilde{\!\!\!\mathcal {A}}}^i}} {\text {CVaR}}_{\alpha }( Z ) \nonumber \\ &= {{\,\mathrm{arg\,min}\,}}_{{a}\in ~~{{\tilde{\!\!\!\mathcal {A}}}^i}} {{\,\mathrm{\mathbb {E}}\,}}\left[ - R \ | \ - R \ge {\text {VaR}}_{\alpha }( Z ) \right] \end{aligned}$$
(78)
$$\begin{aligned} & \overset{{(77)}}{=} {{\,\mathrm{arg\,min}\,}}_{{a}\in ~~{{\tilde{\!\!\!\mathcal {A}}}^i}} {{\,\mathrm{\mathbb {E}}\,}}\left[ - R \ | \ R \le {\text {VaR}}_{1 - \alpha }^+ ( R ) \right] \nonumber \\ &= {{\,\mathrm{arg\,max}\,}}_{{a}\in ~~{{\tilde{\!\!\!\mathcal {A}}}^i}} {{\,\mathrm{\mathbb {E}}\,}}\left[ R \ | \ R \le {\text {VaR}}_{1 - \alpha }^+ ( R ) \right] \end{aligned}$$
(79)

show that this idea is equivalent to choosing the action that maximizes a conditional expected value based on the reward random variable \( R \). We denote this expected value as Complementary Conditional Value at Risk (CCVaR) and define it as

$$\begin{aligned} {\overline{{\text {CVaR}}}}_{1-\alpha }( R ) := {{\,\mathrm{\mathbb {E}}\,}}\left[ R \ | \ R \le {\text {VaR}}_{1 - \alpha }^+ ( R ) \right] \text {.} \end{aligned}$$
(80)

In the context of MDPs and POMDPs, an agent selects the action with the largest \({\overline{{\text {CVaR}}}}_{1-\alpha }( G )\) value, where \( G \) denotes the return (1).

Before the CVaR idea can be applied, a return distribution must be determined. We use kernel regression to calculate the density \(\mathcal {W}^i_{{a}, s _0} := {\text {W}}[{a}| \mathcal {C}^i_{ s _0}]\) (see (73)) and the kernel regression value \(\mathcal{K}\mathcal{R}^i_{{a}, s _0} := {\text {KR}}[{a}| \mathcal {C}^i_{ s _0}]\) (see (75)) of each action \({a}\) with respect to the action candidates \(\mathcal {C}^i_{ s _0} \subseteq \mathcal {C}^i\) from each start state \( s _0\). Each kernel regression value \(\mathcal{K}\mathcal{R}^i_{{a}, s _0}\) for a specific start state \( s _0\) serves as an unweighted particle in the set

$$\begin{aligned} \mathcal{K}\mathcal{R}^i_{{a}} = \left\{ \mathcal{K}\mathcal{R}^i_{{a}, s _0} \ | \ s _0\in \mathcal {S}_0\wedge \mathcal{K}\mathcal{R}^i_{{a}, s _0} \in \mathcal{K}\mathcal{R}^i \right\} \subseteq \mathcal{K}\mathcal{R}^i \end{aligned}$$
(81)

representing the return distribution of an action \({a}\) from different start states. A particle is not added to the set if the density \(\mathcal {W}^i_{{a}, s _0}\) is smaller than a threshold \(w_\textrm{min} \in {{\,\mathrm{\mathbb {R}}\,}}_{\ge 0}\). This condition ensures that unreliable return estimates due to a low visit count of similar actions do not distort the distribution.

Algorithm 5
A 22-line algorithm for the C V a R final selection policy. The function kernel regression has an input C i, and outputs W i and K R i for agent i. The function get A C values has inputs C i, W i and K R i, and output C V a R values Q i for agent i.

CVaR Final Selection Policy [51] © 2022 IEEE

After the particle sets have been constructed, the action candidate values \(\mathcal {Q}^i\) can be determined. Let

$$\begin{aligned} N_{ s _0} = | \{\mathcal {C}^i_{ s _0} \ | \ s _0\in \mathcal {S}_0\wedge \mathcal {C}^i_{ s _0} = \left\{ \text {action candidates from } s _0\right\} \subseteq \mathcal {C}^i \wedge \mathcal {C}^i_{ s _0} \ne \emptyset \} | \end{aligned}$$
(82)

be the number of start states from which actions have been appended to the action candidates \(\mathcal {C}^i\). If the final selection policy follows the process described at the beginning of Sect. 4.3.4 for the initialization of \(\mathcal {C}^i\), \(N_{ s _0}\) corresponds to the number of collision-free and valid start states visited more often than a threshold. Then, the condition

$$\begin{aligned} |\mathcal{K}\mathcal{R}^i_{{a}}| \ge c_{m} N_{ s _0} \end{aligned}$$
(83)

with a constant \( c_{m} \in [0,1]\) indicates whether \(\mathcal{K}\mathcal{R}^i_{{a}}\) has enough elements to be a meaningful representation for the return distribution of action \({a}\). If this is the case, the action candidate value \(\mathcal {Q}^i_{a}\) is set to \({\overline{{\text {CVaR}}}}_{1-\alpha }(\mathcal{K}\mathcal{R}^i_{{a}})\) (see (80)). Otherwise, \(\mathcal {Q}^i_{a}\) is set to negative infinity, which prevents the selection of this action.

The CVaR final selection policy, summarized in Algorithm 5, favors robust actions since it combines the performances from several start states and limits the influence of high-return outliers from a few specific start states.

4.4 Experiments

We evaluate the presented approach in the same manner as described in Sect. 2.4. In addition, we compare our approach with a baseline corresponding to our basic approach of Sect. 2.3.3, which neglects uncertainties. The results are summarized in Fig. 12.

In an uncertain environment with noisy measurements, the performance of the baseline decreases significantly in comparison to a deterministic environment (see Fig. 12a and c). For instance, the mean success rate drops from 85 % to 24 % for 1280 iterations. If the number of iterations increases, the success rate of the baseline also rises due to better action assessment and exploration. Our approach that uses the KRLCB final selection policy outperforms the baseline significantly. For 1280 iterations, the success rate reaches 62 %, for instance. As indicated in Fig. 12b, scenarios one to ten are solved nearly completely, with a success rate approaching 100 %. Furthermore, a greater number of iterations improves the performance in general. However, more complex scenarios with multiple agents, such as scenarios 11–15, need more iterations than 1280 for satisfactory performance.

Fig. 12
2 line graphs and 2 heatmaps plot success rates versus iterations. a. Points for K R L C B and C V a R are above the baseline. b and d. S C 03 has the highest values. S C 12 has the lowest values. c. Points for K R L C B and C V a R are below the higher baseline.

Comparison of the mean success rates between the baseline, KRLCB, and CVAR final selection policies in an uncertain (a) and a deterministic (c) environment. The performance of KRLCB is depicted in more detail for uncertain (b) and deterministic (d) scenarios (SC)

Our approach that employs the CVaR final selection policy also outperforms the baseline with 320 iterations or more but consistently performs worse than KRLCB over all evaluated iteration counts (see Fig. 12a). The experiments show that CVaR needs more iterations to unfold its potential. For instance, more than 160 iterations are necessary to improve the success rate upon the baseline. This is plausible since a sufficient amount of particles is necessary for a reasonable representation of the return distribution.

If our approach is applied in a deterministic environment (see Fig. 12c), the success rates are lower on average compared to the baseline. This behavior is expected since the baseline can plan from the environment’s real state without needing robustification. On the other hand, the KRLCB and CVaR improve their performances with increasing iteration counts. Fig. 12d shows that KRLCB even solves scenarios two to six better than the baseline in a deterministic environment. However, the success rates are significantly lower for the other scenarios.

5 Conclusion

Our work proposes a method to plan cooperative trajectories in challenging interactive urban scenarios and tight spaces. At its core, MCTS powers the search for optimal actions. We employ Decoupled-UCT to model the simultaneous action selection process of multiple interacting agents. In addition, we augment the planning algorithm with progressive widening to enable a continuous action space that allows for arbitrary trajectories.

Further, we combine Maximum Entropy IRL with MCTS to learn reward models for the cooperative trajectory planning problem. The efficacy of the MCTS to quickly generate (approximately) optimal samples for arbitrary reward models, in combination with adjusting the sampling distribution after gradient updates, yields reward models that quickly converge towards the experts. Furthermore, with this approach, it is no longer necessary to manually tune reward functions to mimic human drivers.

Last, we extend the resulting cooperative trajectory planning algorithm to explicitly model uncertainties and thus account for noisy perception and partially unknown environments. By constructing search trees from multiple start states, combining their action assessments by kernel regression to yield a return distribution, and applying risk metrics to this distribution, our approach significantly improves the performance in uncertain environments.