1 Traffic Signal Controller with Deep Reinforcement Learning

Traffic congestion is a tremendous cost factor in terms of fuel and time and many cities all over the world suffer from it [19]. Moreover, the emissions of road transport have been considered as the main cause for air pollution [6, 39]. To alleviate traffic congestion and the associated problems, smarter and cleaner vehicles have been investigated [23, 28]. In addition to that, the effectiveness of road traffic can also be improved by optimizing the scheduling of traffic lights.

In this section, we focus on reducing congestion by improving automated traffic light controllers. More specifically, we focus on traffic signal controllers (TSCs) for isolated intersections [31], i.e., signalized intersections whose traffic is unaffected by any other controllers or supervisory devices.

The performance of conventional fixed-time or actuated TSCs is limited by the restricted setup and the relative primitive sensor information available. Recently, adaptive TSCs [20] attracted more attention due to their high degree of flexibility. Advances in perception and vehicle-to-everything (V2X) communication [18] could make such controllers even better by providing additional (real-time) information, such as locations and velocities of the vehicles. With more detailed information available, adaptive TSCs have the potential to provide optimal control according to current traffic situations. One approach to achieve this is to consider traffic signal optimization as a scheduling problem [20, 47], in which a junction is considered as a production line and the input vehicles as different products to be processed. However, this type of methods suffers from the curse of dimensionality which limits their applicability to small numbers of vehicles [1]. As a result, these methods in general only satisfy real-time requirements for either oversimplified intersections or under small traffic flow rates.

A recent line of research proposes to design adaptive TSCs based on deep reinforcement learning (DRL). DRL has been shown to reach state-of-the-art performance in various domains [30, 37]. However, we believe that the performance of DRL approaches in the traffic domain can be pushed further, in particular with regards to the following limitations:

  • Most previous approaches have focused on improving efficiency, which is calculated according to the throughput of intersections. However, we argue that the equity of the travel time of individual vehicles is also of vital importance. Previous works have been mostly evaluating in scenarios with relatively low traffic flow, in which case the trade-off between efficiency and equity might not have a great influence on the performance of the controller. However, in dense traffic with nearly- or even over-saturated intersections and unbalanced traffic density on incoming lanes, the efficiency-equity trade-off can be an important factor.

  • The flexibility of adaptive TSCs has not been sufficiently explored. Instead, most approaches employ fixed green traffic light duration or fixed traffic light cycles.

  • Previously proposed DRL agents are trained and evaluated in relatively simplified traffic scenarios: very few traffic demand episodes with limited variation or evenly distributed flow for each incoming lane [16]. Thus, their experimental results might not be sufficient indicators of their performance in real traffic scenarios.

  • Current DRL-based approaches have shown performance improvement mainly against fixed-time or actuated TSCs. They either have not compared with state-of-the-art adaptive TSCs, such as the Max-pressure controller [43], or do not surpass state-of-the-art performance [15, 16].

To overcome these limitations, we present a novel method that introduces the following innovations:

  • An equity factor to trade off efficiency (average travel time) against equity (variance of individual travel times) as well as a solution to calculate a rough bound for it.

  • An adaptive discounting method to account for the issues brought by transitional phases of traffic signals, which is shown to substantially stabilize learning.

  • A learning strategy that surpasses state-of-the-art baselines. It is generic with regards to different traffic flow rates, traffic distributions among incoming lanes and intersection topologies.

In line with the aforementioned DRL approaches, we conduct experimental studies in the traffic simulation environment SUMO [25]. We show that our method achieves state-of-the-art performance, which had been held by traditional non-learning methods, on a wide range of traffic flow rates with varying traffic distributions on the incoming lanes. The content of Sect. 1 is based on [49].

1.1 Related Works

In traditional fixed-time TSC designs [31], the traffic flow rates at intersections are treated as constants, and the green-red phases for each route are scheduled in a cyclic manner. Then the duration for each green phase is optimized using history flow rates. The Uniform TSC with the same fixed duration for all green phases and the Webster’s method [44] with pre-timed duration according to latest traffic history are usually used as baselines in TSC works [16]. As the real traffic flow rates generally vary across lanes and across time, the performance of such TSCs is restricted.

Actuated TSCs [31] make use of loop detectors, which are electromagnetic sensors mounted within the road pavement. Such sensors can detect the incoming vehicles and estimate their velocity when they pass by, so that actuated TSCs can dynamically react to the vehicles driving into the intersection. Yet, their performance are still restricted due to the limited information provided by the sensor.

Since decades researchers have investigated on developing adaptive TSCs, which can schedule traffic lights acyclic and with flexible green phase duration according to the real-time traffic situation. Some early works like [14, 26] have been largely applied in real traffic designs. Yet it is still believed that the performance of TSC can be further improved. In recent years, analytical [18, 50], heuristic [17, 43] and learning-based [11, 15, 16, 46] approaches have been proposed. Among these, the heuristic Max-pressure method [43] is reported to be holding state-of-the-art performance [16]. DRL-based methods hold great promise with the possibility to learn generalized and flexible controller policies by interacting with traffic simulators, and that they could provide scheduling decisions in real-time, as opposed to some non-learning methods that need optimization iterations before giving out each decision.

A few works have deployed DRL for isolated intersection TSCs [15, 16, 21, 33]. However, none of them were able to surpass state-of-the-art performance achieved by the Max-pressure method. Each of these method proposes its own reward functions for training the agent, but the connection between them has not been clear. In this work we attempt to give such an analysis of those different reward functions that have been proposed (Sect. 1.2.3).

While efficiency has been the main objective for most of these works, some previous algorithms actually had considered equity implicitly. They [33, 45, 46] design the reward as a weighted sum of several different quantities about the intersection. However, finding the optimal weighting is non-trivial. In this work we instead propose an equity factor along with a method to calculate its rough bound.

1.2 Methods

We consider the task of TSC in standard reinforcement learning settings. At each step, from its state \(s\in \mathcal {S}\) the agent selects an action \(a\in \mathcal {A}\) according to the policy \(\pi (\cdot |s)\). It then transits to the next state \(s'\in \mathcal {S}\) and receives a scalar reward \(r\in \mathbb {R}\). The state and action spaces and the reward function in our work are discussed in the next subsections.

For learning the optimal policy that maximizes the discounted (by \(\gamma \)) cumulative expected rewards, we use proximal policy optimization (PPO) [37] as the backbone DRL algorithm. For a policy \(\pi _{\theta }\) parameterized by \(\theta \), PPO maximizes the following objective:

$$\begin{aligned} \mathcal {J}_{\theta }=\mathbb {E}_t \Big [ \min \Big ( \rho _t(\theta ) A_t, \text {clip}\left( \rho _t(\theta ), 1-\epsilon , 1+\epsilon \right) A_t \Big ) + \beta _\text {entropy} \cdot H \Big ( \pi _{\theta }(s_t) \Big ) \Big ], \end{aligned}$$
(1)

where the expectation is taken over samples collected by following \(\pi _{\theta _\text {old}}\), \(\rho _t({\theta })=\nicefrac {\pi _{\theta }(a_t|s_t)}{\pi _{\theta _\text {old}}(a_t|s_t)}\) is the importance sampling ratio, and \(\epsilon \) is a hyperparameter for clipping the probability ratio. H represents the entropy of the current policy, \(\beta _\text {entropy}\) adjusts the strength of entropy regularization. \(A_t\) is a truncated version (on trajectory segments of length up to K) of the generalized advantage estimator [36], which is an exponentially-weighted average (controlled by \(\lambda \)):

$$\begin{aligned} A_t = \delta _t + (\gamma \lambda ) \delta _{t+1} + \dots + (\gamma \lambda )^{K-1-t}\delta _{K-1}, \end{aligned}$$
(2)

where \(\delta _t = r_t + \gamma V_{\phi _\text {old}}(s_{t+1}) - V_{\phi _\text {old}}(s_t)\). The value function \(V_{\phi }\), parameterized by \(\phi \), is learned by minimizing the following loss (with coefficient \(\beta _\text {value}\)):

$$\begin{aligned} \mathcal {L}_{\phi } = \beta _\text {value} \cdot \mathbb {E}_t \left[ \left\Vert V_{\phi }(s_t) - \Big (V_{\phi _\text {old}}(s_t) + A_t\Big )\right\Vert ^2_2 \right] . \end{aligned}$$
(3)

1.2.1 Action Space

We carry out our method on a four-road intersection where each road contains three incoming lanes (one forward-only, one forward+right-turning, one left-turning, Fig. 1a). We note that our approach can easily generalize to other intersections by adjusting the state and action representations accordingly.

Fig. 1
2 diagrams labeled 4-road intersection with 3 incoming lanes for each road and 4 green phases or actions for the intersection. The left one has an intersection with a zoomed-in section of the right lane. The right one has 4 directional action phases for the lanes.

The intersection and its corresponding action space

The agent has an action space of size 4: while one of the two sets of facing directions (north and south, east and west) has only red light, the other set can schedule either of the following two traffic light signal combinations (Fig. 1b):

  • Green for the forward-only and forward+right-turning lanes and red for the rest;

  • Green for the left-turning lanes and red for the rest.

In order to give the agent more flexibility, we set the duration for each of the 4 actions as 1 second.

We note that choosing one action means scheduling a distinct green phase. During the transition between different green phases, yellow or all-red phases must be scheduled. In our work, a \(3{\text {s}}\)-yellow and \(2{\text {s}}\)-all-red phase is scheduled before activating a new green phase. We denote the constant \(\textsf{T}_{yr}=5\,{\text {s}}\) as the duration for the yellow-red phase.

Due to this setting, if two different actions (green phases) are scheduled consecutively, the effective duration of the second action is \(6\,{\text {s}}\) instead of \(1\,{\text {s}}\); while if the same action (green phase) is scheduled twice in a row, then the effective duration for the second action is still \(1\,{\text {s}}\). During the learning process, the aforementioned two scenarios should not be treated equally. To cope with this we propose the method of adaptive discounting which will be presented when discussing the reward function (Sect. 1.2.3).

1.2.2 State Space

At each process step, the state \(s_t\) the agent receives is comprised of the following components:

  • The distance along the lane to the traffic light and the velocity of each vehicle that has not passed the light and is within \(150\,{\text {m}}\) range (each lane has a maximum capacity of 19 vehicles) to the center of the intersection. A block of \(19\times 2\) scalars in the state vector is reserved for the vehicles in each incoming lane. The vehicles’ states in each block are sorted according to their distance values. The order of lanes in the state vector has to be kept unchanged. All the values are normalized to be within \([-1,1]\). If any lane does not reach its maximum capacity, the corresponding position and velocity values will be set to 1 and \(-1\).

  • The action of the last step \(a_{t-1}\) (in one hot encoding so a 4-dimensional vector).

  • A counter that contains for each action the time in seconds since its last execution. The 4-dimensional vector is normalized by \(500\,{\text {s}}\). This component along with the last action \(a_{t-1}\) helps to avoid state-aliasing.

1.2.3 Reward Function

Several different reward functions have been proposed in previous works to train DRL agents for controlling traffic signals. However, the reasoning behind different designs have not been clearly presented, also the connections between those different choices and the different effects they are causing have not been thoroughly analyzed. We attempt for such an analysis below, which indicates that the vanilla versions of those rewards tend to result in policies that only consider time efficiency (average travel time in an intersection). We then propose solutions that also take equity (variance of individual travel time) into consideration.

Fig. 2
4 diagrams of 4-road intersections with color gradient vehicles and directional action phases for the lanes. They list their respective elapsed time, process step, state vehicles, throughput, and reward and discounting factor data.

Illustration on the proposed adaptive discounting, as well as several important concepts in the traffic intersection domain. In the figure we show each released car with a distinguish symbol on top. As for the colors, the cars in yellow are those that have not yet passed through the traffic light, and they would be depicted in purple immediately after they pass through their traffic lights (judged by the head of the car). The cars in yellow are considered in the state representation, while the cars that turned from yellow to purple are calculated into the throughput. The 1\({\textrm{st}}\), 2\({\textrm{nd}}\) and 4\({\textrm{nd}}\) sub-figures correspond respectively to system elapsed time: \(\{10,11,17\}{\text {s}}\), and to learning process step: \(\{t,t+1,t+2\}\). Since the action \(a_{t+1}\) chooses to schedule a different green phase than that of \(a_t\), a \(3{\text {s}}\)-yellow and \(2{\text {s}}\)-red phase will be scheduled before the new green phase. The 3\({\textrm{rd}}\) sub-figure in red dashed bounding box shows the 1\({\textrm{st}}\) second in the yellow phase. In previous works, the discounting has been conducted with respect to the process step. While we propose to discount according to system elapsed time which is shown in our experiments to be of vital importance for stable learning

Definitions

We first give the definitions of several important concepts in traffic intersection systems. We visualize the important ones in Fig. 2.

  • Total number of vehicles in the intersection (N): At t, the number of vehicles in the intersection system \(N_t\) is the total number of vehicles that are within a certain range to the intersection center (e.g., \(150\,{\text {m}}\)) but have not yet passed through the corresponding traffic lights.

  • Throughput (\(N^\text {TP}\)): The number of vehicles that pass through the traffic lights of their corresponding incoming lanes within \(\left( t-1,t\right] \) is denoted \(N^{\text {TP}}_t\).

  • Travel time (\(T_\text {travel}\)): For a single vehicle, its travel time is counted as the time period starting from when it enters the intersection and ending when it passes through the traffic light. The total travel time of the intersection is the summation of the individual \(T_\text {travel}\) of each vehicle in the intersection. We note an equivalent way of calculating the total travel time is to count \(N_t\) at every second and sum it over a given time period.

  • Delay time (\(T_\text {delay}\)): Similar to travel time, except that a constant is subtracted from each individual travel time: \(T_\text {delay}=T_\text {travel}-\textsf{T}_\text {free}\), where \(\textsf{T}_\text {free}\) is the constant time length for a vehicle to pass through the intersection system with no cars ahead and green lights always on.

  • Traffic flow rate (F): The number of vehicles that pass through an intersection in unit time. A commonly used unit is the number of vehicles per hour \(\nicefrac {\text {v}}{{\text {h}}}\).

  • Saturation flow rate (\(\textsf{F}_\text {s}\)): This is a constant representing the traffic flow rate for one lane under the condition that the traffic light stays green during unit time and that the flow of traffic is as dense as it could be [5].

Reward Function Categories

Given the above definitions, the majority of the reward functions proposed in the TSC domain can be categorized into the following two types:

  • Throughput-based reward functions \(\mathcal {R}^\text {TP}\) [46]. The vanilla form of this type uses the throughput \(N_{t}^{\text {TP}}\) as the reward for step t. Learning on this reward function means maximizing the cumulative throughput of the intersection. The change in throughput \(N^\text {TP}_t - N^\text {TP}_{t-1}\) has also been used as a reward function [35].

  • Travel-time-based reward functions \(\mathcal {R}^\text {TT}\) [11, 15, 16, 33, 46]. As mentioned before, the total travel time of an intersection for a given period of time \([\tau _\text {start}, \tau _\text {end}]\) can be calculated as the summation of \(N_t\) during that time: \(\sum _{\tau _\text {start}}^{\tau _\text {end}}N_t\). The vanilla reward function of this type thus uses \(-N_t\) as the reward for step t. We note that \(N_t=N_{t-1}-N^\text {TP}_t+N^\text {in}_t\) where \(N^\text {in}_t\) denotes the number of new vehicles input into the system from \(t-1\) to t, which is commonly assumed to be determined solely by the traffic flow distribution thus is out of the control of TSC. Learning on this reward function would result in policies that minimize cumulative travel time. Reward functions utilizing the change of cumulative delay time between actions and the total delay time of the intersection have also been investigated.

The above description indicates that maximizing cumulative throughput and minimizing total travel time could both result in policies that puts efficiency in the top priority. During research we observed that throughput-based reward generally leads to more stable learning with smaller variance across different runs. Therefore, we focus on throughput-based reward in the following content.

Adaptive Discounting

Treating the two scenarios discussed in Sect. 1.2.1 equally when discounting rewards in either of the two reward categories can result in suboptimal policy performance. We propose the method of adaptive discounting that properly discount for those scenarios and is shown to be critical for convergence in our experiments.

We illustrate this method under the throughput based reward \(R^{\text {TP}}_t=N^{\text {TP}}_t\) in Fig. 2: At system elapsed time \(10\,{\text {s}}\) the reinforcement learning process is at step t. The action \(a_t\) is chosen that schedules green lights for the left-turning lanes for the north-south roads. Transitioning from t to \(t+1\), the throughput reward obtained is \(r_{t+1}=2\). This is a normal RL iteration and no special adjustments need to be done. But at step \(t+1\) when the system elapsed time is at \(11\,{\text {s}}\), the action \(a_{t+1}\) is chosen to schedule green lights for the forward+right turning directions of the north-south roads, which is a different green phase than that of \(a_{t}\). This means a \(3{\text {s}}\)-yellow and a \(2{\text {s}}\)-all-red phase will be automatically scheduled before the new green phase. The \(5\,{\text {s}}\) intermediate phase and the chosen \(1\,{\text {s}}\) green phase are both within step \(t+2\) of the learning process. During this step the throughput obtained at elapsed times \(\{12,13,14,15,16,17\}{\text {s}}\) are \(\{1,0,0,0,0,2\}\). With no special treatment when calculating the reward for step \(t+2\) it would be \(r_{t+2}=3\). But this could lead to undesired properties since the agent gets the intermediate phase “for free” for collecting extra rewards whenever it chooses to schedule a different green phase, and that the subsequent states are not sufficiently discounted. Furthermore, given that the throughput of two episodes matches at every system elapsed second, the agent should obtain exactly the same return, even with different traffic light schedules. However, with the transitional phases it is not anymore a one-to-one mapping between the system time and the process step. So when discounting according to process steps, those two episodes of interest could lead to different returns. This issue has been overlooked in the current literature of DRL based TSC designs [11, 16]. Thus we propose the method of adaptive discounting to account for the mismatch between the two timing paradigms, in which we discount the reward according to system elapsed time instead of learning process steps. As a result, the reward for \(t+2\) is calculated as:

$$\begin{aligned} r_{t+2}= 1+ \gamma \cdot 0+ \gamma ^2\cdot 0+ \gamma ^3\cdot 0+ \gamma ^4\cdot 0+ \gamma ^5\cdot 2, \end{aligned}$$
(4)

and a discount factor of \(\gamma ^6\) instead of \(\gamma \) will be used for the subsequent reward or value.

The Equity Factor

Having presented the adaptive discounting technique, now we present the equity factor for reward functions for training TSC. The aforementioned two types of reward functions (throughput-based and travel-time based) both treat efficiency, i.e., average travel time of the intersection as the major concern. Equity, the variance of individual travel times, is not explicitly considered. Take the following scenario as an example: Assuming that the north-south roads are saturated, while the east-west roads have lighter traffic, the policy to maximize the cumulative throughput should always keep the north-south traffic lights green, while keeping the east-west lights red. Consequently, the vehicles on the east-west roads might have to wait for an intolerable long time to pass through the intersection. This is due to that in the vanilla reward definitions, every vehicle contributes equally to the throughput or to the travel time, regardless of how long it has been waiting.

Following the above analysis, we propose to use the vehicle’s travel time together with an equity factor \(\eta \) in the reward function. The basic idea is to adapt the contribution of each vehicle to the throughput-based reward according to its travel time in the intersection while passing the traffic light. Instead of just counting value 1 when a vehicle passes through, we consider three ways to incorporate \(\eta \) into the reward calculation: linear (\(\eta \cdot T_\text {travel}\)), power (\({T_\text {travel}}^\eta \)) and base (\(\eta ^{T_\text {travel}}\)). Since simply scaling the rewards does not change the value function landscape, we mainly considered the power and base forms. During research our experiment results show that the power form equity factor leads to convergence to better policies than the base form. Therefore, we focus on the analysis of the \({T_\text {travel}}^\eta \) in the following.

To define the proper range of \(\eta \), two special scenarios are considered.

  • Scenario 1: Only one vehicle is before the traffic light, and its travel time at step t is \(\tau \). With the equity factor \(\eta \) and the discount factor \(\gamma \), the return contributed by this vehicle would be \(\tau ^\eta \) if it passes through the traffic light at t, and \(\gamma \cdot (\tau +1)^\eta \) if one second later. We require \(\tau ^\eta >\gamma \cdot (\tau +1)^\eta \) so that releasing this vehicle sooner is more desired. With this we get \(\eta <\nicefrac {\ln (\gamma )}{\ln {\frac{\tau }{\tau +1}}}\),

  • Scenario 2: One lane with green light is over-saturated, while a single car is waiting at red light in another lane. In the case where the over-saturated lane always has green light on and the single vehicle is never released, the highest return for any state is:

    $$\begin{aligned} G^\text {e} ={\textsf{T}_\text {free}}^\eta \left( 1 + \gamma ^{\frac{1}{\textsf{F}_\text {s}}} + (\gamma ^{\frac{1}{\textsf{F}_\text {s}}})^2 + \cdots \right) = \nicefrac {{\textsf{T}_\text {free}}^\eta } {1-\gamma ^{\frac{1}{\textsf{F}_\text {s}}}} \end{aligned}$$

    (denoted as \(G^\text {e}\) as in this case efficiency is the top priority). If the waiting vehicle is released at step t when its travel time is \(\tau \), the upper limit of the return the system can obtain at state \(s_t\) is:

    $$\begin{aligned} \sup (G^\text {e+e}) = \textsf{T}_\text {free}^\eta + \tau ^\eta \cdot \gamma ^{\textsf{T}_\text {yr}} + \nicefrac { \big (\textsf{T}_\text {free}+2\cdot \textsf{T}_\text {yr}+1\big ) ^\eta \cdot \gamma ^{2\cdot \textsf{T}_\text {yr}+1} } {1-\gamma ^{\frac{1}{\textsf{F}_s}}} \end{aligned}$$

    (we use \(G^\text {e+e}\) since this strategy cares about efficiency and equity). The three terms in the summation are all calculated out of the best case scenario (the traffic light on the saturated lane turns yellow then red for a total of \(\textsf{T}_\text {yr}\) elapsed time, then the light on the single vehicle lane turns green for one second then turns yellow) to get the upper limit: the first term is the reward obtained from the vehicle on the saturated lane that manages to pass through at the beginning of the yellow phase; the second term is contributed by the single vehicle passing through the traffic light in its \(1\,{\text {s}}\) green phase; the last term is the summation of the reward obtained by the vehicles on the saturated lane after the green phase switches back to this lane. We require \(G^\text {e}<\sup (G^\text {e+e})\) to release the single vehicle after certain travel time \(\tau \).

With these analysis a range of \(\eta \) can be found. We note that this is a rough calculation under our system settings as for example the traffic flow in the saturated lane does not recover instantaneously to \(\textsf{F}_s\) after the green light switches back. Nevertheless the analysis gives a general solution to calculate a rough bound for \(\eta \). The experimental results show that the desirable TSC policies could be learned in this bound.

1.3 Experiments

1.3.1 Experimental Setup

We conduct experiments using the urban traffic simulator SUMO [25] and evaluate the trained agents in both simulated one-hour traffic demand episodes (with the intersection type described above) and a real-world whole-day traffic demand (with a different type of intersection in Freiburg, Germany). Both intersections have a speed limit of 50 km/h. We compare with the following common baselines in the TSC domain:

  • Uniform: This controller circulates ordered green phases in the intersection. Each green phase is scheduled for a same fixed period, the duration of which is a hyper-parameter of this algorithm.

  • Webster’s [44]: Same as the Uniform controller, it schedules traffic phases in a cyclic manner. But each phase duration is adjusted in accordance with the latest traffic flow history. It has three hyperparameters: the length \(T_\text {history}\) of how long the traffic flow history to take into account for deciding the phase duration for the next \(T_\text {history}\) period, and the minimum and maximum duration for one complete cycle.

  • Max-pressure [43]: Regarding vehicles in lanes as substances in pipes, this algorithm favors control schedules that maximizes the release of pressure between incoming and outgoing lanes. More specifically, with incoming lanes containing all lanes with green traffic light in a certain phase, and outgoing lanes being those lanes where the traffic from the incoming lanes exit the intersection system, this controller tends to minimize the difference in the number of vehicles between the incoming and outgoing lanes. The minimum green phase duration is a hyper-parameter.

We note that previous learning methods were not able to surpass the state-of-the-art performance held by the non-learning method Max-pressure TSC [16].

Regarding our network architecture for the intersection in Fig. 1a, the input size for both the policy network \(\theta \) and the value network \(\phi \) is \(4+4+2\cdot 19\cdot 12=464\). Then \(\theta \) consists of fully connected layers of sizes \(2\,048\) (ReLU), \(1\,024\) (ReLU) and 4 (SoftMax), where 4 is the size of the action space. For \(\phi \) the fully connected layers are of sizes \(2\,048\) (ReLU), \(1\,024\) (ReLU) and 1. We perform a grid search to find the hyperparameters. We use \(2.5\textrm{e}{-5}\) as the learning rate for the Adam optimizer, \(1\textrm{e}{-3}\) as the coefficient for weight decay. For PPO, we use 32 actors, 0.2 for the clipping \(\epsilon \). In each learning step a total number of around 20 mini-batches of size \(1\,000\) is learned for 8 epochs.

1.3.2 Training

Previous methods focused on relatively limited traffic situations, for example a single one-hour demand episode [11] and traffic input less than \(3\,000\nicefrac {\text {v}}{{\text {h}}}\) [15, 16]. In this paper we challenge our method to experience a wider range of traffic demand. For the four-way junction we consider, the upper bound of the traffic flow can be calculated as \(4\cdot \textsf{F}_s\), where \(\textsf{F}_s\) is the saturation flow rate for one incoming lane. This maximum flow is reached when all 4 forward-going lanes of either the north-south or the east-west roads have green lights and are in full capacity. However, this extreme scenario rarely happens in real traffic. In our experiments we found that the intersection already starts to saturate with around \(3\,000\nicefrac {\text {v}}{{\text {h}}}\) of total traffic input. In our training we set the range of traffic flow rate to be \(\left[ F_{\min }, F_{\max }\right] =\left[ 0,6\,000\nicefrac {\text {v}}{{\text {h}}}\right] \) which is much wider than that used in previous works.

With this flow rate range, we sample traffic demand episodes for training. Each episode is \(1\,200\,{\text {s}}\) long and defined by these randomly sampled parameters: the total traffic flow at the beginning and end \(F_\text {begin}\) and \(F_\text {end}\), and for each incoming lane its traffic flow ratio of the total input at the beginning and end. \(F_\text {begin}\) is randomly sampled from \(\left[ F_{\min }, F_{\max }\right] \). Then \(F_\text {end}\) is sampled uniformly within \([\max (F_{\min }, F_{\text {begin}}-1\,500), \min (F_{\max }, F_{\text {begin}}+1\,500)]\). The flow ratios are decided by sampling 12 uniform random numbers then normalized by their sum. The traffic flow during the episode is then linearly interpolated. The sampled episodes with possibly big change of traffic flow and unbalanced distribution should be enough to cover real traffic scenarios.

1.3.3 Evaluation During Training

During training, we conduct evaluation to monitor the learning progress every 20 learning steps, which corresponds to 640 episodes experienced by 32 actors. For each evaluation phase 5 evaluators are deployed, corresponding to traffic flow ranges of \(\left[ 500,1\,500\right] \), \(\left[ 1\,500,2\,500\right] \), \(\left[ 2\,500,3\,500\right] \), \(\left[ 3\,500,4\,500\right] \) and \(\left[ 4\,500,5\,500\right] \) respectively. Each evaluator samples traffic demand for evaluation in the corresponding range similar to how training episodes are sampled except that the flow rates at the beginning and end are independently sampled from the same corresponding range.

An ablation study is conducted to analyze the individual contributions of different components in our proposed algorithm. The plots are shown in Fig. 3, where the following agent configurations are compared: \([\times ]+[\eta =0]\), \([\times ]+[\eta =0.25]\), \([\text {ad}]+[\eta =0]\), \([\text {ad}]+[\eta =1]\), \([\text {ad}]+[\eta =0.25]\). \([\text {ad}]\) means the agent utilizes adaptive discounting while \([\times ]\) means not; \([\eta =\cdot ]\) denotes the value of the equity factor used by the agent, where the \([\eta =0]\) agents, which use exactly the vanilla throughput-based reward, care only about efficiency while the \([\eta =1]\) ones favor equity.

Interestingly, from Fig. 3 we can observe that the two agents without the technique of adaptive discounting struggle to learn successful policies in both low and high flow rates. We can also observe the influence of the equity factor \(\eta \): the \([\text {ad}][\eta =0]\) agent who does not care about equity converges to a better policy than the \([\text {ad}][\eta =1]\) agent in lower traffic density, while the latter agent outperforms the former one in denser traffic. This makes sense, since with little traffic input the equity problem is not critical, while with higher traffic flow the intersection could be saturated with continuously growing queues even under optimal policies. The efficiency-first policies favor releasing more vehicles in saturated traffic, thus vehicles in other lanes could have long waiting time.

Fig. 3
2 multi-line and area graphs of waiting time versus training step. Both graphs plot a d + eta = n = 0, a d + eta = 1, a d + eta = 0.25, x + eta = 0, and x + eta = 0.25 lines with a decreasing trend, color gradient areas, and intense fluctuations.

Waiting time obtained in evaluation during training for all agent configurations under ablation study. Each plot shows the mean with \(\pm \nicefrac {1}{5}\) standard deviation over 3 non-tuned random seeds (we show \(\nicefrac {1}{5}\) of the standard deviation for clearer visualization). The left figure shows the logs of the evaluators of traffic flow range \([500,1\,500]\), while the right one shows that of \([4\,500,5\,500]\). The vehicles passed the traffic light are not considered for the waiting time. The waiting time for a vehicle is calculated with \(T_\text {episode}-T_\text {in}\), where \(T_\text {episode}\) is the episode duration and \(T_\text {in}\) is the time when it enters the intersection

We observe that the \([\text {ad}]+[\eta =0.25]\) configuration obtains the best performance across different traffic flow rates, thus this is used for the agent Ours in the following experiments.

Having compared the plots of travel time (for released vehicles) and waiting time (for not released vehicles), we notice that the average waiting time always decreases during training when the policy gets better, while the average travel time may vary in different ways. This is because the travel time only considers the released vehicles. Some initial poor policies may choose the same action all the time, which leads to fast throughput for vehicles on the lanes with green light while extremely long waiting time for other vehicles. The waiting time, however, considers only the vehicles not passed the intersection during the episode. As the policy gets better, the number of vehicles staying in the intersection at the end becomes smaller. In order to show the training process clearly, we choose to use the plot of waiting time.

Fig. 4
A grouped bar graph of mean and S T D of travel time versus the range of total traffic input in increasing order. It plots uniform, Webster's, max pressure, and ours. The range of 4500 to 5500 has the highest travel time, while 500 to 1500 has the lowest travel time.

Performance comparison of our work with baselines on 150 one-hour simulated demand episodes (30 from each of the 5 ranges). We note that the baselines are optimized for each of the test episodes before they are tested on it

1.3.4 Evaluation on Simulated Traffic Demand

To test the performance of our agent we first evaluate on simulated traffic demand episodes that each lasts one hour. For each of the 5 traffic flow rate ranges as used for the evaluators during training, we randomly sample 30 episodes; this exact set of \(5\cdot 30\) episodes are used to test all compared algorithms. These demand episodes are sampled following the similar procedure to that for evaluation during training.

To ensure a fair comparison, in each demand episode, we use the exactly same vehicles generation time for different methods. Via the sampling process described above, our test set covers a very wide range of traffic scenarios and could in turn provide a more thorough evaluation.

The evaluation results are shown in Fig. 4. We observe that our method reaches state-of-the-art performance on all traffic flow ranges. It is worth noting that for each baseline that we compare with, we find its optimized hyperparameters for each of the 150 test episodes; while our agent is trained only once and a single agent is used to evaluate on all 150 test episodes. This means that the overall performance of our one trained model outperforms that of the 150 individually optimized models. The performance improvement at about \(1\,000\) and \(5\,000\,\nicefrac {\text {v}}{{\text {h}}}\) is not very obvious, because in light traffic many vehicles do not have to wait in queue and in over-saturated traffic, where there is a queue in every incoming lane, the best policy is similar to scheduling the green phases cyclically. The capability of our agent to react to real time traffic situation can be fully utilized for the traffic flow ranges in the middle, where the improvement against the Max-pressure controller and the fixed-time controllers could be over 20 and \(40\%\). The Webster’s method performs worse than the Uniform controller due to the quick change and short duration of the test episodes, which is most of the time not the case in real traffic (Fig. 5).

Fig. 5
A box plot of travel time of mean, S T D, and median versus uniform, Webster's, max pressure, and ours in decreasing order. Uniform has the highest quartile range for travel time, while ours has the lowest quartile range with outliers.

Performance comparison of our work with baseline models on a whole-day real-world traffic demand

As mentioned, the travel time only indicates how fast the released vehicles drive through. In order to show that our agent can also benefit more drivers than baselines, we present the testing statistics for throughput in Table 1. The percentage values are the ratio of the released vehicles in the total vehicle number generated. With traffic flow lower than about \(3\,000\,\nicefrac {\text {v}}{{\text {h}}}\), all TSCs can properly release traffic input. Not 100 percent of the generated vehicles can be released, because the test is stopped directly after one hour. Some vehicles generated at the end do not have enough time to travel through. From about \(3\,000\,\nicefrac {\text {v}}{{\text {h}}}\) the throughput of the baselines start to drop, which means the TSC can not fully release the input traffic flow and traffic jam starts to form, while our agent can avoid traffic jams in much denser traffic. With the increased efficiency, our agent can still guarantee equity, which is shown by the low standard deviation of vehicles travel time and the high throughput.

1.3.5 Evaluation on Real-World Traffic Demand

To further measure the performance of our agent in more realistic traffic scenarios, we conduct additional tests with a whole-day traffic demand of a real-world intersection of Loerracherstrasse and Wiesentalstrasse located in Freiburg, Germany. This intersection has different layout than the one in Fig. 1a. Here each road has one forward+right turning lane with one additional short lane for protected left turn. So the size of the state changes to 224. We regard the short left-turning lane, the forward+right-turning lane and the lane segment before the branching as separated when we construct the state.

Table 1 Throughput (\(\%\)) of considered methods in Fig. 4

Since the size of the input is different from the experiments above, we need to train another agent. As we want to test the generalization capabilities of our method, the training traffic demand is sampled in the same way as before (only the maximum limit of the traffic flow is reduced to \(\nicefrac {1}{2}\) to reflect the change in the intersection layout) The trained agent is tested on the real-world traffic demand of February 4, 2020, with typical traffic flow peaks at rush hours. The input traffic flow is in the range \([0, 1\,740]\nicefrac {\text {v}}{{\text {h}}}\). We sincerely appreciate the support of city Freiburg (www.freiburg.de/verkehr), which provides us with the traffic flow data measured with inductive-loop detectors.

The results of this real-world experiment is shown in Fig. 5. All the TSCs can properly release all vehicles, because the traffic flow is nearly zero in the night when the demand episode ends. We can observe that our method is again outperforming all baseline methods, even though the baselines are firstly optimized with exactly this whole-day demand and our model is only trained on the simulated episodes with \(1\,200\,{\text {s}}\) duration. The substantial improvement of nearly \(30\%\) on average travel time is even greater than the performance gain in the simulated evaluations. This validates that out proposed method has great generalization capabilities and can adapt to a wide range of traffic scenarios.

Fig. 6
A set of 3 diagrams of a vertical T-shaped intersection has the flow of traffic with traffic signals and network towers. It highlights vehicles numbered 1 and 2 in different color gradients moving through the lanes using the system.

Our intersection management agent optimizes traffic flow by assigning virtual red traffic lights to connected autonomous vehicles (vehicle number 1). Once vehicle 2 is released, the vehicles following it can also proceed through the intersection

2 Courteous Behavior of Automated Vehicles at Unsignalized Intersections

Traffic control signals are not panacea for intersection problems [42]. For example, they may reduce traffic efficiency for low or unbalanced traffic demand. Although recent works [43, 49] developed more intelligent adaptive traffic signal control methods, for the majority of all intersections, which often have only one lane per road and mostly small traffic volume [13], the use of static road signs assigning priority has proven to be more efficient [42]. Ulbrich et al. [41] summarized how humans cooperate with other traffic participants to improve the whole traffic utility. Consider, as an example, the situation shown in Fig. 6. Even though vehicle 1 has higher priority and can proceed through the intersection before vehicle 2, its driver might prefer to yield to vehicle 2 so that the traffic behind vehicle 2 can be released sooner.

Based on the expectation that future traffic will consist of connected autonomous vehicles (CAVs), a large majority of current research excludes human-driven vehicles (HVs) in their development of traffic management approaches. However, it might take decades for the technology, the infrastructure and the users to be ready for traffic with only connected autonomous vehicles [24]. We therefore believe that, for the near future, applicable traffic management solutions must (i) consider various degrees of mixed traffic, (ii) pose no complications or major adjustment requests for human-driven vehicles, and (iii) not present a traffic disturbance or danger when the communication between the connected autonomous vehicles fails.

In this section, we propose a novel centralized method to improve intersection management in mixed traffic. Our approach learns a policy for CAVs that maximizes the overall utility while at the same time showing courteous behavior [27]. We make the following contributions:

  • We present a centralized intersection management method based on deep reinforcement learning that improves traffic performance at unsignalized intersections in mixed traffic scenarios.

  • We introduce return scaling for training in environments with a large imbalance of cumulative rewards at different states. In our case, this helps to balance policy updating of states with different traffic densities, in particular to counteract the large cumulative reward collected in heavy traffic, which would otherwise dominate the stochastic gradient descent process and make the policy unstable for states in sparse traffic.

  • We present a comprehensive performance comparison for various traffic densities and changing rates of CAVs to demonstrate the potential of our approach.

We conduct experimental studies in the traffic simulation environment SUMO [25] and show that our method outperforms the state-of-the-art intersection management method on a wide range of traffic densities with varying traffic distributions. The content of Sect. 2 is based on [48].

2.1 Related Work

Among the first ones to propose an intelligent intersection management system were Dresner and Stone whose reservation-based approach [7, 8] divides the junction with intersecting trajectories into a grid of tiles. Their autonomous intersection management approach, realized as a centralized controller, applies a first-come-first-served (FCFS) strategy to deal with the requests by CAVs for time slots of the tiles along their trajectories. To accommodate HVs they employ traffic lights and the so-called FCFS-light policy [9, 10]. Later, this framework was extended to allow for the centralized intersection management to set the speed profiles of vehicles with cruise control [3]. To improve the performance of FCFS-light, Sharon and Stone introduced hybrid autonomous intersection management [38]. With this extension, requests of CAVs can be approved regardless of the traffic lights if there are no HVs in the intersecting routes.

In general, the methods based on autonomous intersection management [7] provide a relative advantage to CAVs over HVs, which, in our opinion, should be avoided as it might cause the public to repel automated vehicles. Furthermore, human drivers will be more sensitive to stopping and waiting than the passengers in CAVs. We therefore suggest that the benefit brought by intersection management and CAVs in general should be evenly shared with human drivers.

Lin et al. developed a method similar to the FCFS-light policy [22]. It reserves conflicting sections among different routes instead of the grid of tiles. Another first-come-first-served reservation based method has been proposed by Bento et al. [4]. They suggest to control both CAVs and HVs via speed profiles sent by the intersection management unit. This places an undesirable burden on human drivers to follow a given speed profile and additionally even requires all HVs to be connected.

The described approaches make the vehicles roughly follow a first-come-first-serve strategy to traverse intersections. However, as shown by Meng et al. [29], the performance of an intersection management strategy mainly depends on the passing order of vehicles and not so much on the individual trajectory planning algorithms. As the computation time grows exponentially with the number of considered vehicles [29], often simplifying assumptions are made including linear constraints, no overtaking, no lane changing, constant speed, and constant traffic input. The coordination of the passing order can mitigate control uncertainties, which makes it more suitable for mixed traffic. Based on this idea, our work is aimed at finding better passing orders, while having vehicles drive based on their own trajectory planning model.

Qian et al. [34] assign priorities representing the passing order to vehicles. While CAVs receive the priority from a central control unit and plan trajectories accordingly, the passing order of HVs is regulated by traffic lights. With high rates of HVs, this potentially results in an inefficient, mostly first-come-first-served control. Fayazi et al. [12] propose to formulate the intersection management problem as a mixed-integer linear program. Their controller assigns times of arrivals to a virtual access area around the junction to CAVs, while HVs are regulated by traffic lights.

The approaches of these related works are already outperformed by Webster’s method or fixed-time traffic signal controllers when over 10–\(20\%\) of the vehicles are driven by humans [7, 9, 12, 22]. The exception is our previous state-of-the-art learning-based adaptive traffic signal controller, which further outperforms these two controllers in any traffic flow range and reduces the average travel time by up to 30–\(60\%\) in the experiment with real-world traffic input. Therefore, we evaluate our proposed method mainly against Sect. 1 in a wide range of dynamic traffic demands and show that the performance gain is available even with a small portion of CAVs in the traffic system.

2.2 Methods

Like Sect. 1.2, we model the intersection management task at unsignalized intersections as a Markov Decision Process, and use proximal policy optimization [37] for training due to its stability, good performance and ease of implementation. Our work is aimed at training a centralized agent for an intersection that timely stops the CAVs on the routes with higher priority to let the vehicles on conflicting routes with lower priority pass, so that the performance of the whole system is optimized. Since this is similar to red traffic lights for CAVs on the routes with higher priority, we denote our method as Courteous Virtual Traffic Signal Control (CVTSC). We evaluate our proposed approach on the most common type of three-way intersections as illustrated in Fig. 7. By adjusting the state and action representations, our approach can easily be generalized to other intersection layouts, as we show for the real-world intersection in Sect. 2.3.5.

Fig. 7
2 diagrams labeled 3-way intersection with 6 routes and four actions. The left one has a T-shaped intersection with directional action phases and various signs such as give way. The right one has 3 directional action phases for the lanes.

Common regulation of a right-hand traffic three-way intersection (a). The high-priority-routes are W-E, W-S and E-W. The low-priority-routes are S-W and S-E. Route E-S has intersecting routes with higher and lower priority. The proposed set of actions (b) stops CAVs on routes along the indicated directions

As we focus on an isolated intersection, we assume that the vehicles can drive freely after they passed the junction and entered the outgoing lanes. Thus the vehicles on the outgoing lanes do not influence the intersection management. However, unlike in Sect. 1, in which we only considered vehicles in front of the stop lines, we here also take the vehicles into account, which already passed the stop line but not yet entered the outgoing lanes. This is necessary as at unsignalized intersections vehicles very often choose to wait after stop lines and coordination may happen there inside the junction.

2.2.1 Action Space

For the intersection in Fig. 7a we assume that vehicles drive according to the priorities predefined by the road signs, where the diamond indicates priority roads and the triangle indicates yield. Vehicles on the routes with lower priority have to wait until there is enough gap on the conflicting routes with higher priority before passing the junction. Note that in Fig. 7a the route E-S has intersecting routes with higher and lower priority.

To obtain courteous behavior for CAVs on routes with higher priority, without loss of generality, we define a discrete set of four actions {(), (W-E), (W-E, W-S), (W-E, E-W, E-S)} as the action space \(\mathcal {A}\) in relation to Fig. 7b. The indicated directions show the corresponding routes on which the intersection management unit commands CAVs to halt before the respective stop lines to give priority to vehicles waiting on intersecting routes with lower priority. The action restricting no routes uses the default priorities to manage the intersection. We set the duration of each action to 1 s. A categorical policy is learned: during training the actions are sampled according to the output distribution, while during testing the action with the highest probability is always chosen. When a new action \(a_t\) is chosen, CAVs on the routes indicated in \(a_t\) will receive stopping commands, while the instruction for the routes restricted by \(a_{t-1}\) is canceled, if they are not regulated by \(a_t\). If a CAV receives a stopping command while being too close to the stop line, it will continue through the intersection thus ignoring the command. Acceleration, collision avoidance and safe distance are managed by the low-level controllers of the individual vehicles (both CAVs and HVs).

2.2.2 State Space

Due to the restriction of sensors and wireless communication, we assume that the intersection management unit can collect information of vehicles that are within a distance of \(150\,{\text {m}}\) along the road measured from the center of the intersection. We assume that every vehicle’s state, composed of continuous values (its position along the road, velocity and time since entering intersection) and discrete values (a binary value for CAV or HV and optionally a route index indicating the driving direction if the lane contains more than one route), is available to the control unit. Similar to Sect. 1.2.2, the state \(s_t\) of the intersection at time t is given by a vector that contains the structured information of vehicles in it.

As described in Sect. 2.2.1, only CAVs are controlled by the agent. Every 1 s a new action should be chosen according to the new state. However, at certain points in time there are no CAVs in the intersection and including these states in training hinders the learning process. We therefore remove states without CAVs from the training data. As a result, the influence of actions is not limited to a fixed interval and the duration of one step in the learning process can be any positive integer in seconds. To deal with this variable step length, we employ the method of adaptive discounting as proposed in Sect. 1.2.3.

2.2.3 Reward Function

The common objective of intersection management methods is to improve the efficiency while keeping a certain level of fairness for all vehicles. Here, we extend the idea of a reward function with equity factor. Instead of using \({T_ travel }^\eta \), we propose to use \(\eta _\text {a}\cdot {T_ travel } + \eta _\text {b}\) as the reward for each released vehicle, where \(\eta \), \(\eta _\text {a}\) and \(\eta _\text {b}\) are equity factors. Due to the flexible step lengths discussed above, the reward of each step \(r_t\) is calculated by accumulating discounted rewards generated during step t which might contain up to k environment steps (each one second). I.e., we accumulate the contribution of \(N^{\text {TP}}_t\) released vehicles by

$$\begin{aligned} r_t = \sum _{i=0}^{k-1} \gamma ^i \sum _{j=1}^{N^{\text {TP}}_{t\_i}}(\eta _\text {a}\cdot {\tau _j} + \eta _\text {b}), \end{aligned}$$
(5)

where \(N^{\text {TP}}_{t\_i}\) is the throughput of the ith second in step t and \(\tau _j\) is the travel time of the jth released vehicle in the ith second.

The values of \(\eta _\text {a}\) and \(\eta _\text {b}\) are selected as in Sect. 1.2.3 based on two heuristics. First, we favor releasing each vehicle as soon as possible for the purpose of efficiency. The second heuristic aims at equity by considering a traffic situation, where one vehicle waits for saturated traffic flow on an intersecting route. Since efficient traffic flow on the high priority route should not be achieved on the expense of accumulating too large waiting time on the single vehicle, we increase the reward contributed by each released vehicle according to its travel time. This linear relation between reward and travel time is more intuitive than the previous exponential formulation. Moreover, the additional free variable in this formulation can be used to scale the rewards of single released vehicles to keep them around unity, which is beneficial for hyperparameter tuning in common deep reinforcement learning setups.

2.2.4 Return Scaling

According to the reward definition, the return \(G(s_t)\) is mainly influenced by the throughput and the travel time of released vehicles. Since both of them increase with the traffic input, the scale of \(G(s_t)\) could vary from less than 5 to over 100 if the state of the intersection changes from nearly empty \(s_\text {low}\) to saturated \(s_\text {high}\). Consequently, \(s_\text {high}\) would have a much larger impact on \(\pi _\theta \) and \(V_{\phi }\) during the update phase, making the learning process of a policy for light traffic less stable.

We introduce return scaling to resolve the issues caused by imbalanced return of states, which has shown to be critical for convergence with low traffic volumes in our experiments. In order to reduce the difference between \(G(s_\text {low})\) and \(G(s_\text {high})\), we scale the cumulative rewards before the update phase with

$$\begin{aligned} G(s_t) = \rho (s_t)\cdot \sum _{i \ge t}(\gamma ^{\sum _{j=t}^{i-1} k_j}) r_i, \end{aligned}$$
(6)

where k is the number of environment steps (each one second) in one step of learning process. The scaling factor \(\rho \) is defined as

$$\begin{aligned} \rho (s_t) = (\nicefrac {N^\text {V}_\text {c}}{n^\text {V}})^{\xi }, \end{aligned}$$
(7)

where \(n^\text {V}\) and \(N^\text {V}_\text {c}\) are the current number of vehicles in the intersection and its capacity, respectively, and \(\xi \) is a hyperparameter.

Fig. 8
4 multi-line and area graphs labeled A to D of throughput, T travel not released, T travel released, and T travel released versus training step. They plot the a 1, a 3, a 5, a 7, a 9, and a 5 without r s fluctuating lines. Graph A follows an increasing trend. Graph B follows a decreasing trend. Graphs C and D follow a fluctuating trend.

Results obtained in evaluation during training for all agents with varying CAV rates in traffic (solid lines) and an ablation study for the usage of the return scaling (dashed black line). The plots show the mean with standard deviation, where the latter is scaled by \(\pm \nicefrac {1}{10}\) for the travel times (for clearer visualization), over three non-tuned random seeds. By the end of each episode there are still some vehicles, which have not passed the junction. The travel time for such a vehicle is calculated with \(T_\text {episode}-T_\text {spawn}\), where \(T_\text {episode}\) is the episode duration and \(T_\text {spawn}\) is its scheduled spawning time in the simulator

2.3 Experiments

2.3.1 Experimental Setup

We use the open-source traffic simulator SUMO [25] to train and evaluate various intersection management agents. Besides simulated traffic episodes we also evaluate our approach on real-world rush hour traffic demand. For all roads we set a speed limit of 50 km/h. We compare our approach CVTSC to baselines managing the intersection with road signs (RS) defining static priorities for routes and with traffic lights (TL) controlled by a deep reinforcement learning agent according to Sect. 1. A possible set of green phases for the three-way intersection is shown in Fig. 9.

Fig. 9
4 diagrams of the directional action phases for a T-shaped 3-way intersection.

Traffic light green phases for the intersection in Fig. 7a

Two fully connected networks \(\theta \) and \(\phi \) are used as the policy and value function estimators. They have an input layer of size 343 and the same structures as in Sect. 1.3.1. A grid search was used to select the hyperparameters. We use \(5\textrm{e}{-6}\) as the learning rate for the Adam optimizer and 0.001 as the coefficient for weight decay. For proximal policy optimization algorithm, we use 32 actors, the clipping threshold of \(\epsilon = 0.001\) and the discount factor of \(\gamma = 0.98\). For the return scaling factor, we use \(\xi = 0.2\), which is found to be the optimal value in the range of (0, 1]. In each learning step mini-batches of size 100 are used to update the agents in 8 epochs. The number of mini-batches in each learning step is, however, variable due to the varying step lengths. The equity factors \(\eta _\text {a}\) and \(\eta _\text {b}\) for reward calculation are set to 0.0027 and 0.946. The training process of \(150\,\textrm{k}\) steps takes about 40–60 h (depending on the corresponding CAV rate) running on four NVIDIA TITAN X GPUs, while CPU computation is not a limiting factor.

2.3.2 Training Setup

Most current related work has been developed and tested with simplified traffic demand, such as constant traffic input to the intersection. We challenge our approach and train it with more dynamic traffic input ranges to cover as many real traffic scenarios as possible. For the three-way junction in Fig. 7a the saturation flow rate \(\textsf{F}_s\) of each incoming lane is \(1\,670\,\nicefrac {\text {v}}{{\text {h}}}\) and as it is very rare that two non-conflicting routes are simultaneously saturated, we set the traffic demand range to \(\left[ F_{\min }, F_{\max }\right] =\left[ 0, 3\,000\right] \,\nicefrac {\text {v}}{{\text {h}}}\). The simulated traffic episodes are sampled in the same way as Sect. 1.3.2. We train five agents (a1, a3, a5, a7, a9), each corresponding to a fixed CAV rate of \(\left[ 10, 30, 50, 70, 90\right] \%\), corresponding to the expected increasing CAV rates in the future traffic.

2.3.3 Evaluation During Training

To monitor the learning process the performance is evaluated for traffic input of different ranges: \(\left[ 0, 1\,000\right] \), \(\left[ 500, 1\,500\right] \), \(\left[ 1\,000, 2\,000\right] \), \(\left[ 1\,500, 2\,500\right] \), \(\left[ 2\,000, 3\,000\right] \). The generation of traffic demand is analogous to that of training episodes except that the total traffic inputs at the beginning \(F_{\text {begin}}\) and end \(F_{\text {end}}\) are sampled independently in the five given ranges.

The plots in Fig. 8 show the performance of agents trained with different CAV rates and present an ablation study for the usage of the return scaling. The agent a5 w/o rs is trained with a CAV rate of \(50\%\) without using return scaling. We analyze the throughput in percentage of released vehicles among all spawned vehicles, the travel time of released and not released vehicles at the highest traffic density level and the travel time of released vehicles at the lowest level. The calculated travel time is the mean among all released or not released vehicles during three evaluation episodes. We analyze the throughput and travel times instead of the accumulated reward as they give us a better estimate of the overall performance. The variance of the travel times is of particular interest as it is a good indicator for the equity. Large variances correspond to some vehicles with long waiting times at the intersection.

As illustrated in Fig. 8a, b and c, CVTSC with higher CAV rate leads to more throughput, more efficient clearance (lower average \(T_{ travel }\)) of the intersection and more fairness (shown by lower standard deviation of \(T_{ travel }\)) to all the vehicles. As expected, from Fig. 8c and d, we observe that the agent without return scaling fails to learn an efficient policy for light traffic, although its performance is similar to that of a5 in heavy traffic. We plan to conduct further investigation on return scaling, in particular whether it is applicable to a broader class of problems or can be replaced with other methods like \(\gamma \)-tuning.

2.3.4 Evaluation on Simulated Traffic Demand

We first test our agents with simulated traffic episodes, each with a duration of one hour. For each of the five traffic demand levels described above, we first create 50 traffic episodes with spawning time of each vehicle following the procedure to that for evaluation during training. Then we generate five sets of mixed traffic episodes with different CAV rates by randomly setting each vehicle as CAV or HV according to the penetration rate. Note that the baseline methods road sign (RS) and traffic light (TL) do not distinguish between CAV and HV. Following this setup, we test both baselines and our trained agents with identical number of vehicles and same spawning times. In the following, the five agents are first tested with their corresponding CAV rates to evaluate their performance against the baseline methods. Then we cross-evaluate them on settings corresponding to different CAV penetration rates.

Fig. 10
A grouped bar graph of mean and S T D of T travel versus the range of total traffic input in increasing order. It plots R S, T L, a 1, a 3, a 5, a 7, and a 9. The range of 2000 to 3000 has the highest mean and S T D, while 0 to 1000 has the lowest mean and S T D.

Performance comparison of our CVTSC with baselines RS and TL in traffics with different CAV rates. For each controller with each traffic density, the mean (opaque bars) and positive standard deviation (translucent bars) of \(T_ travel \) are calculated over all vehicles (including released and not released) of 50 simulated traffic episodes. Each CVTSC agent is trained and evaluated in traffics with its corresponding CAV rate

Table 2 Throughput (\(\%\)) of considered methods in Fig. 10

Performance of Intersection

The performance is shown in Fig. 10 and Table 2. For all the tested traffic density levels, our CVTSC agents can improve the performance of the unsignalized intersection. Not only more vehicles are released during the same period, but also the mean and standard deviation of their travel times are reduced. The higher the CAV rate is, the better our approach performs. The performance gain of CVTSC on the lowest traffic density is not obvious, because nearly no vehicles have to stop at the junction. When there is little traffic, employing TL can cause unnecessary stopping due to the transition phase (amber or red lights). In heavier traffic over \(1\,500\,\nicefrac {\text {v}}{{\text {h}}}\) TL outperforms a1 by a little margin. However, it is outperformed by CVTSC when \(30\%\) or more vehicles are CAVs.

Fig. 11
A multi-line graph of T travel released and throughput versus R S, a 1, a 3, a 5, a 7, and a 9. It plots the solid lines with error bars of main road H V, main road C A V, and side road all with a decreasing and linear trend, and their dashed lines with an increasing and linear trend.

Performance comparison of different vehicle groups at traffic demand 1 000–2 000 \(\nicefrac {\text {v}}{{\text {h}}}\). The plotted travel times show the median, lower quartile and higher quartile over all released vehicles among all evaluated episodes. The plotted throughput is the percentage of released vehicles among all spawned vehicles throughout all episodes

Table 3 Performance comparison of different agents with different traffic input settings. For each agent with each traffic setting, the average \(T_ travel \) is calculated over all vehicles (including released and not released) of 50 simulated traffic episodes

Performance of Vehicle Groups

In contrast to the relative advantage of CAVs over HVs suggested by the methods based on autonomous intersection management [9], our CVTSC tends to share the performance gain evenly between the two types of vehicles. Figure 11 shows how CVTSC can increase the intersection management performance while keeping the balance between different vehicle categories. Since the actions are executed only for CAVs on the main road, we divide vehicles on the main road into Main road CAV and Main road HV and assign all vehicles on the side road to a third group Side road all. As illustrated, the performance gain against RS is mainly caused by the improvement of the traffic on the side road. With only \(10\%\) CAVs the throughput of the side road traffic is increased from \(74.3\%\) to \(95.6\%\) and the median travel time is decreased by \(61\%\). As a necessary side effect, the courteous behavior adds about \(13\,{\text {s}}\) to the median travel time of CAVs on the main road and slows down some HVs following the CAVs consequently. However, the median travel time of Main road HV and the throughput of both vehicle groups on the main road are nearly not influenced. With growing rate of CAVs in traffic, the performance of the traffic on the side road continues to be improved while the initial disadvantage for the main road is compensated.

Comparison of Agents

To cross-evaluate their performance on other traffic settings than their natives, we further test each agent (a1 to a9) on the five different CAV rates on 50 simulated episodes on each of the five traffic densities. Since CVTSC brings nearly no measurable difference for the lowest traffic density, only the results for the other four traffic densities are listed in Table 3.

We observe that all trained CVTSC agents outperform RS in any mixed traffic setting. Furthermore, two significant patterns can be observed in the results. First, for each CAV rate the agents trained with similar rate values are among the best, as expected. Second, as the CAV rate increases the performance of all agents is continuously improved. Interestingly, a5, the one trained with CAV rate of \(50\%\), outperforms or performs equally well as a7 and a9 even in settings where CAVs are the majority. We suppose this is because a5 during training is exposed to more diverse traffic situations, especially ones with fewer CAVs in the intersection. As shown in Figs. 10 and 11, the margin of the performance gain decreases with increased CAV rate. Even though a7 and a9 can handle highly automated traffic better than a5, the performance gain is so small that it can not compensate the performance loss when occasionally more HVs drive in the intersection.

Fig. 12
A diagram of a T-shaped 3-way intersection has the directional action phases for the lanes along with various signs such as give way.

Intersection of Tullastrasse and Hans-Bunte-Strasse in Freiburg, Germany

Table 4 Traffic in rush hours on the morning and afternoon of October 19, 2017 at the intersection of Fig. 12

2.3.5 Evaluation on Real-World Traffic Demand

To further evaluate CVTSC in more realistic traffic situations, we conduct additional tests with real-world traffic demand recorded at an intersection in Freiburg, Germany, which is sketched in Fig. 12. Unlike the intersection in Fig. 7a, one part of the main road (Tullastrasse) forks before the stop line. After adjusting the state representation and the intersection structure in the simulator we trained two new agents a3 and a5 and employ them in the test. The traffic demand, listed in Table 4, was manually recorded on October 19, 2017 by the traffic department of Freiburg. The total traffic input was about \(1\,000 {-} 1\,500\,\nicefrac {\text {v}}{{\text {h}}}\) with roughly \(20\%\) on the side road.

Figure 13 shows box plots of the travel times of released vehicles controlled by RS and CVTSC agents in traffic scenarios with different CAV rates. The agent a3 is employed for \(10\%\) and \(30\%\) automated traffic, while a5 is employed for the other three. In all scenarios over \(99.7\%\) of all vehicles traverse the intersection. Our method continuously improves the traffic flow with increasing rate of CAVs in traffic. We notice that the median of travel times in all scenarios stay similar, which means the performance gain comes mainly from the vehicles with long travel times on the side road. CVTSC agents manage to release them faster without delaying the traffic on the main road.

Fig. 13
A box plot of T travel released for mean, S T D, and median versus R S 0 to 90% in decreasing order. R S 0% has the highest quartile range for T travel released with outliers, while R S 90% has the lowest quartile range with outliers.

Box plot of travel times with different CAV rates over all released vehicles in the simulation based on the real-world intersection of Fig. 12. The whiskers extend \(1.5\cdot \text {IQR}\) (interquartile range) from the upper and lower quartiles

3 Conclusion and Future Work

3.1 Conclusion

In this chapter we first present an approach to learning traffic signal controllers using deep reinforcement learning. Our approach extends existing reward functions by a dedicated equity factor. We furthermore proposed a method that utilizes adaptive discounting to comply with the learning principles of deep reinforcement learning agents and to stabilize training. We validated the effectiveness of our approach using simulated and real-world data.

Then we present an approach to improving mixed traffic management at unsignalized intersections using deep reinforcement learning. Our proposed method CVTSC creates courteous behavior for automated vehicles in order to optimize the overall traffic flow at intersections. Furthermore, we introduce return scaling to counteract the imbalance of cumulative rewards at different states and to stabilize training. We validate the effectiveness of CVTSC using simulated and real-world traffic data and show that CVTSC improves the traffic performance continuously with increasing percentage of automated vehicles. With more than \(10\%\) of automated vehicles it also outperforms the state-of-the-art adaptive traffic signal controller. Besides the performance gain, our method does not require a change of the current driving habits of humans. Moreover it is fault-tolerant, since the method is an add-on to the existing traffic rules and thus the intersection will still be fully functional even if the intersection management unit fails. Besides outperforming state-of-the-art methods, both of our approaches can be easily adopted to different intersection topologies.

3.2 Future Work

Given the current status, there are still some interesting problems to be solved for learning cooperative driving behaviors at intersections in mixed traffic. In this section, we explain two of them and try to propose some possible approaches to the solutions.

Simulation and Real World

Due to the difficulty of real-world experiments in traffic, most of the research in this area are developed in simulators. However, discrepancies between simulators and the real world traffic system can make it challenging to transfer the learned behaviors from simulation. Some important discrepancies include human driver models, behavior models of other traffic participants and physical models of vehicles. There are three possible solutions for this problem. First we can utilize Sim2Real methods like domain randomization to have enough variability in the simulator, the real world may appear to the model as just another variation [40]. The second approach is training the agent via offline reinforcement learning with history real-world traffic data instead of in simulators. Although lacking online interactions with the environment, it is still optimistic that the agent can learn high quality policies [2]. The third solution is to use more naturalistic data-driven behavior models of traffic participants including human drivers [32].

Decentralized Control

Both of the presented approaches use centralized controller, which requires road infrastructures for collecting traffic data and communicating with the CAVs. To enable cooperative trajectory planning without additional infrastructures, decentralized version of the presented controllers can be an interesting future work. Decentralized control in this task has some main challenges. First, perception systems of CAVs are limited due to sensors property or physical occlusion, leading to a partial observable environment for each agent. Secondly, the number of agents in the environment is constantly changing, which makes it impossible to assign a policy to each agent. The recent advance of multi-agent reinforcement learning [52] and neural networks operating on sets [51] show us very promising approaches.