RealTime Lane Configuration with Coordinated Reinforcement Learning
 74 Downloads
Abstract
Changing lane configuration of roads, based on traffic patterns, is a proven solution for improving traffic throughput. Traditional lanedirection configuration solutions assume preknown traffic patterns, hence are not suitable for realworld applications as they are not able to adapt to changing traffic conditions. We propose a dynamic lane configuration solution for improving traffic flow using a twolayer, multiagent architecture, named Coordinated Learningbased Lane Allocation (CLLA). At the bottomlayer, a set of reinforcement learning agents find a suitable configuration of lanedirections around individual road intersections. The lanedirection changes proposed by the reinforcement learning agents are then coordinated by the upper level agents to reduce the negative impact of the changes on other parts of the road network. CLLA is the first work that allows citywide lane configuration while adapting to changing traffic conditions. Our experimental results show that CLLA can reduce the average travel time in congested road networks by 20% compared to an uncoordinated reinforcement learning approach.
Keywords
Reinforcement learning Spatial database Graphs1 Introduction
The impact of dynamic lanedirection configurations can be shown in the following example (Fig. 1). In Fig. 1a, there are 4 northbound lanes and 4 southbound lanes. Traffic is congested in the northbound lanes. Figure 1b shows the dramatic change of traffic flow after lanedirection changes are applied, where the direction of E, F and G is reversed. The northbound vehicles are distributed into the additional lanes, resulting in a higher average speed of the vehicles. At the same time, the number of southbound lanes is reduced to 1. Due to the low number of southbound vehicles, the average speed of southbound traffic is not affected. The lanedirection change helps improve the overall traffic efficiency in this case. There is no existing approach for applying such lanedirection changes at the network level at realtime, which can help improve traffic efficiency of a whole city. We aim to scale this to citywide areas. The emergence of connected autonomous vehicles (CAVs) [14] can make such largescale dynamic lanedirection changes a common practice in the future. Compared to humandriven vehicles, CAVs are more capable of responding to a given command in a timely manner [4]. CAVs can also provide detailed traffic telemetry data to a central traffic management system in real time, which is important to dynamic traffic optimization.
In order to optimize the flow of the whole network, one needs to consider the impact of possible lanedirection changes on all the other traffic lanes. In many circumstances, one cannot simply allocate more traffic lanes at a road segment for a specific direction when there is more traffic demand in that direction. This is because a lanedirection change at a road segment can affect not only the flow in both directions at the road segment but also the flow at other road segments. Existing solutions for computing lanedirection configurations [4, 9, 21] do not consider the impact of changes at the network level due the assumption that future traffic dynamics are known beforehand at the beginning of the calculation which is unrealistic for practical applications. More importantly, the computation time can be very high with the existing approaches as they aim to find the optimal configurations based on linear programming, and hence are not suitable for frequent recomputation over large networks.

We formalize a lanedirection optimization problem.

We propose a firstofitskind solution, CLLA, for efficient dynamic optimization of lanedirections that uses reinforcement learning to capture dynamic changes in the traffic.

Our experiments with realworld data shows that CLLA improves travel time by 20% compared to an uncoordinated RL Agent solution.
2 Related Work
2.1 LearningBased Traffic Optimization
Existing traffic optimization algorithms are commonly based on traffic flow optimization with linear programming [6, 7, 10]. They are suitable computing optimization solutions if traffic demand and congestion levels are relatively static. When there is a significant change in the network, the solutions normally need to be recomputed from scratch. Due to the high computational complexity of finding an optimal solution, these algorithms are not suitable for highly dynamic traffic environments and not suitable for applications where realtime information are used as an input.
With the rise of reinforcement learning [16], a new generation of traffic optimization algorithms have emerged [13, 18, 22]. In reinforcement learning, an agent can find the rules to achieve an objective by repeatedly interacting with an environment. The interactive process can be modelled as a finite Markov Decision Process, which requires a set of states S and a set of actions A per state. Given a state s of the environment, the agent takes an action a. As the result of the action, the environment state may change to \(s^\prime \) with a reward r. The agent then decides on the next action in order to maximize the reward in the next round. Reinforcement learningbased approaches can suggest the best actions for traffic optimization given a combination of network states, such as the queue size at intersections [1, 2]. They have an advantage over linear programmingbased approaches, since if trained well, they can optimize traffic in a highly dynamic network. In other words, there is no need to retrain the agent when there is a change in the network. For example, Arel et al. show that a multiagent system can optimize the timing of adaptive traffic lights based on reinforcement learning [1]. Different to the existing approaches, our solution uses reinforcement learning for optimizing lanedirections which was not considered before.
A common problem with reinforcement learning is that the state space can grow exponentially when the dimensionality of the state space grows linearly. The fast growth of the state space can make reinforcement learning unsuitable for large scale deployments. This problem is known as the curse of dimensionality [3]. A common way to mitigate the problem is by using a set of distributed agents that operate at the intersection level. This approach has been used for dynamic traffic signal control [5]. Different to the existing work we use this for dynamic lanedirection configurations.
Coordination of multiagent reinforcement learning can be achieved through a joint state space or through a coordination graph [8]. Such techniques, however, require agents to be trained on the targeted network. Since our approach uses an implicit mechanism to coordinate (Sect. 4.3), once an agent is trained, it can be used in any road network.
2.2 LaneDirection Configurations
Research shows that dynamic lanedirection changes can be an effective way to improve traffic efficiency [20]. However, existing approaches for optimizing lanedirections are based on linear programming [4, 9, 21], which are unsuitable for dynamic traffic environments dues to their high computational complexity. For example, Chu et al. uses linear programming to make laneallocation plans by considering the schedule of connected autonomous vehicles [4]. Their experiments show that the total travel time can be reduced. However, the computational time grows exponentially when the number of vehicles grows linearly, which can make the approach unsuitable for highly dynamic traffic environments. The high computational costs are also inherent to other approaches [9, 21]. Furthermore, all these approaches assume the exact knowledge of traffic demand over the time horizon is known beforehand; this assumption does not hold when traffic demand is stochastic [12]. On the contrary, our proposed approach CLLA is lightweight and can adapt to highly dynamic situations based on reinforcement learning. The reinforcement learning agents can find effective lanedirection changes for individual road intersections even when traffic demand changes dramatically. To the best of our knowledge, this is the first work for lanedirection allocation by observing realtime traffic information.
3 Problem Definition
Definition 1
Road network graph: A road network graph \(G_t(V,E)\) is a representation of a road network at time t. Each edge \(e \in E\) represents a road segment. Each vertex \(v \in V\) represents a start/end point of a road segment.
Definition 2
Lane configuration: The lane configuration of an edge e, \(lc_e\), is a tuple with two numbers, each of which is the number of lanes in a specific direction on the edge. The sum of the two numbers is always equal to the total number of lanes on the edge.
Definition 3
Dynamic lane configuration: The dynamic lane configuration of an edge e at time t, \(lc_e(t)\), is the lane configuration that is used at the time point.
Definition 4
Travel cost: The travel cost of a vehicle i that presents at time t, \(TC_i(t)\), is the length of the period between t and the time when the vehicle reaches its destination.
Definition 5
Total travel cost: The total travel cost of vehicles that present at time t, TTC(t), is the sum of the travel costs of all the vehicles. That is, \(TTC(t)= \sum _{(i=1)}^n TC_i (t)\), where n is the number of vehicles.
PROBLEM STATEMENT. Given a set of vehicles at time t and the road network graph \(G_{t1}(V,E)\) from time \(t1\), find the new graph \(G_{t}(V,E)\) by computing dynamic lane configuration (\(lc_e(t)\)) for all the edges in E such that the total travel cost TTC(t) is minimized.
4 Coordinated LearningBased Lane Allocation (CLLA)
To solve the optimization problem defined in Sect. 3, we propose Coordinated Learningbased Lane Allocation (CLLA) solution. CLLA uses a twolayer multiagent architecture, as shown in Fig. 2. The bottom layer consists of a set of RL Agents that are responsible for optimizing the direction of lanes connected to specific intersections. The lanedirection changes that are decided by the RL Agents are aggregated and evaluated by a set of Coordinating Agents at the upper layer, with the aim to resolve conflicts between the RL agents’ decisions.
CLLA operates in the following manner. A RL Agent in the bottom layer observes the local traffic condition around a specific intersection. The RL Agents make decisions on lanedirection changes independently. Whenever a RL Agent needs to make a lanedirection change, it sends the proposed change to the Coordinating Agents in the upper layer. The RL Agents also send certain traffic information to the upper layer periodically. The Coordinating Agents evaluate whether a proposed change would be beneficial at the global level based on the received information. The Coordinating Agents may allow or deny a lanedirection change request. It may also decide to make further changes in addition to the proposed changes. After the evaluation, the Coordinating Agents inform the RL Agents of the changes to be made.
4.1 CLLA Algorithm
4.2 Reinforcement Learning Agent (RL Agent)
States: A RL Agent can work with four types of states as shown below.

The first state represents the current traffic signal phase at an intersection.

The second state represents the queue length of incoming vehicles that are going to pass the intersection without turning.

The third state represents the queue length of incoming vehicles that are going to turn at the intersection.

The fourth state represents the queue length of outgoing vehicles, i.e., the vehicles that have passed the intersection.
Although it is possible to add other types of states, we find that the combination of the four states can work well because the combination of four states provides; i) information about both incoming and outgoing traffic, ii) from which road to which road vehicles are waiting to move, iii) current traffic signal information.
Actions: We denote the two directions of a road segment as upstream and downstream. There are three possible actions: increasing the number of upstream lanes by 1, increasing the number of downstream lanes by 1 or keeping the current configuration. When the number of lanes in one direction is increased, the number of lanes in the opposite direction is decreased at the same time. Since a RL Agent controls a specific road intersection, the RL Agent determines the action for each individual road segment connected to the intersection.
We introduced an action restriction mechanism in RL Agents. Changing lanedirection of a road segment takes time as existing vehicles on that road segment should move out before reversing the lanedirection. Therefore, it takes an even longer time to recover from an incorrect lanedirection decision taken by a RL Agent while learning. In order to stabilize the learning, a RL Agent is allowed to take a lanechanging action only when there is a considerable difference between upstream and downstream traffic. The use of this restriction also provides a way to resolve conflicting actions between neighboring RL Agents. When two RL Agents connected to the same road segment want to increase the number of lanes in different directions, the priority is given to the action, which allocates more lanes to the direction with a higher traffic volume.
4.3 Coordinating Agent
Given a locally optimized lanedirection change, Coordinating Agents check whether the change can help improve traffic efficiency in surrounding areas based on the predicted traffic demand and the current traffic conditions. If a proposed change is beneficial, it can be actioned. Otherwise, it is not allowed by CLLA.
Due to the dynamic nature of traffic, the Coordinating Agents may not need to consider the full path of vehicles when evaluating the proposed changes based on the predicted traffic demand. This is because the route of vehicles may change dynamically at real time, especially in the era of connected autonomous vehicles when traffic optimization can be performed frequently. Instead of collecting the full path of vehicles, the Coordinating Agents can collect the path within a lookup distance. For example, assuming the lookup distance is 200 m, the Coordinating Agents only need to know the road segments that the vehicles will pass within the next 200 m from their current locations.
When there is no conflict between a proposed lanedirection change and the predicted traffic demand, CLLA evaluates the benefit of the proposed change based on the current traffic conditions. Our implementation considers one specific type of traffic condition, the current queue length at road junctions. If a lanedirection change can lead to a lower traffic speed on a road segment, which has a longer queue than the road segment in the opposite direction, the lanedirection change is not allowed. This is because a lower traffic speed can lead to an even longer queue, which can decrease traffic efficiency.
The coordination of lanedirection changes is performed at a certain interval. The time between two coordinating operations is the assignment interval, within which the proposed lanedirection changes are actioned, the predicted traffic demand and the current traffic condition are aggregated at the Coordinating Agents.
Global Impact Evaluation Algorithm: The Coordinating Agents use Global Impact Evaluation Algorithm (Algorithm 2) to quantify the conflicts between lanedirection changes. The algorithm takes lanedirection changes that are proposed by the RL Agents as an input (LLC). The input consists of the road and the lanedirection change (lc) proposed by each RL Agent. First, the algorithm finds the neighboring road segments affected by all the changes proposed by the RL Agents (Line 3). For each neighboring road segment, the algorithm finds the predicted traffic flow caused by the proposed lanedirection changes (Line 5). Then the algorithm adds affected neighboring road segments to a queue (Line 7).
In the next step, the algorithm visits each road segment in the queue and determines the appropriate lanedirection configuration (\(lc_{r_{new}}(t)\)) and the conflicts, where a road segment cannot accommodate the predicted traffic flow (Line 9–13). If a lanedirection change needs to be made, for road segment \(r_{new}\), the road segment is added to coordinated lane changes (CLC) (Line 11). If there is a conflict at road segment \(r_{new}\), corresponding lanedirection change proposed by the RL Agents is marked as a conflict (Line 13).
Complexity of Coordinating Process. Let us use m to denote the number of requests from the RL Agents. The complexity of visiting the relevant road segments is \(\mathcal {O}(m \times neb)\) where neb is the number of neighboring road segments that connect to a road segment at a road junction. Since the number of road segments connecting with the same junction is normally a small value, neb can be seen as a constant value with a given lookup distance (\(l_up\)). Hence the algorithm complexity can be simplified to \(\mathcal {O}(m)\). In the worst case, there is a lanechange request for each road segment of G(V, E), leading to a complexity of \(\mathcal {O}(E)\).
Distributed Version. Since the execution of Global Impact Evaluation algorithm is independent of the order of requests coming from the RL Agents, requests can be processed in a distributed manner using multiple Coordinating Agents. Every Coordinating Agent traverses first depth neighbors and informs changes to other Coordinating Agents. In such a setting, the complexity of the algorithm is \(\mathcal {O}(1)\) with E number of Coordinating Agents. In this work, we implemented the centralized version (with one Coordinating Agent); however, when applied to very large road networks, the distributed version can be implemented.
5 Experimental Methodology
We compare the proposed algorithm, CLLA, against three baseline algorithms using traffic simulations. We evaluate the performance of the algorithms using synthetic traffic data and real traffic data. We use SMARTS (Scalable Microscopic Adaptive Road Traffic Simulator) [15], a microscopic simulator capable of changing the travelling directions of lanes, for our experiments.
Datasets. The real traffic data contains the taxi trip records from New York City^{1}. The data includes the source, the destination and the start time of the taxi trips in the city. We pick an area of Manhattan for simulation (Fig. 4) because the area contains a larger amount of taxi trip records than other areas. The road network of the simulation areas is loaded from OpenStreetMap^{2}. For a specific taxi trip, the source and the destination are mapped to the nearest OpenStreetMap nodes. The shortest path between the source and the destination is calculated. The simulated vehicles follow the shortest paths generated from the taxi trip data.
We also use a synthetic 7 \(\times \) 7 grid network to evaluate how our algorithm performs in specific traffic conditions.
We simulate four traffic patterns with the synthetic road network. A traffic pattern refers to generating vehicles to follow a specific path between a source node and a destination node in the road network.

Rush hour traffic (RH): In this setup, traffic is generated so that traffic demand is directionally imbalanced to represent rush hour traffic patterns.

Bottleneck traffic (BN): This setup generates high volume of traffic at the centre of the grid network. This type of traffic patterns create bottleneck links at the center of the network.

Mixed traffic (MX): Mixed traffic contains both Rush hour traffic and Bottleneck traffic conditions in the same network.

Random traffic (RD): Traffic is generated randomly during regular time intervals. Demand changes over time intervals.

No Lanedirection Allocations (noLA): This solution does not do any lanedirection change. The traffic is controlled by static traffic signals only.

Demandbased Lane Allocations (DLA): This solution assumes that the full knowledge of estimated traffic demand and associated paths are known at a given time step. DLA computes traffic flow for every edge for both directions by projecting the traffic demand to each associated path. Then it allocates more lanes for a specific direction when the average traffic demand per lane in the direction is higher than the average traffic demand per lane in the opposite direction. Same as CLLA, DLA configures lanedirections at a certain interval, \(t_a\), which is called assignment interval.

Local Lanedirection Allocations (LLA): This solution uses multiple learning agents to decide lanedirection changes. The optimization is performed using the approach described in Sect. 4.2. LLA is similar to CLLA but there is no coordination between the agents.
5.1 Evaluation Metrics
We measure the performance of the solutions based on the following metrics.
5.2 Parameter Settings
Parameter settings
Parameter  Range  Default value 

Lookup distance in CLLA  1–7  5 
Assignment interval in CLLA/DLA (minutes)  0.5–3  1 
6 Experimental Results
6.1 Comparative Tests
Performance of baselines evaluated using four traffic patterns of the synthetic grid network. RH, BN, MX, RD refers to the four synthetic traffic patterns
Baseline  Travel time (s)  % of Vehicles with DFFT>6  

RH  BN  MX  RD  RH  BN  MX  RD  
noLA  681.08  427.16  506.28  539.89  49.0  4.8  27.7  4.85 
LLA  575.59  540.62  561.11  577.6  32.3  24.35  30.5  8.41 
DLA  568.02  504.70  493.13  636.51  30.2  16.5  15.5  20.0 
CLLA  568.01  428.28  449.26  523.42  32.4  5.7  14.3  3.67 
Performance of baselines evaluated using New York taxi data
Baseline  Travel time (s)  % of Vehicles with DFFT > 6 

noLA  604.32  45.9 
LLA  585.83  48.6 
DLA  496.12  50.7 
CLLA  471.28  45.87 
Deviation from FreeFlow Travel Time (DFFT): Table 2 and Table 3 show the percentage of vehicles whose travel time is 6 times or more than their freeflow travel time. The results show that CLLA is able to achieve a lower deviation from the freeflow travel time compared to DLA and LLA.
6.2 Sensitivity Analysis
Figure 6 shows the average execution time of Global Impact Evaluation algorithm for one iteration as network size grows. For this test, we build synthetic gridbased road networks. In networks with 9 to 25 nodes, the number of road links on vehicle paths is usually less than the default lookup distance (5). When the number of nodes in a road network is 49, 81 or 144, the number of road links on vehicle paths can be higher than the lookup distance. This is the reason for the increase in execution time when the number of nodes increases from 25 to 49. When the number of nodes is higher than 49, execution is nearly constant, showing that the computation cost does not increase with network size when the lookup distance is fixed.
7 Conclusion
We have shown that effective traffic optimization can be achieved with dynamic lanedirection configurations. Our proposed hierarchical multiagent solution, CLLA, can help to reduce travel time by combining machine learning and the global coordination of lanedirection changes. The proposed solution adapts to significant changes of traffic demand in a timely manner, making it a viable choice for realizing the potential of connected autonomous vehicles in traffic optimization. Compared to stateoftheart solutions based on lanedirection configuration, CLLA runs more efficiently, and is scalable to large networks.
An interesting extension would be to incorporate dynamic traffic signals into the optimization process to this work. It would also be interesting to develop solutions that can dynamically change vehicle routes in addition to the lanedirection changes. The dynamic change of speed limit of roads can also be included in an extension to CLLA. Moreover, it is worthwhile to explore how to jointly optimize route allocation and lane directions to improve traffic further.
Footnotes
References
 1.Arel, I., Liu, C., Urbanik, T., Kohls, A.: Reinforcement learningbased multiagent system for network traffic signal control. IET Intel. Transport Syst. 4(2), 128–135 (2010)CrossRefGoogle Scholar
 2.Aslani, M., Seipel, S., Mesgari, M.S., Wiering, M.: Traffic signal optimization through discrete and continuous reinforcement learning with robustness analysis in downtown Tehran. Adv. Eng. Inform. 38, 639–655 (2018)CrossRefGoogle Scholar
 3.Boutilier, C.: Planning, learning and coordination in multiagent decision processes. In: Theoretical Aspects of Rationality and Knowledge, pp. 195–210 (1996)Google Scholar
 4.Chu, K.F., Lam, A.Y.S., Li, V.O.K.: Dynamic lane reversal routing and scheduling for connected autonomous vehicles. In: International Smart Cities Conference, pp. 1–6 (2017)Google Scholar
 5.ElTantawy, S., Abdulhai, B.: Multiagent reinforcement learning for integrated network of adaptive traffic signal controllers. ITSC 14(3), 1140–1150 (2012)Google Scholar
 6.Fleischer, L., Skutella, M.: Quickest flows over time. SIAM J. Comput. 36(6), 1600–1630 (2007)MathSciNetCrossRefGoogle Scholar
 7.Ford, L.R., Fulkerson, D.R.: Constructing maximal dynamic flows from static flows. Oper. Res. 6(3), 419–433 (1958)MathSciNetCrossRefGoogle Scholar
 8.Guestrin, C., Lagoudakis, M.G., Parr, R.: Coordinated reinforcement learning. In: International Conference on Machine Learning, pp. 227–234 (2002)Google Scholar
 9.Hausknecht, M., Au, T., Stone, P., Fajardo, D., Waller, T.: Dynamic lane reversal in traffic management. In: ITSC, pp. 1929–1934 (2011)Google Scholar
 10.Köhler, E., Möhring, R.H., Skutella, M.: Traffic networks and flows over time, pp. 166–196 (2009)Google Scholar
 11.Lambert, L., Wolshon, B.: Characterization and comparison of traffic flow on reversible roadways. J. Adv. Transp. 44(2), 113–122 (2010)CrossRefGoogle Scholar
 12.Levin, M.W., Boyles, S.D.: A cell transmission model for dynamic lane reversal with autonomous vehicles. Transp. Res. Part C: Emerg. Technol. 68, 126–143 (2016)CrossRefGoogle Scholar
 13.Mannion, P., Duggan, J., Howley, E.: An experimental review of reinforcement learning algorithms for adaptive traffic signal control. In: McCluskey, T.L., Kotsialos, A., Müller, J.P., Klügl, F., Rana, O., Schumann, R. (eds.) Autonomic Road Transport Support Systems. AS, pp. 47–66. Springer, Cham (2016). https://doi.org/10.1007/9783319258089_4CrossRefGoogle Scholar
 14.Narla, S.R.: The evolution of connected vehicle technology: from smart drivers to smart cars to... selfdriving cars. ITE J. 83, 22–26 (2013)Google Scholar
 15.Ramamohanarao, K., et al.: SMARTS: scalable microscopic adaptive road traffic simulator. ACM Trans. Intell. Syst. Technol. 8(2), 1–22 (2016)CrossRefGoogle Scholar
 16.Ravishankar, N.R., Vijayakumar, M.V.: Reinforcement learning algorithms: survey and classification. Indian J. Sci. Technol. 10(1), 1–8 (2017)CrossRefGoogle Scholar
 17.Sutton, R.S., Barto, A.G.: Introduction to Reinforcement Learning, vol. 135, 1st edn. MIT Press, Cambridge (1998)zbMATHGoogle Scholar
 18.Walraven, E., Spaan, M.T., Bakker, B.: Traffic flow optimization: a reinforcement learning approach. Eng. Appl. Artif. Intell. 52, 203–212 (2016)CrossRefGoogle Scholar
 19.Watkins, C.J., Dayan, P.: Technical note: Qlearning. Mach. Learn. 8, 279–292 (1992)zbMATHGoogle Scholar
 20.Wolshon, B., Lambert, L.: Planning and operational practices for reversible roadways. ITE J. 76, 38–43 (2006)Google Scholar
 21.Wu, J.J., Sun, H.J., Gao, Z.Y., Zhang, H.Z.: Reversible lanebased traffic network optimization with an advanced traveller information system. Eng. Optim. 41(1), 87–97 (2009)CrossRefGoogle Scholar
 22.Yau, K.L.A., Qadir, J., Khoo, H.L., Ling, M.H., Komisarczuk, P.: A survey on reinforcement learning models and algorithms for traffic signal control. ACM Comput. Surv. (CSUR) 50(3), 1–38 (2017)CrossRefGoogle Scholar