1 Introduction

In many intersections, traffic lights are used to control traffic. The pattern and time of them can be static or dynamic. In the static model, the time and order of the three phases -green, yellow, and red- are predetermined, while in the dynamic model based on environmental assessment and traffic intensity, the phases are allocated so that the traffic congestion will be reduced. In classical approaches, a slight change in environmental conditions needs to make a fundamental difference in the control model structure. In other words, it is necessary to design fuzzy control for every intersection that is supposed to be used, and expandability is impossible. Since traffic is a real-time problem, the best method to design a controller must be associated with real-time information to make a decision based on the number of vehicles which are in the environment. Furthermore, counting the cars in the intersection can be done by sensors. The point is their maintenance is costly. Among methods based on dynamic models which use artificial intelligence to decide at intersections are simpler and more efficient.

In this research, motivated by the above discussion, we want to design and simulate an efficient real-time method, so deep Q-learning is used. This combines deep neural networks and Q-Learning. Q-Learning is a type of model-free reinforcement learning. It aims to assign a value, known as the Q-value, to an action. Based on the current state of the environment, the traffic light -as an agent- decides which lane must get the green phase. In other words, which action set should be chosen considering the Q-value. Devoting these values is done by the deep neural network. This estimation leads to better decision-making in complex environments such as traffic conditions. Then the reward is received to analyze whether that decision made by the agent was useful and led to better control at the intersection. The performance of traffic lights should be in a way that the queue time (time that a car is in the queue) is reduced compared to a fixed-time traffic light. Low, medium, and heavy traffic scenarios are simulated in Simulation of Urban Mobility (SUMO).

The main contributions of this paper are as follows:

  • Firstly, studying an individual intersection. A four-way intersection is managed by an agent. The results of this method are compared with a static traffic light.

  • Secondly, the consideration of two connected intersections. Each intersection has its agent. The situations where agents are aware and unaware of each other's actions are compared.

The main structure of this work is shown in Fig. 1.

Fig. 1
figure 1

General structure of work

The following sections of this paper are as follows: The next section presents an overview of the literature. Section 3 gives a review of reinforcement learning. In Sect. 4 a single four-way intersection and its results are discussed. In Sect. 5 the studying of two joined intersections and the results are provided. Finally, the conclusions are drawn in Sect. 6.

2 Related work

Researches have focused on controlling traffic congestion with different approaches. The first problem is the information gathering about intersections traffic. Classical methods utilize sensors on the roads to count the number of vehicles. The fundamental drawbacks of it are the cost of implementing sensors and maintenance. Acquiring information by cameras is the other procedure that analyzes pictures to categorize whether the traffic is heavy. Gradient magnitude and frame subtraction method [1] color-to-grayscale algorithm and cascade classifier [2] are some examples of image processing methods for car detection. As the traffic issue is real-time, there must be a solution to observe the environment and make a decision based on real-time information. Using fuzzy logic in intersections traffic control was discussed and it led to a 10–21% decrease in delay time [3]. The main disadvantage of this method can be emphasized as depending on expert knowledge. In addition, it is specific for each intersection. The general structure cannot be implemented in other situations. In better words, generalizability is limited. As a result, artificial intelligence methods are getting replaced.

In traffic forecasting, machine learning concepts can be used. In [4], deep Autoencoder, Deep Belief Network, Random Forest, and Long Short Term Memory (LSTM) were discussed in terms of accuracy, precision, recall, and error in traffic prediction. Among different techniques, LSTM had the best accuracy of 95 percent.

Traffic control studies based on reinforcement learning have drawn attention in recent years. The main difference between these kinds of research was in the way of defining the state and reward. Among 160 articles reviewed by Nain et al. [5] to define "state" in the category of reinforcement learning, 38% of articles queue size, 11% phase state, 10% number of vehicles, 6% vehicle position, and 6% speed have chosen. As a reward, 30% of the articles mentioned queue size, 13% delay, 9% waiting time, 6% number of vehicles, and 4% throughput as a reward.

In order to optimize the traffic signal, a novel methodology was invented by Mao and Hai. Their paper aimed to reduce the effects of an accident on the total travel time (TTT), so genetic algorithm and machine learning were combined. The results showed a decrease in TTT by approximately 50% in comparison with the original genetic algorithm [6].

A study has been done by Ge and et al. [7]. In this research, multi-agent and deep Q-network have been proposed to control a group of intersections. Each agent searches for the optimal strategy to control an intersection with a deep Q- network. Reply experience is used to increase the accuracy of the Q-values. This technique is related to DQN, which aimed to reduce the time correlation in the training dataset [810].

The impacts of different reward definitions and state representation on average waiting time per car were discussed by Razack et al. [8]. Using waiting time as a reward function when the number of states was 80 caused a drop in waiting time per car by 26 percent.

El-Tantawy and Abdullahi [9] used multi-agent reinforcement learning to control traffic lights in five intersections. Agents were employed in two ways. In the first design, agents were independent. In the second scheme, each agent using the multi-agent reinforcement learning algorithm coordinate the control measures with neighboring agents to minimize the total vehicle delay in the traffic network. The performance of the independent state is better than the integrated state ratio of 2–7%.

Among the machine learning-based methods in traffic control, the linear regression model is common. This method is parametric and the number of parameters is low, however, it is difficult to implement. On the other hand, their accuracy is low, and they cannot work well in traffic controllers that are nonlinear and they will not be robust. Neural networks are considered non-parametric models. In comparison with the recursive neural network (RNN) method and Convolutional Neural Network (CNN), the Feed Forward Neural Network (FFNN) method is simpler, but it is less robust [10, 12].

Reviewing the state of the art reveals that there should be an efficient reinforcement learning approach to control traffic. The point is how to tackle setbacks and make the simulation closer to the real conditions. Then this article aims to provide an appropriate way for a single and a group of intersections.

3 Reinforcement learning

In reinforcement learning, an agent interacts with the environment to learn how to act in an environment, without prior knowledge, and through learning to maximize a defined reward function (or minimize a penalty). The most well-known reinforcement learning technique is Q-learning (abbreviated as QL), which uses a trial-and-error approach to explore a complex and random environment and selects the best action based on experience. Q-learning is model-free and allocates a value called Q-value to the action taken from the environment. In fact, it is determined what the best action for each state is. The definition of state representation, reward function, and action set have a crucial role to play in QL. If what an agent does leads to achieving the system goal, it will get a reward, otherwise there will be a penalty. The state represents the current environment of an agent. One action at st is applied and goes to the next state (st+1) to obtain the maximum value in the next state. Q-function receives state and action. Then it calculates the reward. Q-value can be expressed as (1).

$$Q\left({s}_{t} , {a}_{t}\right)\leftarrow Q\left({s}_{t} , {a}_{t} \right)+\upeta . \left( {\mathrm{r}}_{\mathrm{t}+1}+\upgamma .\mathrm{max}Q \left( {s}_{t+1} , {a}_{t+1}\right)-Q\left({s}_{t} , {a}_{t}\right)\right)$$
(1)

where Q(st, at) is the value related to an action based on a state, and \(\eta\) is the learning rate, which can be between 0 and 1. How much α gets closer to 0 means Q-value gets updated less, so nothing is learned. On the other hand, high value means learning happens fast. Max refers to taking the most valuable, among actions at in state st+1. γ is a discount factor that can have a value between 0 and 1, and determines the importance of previous rewards in the future in comparison with immediate reward [1012].

In reinforcement learning, the policy is important during training. In this research, the method ɛ-greedy is used.

$${\varepsilon }_{h}=1-\frac{h}{H}$$
(2)

where ε is the probability of an exploratory action for the current episode of h. Consequently, the probability of the exploitation action is 1—ε. As the number of episodes increases, εh gradually approaches zero and exploration will be chosen.

3.1 DQN

Since the state representation is large, it is impossible to calculate the Q-values from a Q-table. Therefore, DQN is proposed to combine Q-learning with a deep neural network [1113]. In this method, the purpose is to approximate the Q-learning with deep neural networks. Inputs of the deep neural network are the states representative and outputs are Q-values, Fig. 2. To select the number of nourons, there are three methods: 1. The number of neurons can be between the size of the input layer and the size of the output layer. 2. The number of neurons can be the sum of the input layer size and the output layer size. 3. The number of neurons can be less than twice the size of the input layer [11, 13]. Two different models of neural networks with four hidden layers for one intersection are considered in this work. In the first model, each hidden layer consists of 320 neurons, and the second case consists of 480 neurons. In two connected intersections, the first model of ANN is run as this model works better in heavy traffic, and reduces queue time more (Fig. 3).

Fig. 2
figure 2

Neural network structure used to determine the Q-values

Fig. 3
figure 3

Scheme of the single intersection

It is worth mentioning that the replay experience during the training phase is used to improve the performance of the agent. In this method, the information that the agent gathers during the simulation is not sent immediately. The information is submitted in the form of random samples, called a batch. The batch is derived directly from the memory, which stores each sample collected during the training phase.

A sample m is defined as follows.

$$m = \left\{ {s_{t} ,a_{t} , \, r_{t + 1} , \, s_{t + 1} } \right\}$$
(3)

Figure 4 demonstrates that the agent performs action in state st, and it observes a new state st+1 while receiving the reward rt+1. The sample m = (st, at, st+1, rt+1) is stored in memory. Then a batch is randomly selected from the memory buffer. The Q-Network is trained by selected subset of samples. Subsequently, Q-values are predicted by the neural network, which determine the next action at+1.

Fig. 4
figure 4

Schematic of gathering the information

4 Single intersection

In this part of the study, the improvement of traffic flow in an intersection is investigated. An agent is provided to maximize a particular criterion of traffic efficiency. Then the agent receives the states of the environment to choose the phase of the traffic light. This choice is through a fixed set of actions to optimize the reward. After some simulation steps, the agent starts its next operation by collecting information from the current state of the environment. Also, the agent calculates the previously selected action reward by using some criteria of the current traffic situation. A random number of samples is taken from the memory to get information for learning. The sample contains information about the last stages of simulation in memory. Then training is done. Now, the agent is ready to select a new action based on the current state of the environment to continue the simulation until the next episode comes.

The environment is a four-way intersection. Each way is 750 m long and has 4 lanes. The 3 right-most are used to go straight or turn right, while the left-most one is barely used to make a left, and it has its traffic light. Eight traffic lights are considered in this intersection. The pattern of these traffic lights is following as: red, yellow, and green.

4.1 Traffic generation

The Weibull distribution is used to generate three scenarios. There are 4000, 2000 and 600 cars in heavy, medium, and low scenario, respectively. Simulation is run in Simulation of Urban Mobility (SUMO). It is not only open source, but also it can be easily used for simulating and analyzing traffic jams. In SUMO, 100 episodes are supposed to be executed. Each episode has 5400 steps. As a result, all 100 episodes are equal to 150 h in the real world.

4.2 State representation

There is a frequently approach for state representation. It is a feature-based value vector. In this method, average or total value of specific information is used [14]. To obtain the states, each intersection lane is discretized. The length of each cell depends on the closeness and distance from the intersection. The cells close to the crossing are smaller. This means that the closer information is more important than the cells farther away. The left line is considered separately. There are 80 cells. Each cell is considered as a C. Ck is the k-th cell (k is the position). Each Ck contains four data sets, including the total number of vehicles, the average speed, the cumulative queue time of vehicles in the Ck cell, and the number of queued. As there are 80 cells in the environment, and each cell has four sets of information, the total state number is 320, which is the input of the neural networks.

$$\vec{S} = \left( {\vec{T},\overrightarrow {AS } , \overrightarrow {Aqt} , \overrightarrow {NQ} } \right)$$
(4)

where \(\overrightarrow{T}\) is the total number of cars in Ck, \(\overrightarrow{AS}\) is the average speed of cars in the Ck, \(\overrightarrow{Aqt}\) is the accumulated waiting time of cars in the cell Ck and \(\overrightarrow{NQ}\) is the number of queued cars.

4.3 Action set

Taking action means activation of a green phase for a set of lines for a constant duration. The set of actions is defined in (4). Each of these activates the green phase for a specific lane. It is obvious that each way that is in front of the other gets the green phase simultaneously. It means when the north way gets the green phase, at the same time the south way of the intersection gets the green phase, too. As the left lane has its traffic light and this lane is only used to take a turn, it needs a specific action. Table 1 describes the action set, and provides information about which traffic light should be given the green phase. NS is selected when the right way of the northern or southern arm is crowded. NSL is chosen when the left lane of the north or south has the most cars. EW and EWL are also for the eastern and western arms.

$$A = \left\{ {NS, \, NSL, \, EW, \, EWL} \right\}$$
(5)
Table 1 Action set decryption

4.4 Reward function

In literature, it can be seen different reward functions. The total time that all cars have been waiting in queue can be expressed as follows [14, 15]

$${tqt}_{t}=\sum_{veh=1}^{n}{qt}_{(veh,t)}$$
(6)

where tqtt is the total queue time, and qt(veh,t) is the queue time of every car at step t. Also, n is the number of vehicles in the environment. As a result, the reward function could be written as (5) [1415].

$${r}_{t}=t{qt}_{t-1}-t{qt}_{t}$$
(7)

The difference between the total queue time at t-1 and t were considered as the reward at step t (rt). The main drawback of this reward definition is that in traffic jams there will be long lines crowded with lots of cars. As soon as the lane which has the most cars receives the green phase, many cars start moving. The reward of it, it will be gotten not only at the next agent step, but also in the next ones. Therefore, the agent will be confused about the reward. Finally, it leads to choose an inappropriate action, so accumulated total queue time is suggested as (6).

$$atq{t}_{t}=\sum_{veh=1}^{n}aq{t}_{(veh, t)}$$
(8)

where atqtt is the accumulation of the total queue time of all vehicles at the time step of t and aqt (veh, t) is the time when a vehicle has a velocity less than 0.1 m/s at the time step t. This time has been counted from the time that the car is generated in the environment. Also, n represents the total number of vehicles in the environment. By this, when a vehicle is moving but cannot pass the intersection, atqtt will not be reset (unlike tqtt). This avoids the misleading reward associated with the function mentioned in (6) when creating a long queue. The reward function considered in this article is as follows.

$${r}_{t}=at{qt}_{t-1}-at{qt}_{t}$$
(9)

where rt represents the reward at step t. atqtt and atqtt -1 show the cumulative total queue time of all vehicles at the intersection, at steps t and t-1, respectively.

It is worth noting that in this work, the negative reward is going to be minimized. If there were a positive reward, the reward would be maximized.

4.5 Result and discussion of a single intersection

The results of the comparison with a static traffic light (STL) are presented in Table 2 and Figs. 5, 6 and 7. The terms “first” and “second” in Figs. 5, 6 and 7 refers to the first model of neural network, which was introduced before. In Fig. 2 the result of the heavy scenario is provided. It can be said that in the second model of neural network for all cars waiting time was almost 3000 s less than when the first model was used. However, for the medium and low scenarios, the first model of neural network controlled more effectively since the total queue time was roughly 6000 s less than the first model for the medium scenario, Fig. 6.

Table 2 The comparison between a smart traffic control and a STL
Fig. 5
figure 5

Comparison of the total queue time in heavy traffic

Fig. 6
figure 6

Comparison of the total queue time in medium traffic

Fig. 7
figure 7

Comparison of the total queue time in low traffic

It is evident from the information given in Table 2 that compared to STL, the proposed method has led to a decrease in the total queue time and the queue time per vehicle. The data indicates that the Medium scenario has experienced the most significant change by approximately 62 percent decline in the total queue time. Interestingly enough, the time that cars spent in the Heavy scenario was 22 percent less than the same scenario in STL, while for the Low scenario, this decrease was 13 percent. It shows this method performed better in the Heavy scenario than the Low scenario, which is not a concern because due to the nature of this scenario, the STL can be practical when there are few cars. It was not expected a significant change in this scenario.

Moreover, to evaluate the total queue time for all cars at each time interval, it is vitally important to acknowledge that some vehicles may experience extended waiting periods, while this is not shown in the overall time measurement. To address it, the result of the negative reward function is demonstrated in Fig. 8. As a result, the individual experience of each vehicle is considered. Figure 8 reveals that in the medium and heavy traffic scenarios, STL has the most negative reward, which means the agent in smart traffic learns and controls better. However, in the low scenario, it was mentioned in terms of the queue time, STL has the least negative reward. This suggests STL is efficient when there are few vehicles in the environment.

Fig. 8
figure 8

The comparison between negative rewards of STL and the first and the second model

5 Two connected intersections

In this section, two connected intersections are discussed as shown in Fig. 9. The agents are considered as: a1 controls the left intersection, and a2 the right one. The traffic scenarios and the length of lanes are as same as the single intersection, except that in the medium scenario, there is a condition. In this scenario, 90 percent of the cars come from the eastern or western lane.

Fig. 9
figure 9

Schematic of two connected intersections

In the first case, two independent DQNs have been introduced to see how two agents interact in the same environment. Each intersection corresponds to an agent. Each agent is unaware of what the other agent is doing; and the reward is determined locally and the state of each agent is not aware of the action of the reciprocal agent. In other words, each agent is unaware of what the other agent is doing and how the other agent's actions may affect him. With this approximation, it is ignored that the interaction between factors affects each other's decision making. In the second case, agents can share information. In this section, the state is formed when the agent uses the information of another agent to prepare the state.

5.1 New state representative

For the case that the two agents are interacting together, a different state representation of what was introduced in the single intersection is provided. The idea is to connect the agents by sharing action knowledge. A function called f is added to connect I = 0; 1; 2; 3 to four possible action sets.

$$f: \, \cdot A \to I$$
(10)
$$f \, \left( {NS} \right) = 0, \, f \, \left( {NSL} \right) = 1, \, f \, \left( {EW} \right) = 2, \, f \, \left( {EWL} \right) = 3$$
(11)

Consequently, new state representation (NewS) becomes:

$$\overrightarrow{NewS}=(\overrightarrow{T},\overrightarrow{AS }, \overrightarrow{Aqt} , \overrightarrow{NQ}, f\left(A\right))$$
(12)

It is worth mentioning that this change is just for related agent’ tests. For the first case, it is clear that the state representation is the same as the single intersection.

5.2 Result and discussion of two connected intersections

Two different tests were studied. The first test focused on having 2 agents and controlling a group of intersections by reinforcement learning. The second test aimed to study how the traffic at one intersection can have an impact on the other agent’s decision. As it is presented in Table 3, the negative rewards of two related agents in low and EW scenarios are more than the first test which does not share action information. However, when traffic is heavy, the negative reward of the first test is more. It illustrates that the agents decide more efficiently when the state representation has more information related to the action set. In Table 3, RSA represents Reward of Separate Agents, while RCA shows Reward of Connected Agents. Table 4 demonstrates the comparison between two methods. It presents how the average cumulative queue time and average queue time variation per car change when the agents share actions. In the heavy traffic scenario, in comparison with the separated agents scenario, when each agent is aware of the other one’s action, there is a decrease in the average queue time per car of 53%. In low and EW scenarios, considering an agent for each intersection consumes less time than the method of sharing knowledge. This could be cajoled because the main concern is the traffic jam that has been perfectly controlled by the proposed approach. For other scenarios, time-wasting is not significant. More importantly, in the low traffic scenario, even a normal traffic light can manage intersections well. Therefore, it can be said that the idea of sharing action knowledge worked out. In Abdullahi and El-Tantawy’s article [9, 10], the performance of the proposed model is better in the case of independent agents between 2 and 7 percent, while in this paper, at least the proposed method of dependent agents has improved the heavy scenario by 53%.

Table 3 Comparison between rewards of two tests
Table 4 The percentage of changing in action knowledge sharing over the case of two separate agents

6 Conclusion

In this paper, an RL approach was presented to solve the traffic congestion problem. This work has provided a smart traffic light for a single intersection and two connected intersections  to reduce the time that vehicles have to wait to pass the intersection. In the first case, the proposed method reduced the average queue time per vehicle by 34 percent in comparison with a static traffic light. In the second test, the agents of two intersections have been considered independent and related. It has been shown when agents are separated, the total queue time is less than the connected case. Utilizing the information that is drawn from discrete space helped the agents to make decisions appropriately. It is worth mentioning that in complicated issues with a group of intersections, sharing action knowledge might not work out, and there must be a trade-off between model complexity and time. Also, future work could be aimed to focus on a multi-agent approach in a heterogeneous environment. In a heterogeneous environment, the number of the lanes are not the same. In order to make the environment more realistic, emergency vehicles could be considered, and allocate green phase out of order.