1 Introduction

Multiple agents that interact in an environment and coordinate their behaviour to solve a complex task or achieve a goal are known as Multi-agent Systems (MAS). In dynamic environments, agents should be able to adapt to changes and self-generate goals based on the changing requirements autonomously. For example, in intelligent traffic signal control systems, traffic signals are controlled by autonomous and adaptive agents, and their goal is to minimise waiting time for all the vehicles. When an incident happens, it is expected that the system manages the traffic in a way that the ambulance could get to that location as soon as possible. In this situation, the traffic signal agents change their goals from “minimising the waiting time for all the vehicles” to “minimise the ambulance’s waiting time”. To do so, they need to identify the change in the environment, create a new goal, and take suitable actions accordingly.

Agents use several techniques to behave appropriately in an ever-changing environment [36]. Learning techniques are used to enrich agents’ knowledge using big data or a long exploration period [12]. However, these techniques are not useful when real-time decision-making is desirable and not enough data is available. In the discussed example, if the current goal is defined as “minimising total waiting time”, learning techniques cannot figure out how the goal should change when a new/unseen emergency situation happens unless they have a predefined goal to “minimise the ambulance waiting time”.

Practical reasoning is also used to enable agents to infer knowledge from their environment and interaction with other agents [1]. Although reasoning allows agents to understand their environment, update their perceptions, and make decisions in real-time, they cannot help with setting new goals. In the example discussed earlier, the reasoning engine must have a predefined rule (e.g., context: emergency situation, action-specification: minimise the ambulance waiting time).

Goal generation approaches allow agents to choose a new goal from a pre-defined goal-set when the current goal should be changed [26]. Using Goal Reasoning (GR) techniques, agents continuously reason about the goals they are pursuing and if necessary, they adjust them according to their preferences [1]. There are two models of GR, Goal-Driven Autonomy (GDA) and Goal Refinement. In GDA, agents nominate a set of potentially suitable goals for the current situation. In Goal Refinement, agents generate one or more plans to achieve a given goal. When an unexpected event occurs, agents evaluate the goal and decide whether to pursue the goal, drop it, or resolve the detected event through one of the predefined strategies. Although GR agents can operate in complex systems with limited communications and adapt intelligently to changing conditions [21], they do not perform as well when they encounter an unseen situation.

The main research question this paper will address is how agents can adapt their behaviour when they experience an unseen situation. We propose an Automatic Goal Generation Model (AGGM) that allows agents to adapt their behaviour according to the changes in the environment by generating new goals to handle an unseen situation autonomously. In this model, agents continuously observe the state of their environment and evaluate the feedback they receive to decide whether there is a significant change in the environment and consequently if the current goal should be changed. A significant change can be interpreted either when there is a discrepancy with an expected reward, or in anticipation that a goal must be changed because of an unseen situation. To react to an identified significant change, the agent can replace the current goal with one of the predefined goals in its goal-set, and if there are not any suitable predefined goals, agents take actions that plausibly result in experiencing a previously known state. A traffic signal control system scenario is chosen as a case study for this paper, where AGGM’s performance is compared to Q-learning, SARSA (State–Action–Reward–State–Action), and Deep Q Network (DQN). The results show that AGGM outperforms the baselines when handling unseen situations such as emergency and congestion cases.

The remainder of the paper is organised as follows. Section 2 presents a review of the relevant literature. Section 3 briefly presents the background knowledge, Section 4 presents AGGM. In Section 5, the case study is described and in Section 6 the experimental scenarios are defined. In Section 7 the results are analysed. Finally, our conclusion and future works are discussed in Section 8.

2 Related work

In an ever-changing environment, agents need to be adaptive and autonomously revise their understanding of their environment and other agents’ behaviour to take suitable actions. Goal formation, goal generation, and goal reasoning techniques are used by an adaptive agent to handle their environments’ unpredictability. Goal formation is a process of reaching from unachievable goals through instrumental beliefs to concrete ones [8]. In the goal generation process, goals are generated from conditional beliefs, obligations, intentions, and desires or motivations based on agent’s preferences [5, 22], and [26]. Additionally, goals can be generated when an agent detects discrepancies between its sensory inputs and its expectations [27]. Goal reasoning is a case-based system that uses active and interactive learning to automatically select goals from a set of predefined goals [35]. In these works, a new goal is generated when an agent’s belief is changed. However, the goal generation process when the environment’s state is inconsistent with the agent’s beliefs is not discussed.

In practical reasoning, agents’ desires and preferences are used during goal generation [22] and [20]. Social reasoning is also used to enhance agents’ understanding of others’ goals and their dependencies [15]. Although reasoning is an effective approach for real-time decision-making, it requires predefined reasoning rules to handle unseen situations.

Learning techniques are also used to enhance agents’ performance when having access to limited knowledge. In [12], a reverse method for gradual learning from a set of start states nearby the goal is described. In [11], agents learn the forward task and reset policy together, which resets the environment for a subsequent attempt. In [16], agents learn a world model and a self model simultaneously. The world model is used to predict the dynamic consequences of the agent’s actions, and the self model estimates the world model’s errors to be explored in the future.

When the reward function is unknown, agents face a difficult challenge [33]. Imitation Learning (IL) enables agents to reach any goal without any need for reward. This approach extracts additional information from the demonstrations, so it can leverage demonstrations that do not include expert actions [9]. Moreover, the Inverse Reinforcement Learning (IRL) agents can infer a reward function from the expert demonstrations and assume that the expert policy is optimal regarding the unknown reward function [17].

Although learning strategies can guide agents toward their goals using reward or fitness functions, they perform poorly when such functions are not available.

3 Background

This section briefly presents the required background information.

3.1 Reinforcement learning

In this paper, agents use RL as a learning method, however, AGGM does not depend on RL and other learning methods can be used instead. RL is a trial-and-error method in which the agents learn through interaction with the environment in such a way that an agent chooses the action according to its policy, which is subsequently sent to the environment, then the environment moves to a new state and an immediate reward will be sent to the agent. The agents use the action with the most positive reward repeatedly [10]. We use Q-learning, SARSA, and DQN as RL approaches with which AGGM’s performance is compared.

Q-learning

Q-learning is a fundamental RL method in which an agent computes the Q-value that estimates discounted cumulative reward of actions as displayed in (1). \(R_{t}^{{\varPi }}\) (see Appendix for description of symbols) is discounted cumulative reward at time step t under policy π. γ is a discount rate that shows how much the future rewards contribute to the total reward, and rτ is the reward at a time step τ. The agent becomes farsighted if γ values are closer to 1 and the agent becomes shortsighted if γ values are closer to 0 [40].

$$ R_{t}^{{\varPi}} = \sum\limits_{\tau = t}^{\infty} \gamma^{\tau-t} r_{\tau}, \gamma\in[0,1) $$
(1)

Q-value for state s and action a is represented as Q(s, a), and its value is calculated as shown in (2):

$$ \begin{array}{@{}rcl@{}} Q(s,a) &=& Q(s,a) + \rho[R(s,a,s^{\prime})\\ &&+ \gamma\max Q^{\prime}(s^{\prime},a^{\prime}) - Q(s,a)] \end{array} $$
(2)

Q(s, a) is current stored Q-value for applying action a in state s. \(R(s,a,s^{\prime })\) is the reward value the agent gets from the environment after doing action a in state s which takes the environment to the new state \(s^{\prime }\). \({\max \limits } Q^{\prime }(s^{\prime },a^{\prime })\) is the maximum expected future reward. The learning rate ρ determines in which degree the new information overrides the old one [14]. The Q-learning algorithm uses Q-table to store Q-values, then uses this data structure to calculate the maximum expected future rewards.

SARSA

To compute Q-value, Q-learning computes the difference between Q(s, a) and the maximum action value \(Q^{\prime }(s^{\prime },a^{\prime })\), while on-policy SARSA algorithm learns action values relative to the policy it follows [40] as shown in (3):

$$ Q(s,a) = Q(s,a) + \rho[R(s,a,s^{\prime}) + \gamma Q^{\prime}(s^{\prime},a^{\prime}) - Q(s,a)] $$
(3)

Deep Q Network

The DQN algorithm uses a deep neural network and the input of the neural network would be the state that the agent is in and the targets would be the Q-values of each of the actions [37].

The RL algorithm applies an exploration (i.e., exploring action space) and exploitation (i.e., performing the best action) scheme to compute a policy that maximises the payoff. The Epsilon-Greedy algorithm is used to balance the exploration and exploitation in choosing actions, and ε is the percentage dedicated to the exploration [40].

3.2 Traffic signal control systems

Traffic signal control is an important technical means using computer and information technology to adjust the signal timing parameters to improve the traffic flow [43]. In [24], the authors categorise traffic signal control methods based on artificial intelligence into the following categories: fuzzy control technology, Artificial Neural Networks (ANN), evolutionary algorithms (e.g., genetic algorithm, ant colony algorithm, and particle swarm optimisation), and RL methods.

Automatic traffic signal control can be managed by agents with learning capabilities controlling traffic signals [36] and [4]. Agents use interaction and communication protocols and ontologies to negotiate and make decisions [36, 42], and [4]. This system is particularly chosen as a case study for this paper as in traffic systems it is not possible to predict all the events and consequently facing unseen situations is inevitable.

3.3 Ontology

Ontology provides user-contributed, augmented intelligence, and machine-understandable semantics of data [13]. Ontologies are used to enhance agents’ learning process by providing semantic models [31], augmented feedback loop to optimise their overall accuracy [6], and as a way to share knowledge between agents [41]. Additionally, agents can generate their own ontologies by observing others’ actions and automatically derive logical rules that represent the observed behaviour [29] and [18].

We propose each agent represents its observation using a schema described by an ontology. The ontology development 101 strategy [34] and the ontology editing environment Protégé [32] can be used to develop an ontology. Also, ontology Verification & Validation (V&V) following the SABiO guidelines are used to evaluate an ontology (i.e., to identify missing or irrelevant concepts) [3]. In [18], Semantic Sensor Network (SSN) ontology is proposed to describe sensor resources, and the data they collect as observations. It bridges the gap between low-level data streams coming from sensors in real-time and high-level concepts used by agents to interpret an observation (i.e., high-level semantic representations). The ontology-based schema is used when a semantic description is needed and is composed of surrounding concepts and relations between concepts perceived by agents, which are modelled as observations. These relations enable inheritance between concepts and automated reasoning. We define \(L_{g_{i}}^{t}\) as the schema describing the data observed by agent gi at time step t. C represents the set of concepts, and M represents the set of relations over these concepts (see (4)). The domain and range of a relation determine what kind of instances it can be used for (i.e., domain) and what kind of values it can have (i.e., range).

$$ L_{g_{i}}^{t}=\{C_{g_{i}}^{t}, M_{g_{i}}^{t}\} $$
(4)

A graph representation of our ontology is generated using OntoGraf plug-in in Protégé, and the inference rules are expressed using Semantic Web Rule Language (SWRL) [19]. There are two methods of reasoning when using inference rules [38]:

  • Forward reasoning: It starts from state observation and applies inference rules to extract more facts until reaches the goal. For example, we can conclude from “A” and “A implies B” to “B”.

  • Backward reasoning: It starts from the goal and chaining through inference rules to find the required facts that support the goal. For example, we can conclude from “not B” and “A implies B” to “not A”.

To model the concepts in traffic signal control context, we use the ontology proposed in [30] (see Fig. 1).

Fig. 1
figure 1

Ontology for traffic signal control, as represented by OntoGraf

4 Automatic goal generation model

In this paper we assume our agents to be adaptive and use the RL algorithm. When unseen situations occur, the agents continuously observe their environment, identify the changes, and have access to a mechanism that enables them to decide whether to continue with their RL process, change their current goal to a pre-defined goal, or generate a new goal. AGGM enables agents to make such decisions in the following stages: in the Observe stage, agents continuously observe the environment state, update their own state, and the reward associated with their previous action. The agents then evaluate these states and the received rewards in the Evaluate stage, and in the Significant Change Identification stage agents decide whether or not the current goal needs to be revised. If so, in the Reasoning stage the current goal is changed to a predefined goal, or a new goal will be generated. The output from this stage results in a new reward function that will be sent to the Generate Action stage to generate suitable actions accordingly. Then these actions will be executed in the Execute Action stage. This process continues by sending the next state and reward of current action back to the agents until the environment sends a terminal state or a specific number of iterations happens (see AGGM in Fig. 2 and algorithmic procedure in Algorithm 1). The details of the model components are explained below.

figure c
Fig. 2
figure 2

Automatic Goal Generation Model

4.1 Observe

The agents constantly observe the environment to identify new changes and update their perception of the environment, the states, and the reward they received for their previous actions. St is the individual states of all n agents in the system at time step t. For example, \(s_{g_{i}}^{t}\) is the state of agent gi at time step t.

$$ S^{t}=\{s_{g_{1}}^{t}, s_{g_{2}}^{t}, \ldots, s_{g_{i}}^{t}, \ldots, s_{g_{n}}^{t}\} $$
(5)

Agent gi’s observation \(o_{g_{i}}^{t}\) is a tuple including its state \(s_{g_{i}}^{t}\) and the reward \(r_{g_{i}}^{t}\) received from the environment at time step t. When an observation is received, it will be sent to the Evaluate stage.

$$ o_{g_{i}}^{t}=\left( s_{g_{i}}^{t}, r_{g_{i}}^{t}\right) $$
(6)

4.2 Evaluate

Agent gi evaluates its observation using the Q-value, the state distance, and the importance of the observation.

Q-value

Using the reward \(r_{g_{i}}^{t}\) received from the environment, the agent calculates the Q-value, \(Q_{g_{i}}^{t}\) (line 3 of Algorithm 1).

The state distance

The agent computes \(D_{g_{i}}^{t}\) the absolute difference between the current state \(s_{g_{i}}^{t}\) and the previous state \(s_{g_{i}}^{t-1}\). To do so, we define \(V_{s_{g_{i}}^{t}}\) to be a quantifying value that describes \(s_{g_{i}}^{t}\) (line 4 of Algorithm 1).

The importance of observation

Each agent gi observes the environment at time step t based on its ontology \(L_{g_{i}}^{t}\), so the importance of each observation \(s_{g_{i}}^{t}\) is determined based on the importance of the concepts \(C_{g_{i}}^{t}\) involved (line 5 of Algorithm 1). A concept weighting function is used to quantify the degree of importance of each concept \(x \in C_{g_{i}}^{t}\) in a domain using an iweighting indicator [28] (see (7)):

$$ iw_{c}(x) = 1/|M(x)|\sum\limits_{m \in M(x)}^{|M(x)|} iw_{M_{m}}^{(x,y)} $$
(7)

The iweighting indicator, denoted by iwc(x) is a numerical value derived from weighting the local context of concept x based on its outgoing edges (i.e., relations to other concepts). To compute concepts’ weight, the relations are initially weighted manually by ontology engineers during the ontology development process, and iwc(x) is calculated based on the average importance weights of the relations \(m \in M_{g_{i}}^{t}\) of domain concept x constrained by their particular range y. For example, in traffic context, “Vehicle” (i.e., domain) “has type” (i.e., relation) and can be an “Emergency” one (i.e., range), and a “Highest Importance” value (i.e., importance weight) can be assigned to a vehicle of emergency type (i.e., relation and its domain/range combination). There are five degrees of importance weights, which can be converted to numerical values using predefined mappings (see Table 4): “Lowest Importance”, “Low Importance”, “Middle Importance”, “High Importance”, and “Highest Importance”.

4.3 Significant change identification

Based on the output from the Evaluate stage an agent decides whether there is a need to change its current goal or create a new one (see the algorithmic procedure in Algorithm 2), so three cases are possible:

  • Case 1. When \(Q_{g_{i}}^{t}\) is out of the predefined range (\(Q_{g_{i}}^{t}<\) Discrepancy-Low-Threshold or \(Q_{g_{i}}^{t}>\) Discrepancy-High-Threshold) (line 2 of Algorithm 2). Discrepancy-Low-Threshold and Discrepancy-High-Threshold define the minimum and maximum values of Q-values that can be received.

  • Case 2. When \(D_{g_{i}}^{t}\) is bigger than the predefined State-Difference-Threshold which is the maximum difference between successive environment states (line 4 of Algorithm 2).

  • Case 3. When a concept x with a high importance weight iwc(x) appears in an environment or the importance of observation \(iw_{g_{i}}^{t}\) becomes higher than a predefined Importance-Weight-Threshold which is the maximum importance weight that has been experienced for the current goal (line 6 of Algorithm 2).

figure d

4.4 Reasoning

Using the Reasoning Engine, agent gi continuously reasons about the goals it is pursuing, when it is required to change or generate a new goal, two cases are possible:

  • Choosing a predefined goal. An agent uses inference rules to deduce a predefined goal from a goal-set through the forward reasoning (line 8 of Algorithm 1). The goal-set specifies tuples of (S, G) where G is a goal that can be adopted when observation S is observed. When more than one goal in the goal-set is consistent with \(s_{g_{i}}^{t}\), agent’s preferences and constraints \(P_{g_{i}}\) or the concepts’ weight \(\{iw_{g_{i}}^{t}\}\) will be used as decision criteria (line 10 of Algorithm 1). Finally, the problem-specific reward function \(B_{g_{i}}^{t}\) is updated with the selected goal G (line 11 of Algorithm 1). For example, consider the two following alternatives rules in the goal-set:

    • rule 1: c1 = k1,c2 = k2,...c5 = k5,...,cn = kn− > g1

    • rule 2: c1 = k1,c2 = k2,...c6 = k6,...,cn = kn− > g2

    ci is a concept and ki is its value in observation S. The agent’s observation \(s_{g_{i}}^{t}\) is consistent with both rules. If iweighting indicator iwc(c6) is higher than iweighting indicator iwc(c5) then g2 will be selected.

  • Creating a new goal. When agent gi cannot find a suitable goal, a state similarity reward function \(J_{g_{i}}^{t}\) which is the inverse of the difference between \(s_{g_{i}}^{t}\) and \(s_{g_{i}}^{t-1}\) is defined (line 13 of Algorithm 1). Reducing the difference between \(s_{g_{i}}^{t}\) and \(s_{g_{i}}^{t-1}\) leads to an increase in the state similarity reward (see (8)).

    $$ J_{g_{i}}^{t} = 1/|V_{s_{g_{i}}^{t-1}}-V_{s_{g_{i}}^{t}}| $$
    (8)

    Agent gi uses backward reasoning to maximise \(J_{g_{i}}^{t}\). Suppose an unseen situation occurs when ambulance a enters intersection s monitored by agent gi, according to the inference rules shown in Table 1, the agent maximises the position coordinates of ambulance a until it passes through the intersection, thereby, the environment will be reverted to a known previous state. Table 2 shows another example of using backward reasoning to reduce congestion in road r1 by maximising the position coordinates of all instances of vehicle b on the road r1 (i.e., hasPosition(?b, Moving)). A schematic overview of the backward reasoning process is shown in Fig. 3.

Table 1 An example of inference rules, inferring maximising the position coordinates of the ambulance a as a parameter in the state similarity reward function \(J_{g_{i}}^{t}\)
Table 2 An example of inference rules, inferring maximising the position coordinates of all instances of vehicle b as a parameter in the state similarity reward function \(J_{g_{i}}^{t}\)
Fig. 3
figure 3

An example of the backward reasoning process, Z shows the fact atIntersection(?a, ?s) and \(\overline {\text {F}}\) shows the inferred fact hasPosition(?a, Moving). B shows the other facts in Table 1 such as consistOf(?r1, ?l1), atIntersection(?i, ?s), and isRegulatedBy(?r1, ?i)

Finally, agent gi selects an action based on the recommendation of the function that combines the two reward functions \(B_{g_{i}}^{t}\) and \(J_{g_{i}}^{t}\) (line 16 of Algorithm 1). Depending on the problem, various combinations can be defined for these two reward functions. In this paper, we define a priority function which prioritises \(J_{g_{i}}^{t}\) over \(B_{g_{i}}^{t}\). Therefore, agent gi prioritises maximising the state similarity reward over the problem-specific reward optimisation, and takes actions that can contribute to experiencing its previous known state (see Fig. 4). To do so, agent gi selects actions that minimise the difference between \(s_{g_{i}}^{t}\) and \(s_{g_{i}}^{t-1}\).

Fig. 4
figure 4

Handling an unseen situation in the Automatic Goal Generation Model, the functionality of state similarity reward

4.5 Generate and execute action

Using the function F(\(B_{g_{i}}^{t}, J_{g_{i}}^{t}\)), based on the current state \(s_{g_{i}}^{t}\), agent gi selects the appropriate action a from action space A and executes it (lines 16 and 17 of Algorithm 1).

5 Traffic signal control case study

In the multi-agent traffic signal control systems, the traffic signal at each intersection is controlled by an independent agent. Agents observe and analyse the traffic collected data and decide their actions accordingly. In the remainder of this section, we present how our model is tested using a traffic micro-simulator, Simulation of Urban MObility (SUMO) which provides a microscopic real-time traffic simulation [25].

5.1 Simulation setting

SUMO is employed to evaluate the performance of AGGM in a traffic signal control case study. The whole simulated traffic network is a 750m × 750m area. Each intersection is a 300m × 300m area. Thus, the total number of intersections is 16 (see Fig. 5). At each intersection, we have two incoming roads and two exit roads. Each road is marked with a name such as 0to1 and includes two lanes, for example, road 0to1 includes two lanes 0to1_0 and 0to1_1. So, we have eight lanes at each intersection in which vehicles drive. The lane length is 120 meters. The vehicles in incoming roads from west-to-east are allowed to take right-turn and pass through traffic. The vehicles in incoming roads from north-to-south are allowed to take left-turn and pass through traffic. The minimal gap between the two vehicles is 2.5 meters. We have four types of vehicles in the simulation: default, ambulance, fuel truck, and trailer truck. The length of default and ambulance vehicles is 5 meters and the length of the fuel truck and trailer truck is 10 meters. The default vehicles arrive in the environment following a random process, and the arrival rate of every lane is the same, one per second. The arrival rate of other types of vehicles is according to scenarios discussed in Section 6. Vehicles are discarded if they could not be inserted. For all types of vehicles, the max speed is 55.55 m/s, which is equal to 200 km/h. Also, the max accelerating acceleration is 2.6 m/s2 and the decelerating acceleration is 4.5 m/s2. SUMO uses the Krauss Following Model [23], which guarantees safe driving on the road. The duration of yellow phase is set to 2 seconds. The minimum duration of green phase is set to 5 seconds and the maximum one is set to 100 seconds (Max-Green-Time). The number of simulation seconds ran before learning begins is set to 300 seconds. The number of simulated seconds on SUMO is set to 1,000 seconds. The simulation seconds between actions are set to 5 seconds. We used the interface to instantiate RL environments with SUMO for traffic signal control provided by [2] to interact with the traffic signal-controlled intersections.

Fig. 5
figure 5

Simulated traffic network in SUMO

We perform the simulation through 10 runs for each scenario. One run is an episode of 1,000 seconds. The reward is accumulated in an episode. The goal in our network is to maximize the reward in each 1,000 seconds episode by modifying the traffic signals’ phases. The simulation results show the average values obtained from 10 runs and are compared to the baseline algorithms. The parameters of the network are shown in Table 3.

Table 3 Parameter settings of the baseline algorithms

5.2 Baseline algorithms

To build the traffic signal control system using RL, we need to define the states, actions, and rewards. The three elements are defined in the following:

  • States: We model information of each state \(s_{g_{i}}^{t}\) for traffic signal gi at time step t as follows:

    $$ s_{g_{i}}^{t}=\{{{\varPhi}}_{g_{i}}^{t},e_{g_{i}}^{t},q_{l_{i}}^{t},z_{l_{i}}^{t},y_{v_{i}}^{t},b_{v_{i}}^{t}, w_{v_{i}}^{t}\} $$
    (9)
    • Yellow, red and green phase indicators are shown as \({{\varPhi }}_{g_{i}}^{t}\) for the intersection monitored by gi at time step t.

    • Current phase elapsed time, is shown as \(e_{g_{i}}^{t}\) for the intersection monitored by gi at time step t and is computed by (10). u is the time duration from start of the current phase up to now.

      $$ e_{g_{i}}^{t}=u/ \text{Max-Green-Time} $$
      (10)
    • Current lane queue, the number of vehicles waiting in each lane divided by the lane capacity, is shown as \(q_{l_{i}}^{t}\) for lane l at the intersection monitored by gi at time step t and is computed using (11). hl is the total number of halting vehicles for the last time step on lane l (a speed of less than 0.1 m/s is considered a halt), el is the length of lane l in meters and f is the sum of the vehicle length and the minimum gap.

      $$ q_{l_{i}}^{t}=\min(1,(h_{l}/(e_{l}/f))) $$
      (11)
    • Current lane density, the number of vehicles in each lane divided by the lane capacity, is shown as \(z_{l_{i}}^{t}\) for lane l at the intersection monitored by gi at time step t and is computed using (12). nl is the number of vehicles on lane l within the last time step.

      $$ z_{l_{i}}^{t}=\min(1,(n_{l}/(e_{l}/f))) $$
      (12)
    • The type of vehicle is shown as \(y_{v_{i}}^{t}\) for vehicle v at the intersection monitored by gi at time step t.

    • The position coordinates of vehicle along the lane (the distance from the front bumper to the start of the lane) is shown as \(b_{v_{i}}^{t}\) for vehicle v at the intersection monitored by gi at time step t.

    • The waiting time of a vehicle counts the number of seconds a vehicle has a speed of less than 0.1 m/s and is shown as \(w_{v_{i}}^{t}\) for vehicle v at the intersection monitored by gi at time step t.

  • Action Space: Traffic signal phases in the action space include green, yellow, and red phases (i.e., 1, 2, and 3 indicating green, yellow, and red respectively). The green phase is the period during which vehicles are permitted to cross. The yellow phase is required between two neighbouring phases to guarantee safety. The red phase is the period during which vehicles are not allowed to cross.

  • Rewards: The main goal here is to increase the efficiency of the intersection, by minimising the vehicles’ waiting time. Therefore, the reward will be the amount of change in the cumulative waiting time between two consecutive cycles (see (13)). During the training period, the RL algorithm tries different signal control schemes and eventually converges to an optimal scheme which yields a minimum average waiting time.

    $$ B_{g_{i}}^{t}= \sum w_{v_{i}}^{t-1} - \sum w_{v_{i}}^{t} $$
    (13)

Unseen situations

We define the arrival of ambulances, fuel trucks, and trailer trucks and a line of more than 10 vehicles at an intersection as unseen situations.

5.3 AGGM in SUMO

The stages of AGGM in the traffic signal control system are as follows:

  • Observe: We model each traffic signal’s observation as a tuple which includes the traffic signal’s state \(s_{g_{i}}^{t}\) (see (9)) and the traffic signal’s reward \(r_{g_{i}}^{t}\) received from the environment at time step t. The reward functions include the problem-specific reward \(B_{g_{i}}^{t}\) which is explained in (13) and the state similarity reward \(J_{g_{i}}^{t}\) as displayed in (14). Agent gi uses backward reasoning over its ontology’s inference rules (see Tables 1 and 2) and deduces the state similarity reward as maximising the position coordinates of vehicles.

    $$ J_{g_{i}}^{t}= \sum (b_{v_{i}}^{t} - b_{v_{i}}^{t-1}) $$
    (14)

    So, the traffic signal’s reward is a set which includes reward computed using \(B_{g_{i}}^{t}\) and \(J_{g_{i}}^{t}\) (15).

    $$ r_{g_{i}}^{t}=\{B_{g_{i}}^{t},J_{g_{i}}^{t}\} $$
    (15)

    We defined a combination of the two reward functions \(B_{g_{i}}^{t}\) and \(J_{g_{i}}^{t}\) in the form of the priority function by giving more priority to the output of the state similarity reward function.

  • Evaluate and Significant Change Identification: The traffic ontology includes the importance weights of the relations and their domain/range combination (see Table 4). Traffic signal agent gi computes the importance weight iwc(ri) of the road ri by \({\sum }_{v_{i} \in r_{i}} iw_{c}(v_{i})\), to identify whether there has been a significant change in the environment (see line 6 of Algorithm 2). When congestion happens or an important vehicle (e.g., an ambulance, a fuel truck, or a trailer truck) enters the intersection on road ri in the current state at time step t, the iwc(ri) value becomes larger than Importance-Weight-Threshold, and the Reasoning Engine will be triggered.

  • Reasoning and Execute Action: The state similarity reward \(J_{g_{i}}^{t}\) is defined as the sum of the relocation of the vehicles in two consecutive states (see (14)). The greater the state similarity reward is the closer the vehicles that have led to the significant change get to the intersection. AGGM along with the RL algorithm are used to generate the traffic signal’s action (see Algorithm 3):

    • AGGM: When there is a significant change, agent gi will select an action a which maximises the value of the state similarity reward, and discards the action suggested by the RL algorithm (line 5 of Algorithm 3).

    • RL: Action a will be suggested based on the problem-specific reward function (line 7 of Algorithm 3).

figure e
Table 4 Relations between concepts in traffic signal control ontology

6 Experimental scenarios

Vehicles’ average waiting time, is used to measure the efficiency of AGGM and baseline algorithms (i.e., Q-learning, SARSA, and DQN), and it is calculated as follows:

$$ w_{g_{i}}^{t}= 1/|v_{i}|\sum w_{v_{i}}^{t} $$
(16)

The evaluation scenarios used to test the performance of AGGM are listed in Table 5. Scenarios 1, 2, and 3 generate important vehicles (i.e., ambulances, fuel trucks, and trailer trucks) in specified or random roads in Low and High frequencies. In Low frequency setting, 3 important vehicles are generated per 2 minutes, and in High frequency, 5 important vehicles are generated per 2 minutes. Scenario 4 creates congestion at specific roads per 3 minutes.

Table 5 Evaluation scenarios

7 Results and discussion

The results report the average waiting time for all types of vehicles in 10 runs in each scenario. As shown in Fig. 6, AGGM performs almost as well as the baseline algorithms when the average waiting time of default vehicles is compared in scenarios 1, 2, and 3, and outperforms the baseline algorithms in scenario 4. As shown in Figs. 78, and 9, AGGM significantly decreases the average waiting time of ambulances, fuel trucks, and trailer trucks compared to the baseline algorithms.

Fig. 6
figure 6

The average waiting time of default vehicles –AGGM and the three baselines

Fig. 7
figure 7

The average waiting time of important vehicles – AGGM and Q-learning

Fig. 8
figure 8

The average waiting time of important vehicles – AGGM and SARSA

Fig. 9
figure 9

The average waiting time of important vehicles – AGGM and DQN

As shown in Table 6, the decrease in average waiting when AGGM is applied is significant compared to the baseline algorithms, however, AGGM is more effective in low-frequency settings in more complex scenario (i.e., scenario 3).

Table 6 Decrease in average waiting time – AGGM and baseline algorithms

As reported in Table 6, we observe that the improvement of AGGM compared to the baseline algorithms in scenario 1 where ambulances, fuel trucks, and trailer trucks are generated in parallel roads is more than scenario 2 where they are generated in intersecting roads. This is because the important vehicles entering in intersecting roads creates a more complex situation. Also, our approach shows a better performance in more complex scenarios where unseen situations can happen simultaneously in multiple parallel or intersecting roads (i.e., scenario 3) and it does not perform as well in simple scenarios where congestion happens with a line of more than 10 vehicles at an intersection (i.e., scenario 4).

From Table 6, we observe that AGGM decreases the waiting time of ambulances more than the other types of vehicles in scenario 3. Also, in scenario 2, the improvement for fuel trucks is higher than the other types of important vehicles. Since in scenarios 2 and 3, important vehicles can enter the environment in intersecting roads, the higher priority is given to the road that has the most important vehicles. Therefore, when an ambulance or a fuel truck is observed on one road and a trailer truck on another road, the road with the ambulance or fuel truck will have a higher priority and AGGM reverts the road to a familiar previous state using the state similarity reward function.

Finally, from waiting time results for all types of vehicles in the system, we can conclude that the performance of the default vehicles is not compromised to increase the performance of ambulances, fuel trucks, and trailer trucks in the first three scenarios (see Fig. 6). This is particularly an important result as AGGM can handle unseen situations while keeping the performance of the system for seen situations.

8 Conclusion

In real-world environments, it is not possible to predict all possible events in advance, therefore when an agent faces unseen situations or events it will not be able to efficiently show suitable behaviour. In this paper, we attempted to address such environments where agents might face unseen situations. We proposed an Automatic Goal Generation Model in which agents are enabled to detect unseen situations and handle them automatically. In such situations, agents either replace their current goal with another predefined goal or generate a new goal. AGGM is evaluated in a traffic signal control system case study with varying levels of frequency of unseen situation occurrence. The results are compared to several baseline algorithms including Q-learning, SARSA, and DQN and the results show that our method performs significantly better in detecting and handling unseen situations (e.g., emergency and congestion situations).

This paper can be extended in several directions. The state similarity reward was defined particularly for the signal control system case study using the position coordinates of vehicles as a primary parameter. However, defining the state similarity reward might be more complicated when multiple types of unseen events are possible in an environment. To address this issue, we used ontology models to formulate states and their distinguishing parameters. So, AGGM’s performance depends on the accuracy and completeness of the used ontology. Additionally, operating in dynamic environments requires the ontology to evolve and update frequently. Ontology evolution techniques [39] can be used to address this issue. The new goal generation model only concerns about reverting the environment to a familiar state that has been experienced before, however, this might be more challenging when there exist multiple states including familiar and inexperienced states that can contribute to addressing an unseen situation. To address this issue ontology evolution techniques can be used. High-level policies help agents adapt their goals according to different situations of the environment. Generative Policy-based Models (GPM) can be used to enable agents to observe, learn, and adapt high-level policy models [7].