A multiagent cooperative reinforcement learning model using a hierarchy of consultants, tutors and workers
 2.3k Downloads
 8 Citations
Abstract
The hierarchical organisation of distributed systems can provide an efficient decomposition for machine learning. This paper proposes an algorithm for cooperative policy construction for independent learners, named Qlearning with aggregation (QAlearning). The algorithm is based on a distributed hierarchical learning model and utilises three specialisations of agents: workers, tutors and consultants. The consultant agent incorporates the entire system in its problem space, which it decomposes into subproblems that are assigned to the tutor and worker agents. The QAlearning algorithm aggregates the Qtables of worker agents into a central repository managed by their tutor agent. Each tutor’s Qtable is then incorporated into the consultant’s Qtable, resulting in a Qtable for the entire problem. The algorithm was tested using a distributed hunter prey problem, and experimental results show that QAlearning converges to a solution faster than single agent Qlearning and some famous cooperative Qlearning algorithms.
Keywords
Reinforcement learning QLearning Multiagent system Distributed system Markov decision process Factored Markov decision process1 Introduction
Classical reinforcement learning (RL) algorithms attempt to learn a problem by trying actions to determine how to maximise some reward. One such algorithm is Qlearning, which represents the cumulative reward for each stateaction pair in a structure called a Qtable [34, 35]. A major problem with these algorithms is that their performance typically degrades as the size of the state space increases [4, 31]. Fortunately, many large state space problems can be decomposed into loosely coupled subsystems that can be processed independently [11].
One of the most efficient known approaches for RL decomposition of large size problems is the hierarchical methodology [12, 13, 27, 32]. In this approach, the target learning problem is decomposed into a hierarchy of smaller problems. However, current hierarchical RL techniques do not allow migration of learners from one problem space to another in distributed systems. Instead, they focus on decomposing the state or action space into more manageable parts, and then statically assign each learner to one of these parts.
This paper proposes Qlearning with aggregation (QAlearning), an algorithm for cooperative policy construction for independent learners that is based on a distributed hierarchical learning model. QALearning reduces the complexity of large state space problems by decomposing the problem into more manageable subproblems, and distributing agents between these subproblems, to improve efficiency and enhance parallelisation [2].
The QAlearning model includes three specialisations of agents: workers, tutors and consultants. The consultant agent is the highest specialisation in the learning hierarchy. Each consultant is responsible for assigning a subproblem and a number of worker agents to each tutor. The worker agents first learn the problem space of their tutor, then the tutor aggregates its workers’ Qtables into its own Qtable. The tutors’ Qtables are then merged to create the consultant’s Qtable. Finally, the consultant performs a small amount of further learning over the entire problem space to optimise its Qtable.
When a tutor finishes learning its subproblem, the worker agents assigned to that tutor are released to the consultant, who can then reassign the workers to any tutors that are still learning. Thus, rather than remaining idle, worker agents are migrated to subsystems where they can help accelerate learning. This decreases the overall time required to learn the entire system.
The remainder of the paper is structured as follows: Sect. 2 presents basic definitions, Sect. 3 discusses related work, Sect. 4 introduces a motivating example, Sect. 5 discusses the QAlearning algorithm, Sect. 6 discusses simulation results using a distributed version of the hunter prey problem, and Sect. 7 presents the conclusion of this paper and future work.
2 Background in machine learning
This section briefly summarises some of the underlying concepts of reinforcement learning, and Qlearning in particular.
2.1 Markov decision process
A Markov decision process (MDP) is a framework for representing sequential decision making problems that facilitates a decision maker, at each decision stage, to choose from several possible next states [26]. MDPs are widely used to represent dynamic control problems, where the parameters of the MDP need to be learned through interaction with the environment [1].
 1.
\(S=\{s_0,s_1\ldots s_{n1}\}\) is a set of possible states.
 2.
\(A=\{a_0,a_1\ldots a_{m1}\}\) is a set of possible actions.
 3.
\(R: S\times A \rightarrow \mathbb {R}\) is a reward function.
 4.
\(T: S\times A \times S \rightarrow [0,1]\) is a transition function.
The main aim of MDPs is to find a policy \(\pi \) that can be followed to reach a specific goal (a terminal state). A policy is a mapping between the state set and the action set \(\pi : S \rightarrow A\). An optimal policy \(\pi ^*\) always chooses the action that maximises a specific utility function of the current state. RL optimisation problems are often modelled using MDP inspired algorithms [31].
2.2 Factored Markov decision process
A factored MDP (FMDP) is a concept that was first proposed by Boutilier et al. [6]. A FMDP is an MDP with a state space S that can be specified as a crossproduct of sets of state variables \(S=S_0\times S_1 \times \cdots \times S_{n1}\). The idea of factored state space is related to the concepts of state abstraction and aggregation [10]. This idea is based on the fact that many large MDPs have many parts that are weakly connected (loosely coupled) and can be processed independently [23]. In FMDPs, \(T_a\) denotes the state transition model for an action a. This transition model is represented as a dynamic Bayesian network (DBN). It uses a twolayer directed acyclic graph \(G_a\) where the nodes \(S=(S_0, S_1, \ldots , S_{n1}, {S'}_0, {S'}_1, \ldots , {S'}_{n1}\)). In \(G_a\), the parents of \({S'}_i\) are noted as the \( parents_a({S'}_i\)), where these parents are assumed to be a subset of the state space \( parents_a({S'}_i\))\(\subset S\). This means that there are no synchronous arcs from \(S_i\) to \({S'}_j\). The reward function R can be decomposed additively \(\gamma _1 R_0+ \gamma _2 R_1, \ldots ,+ \gamma _{n1} R_{n1}\) and the differences of the decomposition do not depend on the state variables [36].
2.3 QLearning
QLearning is one of the best known RL algorithms that provides solutions for MDPs. This algorithm uses temporal differences (a combination of Monte Carlo methods and dynamic programming methods) to find mappings from stateaction pairs to values. These values are known as Qvalues, and are calculated using a reward function, called the Qfunction, that returns the expected utility of taking a given action in a given state and following a fixed policy after that [34, 35]. The fact that Qlearning does not require a model of the environment is one of its strengths [31].
The main output of the Qlearning algorithm is a policy \(\pi : S \rightarrow A\) which suggests an action for each possible state in an attempt to maximise the expected reward for an agent. Some variations of Qlearning are combined with function approximation methods such as artificial neural networks instead of Qtables to represent continuous large state problems [29].
2.4 Cooperative Qlearning
Cooperative Qlearning allows independent agents to work together to solve a single Qlearning problem. Cooperative Qlearning is typically broken into two stages: individual learning, and learning by interaction. In the individual learning stage, each learner independently uses its own Qlearning algorithm to improve its individual solution. Then, in the learning by interaction stage, a Qvalue sharing strategy is used to combine the Qvalues of each agent to produce new Qtables.
2.5 Classical hunter prey problem
The classical hunter prey problem is considered one of the standard test problems in the field of multiagent learning [25]. Normally, there are two types of agents: hunters and prey. Each agent is randomly positioned in the cells of a grid at the beginning of the game. The agents can move in four directions (up, down, right, left) unless there is an obstacle to the movement direction, such as a wall or boundary. For example, Fig. 1 shows a classical version of the hunter prey problem of grid size \(14 \times 14\) that involves 28 agents: 20 hunter agents (H) and eight prey agents (P).
3 Related work
This section is divided into two subsections. Section 3.1 discusses famous approaches of hierarchical decomposition in the RL domain, while Sect. 3.2 discusses cooperative hunting strategies for the hunter prey problem.
3.1 Hierarchical decomposition in the RL domain
Decomposition of MDPs can be broadly classified into two categories. First, static decomposition which partially or totally requires the implementation designers to define the hierarchy [28, 30]. Second, dynamic decomposition, in which hierarchy components, their positions, and abstractions are determined during the simulation process [11, 16, 32]. Both techniques focus on decomposing the state or action space into more manageable parts, and statically assign each learner to one of these parts. None of these techniques allow the migration of agents between different parts of the decomposition.
Parr and Russell [28] proposed a RL approach called HAMQlearning that combines Qlearning with hierarchical abstract machines (HAMs). This approach effectively reduces the size of the state space by limiting the learning policies to a set of HAMs. However, state decomposition in this form is hard to apply, since there is no guarantee that it will not affect the modularity of the design or produce HAMs that have large state space.
Dietterich [11] has shown that an MDP can be decomposed into a hierarchy of smaller MDPs based on the nature of the problem and its flexibility to be decomposed into smaller subgoals. This research also proposed a MAXQ procedure that decomposes the value function of an MDP into an additive combination of smaller value functions of the smaller MDPs. An important advantage of MAXQ decomposition is that it is a dynamic decomposition, unlike the technique used in HAMQlearning [5].
The MAXQ procedure attempts to reduce large problems into smaller problems, but does not take into account the probabilistic prior knowledge of the agent about the problem space. This issue can be addressed by incorporating Bayesian reinforcement learning priors on models, value functions or policies [9]. Cao and Ray [9] presented an approach that extends MAXQ by incorporating priors on the primitive environment model and on goal pseudorewards. Priors are statistic information of previous policies and problem models that can help a reinforcement agent to accelerate its learning process. In multigoal reinforcement learning, priors can be extracted from models or policies of previous learned goals. This approach is a static decomposition approach. In addition, the probabilistic priors should be given in advance in order to incorporate them in the learning process.
Cai et al. [8] proposed a combined hierarchical reinforcement learning method for multirobot cooperation in completely unknown environments. This method is a result of the integration of options with the MAXQ hierarchical reinforcement learning method. The MAXQ method is used to identify the problem hierarchy. The proposed method obtains all the required learning parameters through learning without any need for an explicit environment model. The cooperation strategy is then built based on the learned parameters. In this method, multiple simulations are required to build the problem hierarchy which is a time consuming process.
Sutton et al. [30] proposed the concept of options which is a form of knowledge abstraction for MDPs. An option can be viewed as a primitive task that is composed of three elements: a learning policy \(\pi : S \rightarrow A \), where S is the state set and A is the action set, a termination condition \(\beta : S^{+} \rightarrow [0,1]\) and an initial set of states \(I \subseteq S\). An agent can perform an option if \(s_t \in I\), where \(s_t\) is the current state of the agent. An agent chooses an option then follows its policy until the policy termination condition becomes valid. In this case, the agent can select another option. A main disadvantage of this approach is that the options need to be determined in advance. In addition, it is difficult to decompose MDPs using this approach because many decomposition elements need to be determined for each option.
Jardim et al. [20] proposed a dynamic decomposition hierarchical RL method. This method is based on the idea that to reach the goal, the learner must pass through closely connected states (subgoals). The subgoals can be detected by intersecting several paths that lead to the goal while the agent is interacting with the environment. Temporal abstractions (options) can then be identified using the subgoals. A drawback of this method is that it requires multiple simulations to define the subgoals. In addition, this method is time consuming and cannot easily be applied to large learning problems.
Generally, multiagent cooperation problems can be modelled based on the assumption that the state space of n agents represents a joint state of all agents, where each agent i has access to a partial view \(s_i\) from the set of joint states \(s=\{s_1, \ldots ,s_{n1},s_n\}\). In the same manner, the joint action is modelled as \(\{a_1, \ldots , a_{n1},a_n\}\), where each agent i may only have access to partial view \(a_i\). One simple approach to modelling multiagent coordination is discussed in the survey study of Barto and Mahadevan [5]. It shows that the concurrency model of joint state and action spaces can be extended to learn tasklevel coordination by replacing actions with options. However, this approach does not guarantee convergence to an optimal policy since learning low level policies varies at the same time as learning high level policies.
The study of Barto and Mahadevan [5] discussed another hierarchical reinforcement learning cooperation approach. This approach is a hyper approach that combines options [5] and MAXQ decomposition [11] together. An option \(o= \langle I,\pi ,\beta \rangle \) is extended to a multioption \(\overrightarrow{o}= \langle o_1,\ldots ,o_n \rangle \), where \(o_i\) is the option that is executed by agent i. A joint action value of a main task p, a state s and a multioption \(\overrightarrow{o}\) is denoted as \(Q(p,s,\overrightarrow{o})\). Then the MAXQ decomposition of the Qfunction can be extended for the joint actionvalues.
Hengst [16] proposed HEXQ, a hierarchical RL algorithm, that automatically decomposes and solves MDPs. It uses state variables to construct a hierarchy of subMDPs, where the maximum number of hierarchy levels is limited to the number of state variables. The results are interlinked small MDPs. As discussed in [16] the main limitation of HEXQ is that it must discover nested subMDPs and find policies for their exits (exits are nonpredictable state transitions and not counted as edges of the graph) with probability of 1. This requires that the problem space must have state variables that change over a long time scale.
Tosic and Vilalta [32] proposed a RL conceptual framework for agents’ collaboration in largescale problems. The main idea here is to reduce the complexity of RL in largescale problems through modelling RL as a process of three levels: single learner level, colearning among small groups of agents and learning at the system level. An important advantage is that it supports dynamic adaption of coalition among agents based on continuous exploration and adaption of the three layered architecture of the proposed model.
The proposed model of Tosic and Vilalta [32] does not specify any communication scheme among its three RL learning levels. Moreover, the model suffers from the absence of detailed algorithmic specifications on how RL can be implemented in this three layered learning architecture.
Guestrin and Gordon [14] proposed a distributed planning algorithm in hierarchical factored MDPs that solves large decision problems by decomposing them into subdecision problems (subsystems). The subsystems are organised in a tree structure. Any subsystem has two types of variables: internal variables and external variables. Internal variables are the variables that can be used by the value function of the subsystem, while the external variables cannot be used because their dynamics are unknown. Although the algorithm does not guarantee convergence to an optimal solution, its output plan is equivalent to the output of a centralised decision system. This proposal has some limitations. First, although coordination and communication between agents are not centralised, they are restricted by the subsystem tree structure. Second, the definition of a subsystem as an MDP composed of internal and external variables only fits decision problems.
Gunady et al. [15] proposed a RL solution for the problem of territory division on hideandseek games. The territory division problem is the problem of dividing the search environment between cooperative seekers to reach optimal seeking performance. The researchers combined a hierarchical RL approach with state aggregation in order to reduce the state space. In state aggregation, similar states are grouped together in two directions: topological aggregation and hiding aggregation. In topological aggregation, the states are divided into regions based on the distribution of obstacles. In hiding aggregation, hiding places are grouped together and treated as the target of aggregation action. A disadvantage of this algorithm is that it requires the model information of the environment to be known in advance.
The distributed hierarchical learning model described in this paper is based on the structure of modern software systems, where a system is decomposed into multiple subsystems. There are no restrictions on the structure of the system hierarchy. Additionally, there are two levels of learning and coordination between subsystems: at the subsystem level; and at the system level. A major goal of this model is to handle dynamic migration of learners between subsystems in a distributed system to increase the overall learning speed.
3.2 Cooperative hunting strategy
Yong and Miikkulainen [37] described two cooperative hunting strategies that can be implemented in the hunter prey problem. The first is a cooperative hunting strategy for noncommunicating teams that involves two different roles of hunters: chasers and blockers. The role of a chaser is to follow the prey movement, while the role of the blocker is to move in a horizontal direction to the prey, staying in the vertical axis of the prey. This allows the blocker to limit the movement of the prey. The second strategy is also a cooperative hunting strategy for noncommunicating teams, but only involves chasers. In order for two chasers to sandwich the prey, at least two chasers are required to follow the prey in opposite directions to eventually surround the prey agent. Both hunting strategies were experimentally proven to be successful. One main advantage of these strategies is that no communication is required between hunters. However, both strategies require the prey position to be part of the state definition to provide the chasers and/or blockers with knowledge of the prey’s position.
Lee [24] proposed a hunting strategy that also involved hunter roles of chaser and blocker. In this paper, the roles of hunters are semantically similar to the roles of the hunters of the first strategy of Yong and Miikkulainen [37]. However, the description of roles is relatively different. The chasers drive the prey to a corner of the grid, while the role of blockers is to surround the prey so it does not have enough space to escape. The hunting is considered successful if the prey is captured. A main difference between blockers in this paper and [37] is that blockers are required to communicate to surround the prey agent. This is considered as a disadvantage of the hunting strategy of [24]. Communication is a disadvantage of the hunting strategy, because communication between agents requires extra computation.
4 Distributed hunter prey problem
This section introduces a distributed version of the classical hunter prey problem to demonstrate how independent agents can cooperatively learn a policy for a distributed large state space problem. The main argument for this design is that the reduction of a state space S into n state spaces, \(S \rightarrow \{S_0,S_1, \ldots ,S_{n1},S_n\}\), accelerates convergence to the optimal solutions [4, 5, 10].
The cooperative hunting strategies in Sect. 3.2 are not directly applicable for distributed hunter prey problems. Those strategies require hunters to have knowledge of the entire system, while hunters in the distributed hunter prey problem only have knowledge of their local grid. However, the semantic ideas behind chaser and blocker hunters can still be implemented. Section 5.4 will describe one possible implementation.
5 QALearning
5.1 Problem model
The problem model of Qlearning with aggregation (QAlearning) is based on a loosely coupled FMDP organised into two levels: system and subsystem. The loose coupling characteristic of an FMDP means that each one of its subsystems has or uses little knowledge of other subsystems [22].
A system is a tuple [S, A, W, T], where S, A, and T are defined as in an MDP (see Sect. 2.1) and W is a set of reward functions \(R:S \times A \rightarrow \mathbb {R}\) for different roles that may be used in the system. A role can then be defined as an MDP [S, A, R, T], where \(R \in W\).
 1.\(M=[S_\mathrm{sub}, A_\mathrm{sub}, R_\mathrm{sub}, T_\mathrm{sub}]\) is a MDP where:
 (a)
\(S_\mathrm{sub} \subseteq S\) is the set of states in the subsystem.
 (b)
\(A_\mathrm{sub} \subseteq A\) is the set of actions in the subsystem.
 (c)
\(R_\mathrm{sub}:S_\mathrm{sub} \times A_\mathrm{sub} \rightarrow \mathbb {R}\) is a reward function such that, given \(s \in S_\mathrm{sub}, a \in A_\mathrm{sub}, r \in \mathbb {R}\), \(R_\mathrm{sub}(s,a)=r \iff R(s,a)=r\).
 (d)
\(T_\mathrm{sub}:S_\mathrm{sub} \times A_\mathrm{sub} \times S_\mathrm{sub} \rightarrow [0,1]\) is a transition function such that, given \(s_i, s_j \in S_\mathrm{sub}, a_k \in A_\mathrm{sub}, t \in [0,1]\), \(T_\mathrm{sub}(s_i, a_k, s_j)=t \iff T(s_i, a_k, s_j)=t\).
 (a)
 2.
\(C:S_\mathrm{sub} \times A \times (S {\setminus } S_\mathrm{sub}) \rightarrow [0,1]\) is a connection set which specifies how \(\textit{Sub}\) connects to other parts of the system such that, given \(s_i \in S_\mathrm{sub}, a \in A, s_j \in S{\setminus } S_\mathrm{sub}, t \in [0,1]\), \(C(s_i, a, s_j)=t \iff T(s_i, a, s_j)=t\).
5.2 Agent specialisations

Worker agents are learners at the subsystem level where each worker can be assigned different roles.

Tutor agents are coordinators at the subsystem level where each subsystem has one tutor for each role. Each tutor agent aggregates its workers’ Qtables into its own Qtable.

Consultant agents are coordinators at the system level. A distributed system has a consultant agent for each role (or a single consultant may handle multiple roles). A consultant agent learns the solution at system level by incorporating its tutors’ Qtables into its own Qtable calculations. Consultant agents are also responsible for redistributing worker agents among tutors to help accelerate the overall learning process
5.3 Migration of agents
 1.
Register each tutor that is active and working to achieve its goals in the service queue.
 2.
If any tutor finishes processing its goals, remove it from the service queue, register it in the inactive list, and flag the state of its workers as available.
 3.
Split the worker agents of the tutor agents registered in the inactive list between the tutor agents in the service queue.
 4.
Apply the migration procedure for all worker agents that follow the tutor that is registered first in the inactive list.
 5.
Delete the first entry in the inactive list.
 6.
Go to step 1.
 1.
The consultant agent declares the migrant worker to be in a migrating state at subsystem level and at system level.
 2.If migrant worker is running in client server mode:
 (a)
Duplicate migrant worker.
 (b)
Maintain a communication channel between the migrant worker and its copy for the rest of the steps.
 (c)
Go to step 5.
 (a)
 3.
Write the problem space of the migrated agent to the tutor of the sourcesubsystem.
 4.
Terminate the migrant worker process or thread.
 5.
Relocate the migrant agent to a new subsystem.
 6.
Inform the destination subsystem’ tutor of the migrant’s new location.
 7.
Resume the agent thread.
 8.
Allocate problem space to the migrant agent.
 9.
Resume the execution of migrant agent.
 10.If a worker agent finishes execution and is running in mobile agent mode:
 (a)
Terminate the migrant agent process or thread.
 (b)
Deallocate memory and data of the migrant agent.
 (c)
Relocate the migrant agent to its original subsystem.
 (a)
5.4 Roles of worker agents
Worker agents can play different roles to perform tasks at the subsystem level. As coordinators, consultant agents and tutor agents are responsible for assigning roles to worker agents.

Chaser hunter: each chaser hunter learns to chase and catch the prey agents inside its subgrid. Each chaser hunter inherits the problem space and the Qtable of its tutor agent.

Blocker hunter: each blocker hunter learns to occupy some blocking cells in the corners of a its subgrid to stop prey from moving in that direction. Each blocker hunter inherits the problem space and the Qtable of its tutor agent.
5.5 QALearning algorithm

First learning stage In this stage, each worker agent copies its tutor’s Qtable into its own Qtable and applies Qlearning to improve the tutor’s solution. After each period of individual learning, the tutor aggregates its workers’ Qtables into its own Qtable using the Qvalue sharing strategy of BESTQ [17, 18, 19].

Second learning stage This stage takes place at the end of the first stage. In this stage, the consultant agent incorporates the Qtables of its tutors into its Qtable for the entire system.
6 Experiments
Two versions of the QAlearning algorithm were implemented: QAlearning with support for migration and QAlearning without migration. These two versions were compared with Qlearning, MAXQ, HEXQ and HAMQlearning for different instances of the distributed hunter prey problem.
6.1 Setup
Three different grid sizes were selected to test the algorithms on small, medium, and large problems. In the first experiment, a grid size of \(100 \times 100\) was used. The second experiment used a grid size of \(200 \times 200\), and the third \(500 \times 500\). Each experiment included 16 prey, with four prey distributed randomly in each quarter of the grid.
Two chaser hunters and three blocker hunters were assigned to each tutor. The Qtables of each worker in the same role (chaser or blocker) were aggregated into the tutor’s Qtable for that role after each 25 learning episodes.

As suggested in [3, 21], the learning rate \(\alpha = 0.4\) and the discount factor \(\gamma =0.9\) for the Qlearning algorithm and the two learning stages of the QAlearning algorithm.

As suggested in [16], the learning rate \(\alpha = 0.25\) and the discount factor \(\gamma =1\) for HEXQ and MAXQ.

As suggested in [28], the learning rate \(\alpha = 0.25\) and the discount factor \(\gamma =0.999\) for HAMQlearning.
 The selection policy of actions for all algorithms was the Softmax selection policy [31]: Given state s, an agent tries out action a with probabilityIn the above equation, the temperature T controls the degree of exploration. Assuming that all Qvalues are different, if T is high, the agent will choose a random action, but if T is low, the agent will tend to select the action with the highest weight. The value of T was chosen to be 0.7 to allow expected rewards to influence the probability while still allowing reasonable exploration.$$\begin{aligned} {p_s(a)=}{\frac{e^{\frac{Q(s,a)}{T}}}{\sum \limits _{b=1}^n e^{\frac{Q(s,b)}{T}}}} \end{aligned}$$
HAMQlearning, HEXQ and MAXQ used two value functions: blocker and chaser value functions. The state variables for the chasing subtask are the position of the prey and the position the hunter while the state variables for the blocking subtask are the position of the blocker and the position of the blocking cell.
The learner in HEXQ explored the environment every 25 episodes and kept statistics on the frequency of change of each of the state variables. Each hierarchical exit is a stateaction pair of the form (position of the goal, capture).
In all experiments, the position of each hunter agent was chosen randomly at the beginning of each episode. A learning episode ended when the hunter agents captured all the prey agents, or after 5000 moves without capturing all the prey agents. An algorithm is said to have converged when the average number of moves in its policy improves by less than one move over d consecutive episodes where \(d=n/2\) for a grid size of \(n\times n\).
6.2 Results and discussion
This section compares the performance of QAlearning with migration, QAlearning without migration, MAXQ, HEXQ, HAMQlearning, and single agent Qlearning in the distributed hunter prey problem.
Figure 9 shows the average number of moves per 25 episodes required to capture all the prey agents in a distributed hunter prey problem of size \(100\times 100\). For QAlearning with migration, each tutor converges to a solution for its subgrid after 175 episodes of learning, which marks the end of the first learning stage. The consultant converges to a solution after 375 episodes meaning that QAlearning converges after 550 episodes to a solution for the whole grid. On the other hand, QAlearning without migration converges after 750 episodes, MAXQ converges after 2450, HAMQlearning converges after 1200, HEXQ converges after 1600 episodes, and basic Qlearning converges after 3900 episodes. These results suggest that the performance of QAlearning with migration is better than the other algorithms. This is because QAlearning allows multiple tutors to learn in parallel then the consultant combines their solutions through a small number of learning episodes. Even if tutors do not learn in parallel, the total number of learning episodes^{1} required to converge to a solution (\(175 \times 4+375 = 1075\)), is only \(27.6~\%\) of the number of episodes required for single agent Qlearning, (\(43.9~\%\) for MAXQ, \(67.2~\%\) for HEXQ, and \(89.6~\%\) for HAMQlearning. The support of migration of learners provided by QAlearning accelerates the learning process even faster as shown in the figure. Further, since the tutors in the QAlearning scenario are learning smaller subsets of the overall problem, their individual learning episodes are typically of shorter duration.
The ratio of the number of episodes in the cooperative learning algorithms to the number of episodes in Qlearning
Experiment 1 (%)  Experiment 2 (%)  Experiment 3 (%)  

Parallel QAlearning  14.1  1.1  0.8 
Sequential QAlearning  27.6  3.5  2.59 
MAXQ  58.3  22.85  23.4 
HAMQLearning  30.8  13.3  11.5 
HEXQ  41  7.9  6.6 
The running time of Qlearning vs the running time of the other algorithms in seconds
Experiment 1  Experiment 2  Experiment 3  

QLearning  583  68,662  509,417 
QALearning without migration  167  1121  8409 
QALearning with migration  160  683  4198 
MAXQ  383  13,977  75,600 
HAMQlearning  232  8109  68,400 
HEXQ  287  4808  21,047 
7 Conclusion and future work
The hierarchical organisation of distributed systems provides an efficient decomposition of large problem spaces into more manageable components. This paper introduced the QAlearning algorithm for cooperative policy construction for independent learners that is based on three specialisations of agents: workers, tutors and consultants. Each consultant is responsible for assigning a subproblem and a number of worker agents to each tutor. The worker agents first learn the problem space of their tutor, then the tutor aggregates its workers’ Qtables into its own Qtable. The consultant then merges the tutors’ Qtables to create its Qtable. Finally, the consultant performs a few rounds of Qlearning to optimise its Qtable.
The QAlearning algorithm has many advantages. First, the problem model of the QAlearning algorithm is a loosely coupled FMDP. This model reduces the complexity of large state space problems by taking advantage of the decomposable nature of the system itself.
Second, worker agents that have finished learning can be reassigned by the consultant to another tutor that is still learning to accelerate its learning process. This decreases the time required for the consultant agent to learn the entire system.
Finally, the results of the pilot experiments suggest that QAlearning performs faster than conventional Qlearning and other famous cooperative Qlearning algorithms, even if the tutors do not learn in parallel. Further, the average length of an episode in QAlearning is shorter than the average length of an episode in the other algorithms.
Currently, the decomposition process of QAlearning is the duty of the implementation designers. It goes hand in hand with the process of the design of distributed systems. This means that all decompositions need to be predefined and have to be compatible with the distributed organisation of the system.
Future work includes the implementation of QAlearning in single goal hierarchical systems, the automatic identification of subsystems, the reusability of subsolutions in QAlearning, and the applicability of QAlearning in partially observable environments
Footnotes
 1.Duration of the first stage of QAlearning \(\times \) the number of tutors \(+\) duration of the second learning stage until the consultant convergence to a solution.
References
 1.Abbeel, P., Ng, A.: Exploration and apprenticeship learning in reinforcement learning. In: Proceedings of the 22nd International Conference on Machine Learning, pp. 1–8 (2005)Google Scholar
 2.AbedAlguni, B.H.K.: Cooperative reinforcement learning for independent learners. Ph.D. thesis, The University of Newcastle, Australia. Faculty of Engineering and Built Environment, School of Electrical Engineering and Computer Science (2014)Google Scholar
 3.Arai, S., Sycara, K.: Effective learning approach for planning and scheduling in multiagent domain. In: Proceedings of the 6th International Conference on Simulation of Adaptive Behavior, pp. 507–516 (2000)Google Scholar
 4.Asadi, M., Huber, M.: State space reduction for hierarchical Reinforcement Learning. In: Proceedings of the Seventeenth International FLAIRS Conference, pp. 509–514 (2004)Google Scholar
 5.Barto, A.G., Mahadevan, S.: Recent advances in hierarchical reinforcement learning. Discrete Event Dyn. Syst. 13(1–2), 41–77 (2003)zbMATHMathSciNetCrossRefGoogle Scholar
 6.Boutilier, C., Dearden, R., Goldszmidt, M.: Exploiting structure in policy construction. In: International Joint Conference on Artificial Intelligence, vol. 14, pp. 1104–1113. Lawrence Erlbaum Associates Ltd (1995)Google Scholar
 7.Boyd, T., Dasgupta, P.: Process migration: a generalized approach using a virtualizing operating system. In: Proceeding of the 22nd International Conference on Distributed Computing Systems ICDCS 2002, pp. 385–392 (2002)Google Scholar
 8.Cai, Y., Yang, S., Xu, X.: A combined hierarchical reinforcement learning based approach for multirobot cooperative target searching in complex unknown environments. In: 2013 IEEE Symposium on Adaptive Dynamic Programming And Reinforcement Learning (ADPRL). Singapore, pp. 52–59 (2013)Google Scholar
 9.Cao, F., Ray, S.: Bayesian hierarchical reinforcement learning. In: F. Pereira, C. Burges, L. Bottou, K. Weinberger (eds.) Advances in Neural Information Processing Systems, vol. 25, pp. 73–81. Curran Associates (2012)Google Scholar
 10.Daoui, C., Abbad, M., Tkiouat, M.: Exact decomposition approaches for Markov decision processes: a survey. In: Advances in Operations Research 2010 (2010)Google Scholar
 11.Dietterich, T.G.: Hierarchical reinforcement learning with the MAXQ value function decomposition. J. Artif. Intell. Res. 13(1), 227–303 (2000)zbMATHMathSciNetGoogle Scholar
 12.Erus, G., Polat, F.: A layered approach to learning coordination knowledge in multiagent environments. Appl. Intell. 27(3), 249–267 (2007)CrossRefGoogle Scholar
 13.Ghavamzadeh, M., Mahadevan, S., Makar, R.: Hierarchical multiagent reinforcement learning. Auton. Agents MultiAgent Syst. 13(2), 197–229 (2006)CrossRefGoogle Scholar
 14.Guestrin, C., Gordon, G.: Distributed planning in hierarchical factored MDPs. In: Proceedings of the Eighteenth Conference on Uncertainty in Artificial Intelligence, pp. 197–206. Morgan Kaufmann (2002)Google Scholar
 15.Gunady, M.K., Gomaa, W., Takeuchi, I.: Aggregate reinforcement learning for multiagent territory division: the hideandseek game. Eng. Appl. Artif. Intell. 34, 122–136 (2014)CrossRefGoogle Scholar
 16.Hengst, B.: Discovering hierarchy in reinforcement learning with HEXQ. In: Machine Learning: Proceedings of the Nineteenth International Conference on Machine Learning, pp. 243–250. Morgan Kaufmann (2002)Google Scholar
 17.Iima, H., Kuroe, Y.: Reinforcement learning through interaction among multiple agents. In: The 2006 International Joint Conference of the Japanese Society of Instrument and Control Engineers and the Korean Institute of Control, Automation and System Engineers, pp. 2457–2462 (2006)Google Scholar
 18.Iima, H., Kuroe, Y.: Swarm reinforcement learning algorithms—exchange of information among multiple agents. In: 2007 Annual Conference of the Japanese Society of Instrument and Control Engineers, pp. 2779–2784 (2007)Google Scholar
 19.Iima, H., Kuroe, Y.: Swarm reinforcement learning algorithms based on sarsa method. In: 2008 Annual Conference of the Japanese Society of Instrument and Control Engineers, pp. 2045–2049 (2008)Google Scholar
 20.Jardim, D., Nunes, L., Oliveira, S.: Hierarchical reinforcement learning: learning subgoals and stateabstraction. In: 2011 6th Iberian Conference on Information Systems and Technologies. Chaves, Portugal, pp. 1–4 (2011)Google Scholar
 21.Jiang, D.W., Wang, S.Y., Dong, Y.S.: Rolebased contextspecific multiagent Qlearning. Acta Autom. Sinica 33(6), 583–587 (2007)zbMATHCrossRefGoogle Scholar
 22.Kaye, D.: Loosely coupled: the missing pieces of Web services. In: Bing, A., Kaye, C. (eds.) 1st edn. Chap. 10, RDS Strategies LLC p. 132 (2003)Google Scholar
 23.Kim, K.E., Dean, T.: Solving factored MDPs via nonhomogeneous partitioning. Proceedings of the 17th International Joint Conference on Artificial Intelligence. IJCAI’01, vol. 1, pp. 683–689. Morgan Kaufmann, San Francisco (2001)Google Scholar
 24.Lee, M.R.: A multiagent cooperation model using reinforcement learning for planning multiple goals. J. Secur. Eng. 2(3), 228–233 (2005)Google Scholar
 25.Liu, F., Zeng, G.: Multiagent cooperative learning research based on reinforcement learning. In: G. Weiß (ed.) The 10th International Conference on Computer Supported Cooperative Work in Design, pp. 1–6 (2006)Google Scholar
 26.Mausam, Weld, D.S.: Solving concurrent Markov decision processes. In: Proceedings of the 19th National Conference on Artificial Intelligence, pp. 716–722. AAAI Press (2004)Google Scholar
 27.Ono, N., Fukumoto, K.: A modular approach to multiagent reinforcement learning. In: Weiß, G. (ed.) Distributed Artificial Intelligence Meets Machine Learning Learning in MultiAgent Environments. Lecture Notes in Computer Science, vol. 1221, pp. 25–39. Springer, Berlin (1997)CrossRefGoogle Scholar
 28.Parr, R., Russell, S.: Reinforcement learning with hierarchies of machines. In: Advances in Neural Information Processing Systems, vol. 10, pp. 1043–1049. MIT Press (1997)Google Scholar
 29.Strösslin, T., Gerstner, W.: Reinforcement learning in continuous state and action space. Artif. Neural Netw. ICANN 2003, 4 (2003)Google Scholar
 30.Sutton, R., Precup, D., Singh, S.: Between MDPs and SemiMDPs: a framework for temporal abstraction in reinforcement learning. Artif. Intell. 112, 181–211 (1999)zbMATHMathSciNetCrossRefGoogle Scholar
 31.Sutton, R.S., Barto, A.G.: Reinforcement Learning: An Introduction. MIT Press, Cambridge (1998)Google Scholar
 32.Tosic, P.T., Vilalta, R.: A unified framework for reinforcement learning, colearning and metalearning how to coordinate in collaborative multiagent systems. Proc. Comput. Sci. 1(1), 2217–2226 (2010)CrossRefGoogle Scholar
 33.Vasudevan, N., Venkatesh, P.: Design and implementation of a process migration system for the Linux environment. In: 3rd International Conference on Neural, Parallel and Scientific Computations. Atlanta, USA (2006)Google Scholar
 34.Watkins, C.: Learning from delayed rewards. Ph.D. thesis, Cambridge University, Cambridge, England (1989)Google Scholar
 35.Watkins, C., Dayan, P.: Technical Note: QLearning. Mach. Learn. 8(3), 279–292 (1992)zbMATHGoogle Scholar
 36.Wu, B., Feng, Y., Zheng, H.: Modelbased bayesian reinforcement learning in factored Markov decision process. J. Comput. 9(4), 845–850 (2014)Google Scholar
 37.Yong, C., Miikkulainen, R.: Coevolution of rolebased cooperation in multiagent systems. IEEE Trans. Auton. Ment. Dev. 1(3), 170–186 (2009)CrossRefGoogle Scholar
Copyright information
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.