Introduction

Recent progress in multi-agent reinforcement learning (RL) has achieved impressive performance on many complex tasks [1,2,3,4]. Despite the progress, these works mostly have been designed to deal with a single-task RL with stationary environments which face problems when applied to the non-stationary real-world [2]. Agents trained in specialized tasks cannot achieve satisfied generalization performance across multiple tasks. When solving a variety of real-world problems with a set of specialized policies, agent have to store identities of individual task to utilize the appropriate specific-task policy according to the environment. Unfortunately, reliable identities of tasks are hardly observable in practice.

Rather than focus on a single task, we argue that agent should be able to solve multiple tasks with various sources of reward. Recent work in multi-task RL has attempted to address this by achieving generalized performance across all tasks [5]. MTRL methods often focus on the tasks which are explicitly described with natural language instruction, programs, or graph structures [6, 7]. However, such explicit knowledge descriptions and reliable identities of tasks may not readily be available.

Existing MTRL methods usually focus on multi-task single-agent tasks [8,9,10,11]. They uses the multi-task learning method to generalize on all original tasks by sharing representations between related tasks. A more flexible approach is meta-RL which has the agents implicitly inferring the task by interacting with the environment and quickly adapt to them [12,13,14]. However, they have focused on relatively simple tasks with single agent and/or fully observable settings. By contrast, we proposed a gradient-based Self-Adaptive Meta Learning method called SAML which combined meta-learning with MARL approach to deal with the problem continuously adapting to multi-task multi-agent tasks.

SAML proposes a new MT-MARL framework and decomposes the training process into single-task phase and multi-task phase (Fig. 1 ). The first phase uses centralized training with decentralized execution (CTDE) paradigm to learn distinct policy for single-task MARL. This first phase enables the same Deep Recurrent Q-Network (DRQN) to control all the agents in single-task MARL. The input and output of scenarios are reconstructed into a unified form with same dimension so that specific DRQN of the single-task in first phase can be used in meta-learning process of second phase. Each task-specific DRQN utilizes their policy to interactive with environment to collected supervised data for meta-learning in phase II. The second phase of our approach distills each specialized action-value DRQN of single-task into a generalized recurrent multi-task DRQN with maximizing transfer and minimizing interference between multiple tasks. Existing benchmarks mainly focus on relatively simple agent in multiple tasks (e.g., Meta-World [15]) or multi-agent learning in a single-task scenario (e.g., SMAC [16]), which cannot meet our experimental needs. To validate the continuous adaptation performance on complex MT-MARL task, we develop a meta-learning experimental environment based on a widely adopted StarCraft II benchmark SMAC. As far as we know, this is the first work to validate the continuous adaptive performance of multi-agent in complex StarCraft II multi-task scenarios.

Fig. 1
figure 1

Overview of self-adaptive meta-learning framework to multi-task multi-agent tasks. The continuous adaptation is formalized into a two-phase approach. The approach first conducts single-task MARL process to achieve task-specific policy. The input and output of scenarios are reconstructed into a unified form with same dimension so that specific DRQN of the single-task in first phase can be used in meta-learning process of second phase. Each task-specific DRQN utilizes their policy to interactive with environment to collected supervised data for meta-learning in phase II. Then, in phase II, we distill task-specific policies into a unified policy with our proposed gradient-based adaptive meta-learning method (SAML)

The main novelties of this paper are summarized as follows:

  • A novel two-phase approach for continuous adaptation on multi-task MARL is provided, which first conducts single-task MARL process to achieve task-specific policy, and then distills task-specific policies into a unified DRQN that performs well on all tasks without explicit provision of task ID.

  • We analyze the Transfer-Interference Trade-off problem and propose a self-adaptive meta-learning method, SAML, to enhance and facilitate MT-MARL on phase II with maximizing transfer and minimizing interference between multiple tasks.

  • To validate our proposed method, we extend the widely adopted StarCraft benchmark SMAC and develop a new multi-task multi-agent StarCraft environment, Meta-SMAC, to test the various aspects of continuous adaptation on complicated MT-MARL. Our experiments with a population of agents show that our method enables significantly more efficient adaptation than reactive baselines, such as FOMAML and REPTILE, across different scenarios.

The rest of this paper is organized as follows. We first introduce the basic MARL theories and some typical methods of MARL and meta learning in the next section. In the third section, we present the two-phase continuous adaptation framework in details and the gradient-based Self-Adaptive Meta-Learning method. Main experimental results and ablation studies are shown in the fourth section. Then, the related work are given in the fifth section. We conclude the paper in the last section.

Background

In this section, the basic theories and important concepts in MT-MARL are introduced. We also present some typical MARL and meta learning methods that aim at learning a generalized centralized value function with the CTDE framework.

Multi-agent RL

We consider a fully cooperative multi-agent setting involving a set of agents in a shared environment. MARL methods are separated into two classes: independent learners and joint action learners [17]. Independent learners is to learn the individual agent’s action-value function independently, i.e., independent Q-learning, whereas joint action learners observe joint-actions taken by all agents. Recently, Centralized Training with Decentralized Execution (CTDE) paradigm is employed to train MARL agents especially with only partial observation [18]. CTDE allows agents to train decentralized polices with the global information while training and to make decisions based on the individual learned policies while executing. Recent works including VDN and QMIX employ CTDE to utilize optimization at the individual action-value functions to optimize of the joint action-value function. VDN [19] represents a central action-value function as a sum of individual action-value function. QMIX [3] represents joint action-value function with a large family of monotonic non-linear per-agent action-value function rather than just as a sum of them. VDN’s representation is sufficient to satisfy Eq. 1. QMIX extends VDN’s representation and satisfies Eq. 3 when Eq. 2 is satisfied:

$$\begin{aligned}&{Q_{\mathrm{tot}}}(\pmb {\tau },\mathbf{u} ) = \sum \limits _{i = 1}^n {{Q_i}({\tau ^i},{u^i})}. \end{aligned}$$
(1)
$$\begin{aligned}&\frac{{\partial {Q_{\mathrm{tot}}}}}{{\partial {Q_i}}} \ge 0,\quad \forall i \in A. \end{aligned}$$
(2)
$$\begin{aligned}&\mathop {\max }\limits _u Q(\pmb {\tau }, {\mathbf {u}})\nonumber \\&\quad \!=\! {Q_{\mathrm{tot}}}\left( \!\tau ,\mathop {\arg \max }\limits _{{u^1}} {Q_1}({\tau ^1},{u^1}),...,\mathop {\arg \max }\limits _{{u^N}} {Q_N}({\tau ^N},{u^N})\!\right) \!.\nonumber \\ \end{aligned}$$
(3)

Multi-task multi-agent RL

Multi-task multi-agent RL (MT-MARL) aims to learn a policy that performs well on a set of related source MARL tasks to a target task. MT-MARL is beneficial when source tasks share common features [20], and challenging when the task description is not explicitly observed to agents during execution [8]. The MT-MARL problem in partially observation extends the single-agent fully observable setting [21]. A partially observable MT-MARL Task \({T_j}\) is a tuple \(\left<{P_j}, {S_j}, {R_j}, D\right>\), where \({P_j}, {S_j}, {R_j}\) are, respectively, the state transition function, observation functions and reward. MT-MARL Domain D is partially observable and denoted as a tuple \(<A, S, U, O, \gamma>\), where A is the set of agents, S is the environment state space, U is the joint action space, O is the joint observation space, and \(\gamma \) is the discount factor. In MT-MARL, task \( {T_j}\) is sampled from domain D and is consisted of episode \(e \in \{ 1,...,E\}\). The MT-MARL agents can observe the task ID, \(j \in \{ 1,...,J\}\) during learning, but not while executing. The objective is to find a optimal joint policy to achieve maximal reward in all E episodes, \({R^*} = \frac{1}{J}\frac{1}{E}\sum \nolimits _{j = 1}^J {\sum \nolimits _{e = 1}^E {{\gamma ^t}} {R_e}({s_t},{a_t})}\) [21].

Meta-reinforcement learning

Recent meta-RL approaches contain two broad categories: popular gradient-based meta-learners [22,23,24] and RNN-based meta-learners [25]. Gradient-based meta-learners aims to learn a good initialization of agent’s policy network by taking gradient steps and enable agents rapidly adapt to an unseen task, such as MAML [13] and REPTILE [23]. RNN-based meta-RL methods complete process of adaptation by updating the hidden states of a RNN [25]. In addition, other variants of RNN-based meta-learner have been explored, such as temporal convolutions [26]. Our approach has referred to the first category, but different from these works as we combine the meta-RL approaches with existing MARL methods and consider the transfer-interference trade-off problem.

Methods

Figure 1 overviews our approach with two-phase curriculum to address the MT-MARL problems. Our main idea is to employ two policies: task-specific policy and adaptation policy. During the phase I, we first conduct single-task MARL process to achieve task-specific policy. In phase II, we distill task-specific policies into a adaptation policy that performs well in all tasks with collected trajectories produced by task-specific policies.

Phase I: Dec-POMDP single-task MARL

As Fig. 1 shows, Dec-POMDP single-task MARL algorithm of phase I has an architecture consisting of agent networks, a mixing network [27] and a meta-experience transition module. All agents share the same agent network and use it to generate the action policy. The mixing network is a feed-forward neural network that used for central training in CTDE paradigm. The meta-experience transition module is used to transform the input and output of all tasks and make them have same dimension.

Fig. 2
figure 2

Overview of Dec-POMDP single-task MARL in phase I. The input and output of scenarios are reconstructed into a unified form with same dimension with meta-experience module so that action-value DRQN of the single-task in first phase can be used in meta-learning process of second phase. The framework of single-task MARL is similar to QMIX algorithm, and details of learning process is illustrated in section of method

Figure 2 illustrates the learning process of Single-Task MARL of phase I. Each agent i represents its individual action-value function \(Q_{i}(\tau _{i}, u_{i})\) as a DRQN named Q_network_local. After receiving state \( s_{t} \) at each time step, a random action \( u_{t}\) is executed with probability \(\epsilon \), otherwise action is produced by Q_network_local with \(\epsilon \)-greedy policy. We execute action \( u_{t} \) to StarCraft II environment, and then receive reward \( r_{t} \) and next state \( s_{t+1} \). These experiences are stored in replay memory and will be sampled for mixing network to produce the joint action-value function \( Q_{\mathrm{tot}} \). With the joint Q-value produced by mixing network and the target Q-value produced by Q_network_target network, we can compute MSELoss and perform a gradient descent step to update the parameters of Q_network_local. Q_network_target will be reset with the weights of Q_network_local every C iterations until the satisfied policy is learned.

The structure of mixing network is shown in Fig. 2. The mixing network is a feed-forward neural network with non-negative weights generated by separate hyper-networks which take the additional states as input [28]. The agent network consists of a DRQN with parameter \(\theta \), which represent the individual agent’s action-value function Q. After sampling batch of experiences from the replay memory, MSELoss is computed with the joint Q-value with double DQN paradigm. We can learn the parameter \(\theta \) by performing a gradient descent step to learn the parameters of Q_network_local (Eq. 4):

$$\begin{aligned} L(\theta ) = \sum \limits _{i = 1}^n {[{{(y_i^{\mathrm{tot}} - {Q_{\mathrm{tot}}}(\tau ,u,s;\theta ))}^2}]}. \end{aligned}$$
(4)

The meta-experience transition module is used to transform the input and output of the task. Note that the state space is different between these tasks, therefore, the transitions sampled from replay memory cannot be used to compute the MSELoss directly. Therefore, we use meta-experience transition module to reconstruct the samples into a unified form with same dimension. We define a formal semantic mapping function \( \phi (.) \) to indicate the mapping between different state spaces.

Definition 1

(Semantic Mapping Function)  Given three tasks \( {\tau _1} \), \( {\tau _2} \) and \( {\tau _3} \) with different state space dimensions, if state \( {s_{{\tau _1}}} \) and \( {s_{{\tau _2}}} \) contain similar semantic information while \( {s_{{\tau _3}}} \) does not. Through the mapping function \( \phi (.) \) to transform the state dimension, there exists a state space to make the following Inequation establish: \( dis(\phi ({s_{{\tau _1}}}),\phi ({s_{{\tau _2}}})) \) < \( dis(\phi ({s_{{\tau _1}}}), \phi ({s_{{\tau _3}}}))\), \( dis(\phi ({s_{{\tau _1}}}),\phi ({s_{{\tau _2}}})) \) < \( dis(\phi ({s_{{\tau _2}}}),\phi ({s_{{\tau _3}}})) \), where dis(.) is the distance between two vectors [29].

By the Definition 1, we can map the different states of each task \( {\tau _i} \) into the same semantic state space. This concept is often used in domain adaptation area [30, 31]. If states of different tasks have the same semantics with different dimensions and can be transformed into same latent state space, then we can easily transfer knowledge between these tasks, e.g., adding zero padding for those samples from small-scale environment or exchanging similar positions in the same latent state space.

Phase II: Dec-POMDP multi-task MARL

The second phase is used to distill each DRQN of single task into a unified DRQN that can perform well over all tasks without explicit provision of task ID. Actually, once task-specific DRQN is conducted for each task, multi-task MARL can be formulated as a regression problem over Q-values, which can be resolved by continuously conducts data collection and regression.

For data collection, agents use each task-specific DRQN (from Phase I) to receive the states and execute actions in corresponding environment. The episode data of regression experiences \( ({s_t},{u_t}) \) is stored in shared replay memory and used to compute cross entropy loss of unified DRQN (referring to Algorithm 1 ). To improve the performance, we also concern the transfer and interference on the multiple tasks. As shown in Fig. 3, two arbitrary distinct examples \( ({s_i},{a_i}) \) and \( ({s_j},{a_j}) \) are trained with SGD.

figure a
Fig. 3
figure 3

A depiction of transfer and interference in weight space

Transfer occurs when Eq. 5 is satisfied which will enhance the weight sharing on two neural networks, while interference occurs when Eq. 6 are satisfied which will attenuate the weight sharing.

$$\begin{aligned}&\frac{{\partial {L_i}}}{{\partial \theta }} \cdot \frac{{\partial {L_j}}}{{\partial \theta }} > 0, \end{aligned}$$
(5)
$$\begin{aligned}&\frac{{\partial {L_i}}}{{\partial \theta }} \cdot \frac{{\partial {L_j}}}{{\partial \theta }} < 0, \end{aligned}$$
(6)

where (\(\cdot \)) is the dot product operator. Maximizing weight sharing will maximize the potential for transfer, while minimizing weight sharing will minimize the potential for interference. In typical offline supervised learning, we can optimize for the following objective over the stationary distribution of (sa) pairs within the data-set D:

$$\begin{aligned} \theta = \arg \mathop {\min }\limits _\theta {E_{(s,a) \sim D}}[L(s,a)]. \end{aligned}$$
(7)

To maximize the potential for transfer and minimize the potential for interference, we can express our optimization objective as follows:

$$\begin{aligned} \begin{array}{l} \theta = \arg \mathop {\min }\limits _\theta {E_{[({s_i},{a_i}),({s_j},{a_j})] \sim D}}[L({s_i},{a_i}) + L({s_j},{a_j})\\ \qquad - \alpha \frac{{\partial {L_i}}}{{\partial \theta }} \cdot \frac{{\partial {L_j}}}{{\partial \theta }}] \end{array}. \end{aligned}$$
(8)

Considering in terms of gradients, REPTILE algorithm approximately optimizes its objective on a set of s batches [23]:

$$\begin{aligned} \begin{array}{l} \theta = \arg \mathop {\min }\limits _\theta {E_{[{B_1},...,{B_s}] \sim D}}\\ \qquad \times \left[ 2\sum \limits _{i = 0}^s {\Bigg [L({B_i})} - \sum \limits _{j = 1}^{i - 1} {\alpha \frac{{\partial L({B_i})}}{{\partial \theta }} \cdot \frac{{\partial L({B_j})}}{{\partial \theta }}} \Bigg ]\right] , \end{array} \end{aligned}$$
(9)

where \( {{B_1}} \),..., \( {{B_s}} \) are batches within D. We find that it is similar to our motivation in Eq. 8, the difference is that Eq. 8 both computes gradients produced within and across these batches. Therefore, we express our optimization objective in Eq. 10:

(10)

We sample s batches from D, and each batch has k examples. Further details for SAML is provided in Algorithm 1.

Table 1 The details of three multi-task scenario
Fig. 4
figure 4

Maps included in heterogeneous scenario. Figure 4d shows the map 2s4z_vs_2s5z is an asymmetric scenario in heterogeneous scenario that requires the 2 allied stalker and 4 allied zealots (left) to defeat the 2 enemy stalker and 5 enemy zealots (right), that is, we use 6 allied game unit to defeat 7 enemies. Multiple agents require a range of coordination skills such as focusing fire and avoiding overkill to attack the enemy units and win the game. Figure 4e shows the map 2s5z_vs_2s6z that requires 7 allies to defeat 8 enemies. The map 2s4z_vs_2s5z and map 2s5z_vs_2s6z are regarded as different tasks and task-specific DRQN on map 2s4z_vs_2s5z achieve poor performance on map 2s5z_vs_2s6z (validated in Table 4)

Experimental evaluation

The experiment is separated into a two-stage curriculum. Single-task MARL process is first conducted to achieve task-specific policy, and task-specific policies are distilled into a unified DRQN that performs well in all tasks without explicit provision of task ID.

Environment

We focus on the continuous adaptation of multiple multi-agent tasks. However, the existing benchmarks mainly focus on simple single agent in multiple tasks (e.g., Meta-World [15]) or multi-agent learning in single-task scenario (e.g., SMAC [16]), which cannot meet our experimental needs. Therefore, we extend the widely adopted StarCraft benchmark SMAC and develop a new multi-task multi-agent StarCraft environment, Meta-SMAC, for testing various aspects of continuous adaptation. We reconstruct the input and output of scenarios into a unified form with same dimension so that action-value DRQN of the single-task can have same dimension.

We design three experimental scenarios, Homogeneous scenario, Heterogeneous scenario and Mixed scenario, to test the performance of the algorithms. The homogeneous scenario contains only one kind of game unit, while the heterogeneous and mixed scenarios contain multiple game units, in order that three scenarios have different difficulties to get generalized performance. They all contain eight micromanagement maps, for example, heterogeneous scenario is shown in Table 1. Figure 4d shows the map 2s4z_vs_2s5z is an asymmetric scenario in heterogeneous scenario that requires the 2 allied stalker and 4 allied zealots (left) to defeat the 2 enemy stalker and 5 enemy zealots (right), that is, we use 6 allied game unit to defeat 7 enemies. At the training of each episode, multiple agents with partial observation require a range of coordination skills such as focusing fire and avoiding overkill to attack the enemy units and win the game. In Sect. 4.4.1, seven maps were used to test the adaptation performance. Map 5 m_vs_6m and 1s5z_vs_1s6z are assigned to test the ability of continuous adaptation performance to a new task in Sect. 4.5.

Baselines and experimental settings

Baselines. We compare to three baselines to test the performance of our algorithm as following: (i) Multi-Task Learning: an algorithm which enables unified model to generalize on all original tasks by sharing representations between related tasks [32] with computing cross entropy loss. (ii) REPTILE is a meta-learning method with computing first-order gradient information [23]. (iii) FOMAML: a meta-learning algorithm ignoring the second derivative terms at the expense of losing some gradient information [23].

Table 2 Hyper-parameters used in experiment

Experimental settings. Phase I is a MARL process, while phase II is a meta-learning process. In phase I, the training will be paused after every 10 thousand time steps, meanwhile, 32 test episodes will run to evaluate the performance of the algorithm with test win rate. In phase II, the training will be paused after 128 episodes, meanwhile, 32 test episodes will run to evaluate the performance of the algorithm. We measure the performance in terms of mean win rate averaged over all the tasks in each scenarios shown in Table 1. Test win rate is the percentage of test episodes in which the trained agents defeat enemy units within the permitted time limit. All the resulting plots contain the median mean performance as well as the 25–75\( \% \) percentiles over 5 runs with different seeds. Mean win rate is used to evaluate the mean performance of all the maps in the scenarios in phase II. All the hyper-parameters are shown in Table 2. The state, action and reward setting are the same as SMAC and can be referred to SMAC’s appendix [16]).

Table 3 Median win rate of models trained by single-task MARL in phase I

Single-task performance validation

In phase I, we first conduct single-task MARL process to achieve task-specific policy. The task-specific DRQN model is trained until the median win rate is above 94%, as shown in Table 3. To illustrate the necessity of the general model, we also verify the median win rate of task-specific DRQN model on other maps in the same scenario. The results of heterogeneous scenario is shown in Table 4. As the first row of Table 4 shown, median win rate of task-specific model of map 10 m_vs_11m is about 97%. However, when we test the DRQN model of map 10 m_vs_11m on other maps in mixed scenario, the median win rate of map 9 m_vs_10m is relatively high, but in other maps, the median win rate is 0. Therefore, the task-specific DRQN model has no ability of generalization.

Table 4 Median win rate of trained model while tested on each task of heterogeneous scenario

Multi-task performance validation

In phase II, we validate the generalization performance of SAML algorithm in multi-task scenarios and its continuous adaptation performance when a new map is added.

Fig. 5
figure 5

Adaptation performance validation on three scenarios. ac show the training results of SAML, FOMAML, REPTILE and MTL in 10,000 epochs. In these three scenarios, SAML algorithm achieves the highest mean win rate, about 90% in homogeneous scenario, 79% in heterogeneous scenario, and 88% in mixed scenario. All the algorithms with meta-learning consistently outperform MTL algorithm across all the scenarios, showing that meta-adaptation policy can efficiently improve the generalization performance. In addition, SAML algorithm performs better than FOMAML and REPTILE baselines in all the scenarios, showing that considering the transfer and interference trade-off while training multiple tasks is more effective than normalized meta-learning. In d, SAML algorithm achieves the mean win rate of 91% and the new task’s (map 5 m_vs_6m) the median win rate is about 86%. As e and f shows, SAML also achieves the optimal performance in heterogeneous and mixed scenarios. Our experimental results on StarCraft II environments show that the proposed method is able to effectively learn a unified generalization policy, and can adapt more efficiently to unseen tasks than existing baselines across different scenarios

Fig. 6
figure 6

Effects on network capacity. The capacity of generalized DRQN model is set to 64, 128 and 256, respectively, such that we can train the DRQN model with respect to different capacity to measure how much it is likely to increase the mean win rate of the scenarios. Under the same conditions, through 10,000 epochs of training, the mean win rate of the DRQN model with the capacity of 256 achieve 94, 92 and 93 in homogeneous, heterogeneous and mixed scenarios, respectively, which is close to the maximum mean win rate shown in Table 5

Table 5 Median win rate of trained models with different capacity

Generalization performance validation

The generalization performance of the DRQN model is tested in homogeneous, heterogeneous and mixed scenarios. The homogeneous scenario contains only one kind of game unit, while the heterogeneous and mixed scenarios contain multiple game units. Figure 5a–c shows the training results of SAML, FOMAML, REPTILE and MTL in 10,000 epochs. These three scenarios contain 7 single tasks, as shown in Table 1. In these three scenarios, SAML algorithm achieves the highest mean win rate, about 90% in homogeneous scenario, 79% in heterogeneous scenario, and 88% in mixed scenario. MTL method has the lowest median win rate because it only uses the multi-task regression method instead of meta-learning paradigm. All the algorithms with meta-learning consistently outperform MTL algorithm across all the scenarios, showing that meta adaptation policy can efficiently improve the generalization performance. In addition, SAML algorithm performs better than FOMAML and REPTILE baselines in all the scenarios, showing that considering the transfer and interference trade-off while training multiple tasks is more effective than normalized meta-learning. The median win rate of SAML, FOMAML, REPTILE and MTL in mixed scenario is shown in Table 4. SAML algorithm can get relatively high performance on all tasks of mixed scenario, therefore, it DRQN model is a generalized model that can deal with MT-MARL tasks without provision of task entities.

Continuously adaptation performance validation

We test the continuous adaptation performance when a new map is added to the experimental scenarios. As shown in Table 1, we add map 5 m_ vs_ 6 m to Homogeneous scenario and map 5 m_ vs_ 6 m map to the heterogeneous scenario and mixed scenarios. Figure 5d–f shows the mean win rate of the three scenarios and median win rate of the new map (map 5 m_ vs_ 6 m and map 1s5z_ vs_ 1s6z). As Fig. 5e, f shows, SAML achieves the optimal performance of continuously adaptation on all three scenarios. SAML algorithm achieves about 91% in homogeneous scenario, 81% in heterogeneous scenario, and 89% in mixed scenario. Our experimental results on StarCraft II environments show that the proposed method is able to effectively learn a unified generalization policy, and can adapt more efficiently to unseen tasks than existing baselines across different scenarios.

Effects on network capacity

To invalidate the effect of SAML method, the capacity of the unified DRQN in phase II is set 64 that is the same as the single-task DRQN in phase I. In Fig. 5, we found that SAML does not get sufficiently satisfactory generalized performance. We think generalized performance is limited on network capacity of unified DRQN network. In this section, we discuss the effect of the size of the network structure on the phase of learning a unified generalization policy. The capacity of generalized DRQN model is set to 64, 128 and 256, respectively, such that we can train the DRQN model with respect to different capacity to measure how much it is likely to increase the mean win rate of the scenarios. As shown in Fig. 6, with the increase of the capacity of the DQRN model, the generalization performance of the model is significantly enhanced. Under the same conditions, through 10,000 epochs of training, the mean win rate of the DRQN model with the capacity of 256 achieve 94, 92 and 93 in homogeneous, heterogeneous and mixed scenarios respectively, which is close to the maximum mean win rate shown in Table 5. The results of Fig.  6 indicate that SAML is a feasible solution which can deal with complex MT-MARL problems. Through the application of high-capacity unified DQRN network, SAML has the potential to solve large-scale MT-MARL problems.

Related work

Recent progress in MARL has achieved impressive performance since the representative value-based MARL method, QMIX [3], is proposed. Plenty of outstanding MARL methods, such as Qatten [33], WQMIX [34], QPLEX [35] and ROMA [36], are improved based on QMIX. They all claim that they have outperformed QMIX in their experiments. However, as rethinking the importance of implementation tricks in multi-agent reinforcement learning, a new research work [37] demonstrates that experimental results of these works seem to highly depend on “implementation tricks”. After minimal tuning, QMIX can attain extraordinarily high win rates and achieve SOTA in the StarCraft Multi-Agent Challenge (SMAC). Though QMIX is a powerful MARL method to solve the single-task MARL tasks, we found that it cannot achieve satisfied generalization performance across MT-MARL tasks. When solving a variety of real-world problems with a set of task-specific policies produced by single-task MARL method, agent has to store identities of all individual tasks to utilize the appropriate specific-task policy according to the specific condition of environment. Unfortunately, reliable identities of tasks are hardly observable in practice. Therefore, it is necessary to deal with multi-task problems without considering their identities.

There are a number of types of methods can achieve the adaption from a learnt task to others. Distral [9] is a method allowing effective sharing a “distilled” policy that captures common behavior across tasks. Each worker is trained to solve its own task while constrained to stay close to the shared policy. Therefore, knowledge gained in one task is distilled into the shared policy, then transferred to other tasks implicitly. AATEAM [10] uses the attention-based neural networks to cope with new teammates’ behavior in real-time. AATEAM trains the attention network to measure the similarity between the corresponding past teammates and new teammates, which helps to adjust the action selection in real time. MvDAN [11] is a novel multi-view deep attention network, which approximates multiple view-specific policy in parallel and integrates these functions based on attention mechanisms to generate a comprehensive strategy. MvDAN can exploit the specific statistical property of each view to learn a more comprehensive representation than single-view representation learning, which can reduce redundancy between multiple views and improve the generalization performance. In contrast to these methods, we propose to utilize the meta-learning paradigm to enhance and facilitate MT-MARL process with maximizing transfer and minimizing interference between multiple tasks.

As we know, meta-RL is a flexible approach which can implicitly infer the multiple tasks and quickly adapt to them. The general meta-learning paradigm [38] is often used to the problems of continuous adaptation. It learns a unified model that can achieve generalized performance across multiple tasks. There is a series of works on meta-learning, including methods for learning update rules for neural models that were explored in the past [39,40,41]. More recent approaches focus on learning optimizer for deep networks [42, 43], learning to learn implicitly via RL [25, 44] and generating model parameters [27, 45]. Gradient-based meta-learners aim to learn a good initialization of agent’s policy network by taking policy gradient steps and enable agents rapidly adapt to an unseen task, such as MAML [13] and REPTILE [23]. RNN-based meta-RL methods complete process of adaptation by updating the hidden states of a RNN [25, 46, 47]. Though the meta-learner’s advantage on solving the multi-task RL tasks, Existing meta-learning approaches often focus on relatively simple tasks with single agent and fully observable settings. Therefore, we proposed a gradient-based self-adaptive meta-learning method which combined meta-learning with MARL approach to deal with the multi-task multi-agent tasks.

The StarCraft II has already been used as RL environments, due to the many interesting challenges inherent to the games [48]. Multi-agent challenge (SMAC) is built on the popular real-time strategy game StarCraft II and makes use of the SC2LE environment. SMAC focuses on decentralized micromanagement where each unit is controlled by an independent learning agent. It uses the StarCraft II game engine to build a new set of rich cooperative multi-agent problems that bring unique challenges, such as the non-stationary of learning, multi-agent credit assignment [49], and the difficulty of representing the value of joint actions [3]. However, the SMAC mainly focus on multi-agent learning in single task scenario, which cannot meet our experimental needs. Other existing benchmarks (e.g., Meta-World [15]) mainly focus on simple single agent in multiple tasks which is too easy comparing with the StarCraft environment. Therefore, we extend the widely adopted StarCraft benchmark SMAC and design a new multi-task multi-agent StarCraft environment, Meta-SMAC, for testing various aspects of continuous adaptation.

Conclusion

In this work, we proposed a gradient-based meta-learning approach called SAML for continuous adaptation in MT-MARL non-stationary tasks. The key idea of the method is to formalize the problem into a two-stage curriculum. The approach first conducts single-task MARL learning process to achieve task-specific policy, and then distills task-specific policies into a unified DRQN with SAML meta-learning method by maximizing transfer and minimizing interference between multiple tasks. For the latter, we designed the meta-SMAC environment based on widely adopted StarCraft benchmark SMAC and defined iterated adaptation games that allowed us to test various aspects of adaptation strategies. Our experimental results on StarCraft II environments show that the proposed method is more efficient than reactive baselines to learn a unified generalization policy across different scenarios, and can adapt more efficiently to unseen tasks.