Offline Pre-trained Multi-agent Decision Transformer

Offline reinforcement learning leverages previously collected offline datasets to learn optimal policies with no necessity to access the real environment. Such a paradigm is also desirable for multi-agent reinforcement learning (MARL) tasks, given the combinatorially increased interactions among agents and with the environment. However, in MARL, the paradigm of offline pre-training with online fine-tuning has not been studied, nor even datasets or benchmarks for offline MARL research are available. In this paper, we facilitate the research by providing large-scale datasets and using them to examine the usage of the decision transformer in the context of MARL. We investigate the generalization of MARL offline pre-training in the following three aspects: 1) between single agents and multiple agents, 2) from offline pretraining to online fine tuning, and 3) to that of multiple downstream tasks with few-shot and zero-shot capabilities. We start by introducing the first offline MARL dataset with diverse quality levels based on the StarCraftII environment, and then propose the novel architecture of multi-agent decision transformer (MADT) for effective offline learning. MADT leverages the transformer’s modelling ability for sequence modelling and integrates it seamlessly with both offline and online MARL tasks. A significant benefit of MADT is that it learns generalizable policies that can transfer between different types of agents under different task scenarios. On the StarCraft II offline dataset, MADT outperforms the state-of-the-art offline reinforcement learning (RL) baselines, including BCQ and CQL. When applied to online tasks, the pre-trained MADT significantly improves sample efficiency and enjoys strong performance in both few-short and zero-shot cases. To the best of our knowledge, this is the first work that studies and demonstrates the effectiveness of offline pre-trained models in terms of sample efficiency and generalizability enhancements for MARL.


Introduction
Multi-agent reinforcement learning (MARL) algorithms [1] play an essential role in solving complex decision-making tasks by learning from the interaction data between computerised agents and (simulated) physical environments. It has been typically applied to self-driving [2−4] , order dispatching [5,6] , modelling population dynamics [7] , and gaming AIs [8,9] . However, the scheme of learning policy from experience requires the algorithms with high computational complexity [10] and sample efficiency due to the limited computing resources and high cost resulting from the data collection [11−14] . Furthermore, even in domains where the online environment is feasible, we might still prefer to utilize previously-collected data instead; for example, if the domain′s complexity requires large datasets for effective generalization. In addition, a policy trained on one scenario usually cannot perform well on another even under the same task. Therefore, a universal policy is critical for saving the training time of general reinforcement learning (RL) tasks.
Notably, the recent advance of supervised learning has shown that the effectiveness of learning methods can be maximized when they are provided with very large modelling capacity, trained on very large and diverse datasets [15−17] . The surprising effectiveness of large, generic models supplied with large amounts of training data, such as GPT-3 [18] , spurs the community to search for ways to scale up thus boosting the performance of RL models. Towards this end, Decision transformer [19] is one of the first models that verifies the possibility of solving conventional (offline) RL problems by generative trajectory modelling, i.e., modelling the joint distribution of the sequence of states, actions, and rewards without temporal difference learning.
The technique of transforming decision-making problems into sequence modelling problems has opened a new gate for solving RL tasks. Crucially, this activates a novel pathway toward training RL systems on diverse datasets [20−22] in much the same manner as in supervised learning, which is often instantiated by offline RL techniques [23] . Offline RL methods have recently attracted tremendous attention since they enable agents to apply selfsupervised or unsupervised RL methods in settings where online collection is infeasible. We, thus, argue that this is particularly important for MARL problems since online exploration in multi-agent settings may not be feasible in many settings [24] , but learning with unsupervised or metalearned [25] outcome-driven objectives via offline data is still possible. However, it is unclear yet whether the effectiveness of sequence modelling through transformer architecture also applies to MARL problems.
In this paper, we propose multi-agent decision transformers (MADT), an architecture that casts the problem of MARL as conditional sequence modelling. Our mandate is to understand if the proposed MADT can learn through pre-training a generalized policy on offline datasets, which can then be effectively used to other downstream environments (known or unknown). As a study example, we specifically focus on the well-known challenge for MARL tasks: the StarCraft multi-agent challenge (SMAC) [26] , and demonstrate the possibility of solving multiple SMAC tasks with one big sequence model. Our contribution is as follows: We propose a series of transformer variants for offline MARL by leveraging the sequential modelling of the attention mechanism. In particular, we validate our pre-trained sequential model in the challenging multi-agent environment for its sample efficiency and transferability. We built a dataset with different skill levels covering different variations of SMAC scenarios. Experimental results on SMAC tasks show that MADT enjoys fast adaptation and superior performance via learning one big sequence model.
The main challenges in our offline pre-training and online fine-tuning problems are the out-of-distribution and training paradigm mismatch problems. We tackle these two problems with the sequential model and pre-train the global critic model offline.

Related work
Offline deep reinforcement learning. Recent works have successfully applied RL in robotics control [27,28] and gaming AIs [29] online. However, many works attempt to reduce the cost resulting from online interactions by learning with neural networks from an offline dataset named offline RL methods [23] . There are two classes to divide the offline RL methods: constraint-based and sequential model-based methods. For the constraintbased methods, a straightforward method is to adopt the off-policy algorithm and regard the offline datasets as a replay buffer to learn a policy with promising performance. However, experience existing in offline datasets and interaction with online environments have different distributions, which causes the overestimation in the off-policy (value-based) method [30] . Substantial works presented in offline RL aim at resolving the distribution shift between the static offline datasets and the online environment interactions [30−32] . In addition, depending on the dynamic planning ability of the transition model, Matsushima et al. [33,34] learn different models offline and regularize the policy efficiently. In particular, Yang et al. [35,36] constrain off-policy algorithms in the multi-agent field. Related to our work for the improvement of sample efficiency, Nair et al. [37] derive the Karush Kuhn-Tucker (KKT) conditions of the online objective, generating an advantage weight to avoid the out-of-distribution (OOD) problem. For the sequential model-based methods, Decision transformer outperforms many state-of-the-art offline RL algorithms by regarding the offline policy training process as a sequential modelling and testing it online [19,38] . In contrast, we show a transformer-based method in the multi-agent field, attempting to transfer across many scenarios without extra constraints. By sharing the sequential model across agents and learning a global critic network offline, we conduct a pre-trained multi-agent policy that can be continuously fine-tuned online.
Multi-agent reinforcement learning. As a natural extension from single-agent RL, MARL [1] attracts much attention to solve more complex problems under Markov games. Classic algorithms often assume multiple agents to interact with the environment online and collect the experience to train the joint policy from scratch. Many empirical successes have been demonstrated in solving zero-sum games through MARL methods [8,39] . When solving decentralized partially observable Markov decision processes (Dec-POMDPs) or potential games [40] , the framework of centralized training and decentralized execution (CTDE) is often employed [41−45] where a centralized critic is trained to gather all agents′ local observations and assign credits. While CTDE methods rely on the individual-global-max assumption [46] , another thread of work is built on the so-called advantage decomposition lemma [47] , which holds in general for any cooperative game; such a lemma leads to provably convergent multiagent trust-region methods [48] and constrained policy optimization methods [49] .
Transformer. Transformer [50] has achieved a great breakthrough to model relations between the input and output sequence with variable length, for the sequence-tosequence problems [51] , especially in machine translation [52] and speech recognition [53] . Recent works even reorganize the vision problems as the sequential modelling process and construct the state-of-the-art (SOTA) model with pretraining, named vision transformers (ViT) [16,54,55] .
Due to the Markovian property of trajectories in offline datasets, we can utilize Transformer as that in language modelling. Therefore, Transformer can bridge the gap between supervised learning in the offline setting and reinforcement learning in online interaction because of the representation capability. We claim that the components in Markov games are sequential, then utilise the transformer for each agent to fit a transferable MARL policy. Furthermore, we fine-tune the learned policy via trialand-error.

Methodology
In this section, we demonstrate how the transformer is applied to our offline pre-training MARL framework. First, we introduce the typical paradigm and computation process for the multi-agent reinforcement learning and attention-based model. Then, we introduce an offline MARL method, in which the transformer sequentially maps between the local observations and actions of each agent in the offline dataset via parameter sharing. Then we leverage the hidden representation as the input of the MADT to minimize the cross-entropy loss. Furthermore, we introduce how to integrate the online MARL with MADT in constructing our whole framework to train a universal MARL policy. To accelerate the online learning, we load the pre-trained model as a part of the MARL algorithms and learn the policy based on experience in the latest buffer stored from the online environment. To train a universal MARL policy quickly adapting to other tasks, we bridge the gap between different scenarios from observations, actions, and available actions, respectively. Fig. 1 overviews our method from the perspective of offline pretraining with supervised learning and online fine-tuning with MARL algorithms. The main contributions of this work are summarized as follows: 1) We conducted an offline dataset for multi-agent offline pre-training on the well-known challenging task, SMAC; 2) To improve the sample efficiency online, we propose fine-tuning the pretrained multi-agent policy instantiated with the sequence model by sharing policy among agents and show the strong capacity of sequence modelling for multi-agent reinforcement learning in the few-shot and zero-shot settings; 3) We propose pre-training an actor and a critic to fine-tune with the policy-based network. In contrast to the imitation learning that only fits a policy network offline, MADT trains the actor and critic offline together and fine-tunes them online in the RL-style training scheme. We also give some empirical conclusions, such as the effect of reward-to-go in the online fine-tuning stage and the multi-task padding method on SMAC.
⟨S, A, R, P, n, γ⟩ S n S1 × S2 · · · ×Sn → S Ai i P : Multi-agent reinforcement learning. For the Markov game, which is a multi-agent extension of the Markov decision process (MDP), there is a tuple representing the essential elements , where denotes the state space of agents, . is the action space of each agent , denotes the transition function emitting the distribution over the state space and is the joint action space, is the reward function of each agent and takes action following their policies from the policy space , where denotes the policy space of agent , , and . Each agent aims to maximize its long-term reward , where denotes the reward of agent in time and denotes the discount factor. In the cooperative setting, we also denote the with shared among agents for the simplification.
Attention-based model. The attention-based model has shown its stable and strong representation capability. The scale dot-production attention uses the self-attention mechanism demonstrated in [50]. Let be the quries, be the keys, and be the values, where are the element numbers of different inputs and are the corresponding element dimensions. Normally, and . The outputs of selfattention are computed as where the scalar is used to prevent the softmax function from entering regions that have very small gradients. Then, we introduce the multi-head attention process as follows: The position-wise feed-forward network is another core module of the transformer. It consists of two linear transformations with a ReLU activation function. The dimensionality of inputs and outputs is , and that of the feed forward layer is . Specially,

Multi-agent decision transformer
Algorithm 1 shows the offline training process for a single-task of MADT, in which we autoregressively encode the trajectories from the offline datasets in offline pre-trained MARL and train the transformer-based network with supervised learning. We carefully reformulate the trajectories as the inputs of the causal transformer that are different from those in the Decision transformer [19] . We deprecate the reward-to-go and actions that are encoded with states together in the single-agent DT. We will interpret the reason for this in the next section. Similar to the seq2seq models, MADT is based on the autoregressive architecture with reformulated sequential inputs across timescales. The left part of Fig. 2 shows the architecture. The causal transformer encodes the agent ′s trajectory sequence at the time step to a hidden representation with a dynamic mask. Given , the output at the time step is based on the previous data and then consumes the previously emitted actions as additional inputs when predicting a new ac-tion. Algorithm 1. MADT-Offline: Multi-agent decision transformer denotes the available action as the learning rate, as the context length, and as the maximum agent number ) Chunk the trajectory into as the ground truth samples, is the context length, and mask the trajectory when is true Trajectories reformulation as input. We model the lowest granularity at each time step as a modelling unit from the static offline dataset for the concise representation. MARL has many elements, such as , different from the single agent. It is reasonable for sequential modelling methods to model them in a MDP. Therefore, we formulate the trajectory as follows: denotes the global shared state, denotes the individual observation for agent at time step , and denotes the action. We regard as a token and process the whole sequence similar to the scheme in the language modelling.
Output sequence construction. To bridge the gap between training with the whole context trajectory and  testing with only previous data, we mask the context data to autoregressively output in the time step with previous data in . Therefore, MADT predicts the sequential actions at each time step using the decoder as follows: where denotes the parameters of MADT and denotes the trajectory including the global state , local observation before the time step , is the distribution over the legal action space under the available action .
Core module description. MADT differs from the transformers in conventional sequence modelling tasks that take inputs with position encoding and decode the encoded hidden representation autoregressively. We use the masking mechanism with a lower triangular matrix to compute the attention: where is the mask matrix that ensures that the input at the time step can only correlate with the input from . We employ the cross-entropy (CE) as the total sequential prediction loss and utilize the available action to ensure agents take those illegal actions with a probability of zero. The CE loss can be represented as follows: is the ground truth action, includes . denotes the output of MADT. The cross-entropy loss shown above aims to minimize the distribution distance between the prediction and the ground truth.

Multi-agent decision transformer with PPO
The method above can fit the data distribution well, resulting from the sequential modelling capacity of the transformer. However, it fails to work well when pretraining on the offline datasets and improves continually by interacting with the online environment. The reason is the mismatch between the objectives of the offline and online phase. In the offline stage, the imitation-based objective conforms to a supervised learning style in MADT and ignores measuring each action with a value model. When the pre-trained model is loaded to interact with the online environment, the buffer will only collect actions conforming to the distributions of the offline datasets rather than those corresponding to high reward at this state. That means the pre-trained policy is encouraged to choose an action to be identical to the distribution in the offline dataset, even though it leads to a low reward. Therefore, we need to design another paradigm, MADT-PPO, to integrate RL and supervised learning for finetuning in Algorithm 2. Fig. 2 shows the pre-training and fine-tuning framework. A direct method is to share the pre-trained model across each agent and implement the REINFORCE algorithm [56] . However, only actors result in higher variance, and the employment of a critic to assess state values is necessary. Therefore, in online MARL, we leverage an extension of PPO, the state-of-the-art algorithm on tasks of StarCraft, multi-agent particle environment (MPE), and even the return-based game Hanabi [57] . In the offline stage, we adopt the strategy mentioned before to pre-train an offline policy for each agent and additionally use the global state to pre-train a centralized critic . In the fine-tuning stage, we first load the offline pre-trained sharing policy as each agent′s online initial policy . When the critic is pre-trained, we instantiate the centralized critic with the pre-trained model as . To fine-tune the pretrained multi-agent policy and critic model, multiple agents clear the buffer and interact with the environment to learn the policy via maximizing the following PPO objective: 2) Initialize and are the parameters of an actor and critic respectively, which could be inherited directly from pre-trained models n γ ϵ 3) Initialize as the agent number, as the discount factor, and as clip ratio.
where C 5) Sample as the ground truth, is the context lengthÂ

Universal model across scenarios
To train a universal policy for each of the scenarios in the SMAC which might vary with agent number in feature space, action space, and reward ranges, we consider the modification list below.
Parameters sharing across agents. When offline examples are collected from multiple tasks or the test phase owns the different agent numbers from the offline datasets, the difference in agent numbers across tasks is an intractable problem for deciding the number of actors. Thus, we consider sharing the parameters across all actors with one model as well as attaching one-hot agent IDs into observations for compatibility with a variable number of agents.
Feature encoding. When the policy needs to generalize to new scenarios that arise from different feature shapes, we propose encoding all features into a universal space by padding zero at the end and mapping them to a low-dimensional space with fully connected networks.
Action masking. Another issue is the different action spaces across scenarios. For example, fewer enemies in a scenario means fewer potential attack options as well as fewer available actions. Therefore, an extra vector is utilized to mute the unavailable actions so that their probabilities are always zero during both the learning and evaluating processes.
Reward scaling. Different scenarios might vary in reward ranges and lead to unbalanced models during multi-task offline learning. To balance the influence of examples from different scenarios, we scale their rewards to the same range to ensure that the output models have comparable performance across different tasks.

Experiments
We show three experimental settings: offline MARL, online MARL by loading the pre-trained model, and fewshot or zero-shot offline learning. For the offline MARL, we expect to verify the performance of our method by pre-training the policy and directly testing on the corresponding maps. In order to demonstrate the capacity of the pre-trained policy on the original or new scenarios, we aim to demonstrate the fine-tuning in the online environment. Experimental results in offline MARL show that our MADT-offline in Section 3.1 outperforms the state-ofthe-art methods. Furthermore, MADT-online in Section 3.2 can improve the sample efficiency across multiple scenarios. Besides, the universal MADT trained from multi-task data with MADT-online generalizes well in each scenario in a few-shot or even zero-shot setting.

Offline datasets
The offline datasets are collected from the running policy, MAPPO [58] , on the well-known SMAC task [26] . Each dataset contains a large number of trajectories: . Different from D4RL [59] , our datasets conform to the property of DecPOMDP, which owns local observations and available actions for each agent. In the appendix, we list the statistical properties of the offline datasets in Tables A1 and A2.

Offline multi-agent reinforcement learning
2s3z 3s5z 3s5z VS. 3s6z corridor In this experiment, we aim to validate the effectiveness of the MADT offline version in Section 3.1 as a framework for offline MARL on the static offline datasets. We train a policy on the offline datasets with various qualities and then apply it to an online environment, StarCraft [26] . There are also baselines under this setting, such as behavior cloning (BC), as a kind of imitation learning method showing stable performance on singleagent offline RL. In addition, we employ the conventional effective single-agent offline RL algorithms, BCQ [32] , CQL [31] , and ICQ [35] , and then use the extension method by simply mixing each agent value network proposed by [35] for multi-agent setting, denoting it as "xx-MA". We compare the performance of the MADT offline version with other abovementioned offline RL algorithms under online evaluation in the MARL environment. To verify the quality of our collected datasets, we chose data from different levels and trained the baselines as well as our MADT. Fig. 3 shows the overall performance on various quality datasets. The baseline methods enhance their performance stably, indicating the quality of our offline datasets. Furthermore, our MADT outperforms the offline MARL baselines and converges faster across easy, hard, and super hard maps ( , , , ). From the initial performance in the evaluation period, our pretrained model gives a higher return than baselines in each task. Besides, our model can surpass the average performance in the offline dataset.

Offline pre-training and online finetuning
The experimental designed in this subsection intends to answer the question: Is the pre-training process necessary for online MARL? First, we compare the online version of MADT in Section 3.2 with and without loading the pre-trained model. If training MADT only by online experience, we can view it as a transformer-based MAPPO replacing the actor and critic backbone networks with the transformer. Furthermore, we validate that our framework MADT with the pre-trained model can improve sample efficiency on most easy, hard, and super hard maps.
Necessity of the pretrained model. We train our MADT based on the datasets collected from a map and fine-tune it on the same map online with the MAPPO algorithm. For comparison fairness, we use the transformer as both actor and critic networks with and without the pre-trained model. Primarily, we choose three maps from easy, hard, and super hard maps to validate the effectiveness of the pre-trained model in Fig. 4. Experimental results show that the pre-trained model converges faster than the algorithm trained from scratch, especially in challenging maps.
Improving sample efficiency. To validate the sample efficiency improvement by loading our pre-trained MADT and fine-tuning it with MAPPO, we compare the overall framework with the state-of-the-art algorithm, MAPPO [58] , without the pre-training phase. We measure the sample efficiency in terms of the time to threshold mentioned in [60], which denotes the number of online interactions (timesteps) to achieve a predefined threshold in Table 1, and our pre-trained model needs much less than the traditional MAPPO to achieve the same win rate.

Generalization with multi-task pretraining
Experiments in this section explore the transferability of the universal MADT mentioned in Section 3.3, which is pre-trained with mixed data from multiple tasks. Depending on whether the downstream tasks have been seen or not, the few-shot experiments are designed to validate the adaptability of the seen tasks. In contrast, the zeroshot experiments are designed for the held-out maps.
Few-shot learning. The results in Fig. 5(a) show that our method can utilize multi-task datasets to train a universal policy and generalize to all tasks well. Pretrained MADT can achieve higher returns than the model trained from scratch when we limit the interactions with the environment.
Zero-shot learning. Fig. 5(b) shows that our universal MADT can surprisingly improve performance on downstream tasks even if it has not been seen before (3 stalkers VS. 4 zealots).

Ablation study
The experiments in this subsection are designed to answer the following research questions: RQ1: Why should we choose MAPPO for the online phase? RQ2: Which kind of input should be used to make the pre-trained model beneficial for the online MARL? RQ3: Why cannot the offline version of MADT be improved in the online fine-tuning period after pre-training?
Suitable online algorithm. Although the selection of the MARL algorithm for the online phase should be flexible according to specific tasks, we design experiments to answer RQ1 here. As discussed in Section 3, we can train Decision Transformer for each agent and finetune it online with an MARL algorithm. An intuitive method is to load the pre-trained transformer and take it as the policy network for fine-tuning with the policy gradient method, e.g., REINFORCE [56] . However, for the  Fig. 6(a).  Table 1 Number of interactions needed to achieve the win rate for the training policy with (MAPPO/pre-trained MADT). "-" means no more samples are needed to reach the target win rate.
" " represents that policies cannot reach the target win rate.   Dropping reward-to-go in MADT. To answer RQ2, we compare different inputs embedded into the transformer, including the combination of state, rewardto-go, and action. We find reward-to-go harmful to online fine-tuning performance, as shown in Fig. 6(b). We suppose the distribution of reward-to-go is the mismatch between offline data and online samples. That is, the rewards of online samples are usually lower than those of offline data due to stochastic exploration at the beginning of the online phase. It deteriorates the fine-tuning capability of the pre-trained model, and based on Fig. 6(b), we only choose states as our inputs for pretraining and fine-tuning.
Integrating online MARL with MADT. To answer RQ3, we directly apply the offline version of MADT for pre-training and fine-tune it online. However, Fig. 6(c) shows that it cannot be improved during the online phase. We analyse the results caused by the absence of motivation for chasing higher rewards and conclude that offline MADT is supervised learning and tends to fit its collected experience even with unsatisfactory rewards.

Conclusions
In this work, we propose MADT, an offline pretrained model for MARL, which integrates the transformer to improve sample efficiency and generalizability in tackling SMAC tasks. MADT learns a big sequence model that outperforms the state-of-the-art methods in offline settings, including BC, BCQ, CQL, and ICQ. When applied in online settings, the pre-trained MADT can drastically improve the sample efficiency. We applied MADT to train a generalizable policy over a series of SMAC tasks and then evaluated its performance under both few-shot and zero-shot settings. The results demonstrate that the pre-trained MADT policy adapts quickly to new tasks and improves performance on different downstream tasks. To the best of our knowledge, this is the first work that demonstrates the effectiveness of offline pre-training and the effectiveness of sequence modelling through transformer architectures in the context of MARL.

Appendix A Properties of datasets
We list the properties of our offline datasets in Tables A1  and A2.

Appendix B Details of hyper-parameters
Details of hyper-parameters used for MADT experiments are listed from Tables B1-B5.  Fig. 6 Ablation results on a hard map, , for validating the necessity of (a) MAPPO in MADT-online, (b) Input formulation, (c) online version of MADT.

Open Access
This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made.
The images or other third party material in this article are included in the article′s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article′s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.