Introduction

Behavior trees (BTs) is a popular control tool in robotics and computer games, such as real-time strategy games [1, 2], unmanned aerial vehicles [3, 4], mobile robots [5,6,7] and so on. It is widely appreciated for its modularity, scalability and reactivity [8]. However, it needs much human knowledge and takes lots of effort to design a good one for a specific scenario [9]. The behavioral structure of the manual design is fixed which lacks the ability to dynamically adapt to the decision-making environment.

Some works have focused on integrating reinforcement learning with the design of BTs to enhance adaptability of BTs in dynamic environments. The work in Ref. [10] learns the fallback node to decide whether a child node should be executed. The work in Ref. [11] uses the Q-learning method to optimize the structure of the tree by selecting a proper node for each control node in the tree. Much works has been done to substitute an action node in the tree with an entire reinforcement learning model. The work in Ref. [12] uses a Q-learning method and Ref. [13] considers Proximal Policy Optimization to learn an action node. Further, the work in Ref. [14] proposes a hierarchical reinforcement learning approach MAXQ, which is used to optimize control nodes in the higher layer and learn the action node in the low layer. However, this kind of work needs to construct a sub-scenario to train the reinforcement learning model and then embed the model into the tree. They are only feasible for simple problems, which is hard to be used for complex games. Such problems often involves multiple agents. Different multi-agent reinforcement learning (MARL) methods have been proposed, such as MADDPG [15], QMIX [16], MAPPO [17]. However, they are only examined on mini-games like StarCraft II micromanagement tasks [18] and simple ball games [19]. With the number of agents increasing, all MARL methods will fail to obtain a reasonable solution in a finite time.

In this paper we consider to combine MARL methods with BTs for solving complex problem. The BT is a good tool to decompose a large task into multiple smaller sub-tasks [20], which could be solved by MARL methods. In this way, we do not have to solve the complex problem directly with MARL methods, which would need quite a lot of time and expensive resources. We propose a framework, named as MARL-BT, to embed MARL methods into BTs. Different from previous works, we neither need to construct an independent sub-scenario nor run over the whole game. We design a procedure to train the MARL model during the run of BTs. Samples for the MARL method are collected when corresponding sub-task is activated. The episode for the MARL method does not correspond to the full run of the game but is just a segment, which improves the efficiency of collecting samples.

Further, we notice a phenomenon that happened among sub-tasks when combining BTs with MARL. The sub-tasks decomposed by BTs have different priorities and may control some common agents. The conflict happens between the learning-based task with lower priority and the rule-based task with higher priority, which control the same agent. Once the two tasks are both activated, the action of the agent computed by the learning-based task will not be executed. For example, in the StarCraft II game, some sappers on mining tasks may be urgently dispatched to perform an offensive task. We define such conflicts as unexpected interruptions. When interruptions happen, actions computed by the agent network are in fact not executed. However, we store such samples into the replay buffer. Training the network with such bad samples will lead to higher value prediction errors. This error will also be accumulated during the computation, which would hinder finding a good policy. Therefore, we design an action masking technique that could remove the impact of actions generated by other sub-tasks.

To clearly clarify the problem, we provide an example which is shown in Fig. 1. The BT decomposes the soccer backfield task into penalty-area defense, left-side, middle and right-side sub-tasks, which is shown as the right sub-figure. In BT, the middle sub-task is realized by the MARL method, while others not. The penalty-area defense sub-task has the highest priority, which means players belonging to other tasks may be reassigned to this sub-task. The conflict using of a common player between two sub-tasks is called the unexpected interruption.

Fig. 1
figure 1

An example illustrating the combination of BT with MARL

Finally, we make extensive experiments on the 11 versus 11 full game in Google Research Football to examine the proposed methods. With the framework MARL-BT, we use the MARL method to replace certain sub-task of BTs. We could find that MARL converges quickly to perform better than the rules. With the trained model, the performance of BTs is also significantly improved. Compared with pure BTs, MARL-BT could improve the win rate by around 11.507% for certain scenarios. The action masking technique could greatly improve the performance of the learning method, i.e., the final reward is improved around 100% times for a sub-task.

Background

Behavior trees

BTs originated within the realm of computer games [21], serving as integral planning and decision-making tools [12] for the effective modeling and control of autonomous agents. A BT consists of nodes that perform specific actions or conditions. These nodes are categorized into three main types: control nodes, action nodes, and condition nodes. The control nodes control the flow of execution in the BT, which is divided into three categories, i.e., Sequence, Fallback, and Parallel. The sequence node returns a Success when all its child nodes succeed. The fallback node returns a Failure when all of its children fail. In parallel nodes, there are M child nodes, and each iteration requires the execution of all nodes. If a node returns Failure or if all nodes return Success, the parent node then returns Failure or Success. The action node is always a leaf node, employed to execute specific actions associated with the node, and it returns the execution outcome to the parent node based on the status of the action execution. The condition node corresponds to the if-else structure in programming languages, serving to assess whether the current environmental state satisfies the specified logical conditions. If the condition test is True, the condition node returns Success to the parent node. If it is False, it returns Failure.

Periodically, the execution of a BT starts from its root node, a process facilitated by the generation of a tick signal that traverses the tree branches in accordance with the distinct characteristics of each node type [22]. A node is deemed eligible for execution solely upon receipt of the tick signal. Upon execution, a child node promptly communicates its status to the parent, status either Running if its execution is ongoing, Success if its objective has been attained, or Failure if unsuccessful. The tick signal allows a BT to respond and adapt to changes in the environment or the agent’s state in real-time. By regularly updating the tree’s nodes, the agent can make informed decisions and perform appropriate actions based on the current state.

Multi-agent reinforcement learning

A multi-agent game can be modeled as a multi-agent extension of the Markov Decision Process [23]. It is represented by the tuple \(\left\langle {\mathcal {N}}, {\mathcal {S}}, {\mathcal {A}}, R, P\right\rangle \), where \({\mathcal {N}}=\{1,2, \ldots , N\}\) is the set of agents. \({\mathcal {S}}=\left\{ S^{t}\right\} _{t=0}^{T}\) is the environment state where \(S^{t}\) is the state at time t and T is the maximum time steps. The terminal state \(S^{T}\) represents the final state when the stopping condition is satisfied. \({\mathcal {A}} = \{A_i\}_{i=0}^{N}\) represents the action space for all agents. \(R: {\mathcal {S}} \times {\mathcal {A}} \rightarrow {\mathcal {R}}\) is the reward function. \( {\mathcal {P}}=\{P_{ss'}^a\mid s,s' \in {\mathcal {S}},a\in {\mathcal {A}}\} \) is the state transition function, and \(P_{ss'}^a\) gives the probability from state s to state \(s'\) if the action a is taken. To compute the optimal policy, the state value function \(V^{\pi }(s)\) under policy \(\pi \) is introduced, which is calculated by accumulated rewards with the discount factor \(\lambda \in [0,1]\), see (1), where \({\mathbb {E}}\) is the expectation:

$$\begin{aligned} V_{\pi }(s)={\mathbb {E}}_{\pi }\left\{ \sum _{k=0}^{\infty } \lambda ^{k} r^{t+k+1} \bigg | S^t = s \right\} \end{aligned}$$
(1)

Based on \(V_{\pi }(s)\), the state-action value \(Q_{\pi }(s,a)\) is defined as (2)

$$\begin{aligned} Q_{\pi }\left( s, a\right)= & {} {\mathbb {E}}_{\pi }\left\{ \sum _{k=0}^{\infty } \lambda ^{k} r^{t+k+1} \bigg | S^{t}=s, A^{t}=a\right\} \nonumber \\= & {} r^{t+1}+ \lambda \sum _{s'} P_{ss'}^a V^{\pi }(s') \end{aligned}$$
(2)

The optimal policy is obtained by iteratively updating the Q-value function, for which a popular one is that of Q-learning, see (3). \(\alpha \) is the updating step:

$$\begin{aligned} Q(s,a) = Q(s,a) + \alpha [r + \gamma \max _{a}Q(s',a)-Q(s,a)]\nonumber \\ \end{aligned}$$
(3)

For multi-agent games, a popular solution approach is the value-decomposition based MARL algorithms [16, 23, 24]. Each agent i is associated with a neural network, which is used to compute the individual value \(Q_{i}\) based on its observation. During the training process, the global state-action value \(Q_{\textrm{t o t}}\) is computed as a function of \(Q_{i}\) for each agent. At present time, different algorithms are proposed to study the relation between \(Q_{\textrm{t o t}}\) and \(Q_{i}\). For example, VDN [24] expresses \(Q_{\textrm{t o t}}\) as a sum of \(Q_{i}\), i.e., \(Q_{\textrm{t o t}}=\sum _{i} Q_{i}\). QMIX [16] uses a continuous monotonic function in form of a mixing network to express this relation, i.e., \(Q_{\textrm{t o t}}=f\left( Q_{1}, Q_{2}, \ldots , Q_{n}\right) \). For value-decomposition based method, a general principle Individual-Global-Max (IGM) [25] should be satisfied, which is shown as (4). It guarantees that a global argmax performed on \(Q_{\textrm{t o t}}\) yields the same result as a set of individual argmax operations performed on each \(Q_{i}\):

$$\begin{aligned}{} & {} \underset{{\varvec{a}} \in {\mathcal {A}}}{\arg \max } Q_{\textrm{t o t}}({\varvec{o}}, {\varvec{a}})\nonumber \\{} & {} \quad =\left( \underset{a_{1} \in {\mathcal {A}}}{\arg \max } Q_{1}\left( o_{1}, a_{1}\right) , \ldots , \underset{a_{n} \in {\mathcal {A}}}{\arg \max } Q_{n}\left( o_{n}, a_{n}\right) \right) \end{aligned}$$
(4)

Integrating reinforcement learning with BTs allows for the potential adoption of adaptive strategies by the BTs. BTs combined with reinforcement learning is not an entirely new concept, there have been some works that integrate reinforcement learning into BTs to relieve manual programming BTs efforts. References [11, 26] generate BTs through reinforcement learning, while the common approach is to embed reinforcement learning as a learning node into BTs to improve the adaptability of a predefined BT [10, 27]. However, the current research on BTs combined with reinforcement learning focuses on simple single-agent tasks, i.e., BTs combined with single-agent reinforcement learning. And the learning process is independent of BTs, i.e., the training is conducted in a separate sub-scenario. Once a learning model is obtained, it is embedded into the tree. However, for complex multi-agent game tasks, such a separate sub-scenario is normally hard to be constructed. It is better to train the model during the run of the behavior trees. We present such a training procedure in this paper.

Methodology

In this section, we present a framework for combining BTs with MARL algorithms, which is named as MARL-BT, and provide a detailed description for the training procedure of MARL-BT. In addition, we introduce unexpected interruptions that can occur during MARL-BT training, and present solutions for such problems.

Fig. 2
figure 2

The framework of MARL-BT

MARL-BT architecture

A BT is designed to decompose a complex task into multiple sub-tasks with different goals. The number of sub-tasks (action nodes) in BTs is denoted by J. The set of all agents is denoted by \({\mathcal {N}} = {\mathcal {N}}_1 \bigcup {\mathcal {N}}_2 \ldots {\mathcal {N}}_J\). Each sub-task j is denoted by \(\langle {\mathcal {N}}_j, {\mathcal {T}}_j,g_j,p_j,c_j \rangle \), where \({\mathcal {N}}_j \in {\mathcal {N}}\) is the set of agents that are assigned to sub-task j, \( {\mathcal {T}}_j\) is the set of time steps, \(p_j\) is the priority value and \(g_j\) is the goal of sub-task j. \(c_j\) is a binary value, for which 1 means sub-task j is activated while 0 means not. All sub-tasks are executed following the control flow of BTs, i.e., Sequence, Fallback, or Parallel. The whole framework MARL-BT is shown as Fig. 2. It consists of two main parts, the left part involves sample collection, and the right part concerns network updating. The sample collection is responsible for interacting with the environment and generating samples, which are then saved into the replay buffer. The core module of sample collection is the BT with a MARL sub-task. The network structure of the MARL sub-task is shown in the right part, which is updated by samples in the replay buffer.

Algorithm 1
figure a

Training Procedure of MARL-BT

The entire sampling and learning procedure of the framework MARL-BT is delineated in Algorithm 1. The input is a BT with a MARL node, and the output is the trained MARL model parameterized by \(\theta \). The sample collection runs in accordance with the operational mechanism of BTs, initiating the execution process of the BT. Upon a MARL sub-task j receives the tick signal (Step 9), the MARL node agent action set \(A_{{\mathcal {N}}_j}^{t}\) and action masking vector \(\delta ^t\) are obtained by Algorithm 2. \(A_{{\mathcal {N}}_j}^{t}\) along with actions generated by other active sub-tasks \(A_{{\mathcal {N}}_k}^{t}\) is passed to the environment. At the next time slot \(t+1\), a reward \(r^{t}\), global state \(S^{t+1}\) and observations \(O^{t+1}\) are obtained through the environment. A sample \((O^t,A_{{\mathcal {N}}_j}^{t},O^{t+1},r^t,\delta ^t)\) is then acquired and stored in the replay buffer.

Finally, we utilize the keyword ‘UPDATE’ to regulate the frequency of updating the neural networks (Steps 22–26). Specifically, the network structure comprises \( \vert {\mathcal {N}}_{j} \vert \) agent networks and a mixing network, following the idea of QMIX [16]. Each time we take samples from the replay buffer, which are sent to agent networks. The Q-values are computed for all agents, which are then disposed by the designed masking mechanism to handle unexpected interruptions introduced in section “Learning with unexpected interruptions”. Then a mixing network is used to compute the total Q-value \(Q_{\textrm{tot}}\), and the loss for the whole neural networks is computed as (5) and (6). The loss computation follows the way of DQN with double network [28]. \(\theta '\) represents parameters of target networks and \(\theta \) represents parameters of current networks:

$$\begin{aligned}&Q_{\textrm{tot}}(O^t,A_{{\mathcal {N}}_j}^{t};\theta ) = f(Q_i((o^t_i,a^t_i;\theta ),i \in {\mathcal {N}}_j)) \end{aligned}$$
(5)
$$\begin{aligned} \text {loss}&= \frac{1}{\vert {\mathcal {T}}_{j} \vert }\sum _{t=1}^{\vert {\mathcal {T}}_{j} \vert }(r^{t} + \gamma \max _{A}Q_{\textrm{tot}}(O^{t+1},A_{{\mathcal {N}}_j}^{t+1};\theta ')\nonumber \\&\quad -Q_{\textrm{tot}}(O^t,A_{{\mathcal {N}}_j}^{t};\theta )) \end{aligned}$$
(6)

Learning with unexpected interruptions

In this section, we introduce unexpected interruptions that happened in the framework MARL-BT. There are multiple sub-tasks organized by control flow nodes of BTs. Many sub-tasks run in overlapped time periods. In the case that they control some common agents, different priorities will make sure that no conflict will happen among sub-tasks. For the sub-task realized by MARL method, a higher sub-task that controls common agents may happen in BTs. It leads that some agents controlled by MARL method may be scheduled by other sub-tasks with the higher priority during its decision period. These agents may also come back to the sub-task with MARL method again when sub-tasks with higher priorities release the control of them. The actions of agents for MARL method generated by sub-task with higher priority are defined as unexpected interruptions. Such interruptions will affect the learning efficiency of MARL methods. The reason is that unexpected interruptions will be taken into account in the computation of Q-value for each agent. This will amplify the bias of the learned value function, leading to high prediction errors which affect the updating of Q-values of agents. The errors will be accumulated with the increasing number of unexpected interruptions, which will hinder finding a favorable strategy.

Algorithm 2
figure b

MARL Node Action and Masking Vector Calculation

Here we introduce a action masking mechanism to deal with unexpected interruptions. In the BTs, the set of indexes of all sub-tasks is denoted by \({\mathcal {J}} = \{\hat{{\mathcal {J}}} \bigcup \bar{{\mathcal {J}}}\}\). \(\hat{{\mathcal {J}}}\) represents the set of sub-tasks realized by MARL methods while \(\bar{{\mathcal {J}}}\) represents the set of sub-tasks realized by rules. For a MARL node \(j \in \hat{{\mathcal {J}}}\), considering that agents in MARL nodes may also be included in other nodes, we compute the agent action set \( A_{{\mathcal {N}}_j}^{t} \) and action masking vector \( \delta ^t \) by Algorithm 2. In Step 2, each agent \(i \in {\mathcal {N}}_j\) computes its action \(a^t_{i}\) based on its observation \(o^t_{i}\). To facilitate action exploration, we employ the \(\epsilon \)-greedy strategy. Meanwhile, the action masking vector according to Eq. (7). This means that \(\delta _{i}^t = 0\) if there exist an activated node k (\(c_k = 1\)) with higher priority \(p_k\) generate actions for a common agent i, i.e., \(i \in {\mathcal {N}}_j \bigcup {\mathcal {N}}_k\). With the action masking vector, the transition should be expressed as \((O^t,A_{{\mathcal {N}}_j}^{t},O^{t+1},r^t,\delta ^t)\). To improve the learning performance, we consider to change the reward for such samples, if \(r^t=0\), then set \(r^t=r^t+\varepsilon \), and \(\varepsilon \) is small positive value. Then the \(Q_{\textrm{tot}}\) is computed as (8) and the loss is computed still as (6)

$$\begin{aligned}&\delta _{i}^t=\left\{ \begin{array}{ll} 0 , &{}\quad i \in {\mathcal {N}}_j \bigcup {\mathcal {N}}_k, c_j=c_k=1, \;\text { and }\; p_k > p_j \\ 1, &{}\quad \text {Otherwise} \end{array} \right. \end{aligned}$$
(7)
$$\begin{aligned}&Q_{\textrm{tot}}(O^t,A_{{\mathcal {N}}_j}^{t};\theta ) = f(Q_i(o^t_i,a^t_i;\theta )*\delta ^t_{i},i \in {\mathcal {N}}_j) \end{aligned}$$
(8)

Experiments

In this section, we verify the effectiveness of the proposed MARL-BT framework through extensive experiments. We firstly describe the environment and experimental settings. With a given BT, we show the performance improvement brought by MARL method, and the action masking technique for the learning efficiency.

Fig. 3
figure 3

a A snapshot of the 11_vs_11_competition football game. b Initial positions of agents. Yellow points represent our agents while blue represent opponents. The black point represents the ball

Experiment settings

We conduct experiments in a challenging task, i.e., Google Research Football (GRF) [29], as shown in Fig. 3a. The game requires balancing short-term control tasks such as passing, dribbling, and shooting with long-term strategic planning. To evaluate the MARL-BT framework, we divided the football ground into three zones: Backfield, Midfield, and Front-field, with different tasks assigned to each area. For example, defense tasks, organizing tasks, and attack tasks frequently occur when the ball is in the Backfield, Midfield, and Front-field, respectively, as shown in Fig. 3b. Each agent in the game has a discrete action space of dimension 19, including moving in eight directions, sliding, shooting, and other actions. The observation contains information about the positions and movement directions of blue agents, red agents, the ball, and other elements.

We take the most votes code gfootball-with-memory-patternsFootnote 1 in the kaggle GRF game competition [30], and reform it as a BT which is used as the baseline. We name it as Baseline-BT which is used subsequently. Figure 4 shows the structure of the Baseline-BT used in our experiments. decomposes the full game task into three independent sub-tasks based on the ball’s position on the field: Backfield, Midfield, and Front-field. In Backfield sub-task, four sub-tasks are further decomposed, which corresponding to the players’ defensive positions on the penalty, left-side, middle and right-side of field. In addition, The penalty-area defense sub-task is given the highest priority, and as a result, players from other tasks may be reassigned to this sub-task. The Midfield and Front-field sub-tasks include emergency marking tasks to prevent ball interception and perform counterattacks. Moreover, based on ball ownership, the Midfield sub-task is decomposed into an organizing task and a defensive task, while the Front-field sub-task is decomposed into an attacking task and a defensive task. Then we embed different MARL algorithms, i.e., VDN and QMIX, to the BT with the proposed framework, which is named as VDN-BT and QMIX-BT, respectively, as indicated by the red dashed box. Finally, we use the action mask technique for the two methods, and name them as MASK-VDN-BT and MASK-QMIX-BT, respectively.

All the experiments are conducted on a computer with i7-11700F CPU, RTX3060, and 32 G RAM. We set the discount factor \( \gamma = 0.99 \). The optimization is conducted using Adam with a learning rate \( 5 \times 10^{-4} \). \(\varepsilon \) is set to 0.1.

Fig. 4
figure 4

The structure of Baseline-BT in our experiments

Fig. 5
figure 5

Comparisons of training performance for different sub-tasks

Comparison of the training performance

In this section, we compare the MARL learning performance with the baseline for different scenarios. We take two popular MARL methods, i.e., VDN and QMIX, to replace respective parts of BTs. For the reward design of the attack sub-task in Front-field, we take the reward setting as in CHECKPOINT provided by GRF Engine. For the organize task in Midfield, we give +1 when the ball is near to the front line while -1 when the ball is near to the back line. For the middle sub-task in Backfield, we give -1 when the ball is kicked into the goal while +1 when the ball is near to the back line.

The training curves are shown in Fig. 5, in which the vertical axis represents the sum of rewards over an episode and the horizontal axis represents the training episodes. For all curves, we can see that the reward increases gradually with the increasing number of episodes. At the beginning, the reward of Baseline-BT is better than VDN-BT and QMIX-BT, due to the learning algorithm needs time to explore the unknown environment. However, the performance of VDN-BT and QMIX-BT gradually improves as training progresses, and ultimately outperform Baseline-BT. Figure 5a, b show the performance of VDN and QMIX for the middle task in Backfield. After about \(0.5 \times 10^{4} \) episodes, the reward of VDN-BT starts to be better than Baseline-BT, and after about \( 1.4 \times 10^{5} \) episodes, the reward value tends to converge. Compared to Baseline-BT, VDN-BT improved the reward by 1.4 times, while QMIX-BT improved it by 2.6 times. Figure 5c, d show the performance of the two MARL methods for the organize sub-task in Midfield. The two MARL methods improve the reward by 1.7 times and 2.4 times compared to Baseline-BT. For the attack sub-task in Front-field, i.e., Fig. 5e, f, the reward is improved by 0.3 times and 1.3 times, respectively. This shows that MARL methods could effectively improve the performance of BT in games.

For the masking mechanism, the performance of MARL methods is improved significantly in all cases. In Fig. 5a, b, the reward of VDN-BT is improved by 2 times with MASK-VDN-BT, while that of QMIX-BT is improved around 1.2 times with MASK-QMIX-BT. The performance improvements for Fig. 5c, d are 0.6 times and 0.8 times, respectively. For the attack sub-task in Fig. 5e, f, the masking mechanism improves the performance of VDN and QMIX by 0.3 times and 0.2 times, respectively. Another important point to note is that the reward growth of MASK-VDN-BT and MASK-QMIX-BT is more significant in Backfield and Front-field than that in Midfield. This is because in Backfield, the opponent often intercepts the ball, activating the assisting task more frequently. Similarly, for the attack sub-task in Front-field, the opponent frequently intercepts the ball, leading to a higher frequency of the marking task. Overall, these results highlight the importance of carefully evaluating different learning algorithms embedding different scenarios, and considering masking mechanisms to improve performance.

Table 1 The possession, win and loss rate of MARL methods
Fig. 6
figure 6

The training curves with different numbers of players

Comparison of the evaluating performance

We further investigate the impact of VDN-BT, QMIX-BT, MASK-VDN-BT, and MASK-QMIX-BT on the overall performance of the full game, we conducted a comprehensive evaluation using three key criteria: possession rate, win rate, and loss rate. The possession rate reflects the number of times that our agents were able to control of the ball, while the win rate and loss rate indicate the number of matches that our agents won and lost, respectively. By considering these three criteria, we are able to obtain a holistic understanding of the effectiveness of the MARL-BT.

The results for all methods under three tasks are shown in Table 1, where bold values identify the data with the most significant changes in evaluation criteria in Compared to Baseline-BT, we found that VDN-BT improves the possession rate and win rate by around 3.7% and 1.7% on average, respectively. The QMIX-BT improves that by around 5.9% and 3.6%. For the loss rate, VDN-BT and QMIX-BT decrease 4.2% and 6.5% on average. Further, the masking mechanism brings additional improvements, i.e., MASK-VDN-BT gets 4.5% and 2.8% improvements in the possession rate compared to Baseline-BT, while MASK-QMIX-BT obtains around 6.7% and 4.6% improvements. This further verifies that the masking mechanism can successfully deal with unexpected interruptions in the training procedure of MARL-BT.

More specifically, for the middle sub-task in Backfield, the loss rate is reduced by 12.7% and 13.6% while the win rate is increased only by 3% and 6% with MASK-VDN-BT and MASK-QMIX-BT. The reason is that the main responsibility for players in Backfield is to prevent the opponent from scoring goals. Similarly, for the organize sub-task in Midfield, the most significant improvement is the ball possession rate. MASK-VDN-BT and MASK-QMIX-BT obtain around 6.3% and 7.5% improvements in the ball possession rate compared to Baseline-BT. The win rate is increased only 1.4% and 3.3%, The loss rate is decreased by 1.8% and 3.5%. For the attack sub-task in Front-field, we clearly find that the win rate is improved most significantly. This illustrates that introducing MARL methods with the masking mechanism can significantly improve the performance of BT. MASK-VDN-BT and MASK-QMIX-BT improve the win rate by around 5.8% and 8.9%, respectively, which is bigger than that for the loss rate and for the ball possession rate.

Ablation study

In this section, we conduct experiments to examine the number of players controlled by the MARL method and the impact of different performance MARL algorithms on the learning performance of the MARL-BT framework. We consider the organize task in the Midfield, in which the MARL method controls 3 players and 5 players, respectively. The results for embedding different MARL methods into BTs with and without the action masking technique are shown in Fig. 6. First, the MARL method and the action masking technique could improve the performance of BTs no matter how the number of players controlled by the MARL method changes. Second, we observe a positive correlation between the improvement in learning performance of the MARL-BT framework and algorithmic performance. In the experiments, the latest RiskQ [31] is embedded into BTs, i.e., RiskQ-BT. RiskQ-BT exhibits superior performance compared to QMIX-BT and VDN-BT, with the masked mechanism of MASK-RiskQ-BT further enhancing the performance of RiskQ-BT.

Conclusion

In this paper, we propose the framework MARL-BT, which combines BTs with MARL methods. It inherits the ability of BTs for decomposing complex tasks to sub-tasks, and good performance of learning-based methods on well-defined small problems. Different from previous works which construct a separate sub-scenario, or train over the whole game, we present a procedure that the MARL method is trained following the running mechanism of BTs, which works only as a segment over the whole game. Meanwhile, we point out a special phenomenon which is the unexpected interruption, that exists in MARL-BT. It happens between the learning-based sub-task with lower priority and the rule-based sub-task with higher priority. We propose a action masking technique to remove the effects of unexpected interruptions for the learning of the MARL method. We conduct experiments on the GRF game, and the results show that the performance of BTs are significantly improved by the proposed framework, i.e., getting an 11.507% improvement for certain scenarios. The action masking technique could greatly improve the performance of the learning method, i.e., the final reward is improved around 100% times for a sub-task. We hope that our approach could provide valuable guidance for combining BTs with MARL to solve real-world large-scale problems. As the future work, it would be interesting to study more sophisticated method other than the masking mechanism to deal with unexpected interruptions.