Introduction

The current control strategies for multiple agents can be classified into three main paradigms: Centralized learning [1], Independent learning [2], and Centralized training distributed execution (CTDE) [3,4,5]. Centralized learning treats the entire system as a whole and uses single-agent reinforcement learning algorithms for training. Although this approach solves the problem of environmental non-smoothness, it requires global information, is not scalable, and cannot address issues of no communication, large scale, and large action space. Independent learning allows each agent to train its own strategy independently and achieves good performance in some cooperative tasks, but it ignores the multi-agent connections, which leads to unstable learning. Centralized training and distributed execution is a compromise approach that uses global information for training to improve learning efficiency, while allowing each agent to make independent decisions for execution. This approach has been shown to solve some of the multi-agent learning problems [6,7,8], but the optimal joint value function solution becomes complex as the number of agents increases.

In human societies, when dealing with complex tasks or problems, people tend to decompose them into different levels of subtasks, and individuals take on specific roles to deal with these subtasks [9]. This process is akin to each individual only exploring the constrained state action space associated with their assigned role. Thus, people will use roles associated with specific subtasks or strategies to deal with the tasks. If we have a priori knowledge and domain expertise, we can create a set of predefined roles that decompose complex multi-agent tasks into simpler, low-level tasks [10]. However, in practice, predefined roles may not always meet the task requirements, and changes in task or team dynamics can affect the effectiveness of these roles. In some cases, team members need to take on new roles at the right time to adapt to evolving task dynamics.

Therefore, having a dynamic way of dividing and selecting roles is crucial. Existing approaches to dynamic roles require a lot of exploration in a large action space [11] and learning roles from scratch. Such an approach does not necessarily have advantages over role-free approaches. In a multi-agent environment, we can decompose tasks into small ones and gradually achieve the overall task goal. However, this approach requires a model that can effectively decompose multi-agent team tasks, posing a new challenge.

In this work, we propose a novel framework for learning dynamic role discovery and assignment (DRDA) in multi-agent team tasks. We use an action encoder to construct vector representations based on action attributes and combine action contributions to trip action representations. The results of action representations are used to cluster actions and eventually discover different roles. At each time step, the historical information of the agents is represented using a trajectory encoding network, the similarity with the role representation is computed to obtain the classification distribution of role selection, and the role selection strategy is learned. For learning strategies, agents playing the same role can share the learning process, and the network estimates the effects and contributions of the actions, weighting the Q-function values of the actions into the Q-functions of the roles. We also employ regularizers in the learning process of role discovery to better separate roles and increase the representational differences between them, and in the learning process of role selection strategy to avoid training instability caused by frequent agent changes.

In a cooperative task, we use action clustering to decompose the joint action space by learning action differences and action contributions. A role is defined by an action space, and each agent only needs to learn a subset of actions that correspond to their assigned function. For example, in a soccer game, players can be divided into goalkeepers, defenders, strikers, etc. Each player would be constrained in his choice of actions according to the role he plays, and the goalkeeper would only need to explore how to move when defending. We further differentiate and selection of roles by classifying them into long-term and short-term roles based on action differences and action contributions. We also investigate the losses incurred due to role differences through the reward horizon, which guides our role selection method. The role selection model coordinates role selection in the restricted action space of roles, and role policies are explored in this space and associated with action representations. We reduce the complexity of learning by spatially decomposing the multi-agent cooperation problem into several learning problems in the restricted space.

We conducted experiments to evaluate the performance of our proposed method in the SMAC [12] environment. The results demonstrate that our method consistently improves performance on all six maps tested. Our method performs well on both hard and super hard maps, achieving a win rate above the baseline, and also converges faster than the baseline method. Furthermore, we conducted ablation experiments to investigate the impact of three different components of our method on performance. These experiments revealed that the three components have varying degrees of impact on performance, providing insights into the effectiveness of our proposed approach.

Related work

CTDE

QMIX [5] is the most typical approach in the CTDE [13, 14] framework, which produces joint action valuations based on nonlinear combinations of individual agents. QTRAN [15] lifts the limitations of cumulativity and monotonicity to some extent, removing the structural constraints in the QMIX approach. However, QTRAN does not perform well in practice on subsequent SMAC tasks. Qatten [16] introduced a multi-headed attention mechanism to decompose the joint Q-values, compensating for the lack of theoretical support for the decomposition of joint Q-values in algorithms such as QMIX and VDN. QPLEX [17] introduces a duel structure in QMIX to generate action advantages. MADDPG [18] solves the MARL task for a continuous action space and stabilizes the training process by speculating the policies of other agents. For large-scale multi-agent tasks, Mean-Field [19] treats the agents around an agent as a whole with average properties, simplifying the interaction between agents. However, it lacks the details of individual agents, limiting its ability to cooperate on a large scale. Recently, QMIX-ME [20] was proposed to learn exploding strategies with maximum entropy while using the QMIX structure to solve the credit assignment problem. Additionally, optimizing the code level and enforcing monotonicity constraints for QMIX variants can improve the sample efficiency in SMAC and DEPP, according to recent studies [21].

Task decomposition

Task decomposition breaks down a complex task into a set of subtasks with restricted action space. We classify task decomposition into two types: general [22] or domain-specific [23,24,25], depending on whether the application domain is restricted or not. Manual task decomposition was the basis of much previous work [26, 27], and now, much research investigates how to automate decomposition tasks. Many approaches require a high level of domain knowledge to understand the relationships between subtasks, leading to limitations in their extension [28,29,30]. The M+ algorithm [22] decomposes tasks into more primitive subtasks through a task list based on the agent’s specific skill set, making the subtasks executed by the agent more adapted to their own characteristics. Task tree and auction-based approaches [24, 31] produce more system-specific decompositions by providing feedback to improve agent-competency-based task decomposition. However, it can be more time-consuming when subtasks need to be adapted. Recently, a framework for human–agent cooperation [32] was proposed that describes the key components of teamwork.

Roles

Roles have been widely used in the design of multi-agent systems to reduce overall complexity by decomposing tasks and assigning agents with the same roles to handle the same subtasks [10, 33,34,35,36,37,38,39]. However, such systems rely heavily on prior knowledge [40] for defining roles and their related subtasks [41]. While predefinition can be effective in specific scenarios [42], the generalizability of prior knowledge is greatly limited. Consequently, there is a growing interest in investigating the generalizability of personas across different contexts. Wilson et al. [43] proposed a model-free actor discovery algorithm using a Bayesian policy search approach, while Wang et al. [11] designed a specialized goal to encourage actor emergence in a flexible, generalized manner. However, these approaches require a lot of exploration in the complete action space, which can result in inefficient behavior that wastes resources and increases the complexity of the mechanism. Moreover, recent works have used roles to measure the characteristics of different agents [11, 44,45,46,47,48]. However, accurately and completely defining roles remains a challenge. For example, Wang et al. [46] defined roles as high-level options in a hierarchical reinforcement learning framework [49], while Christianos et al. [44] defined roles based on the similarity of environmental impacts of stochastic policies. While these definitions capture some aspects of the differences in agents’ behavior, they cannot measure them completely.

Our work builds upon the distinctions in action differences to classify actions based on their contributions, establishing a more comprehensive perspective for defining roles and measuring the role distance between agents. We categorize roles into long-term roles and short-term roles using the knowledge of action differences and contributions to enhance their differentiation. Studying role-induced losses through reward horizon in role selection makes role rewards more reasonable. We decompose the complete action space based on action differences and contributions, and our unfixed role discovery allows for dynamic role discovery.

Role-based task decomposition model construction

In this paper, we propose a novel approach for defining roles based on action differences and action contributions. We decompose and form different role action spaces based on the varying degrees of influence that different actions have on environmental changes and other agents. Role distances are then computed for different roles played by the agents. Agents are assigned roles by considering the role selector of the reward horizon, and policies are selected on the role action space using role policies. Our approach enhances the exploration process of the agent while providing a priori knowledge and narrowing the agent’s action space. This is particularly effective in environments that require exploration.

We model the multi-agent cooperative task as a Decentralized Partially Observable Markov Decision Process (Dec-POMDP) [50, 51], which comprises the tuple \(G= \langle N,S,A, P,R,\Omega ,O,n,\gamma \rangle \). Here, N is a finite set of n agents, \(s\in S\) is the true state of the environment, and at each time step c, each agent \(i\in N:=(1,\ldots ,n)\) is an independent observation \(o_i\in {\Omega }\) determined by the observation function \(O(s,i):S \times N\mapsto \Omega \). Agent i selects an action \(a_i\in A\), forming a joint action space \(a\in A^n\). The transition function \(P\left( s^\prime |s,a\right) : S\times A^N\times S\mapsto \left[ 0,1\right] \) defines the next state \(s^\prime \). All agents share the same reward function \(r=R(s,a)\), and \(\gamma \in [0,1]\) is the discount factor. Each agent has a local action–observation history \(\tau _i\in T\equiv {(\Omega \times A)}^*\).

Based on the CTDE scheme, we design a role-based multi-agent team task decomposition framework to decompose multi-agent teamwork tasks into subtasks under different levels, each of which is associated with a role. Each agent plays a role in one time step, and agents playing the same role at the same time step can share information and jointly learn policies to solve the subtasks associated with the role. Breaking down team goals into subgoals at a high level is important because it is a general way for team members to be able to quickly understand how the team solves problems. Breaking down the main goal into its sub-goals through personas does not specify how these sub-goals are achieved or by whom. As an example, in a StarCraft II confrontation scenario, our objective is to eliminate members of the opposing team, and eliminating members of the opposing team would be broken down into subtasks such as move, surround, and attack.

In this paper, we give a multi-agent cooperation task G, a set of roles \({\Psi }\), role \(\rho _j\), subtask \(g_j\). \(G=\langle N,S,A,P,R,{\Omega }, O,n,\gamma \rangle \). The role \(\rho _j\in {\Psi }\) is defined by a tuple \(\langle j,r_{i,j},g_j,\pi _{\rho _j} \rangle \), j is the identity of the role; \(r_{i,j}\) is the reward function of the role; \(\pi _{\rho _j}: T\times A_j\rightarrow [0,1]\) is the role policy for roles associated with sub-task \(g_j= \langle N_j,S,A_j,P,R,{\Omega }_j,O,\gamma \rangle \), \(N_j\subset N\), \(\cup _{j}N_j=N\), and agents can only play one role in time step c: \(N_j\cap N_k=\emptyset \), \(j\ne k\). \(A_j\) is the restricted action space of the role \(\rho _j\). The action spaces of different roles can overlap: \(A_j\subset A\), \(\cup _{j}A_j=A\), \(|A_j\cap A_k|\ge 0, j\ne k\).

We hope that we can dynamically discover a set of roles that maximize the global expected payoff function \(Q^{\psi }(s_t,a_t)={\mathbb {E}}_{s_{t+1:\infty },a_{t+1:\infty }}\left[ \sum _{i=0}^{\infty }{\gamma ^ir_{t+i}}| s_t, a_t, {\psi }\right] \) of the role \({\psi }^*\). The global expected gain is influenced by both role division and role selection. Our mechanism is a dynamic grouping mechanism, neither role division nor role selection is fixed that can learn role policies by exploring the role-restricted action space. When an agent chooses a role, the policy to solve the corresponding subtask is determined simultaneously.

Methods

In team-based task completion, roles of individuals are not fixed, and the number of individuals in each role may vary in different states. For instance, during a search and rescue operation, all personnel start off as searchers to locate the search and rescue target. Once the target is located, some personnel become rescuers to carry out rescue operations while others continue searching for the next target. This approach enables agents to collaborate effectively in multi-agent cooperative tasks. Figure 1 shows the learning framework of Dynamic Role Discovery and Assignment (DRDA), which first involves how a set of roles can be discovered to decompose a multi-agent cooperative task based on actions. We then discuss how agents can select roles based on their fitness. Finally, after the agents have selected roles, we propose policies for these roles that are associated with action representations.

Fig. 1
figure 1

The Overview framework of DRDA. a Action representation learning structure. b Role selection structure of the agent. c Policy learning structure of the role

Role discovery based on action representations

The effectiveness of the mechanism is largely determined by the role classification scheme, thus, it is crucial to carefully determine how to classify the roles. While different agents output different actions at the same time period depending on the state, and these different actions can be represented as different roles, defining roles solely based on action differences cannot fully reflect the different visual distance, action space, and reward function of the roles. Therefore, action differences cannot be the only criterion for defining roles. To address this limitation, we introduce action values to define role type differences based on action differences, which are called action contributions. We estimate the Q-value or state value of each action based on the reward function, and the Q-value is the contribution of a single action, which is considered as the contribution of the action to the team.

In many cases, although the actions taken by the agents are not the same, the effects of the actions on the surrounding environment are highly similar. For example, in a soccer game, two players pass the ball to each other during an attack [52], and although the actions taken by the players at each time step will not be identical, the effects of the actions of the two players passing the ball to each other are highly similar before changing the action of passing the ball to each other, and the two players actually play very similar roles. As a result, the number of roles found based on time steps may be large, while the differences between roles [46] may be small. To maintain the differentiation between the roles, we propose a regularizer.

We first characterize and cluster the actions based on their attributes, determine the action spaces to which the actions belong, calculate the action contributions, and group the roles with similar contributions in the same space. Actions with different contributions are identified as outliers. We classify action spaces with similar action effects as the same roles and those with different action effects as different roles. By doing so, we determine the action spaces to which actions belong and the final role classification.

We present the two-layer linear network model depicted in Fig. 1a to acquire the action encoder \(f_e(\cdot ;\theta _e)\) with parameter \(\theta _e\). The action encoder \(f_e(\cdot ;\theta _e)\) maps one-hot actions \(a_i\) from \({\mathbb {R}}^{|A|}\) to \({\mathbb {R}}^{m}\), thereby generating the vector representation \(x_{\phi _{i}} = f_{e}\left( a_{i};\theta _{e} \right) \in {\mathbb {R}}^{m}\) for each action \(a_i\). The action representation \(z_{a} = Q_{\pi }\left( s,a_{k} \right) x_{\phi _{i}}\) of action a is then outputted in this space.

To calculate the Loss function value and update the network, we predict the value \({{\widetilde{o}}}_i^\prime \), \({{\widetilde{r}}}_t\), and the difference between the true values \(o_i\), \(r_t\):

$$\begin{aligned} {\mathcal {L}}_{e}\left( \theta _{e},~\xi _{e} \right)= & {} {\mathbb {E}}\left( o,a,r,o^{'} \right) \nonumber \\{} & {} \sim {\mathcal {D}}\left[ {\sum \limits _{i}|| {\overset{\sim }{o_{i}}}^{\prime }} - o_{i}^{\prime }{||}_{2}^{2} + \lambda _{e}{\sum \limits _{i}\left( \overset{\sim }{r_{t}} - r \right) ^{2}} \right] \nonumber \\ \end{aligned}$$
(1)

the hyperparameter \(\lambda _{e}\) plays a crucial role in determining the training focus of the agent, and adjusting its value can alter the significance of predicting the next observation and the predicted gain. Finally, the replay buffer \({\mathcal {D}}\) is employed in the training process.

At the same time, if the roles do not differ, then a large number of roles with highly similar action effects will appear, which will increase the complexity of the framework. To prevent the emergence of duplicate roles resulting from highly similar action effects, we introduce a regularizer that maximizes the L2 distance between role representations, thereby extracting the differences between roles and maintaining their differentiation:

$$\begin{aligned} {\mathcal {L}}_{a}\left( \theta _{e} \right) = {\mathbb {E}}_{{\mathcal {D}}}\left[ - {\sum \limits _{j \ne k}|| z_{a_{j}} - z_{a_{k}} ||^{2}} \right] \end{aligned}$$
(2)

We employ supervised learning on the samples \((o,a,r,o^\prime )\) in the replay buffer, given K roles, i.e., K categories, with a complete action space for each role. After collecting samples and training the prediction model for a period, clustering is performed with k-means based on action representations, updating the cluster centers using the samples. Outliers are identified and marked as actions of outliers, and added to each role to update the clustering results, resulting in different categories representing different roles. The new action representations contain action effect information, and each clustering result represents a role, and the restricted action space of a role is the action in a category with the same action effect. Accordingly, we determine the action space to which the action belongs and the final role classification. Action effects can be measured by changes in action rewards and local observations. Through supervised learning, the hidden action representation \(z_a\) contains the same action effects (state transfer and reward values) as the original action a. During training, the action representation adapts to the dynamic changes of the environment by continuous learning, and after training is completed, the role and the corresponding action space are fixed.

Therefore, the overall optimization objective of role discovery is given by the equation:

$$\begin{aligned} {\mathcal {L}}_{e}\left( \theta _{e},~\xi _{e} \right) = {\mathcal {L}}_{e}\left( \theta _{e},~\xi _{e} \right) + \lambda _{a}{\mathcal {L}}_{a}\left( \theta _{e} \right) \end{aligned}$$
(3)

where \(\lambda _{a}\) is the positive coefficient used to balance the regularizer.

Representation-based role selection

Taking into account the differences between various roles and the behavioral tendencies of agents, we aim to devise a role selection policy based on the similarity between the representations of roles and agents. Our roles are classified into two categories: short-term roles and long-term roles. We investigated the impact of the reward horizon, as reflected in the discount factor, on the loss of role reward. This is because certain roles in the team are oriented towards long-term rewards with no immediate gains, and it is not rational to use the same reward function for all roles. The reward horizon will vary for different types of roles: some favor immediate rewards (e.g., shooting immediate prey), while others prefer long-term rewards (e.g., guarding a camp). Our role selection policy is based on a traditional Q-network, where the local action observation history is fed as input, and the Q-value of the role is output. Every c steps, the agent chooses a role through the role selector, and each selection corresponds to determining the set of possible actions for the next c actions, where the Q value of the role is closely linked to the role representation. The role will then act within the given time frame and receive different rewards based on the reward horizon and reward function.

By utilizing action representations, we construct the role representation \(z_{\rho _{j}}\) by averaging the representations of the action space contained in the role, which results in an average representation of the available actions of role \(\rho _j\):

$$\begin{aligned} z_{\rho _{j}} = \frac{1}{| A_{\rho _{j}} |}{\sum \limits _{a_{k} \in A_{\rho _{j}}}z_{a_{k}}} \end{aligned}$$
(4)

\(z_{\rho _{j}}\) is the representation of role \(\rho _j\), and \(A_{\rho _j}\) is the restricted action space of role \(\rho _j\). We employ the structure shown in Fig. 1b to learn role selection. The observation \(o_i\) of agent i serves as the first input, which undergoes a shared trajectory encoder consisting of a shared linear layer and a GRU [53]. This encoder encodes the local action observation history \(\tau \) into a vector \(h_\tau \) that is parameterized by \(\theta _{\tau _\beta }\) for both networks. The role selector \(f_\beta (h_\tau ;\theta _\beta )\) is a fully connected neural network, parameterized by \(\theta _\beta \), which maps the vector \(h_\tau \) to \(z_\tau \in {\mathbb {R}}^d\) with the same length as the action representation. For the role selector, the expected payoff of agent i when choosing role \(\rho _j\) with observation state \(\tau _i\) is denoted as \(Q_i^\beta (\tau _i,\rho _j)\):

$$\begin{aligned} Q_{i}^{\beta }\left( \tau _{i},~\rho _{j} \right) = z_{\tau _{i}}^{T}z_{\rho _{j}} \end{aligned}$$
(5)

Considering the similarity between the observation history and role representation of the agents, for each agent i, we compute the similarity between its action–observation history representation \(h_\tau \) and the role representation \(z_{\rho }: = \left[ z_{\rho _{j}} \right] _{j = 1}^{k}\), i.e., the similarity \(\left( h_{\tau },z_{\rho _{j}}\! \right) = \left( z_{\tau _{i}}^{T}z_{\rho _{j}}\! \right) / \big (| | h_{\tau } | | | | z_{\rho _{j}} || \big )\). Specifically, we approximate the cosine similarity using \(Q_{i}^{\beta }\left( \tau _{i},~\rho _{j} \right) \) and apply the softmax function to obtain the categorical distribution of role selection \(p\left( \rho | h_{\tau },z_{\rho } \right) : = \left[ p\left( \rho _{j} | h_{\tau },z_{\rho } \right) \right] _{j = 1}^{k}\). Here, \(p\left( \rho _{j} | h_{\tau },z_{\rho } \right) \) is the probability that agent i selects role \(\rho _{j}\) with:

$$\begin{aligned} p\left( \rho _{j} | h_{\tau },z_{\rho } \right) = \frac{\exp \left( Q_{i}^{\beta }\left( \tau _{i},~\rho _{j} \right) \right) }{\sum _{i = 1}^{n}{\exp \left( Q_{i}^{\beta }\left( \tau _{i},~\rho _{j} \right) \right) }} \end{aligned}$$
(6)

where \(\exp (-)\) is the exponential function. To make the role selection process trainable, we use the Straight-Through Gumbel-Softmax Estimator to sample the role \(\rho _{j}\), as drawing a role directly from a categorical distribution is indistinguishable.

To better coordinate role assignment and solve the role assignment problem in multi-agent learning, we use the QMIX method to estimate the global Q value of the joint action value function \(Q_\textrm{tot}^\beta \) using \(Q_i^\beta \) to generate the parameters \(\xi _\beta \) of the hybrid network using the current state as input and the hypernetwork [54]. We then use the return gain from the previous c-step \(R_{LSTRR}^{(c)}\) plus the \(Q_\textrm{tot}\) value of the optimal role selection after step c, minus the \(Q_\textrm{tot}\) of the current prediction, squared about all historical experiences in the replay buffer, taking expectations as the loss function. The updated role selector \(f_\beta \) is learned by sampling multiple transitions from the replay buffer and minimizing the TD loss:

$$\begin{aligned}&{\mathcal {L}}_{\beta }\left( \theta _{\tau _\beta },\theta _{\beta },\xi _{\beta } \right) \nonumber \\&\quad = {\mathbb {E}}_{D}\left[ \left( R_{LSTRR}^{(c)} + \gamma {\underset{\rho ^{\prime }}{{\max }}{{\overset{-}{Q}}_{tot}^{\beta }\left( s_{t + c},\rho ^{\prime };\theta ^{-},\xi ^{-} \right) }}\right. \right. \nonumber \\&\qquad \left. \left. - Q_{tot}^{\beta }\left( s_{t},\rho _{t};\theta ,\xi \right) \right) ^{2} \right] \end{aligned}$$
(7)

\(\rho \) in Eq. (7) refers to all the roles that can be chosen, while \(\theta ^-\) and \(\xi ^-\) are target network parameters that are copied periodically from the current network and remain constant over multiple iterations. \({{\bar{Q}}}_\textrm{tot}^\beta \) is a target network with \(\rho = \langle \rho _1,\rho _2, \ldots \,\rho _j \rangle \) as the joint set of roles for all agents, and the expectation is estimated using a uniform sample of the replay buffer \({\mathcal {D}}\):

$$\begin{aligned} \begin{aligned}&Q_\textrm{tot}^{\beta }(s,\rho ;\theta ,\xi ) = {\sum \limits _{i = 1}^{n}{Q_{i}^{\beta }( \tau _{i},\rho _{i};\theta _{i} )}}\\&\frac{\partial Q_\textrm{tot}^{\beta }(s,\rho ;\theta ,\xi )}{\partial Q_{i}^{\beta }( \tau _{i},\rho _{i};\theta _{i} )} \ge 0,\forall i \in N \end{aligned} \end{aligned}$$
(8)

The monotonicity constraint in Eq. (8) ensures that maximizing joint \(Q_\textrm{tot}^\beta \) and maximizing individual \(Q_i^\beta \) are equivalent, meaning that the best individual action remains the same as the best joint action.

Since the role categories lead to role-specific reward ranges, a regularizer is available at each training step c to control the training stability:

$$\begin{aligned} R_{LSTRR}^{(c)}= & {} \frac{1}{k}{\sum \limits _{j = 1}^{k}\left( Q_{i}^{\beta }\left( \tau _{i},\rho _{j} \right) - {\sum \limits _{c = 0}^{T - c}{\mu _{j}^{t}r^{(t + c)}}} \right) ^{2}};\nonumber \\{} & {} \text {for}\,\ k \le K. \end{aligned}$$
(9)

where LSTRR represents the long-term and short-term reward roles. Here \(Q_i^\beta (\tau _i,\rho _j)\) is an estimate of the Q value associated with the role \(\rho _j\), and \(R_j=\sum _{c=0}^{T-c}{\mu _j^tr^{(t+c)}}\) is the discounted reward for the role \(\rho _j\). This regularizer is used in the centralized training process while we know the rewards. Without loss of generality, we assume that \(\mu _1,\mu _2,\ldots ,\mu _{k}\) is a decreasing sequence (from long-term to short-term horizon).

At each time step, agent i will choose a role. When an agent frequently changes roles in adjacent time steps, it can cause instability during training. To smooth the role selection of agents and stabilize training, we introduce regularizers that minimize the KL divergence between the role selection distributions of adjacent time steps:

$$\begin{aligned} {\mathcal {L}}_{h}\left( {\theta _{e},\theta _{h}} \right) = {\mathbb {E}}_{{\mathcal {D}}}\left[ {\sum \limits _{a}{D_{KL}\left( p\left( \rho | h_{\tau },z_{\rho } \right) | | p^{\prime }\left( \rho ^{\prime } | {h_{\tau }^{\prime }},{z_{\rho }^{\prime }} \right) \right) }} \right] \nonumber \\ \end{aligned}$$
(10)

where \(p^{\prime }\left( \rho ^{\prime } | {h_{\tau }^{\prime }},{z_{\rho }^{\prime }} \right) \) is the role selection distribution for the next time step and \(D_{KL}\left( \cdot | | \cdot \right) \) is the KL divergence operator whose sum is carried over all agents.

The overall optimization objective for role selection is:

$$\begin{aligned} {\mathcal {L}}_{\beta }\left( \theta _{\tau _\beta },\theta _{\beta },\xi _{\beta } \right) = {\mathcal {L}}_{\beta }\left( \theta _{\tau _\beta },\theta _{\beta },\xi _{\beta } \right) + \lambda _{h}{\mathcal {L}}_{h}\left( {\theta _{e},\theta _{h}} \right) \end{aligned}$$
(11)

where \(\lambda _{h}\) is the positive coefficient of this regularizer.

Representation-related role strategy selection

After an agent chooses a role, it maintains the same role for the next c time steps, and its action space is restricted. Each role \(\rho _j\) is associated with a role policy \(\pi _{\rho _j}:T\times A_j\rightarrow [0,1]\), defined by the restricted action space. The policy parameters for each role are learned separately, as illustrated in Fig. 1c. Generally, agents playing the same role can learn policy parameters from each other, but the parameters are different for different roles.

Similar to the role selector, we learn the role policy using a deep Q network that estimates the Q value of each action directly. Q values based on the effects of actions can leverage information about the differences and contributions of actions. The role policy \(f_{\rho _j}(h_\tau ;\theta _{\rho _j})\) is a fully connected network that maps \(h_\tau \) to \(z_\tau \) to obtain the observed representation \(z_\tau \in {\mathbb {R}}^d\), and the Q value obtained by making an inner product of the observed representation \(z_\tau \) and the representation \(z_{a_k}\) of the role is the agent i chooses the value of an original action \(a_k\). The value function \(Q_i(\tau _i,a_k)\) of the action \(a_k\) executed by an agent after selecting the role \(\rho _j\) when the observed state is \(\tau _i\):

$$\begin{aligned} Q_{i}\left( \tau _{i},a_{k} \right) = z_{\tau _{i}}^{T}z_{a_{k}} \end{aligned}$$
(12)

To learn \(Q_i\) by global reward, we again input the local Q values into a QMIX-style mixing network to estimate the global action values, \(Q_\textrm{tot}(s,a)\). The parameters of the mixing network are denoted by \(\xi _\rho \). We minimize the TD loss to learn the update role policy \(f_{\rho _j}\). The \(Q_\textrm{tot}\) value of the optimal action selection is added to the return gain r, and the \(Q_\textrm{tot}\) of the current prediction is subtracted, and all historical experiences about the replay buffer are squared to take the expectation as the loss function. We minimize the TD loss to learn the update role policy \(f_{\rho _j}\):

$$\begin{aligned}{} & {} {\mathcal {L}}_{\rho }\left( {\theta _{\tau _{\rho }},\theta _{\rho },\xi _{\rho }} \right) \nonumber \\{} & {} \quad ={\mathbb {E}}_{{\mathcal {D}}}\left[ \left( r(s_t,a_t) + \gamma {\max \limits _{a^{\prime }}{{\overset{-}{Q}}_\textrm{tot}\left( s^{\prime },a^{\prime };\theta ^{-},\xi ^{-}\right) }}\right. \right. \nonumber \\{} & {} \qquad \left. \left. - Q_\textrm{tot}(s,a;\theta ,\xi ) \right) ^{2} \right] \end{aligned}$$
(13)

a in Eq. (13) represents all optional actions of agent i, \({{\bar{Q}}}_\textrm{tot}\) is a target network, \(\theta _\rho \) are the parameters of all role policies, and the expectation is estimated using a uniform sample of the same replay buffer \({\mathcal {D}}\) as the role selector. Therefore, each agent only trains the policy parameters of its selected role. In this way, agents with similar abilities tend to choose the same roles, can share their experience, and speed up training, ultimately improving performance.

Experiment results

Details

We selected the SMAC benchmark [12] as our test environment, which is a widely used benchmark for MARL and leverages the StarCraft II Learning Environment (SC2LE) to provide a challenging platform for solving both competitive and cooperative multi-agent problems. The agent’s action space in the SMAC benchmark includes four basic movement directions (up, down, left, and right), selecting an enemy to attack, stopping, and not acting at each time step. However, agents can only attack enemies within their shooting range. Therefore, if there are \(n_e\) enemies on the map, the action space for each allied unit consists of \(n_e+6\) discrete actions. Our goal is to maximize the win rate, which is defined as the ratio of games won to games played. Specifically, in our experiments, we measure the percentage of episodes in which our proposed method defeats all enemy units within the time limit, which we refer to as the test victory rate.

Table 1 Hyperparameter settings and structure details

We use RMSprop for optimization, and all hyperparameters and structural details are presented in Table 1. The episodes are sampled from the replay buffer, and the role policy and role-selected mixing network share the same architecture. The parameters of the mixing network are set to the same values as QMIX [55]. We employ the default reward and observation settings of the SMAC benchmark in our experiments. The baseline uses the code provided by the authors, and we fine-tuned the hyperparameters on the SMAC benchmark. All experiments were conducted on NVIDIA RTX 2060 GPUs, and training times ranged from 12 to 48 h.

To determine the role action space, we utilize k-means clustering. The number of clusters, denoted as k, is considered as a hyperparameter. We set k to 3 for maps with homogeneous enemies and 5 for maps with heterogeneous enemies.

Table 2 SMAC map details

Performance

For the purpose of evaluation, all experiments in this section were conducted using different random seeds, and the median performance was reported. The SMAC benchmark maps were classified into three difficulty levels: easy, hard, and super hard. In this study, we chose two maps from each difficulty level of SMAC, specifically, 3s5z and 10 m_vs_11m from the easy level, 2c_vs_64zg and 5 m_vs_6m from the hard level, and corridor and 27 m_vs_30m from the super hard level. The details of each map can be found in Table 2. The hard and super-hard maps are typical examples of challenging exploration tasks. We conducted experiments in six scenarios to evaluate the performance of our method in the SMAC experimental environment, and compared it with value-based multi-agent reinforcement learning (MARL) algorithms QMIX [55] and VDN [56]. The results are shown in Figs. 2, 3, 4. After 2 M training steps, our method outperformed all baselines by at least \(5\%\) in four out of six scenarios.

Fig. 2
figure 2

Performance comparison with baselines on easy maps

Fig. 3
figure 3

Performance comparison with baselines on hard maps

The experimental results for the easy maps are depicted in Fig. 2. In Fig. 2a, our method improves the win rate by approximately \(10\%\) compared to one baseline and performs similarly to another baseline, but our method does not converge as quickly as the baseline. In Fig. 2b, our method performs similarly to the baseline, with a win rate improvement of 2–5\(\%\) over the baseline. Overall, on easy maps, our method performs comparably to the baseline, but requires more samples to achieve similar performance. We speculate that this is because easy maps do not require much exploration—allied agents simply engage in combat and destroy enemy units to win rewards, whereas our method still does a lot of exploration, and the benefits of the restricted action space are not apparent enough to outperform the baseline.

Figure 3 shows the performance of our method on hard maps. In both maps of Fig. 3a, b, our method outperforms the baselines. In Fig. 3a, our method improves the win rate by about 5\(\%\) on average, and the convergence rate is even faster than the baseline method by about 28\(\%\). In Fig. 3b, our method outperforms the baseline method by approximately 25\(\%\) on average, although the convergence speed is slower. Overall, on hard maps, our method improves the win rate by an average of 20\(\%\) over the baseline.

Fig. 4
figure 4

Performance comparison with baselines on super hard maps

The experimental results for super-hard maps are shown in Fig. 4. In both maps in Fig. 4a, b, our method performs significantly better than the baselines, especially in Fig. 4a. In Fig. 4a, our method improves the win rate by approximately 70\(\%\) on average, and in Fig. 4b, our method improves the win rate by about 40\(\%\). Overall, on super hard maps, our algorithm improves the win rate by an average of 55\(\%\) over the baseline algorithm.

In summary, our method achieves good results on the SMAC benchmark and performs well in various scenarios, especially in the super hard and hard scenarios, where it surpasses all baselines. Our method outperforms the baseline by a wide margin on maps that require more exploration, such as a corridor. These results demonstrate that our method can effectively explore and solve complex tasks, as we anticipated.

Ablation

Our approach consists of three components, namely (A) a restricted action space, (B) clustering using action differences and action contributions, and (C) integration of action representations into role selection and role policies. To test components A and B, we keep the rest of the framework unchanged and let the role action space become a space containing all actions or a space of random actions. For component B, we test the role action space in the case that it contains all actions. For component C, action representations are eliminated to learn role selection and role strategies using traditional Q-networks.

Fig. 5
figure 5

Ablation studies regarding each component of ours

We chose one map from each of the three different difficulty levels for the ablation experiments, and the results are shown in Fig. 5. It can be seen that the performance using the traditional deep Q-network (No Action Repr.) method is closest to the original model, with an average performance reduction of 55\(\%\), outperforming the ablation without restricting the role action space (Full Action Spaces). It can be concluded that integrating action representations into role selection and role policy has a significant impact on the performance of our method, and the restricted action space is closely related to the performance of the new follow-through one. The performance of the method with Random Restricted Action Spaces (Random Restricted Action Spaces) is the worst of the three ablation experiments, with an average 60\(\%\) reduction in performance compared to the original model, highlighting the importance of using action differences and action contributions to decompose the joint action space.

In conclusion, all three methods that ablate different components perform worse than the unablated methods in terms of performance. It can be seen that limiting the role action space and integrating action representations into roles and strategies play a key role in the performance of our methods.

Conclusion

Discovering a set of roles that effectively decompose tasks is a thorny problem that hinders the scalability of using roles to decompose team tasks. In this paper, we propose to divide roles to decompose team tasks, decompose the action space based on action differences and action contributions, and role selection enables agents to play roles dynamically based on reward horizons. Experimental results on SMAC show that our method has significant performance improvement over other MARL methods, with an average 20\(\%\) improvement in win rate, especially excellent performance on hard and super hard maps. In addition, ablation experiments were conducted on three components of our method to verify the impact of different components of our method. The results of the experiments showed that the win rate decreased by 55–60\(\%\) on average after ablating the components, which proves that three different components play a key role in the performance of our method. However, our framework may not perform optimally under certain conditions. Specifically, our method may not be effective in scenarios where task decomposition is unnecessary and victory can be achieved through simple strategies that do not require extensive exploration. In such cases, our method may not yield significant performance improvements. Our role selection and strategy selection are based on the traditional QMIX method, which can theoretically be replaced with other QMIX variant methods, and we hope to be able to achieve better performance with different combinations in the future. Our approach provides new insights into building effective multi-agent team task decomposition in dynamic environments.