Introduction

The multi-agent system (MAS) and multi-agent reinforcement learning (MARL) have drawn lots of attention [1] and have been applied to solve some optimization problems in the physical world, such as resource allocation problem [2,3,4], cooperative navigation problem [5,6,7], air traffic flow management [8], and massive traffic light control problem [9,10,11]. The novel research field successfully combines machine learning (ML) [12, 13], deep learning (DL) [14, 15], and swarm intelligence [16] approaches, and proves the ability to obtain outstanding results in different areas. Previous deep MARL algorithms have achieved impressive results in cooperative MARL environment [17, 18], e.g., the Starcraft Multi-Agent Challenges (SMAC) environment [19].

The SMAC environment is a multi-agent micromanagement scenario in which two adversarial MAS battle against each other. The goal is to train an MARL algorithm controlling ally agents to eliminate enemy agents controlled by the internal script of SMAC environment. The algorithm needs to learn tactics and skills for choosing the best actions and utilizing the different properties of agents. Due to the discrete property of the environment, value-based algorithms have achieved better results than policy-based algorithms [20,21,22].

Asymmetric heterogeneous problems are very common in real-world scenarios [23,24,25,26,27], such as wireless network accessibility problem [28] and multi-agent robotic systems [29, 30]. However, the original maps of SMAC environment mainly consist of symmetric maps or homogeneous maps (see Table 1). A symmetric map means that allies and enemies consist of the same types of units, and the numbers of both sides are also equal. A homogeneous map means that allies consist of one specific type of unit, no matter what composition of the enemies. Furthermore, in SMAC, even though allies and enemies are equal at the starting state of symmetric heterogeneous problems, they would become asymmetric as the game runs, because the two sides are attacking and killing each other. In [31], the authors propose a situation (Proposition 5) where the policy of an algorithm may be trapped in a sub-optimal state due to the complexity of heterogeneity. Therefore, it is necessary to comprehensively and carefully study the heterogeneous MARL problem.

Previous algorithms have acquired good performance in most symmetric homogeneous maps, symmetric heterogeneous maps, and asymmetric homogeneous maps from the SMAC original map set. However, experiments show that even state-of-the-art algorithms cannot achieve a high winning-rate (WR) in asymmetric heterogeneous maps, indicating that the combination of asymmetry and heterogeneity brings more complexity. Therefore, to fully study the heterogeneity problem, it is beneficial to enrich the SMAC environment with more asymmetric heterogeneous maps. Recent mainstream approaches use policy-based actor-critic algorithms to solve the heterogeneous MARL problem with various individual agent policies [32, 33]. Some other papers discussing heterogeneity are mainly about multi-agent robotic systems, such as [29, 30], which are slightly different from MARL research.

For example, in the multi-agent area search problem proposed in [29], the multi-agent robotic methods usually manage to model the problem in detail with proper mathematical structures, and then propose the solution. However, the MARL approaches usually model the problem as a POMDP (see the section “Preliminaries” for details) and design a proper reward function for the environment. The goal is to learn an optimal policy function to decide the best actions for all states.

Particularly, it is required to point out that previous approaches lack the formal definition of heterogeneity. A natural description of the heterogeneity problem is that the action spaces of agents are different, and parameter-sharing among different agents is limited or prohibited. However, such description is not detailed enough for further study. In [34], the authors describe and classify the Physical and Behavioral heterogeneities with natural language instead of mathematical definitions. It is easy for humans to realize that planes and cars are heterogeneous. However, it is still necessary to deeply analyze heterogeneity with a formal definition, so that we are able to figure out what property is different, so that the MASs must be treated differently, and which type of heterogeneity do the MASs possess. Based on the definition and classification, we can further quantify and solve the heterogeneity problem.

Considering the generation process of a transition tuple, it is concluded that the heterogeneity in MARL mainly occurs in three components of the tuple: Local Reward, Local Observation, and Local Transition. In this paper, we focus on and study the cooperative Local Transition Heterogeneity (LTH) MARL problem, in which cooperation happens among different types of agents. When changing the number of ally agents, the ratio of different agent types may also be changed, and thus, the optimal cooperating policy is affected. This change increases the diversity and complexity of the LTH problem.

A natural solution for LTH is grouping. An agent is determined to affiliate a specific group depending on its certain property. Furthermore, an agent keeps to be a permanent member of a group as long as the scenario remains unchanged. The grouping process simplifies and stabilizes the determination of group members and the usage of different group policies, making it easier to add inter-group mechanisms between policies of groups. In addition, grouping helps to maintain a proper structure for parameter-sharing, which helps to improve cooperation through homophily [35]. As a result, it becomes an important problem to choose an appropriate property for grouping in LTH problems.

In this paper, we propose GIGM consistency and GHQ algorithm to solve the LTH problem in SMAC environment. First, to leverage the benefit of value-based methods and grouping methods, we need to generalize the Individual-Global-Maximum (IGM) consistency [36] into grouped situations. Therefore, we conduct the Grouped Individual-Global-Maximum (GIGM) consistency and a condition to test whether a grouping method satisfies GIGM. Second, we propose the Grouped Hybrid Q-Learning (GHQ). Agents are partitioned into groups following the ideal object grouping (IOG) method. Each group has its own isolated network parameters, and the parameters are only shared among group members. A novel hybrid structure for value factorization is proposed for optimizing and reducing computation. Furthermore, a variational lower bound of the inter-group mutual information (IGMI) is introduced to increase the correlation between groups for better cooperation. Third, we test GHQ in our new asymmetric heterogeneous maps. Results show that GHQ outperforms other baseline algorithms with higher WR and better learning curve, and the cooperate policy between GHQ groups is significantly different against baselines. Main contributions of this paper are:

  • As far as we know, we are the first to propose the Local Transition Heterogeneity (LTH) problem with a formal definition.

  • We analyze the properties of the LTH problem and design new asymmetric heterogeneous SMAC maps to comprehensively study the LTH problem.

  • We propose the GIGM consistency and the GHQ algorithm to solve the LTH problem in SMAC.

  • We run comparison and ablation experiments to prove the effectiveness of the GHQ algorithm.

The rest of the content is as follows: we summarize some related works in the section “Related works”; we give the definition of LTH and theoretically analyze it in SMAC in the section “Local transition heterogeneity”; we provide details about the GHQ algorithm in the section “Methods”; we present detailed environmental and experimental design, and discuss results of our experiments in the section “Experiments and results”; and finally, we draw some conclusion in the section “Conclusion”.

Table 1 SMAC original maps

Related works

Multi-agent reinforcement learning

Following the centralized training with decentralized execution (CTDE) paradigm [37,38,39], which requests agents not to use state s during execution, recent approaches have achieved impressive results in SMAC environment. The mainstream value-based method is the value factorization method. Its formal objective is to learn a centralized yet factorized joint action-value function \(Q_{tot}\) and the factorization structure: \(Q_{tot} \rightarrow Q_i\), and use them to calculate TD-error and guide the optimization of agent policies

$$\begin{aligned} {\mathcal {L}}(\theta )= & {} {\mathbb {E}}_{\mathcal {D}} \Big [(r+\gamma \max _{{\varvec{a}}'} Q_{tot}^{tgt}(\varvec{\tau }', s';\theta ^{tgt}) \nonumber \\{} & {} -Q_{tot}(\varvec{\tau }, s;\theta ))^2\Big ] \end{aligned}$$
(1)

where \({\mathbb {E}}_{\mathcal {D}}\) means sampling a batch of tuples \((\varvec{\tau }, s, r)\) from replay buffer \({\mathcal {D}}\) and calculating expectation across the batch. \(\varvec{\pi }\) is the action policy, which is commonly the \(\epsilon \)-greedy policy or argmax policy of Q function in value-based algorithms. \(Q_{tot}^{tgt}\) is the target function and of \(Q_{tot}\). \(\theta \) and \(\theta ^{tgt}\) are the network parameter of \(Q_{tot}\) and \(Q_{tot}^{tgt}\), respectively. To factorize \(Q_{tot}\) and use the argmax policy of \(Q_i\) to select actions, the Individual-Global-Max (IGM) consistency [36] is required

$$\begin{aligned} \arg \max _{{\varvec{a}}} Q_{tot}(\varvec{\tau }, s) = \begin{pmatrix} \arg \max _a Q_{1}(\tau _1)\\ ..., \\ \arg \max _a Q_k(\tau _k) \end{pmatrix}. \end{aligned}$$
(2)

VDN [40] represents \(Q_{tot}\) as the sum of local \(Q_i\) functions. QMIX [20] changes the factorization structure from additivity to monotonicity, and the fine-tuned version of QMIX has been proved to be one of the best algorithms on the original SMAC maps [21]. Based on these two fundamental algorithms, QTRAN [36], WQMIX [41], Qatten [42], and QPLEX [43] improve performance with modified value factorizing mechanism. Qauxi [44] introduces auxiliary tasks to generate meta-experience for transfer learning. CDCR [45] calculates Cognition Differences between agents with the attention mechanism and learns Consistent Representation of agents’ hidden states for enhancing cooperation. Trans_mix [46] uses a transform network to solve the misalignment of partially observatory value. BRGR [47] uses Bidirectional Real-Time Gain Representation to learn overall information representation and neighbor information representation, and combines them with other value-based algorithms for better cooperation.

Heterogeneous MARL has been considered as a special case of homogeneous MARL and can be handled with individual policy networks. HAPPO [32], in which the H stands for heterogeneous, lacks specific analysis and sufficient experiments for heterogeneity. In other field of MAS, [48] uses Relative Needs Entropy (RNE) to build a trust model to improve cooperation in heterogeneous multi-robot grouping task, and [49] contributes a novel method for the heterogeneous multi-robot assembly planning.

Grouping method

Grouping is a natural idea and solution for complex or large-scale problems and is widely used in many research of optimization or machine learning. In SMAC environment, THGC [50] divides agents into different groups based on their different “types” for knowledge sharing and group communication. However, it is necessary to formally define and describe the difference between agent types in a universal way across different environments. In this paper, we introduce some auxiliary definitions for describing our grouping method.

Liu et al. [51] uses a channel grouping algorithm to cluster different sub-regions of pictures for vehicle Re-ID. Li et al. [52] introduces a ranking-based grouping method to improve multi-population-based differential evolution algorithm. Cheng et al. [53] proposes a grouping attraction model, which can significantly reduce the number of attractions and fitness comparisons in the firefly algorithm. Li et al. [54] modify the Transformer encoder by properly organizing encoder layers into multiple groups, and connect these groups via a grouping skip connection mechanism. Rotman et al. [55] enhance the Optimal Sequential Grouping (OSG) to solve the video scene detection problem. Hou et al. [56] propose FedEntropy for better dynamic device grouping in federated learning. Al Faiya et al. [57] introduce an enhanced decentralized autonomous aerial swarm system with group planning. [58] designs a self-organizing MAS for distributed voltage regulation in the smart power grid.

Mutual information

Computing the variational bound of mutual information (MI) has been proven to enhance cooperation in MARL. MAVEN [59] maximizes a variational lower bound of the MI between the latent variable z and the agent-specific Boltzmann policy \(\sigma (\tau )\) to encourage exploration of the algorithm. ROMA [60] computes two MI-related losses to learn identifiable and specialized role policies. PMIC [61] maintains positive and negative trajectory memories to compute the upper bound and lower bound of the MI between global state s and joint action \({\varvec{a}}\). MAIC [62] maximizes the MI between the trajectory of agent i and the ID of another agent j for teammate modeling and communication. CDS [63] maximizes the MI between the trajectory \(\tau _i\) of agent i and its own agent ID to maintain diverse individual local Q functions.

Local transition heterogeneity

In this section, our goal is to give a formal definition of the Local Transition Heterogeneity (LTH) problem and analyze its existence in SMAC. We first present fundamental concepts and definitions in the section “Preliminaries”. Next, we define auxiliary concepts and the Local Transition Function (LTF) for the formal definition of LTH in the section “Auxiliary definitions and the local transition function”. These definitions isolate one specific agent i into an ideal scenario. Therefore, we can study the properties of agent i affecting the LTH problem. And then, in the section “Definition of local transition heterogeneity”, we define the LTH problem and show the advantage of our definition. Finally, we conclude two properties for proving the existence of the LTH problem, and analyze the existence of LTH in SMAC in the section “Existence of LTH in SMAC”.

Preliminaries

In this paper, we study the cooperative MARL problems that can be modeled as the decentralized partially observable Markov decision process (Dec-POMDP) [64]. The problem is described with a tuple \(G=\left\langle S, {\varvec{A}}, {\varvec{P}}, R, \varvec{\Omega }, O; \gamma , K, T \right\rangle \). \(s \in S\) denotes the true state of environment with complete information, \(K= \{ 1,..., k \} \) denotes the finite set of k agents, and \(\gamma \in [0,1)\) is the discount factor. At each time-step \(t \le T\), agent \(i \in K\) receives an individual partial observation \(o_i^t\) and chooses an action \(a_i^t \in A_i\) from local action set \(A_i\), with the local action-dim \(|A_i|\). Actions of all agents form a joint action \({\varvec{a}}^t = (a_1^t,..., a_k^t) \in {\varvec{A}} = (A_1,..., A_k)\). The environment receives a joint action \({\varvec{a}}^t\) and returns a next-state \(s^{t+1}\) according to the joint transition function \({\varvec{P}} (s^{t+1}|s^{t}, {\varvec{a}}^t)\), and a reward \(r^t = R(s, {\varvec{a}}^t)\) shared by all agents. The joint observation \({\varvec{o}}^t = (o_1^t,..., o_k^t) \in \varvec{\Omega }\) is generated according to the observation function \(O^t (s^t, i)\). Observation-action trajectory history \(\tau ^t = \cup _1^t \{ (o^{t-1}, a^{t-1})\}\) (\(t \ge 1; \tau ^0 = o^0\)) is the summary of partial transition tuples before t. Specifically, \(\varvec{\tau }_i = \varvec{\tau }_i^T\) indicates the overall trajectory of agent i through all time-steps \(t \le T\). Replay buffer \({\mathcal {D}} = \cup (\varvec{\tau }, s, r)\) stores all data for batch sampling. Network parameters are notated by \(\theta \) and \(\psi \).

Auxiliary definitions and the local transition function

Apart from the joint transition function \({\varvec{P}} (s^{t+1}|s^{t}, {\varvec{a}}^t)\), we need to define the Local Transition Function (LTF) \(P_i (s^{t+1}|s^{t}, a_i^t)\) for the definition of LTH problem. Several auxiliary definitions are given for better demonstration and analysis of LTF and LTH.

First, we partition the actions of an agent \(A_i\) into 3 different types: common actions \(A_{com}\), which only affect agent i itself, e.g., moving, scanning, and transforming; interactive actions \(A_{act}\), which are interacting with other agents, e.g., attacking, guiding, and delivering; and mixing actions \(A_{mix}\), which affect both itself and others, e.g., a predator moving close to a prey for automatic predating. Usually, \(A_{mix}\) can be divided into the combination of \(A_{com}\) and \(A_{act}\), e.g., the \(A_{mix}\) automatic predating can be divided into \(A_{com}\) moving and \(A_{act}\) predating. For terminological simplicity, we divide \(A_{mix}\) into the combination of \(A_{act}\) and \(A_{com}\) by default, and focus on the latter two types of actions.

Second, we introduce the joint available-action-mask matrix \(\varvec{AM} (s^t)\) and the local available-action-mask vector \(AM_i (s^t, i)\), which are common components in many MARL environments. \(\varvec{AM} (s^t)\) is a binary matrix with dimensions being \(|A_i| \times K\), indicating that the available-actions of all agents at the state \(s^t\). \(AM_i (s^t, i)\) are the column vector of \(\varvec{AM} (s^t)\), indicating the mask vector of a certain agent. Element 1 (true) at \((a_i^t, i)\) of \(\varvec{AM} (s^t)\) means that agent i can take action \(a_i^t\) at \(s^t\), and vice versa.

Finally, we define the Ideal Object (IO) and the Ideal Condition (IC), and then define the LTF, \(P_i (s^{t+1}|s^{t}, a_i^t)\).

Definition 1

Ideal Object (IO) and Ideal Condition (IC): The Ideal Object \((IO_i)\) of agent i is an action object that is available for any \(A_{act}\) of agent i to be applied on. The Ideal Condition \((IC_i)\) of agent i is the environmental condition that maintains the local available-action-mask function \(AM_i (s^t, i)\) being all true for any state \(s^t\) and any action \(a_i^t\) applied on \(IO_i\).

Definition 2

Local Transition Function (LTF): For agent i with its \(IO_i\) and \(IC_i\), the Local Transition Function (LTF) \(P_i (s^{t+1}|s^t, a_i^t)\) is the probability distribution of next-state \(s^{t+1}\) conditioned by state \(s^t\) and action \(a_i^t\). The action \(a_i^t\) is applied on \(IO_i\) under \(IC_i\).

Definition of local transition heterogeneity

In general, the Local Transition Heterogeneity (LTH) means that agents cannot reach the same next-state \(s^{t+1}\) from the same state \(s^{t}\), no matter what policies they are using. A formal definition is given below.

Definition 3

Local Transition Heterogeneity (LTH): Let there be two agents \(i,j \in K\). Their policies are \(\pi _i(a_i|s)\) and \(\pi _j(a_j|s)\), and their LTFs are \(P_i (s^{t+1}|s^{t}, a_i^t)\) and \(P_j (s^{t+1}|s^{t}, a_j^t)\). A certain state \(s^t\), which simultaneously fulfills \(IC_i\) and \(IC_j\), is the starting state. The sets of next-states \(\{s^{t+1}_i | s^{t}, \pi _i, P_i\}\) and \(\{s^{t+1}_j | s^{t}, \pi _j, P_j\}\) are generated by \(\pi _i\) and \(\pi _j\) individually executed on \(s^{t}\) toward their corresponding \(IO_i\) and \(IO_j\). If the intersection of the two sets of next-states is empty for all available policies, then the MARL problem has LTH

$$\begin{aligned} \begin{aligned}&\exists \ i,j \in K,\ \forall \ \pi _i, \pi _j,\\&\{s^{t+1}_i | s^{t}, \pi _i, P_i\} \cap \{s^{t+1}_j | s^{t}, \pi _j, P_j\} = \emptyset . \end{aligned} \end{aligned}$$
(3)

For example, in an MAS consisting of Unmanned Aerial Vehicle (UAV) and Unmanned Ground Vehicle (UGV), we suppose that UAV and UGV carry different mission cargo, so their \(A_{act}\) are different. Their moving speed and moving dimension (2-D and 3-D) are also different, so their \(A_{com}\) are different. From the same starting state \(s^{t}\), UAV and UGV cannot reach the same next-state \(s^{t+1}\), because their action spaces are completely different. Therefore, the existence of LTH problem in such MAS is clear.

An advantage of our definition is the reliability of presenting heterogeneity. We define the LTH problem under the restriction of IO and IC. Our core motivation is to ensure that the local available-action-mask vector \(AM_i (s^t, i)\) remains all true, because \(AM_i (s^t, i)\) can influence the behavior of agents and thus affect the existence of LTH. For instance, if all enemies choose the policy “attack and eliminate agent i at the \(1^{st}\) time-step”, then the \(AM_i\) would be only available for the “dead-action”, since agent i is always dead from the \(1^{st}\) time-step. Therefore, it is impossible for agent i to present LTH. Similarly, ally agents’ actions and policies can also affect the \(AM_i\) and lead to the same result. In conclusion, our definition avoids unexpected influence from enemies or allies toward the \(AM_i\), and is capable of presenting LTH reliably.

Table 2 Unit information in SMAC

Existence of LTH in SMAC

The original definition formula (3) is inconvenient to judge whether an environment has LTH. We further conclude that the difference of IO or LTF can determine the existence of LTH. First, different IO leads to qualitative LTH. For example, in a UAV–UGV system with different mission cargo, the IO of a UAV is defined to be another UAV, while the IO of a UGV is defined to be another UGV. Their objects and functionalities of \(A_{act}\) are different, leading to LTH. Generally, different interactive action-dim \(|A_{act}|\) is sufficient to prove the difference of IO, and can also be used to prove the existence of LTH. Second, different LTFs lead to quantitative LTH. For example, in a UAV–UGV system with the same mission cargo, their moving speeds are still different. Typically, UAVs fly faster in the air than UGVs move on the ground. The difference of the dynamics of \(A_{com}\) or \(A_{act}\) leads to different LTF, and thus, LTH occurs.

In SMAC, there are two agent types, supporting units \(U_{spt}\) and attacking units \(U_{atk}\). \(U_{spt}\) can only affect allies, while \(U_{atk}\) can only affect enemies. For example, Medivac is a \(U_{spt}\) who can only heal allies, while Marine is a \(U_{atk}\) who can only attack enemies (see Table 2). In SMAC, \(A_{com}\) are moving and stopping, available for all living agents at any state s and any time-step t. The common action-dim \(|A_{com}|\) also remains identical among all agent types. \(A_{act}\) are attacking or healing. A certain agent type can only attack enemies or heal allies. Therefore, \(|A_{act}|\) should be different between different agent types.

First, the IO of \(U_{atk}\) and \(U_{spt}\) are different, leading to qualitative LTH. For \(U_{atk}\), its IO is an enemy, and its \(|A_{act}|\) is also the total number of enemies. However, for \(U_{spt}\), its IO is an ally, so its \(|A_{act}|\) is the total number of allies. Second, the moving speed, shot-range and damage-per-gaming-second (DPS) are different between different types of agents (see Table 2), indicating the existence of quantitative LTH. In conclusion, the existence of LTH in SMAC is clarified, and further analysis and study of LTH are therefore required.

Methods

Grouped individual-global-max consistency

As is shown in the section “Definition of local transition heterogeneity”, LTH does not change the reward function \(R(s, {\varvec{a}})\) or the available-action-mask. Therefore, any available joint action \({\varvec{a}}\) is rewarded the same as it in homogeneous scenarios, and the optimal joint action \({\varvec{a}}^*\) is not affected. As a result, the IGM consistency in LTH still holds and we can further generalize the consistency to a “grouped” situation for solving LTH problems with grouping value factorization.

Definition 4

Grouped IGM Consistency (GIGM): Let there be \(U= \left\{ 1,..., u \right\} , \ (u<k)\) agent groups in total. An agent group \({\mathcal {G}}_m\ (m \in U)\) consists of agents arbitrarily pre-defined. If the argmax operation performed on the joint function \(Q_{tot}\) yields the same result as a set of individual argmax operations performed on all group functions \(Q_{{\mathcal {G}}_m}\ (m \in U)\); and the argmax operation performed on each group function \(Q_{{\mathcal {G}}_m}\) yields the same result as a set of individual argmax operations performed on the agent functions \(Q_i\ (i \in {\mathcal {G}}_m)\), then GIGM holds true

$$\begin{aligned}&\begin{aligned}&\arg \max _{{\varvec{a}}} Q_{tot}(\varvec{\tau }, s) \\&\quad = \begin{pmatrix} \arg \max _{{\varvec{a}}} Q_{{\mathcal {G}}_1}(\varvec{\tau }_{{\mathcal {G}}_1}, s)\\ ..., \\ \arg \max _{{\varvec{a}}} Q_{{\mathcal {G}}_u}(\varvec{\tau }_{{\mathcal {G}}_u}, s) \end{pmatrix} \\&\quad = \begin{pmatrix} \arg \max _a Q_1(\tau _1)\\ ..., \\ \arg \max _a Q_k(\tau _k) \end{pmatrix}, \end{aligned} \end{aligned}$$
(4)
$$\begin{aligned}&\begin{aligned}&\arg \max _{{\varvec{a}}} Q_{{\mathcal {G}}_m}(\varvec{\tau }_{{\mathcal {G}}_m}, s) \\&\quad = \begin{pmatrix} \arg \max _a Q_i(\tau _i)\\ i \in {\mathcal {G}}_m \end{pmatrix}, \end{aligned} \end{aligned}$$
(5)

where \(\varvec{\tau }_{{\mathcal {G}}_m} = \cup _{i \in {\mathcal {G}}_m}\{\tau _i\}\) is the group trajectory of \({\mathcal {G}}_m\), and \(\varvec{\tau } = \cup _{i \in K}\{\tau _i\}\) is the global joint trajectory of all agents.

Furthermore, We conclude a theorem sufficient to prove GIGM:

Theorem 1

Joint Trajectory Condition (JTC): GIGM holds true if the following two conditions are simultaneously satisfied:

  1. (i)

    The global joint trajectory is equivalent to the union of all group trajectories

    $$\begin{aligned} \varvec{\tau } = \cup _{i \in K}\ \{\tau _i\} = \cup _{m \in U}\ \{\varvec{\tau }_{{\mathcal {G}}_m}\}. \end{aligned}$$
    (6)
  2. (ii)

    The intersection of all group trajectories is empty

    $$\begin{aligned} \cap _{m \in U}\ \{\varvec{\tau }_{{\mathcal {G}}_m}\} = \emptyset . \end{aligned}$$
    (7)

The first condition guarantees the transitivity of argmax operations performed on Q functions. The second condition guarantees the coexistence of argmax operations on all \(Q_{{\mathcal {G}}_m}\). The two conditions jointly guarantee the equivalence of argmax operations on all group Q and agent Q functions

$$\begin{aligned} \arg \max _{{\varvec{a}}} \ Q_{{\mathcal {G}}_m}\ (m \in U)\ =\ \arg \max _a \ Q_i\ (i \in K). \end{aligned}$$
(8)

Ideal object grouping

To utilize GIGM to solve the LTH problem, we propose Ideal Object Grouping (IOG), which means partitioning agents into different groups by their different ideal objects IO. As is mentioned in the section “Grouping method”, we need to formally define and describe “ what is the meaning of different types of agents” in a universal way. And we point out that the difference in IO is equivalent to the difference of agent types, because fundamentally these differences are all about the difference in agent action space \({\varvec{A}}\). This is the exact functionality and property describing the heterogeneous agents. In general, our goal is to acquire a grouping function \(g(i, {{\mathcal {G}}_m})\ (i\in K,\ m\in U)\) for agent i and group \({\mathcal {G}}_m\)

$$\begin{aligned} g(i, {{\mathcal {G}}_m}) = {\left\{ \begin{array}{ll} 1 &{} \text { if } i \in {{\mathcal {G}}_m} \\ 0 &{} \text { else } \end{array}\right. }. \end{aligned}$$
(9)

Each agent group \({\mathcal {G}}_m\) consists of agents with the same \(IO_{{\mathcal {G}}_m}\) and the same interactive action-dim \(|A_{act-{\mathcal {G}}_m}|\). Only one universal agent network is kept for one group, which significantly reduces the number of agent networks from K to U. Parameter-sharing is only allowed between agents within the same group. Maintaining a proper parameter-sharing structure not only avoids redundant computing resources for individual agent networks, but can also increase in-group cooperating via homophily [35].

Moreover, IOG is a mapping function from agents to groups \(g(i, {{\mathcal {G}}_m}): (K \rightarrow U)\). The \(|A_{act}|\) of each agent must be assigned during the initialization of SMAC. As a result, one specific agent i can only be assigned to the certain group with \(IO_i = IO_{{\mathcal {G}}_m}\) and \(|A_{act_i}| = |A_{{\mathcal {G}}_m}|\). Therefore, JTC is satisfied and GIGM holds true, indicating that IOG is an appropriate grouping method for value factorization.

Inter-group mutual information loss

To enhance inter-group cooperation and correlation, we maximize the Inter-Group Mutual Information (IGMI) between trajectories of different groups \({\varvec{\tau }_{{\mathcal {G}}_m}}\) and \({\varvec{\tau }_{{\mathcal {G}}_n}}\), written as \(I(\varvec{\tau }_{{\mathcal {G}}_m}; \varvec{\tau }_{{\mathcal {G}}_n})\). For encoding trajectories, a common implementation is to use the hidden states of gated recurrent unit (GRU) [65] \(h_{{\mathcal {G}}_m}\) and \(h_{{\mathcal {G}}_n}\). While GRU takes \({\varvec{o}}_{{\mathcal {G}}_m}^t\) and \({\varvec{a}}_{{\mathcal {G}}_m}^t\) recursively for all \(t \le T\), we assume that \(h_{{\mathcal {G}}_m}\) is capable of encoding and representing \(\varvec{\tau }_{{\mathcal {G}}_m}\). After encoding, because the mutual information can only be calculated between two distributions, we add a Gaussian distribution layer in the agent network of every group, marked as \(l_{{\mathcal {G}}_m}\) and \(l_{{\mathcal {G}}_n}\). Therefore, calculating \(I(\varvec{\tau }_{{\mathcal {G}}_m}; \varvec{\tau }_{{\mathcal {G}}_n})\) can be converted into calculating \(I(l_{{\mathcal {G}}_m}; l_{{\mathcal {G}}_n} | h_{{\mathcal {G}}_m}, h_{{\mathcal {G}}_n})\). Detailed agent network structure is illustrated in Fig. 1a

$$\begin{aligned} \begin{aligned}&h_{{\mathcal {G}}_m} = GRU(\varvec{\tau }_{{\mathcal {G}}_m}) = GRU(\cup _{i \in {\mathcal {G}}_m}\{\tau _i\})\,\\&l_{{\mathcal {G}}_m} = Gaussian(h_{{\mathcal {G}}_m})\,\\&I(\varvec{\tau }_{{\mathcal {G}}_m}; \varvec{\tau }_{{\mathcal {G}}_n})=I(l_{{\mathcal {G}}_m}; l_{{\mathcal {G}}_n} | h_{{\mathcal {G}}_m}, h_{{\mathcal {G}}_n}). \end{aligned} \end{aligned}$$
(10)

We further conduct a lower bound of \(I(\varvec{\tau }_{{\mathcal {G}}_m}; \varvec{\tau }_{{\mathcal {G}}_n})\) for easier calculation

$$\begin{aligned}{} & {} I(\varvec{\tau }_{{\mathcal {G}}_m}; \varvec{\tau }_{{\mathcal {G}}_n}) \nonumber \\{} & {} \quad =I(l_{{\mathcal {G}}_m}; l_{{\mathcal {G}}_n} | h_{{\mathcal {G}}_m}, h_{{\mathcal {G}}_n}) \nonumber \\{} & {} \quad ={\mathbb {E}}_{l_{{\mathcal {G}}_m}, l_{{\mathcal {G}}_n}, h_{{\mathcal {G}}_m}, h_{{\mathcal {G}}_n}}\left[ log\frac{p(l_{{\mathcal {G}}_m}|l_{{\mathcal {G}}_n}, h_{{\mathcal {G}}_m}, h_{{\mathcal {G}}_n})}{p(l_{{\mathcal {G}}_m})}\right] \nonumber \\{} & {} \quad ={\mathbb {E}}_{l_{{\mathcal {G}}_m}, l_{{\mathcal {G}}_n}, h_{{\mathcal {G}}_m}, h_{{\mathcal {G}}_n}}\left[ log\ p(l_{{\mathcal {G}}_m}|l_{{\mathcal {G}}_n}, h_{{\mathcal {G}}_m}, h_{{\mathcal {G}}_n}) \right. \nonumber \\{} & {} \quad \left. \quad - log\ p(l_{{\mathcal {G}}_m}) + log\ q_{{\mathcal {G}}_m}(l_{{\mathcal {G}}_m}|l_{{\mathcal {G}}_n}, h_{{\mathcal {G}}_m}, h_{{\mathcal {G}}_n})\right. \nonumber \\{} & {} \quad \left. \quad - log\ q_{{\mathcal {G}}_m}(l_{{\mathcal {G}}_m}|l_{{\mathcal {G}}_n}, h_{{\mathcal {G}}_m}, h_{{\mathcal {G}}_n})\right] \nonumber \\{} & {} \quad ={\mathbb {E}}_{l_{{\mathcal {G}}_m}, l_{{\mathcal {G}}_n}, h_{{\mathcal {G}}_m}, h_{{\mathcal {G}}_n}}\left[ log\frac{q_{{\mathcal {G}}_m}(l_{{\mathcal {G}}_m}|l_{{\mathcal {G}}_n}, h_{{\mathcal {G}}_m}, h_{{\mathcal {G}}_n})}{p(l_{{\mathcal {G}}_m})}\right. \nonumber \\{} & {} \quad \quad +\alpha * D_{KL}(p(l_{{\mathcal {G}}_m}|l_{{\mathcal {G}}_n}, h_{{\mathcal {G}}_m}, h_{{\mathcal {G}}_n}) \nonumber \\{} & {} \qquad \left. || q_{{\mathcal {G}}_m}(l_{{\mathcal {G}}_m}|l_{{\mathcal {G}}_n}, h_{{\mathcal {G}}_m}, h_{{\mathcal {G}}_n}))\right] \, \end{aligned}$$
(11)

where \(\alpha = \frac{p(l_{{\mathcal {G}}_m})}{p(l_{{\mathcal {G}}_m}|l_{{\mathcal {G}}_n}, h_{{\mathcal {G}}_m}, h_{{\mathcal {G}}_n})}\) is always non-negative, and \(D_{KL}\) is the KL-divergence being also non-negative. \(q_{{\mathcal {G}}_m}\) is an inference distribution of group \({\mathcal {G}}_m\) with parameter \(\psi _{{\mathcal {G}}_m}\), and is independent from \(h_{{\mathcal {G}}_n}\). To keep this independence, a mixed input of different groups is forbidden. Therefore, we keep individual inference networks for each group. Finally, we have

$$\begin{aligned} \begin{aligned}&I(\varvec{\tau }_{{\mathcal {G}}_m}; \varvec{\tau }_{{\mathcal {G}}_n})\\&\quad \ge {\mathbb {E}}_{l_{{\mathcal {G}}_m}, l_{{\mathcal {G}}_n}, h_{{\mathcal {G}}_m}, h_{{\mathcal {G}}_n}}\left[ log\frac{q_{{\mathcal {G}}_m}(l_{{\mathcal {G}}_m}|l_{{\mathcal {G}}_n}, h_{{\mathcal {G}}_m}, h_{{\mathcal {G}}_n})}{p(l_{{\mathcal {G}}_m})}\right] \\&\quad ={\mathbb {E}}_{l_{{\mathcal {G}}_m}, l_{{\mathcal {G}}_n}, h_{{\mathcal {G}}_m}}[log\ q_{{\mathcal {G}}_m}(l_{{\mathcal {G}}_m}|l_{{\mathcal {G}}_n}, h_{{\mathcal {G}}_m})- log\ p(l_{{\mathcal {G}}_m})]\\&\quad ={\mathbb {E}}_{l_{{\mathcal {G}}_n}, h_{{\mathcal {G}}_m}}[-CE(p(l_{{\mathcal {G}}_m}) || q_{{\mathcal {G}}_m}(l_{{\mathcal {G}}_m}|l_{{\mathcal {G}}_n}, h_{{\mathcal {G}}_m}))\\&\quad \quad + {\mathcal {H}}(p(l_{{\mathcal {G}}_m}))]\\&\quad =-{\mathbb {E}}_{l_{{\mathcal {G}}_n}, h_{{\mathcal {G}}_m}}[D_{KL}(p(l_{{\mathcal {G}}_m}) || q_{{\mathcal {G}}_m}(l_{{\mathcal {G}}_m}|l_{{\mathcal {G}}_n}, h_{{\mathcal {G}}_m}))]. \end{aligned} \nonumber \\ \end{aligned}$$
(12)

To maximize the IGMI, the loss is written as

$$\begin{aligned} \begin{aligned}&{\mathcal {L}}_{MI_m}(\tau _{{\mathcal {G}}_m}; \tau _{{\mathcal {G}}_n} | \psi _{{\mathcal {G}}_m}) \\&={\mathbb {E}}_{\mathcal {D}} [D_{KL}(p(l_{{\mathcal {G}}_m})||q_{{\mathcal {G}}_m}(l_{{\mathcal {G}}_m} | l_{{\mathcal {G}}_n}, h_{{\mathcal {G}}_m}))]. \end{aligned} \end{aligned}$$
(13)
Fig. 1
figure 1

An overall framework of GHQ. \(\varvec{\theta }_{{\mathcal {G}}_m}\) of group m consists of three parts: agent network \(\theta _i\), mixing network \(\theta _{M_m}\), and inference network \(\varvec{\psi }_{{\mathcal {G}}_m}\). Detailed data-stream for training and executing are shown in Fig. 2. In (a), \(\theta _i\) takes \(o_i^t\) and \(a_i^{t-1}\) as input. It generates \(Q_i\) for choosing actions, and \(l_{{\mathcal {G}}_m}\) and \(h_{{\mathcal {G}}_m}\) for calculating \(q_{{\mathcal {G}}_m}\). In (b), \(\theta _{M_m}\) takes Q and s for calculating TD loss \({\mathcal {L}}_{TD_m}\) with hybrid factorization. In (c), \(\varvec{\psi }_{{\mathcal {G}}_m}\) takes \(l_{{\mathcal {G}}_m}\), \(h_{{\mathcal {G}}_m}\) and \(l_{{\mathcal {G}}_n}\) for calculating IGMI loss \({\mathcal {L}}_{MI_m}\)

Algorithm 1
figure a

GHQ

Grouped hybrid Q-learning

An ordinary idea to calculate \(Q_{{\mathcal {G}}_m}\) and \(Q_i\) is to design factorization structures for \(Q_{tot} \rightarrow Q_{{\mathcal {G}}_m}\) and \(Q_{{\mathcal {G}}_m} \rightarrow Q_i\). Let \({\mathcal {C}}_{{\mathcal {G}}_m}\) and \({\mathcal {C}}_i\) be the two-factor function. Like IGM [66], the monotonicity constraint is also sufficient for GIGM. Therefore, \({\mathcal {C}}_{{\mathcal {G}}_m}\) and \({\mathcal {C}}_i\) can be written as

$$\begin{aligned}{} & {} \frac{\partial Q_{tot}(\varvec{\tau }, s)}{\partial Q_{{\mathcal {G}}_m}(\varvec{\tau }_{{\mathcal {G}}_m}, s)} = {\mathcal {C}}_{{\mathcal {G}}_m} \ge 0,\ m \in U \, \end{aligned}$$
(14)
$$\begin{aligned}{} & {} \frac{\partial Q_{{\mathcal {G}}_m}(\varvec{\tau }_{{\mathcal {G}}_m}, s)}{\partial Q_i(\tau _i)} = {\mathcal {C}}_i \ge 0,\ i \in {\mathcal {G}}_m. \end{aligned}$$
(15)

Our key insight is that trivially calculating \({\mathcal {C}}_{{\mathcal {G}}_m}\) and \(Q_{{\mathcal {G}}_m}\) is unnecessary. Instead of hierarchical factorization, we imply independent Q-Learning (IQL) [67] for \({\mathcal {C}}_{{\mathcal {G}}_m}\), which is called the hybrid factorization. This method makes \(Q_{{\mathcal {G}}_m}\) become an action-value function instead of a utility function [66, 68], and \({\mathcal {C}}_{{\mathcal {G}}_m}\) become a positive constant. As a result, the TD loss of group \({\mathcal {G}}_m\) is written as

$$\begin{aligned} \begin{aligned}&{\mathcal {L}}_{TD_m}(\varvec{\theta }_{{\mathcal {G}}_m})\ =\ {\mathbb {E}}_{\mathcal {D}} \left[ (y^{{\mathcal {G}}_m}-Q_{{\mathcal {G}}_m}(\varvec{\tau }_{{\mathcal {G}}_m},s;\varvec{\theta }_{{\mathcal {G}}_m}))^2\right] \,\\&y^{{\mathcal {G}}_m}\ =\ r+\gamma \ \max _{{\varvec{a}}'} Q_{{\mathcal {G}}_m}^{tgt}(\varvec{\tau }_{{\mathcal {G}}_m}',s';\varvec{\theta }^{tgt}_{{\mathcal {G}}_m}),\, \end{aligned} \end{aligned}$$
(16)

where \(y^{{\mathcal {G}}_m}\) is the TD-target of \(Q_{{\mathcal {G}}_m}\), \(Q_{{\mathcal {G}}_m}^{tgt}\) is the target Q function of \(Q_{{\mathcal {G}}_m}\), and \(\varvec{\theta }^{tgt}_{{\mathcal {G}}_m}\) and \(\varvec{\theta }_{{\mathcal {G}}_m}\) are network parameters of \(Q_{{\mathcal {G}}_m}^{tgt}\) and \(Q_{{\mathcal {G}}_m}\), separately. Group network \(\varvec{\theta }_{{\mathcal {G}}_m}\) consists of two parts, agent network \(\theta _i\) and mixing network \(\theta _{M_m}\). Their losses are calculated with backward propagation following (15):

$$\begin{aligned} \frac{\partial {\mathcal {L}}_{TD_m}(\varvec{\theta }_{{\mathcal {G}}_m})}{\partial \theta _i} = \frac{\partial {\mathcal {L}}_{TD_m}(\varvec{\theta }_{{\mathcal {G}}_m})}{\partial \theta _{M_m}} \cdot \frac{\partial \theta _{M_m}}{\partial \theta _i}. \end{aligned}$$
(17)

GIGM and the input of state information keep different \(Q_{{\mathcal {G}}_m}\) in relevance, and IGMI further enhances the correlation. Even though IQL methods suffer from non-stationary problems [69], GHQ overcomes this disadvantage and achieves impressive results. The hybrid factorization avoids the calculation hierarchical factorization function. Although the IQL value of \(Q_{{\mathcal {G}}_m}\) following (16) does not equal the factorized value of \(Q_{{\mathcal {G}}_m}\) following (14), the monotonicity of factorization and GIGM still hold. As a result, the optimal policy of GHQ converges to the same optimal policy provided by the fully factorized structure.

Fig. 2
figure 2

An overview of data-stream of GHQ. During decentralized executing, agent networks \(\theta _i\) and \(\theta _j\) generate \(Q_i^t\) and \(Q_j^t\) for choosing actions \(a_i^t\) and \(a_j^t\), respectively. The input of \(\theta _i\) is the local observation \(o_i^t\) and last action \(a_i^{t-1}\) of agent i in group \({\mathcal {G}}_m\). All necessary transition tuples \((\varvec{\tau }, s, r)\) are stored into the replay buffer \({\mathcal {D}}\). During centralized training, a batch of trajectories \(\varvec{\tau }_{{\mathcal {G}}_m}\) are sampled from \({\mathcal {D}}\) as the input of \(\theta _i\) for calculating \({\varvec{Q}}_{{\mathcal {G}}_m}(\varvec{\tau }_{{\mathcal {G}}_m})\). Then, mixing network \(\theta _{M_m}\) takes \({\varvec{Q}}_{{\mathcal {G}}_m}(\varvec{\tau }_{{\mathcal {G}}_m})\) and state s for calculating \({\varvec{Q}}_{{\mathcal {G}}_m}(\varvec{\tau }_{{\mathcal {G}}_m}, {\varvec{s}})\) and TD loss \({\mathcal {L}}_{TD_m}\). The GRU hidden states \(h_{{\mathcal {G}}_m}\), \(h_{{\mathcal {G}}_n}\) and the Gaussian distributions \(l_{{\mathcal {G}}_m}\), \(l_{{\mathcal {G}}_n}\) are generated from agent networks \(\theta _i\) and \(\theta _j\), and are used to calculate IGMI losses \({\mathcal {L}}_{MI_m}\) and \({\mathcal {L}}_{MI_n}\). Detailed formulas are shown in the section “Inter-group mutual information loss

Implementing details and network architecture

Detailed network architecture is illustrated in Fig. 1, an overview of the data-stream of GHQ is illustrated in Fig. 2, and the pseudo-code of GHQ is given in Algorithm 1. As is shown in Figs. 1 and 2, there are three kinds of networks marked with different colors. Agent network \(\theta _{i}\) is marked in green and is shared by all agents \((i \in {\mathcal {G}}_m)\). \(\theta _{i}\) receives the current observation \(o_i^t\) and the last action \(a_i^{t-1}\), and generates \(Q_i^t\). The input is first sent to a Multi-Layer Perceptron (MLP) and then a GRU layer. The hidden state of GRU \(h_{{\mathcal {G}}_m}\) is sent to the following layers and the next time-step. The following layer is a Gaussian distribution layer generating \(l_{{\mathcal {G}}_m}\) using \(h_{{\mathcal {G}}_m}\), and then, \(l_{{\mathcal {G}}_m}\) is sampled and sent to the next two MLP layers. Eventually, a skip connection directly sends \(h_{{\mathcal {G}}_m}\) to the final MLP layer, and \(h_{{\mathcal {G}}_m}\) is concatenated with the output of formal MLP layer for generating \(Q_i\).

Table 3 The hyper-parameters of GHQ

Mixing network \(\theta _{M_m}\) is marked in blue. It takes all \([Q_i]\) of the group \({\mathcal {G}}_m\) as input, and mixes with the state s to produce the \(Q_{{\mathcal {G}}_m}\). Four hyper-networks generate weights and bias \((w_1, b_1, w_2, b_2)\) with s, and only the absolute values of weights are used. The weights and bias multiply with the joint \([Q_i]\) procedurally and the intermediate results are activated to be non-negative, fulfilling the GIGM requirements.

Inference network \(\psi _{{\mathcal {G}}_m}\) is marked in yellow and is only used to calculate \({\mathcal {L}}_{MI}\). It takes the hidden state of GRU \(h_{{\mathcal {G}}_m}\) of group \({\mathcal {G}}_m\) and the Gaussian latent \(l_{{\mathcal {G}}_n}\) of another group \({\mathcal {G}}_n\) as input. The input is first sent to an MLP layer and then a new Gaussian distribution layer to generate the inference distribution \(q_{{\mathcal {G}}_m}(l_{{\mathcal {G}}_m} | l_{{\mathcal {G}}_n}, h_{{\mathcal {G}}_m})\). The MI-loss \({\mathcal {L}}_{MI}\) is calculated by the KL-divergence between the original distribution \(p(l_{{\mathcal {G}}_m})\) and the inference distribution \(q_{{\mathcal {G}}_m}(l_{{\mathcal {G}}_m} | l_{{\mathcal {G}}_n}, h_{{\mathcal {G}}_m})\).

Finally, when calculating the total loss \({\mathcal {L}}_{GHQ}\), adjusting weights \(\lambda _{TD}\) and \(\lambda _{MI}\) are introduced. In our implementation, we set \(\lambda _{TD} = \lambda _{MI} =1\). We choose Adam [70] as the optimizer, with the learning rate of all networks being 3e-4. The total training step is 5 M and the maximum step for one episode is 200. The learning rate is scheduled to decay by multiplying the factor 0.5 every 50,000 episodes (averagely about 2 M–3.5M steps). The reward discounting factor \(\gamma \) is 0.99. The \(\epsilon \) of the \(\epsilon -greedy\) action selecting policy starts at 1.0, ends at 0.05, and linearly declines for 50,000 steps. The size of the memory buffer is 5000 and the batch size is 32. A universal buffer saves all data for training, including trajectories of state \(s^t\), observation \({\varvec{o}}^t\), action \({\varvec{a}}^t\), and reward \(r^t\). After one episode, the latest data are inserted into the buffer and one batch of 32 episode data is sampled from the buffer and used for training. Table 3 summarizes the hyper-parameters mentioned above. In addition, we use the latest version 4.10 of StarcraftII game on Linux to perform experiments, instead of the old version 4.6.

In summary, the total loss of GHQ is written as

$$\begin{aligned} {\mathcal {L}}_{GHQ}(\varvec{\theta },\varvec{\psi }) = \lambda _{TD} {\mathcal {L}}_{TD}(\varvec{\theta }) + \lambda _{MI} {\mathcal {L}}_{MI}(\varvec{\theta }, \varvec{\psi }). \end{aligned}$$
(18)

Experiments and results

Designing new asymmetric heterogeneous maps in SMAC

In the section “Existence of LTH in SMAC”, we prove the existence of LTH in SMAC. However, the default setup of SMAC environment and default implementation of previous algorithms ignore the existence of LTH problem and the importance of asymmetric heterogeneous scenarios.

First, SMAC environment uses a padding vector to deal with the different interactive action-dim \(|A_{act}|\). It increases the \(|A_{act}|\) of \(U_{spt}\) up to the \(|A_{act}|\) of \(U_{atk}\) with the padding vector, and masks unavailable actions when choosing. This solution covers up the existence of LTH problem. In addition, because of the padding vector, previous algorithms can apply parameter-sharing among all unit types. This implementation further prevents the MAS from learning better coordinating policy. In GHQ, all agents use their true \(|A_{act}|\), and parameter-sharing is restricted between agents within the same group.

Second, it is ignored that the internal AI script of StarcraftII is incapable of coordinating and collaborating among multiple types of agents. As a result, the performance of enemies in symmetric heterogeneous maps is limited, and we consider that asymmetric heterogeneous maps are more fitted to perform and study the LTH problem. There are only two asymmetric heterogeneous maps in original SMAC maps: \(3s5z\_vs\_3s6z\) and MMM2 (see Table 1). However, these two maps have shortages, respectively.

All units in \(3s5z\_vs\_3s6z\) are \(U_{atk}\), while the difference is their shot-range and health-point. However, the heterogeneity of this map is restricted, because all \(U_{atk}\) have the same ideal object. Algorithms can acquire high performance without any information about the types or other properties of agents. Another map, MMM2, contains Marine, Marauder, and Medivac (see Tables 1 and 2). \(U_{atk}\) and \(U_{spt}\), ground unit and flying unit are all included in the map. However, since both sides contain all of the three types of units, the internal AI script is unable to perform well. Therefore, we need to design new asymmetric heterogeneous maps for experiments.

Our maps, by contrast, avoid the shortages of original maps. For allies, we have Marine and Medivac, a \(U_{atk}\) on the ground and a \(U_{spt}\) in the air, which is similar to the common heterogeneous UAV-UGV MAS in [29]. For enemies controlled by the internal AI script, we have only Marine to prevent the incapability of the script. We increase the number of enemy Marines to balance the difficulty of maps. Lots of pre-experiments are conducted to determine the specific number of all units. Table 4 shows the information of all new maps. Figure 3 shows some examples of original and new maps.

Fig. 3
figure 3

Examples of SMAC maps. The lower two are ours

Table 4 New asymmetric heterogeneous SMAC maps

Environmental and experimental details

In SMAC, all information provided by the environment is organized into tensors of pure data, all of which are either normalized into [0, 1] or transferred into one-hot vectors. We describe the necessary information details below for a better understanding of the SMAC environment. More details can be accessed in the official repository and source codes.

The state S is only accessible by the mixing network \(\theta _{M}\) during training. It consists of two major parts, ally-state and enemy-state:

  • ally-state includes the percentage of health-point, weapon cool-down timer, ally unit type, and absolute position of all allies;

  • enemy-state includes the percentage of health-point, enemy unit type, and absolute position of all enemies.

The observations O is the input to the agent network \(\theta _i\) for calculating \(Q_i\). For agent i, the observation \(o_i\) consists of four parts, moving-feature, ally-feature, enemy-feature and own-feature:

  • moving-feature includes the ID of available moving action of agent i;

  • ally-feature includes the percentage of health-point, unit type, relative distance, and relative position of other allies to agent i within its sight-range. Information about the agents out of the sight-range of agent i is not accessible;

  • enemy-feature includes the percentage of health-point, unit type, relative distance, and relative position of all enemies to agent i within its sight-range. Information about the enemies out of the sight-range of agent i is not accessible;

  • own-feature includes the percentage of health-point and unit type of agent i.

As we have described in the section “Auxiliary definitions and the local transition function”, agent action consists of two parts: common-action \(A_{com}\) and interactive-action \(A_{act}\). The common action-dim \(|A_{com}|\) is 6 for all agents. Action ID 0 is null action only available for dead agents. Action ID 1 is stop action, and ID 2, 3, 4, and 5 are moving actions available for all living agents. The four moving actions are pre-defined by the SMAC source codes, indicating moving up, down, left, and right with a certain moving_amount step-length. The interactive action-dim \(|A_{act}|\) equals the number of interacting objects of a certain agent type. For \(U_{atk}\), \(|A_{act}|\) is the number of enemies. For \(U_{spt}\), \(|A_{act}|\) is the number of allies.

We use the default global dense reward function of SMAC. The MAS is rewarded when dealing damage to the enemies, killing enemies, and winning the game. The damage reward equals the value of the health-point changes of enemies after one time-step, which is the absolute damage value dealt to the enemies. The killing reward is 10 for every enemy-kill, and the winning reward is 200 given at the terminal time-step.

We use the official implementations of all algorithms with minimal necessary adaptation to our new environmental settings. In general, we use the traditional winning-rate (WR) as the measuring criterion. WR is the probability of MARL agents eliminating all enemies and winning the game, and is approximated by the frequency of winning. We use the averaged WR of 32 testing episodes. Testing episodes are taken every 10,000 training steps (about 1,000 training episodes). Five rounds of complete experiments with different random seeds are performed for plotting the curve of the averaged WR with the p value being 0.05. As is shown in Figs. 1 and 2, GHQ uses extra Inference networks to calculate IGMI loss. As a result, the computing time of GHQ is roughly about 1.5 times of the computing time of QMIX. Other value-based methods also consume more time than QMIX, indicating their more complexity than QMIX.

Table 5 The Enemy Strength (ES), Proportion of Supporting Units (POS), and Winning-Rate (WR) of QMIX-FT and GHQ on homogeneous and heterogeneous maps

Criteria for measuring map heterogeneity and difficulty

According to our analysis in the section “Existence of LTH in SMAC”, the existence of LTH in SMAC is clear. However, analyzing and quantifying the influence of LTH on agent policy are still required. Here, we propose objective criteria to measure the heterogeneity and difficulty of maps.

The Proportion of Supporting Units (POS) is the proportion of the number of ally supporting units \(|U_{spt_i}|\) divided by the number of overall ally units \(|U_{i}|\). The Enemy Strength (ES) is the ratio of weighted attacking units \(U_{atk}\) of two sides, for measuring the strength of different \(U_{atk}\). The result is calculated with the enemies divided by the allies

$$\begin{aligned}{} & {} POS = \frac{|U_{spt_i}|}{|U_{i}|} = \frac{Ally\ Medivacs}{All\ Ally\ Units} \nonumber \\{} & {} ES= \frac{\sum _{all\ types} w_e \cdot |U_{A_e}|}{\sum _{all\ types} w_i \cdot |U_{A_i}|} = \frac{Enemy\ Marines}{Ally\ Marines} \nonumber \\ \end{aligned}$$
(19)

where \(|U_{A_i}|\) and \(|U_{A_e}|\) are the number of different types of attacking units \(U_{atk}\) for allies and enemies, and \(w_i\) and \(w_e\) are the correction weights.

In our maps, since the only \(U_{spt}\) is Medivac and the only \(U_{atk}\) is Marine, POS equals the proportion of Medivacs among all ally units. ES equals the ratio of the number of Marines from two sides. It is obvious that high POS represents high heterogeneity, because the high proportion of ally \(U_{spt}\) indicates the serious influence introduced by the policy of \(U_{spt}\). High ES represents high difficulty, because the only way to win in SMAC is to control ally \(U_{atk}\) eliminating all enemies, and high ES indicates more enemy \(U_{atk}\) than ally \(U_{atk}\).

We design several homogeneous maps consisting of only Marine for both sides. The enemy consists of 15, 20, and 30 Marines, which is almost the same as our heterogeneous maps. The ally consists of Marines slightly less than the enemy (see Table 5). According to the converged (WR), we conclude that in homogeneous maps with only Marines controlled by QMIX-FT [21] algorithm, ES and WR are highly related and proportional. When ES is about 1.25, WR is about 0.5; and when ES is less than 1.18, WR keeps being 1.0. Even if the total number of units is doubled, this relation remains unchanged. In symmetric homogeneous maps, ES is at its minimum 1.0, and thus, it can be concluded that the MARL policy is easier to win than in asymmetric maps.

We further design additional heterogeneous maps (see Table 5). On the one hand, the ES of heterogeneous maps can be easily increased up to 1.7 to 2.4 in heterogeneous maps, when WR of QMIX-FT is about 0.9. Introducing heterogeneity into SMAC maps can significantly increase the difficulty of maps, so it is necessary to study and better utilize heterogeneity. On the other hand, POS and ES are highly related. To achieve high WR in harder maps with high ES, we need to increase POS simultaneously with increasing attacking units. For example, in \(6m2m\_16m\), ES is 2.67 and POS is \(25.0\%\). Both GHQ and QMIX-FT can only achieve the WR about 0.5. By contrast, in \(8m3m\_21m\), ES is 2.63 and POS is \(27.3\%\), and the WR reaches about 0.9.

In conclusion, our results prove the shortage of original SMAC symmetric maps, and the ability of GHQ and QMIX-FT to handle the LTH problem with higher POS and ES. The following experiments show that better utilizing LTH helps GHQ to acquire higher WR with smaller variance than QMIX-FT. Additionally, we conclude that the strength of 1 Medivac equals about 3.5 Marines.

Comparison algorithms

Experiments are taken in our seven new maps (see Table 4) and the MMM2 map as an original asymmetric heterogeneous map. We mainly choose value-based algorithms to run experiments for comparison, including vanilla QMIX [20], fine-tuned QMIX (QMIX-FT) [21], QPLEX [43], ROMA [60], RODE [71], MAIC [62], and CDS [63]. We also run experiments for policy-based baseline algorithms, including COMA [72], MAPPO [22], and HAPPO [32].

RODE and ROMA are role-based algorithms, which learn and apply role policies online, end to end. These two algorithms are relatively similar to our group-based algorithms than others. However, ROMA cannot learn effective policy within 5 M (5 million) training steps, because the default training step of ROMA is 20 M. In RODE, several key hyper-parameters define the clustering and using of role policies. The end-to-end clustering of role policies makes it difficult to focus on the LTH property. Therefore, the performance of RODE is restricted. QPLEX, MAIC, and CDS modify the factorization structure of QMIX with distinct methods. COMA and MAPPO are actor-critic algorithms using the “centralized critic decentralized actor” (CCDA) architecture. These two algorithms apply parameter-sharing in actor networks and use one shared critic network. HAPPO uses independent network parameters for actor networks and proposes a monotonic policy-improving architecture with a theoretical guarantee.

Fig. 4
figure 4

Results of value-based algorithms comparison

Comparison results

The results of value-based algorithms are shown in the section “Results of value-based algorithms comparison”, Fig. 4, and Table 6. We criticize the performance of value-based algorithms with 4 groups of experiments. All of our GHQ results are in red color and the colors for other value-based algorithms are shown in the legend. The results of policy-based algorithms are shown in the section “Results of policy-based algorithms comparison” and Table 8. Generally, all of the comparison algorithms suffer from the LTH problem and cannot acquire high WR with small variance. Previous value-based algorithms are basically modified from QMIX and, to some extent, weaken the ability of QMIX to handle the LTH problem.

Results of value-based algorithms comparison

Table 6 Results of value-based algorithms comparison

Results for value-based algorithms are shown in Fig. 4 and Table 6. In Table 6, the results are the final WR, and are averaged across 5 individual tests with different random-seeds. The standard deviations are followed. In Fig. 4, the lines and shadows are fitted across the whole data, so the values may be slightly different from the results in Table 6.

(1) We test all algorithms on the original asymmetric heterogeneous map MMM2. The results are shown in Fig. 4a. Because the map is relatively easy and almost all algorithms converge at 3 M training steps, we only show the results ended at 3 M steps for better presentation. The graph shows that WR of most algorithms converged to 1.0 at about 1.5M steps with a relatively small variance. QPLEX and GHQ are slightly better than QMIX. RODE and MAIC converge at about 2.5M steps, which is slower than other algorithms. ROMA and CDS fail to converge at 3 M steps.

(2) We decrease the heterogeneity of maps through decreasing POS. In Fig. 4b, d, g, and h, the number of Medivac remains to be 2, while the number of Marine is increased. Therefore, the POS decreases from \(25.0\%\) of (b), to \(11.1\%\) of (h) (see Table 5). As a result, algorithms using parameter-sharing among all agents learn better policy than the setting of increasing POS. In general, algorithms perform well in small-scale maps (b) and (d), but only GHQ and QMIX-FT perform well in both of the large-scale maps (g) and (h). GHQ outperforms QMIX-FT with smaller variance. MAIC and QMIX perform well in (g) but fail in (h), indicating their limitation in handling large-scale problems. QPLEX and RODE cannot learn effective policy in (g) and (h), while ROMA and CDS completely fail in (g) and (h). RODE performs better in (d), (g), and (h) than in (c), (e), and (f), indicating the training of the role-selector requires homogeneous MARL settings.

Fig. 5
figure 5

Heat-maps of \(U_{spt}\)’ percentage of health-points of GHQ and QMIX-FT in 6m2m_15m and 6m2m_16m

(3) We scale up all units of both sides simultaneously. In Fig. 4b and c, the POS remains to be \(25.0\%\), while the number increases from 2 to 4. Theoretically, the optimal policies of map (b) and (c) are similar. However, this scaling method combines the complexity of scalability and heterogeneity, making it harder for comparison algorithms to learn effective policies. In (b), most algorithms achieve high WR within 5 M steps, while GHQ converges fastest and RODE suffers from high variance and relatively low WR. ROMA and CDS fail to learn effective policy in (b). However, in (c), almost all comparison algorithms fail to learn effective policy. GHQ and QMIX-FT outperform other algorithms and have not yet converged at 5 M steps. QPLEX also suffers from complexity, but generally performs better than QMIX, ROMA, RODE, MAIC, and CDS.

Table 7 Results of independent t-test of GHQ against other value-based algorithms

(4) We increase the heterogeneity of maps through increasing POS. In Fig. 4d–f, the POS are \(25.0\%\), \(27.2\%\), and \(33.3\%\), while the number are 2, 3, and 4. The results show that GHQ achieves the best results in all maps with relatively small variance, indicating the effectiveness of the IOG method and IGMI loss. IOG allows different types of agents to possess different network parameters and thus reduces the influence of increasing Medivacs. IGMI loss helps to increase the correlation between group trajectories and thus increases cooperation between groups. QMIX-FT achieves relatively similar WR to GHQ against other algorithms, but still suffers from high variance. QPLEX performs well in (d) and (e), but the WR decreases evidently in (f), indicating the influence of LTH. WR of RODE is about 0.2 in (d), but remains zero in other maps. ROMA fails to learn effective policy in all maps. MAIC achieves a WR of about 0.2 in (d) and (e), but fails in (f). CDS has little WR in (d) but fails in other maps.

Independent t test and further analysis of GHQ against other value-based algorithms

In SMAC, we cannot conduct the experiment of two algorithms attack against each other. Therefore, we cannot directly count the win-lose relationship between algorithms for the statistical tests in [73]. As an alternative, we use the data in Table 6 to conduct independent t tests between GHQ and other value-based algorithms to prove the significance of the obtained results. We assume that the distributions of all results are normal, following the mean values and the standard deviations in the table. We use Scipy to generate the distributions with the size being 500, and then run the independent t tests. The results of t tests are shown in Table 7. The t-statistics are shown in the table with the p values followed.

It is obvious that almost all p values are smaller than 0.05, indicating the significance of the results. Only the p values of the results of GHQ against QMIX-FT, QMIX, QPLEX, and RODE in MMM2 are greater than 0.05, indicating that the result of GHQ has no significant difference against the result of the 4 algorithms, which is proved in Table 6. The t-statistics are also almost all positive, indicating the superior performance of GHQ against other algorithms.

In 6m2m_15m, the mean value of GHQ is smaller than QMIX-FT and QPLEX. First, we need to point out that, as is shown in Fig. 4b, the curve of the WR of GHQ grows faster than the other two algorithms, indicating the faster learning speed of GHQ. Second, for further analysis, we draw a heat-map of \(U_{spt}\)’ percentage of health-points following the method in the section “Visualization analysis of trained policies”, and the result is shown in Fig. 5. It can be concluded that GHQ learns similar policies in 6m2m_15m and 6m2m_16m, which is to “let \(U_{spt}\) take damage for preserving \(U_{atk}\)”. However, even though QMIX-FT manages to learn a similar policy with GHQ in 6m2m_15m, it fails to learn the proper policy in 6m2m_16m. This phenomenon indicates the increasing difficulty of 6m2m_16m than 6m2m_15m, as the optimal policy becomes harder to learn.

Results of policy-based algorithms comparison

Due to the discrete property of SMAC, value-based algorithms generally have achieved better results than policy-based algorithms [20,21,22]. To support this conclusion, we conduct experiments of COMA, MAPPO, and HAPPO against GHQ and QMIX-FT. Results for these algorithms are shown in Table 8. The final WRs are shown in the table, and the results are averaged across 3 individual tests with different random-seeds and the standard deviations are followed. Original papers of MAPPO and HAPPO run experiments in SMAC for 10 M training steps, so we list the results of 5 M and 10 M training steps separately. COMA can only acquire WR in MMM2 and fail in all other maps. MAPPO performs the best among the 3 policy-based algorithms, especially in 6m2m_15m, 8m4m_23m, and 12m4m_30m. These maps have relatively high ES and POS, indicating the potential of MAPPO handling LTH problems. HAPPO performs worse than other policy-based algorithms. One possible reason is that HAPPO implements Multi-Agent Advantage Decomposition (MAAD) via the random sequential update and execute scheme. However, in the LTH problem, the sequential partial order of agent actions can significantly affect the final joint policy. In conclusion, the results show that value-based algorithms generally perform better than policy-based algorithms, and GHQ outperforms all policy-based baseline algorithms.

Table 8 Results of policy-based algorithms comparison

Ablation study

Fig. 6
figure 6

Results for ablation tests about IOG and IGMI

The ablation study consists of two experiments. The section “Ablation tests about IOG and IGMI” is the ablation test about two component parts of GHQ, IOG, and IGMI. Because the IGMI must be applied between two agent groups, it is incapable of testing “QMIX-FT+IGMI” individually. Therefore, 3 groups of ablation tests are taken in 4 maps, as is shown in Fig. 6. Another experiment in the section “Visualization analysis of trained policies” is the visualization analysis of trained policies about GHQ and QMIX-FT in 6m2m_16m. We visualize the trained policies of the two algorithms in heat-maps to show the influence of IOG and IGMI on policy learning. The temperature of heat-maps is the counting sum of corresponding agents. The results are shown in Fig. 7.

Ablation tests about IOG and IGMI

To analyze the effectiveness of IOG method and IGMI loss in different maps, we take ablation tests in (a) MMM2, (b) 6m2m_16m, (c) 8m4m_23m, and (d) 16m2m_30m. QMIX-FT and QMIX-FT+IOG are the ablation groups. The results are shown in Fig. 6.

In general, as our expectation, IOG method helps to improve the performance of QMIX-FT, and IGMI loss helps to reduce variance. Figure 6a shows that all algorithms are able to conquer the MMM2 map within 1.5M steps, while QMIX-FT+IOG and GHQ are converged slightly faster than QMIX-FT. In (b) 6m2m_16m, IOG and IGMI are performing well. They not only improve the WR, but also reduce the variance. In (c) 8m4m_23m, the WR of QMIX-FT increases faster than the other two algorithms before 3 M steps. However, IOG method manages to find a good cooperating policy, and converges to a better WR at 5 M steps with smaller variance. A higher derivative of IOG method at 3 M to 4 M steps indicates the progress of learning better policy. In (d) 16m2m_30m, GHQ outperforms QMIX-FT with higher WR and smaller variance. QMIX-FT+IOG receives a similar result with QMIX-FT, but has even larger variance. The main reason is that the difference between two groups is so large. Introducing IGMI loss helps to restrict the difference and improve the correlation between groups. Therefore, GHQ achieves the best result among the three testing algorithms.

Visualization analysis of trained policies

Fig. 7
figure 7

Heat-maps for the trained policies of GHQ and QMIX-FT in 6m2m_16m

We choose the trained policies of GHQ and QMIX-FT in 6m2m_16m at 5 M training steps for analysis. The two algorithms are implemented with the same hyper-parameter and similar network capacity. The differences in choosing action and health-points of the two unit types controlled by GHQ and QMIX-FT are visualized in Fig. 7. We test the two policies 50 times and record their trajectories. We calculate the sum of agents’ chosen-actions and agents’ percentage of health-points, and visualize them in heat-maps of Fig. 7. The horizontal coordinates of all heat-maps are the time-step T, and the vertical coordinates are the every \(10^{th}\) percentile of health-points in (a), (b), (e) and (f), or the action ID number in (c), (d), (g) and (h). The temperature is the counting sum of corresponding agents at certain time-step with certain status. In action heat-maps, action ID 0 to 5 are common actions \(A_{com}\) for moving and stopping, while the rest are interactive actions \(A_{act}\) for attacking or healing. The chosen network parameters of GHQ and QMIX-FT reach the same WR of about 0.8 after 5 M training steps, noting that GHQ learns faster than QMIX-FT. Figure 7 clearly shows that the two algorithms achieve similar results through different agent policies. We observe three key phenomena.

  1. (1)

    Parameter-sharing among different agent types do influence agent policy. As is suggested in [32], parameter-sharing restricts network parameters from being diverse. Red boxes in Fig. 7c and g show a similar policy pattern of “first move and then stop to attack/heal” for two types of agents in QMIX-FT. Specifically, both types of agents prefer to choose action 2 and 5 in the first 14 time-steps. In GHQ, however, the diversity of different groups is guaranteed, as is generally shown in (d) and (h). In addition, comparing the Medivac policy of QMIX-FT and GHQ in (g) and (h), it is clear that the QMIX-FT Medivac policy in (g) is more similar to the QMIX-FT Marine policy in (c) than the GHQ policies in (h) and (d).

  2. (2)

    GHQ improves group policy learning. Green boxes in Fig. 7c and d indicate that Marine controlled by GHQ learns better “focusing and firing” policy, as the temperature of \(A_{act}\) are notably hotter than QMIX-FT. GHQ agents learn to focus and fire at one specific enemy target within several time-steps, which makes them quickly eliminate enemies and reduces their damage. By contrast, QMIX-FT agents learn to fire at several targets at the same time, which reduces the speed of elimination and causes more damage. Yellow box in Fig. 7d shows that the moving policies of GHQ Marines are also significantly different from QMIX-FT. GHQ Marines finish their movement in the first 4 time-steps with decisive actions and form a tight front. They tend to stay together and therefore take enemy damage simultaneously, which leads to a similar decreasing tendency of health-point and the two obvious temperature valleys at 80 and 40 percentile in the yellow box of (b).

  3. (3)

    GHQ improves inter-group cooperating. Orange boxes in Fig. 7e and f represent the decreasing curves of the health-point of Medivacs. GHQ Medivacs learns a better “distracting” policy than QMIX-FT Medivacs. One GHQ Medivac first moves toward enemies and attracts fire to prevent enemies from attacking ally Marines. This policy is proved by the orange box in (h) with the “action 0 line”, indicating the death of one Medivac agent. Then, the other Medivac agent moves on to keep attracting enemy fire. As a result, the figure in (f) consists of two independent curves. The distraction policy performed by GHQ Medivacs is a fabulous tactic and significantly differs from the policies of GHQ Marines, indicating that GHQ is capable of utilizing LTH for better cooperation.

Conclusion

In this paper, we focus on the cooperative heterogeneous MARL problem, especially the asymmetric heterogeneous MARL problems. To describe and study the heterogeneous MARL problem, we propose the Local Transition Heterogeneity (LTH) with a formal definition. To support the definition of LTH, we first define the Local Transition Function (LTF) and several auxiliary concepts. Furthermore, we study the existence and influence of LTH in SMAC.

To primarily solve the LTH problem, we first propose the Grouped Individual-Global-Max (GIGM) consistency. Following the restriction of GIGM, we further propose the Ideal Object Grouping (IOG), the Inter-Group Mutual Information (IGMI) loss, and the hybrid factorization structure. The combination of these three methods is our novel Grouped Hybrid Q-learning (GHQ) algorithm. Experiments are conducted in asymmetric heterogeneous SMAC maps to show that GHQ outperforms other state-of-the-art algorithms. The results prove the necessity to study and utilize LTH for studying more complex scenarios in SMAC.

We believe that the study of heterogeneity is indispensable for future MARL studies, and we hope that the mathematical definitions and analysis can help future studies on the cooperative heterogeneous MARL problem. Due to the restriction of computing resources and network structure, we are unable to study large-scale problems or transfer learning problems in heterogeneous MARL. In the future, we will try to solve more large-scale and complex heterogeneous MARL problems in other maps and environments.