GHQ: Grouped Hybrid Q-Learning for Cooperative Heterogeneous Multi-agent Reinforcement Learning

Previous deep multi-agent reinforcement learning (MARL) algorithms have achieved impressive results, typically in symmetric and homogeneous scenarios. However, asymmetric heterogeneous scenarios are prevalent and usually harder to solve. In this paper, the main discussion is about the cooperative heterogeneous MARL problem in asymmetric heterogeneous maps of the Starcraft Multi-Agent Challenges (SMAC) environment. Recent mainstream approaches use policy-based actor-critic algorithms to solve the heterogeneous MARL problem with various individual agent policies. However, these approaches lack formal definition and further analysis of the heterogeneity problem. Therefore, a formal definition of the Local Transition Heterogeneity (LTH) problem is first given. Then, the LTH problem in SMAC environment can be studied. In order to comprehensively reveal and study the LTH problem, some new asymmetric heterogeneous maps in SMAC are designed. It has been observed that baseline algorithms fail to perform well in the new maps. Then, the authors propose the Grouped Individual-Global-Max (GIGM) consistency and a novel MARL algorithm, Grouped Hybrid Q-Learning (GHQ). GHQ separates agents into several groups and keeps individual parameters for each group. To enhance cooperation between groups, GHQ maximizes the mutual information between trajectories of different groups. A novel hybrid structure for value factorization in GHQ is also proposed. Finally, experiments on the original and the new maps show the fabulous performance of GHQ compared to other state-of-the-art algorithms.

The SMAC environment is a multi-agent micromanagement scenario in which two adversarial MAS battle against each other.The goal is to train a MARL algorithm controlling ally agents to eliminate enemy agents controlled by the internal script of SMAC environment.The algorithm needs to learn tactics and skills for choosing the best actions and utilizing the different properties of agents.Due to the discrete property of the environment, value-based algorithms have achieved better results than policy-based algorithms [20][21][22].
Asymmetric heterogeneous problems are very common in real-world scenarios [23][24][25][26][27], such as wireless network accessibility problem [28] and multiagent robotic systems [29,30].However, the original maps of SMAC environment mainly consist of symmetric maps or homogeneous maps (see Table 1).A symmetric map means that allies and enemies consist of the same types of units, and the numbers of both sides are also equal.A homogeneous map means that allies consist of one specific type of unit, no matter what composition of the enemies.Furthermore, in SMAC, even though allies and enemies are equal at the starting state of symmetric heterogeneous problems, they would become asymmetric as the game runs, because the two sides are attacking and killing each other.In [31], the authors propose a situation (Proposition 5) where the policy of an algorithm may be trapped in a sub-optimal state due to the complexity of heterogeneity.Therefore, it is necessary to comprehensively and carefully study the heterogeneous MARL problem.
Previous algorithms have acquired good performance in most symmetric homogeneous maps, symmetric heterogeneous maps, and asymmetric homogeneous maps from the SMAC original map set.However, experiments show that even state-of-the-art algorithms cannot achieve a high winning-rate (WR) in asymmetric heterogeneous maps, indicating that the combination of asymmetry and heterogeneity brings more complexity.Therefore, in order to fully study the heterogeneity problem, it is beneficial to enrich the SMAC environment with more asymmetric heterogeneous maps.Recent mainstream approaches use policy-based actor-critic algorithms to solve the heterogeneous MARL problem with various individual agent policies [32,33].Some other papers discussing heterogeneity are mainly about multi-agent robotic systems, such as [29,30], which are slightly different from MARL research.
For example, in the multi-agent area search problem proposed in [29], the multi-agent robotic methods usually manage to model the problem in detail with proper mathematical structures, and then propose the solution.However, the MARL approaches usually model the problem as a POMDP (see section 3.1 for details) and design a proper reward function for the environment.The goal is to learn an optimal policy function to decide the best actions for all states.
Particularly, it is required to point out that previous approaches lack the formal definition of heterogeneity.A natural description of the heterogeneity problem is that the action spaces of agents are different, and parameter-sharing among different agents is limited or prohibited.However, such description is not detailed enough for further study.In [34], the authors describe and classify the Physical and Behavioral heterogeneities with natural language instead of mathematical definitions.It is easy for humans to realize that planes and cars are heterogeneous.However, it is still necessary to deeply analyze heterogeneity with a formal definition, so that we are able to figure out what property is different so that the MASs must be treated differently, and which type of heterogeneity do the MASs possess.Based on the definition and classification, we can further quantify and solve the heterogeneity problem.
Considering the generation process of a transition tuple, it is concluded that the heterogeneity in MARL mainly occurs in three components of the tuple: Local Reward, Local Observation, and Local Transition.In this paper, we focus on and study the cooperative Local Transition Heterogeneity (LTH) MARL problem, in which cooperation happens among different types of agents.When changing the number of ally agents, the ratio of different agent types may also be changed, and thus the optimal cooperating policy is affected.This change increases the diversity and complexity of the LTH problem.
A natural solution for LTH is grouping.An agent is determined to affiliate a specific group depending on its certain property.Furthermore, an agent keeps to be a permanent member of a group as long as the scenario remains unchanged.The grouping process simplifies and stabilizes the determination of group members and the usage of different group policies, making it easier to add inter-group mechanisms between policies of groups.In addition, grouping helps to maintain a proper structure for parameter-sharing, which helps to improve cooperation through homophily [35].As a result, it becomes an important problem to choose an appropriate property for grouping in LTH problems.
In this paper, we propose GIGM consistency and GHQ algorithm to solve the LTH problem in SMAC environment.First, in order to leverage the benefit of value-based methods and grouping methods, we need to generalize the Individual-Global Maximum (IGM) consistency [36] into grouped situations.Therefore, we conduct the Grouped Individual-Global Maximum (GIGM) consistency and a condition to test whether a grouping method satisfies GIGM.Second, we propose the Grouped Hybrid Q-Learning (GHQ).Agents are partitioned into groups following the ideal object grouping (IOG) method.Each group has its own isolated network parameters, and the parameters are only shared among group members.A novel hybrid structure for value factorization is proposed for optimizing and reducing computation.Furthermore, a variational lower bound of the inter-group mutual information (IGMI) is introduced to increase the correlation between groups for better cooperation.Third, we test GHQ in our new asymmetric heterogeneous maps.Results show that GHQ outperforms other baseline algorithms with higher WR and better learning curve, and the cooperate policy between GHQ groups is significantly different against baselines.Main contributions of this paper are: The rest of the content is as follows: we summarize some related works in section 2; we give the definition of LTH and theoretically analyze it in SMAC in section 3; we provide details about the GHQ algorithm in section 4; we present detailed environmental and experimental design, and discuss results of our experiments in section 5; and finally we draw some conclusion in section 6.

Multi-agent Reinforcement Learning
Following the centralized training with decentralized execution (CTDE) paradigm [37][38][39], which requests agents not to use state s during execution, recent approaches have achieved impressive results where E D means sampling a batch of tuples (τ , s, r) from replay buffer D and calculating expectation across the batch.π is the action policy, which is commonly the ϵ-greedy policy or argmax policy of Q function in value-based algorithms.Q tgt tot is the target function and of Q tot .θ and θ tgt are the network parameter of Q tot and Q tgt tot respectively.In order to factorize Q tot and use the argmax policy of Q i to select actions, the Individual-Global-Max (IGM) consistency [36] is required: [20] changes the factorization structure from additivity to monotonicity, and the finetuned version of QMIX has been proved to be one of the best algorithms on the original SMAC maps [21].Based on these two fundamental algorithms, QTRAN [36] Heterogeneous MARL has been considered as a special case of homogeneous MARL and can be handled with individual policy networks.HAPPO [32], in which the H stands for heterogeneous, lacks specific analysis and sufficient experiments for heterogeneity.In other field of MAS, [48] uses Relative Needs Entropy (RNE) to build a trust model to improve cooperation in heterogeneous multi-robot grouping task, and [49] contributes a novel method for the heterogeneous multi-robot assembly planning.

Grouping Method
Grouping is a natural idea and solution for complex or large-scale problems and is widely used in many research of optimization or machine learning.In SMAC environment, THGC [50] divides agents into different groups based on their different "types" for knowledge sharing and group communication.However, it is necessary to formally define and describe the difference between agent types in a universal way across different environments.In this paper, we introduce some auxiliary definitions for describing our grouping method.
[51] uses a channel grouping algorithm to cluster different sub-regions of pictures for vehicle Re-ID.
[52] introduces a ranking-based grouping method to improve multi-population-based differential evolution algorithm.[53] proposes a grouping attraction model, which can significantly reduce the number of attractions and fitness comparisons in the firefly algorithm.
[54] modifies the Transformer encoder by properly organizing encoder layers into multiple groups, and connects these groups via a grouping skip connection mechanism.[55] enhances the Optimal Sequential Grouping (OSG) to solve the video scene detection problem.[56] proposes FedEntropy for better dynamic device grouping in federated learning.[57] introduces an enhanced decentralized autonomous aerial swarm system with group planning.[58] designs a selforganizing MAS for distributed voltage regulation in the smart power grid.

Mutual Information
Computing the variational bound of mutual information (MI) has been proven to enhance cooperation in MARL.MAVEN [59] maximizes a variational lower bound of the MI between the latent variable z and the agent-specific Boltzmann policy σ(τ ) to encourage exploration of the algorithm.ROMA [60] computes two MI-related losses to learn identifiable and specialized role policies.PMIC [61] maintains positive and negative trajectory memories to compute the upper bound and lower bound of the MI between global state s and joint action a. MAIC [62] maximizes the MI between the trajectory of agent i and the ID of another agent j for teammate modeling and communication.CDS [63] maximizes the MI between the trajectory τ i of agent i and its own agent ID to maintain diverse individual local Q functions.

Local Transition Heterogeneity
In this section, our goal is to give a formal definition of the Local Transition Heterogeneity (LTH) problem and analyze its existence in SMAC.We first present fundamental concepts and definitions in 3.1.Next, we define auxiliary concepts and the Local Transition Function (LTF) for the formal definition of LTH in 3.2.These definitions isolate one specific agent i into an ideal scenario.Therefore, we can study the properties of agent i affecting the LTH problem.And then, in 3.3, we define the LTH problem and show the advantage of our definition.Finally, we conclude two properties for proving the existence of the LTH problem, and analyze the existence of LTH in SMAC in 3.4.

Preliminaries
In this paper, we study the cooperative MARL problems that can be modeled as the decentralized partially observable Markov decision process (Dec-POMDP) [64].The problem is described with a tuple G = ⟨S, A, P , R, Ω, O; γ, K, T ⟩. s ∈ S denotes the true state of environment with complete information, K = {1, ..., k} denotes the finite set of k agents, and γ ∈ [0, 1) is the discount factor.

Auxiliary Definitions and the Local Transition Function
Apart from the joint transition function P (s t+1 |s t , a t ), we need to define the Local Transition Function (LTF) P i (s t+1 |s t , a t i ) for the definition of LTH problem.Several auxiliary definitions are given for better demonstration and analysis of LTF and LTH.
First, we partition the actions of an agent A i into 3 different types: common actions A com , which only affect agent i itself, e.g.moving, scanning and transforming; interactive actions A act , which are interacting with other agents, e.g.attacking, guiding and delivering; and mixing actions A mix , which affect both itself and others, e.g. a predator moving close to a prey for automatic predating.Usually, A mix can be divided into the combination of A com and A act , e.g. the A mix automatic predating can be divided into A com moving and A act predating.For terminological simplicity, we divide A mix into the combination of A act and A com by default, and focus on the latter two types of actions.
Second, we introduce the joint available-actionmask matrix AM (s t ) and the local available-actionmask vector AM i (s t , i), which are common components in many MARL environments.

Definition of Local Transition Heterogeneity
In general, the Local Transition Heterogeneity (LTH) means that agents cannot reach the same nextstate s t+1 from the same state s t , no matter what policies they are using.A formal definition is given below.Definition 3. Local Transition Heterogeneity (LTH): Let there be two agents i, j ∈ K. Their policies are π i (a i |s) and π j (a j |s), and their LTFs are P i (s t+1 |s t , a t i ) and P j (s t+1 |s t , a t j ).A certain state s t , which simultaneously fulfills IC i and IC j , is the starting state.The sets of next-states {s t+1 i |s t , π i , P i } and {s t+1 j |s t , π j , P j } are generated by π i and π j individually executed on s t towards their corresponding IO i and IO j .If the intersection of the two sets of next-states is empty for all available policies, then the MARL problem has LTH: For example, in a MAS consisting of UAV (Unmanned Aerial Vehicle) and UGV (Unmanned Ground Vehicle), we suppose that UAV and UGV carry different mission cargo, so their A act are different.Their moving speed and moving dimension (2-D and 3-D) are also different, so their A com are different.From the same starting state s t , UAV and UGV cannot reach the same next-state s t+1 because their An advantage of our definition is the reliability of presenting heterogeneity.We define the LTH problem under the restriction of IO and IC.Our core motivation is to ensure that the local available-actionmask vector AM i (s t , i) remains all true, because AM i (s t , i) can influence the behavior of agents and thus affect the existence of LTH.For instance, if all enemies choose the policy "attack and eliminate agent i at the 1 st time-step", then the AM i would be only available for the "dead-action", since agent i is always dead from the 1 st time-step.Therefore, it is impossible for agent i to present LTH.Similarly, ally agents' actions and policies can also affect the AM i and lead to the same result.In conclusion, our definition avoids unexpected influence from enemies or allies towards the AM i , and is capable of presenting LTH reliably.

Existence of LTH in SMAC
The original definition formula (3) is inconvenient to judge whether an environment has LTH.We further conclude that the difference of IO or LTF can determine the existence of LTH.First, different IO leads to qualitative LTH.For example, in a UAV-UGV system with different mission cargo, the IO of a UAV is defined to be another UAV while the IO of a UGV is defined to be another UGV.Their objects and functionalities of A act are different, leading to LTH.Generally, different interactive action-dim |A act | is sufficient to prove the difference of IO, and can also be used to prove the existence of LTH.Second, different LTFs lead to quantitative LTH.For example, in a UAV-UGV system with the same mission cargo, their moving speeds are still different.Typically, UAVs fly faster in the air than UGVs move on the ground.The difference of the dynamics of A com or A act leads to different LTF, and thus LTH occurs.
In SMAC, there are two agent types, supporting units U spt and attacking units U atk .U spt can only affect allies while U atk can only affect enemies.For example, Medivac is a U spt who can only heal allies, while Marine is a U atk who can only attack enemies (see Table 2).In SMAC, A com are moving and stopping, available for all living agents at any state s and any time-step t.The common action-dim |A com | also remains identical among all agent types.A act are attacking or healing.A certain agent type can only attack enemies or heal allies.Therefore, |A act | should be different between different agent types.
First, the IO of U atk and U spt are different, leading to qualitative LTH.For U atk , its IO is an enemy, and its |A act | is also the total number of enemies.However, for U spt , its IO is an ally, so its |A act | is the total number of allies.Second, the moving speed, shot-range and damage-per-gaming-second (DPS) are different between different types of agents (see Table 2), indicating the existence of quantitative LTH.In conclusion, the existence of LTH in SMAC is clarified, and further analysis and study of LTH are therefore required.

Grouped Individual-Global-Max Consistency
As is shown in section 3.3, LTH does not change the reward function R(s, a) or the available-actionmask. Therefore, any available joint action a is rewarded the same as it in homogeneous scenarios, and the optimal joint action a * is not affected.As a result, the IGM consistency in LTH still holds and we can further generalize the consistency to a "grouped" situation for solving LTH problems with grouping value factorization.Definition 4. Grouped IGM Consistency (GIGM): Let there be U = {1, ..., u} , (u < k) agent groups in total.An agent group G m (m ∈ U ) consists of agents arbitrarily pre-defined.If the argmax operation performed on the joint function Q tot yields the same result as a set of individual argmax operations performed on all group functions Q Gm (m ∈ U ); and the argmax operation performed on each group function Q Gm yields the same result as a set of individual argmax operations performed on the agent functions where Furthermore, We conclude a theorem sufficient to prove GIGM: Theorem 1. Joint Trajectory Condition (JTC): GIGM holds true if the following two conditions are simultaneously satisfied: (i) The global joint trajectory is equivalent to the union of all group trajectories.
(ii) The intersection of all group trajectories is empty.
The first condition guarantees the transitivity of argmax operations performed on Q functions.The second condition guarantees the coexistence of argmax operations on all Q Gm .The two conditions jointly guarantee the equivalence of argmax operations on all group Q and agent Q functions:

Ideal Object Grouping
In order to utilize GIGM to solve the LTH problem, we propose Ideal Object Grouping (IOG), which means partitioning agents into different groups by their different ideal objects IO.As is mentioned in section 2.2, we need to formally define and describe " what is the meaning of different types of agents" in a universal way.And we point out that the difference in IO is equivalent to the difference of agent types, because fundamentally these differences are all about the difference in agent action space A. This is the exact functionality and property describing the heterogeneous agents.In general, our goal is to acquire a grouping function g(i, G m ) (i ∈ K, m ∈ U ) for agent i and group G m : Each agent group G m consists of agents with the same IO Gm and the same interactive action-dim |A act−Gm |.Only one universal agent network is kept for one group, which significantly reduces the number of agent networks from K to U .Parameter-sharing is only allowed between agents within the same group.Maintaining a proper parameter-sharing structure not only avoids redundant computing resources for individual agent networks, but can also increase in-group cooperating via homophily [35].
Moreover, IOG is a mapping function from agents to groups g(i, G m ) : (K → U ).The |A act | of each agent must be assigned during the initialization of SMAC.As a result, one specific agent i can only be assigned to the certain group with Therefore, JTC is satisfied and GIGM holds true, indicating that IOG is an appropriate grouping method for value factorization.

Inter-Group Mutual Information Loss
In order to enhance inter-group cooperation and correlation, we maximize the Inter-Group Mutual Information (IGMI) between trajectories of different groups τ Gm and τ Gn , written as I(τ Gm ; τ Gn ).For encoding trajectories, a common implementation is to use the hidden states of gated recurrent unit (GRU) [65] h Gm and h Gn .While GRU takes o t

Gm and a t
Gm recursively for all t ≤ T , we assume that h Gm is capable of encoding and representing τ Gm .After encoding, because the mutual information can only be calculated between two distributions, we add a Gaussian distribution layer in the agent network of every group, marked as l Gm and l Gn .Therefore, calculating I(τ Gm ; τ Gn ) can be converted into calculating I(l Gm ; l Gn |h Gm , h Gn ).Detailed agent network structure is illustrated in Fig. 1(a).
We further conduct a lower bound of I(τ Gm ; τ Gn ) for easier calculation: (11) where α = p(l Gm ) p(l Gm |l Gn ,h Gm ,h Gn ) is always nonnegative, and D KL is the KL-divergence being also non-negative.q Gm is an inference distribution of group G m with parameter ψ Gm , and is independent from h Gn .To keep this independence, a mixed input of different groups is forbidden.Therefore, we keep individual inference networks for each group.Finally, we have: In order to maximize the IGMI, the loss is written as:

Grouped Hybrid Q-Learning
An ordinary idea to calculate Q Gm and Q i is to design factorization structures for Q tot → Q Gm and Q Gm → Q i .Let C Gm and C i be the two factor function.Like IGM [66], the monotonicity constraint is also sufficient for GIGM.Therefore, Set episode step t = 0, receive initial state and observation (s 0 , o 0 ) from environment.Choose joint action a t = {a t i } K 1 with agent networks {θ i } K 1 .

5:
Receive r t , is terminated and (s t+1 , o t+1 ) from environment using a t .

6:
Collect a transition tuple at t and update the replay buffer D = D ∪ {(τ t , s t , r t )}.Calculate L M Im following (13). 12: Calculate L T Dm following (16).Calculate total loss L GHQ following (18) and update network parameters θ and ψ.

15:
t T OT = t T OT + t. 16: end while be written as: Our key insight is that trivially calculating C Gm and Q Gm is unnecessary.Instead of hierarchical factorization, we imply independent Q-Learning (IQL) [67] for C Gm , which is called the hybrid factorization.This method makes Q Gm become an action-value function instead of a utility function [66,68], and C Gm become a positive constant.As a result, the TD loss of group G m is written as: GIGM and the input of state information keep different Q Gm in relevance, and IGMI further enhances the correlation.Even though IQL methods suffer from non-stationary problems [69], GHQ overcomes this disadvantage and achieves impressive results.The hybrid factorization avoids the calculation hierarchical factorization function.Although the IQL value of Q Gm following (16) does not equal the factorized value of Q Gm following (14), the monotonicity of factorization and GIGM still hold.As a result, the optimal policy of GHQ converges to the same optimal policy provided by the fully factorized structure.

Implementing details and Network Architecture
Detailed network architecture is illustrated in Fig. 1, an overview of the data-stream of GHQ is illustrated in Fig. 2, and the pseudo-code of GHQ is given in Algorithm 1.As is shown in Fig. 1 and 2, there are three kinds of networks marked with different colors.Agent network θ i is marked in green and is shared by all agents (i ∈ G m ).θ i receives the current observation o t i and the last action a t−1 i , and generates Q t i .The input is first sent to a Multi-Layer Perceptron (MLP) and then a GRU layer.The hidden state of GRU h Gm is sent to the following layers and the next time-step.The following layer is a Gaussian distribution layer generating l Gm using h Gm , and then l Gm is sampled and sent to the next two MLP layers.Eventually, a skip connection directly sends h Gm to the final MLP layer, and h Gm is concatenated with the output of formal MLP layer for generating Q i .
Mixing network θ Mm is marked in blue.It takes all Then, mixing network θ Mm takes Q Gm (τ Gm ) and state s for calculating Q Gm (τ Gm , s) and TD loss L T Dm .The GRU hidden states h Gm , h Gn and the Gaussian distributions l Gm , l Gn are generated from agent networks θ i and θ j , and are used to calculate IGMI losses L M Im and L M In .Detailed formulas are shown in section 4.3.
Inference network ψ Gm is marked in yellow and is only used to calculate L M I .It takes the hidden state of GRU h Gm of group G m and the Gaussian latent l Gn of another group G n as input.The input is first sent to an MLP layer and then a new Gaussian distribution layer to generate the inference distribution q Gm (l Gm |l Gn , h Gm ).The MI-loss L M I is calculated by the KL-divergence between the original distribution p(l Gm ) and the inference distribution q Gm (l Gm |l Gn , h Gm ).
Finally, when calculating the total loss L GHQ , adjusting weights λ T D and λ M I are introduced.In our implementation, we set λ T D = λ M I = 1.We choose Adam [70] as the optimizer, with the learning rate of all networks being 3e-4.The total training step is 5M and the maximum step for one episode is 200.The learning rate is scheduled to decay by multiplying the factor 0.5 every 50,000 episodes (averagely about 2M-3.5Msteps).The reward discounting factor γ is 0.99.The ϵ of the ϵ − greedy action selecting policy starts at 1.0, ends at 0.05 and linearly declines for 50,000 steps.The size of the memory buffer is 5,000 and the batch size is 32.A universal buffer saves all data for training, including trajectories of state s t , observation o t , action a t and reward r t .After one episode, the latest data is inserted into the buffer and one batch of 32 episode data is sampled from the buffer and used for training.The following Table 3 summarizes the hyperparameters mentioned above.In addition, we use the In summary, the total loss of GHQ is written as: 5 Experiments and Results

Designing New Asymmetric Heterogeneous Maps in SMAC
In section 3.4, we prove the existence of LTH in SMAC.However, the default setup of SMAC environment and default implementation of previous algorithms ignore the existence of LTH problem and the importance of asymmetric heterogeneous scenarios.
First, SMAC environment uses a padding vector to deal with the different interactive action-dim |A act |.It increases the |A act | of U spt up to the |A act | of U atk with the padding vector, and masks unavailable actions when choosing.This solution covers up the existence of LTH problem.In addition, because of the padding vector, previous algorithms can apply parameter-sharing among all unit types.This implementation further prevents the MAS from learning better coordinating policy.In GHQ, all agents use their true |A act |, and parameter-sharing is restricted between agents within the same group.
Second, it is ignored that the internal AI script of StarcraftII is incapable of coordinating and collaborating among multiple types of agents.As a result, the performance of enemies in symmetric heterogeneous maps is limited, and we consider that asymmetric heterogeneous maps are more fitted to perform and study the LTH problem.There are only two asymmetric heterogeneous maps in original SMAC maps: 3s5z vs 3s6z and MMM2 (see Table 1).However, these two maps have shortages respectively.All units in 3s5z vs 3s6z are U atk , while the difference is their shot-range and health-point.However, the heterogeneity of this map is restricted, because all U atk have the same ideal object.Algorithms can acquire high performance without any information about the types or other properties of agents.Another map, MMM2, contains Marine, Marauder, and Medivac (see Table 1 and 2).U atk and U spt , ground unit and flying unit are all included in the map.However, since both sides contain all of the three types of units, the internal AI script is unable to perform well.Therefore, we need to design new asymmetric heterogeneous maps for experiments.
Our maps, by contrast, avoid the shortages of original maps.For allies, we have Marine and Medivac, a U atk on the ground and a U spt in the air, which is similar to the common heterogeneous UAV-UGV MAS in [29].For enemies controlled by the internal AI script, we have only Marine to prevent the incapability of the script.We increase the number of enemy Marines to balance the difficulty of maps.Lots of pre-experiments are conducted to determine the specific number of all units.Table 4 shows the information of all new maps.Fig. 3 shows some examples of original and new maps.

Environmental and Experimental Details
In SMAC, all information provided by the environment is organized into tensors of pure data, all of which are either normalized into [0, 1] or transferred into one-hot vectors.We describe the necessary information details below for a better understanding of the SMAC environment.More details can be accessed in the official repository and source codes.The observations O is the input to the agent network θ i for calculating Q i .For agent i, the observation o i consists of four parts, moving-feature, ally-feature, enemy-feature and own-feature: • moving-feature includes the ID of available moving action of agent i; • ally-feature includes the percentage of health-point, unit type, relative distance, and relative position of other allies to agent i within its sight-range.Information about the agents out of the sight-range of agent i is not accessible; • enemy-feature includes the percentage of healthpoint, unit type, relative distance, and relative position of all enemies to agent i within its sight-range.
Information about the enemies out of the sightrange of agent i is not accessible; • own-feature includes the percentage of health-point and unit type of agent i.
As we have described in section 3.2, agent action consists of two parts: common-action A com and interactive-action A act .The common action-dim |A com | is 6 for all agents.Action ID 0 is null action only available for dead agents.Action ID 1 is stop action, and ID 2, 3, 4, and 5 are moving actions available for all living agents.The four moving actions are pre-defined by the SMAC source codes, indicating moving up, down, left, and right with a certain moving amount step-length.The interactive actiondim |A act | equals the number of interacting objects of a certain agent type.For U atk , |A act | is the number of enemies.For U spt , |A act | is the number of allies.
We use the default global dense reward function of SMAC.The MAS is rewarded when dealing damage to the enemies, killing enemies, and winning the game.The damage reward equals the value of the healthpoint changes of enemies after one time-step, which is the absolute damage value dealt to the enemies.The killing reward is 10 for every enemy-kill, and the winning reward is 200 given at the terminal time-step.
We use the official implementations of all algorithms with minimal necessary adaptation to our new environmental settings.In general, we use the traditional winning-rate (WR) as the measuring criterion.WR is the probability of MARL agents eliminating all enemies and winning the game, and is approximated by the frequency of winning.We use the averaged WR of 32 testing episodes.Testing episodes are taken every 10,000 training steps (about 1,000 training episodes).5 rounds of complete experiments with different random seeds are performed for plotting the curve of the averaged WR with the p-value being 0.05.As is shown in Fig. 1 and 2, GHQ uses extra Inference networks to calculate IGMI loss.As a result, the computing time of GHQ is roughly about 1.5 times of the computing time of QMIX.Other value-based methods also consume more time than QMIX, indicating their more complexity than QMIX.

Criteria for Measuring Map Heterogeneity and Difficulty
According to our analysis in section 3.4, the existence of LTH in SMAC is clear.However, analyzing and quantifying the influence of LTH on agent policy is still required.Here, we propose objective criteria to measure the heterogeneity and difficulty of maps.
The Proportion of Supporting Units (POS) is the proportion of the number of ally supporting units |U spti | divided by the number of overall ally units |U i |.The Enemy Strength (ES) is the ratio of weighted attacking units U atk of two sides, for measuring the strength of different U atk .The result is calculated with We design several homogeneous maps consisting of only Marine for both sides.The enemy consists of 15, 20, and 30 Marines, which is almost the same as our heterogeneous maps.The ally consists of Marines slightly less than the enemy (see Table 5).According to the converged (WR), we conclude that in homogeneous maps with only Marines controlled by QMIX-FT [21] algorithm, ES and WR are highly related and proportional.When ES is about 1.25, WR is about 0.5; and when ES is less than 1.18, WR keeps being 1.0.Even if the total number of units is doubled, this relation remains unchanged.In symmetric homogeneous maps, ES is at its minimum 1.0, and thus it can be concluded that the MARL policy is easier to win than in asymmetric maps.
We further design additional heterogeneous maps (see Table 5).On the one hand, the ES of heterogeneous maps can be easily increased up to 1.7 to 2.4 in heterogeneous maps, when WR of QMIX-FT is about 0.9.Introducing heterogeneity into SMAC maps can significantly increase the difficulty of maps, so it is necessary to study and better utilize heterogeneity.On the other hand, POS and ES are highly related.In order to achieve high WR in harder maps with high ES, we need to increase POS simultaneously with increasing attacking units.For example, in 6m2m 16m, ES is 2.67 and POS is 25.0%.Both GHQ and QMIX-FT can only achieve the WR about 0.5.By contrast, in 8m3m 21m, ES is 2.63 and POS is 27.3%, and the WR reaches about 0.9.
In conclusion, our results prove the shortage of original SMAC symmetric maps, and the ability of GHQ and QMIX-FT to handle the LTH problem with higher POS and ES.The following experiments show that better utilizing LTH helps GHQ to acquire higher WR with smaller variance than QMIX-FT.Additionally, we conclude that the strength of 1 Medivac equals about 3.5 Marines.
RODE and ROMA are role-based algorithms, which learn and apply role policies online, end to end.These two algorithms are relatively similar to our group-based algorithms than others.However, ROMA can not learn effective policy within 5M (5 million) training-steps, because the default training step of ROMA is 20M.In RODE, several key hyperparameters define the clustering and using of rolepolicies.The end-to-end clustering of role-policies makes it difficult to focus on the LTH property.Therefore, the performance of RODE is restricted.QPLEX, MAIC, and CDS modify the factorization structure of QMIX with distinct methods.COMA and MAPPO are actor-critic algorithms using the "centralized critic decentralized actor" (CCDA) architecture.These two algorithms apply parameter-sharing in actor networks and use one shared critic network.HAPPO uses independent network parameters for actor networks and proposes a monotonic policy-improving architecture with a theoretical guarantee.

Comparison Results
The results of value-based algorithms are shown in section 5.5.1,Fig. 4, and Table 6.We criticize the performance of value-based algorithms with 4 groups of experiments.All of our GHQ results are in red color and the colors for other value-based algorithms are shown in the legend.The results of policy-based algorithms are shown in section 5.5.3 and Table 8.Generally, all of the comparison algorithms suffer from the LTH problem and cannot acquire high with small variance.Previous value-based algorithms are basically modified from QMIX and, to some extent, weaken the ability of QMIX to handle the LTH problem.

Results of Value-based Algorithms Comparison
Results for value-based algorithms are shown in Fig. 4 and Table 6.In Table 6, the results are the final WR, and are averaged across 5 individual tests with different random-seeds.The standard deviations are followed.In Fig. 4, the lines and shadows are fitted across the whole data, so the values may be slightly different from the results in Table 6.
(1) We test all algorithms on the original asymmetric heterogeneous map MMM2.The results are shown in Fig. 4 (a).Because the map is relatively easy and almost all algorithms converge at 3M training steps, we only show the results ended at 3M steps for better presentation.The graph shows that WR of most algorithms converged to 1.0 at about 1.5M steps with a relatively small variance.QPLEX and GHQ are slightly better than QMIX.RODE and MAIC converge at about 2.5M steps, which is slower than other algorithms.ROMA and CDS fail to converge at 3M steps.
(2) We decrease the heterogeneity of maps through decreasing POS.In Fig. 4 (b), (d), (g), and (h), the number of Medivac remains to be 2, while the number of Marine is increased.Therefore, the POS decreases from 25.0% of (b), to 11.1% of (h) (see Table 5).As a result, algorithms using parameter-sharing among all agents learn better policy than the setting of increasing POS.In general, algorithms perform well in smallscale maps (b) and (d), but only GHQ and QMIX-FT perform well in both of the large-scale maps (g) and (h).GHQ outperforms QMIX-FT with smaller variance.MAIC and QMIX perform well in (g) but fail in (h), indicating their limitation in handling large-scale problems.QPLEX and RODE cannot learn effective policy in (g) and (h), while ROMA and CDS completely fail in (g) and (h).RODE performs better in (d), (g), and (h) than in (c), (e), and (f), indicating the training of the role-selector requires homogeneous MARL settings.
(3) We scale up all units of both sides simultaneously.In Fig. 4 (b) and (c), the POS remains to be 25.0%, while the number increases from 2 to 4. Theoretically, the optimal policies of map (b) and (c) are similar.However, this scaling method combines the complexity of scalability and heterogeneity, making it harder for comparison algorithms to learn effective policies.In (b), most algorithms achieve high WR within 5M steps, while GHQ converges fastest and RODE suffers from high variance and relatively low WR.ROMA and CDS fail to learn effective policy in (b).However, in (c), almost all comparison algorithms fail to learn effective policy.GHQ and QMIX-FT outperform other algorithms and have not yet converged at 5M steps.QPLEX also suffers from complexity, but generally performs better than QMIX, ROMA, RODE, MAIC, and CDS.

Independent t-test and further analysis of GHQ against other Value-based Algorithms
In SMAC, we cannot conduct the experiment of two algorithms attack against each other.Therefore, we cannot directly count the win-lose relationship between algorithms for the statistical tests in [73].As an alternative, we use the data in Table 6 to conduct independent t-tests between GHQ and other value-based algorithms to prove the significance of the obtained results.We assume that the distributions of all results are normal, following the mean values and the standard deviations in the table.We use Scipy to generate the distributions with the size being 500, and then run the independent t-tests.The results of t-tests are shown in Table 7.The t-statistics are shown in the table with the p-values followed.
It is obvious that almost all p-values are smaller than 0.05, indicating the significance of the results.Only the p-values of the results of GHQ against QMIX-FT, QMIX, QPLEX, and RODE in MMM2 are greater than 0.05, indicating that the result of GHQ has no significant difference against the result of the 4 algorithms, which is proved by Table 6.The t-statistics are also almost all positive, indicating the superior performance of GHQ against other algorithms.
In 6m2m 15m, the mean value of GHQ is smaller than QMIX-FT and QPLEX.First, we need to point out that, as is shown in Fig. 4(b), the curve of the WR of GHQ grows faster than the other two algorithms, indicating the faster learning speed of GHQ.Second, for further analysis, we draw a heat-map of U spt ' percentage of health-points following the method in section 5.6.2, and the result is shown in Fig. 5.It can be concluded that GHQ learns similar policies in 6m2m 15m and 6m2m 16m, which is to "let U spt take damage for preserving U atk ".However, even though QMIX-FT manages to learn a similar policy with GHQ in 6m2m 15m, it fails to learn the proper policy in 6m2m 16m.This phenomenon indicates the increasing difficulty of 6m2m 16m than 6m2m 15m, as the optimal policy becomes harder to learn.

Results of Policy-based Algorithms Comparison
Due to the discrete property of SMAC, valuebased algorithms generally have achieved better results than policy-based algorithms [20][21][22].To support this conclusion, we conduct experiments of COMA, MAPPO, and HAPPO against GHQ and QMIX-FT.Results for these algorithms are shown in Table 8.However, in the LTH problem, the sequential partial order of agent actions can significantly affect the final joint policy.In conclusion, the results show that value-based algorithms generally perform better than policy-based algorithms, and GHQ outperforms all policy-based baseline algorithms.

Ablation Study
The ablation study consists of two experiments.5.6.1 is the ablation test about two component parts of GHQ, IOG, and IGMI.Because the IGMI must be applied between two agent groups, it is incapable of testing "QMIX-FT+IGMI" individually.Therefore, 3 groups of ablation tests are taken in 4 maps, as is shown in Fig. 6.Another experiment in 5.6.2 is the visualization analysis of trained policies about GHQ and QMIX-FT in 6m2m 16m.We visualize the trained policies of the two algorithms in heat-maps to show the influence of IOG and IGMI on policy learning.The temperature of heat-maps is the counting sum of corresponding agents.The results are shown in Fig. 7.

Ablation Tests about IOG and IGMI
In order to analyze the effectiveness of IOG method and IGMI loss in different maps, we take ablation tests in (a) MMM2, (b) 6m2m 16m, (c) 8m4m 23m, and (d) 16m2m 30m.QMIX-FT and QMIX-FT+IOG are the ablation groups.The results are shown in Fig. 6.
In general, as our expectation, IOG method helps to improve the performance of QMIX-FT, and IGMI loss helps to reduce variance.Fig. 6 (a) shows that all algorithms are able to conquer the MMM2 map within 1.5M steps, while QMIX-FT+IOG and GHQ are converged slightly faster than QMIX-FT.In (b) 6m2m 16m, IOG and IGMI are performing well.They not only improve the WR, but also reduce the variance.In (c) 8m4m 23m, the WR of QMIX-FT increases faster than the other two algorithms before 3M steps.But IOG method manages to find a good cooperating policy, and converges to a better WR at 5M steps with smaller variance.A higher derivative of IOG method at 3M to 4M steps indicates the progress of learning better policy.In (d) 16m2m 30m, GHQ outperforms QMIX-FT with higher WR and smaller variance.QMIX-FT+IOG receives a similar result with QMIX-FT, but has even larger variance.The main reason is that the difference between two groups is so large.Introducing IGMI loss helps to restrict the difference and improve the correlation between groups.Therefore, GHQ achieves the best result among the three testing algorithms.
(1) Parameter-sharing among different agent types do influence agent policy.As is suggested in [32], parameter-sharing restricts network parameters from being diverse.Red boxes in Fig. 7 (c) and (g) show a similar policy pattern of "first move and then stop to attack/heal" for two types of agents in QMIX-FT.Specifically, both types of agents prefer to choose action 2 and 5 in the first 14 time-steps.In GHQ, however, the diversity of different groups is guaranteed, as is generally shown in (d) and (h).In addition, comparing the Medivac policy of QMIX-FT and GHQ in (g) and (h), it is clear that the QMIX-FT Medivac policy in (g) is more similar to the QMIX-FT Marine policy in (c) than the GHQ policies in (h) and (d).
(2) GHQ improves group policy learning.Green boxes in Fig. 7 (c) and (d) indicate that Marine controlled by GHQ learns better "focusing and firing" policy, as the temperature of A act are notably hotter than QMIX-FT.GHQ agents learn to focus and fire at one specific enemy target within several timesteps, which makes them quickly eliminate enemies and reduces their damage.By contrast, QMIX-FT agents learn to fire at several targets at the same time, which reduces the speed of elimination and causes more damage.Yellow box in Fig. 7 (d) shows that the moving policies of GHQ Marines are also significantly different from QMIX-FT.GHQ Marines finish their movement in the first 4 time-steps with decisive actions and form a tight front.They tend to stay together and therefore take enemy damage simultaneously, which leads to a similar decreasing tendency of health-point and the two obvious temperature valleys at 80 and 40 percentile in the yellow box of (b).
(3) GHQ improves inter-group cooperating.Orange boxes in Fig. 7 (e) and (f) represent the decreasing curves of the health-point of Medivacs.GHQ Medivacs learns a better "distracting" policy than QMIX-FT Medivacs.One GHQ Medivac first moves toward enemies and attracts fire to prevent enemies from attacking ally Marines.This policy is proved by the orange box in (h) with the "action 0 line", indicating the death of one Medivac agent.Then, the other Medivac agent moves on to keep attracting enemy fire.As a result, the figure in (f) consists of two independent curves.The distraction policy performed by GHQ Medivacs is a fabulous tactic and significantly differs from the policies of GHQ Marines, indicating that GHQ is capable of utilizing LTH for better cooperation.

Conclusion
In this paper, we focus on the cooperative heterogeneous MARL problem, especially the asymmetric heterogeneous MARL problems.In order to describe and study the heterogeneous MARL problem, we propose the Local Transition Heterogeneity (LTH) with a formal definition.To support the definition of LTH, we first define the Local Transition Function (LTF) and several auxiliary concepts.Furthermore, we study the existence and influence of LTH in SMAC.
In order to primarily solve the LTH problem, we first propose the Grouped Individual-Global-Max (GIGM) consistency.Following the restriction of GIGM, we further propose the Ideal Object Grouping (IOG), the Inter-Group Mutual Information (IGMI) loss, and the hybrid factorization structure.The combination of these three methods is our novel Grouped Hybrid Q-learning (GHQ) algorithm.Experiments are conducted in asymmetric heterogeneous SMAC maps to show that GHQ outperforms other state-of-the-art algorithms.The results prove the necessity to study and utilize LTH for studying more complex scenarios in SMAC.
We believe that the study of heterogeneity is indispensable for future MARL studies, and we hope that the mathematical definitions and analysis can help future studies on the cooperative heterogeneous MARL problem.Due to the restriction of computing resources and network structure, we are unable to study large-scale problems or transfer learning problems in heterogeneous MARL.In the future, we will try to solve more large-scale and complex heterogeneous MARL problems in other maps and environments.

Statements and Declarations
• Funding: No funding was received to assist with the preparation of this manuscript.

3 :
while t ≤ T EP and not terminated do 4:
a random batch of B episodes from D.

Fig. 3 .
Fig. 3. Examples of SMAC maps.The lower two are ours.
The final WRs are shown in the table, and the results are averaged across 3 individual tests with different random-seeds and the standard deviations are followed.Original papers of MAPPO and HAPPO run experiments in SMAC for 10M training-steps, so we list the results of 5M and 10M training-steps separately.COMA can only acquire WR in MMM2 and fail in all other maps.MAPPO performs the best among the 3 policy-based algorithms, especially in 6m2m 15m, 8m4m 23m, and 12m4m 30m.These maps have relatively high ES and POS, indicating the potential of MAPPO handling LTH problems.HAPPO performs worse than other policy-based algorithms.One possible reason is that HAPPO implements Multi-Agent Advantage Decomposition (MAAD) via the random sequential update and execute scheme.
in SMAC environment.The mainstream value-based method is the value factorization method.Its formal objective is to learn a centralized yet factorized joint action-value function Q tot and the factorization structure: Q tot → Q i , and use them to calculate TD-error and guide the optimization of agent policies: At each time-step t ≤ T , agent i ∈ K receives an individual partial observation o t i and chooses an action a t i ∈ A i from local action set A i , with the local action-dim |A i |.Actions of all agents form a joint action a t = (a t 1 , ..., a t k ) ∈ A = (A 1 , ..., A k ).The environment receives a joint action a t and returns a next-state s t+1 according to the joint transition function P (s t+1 |s t , a t ), and a reward r t = R(s, a t ) shared by all agents.The joint observation o t = (o t 1 , ..., o t k ) ∈ Ω is generated according to the observation function O t (s t , i).Observation-action trajectory history τt = ∪ t 1 {(o t−1 , a t−1 )} (t ≥ 1; τ 0 = o 0 )is the summary of partial transition tuples before t.Specifically, τ i = τ T i indicates the overall trajectory of agent i through all time-steps t ≤ T .Replay buffer D = ∪(τ , s, r) stores all data for batch sampling.Network parameters are notated by θ and ψ.
AM (s t ) is a binary matrix with dimensions being |A i | × K, indicating the available-actions of all agents at the state s t .AM i (s t , i) is the column vector of AM (s t ), indicating the mask vector of a certain agent.Element 1 (true) at (a t i , i) of AM (s t ) means that agent i can take action a t i at s t , and vice versa.Finally, we define the Ideal Object (IO) and the Ideal Condition (IC), and then define the LTF, P i (s t+1 |s t , a t i ).Definition 1. Ideal Object (IO) and Ideal Condition (IC): The Ideal Object (IO i ) of agent i is an action object that is available for any A act of agent i to be applied on.The Ideal Condition (IC i ) of agent i is the environmental condition that maintains the local available-action-mask function AM i (s t , i) being all true for any state s t and any action a t i applied on IO i .Definition 2. Local Transition Function (LTF): For agent i with its IO i and IC i , the Local Transition Function (LTF) P i (s t+1 |s t , a t i ) is the probability distribution of next-state s t+1 conditioned by state s t and action a t i .The action a t i is applied on IO i under IC i .

Table 2 :
Unit Information in SMAC.DPS stands for "damage-per-game-second".For Medivac, the number of DPS indicates the healing-per-game-second (HPS).

C
Gm and C i can Algorithm 1 GHQ Input: Learning rate α, loss weights λ T D and λ M I , number of groups U , number of agents K, number of units in each group |G m |, max total steps T T OT , max steps per episode T EP , batch size B. Initialize: Network parameters θ = {θ Gm } U 1 and ψ = {ψ Gm } U 1 , replay buffer D = {}, total step t T OT = 0. 1: while t T OT ≤ T T OT do (   ;    )       An overall framework of GHQ.θ Gm of group m consists of three parts: agent network θ i , mixing network θ Mm and inference network ψ Gm .Detailed data-stream for training and executing are shown in Fig. 2. In (a), θ i takes o t i and a t−1 i as input.It generates Q i for choosing actions, and l Gm and h Gm for calculating q Gm .In (b), θ Mm takes Q and s for calculating TD loss L T Dm with hybrid factorization.In (c), ψ Gm takes l Gm , h Gm and l Gn for calculating IGMI loss L M Im .where y Gm is the TD-target of Q Gm , Q tgt Gm is the target Q function of Q Gm , and θ tgt Gm and θ Gm are network parameters of Q tgt Gm and Q Gm , separately.Group network θ Gm consists of two parts, agent network θ i and mixing network θ Mm .Their losses are calculated with backward propagation following (15): the group G m as input, and mixes with the state s to produce the Q Gm .Four hyper-networks generate weights and bias (w 1 , b 1 , w 2 , b 2 ) with s, and only the absolute values of weights are used.The weights and bias multiply with the joint [Q i ] procedurally and the intermediate results are activated to be non-negative, fulfilling the GIGM requirements.  (   ) Mixing Net (   )    (   , ) Agent Net (  )          (   ) Mixing Net (   )    (   , ) Agent Net (  ) i in group G m .All necessary transition tuples (τ , s, r) are stored into the replay buffer D. During centralized training, a batch of trajectories τ Gm are sampled from D as the input of θ i for calculating Q Gm (τ Gm ).
Fig.2.An overview of data-stream of GHQ.During decentralized executing, agent networks θ i and θ j generate Q t i and Q t j for choosing actions a t i and a t j , respectively.The input of θ i is the local observation o t i and last action a t−1 i of agent

Table 3 :
The hyper-parameters of GHQ.
latest version 4.10 of StarcraftII game on Linux to perform experiments, instead of the old version 4.6.

Table 5 :
The Enemy Strength (ES), Proportion of Supporting Units (POS) and Winning-Rate (WR) of QMIX-FT and GHQ on Homogeneous and Heterogeneous maps.The map name XmY m Zm means that allies consist of X Marines and Y Medivacs while enemies consist of Z Marines.|U Ai | and |U Ae | are the number of different types of attacking units U atk for allies and enemies, and w i and w e are the correction weights.In our maps, since the only U spt is Medivac and the only U atk is Marine, POS equals the proportion of Medivacs among all ally units.ES equals the ratio of the number of Marines from two sides.It is obvious that high POS represents high heterogeneity, because the high proportion of ally U spt indicates the serious influence introduced by the policy of U spt .High ES represents high difficulty, because the only way to win in SMAC is to control ally U atk eliminating all enemies, and high ES indicates more enemy U atk than ally U atk .

Table 6 :
Results of Value-based Algorithms Comparison.

Table 7 :
Results of Independent t-test of GHQ against other Value-based Algorithms.

Table 8 :
Results of Policy-based Algorithms Comparison.(0.14) 0.72 (0.23) 0.84 (0.09) 0.05 (0.02) 0.10 (0.02) 1.00 (0.00) 1.00 (0.00) 6m2m 15m • Competing interests: The authors have no competing interests to declare that are relevant to the content of this article.• Ethics approval: This article does not involve any ethical problem which needs approval.• Consent to participate: All authors have seen and approved the final version of the manuscript being submitted.• Consent for publication: All authors warrant that the article is our original work, hasn't received prior publication, and isn't under consideration for publication elsewhere.A preprint version of our manuscript has been submitted to arXiv, and the page is https://arxiv.org/abs/2303.01070.The journal version improves the overall structure of the article, and enhances with more definitions, demonstrations, and experiments.• Availability of data and materials: The datasets generated during and/or analyzed during the current study are available from the corresponding author on reasonable request.• Code availability: The codes for this article are available from the corresponding author on reasonable request.• Authors' contributions: Conceptualization: [Xiaoyang Yu, Kai Lv, Xiangsen Wang]; Methodology: [Xiaoyang Yu, Kai Lv]; Formal analysis and investigation: [Xiaoyang Yu]; Writing -original draft preparation: [Xiaoyang Yu]; Writing -review and editing: [Xiaoyang Yu, Youfang Lin, Xiangsen Wang, Sheng Han, Kai Lv]; Funding acquisition: [Youfang Lin, Sheng Han]; Resources: [Youfang Lin, Sheng Han]; Supervision: [Youfang Lin, Sheng Han, Kai Lv].