Introduction

Team formation in multi-agent systems refers to the process of combining agents into teams to achieve a common goal, such as maximising team performance and interests [1]. This problem only happens in heterogeneous teams, in either cooperative scenarios or cooperative-competitive scenarios, usually against another heterogeneous team [2, 3].

In many previous studies, one important assumption for team formation is that the capabilities of the team members are known in advance [4,5,6]. In practice, the agents may be unaware of the types or capabilities of their potential partners [7], and they compete with another team about which we know nothing. This is the case, for instance, in military engagements on the battlefield [8], criminal pursuit [9], collision avoidance [10], tracking and collection of space debris [11], etc. In many cases, the only source of information is the results of the games in which these agents have participated, with some variations in the teams between matches, as in many sports. However, the number of games played from which to infer the best lineup is limited, usually a very small proportion of the combinatorial explosion of all possible lineups against all possible lineups.

Not knowing the capabilities of other agents is especially common in Multi-Agent Reinforcement Learning (MARL), because the performance (expected reward) of the same algorithm depends on the scenario and the population of agents (in the same and opponent teams). Even with small variations in the composition of the teams, the convergence of the algorithms may vary significantly [12], and the assumption of a fixed set of capabilities or skills for an agent is elusive. It is also quite common that these teams have no predetermined positions or roles, as in sports (e.g., goalkeeper, midfielder, etc.), but any agent can take any position. Given the increasing diversity of reinforcement learning algorithms and the computational effort in training them, it is an urgent need to find an effective solution to the lineup problem of MARL teams in these situations.

Pursuit–evasion games or variants such as intruders-defenders [3], reach-avoid games [13, 14], and predator–prey games [12, 15], are idealisations of a wide range of problems where two teams have to compete and the team members have to cooperate inside the team. The pursuers try to capture the evaders, while the evaders attempt to avoid capture. The pursuit–evasion game has a rich history in the literature and addresses many of the challenges of multi-agent systems in formalising important applications in different fields such as surveillance, navigation, aerospace, and robotics [8, 16, 17]. Pursuit–evasion games usually consider scenarios with (many) more than two players [18,19,20]. In these cases, exploring all combinations is simply infeasible. It is essential then to find effective solutions in these situations, exploiting and generalising from a limited number of game results.

Finally, in competitive team games, there are always adversarial strategies for team formation. If one team (A) knows the formation of the other team (B) in advance or can make changes in the formation during the game, then an online strategy for team A will likely adapt its formation to team B’s formation. This will be followed by changes in the other team, and so on. Even in the absence of information, and when changes are limited or not allowed, there is an offline strategy before the game about this team formation. In both cases, quickly evaluating possibilities is key to find the best option for each team, and possible equilibria.

Given all these considerations, the team formation problem we address in this paper has the following general characteristics: (1) It is set in a competitive scenario (one team playing against another) with possibly several agents in both teams, although it could also be applied to purely cooperative scenarios by assuming the opponent team as fixed. (2) We have to choose from several heterogeneous agents for which we know nothing, except for a small set of results of previous games where these agents have participated. (3) There are no positions or predefined roles in the team, i.e., any agent can play in any position. (4) We want to consider the parameters of the whole team, such as their sociality (selfish, egalitarian or altruistic) that may affect the performance of the whole team. Finally, (5) we are interested in situations where the coach can make changes to the team given the information of the other team (online team formation), but also situations when changes are not possible, and the first lineup must stay for the whole game (offline team formation). These five characteristics make our problem novel, general and important for a wide range of realistic applications, especially in MARL.

We present a relatively straightforward yet powerful solution for this problem. We build a predictive model (an assessor model, using the terminology in [21]) for a team’s reward, conveniently using team formation as features. Once the model is learnt from previous games, not only can it be used to predict the result of the game given two lineups, but it can also be ‘interrogated’ to determine the best lineup of team A given a lineup of team B and vice versa. With this we avoid a combinatorial problem, but at the same time we have an estimate for every possible combination. This allows for iterations on adversarial team formation when either knowing the other team’s formation and being able to make changes at any time (online), and not knowing exactly the other team’s formation and not being able to make changes (offline) the opposing team, using several strategies.

We present experimental results with a number of teams in a pursuit–evasion game [22, 23] in which we deal with heterogeneous teams with four agents: Multi-Agent Deep Deterministic Policy Gradient (MADDPG) [22], MiniMax Multi-agent Deep Deterministic Policy Gradient (M3DDPG) [24], Deep Deterministic Policy Gradient (DDPG) [25], and a random agent. We modify the way rewards are distributed (the sociality regime) to accommodate various sorts of sociality (selfish, egalitarian or altruistic) in the team as well, following the data from [26]. The effectiveness of collective action in these experiments provides information about the capabilities of a team depending on their formation from which the assessor model is trained using standard machine learning algorithms for regression.

Our main contributions are:

  1. 1.

    We demonstrate how to express the team formation of MARL algorithms in a pursuit–evasion game as features for developing an effective assessor.

  2. 2.

    We show that the outcome of a game given two lineups can be estimated accurately using an assessor, a regression model predicting reward.

  3. 3.

    We show how the assessor model can be ‘interrogated’ to derive the full matrix for any combination of both teams, and how this can be used to determine best, average and worst situations.

  4. 4.

    We explore team formation strategies in online and offline situations (knowing exactly the other team’s formation and being able to make changes, or not) and study iterations and fixed points.

The rest of the paper is organised as follows. We review related work in the following section. We explain the pursuit–evasion game, the environment and the data we used in section “Scenario and assessor data representation”, and consider how to represent teams and other team properties in a convenient way to build an assessor that models the outcome of the game. Section “Model training and evaluation” builds and validates the model and studies feature relevance. This is in preparation and validation for the use of the assessor in the following section. Section “Team formation strategies” uses this assessor to explore different strategies when the opponent’s lineup is given, and when lineups are chosen (offline) in anticipation of the expectations of the other team’s formation or iteratively (online) from observing the other team’s formation. Finally, section “Conclusion” closes the paper, enumerating possible limitations and main take-aways.

Background

While team homogeneity can be optimal in situations with high levels of coordination, planning and shared information, in many other situations team diversity has a positive influence on robustness and performance, both in human [27] and artificial agent teams [28]. Team heterogeneity also allows for the combination of different systems, which is very convenient in robotics or in hybrid human–machine teams, since the assumption of fully-homogeneous teams is simply not possible. However, heterogeneity poses challenges for the identification of appropriate policies at the individual level, and creates the lineup, or formation, problem.

Team formation in a multi-agent system is the process of grouping multiple agents into teams to achieve a common goal or outperform other teams. Team formation is generally problem-focused due to the different applications influenced by many factors. In other words, there is rarely the best team for all situations.

Given the relevance of the team formation problem inside and outside artificial intelligence, there is an extensive literature in a variety of contexts and application domains, e.g., robotics [29], social networks [30], unmanned aerial vehicles (UAV) [31] and football [32], to name a few. Research has focused on algorithms or strategies that help automate team formation.

Team formation techniques depend on the information available about the task and the pool of agents to choose from. It also depends on whether this information is available at the beginning, and the team has to be built before the game, or new information can be used to adapt the team during the game. These two settings are usually referred to as offline and online team formation respectively. For offline team formation, there are various algorithms optimising different criteria, using standard solvers [33], graphs [34, 35], and heuristics algorithm (genetic algorithms [36], ant colony optimisation [37], annealing algorithms [38]), etc. Although the optimisation algorithm does not require extensive agent modelling before forming a team, its effectiveness is dependent on the quality of the metrics used and the composite value function used to optimise the team. These functions are usually relatively simple, to help the optimisation problem.

In online team formation, we have the opportunity to repeatedly form a new team, even at every move of the game. If this team formation problem is modelled as a Markov decision process [39], reinforcement learning algorithms can even train agents to learn how to collaborate with each other to form a new team in different tasks [40]. Agents are assigned to form a new team repeatedly during different assignments and learn online by observing the results of the team’s actions. Bayesian approaches can be used to deal with the uncertainty in player abilities [39] and actions [41]. In general, online team formation requires significant expertise and resources to replace certain members with others constantly.

In both offline and online team formation, having a profile of capabilities for each agent can be very helpful for determining good team lineups. However, in artificial intelligence, the increasing use of machine learning makes this characterisation more difficult. In multi-agent systems, the use of reinforcement learning is widespread today, and extracting capabilities for each agent that can be combined analytically to derive the optimal team lineup seems unrealistic. MARL methods can also integrate different degrees of heterogeneity and sociality regimes (selfish, egalitarian or altruistic) [42]. In the single-agent RL algorithms, each agent is learning independently and perceives the other agents as part of the environment. Still, single-agent reinforcement learning algorithms, such as PPO [43], IA2C [44] and DDPG [25] have been applied in multi-agents without consideration of the multi-agent structure. They sometimes achieve strong results in cooperative or competitive multi-agent challenges [45,46,47].

However, it is reasonable to expect that specifically-designed multi-agent reinforcement learning algorithms could potentially obtain better results. Recent research has focused on centralised training with decentralised execution (CTDE) algorithm, where agents are trained offline using centralised information but execute in a decentralised manner online. One of the pioneering algorithms, MADDPG [22], has been widely employed to problems of cooperation, competition, and a mixture of both, and has been used to study pursuit–evasion games. M3DDPG [24] enhances MADDPG to learn robust policies against changing adversarial policies by optimising a minimax goal, demonstrating that a M3DDPG-trained team is far superior to the original MADDPG in a homogeneous team setting. But are these algorithms useful in heterogeneous teams that are composed of other agents?

Team formation in heterogeneous MARL may involve varying degrees of abilities and sociality to successfully reach the goal. This fact naturally raises the question of how to form optimal teams with the greatest value when these ‘better’ agents are used in settings mixed with other kinds of agents. MARL systems may be quite unpredictable when put in such situations. Experiments in MARL are very expensive because many learning phases are required for the algorithm to learn the agent policy through trial and error, and observing the behaviour of the other agents, which may also be evolving. The problem of team formation in this complex setting has not been widely addressed.

Instead of further specialising existing techniques for this complex scenario, we can take a step back and look for more general techniques for team formation that could be applied to this situation and many other common scenarios. One key idea is to simply take the most from the information of a small set of previous games, and build a model for the game outcome. This kind of model predicting AI performance has recently been reframed and generalised under the term ‘assessor’ [21], a machine learning model that is trained on the test data results of an AI system. This is a straightforward and very flexible approach, since we can use off-the-shelf machine learning techniques for this problem, provided we represent the performance prediction problem in a convenient way. However, to date, this idea has only been applied to single agents [48], but never extended to multi-agent systems. Exploring the possibility of a multi-agent scenario is one of the goals of this paper.

Scenario and assessor data representation

We chose pursuit–evasion games as a practical scenario for two main reasons. First, the behaviours of pursuers and evaders are different, and so are the team dynamics. It may be the case that more coordination is needed for pursuers than for evaders, and that stochasticity (even random agents) in the evader team may even have a positive effect, while this is possibly less likely in the pursuer team. Second, pursuit–evasion is an important field in MARL and, as we said in the introduction, many variants exist such as intruders-defenders [3], reach-avoid games [13, 14], and predator–prey games [12, 15].

Pursuit–evasion environment, agents and data

The Multi-agent Particle Environment (MPE) [22, 23] contains an instantiation of the pursuit–evasion game. Figure  1 shows a screenshot of the environment, where we can see two pursuers, two evaders and two solid landmarks. All the members of the pursuer team receive a reward of \(+10\) for colliding with any evader, regardless of which pursuer caused the collision. If there is a collision, the evader receives a reward of \(-10\), and evaders are also punished individually for leaving the screen (to prevent it from simply running away but learning to evade), unlike pursuers.

Fig. 1
figure 1

Pursuit–evasion game in MPE where we see two pursuers (in red), two evaders (in green), and two landmarks (in black)

For our purposes, we will investigate combinations of n pursuers to pursue m evaders. More specifically, we balance the teams, and explore \(n=m=2\), i.e., teams of two pursuers versus teams of two evaders (2v2 game), and \(n=m=3\), having teams of three pursuers versus teams of three evaders (3v3 game). We deploy four agents: MADDPG, M3DDPG, DDPG, and Random, denoted respectively by M, L, D, and R for short. An agent may appear more than once in a team. The pursuer-evader teams can use three different sociality regimes: selfish, egalitarian and altruistic (S, E and A, respectively) [26]. In the selfish regime, each agent only receives its own rewards, in the egalitarian case, the rewards are averaged and shared by all members of the team and, in the altruistic case, each team member is given the average of rewards of the other members of the team only. As we have two different teams (pursuers and evaders), this results in nine different combinations: EvE, EvS, EvA, SvE, SvS, SvA, AvE, AvS, and AvA. For example, AvS stands for altruistic (pursuer) vs selfish (evader). Note that these variations are about the reward that each agent receives, but for any of these combinations the result of the game (score) is always the sum of rewards of all the members of the team. This is actually the value the assessor will have to predict.

Each of the 9 sociality combinations for each of the 100 or 400 combinations of teams of size 2 or 3 would total \((100+400)\times 9=4500\) games. Due to the setting producing a large number of combinations, in [26] they chose a random sample of 20 coalitions for two pursuers versus two evaders, and 40 coalitions for three pursuers versus three evaders. We reuse the same experiments here, which totals \((20 + 40)\times 9 = 540 \) lineups, each one with the cumulative rewards for each of the two teams at the end of the game.

Assessor data representation

An assessor [21] is a model that is trained on the evaluation information for an existing system. An assessor model is intended to serve as a general mapping from the space of systems and the space of instances to the corresponding distribution of scores. An assessor model is a conditional probability estimator \(\hat{R}(r\mid \pi , \mu )\) that may be constructed using data from the entire system-problem space, where \(\pi \) is the particular profile of the system, \(\mu \) is a new problem situation, and r is the result. Assessors are anticipative; they do not need to run the system \(\pi \) to anticipate their result on \(\mu \).

In our case, the environment is fixed, but what makes each game unique is the composition of agents in the pursuer and evader teams, as well as the sociality regime of both teams. So, from the perspective of one team, e.g., the pursuer, \(\pi \) would be the system (including the properties of the pursuer team) and \(\mu \) would be the evader team as problem instance (including the properties of the evader team).

In our evaluation data, we have records of the shape \(\langle Purs\)vEvad, \(\alpha _d\), \(\alpha _r\), R(d), \(R(d_{1})\), \(R(d_{2})\), \(R(d_{3})\), R(r), \(R(r_1)\), \(R(r_2)\), \(R(r_3)\rangle \). PursvEvad indicates the algorithm combination of the pursuer and evader team, \(\alpha _d\) is the sociality factor for the pursuer team and \(\alpha _r\) is for the evader team. R(d) is the rewards earned by pursuer team, and \(R(d_{i})\) is the rewards earned by pursuer \(d_i\) with sociality factor \(\alpha _d\). R(r) is the rewards earned by evader team, and \(R(r_i)\) is the reward with \(\alpha _r\) for evader \(r_i\).

We need to convert this into a structure that works for building an assessor model, such as \(\langle \pi , \mu , \gamma \rangle = \langle \langle system features\rangle , \langle instance features \rangle , score \rangle \). However, as we have different team sizes, and agent repetitions, the enumeration of agent types \(d_1\), \(d_2\), etc., will be a poor representation for any tabular machine learning technique to generalise well. In addition, the tuples would become full of NAs for larger teams.

Because of this, we use a very simple and convenient representation \(\langle M_d, D_d, L_d, R_d,\alpha _d, M_r, D_r, L_r, R_r, \alpha _r, R\rangle \) where \( M_d, D_d, L_d, R_d\) are the number of agents in the pursuer team that used the MADDPG, DDPG, M3DDPG, and random algorithm respectively, whereas \(M_r, D_r, L_r, R_r\) are the number of agent that used the MADDPG, DDPG, M3DDPG, and random algorithm when using in evader team. The features include \(\alpha _d\) and \(\alpha _r\), which are the sociality factor for the pursuer and evader teams, respectively. The score R is what the assessors must predict, the reward of the pursuer team. Even if the implementation of the pursuit–evasion game in [22] is not fully zero-sum, the experimental results in [26] show that the sum of the rewards for the pursuer and evader team is a constant, so for simplicity we only model the rewards for the predator team.

When teams are of size 2, \(M_d, D_d, L_d, R_d, M_r, D_r, L_r, R_r\) can take values 0, 1 or 2, with the sum for each team being 2. For teams of size 3, this scales naturally without any change in the number or type of attributes. For \(\alpha _d\) and \(\alpha _r\) we have three values: 0, 0.5 and 1, representing altruistic, egalitarian and selfish respectively for teams of size 2, and 0, 0.33 and 1, representing altruistic, egalitarian and selfish respectively for teams of size 3. Again, we have 900 and 3600 combinations for the 2v2 and 3v3 game, respectively.

An example of a match of a 3v3 game is a tuple like this:

$$\begin{aligned} \langle 0, 2, 1, 0, 0.33, 0, 1, 1, 1, 0, 5485\rangle \end{aligned}$$

We will use a shorter notation sometimes. For the previous examples, we will write that the pursuer lineup is 0210\(\mid \)0.33 against evader lineup 0111\(\mid \)0, yielding 5485 rewards for the pursuer team.

As said before, because of the computational cost, in [26] they only explored a random subset of combinations. We choose 20 teams at random for the 2v2 game, and 40 teams for the 3v3 game, ensuring that pursuer and evader teams have the same quantity of M, L, D and R agents across all matches. Tables 7 and 8 in the Appendix, contain \((20+40) \times 9 =540\) lineups results, which are used for training and testing the assessor.

Model training and evaluation

We explored different supervised machine learning methods to build an assessor. In the end, the best results were obtained with Extreme Gradient Boosting (XGBoost) [49], which can be used for regression. It is a machine learning model that integrates multiple learners like linear model and tree learning algorithms.

To train XGBoost and predict rewards for the pursuer team, we divide the dataset into \(90\%\) training data and \(10\%\) test data. The hyperparameters were set to their default values. Figure 2 shows the predicted vs. actual results for this model. The \(R^2\) score is 0.89. All this suggests that this technique and data representation, even with only 540 data records, can successfully predict the team rewards.

Fig. 2
figure 2

The rewards predicted by XGBoost regression

We did 1000 repetitions of this experiment to get a more robust assessment of the quality of XGBoost for this problem. Results are shown in Table 1, including the coefficient of determination (\(R^2\)), its standard deviation (SD), mean squared error (MSE), root mean squared error (RMSE), and mean absolute error (MAE).

Table 1 Results of the assessor using XGBoost for 90–10% split evaluation
Fig. 3
figure 3

The average feature importance for each attribute

Fig. 4
figure 4

Correlation matrix for the features

Now that we see the assessor is a good predictor, we wonder what features of the lineups matter most. We use the ‘gain’ metric to get the importance scores, as shown in Fig. 3. This is calculated by taking each feature’s contribution for each tree in the model. Clearly, \(\alpha _d\), \(\alpha _r\), \(R_d\) and \(R_r\) show importance scores that indicate they contribute the most to the result. A more refined analysis suggests that the sociality regime used by the pursuer team is critical (\(\alpha _d\)), and a bit less so when changed in the evader team (\(\alpha _r\)). In both cases, the existence of a random agent (\(R_d\) and \(R_r\)) has an important effect as well, possibly by making any coordination impossible.

However, importance does not show whether the variable affects positively or negatively to the outcome. To confirm that direction (signs) of influence we can have a look at the correlations with the response variable.

Figure 4 shows the Pearson correlation between all variables. The higher the value, the lighter the colour. Looking at the bottom row, we can see that \(\alpha _d\) has a highly positive impact on R, the result, which means this is good for the pursuers. Conversely, the \(\alpha _r\) has a negative impact on R, which means this is good for the evaders. Note that R represents the reward for the pursuer team, so the higher R is the better, but for the evader team the lower R is the better. The correlations for the random agents are negative when in the pursuer team and positive when in the evader team, which confirms that random agents are bad for both teams.

Since a correlation matrix only captures the bivariate interactions, we use another method to determine the relevance and sign of the features. We create a TreeExplainer from our model and estimate the SHapley Additive exPlanations (SHAP) values [50]. Figure 5 represents every data point in our dataset. For each feature, the associated SHAP value is plotted on the \(x\)-axis . The more influential a feature is, the more negative or positive its associated SHAP value is. We see values on both sides because we have a mixture of games with a wide range of values for R. Features are ranked in decreasing order of importance from top to bottom. The colour represents whether the feature value is low (blue) or high (red). This has to be combined with the \(x\)-axis value to really understand the sign of the effect.

Fig. 5
figure 5

The feature importance of the assessor using SHAP. More important features are at the top and less important features are at the bottom. Each point represents an example. The colour shows the value of the feature from low to high, and the location shows the Shapley value and its sign

So, in Fig. 5, we see that the \(\alpha _d\) and \(\alpha _r\) features are highly influential with strong negative and positive SHAP values for many predicted outcomes. We confirm that a high \(\alpha _d\) contributes positively to the pursuer team and \(\alpha _r\) contributes negatively to the pursuer team (which means it contributes positively to the evader team). In other words, being selfish is good for both teams. Again, we confirm that random agents are bad for both teams, combining the values and colours of what we see for \(R_r\) and \(R_d\). With this multivariate analysis the effect of \(R_d\) is more intertwined with the other variables. Finally, the effect of the variables representing actual RL agents is limited, and has an opposite sign as expected for \(D_r\), \(L_R\) and \(M_r\). It seems that having more of those is not giving much to the evader team.

Finally, a tree visualisation of XGBoost Tree is shown in Fig. 6 in the Appendix, pruned at depth four. We can also see the relevance of \(R_d\), \(R_r\) and the social factors \(\alpha _d\) and \(\alpha _r\).

Team formation strategies

Once an assessor has been built and validated, we can use it to explore different team formation strategies. There is a total of 4500 combinations for the pursuer versus evader teams when considering all 2v2 and 3v3 games. Since we do not have access to these 4500 experiments, we use the assessor as a proxy for it.

Best team lineup given the opponent lineup

The simplest scenario is when one team is given information about the opponent’s lineup. For instance, selecting the best pursuer team when we know the evader team (let us call it E) can use the assessor model to predict the outcome for all possible pursuer lineups. The pursuer that has the maximum reward is the best choice for this specific situation of knowing E. A similar procedure can be used when the evader team knows the lineup of the pursuer team. In this case, we only need to calculate the minimum.

In this simple scenario, we have some interesting insights. Given the evader team, the 90 pursuer teams with the highest rewards are included in Table 10 in the Appendix. Of these 90 pursuer teams, we can count the frequency of agents: DDPG leads with 93 appearances, followed by M3DDPG with 86 appearances, MADDPG with 60 appearances, and Random with 2 appearances. The sociality of the pursuer team is always 1. This confirms that there is no agent that dominates the rest, being the best in all combinations. We can also see that homogeneous teams are not so frequent, so diversity seems to be beneficial. The clear findings for the pursuer team are that random agents are bad in almost all cases, and the selfish regime is best in all cases.

Given the pursuer team, the 90 best lineup evader teams are shown in Table 9 in the Appendix. Of these 90 evader teams, M3DDPG leads with 84, followed by MADDPG with 76, DDPG with 67, and Random with 13. In this case, there are 11 records when the evader team’s sociality is not 1 (selfish), as can be seen in Table 2. In particular, 6 records are 0.33 or 0.5 (egalitarian) and the remaining 5 records are 0 (altruistic). The evader team has 79 records with sociality 1, accounting for \(88.8\%\) of the total. How can we explain these results? It is important to note that these 6 egalitarian evader team cases happen for very specific given pursuer teams. Of these six with \(\alpha _r = 0\), there are two records where the data pursuer team is only composed of random agents. For the cases where \(\alpha _d\) is not 1, we only have 3 MADDPG, 5 DDPG and 3 M3DDPG agents, in front of 15 random agents. In brief, these 11 pursuer team lineups are so poor that almost every combination of the evader has the same effect, or at least the assessor does not find a pattern for these cases and so the result.

Table 2 11 pursuer teams whose best lineup evader team has a sociality is not selfish

Having access to the opponent team’s formation is an actual possibility in situations where the other team has been observed already. However, the most common situations are when this information is not given to any of the two teams, and the teams have to decide their lineup before the game, and not changed afterwards (e.g., cricket), known as offline team formation, and the situation where teams can see the lineups of the other teams and can be changed at any time (e.g., basketball), the online team formation. We explore these two cases in the following two subsections.

Offline team formation strategies

In the offline team formation case, even if we do not see the lineups until the match starts, we can assume several degrees of information about the estimates from either team, and this may be used by the teams so that some lineups are more likely than others. Given our assessor, we have the estimates of the n = 30 \(\times \) 30 = 900 combinations for teams of two agents and the n = 60 \(\times \) 60 = 3600 combinations for teams of three agents. Assuming increasing detailed use of the information from the assessor, the strategies are:

  1. 1.

    Uniform: chooses the lineup randomly following a uniform distribution among all possible combinations. This does not require the assessor.

  2. 2.

    Best: chooses the lineup that gives the best reward (highest for pursuers and lowest for evader) of all possible combinations. Basically each team optimistically chooses the best lineup in all possible situations of the opponent.

  3. 3.

    Weighted\(^{Exp}\): chooses the lineup randomly with a probability that is directly (resp. inversely) proportional to R for the pursuer (resp. the evader). In this case, we can apply an exponent to R before calculating the weights, giving more relevance to the weighting.

None of these strategies assume anything about the other team, but we will explore that each team can choose any of these strategies. As we consider exponents 1 and 2, we have four strategies and 4 \(\times \) 4 = 16 combinations. For each of them we perform 1000 repetitions and calculate the average. The results are shown in Table 3 for teams of size two, and in Table 4 for teams of size three.

For teams of size two, we know that the average result for all combinations is 2784. Having no information about previous games, both teams would choose randomly in a uniform way for all lineups, and the result would be 2639, very close to the average result. From there, we see that using the information about the model (the assessor built from previous games) is especially beneficial for the pursuer. Using weighted sampling according to the estimated R increases reward, more if the weights are squared. The best (highest) results for the pursuer team are always achieved for the Best strategy (last column), independently of what the evader does. For the evader team, we see that the most favourable strategy is weighted with exponent=1 (second row), but this only reduces a bit from the uniform one. Actually, the ‘best’ strategy is not very good in this case, because in this particular scenario the game with the estimated minimum and maximum are both with the same evader (0011|0). This explains the strange values in the bottom right (12,178), with both strategies being best. This suggests that the ‘best’ strategy is naive, and can be very risky because it is too optimistic. In this case, the choice turned pessimal for the evader, as it is the worst row for this team in all situations.

Table 3 Results of different offline team formation strategies for teams of two agents

For three agents, the behaviour is slightly different, as shown in Table 4. In this case we do not have the coincidence of the ‘best’ combination for the evader having a lineup for the evader that also appears in the best combination for the pursuer. As a result, we see a gradient that usually gets better for the pursuer as we move to the top right and better for the evader as we move to the bottom left in the table. Still, the case where both teams choose the ‘best’ strategy is usually more beneficial for the pursuer, although far from the maximum (19,691) in this situation.

Table 4 Results of different offline team formation strategies for teams of three agents
Table 5 Results for the locally optimal online strategy for teams of two agents

Online team formation strategies

In situations when changes are allowed at any point, when one team reveals its lineup, the other will follow suit with another change. This may lead to very complex dynamics, with changes in one team being counteracted by changes in the other team and vice versa. One straightforward observation is that because the number of players to choose from is finite, the lineups are finite, and if the strategies are deterministic there will be cycles. We expect cycles to be short, and in some cases even to find convergences (fixed points).

Let us analyse the following natural strategy. If team A sees the lineup of team B, then team A will change to the best lineup for A given the observed lineup for B. We assume that changes are done in an alternating way because if a team has just switched to its optimal lineup, it makes sense that the next change is given to the other team. Implementing this strategy can be done with the whole table of estimated values from the model, but we can also compress the n= 30 \(\times \) 30 = 900 combinations for teams of two agents and the n = 60 \(\times \) 60 = 3600 combinations for three agents, into simply the best ones given the pursuer, and the best ones given the evader, both of size 30 for two agents and both of size 60 for three agents. These were illustrated in Tables 9 and 10 in the Appendix.

Even if both teams use this locally optimal strategy, which team starts and the initial lineup might have an important effect, even leading to different final cycles or states. To analyse this, we performed an experiment where we first explored the pursuer starting with all possible lineups, and explore what happens when this unravels using the optimal changes for each team in an alternating way. For each of them, we simulate 50 cycles, which correspond to 100 changes. We calculated the average reward for these 100 situations.

For two teams, Table 5 shows the mean, best, worst and fixed-point (if exists) reward when the pursuer starts and when the evader starts. We see that the values are very similar, with a small advantage for evader when the pursuer starts and vice versa, which makes sense, since the first time rewards are generated is when both teams have completed the team formation and only the second one has been able to see the first one. One surprising observation is that the results are in general lower than the best result in Table 3. Knowing the opponent’s lineup is especially beneficial for the evader.

For three teams, Table 6 shows a similar picture, but here no fixed point is found in general (only occasionally) and a cycle becomes very frequent: pursuer 0300\(\mid \)1, evader 3000\(\mid \)1, pursuer 2100\(\mid \)1, and evader 1200\(\mid \)1. This happens independently of who starts, and the results are again very similar. The values are also lower than the best result in Table 4, but this time the difference is much stronger. Knowing the opponent’s lineup and reacting accordingly is specially beneficial for the evader in the three-agent teams.

Table 6 Results for the locally optimal online strategy for teams of three agents
Fig. 6
figure 6

Tree of the XGBoost regression model predicting the pursuer team reward. The tree is only shown up to depth 4

Conclusion

Team formation is a very common problem in any situation where we need to compose a team from a pool of different agents. This also happens in artificial intelligence. However, as AI systems become more complex, it is difficult to parametrise the agents with a predefined or estimated set of skills, and use them to determine the best lineup for any given situation. In many cases, especially when machine learning comes into play, we end up observing how well the team behaves for different situations, according to the heterogeneous set of agents, their architecture and hyperparameters. Modelling this history of games into a predictive model, an assessor, allows for many possibilities to choose the appropriate lineup. We have explored this in a MARL scenario using a popular pursuit–evasion game using three MARL algorithms and a random agent, and several sociality regimes. We have seen different results for the pursuer and evader teams, and very interesting insights when using different procedures to choose the lineup. Our approach is straightforward and general, being easily adapted to any heterogeneous game.

Our work has some limitations. First, it requires some test data for previous games to have sufficient training data to build a high-quality assessor. With an appropriate representation, a dataset of a few hundred lineups (540 in this paper) may be sufficient. Second, the use of a non-linear assessor makes it difficult to use an analytical approach for the search of the best combination given information about the other team. However, once the model has been learnt, exploring many games (4500 in this paper) was very quick. Also, calculating the maximum for each of the 10 or 20 combinations for 2 or 3 teams can be done once, and then these results (Tables 9 and 10 in the Appendix) reused again and again for any strategy. In our case, we did not need them, but some gradient-descent approaches could be used, provided the assessor is a continuous function.

The most important take-away is that the information from previous games can be used to model insightful dynamics about how team formation affects results. As we have shown in this paper, the assessor can then be exploited to find convenient team formation strategies. The approach is flexible enough to be extended to a wide range of situations, applications and kinds of multi-agent systems.