1 Introduction

The process of team composition, which aims to discover an appropriate set of individuals with relevant expertise to achieve common goals efficiently, has been a major area of interest in the field of the science of teamwork. As football (also called “soccer” in some countries) requires a high level of teamwork, it is one of the best options for studying the team composition problem since it is characterized by a large amount of communication, interaction, and collaboration between team members. In reality, it is difficult to assess the effectiveness of a football team composition result because it may require a considerable amount of money as well as being labor-intensive. Fortunately, the emergence of a wide variety of football video games, such as Pro Evolution Soccer (PES)Footnote 1, Electronic Arts Sports FC (also known as FIFA)Footnote 2 and Football ManagerFootnote 3, offers an opportunity to compose a team based on human preferences and evaluate outcomes efficiently. This opportunity exists not only because gamers can completely redo club designs as well as edit any player in the game but also because the platforms can fully simulate on-pitch football matches. Subsequently, the football team composition task becomes interesting and important on the game platforms.

As a multiplayer game, the process of football player selection and team composition is designed to select the most suitable player for a particular playing position and role [1], which is vital for clubs to be able to deliver high sports and financial returns [2]. Such a process is crucial since a poor selection result can affect player loyalty as well as cost a football team millions of dollars [3]. However, the multicriteria complexity and decision-making difficulty make the selection of players a challenging task. Although team managers and coaches use a variety of assessments to choose players by considering many aspects, including player productivity and limited wage budgets, the selection process would be too time-consuming to be realistic, and the accurate evaluation of a player’s suitability for a team is also a considerable puzzle. Thus, applying a systematic approach such as the mathematical modeling method is urgent.

Many studies have attempted to address the football team composition problem, but most of them rely on attributes such as players’ skills and physical status. For instance, most researchers utilize anthropometric measurements (e.g., age, height, and weight), fitness-related indices (e.g., vertical jump ability and speed), and players’ techniques (e.g., short passing and shooting) for the football player selection problem [4]. In addition, the market value and salary of football players are taken into account [5, 6]. Specifically, Zeng et al. [5] considered the players’ total salary as a budget constraint and resorted to a submodular function to solve the team composition problem. However, such attributes are not sufficient to measure a football team’s competitiveness. Achieving good results depends on not only the high-level players who are involved but also how effectively they collaborate, communicate, and work together as a team.

Assume, for example, a team manager who wants to build a football team consisting of players with distinguished skills in the following areas: {attacking prowess, ball control, defensive prowess, physical contact, and speed}. We also assume that there is a network including five football players {P1,P2,P3,P4,P5} in Fig. 1. Each player highlights the corresponding skills, and an edge between two football players indicates that they can collaborate effectively. Such a network is referred to as an attributed collaboration network (ACN)Footnote 4 [7]. Without considering the connection among players, the manager can select either C1 = {P1,P2,P3} or C2 = {P1,P4,P5} - both C1 and C2 have the required skill set. However, the candidate set C1 is the better choice since the network indicates that P1 cannot work with P4 and P5 effectively.

The existence of an ACN among football players is quite common. In a football league, an obvious type of player collaboration is developed upon whether they are from the same team or country, which is often used to organize players in a club. In this case, the network encodes the fact that football players from the same club or country can communicate more easily and cooperate more harmoniously with each other than those serving in different teams. In addition, it is known that defensive and offensive positions differ in player composition because they are conjunctive and disjunctive tasks respectively [8]. The success of driving off each attack is dependent on completing a joint action. Here, the weakest defender is detrimental to the team’s defensive performance because he or she limits the team’s defensive capabilities. In contrast, a team’s offensive capacity is determined by the output of the best-performing member. Moreover, the distribution of a team’s offensive (defensive) performance can be measured by the network heterogeneity (homogeneity) [9, 10]. Low heterogeneity (or high homogeneity) indicates that all players share a similar level of interaction through the match, and vice versa. Thus, attacking benefits from heterogeneous players, while homogeneity ensures that there are no weak links among defensive players. This insight facilitates our understanding of the underlying functional mechanism of collaboration and motivates us to develop players’ attributed collaboration networks for the football team composition problem.

Fig. 1
figure 1

An example of an ACN with five individual players, each of whom is equipped with several skills

In this paper, we consider the team composition problem in the context of the Football Players’ Attributed Collaboration Network (FPACN). Each node in the network is a football player with certain skills, such as attacking prowess, ball control, dribbling, while edges between nodes are constructed based on the clubs they played for and their nationalities, which reflect the affinity between players. After obtaining the attributed collaboration network, given a certain budget, we propose a TC-FPACN model, the acronym for Team Composition based on the Football Players’ Attributed Collaboration Network, to identify a set of highly qualified football players and form a remarkably cohesive team. We evaluate the cohesiveness of a football team on the basis of three predefined network metrics, namely, network ability, network density, and network heterogeneity&homogeneity, in the TC-FPACN, whose goal is to discover a football team that maximizes the combination of the three network metrics. As we present the team’s properties through the attributed collaboration network, the constrained optimization problem can be converted to finding a maximum density subgraph in a graph, which turns out to be NP-hard [11]. The problem becomes more complicated when players’ ability and heterogeneity (or homogeneity) are considered. We propose an approximation algorithm that finds the best team based on greedy algorithms and further improve the algorithm using pruning methods under a budget constraint. We summarize the main contributions of this paper below.

  • We propose a Team Composition based on the Football Players’ Attributed Collaboration Network (TC-FPACN) model, which incorporates three network metrics (i.e., network ability, network density, and network heterogeneity&homogeneity) to define players’ cooperation mechanism.

  • We formulate the team composition task as a constrained optimization problem for the TC-FPACN that finds the optimal subgraph based on the network metrics. Since the problem is NP-hard, we propose a greedy algorithm with a pruning technique to solve it.

  • We conduct an empirical study on two video game platforms, i.e., Pro Evolution Soccer 2018 (PES2018) and EA SPORTS FIFA 22 (FIFA2022) to evaluate the effectiveness of the proposed model. Simulation results show that our model achieves favorable performance in competition against other teams.

The remainder of the paper is organized as follows. We review related works in Section 2. In Section 3, we first formally introduce the team composition task, then describe the three network metrics of the TC-FPACN and finally, formulate the team composition problem. We propose the new algorithms in Section 4. Section 5 demonstrates the performance of the proposed method. Finally, Section 6 concludes our work and discusses future research directions.

2 Related work

Since this paper considers finding a cohesive football team based on football players’ capabilities and collaborations, we start with a review of football decisions, especially for player selection and team composition, and proceed with related research on the evaluation of personal ability and the retrieval of the team from collaboration networks in general.

2.1 Football player selection and team composition

The process of football player selection and football team composition is a complex problem with conflicting objectives. The traditional solution to this problem is to assess several quantitative factors that are compulsory for coaches and their technical committees to produce the most elite player. These factors include the player’s anthropometric measurements [4], fitness-related indices [12], and skills [5, 13]. To name a few, Inan and Cavas [13] analyzed the offensive and defensive characteristics of Turkish Super League football players, such as the long pass accuracy, and developed an artificial neural network model for talent selection. Zeng et al. [5] defined a submodular function that represents the team’s skill coverage and used improved greedy algorithms to solve the optimization problem. Given the existence of different duties for football players in the field, many researchers have also considered that the relevant criteria of skills must be assigned according to each player’s position [3, 14, 15]. Ozceylan [3], for example, used an analytic hierarchic process to prioritize the criteria for each player based on their position and developed a 0-1 integer linear programming to determine top players in a team.

Most approaches mentioned above emphasize the on-pitch sport success. In addition, there are other factors worth considering, such as financial aspects [16, 17] and the future potential of professional football players [18, 19]. For instance, Singh and Lamba [16] resorted to machine learning models including decision tree and gradient boost to identify the factors that affect the financial market values of football players and then used the selected factors to predict the player’s market value. In [18], the authors projected a target player’s potential by searching the corresponding historical attributes to identify other football players with a similar profile. Zhao et al. [19] defined three attributes, including the potential factor, to evaluate the performance of teams and football players.

Nevertheless, forming a winning football team involves more than having the required mix of skills under the budget limit. Player selection is a difficult decision-making problem that needs to take into account the collaboration mechanisms among football players, which are ignored in the literature.

2.2 Personal ability evaluation

Personal ability is always an important guideline for team composition. Player selection needs to consider quantitative attributes, and the most widely used rating systems for a player are based on performance data. Since there are multiple attributes to consider when assessing a player’s ability, algorithms based on multicriteria decision-making (MCDM) are regarded as simple and suitable for developing solutions [20]. As a key component of the MCDM method, the analytic hierarchic process (AHP) is widely used to determine the weights of the selected criteria [21]. Using the AHP methods, each player’s attributes are ranked according to their importance in a given position. In parallel, the technique for order of preference by similarity to ideal solution (TOPSIS) – the well-known MCDM method – is applied extensively to rank the alternatives, partly due to its mathematical clarity. A plethora of methods have been developed following this breakthrough, such as TOPSIS-IPA [22] and Fuzzy-TOPSIS [23]. More recently, Sałabun et al. [24] developed a multicriteria model based on the characteristic objects method to evaluate players in team sports.

In addition to MCDM-based models, Liu et al. [25] introduced the text information of postmatch reports written by professional soccer journalists or editors and proposed an affective computing model for the player’s performance rating. Furthermore, Pantzalis and Tjortjis [26] conducted an intensive study to define the main attributes that influence a defender’s match rating. They found that classic defensive actions such as interceptions and clearances, along with player attributes such as jumping reach and strength, are more suitable for evaluating defenders.

2.3 Collaboration networks for a team formation

A successful team relies on not only individual ability but also communication and collaboration. The study of scientific collaboration aims to compute the fitness level of an expert for collaborating with other experts on a set of skills [27]. Given an expertise collaboration network, Lappas et al. [28] first considered team formation in the presence of a collaboration network and measured effectiveness using communication cost. Furthermore, density-based measurements were proposed [29,30,31], and the authors generalized the approach [28] by considering the team formation problem as a multiobjective optimization task. For example, Selvarajah et al. [31] aimed to build a more effective team by analyzing various scenarios, such as how frequently team members had worked together in the past. In parallel, Datta et al. [32] proposed a composite mechanism to exploit different elements of individuals and the community given by their expertise and connections. Furthermore, Awal and Bharadwaj [33] quantified and optimized a team’s collective ability based on a collective intelligence index, which encodes individuals’ knowledge competence and their collaboration competence.

Given that the major limitations of the class of solutions mentioned above are that they fail to capture complex interactions and are computationally intractable, more recent work adopted neural architectures to learn a mapping between the skills and experts’ space [34,35,36]. For instance, Hamidi et al. [36] focused on state-of-the-art neural network methods to learn the dense representations for nodes in the collaboration network and bootstrapped the training process through transfer learning. Similarly, in this paper, we focus on the team formation problem based on the collaboration network and explore an efficient way to find a team. Specifically, we consider a network structure of football players as an attributed collaboration network, where nodes representing players are associated with their skills and the weights attached to edges reflect their degree of affinity.

3 TC-FPACN model

In this section, we present the TC-FPACN model, which is formed by three network metrics that contribute to determining the cohesiveness of a football team, including network ability, network density, and network heterogeneity&homogeneity. We first formally introduce the team composition task and then detail the network metrics. Finally, we formulate the objective function of TC-FPACN, which is to discover a subnetwork by maximizing the three metrics simultaneously.

3.1 Task formulation

Let P = {Pn} (1≤nN) be a set of football players, and S = {Sm} (1≤mM) be a set of players’ skills, where N and M are the number of football players and skills, respectively. Assume that football players are organized in a weighted and undirected graph (i.e., FPACN), denoted as \(\mathcal {G}(\mathcal {V}, \mathcal {E})\) with a set of nodes \(\mathcal {V}\) and a set of edges \(\mathcal {E}\). Each node \(v_{n} \in \mathcal {V}\) is associated with a football player Pn equipped with a set of skills Footnote 5, while an edge \((i,j) \in \mathcal {E}\) models the relationship between the pair of the players (i.e., Pi and Pj). In addition, for readability, we present the main notations used throughout the paper in Appendix A, Table 10.

In football, it is intuitive that different positions on the pitch highlight different skills, which means that some skills are common (e.g., body control and jump) while others (e.g., goalkeeping) are unique to a particular position (e.g., goalkeeper). Thus, we divide football players into three groups - Forward/Midfielder, Backward, and Goalkeeper - according to a player’s position in the football field, with the corresponding collaboration network \(\mathcal {G} = \mathcal {G}_{\mathrm {F}} \cup \mathcal {G}_{\mathrm {B}} \cup \mathcal {G}_{\mathrm {G}}\), where \(\mathcal {G}_{\mathrm {F}}\), \(\mathcal {G}_{\mathrm {B}}\), and \(\mathcal {G}_{\mathrm {G}}\) are subgraphs for Forward/Midfielder, Backward, and Goalkeeper respectively. We define the task of football team composition as follows:

Definition 1

Given an attributed collaboration network of all football players and a limited budget, the goal of our team composition task is to form a cohesive subnetwork (i.e., football team) \(\mathcal {G}^{\prime }(\mathcal {V}^{\prime }, \mathcal {E}^{\prime }) \subseteq \mathcal {G}(\mathcal {V}, \mathcal {E})\), where the node set \(\mathcal {V}^{\prime }\) represents the selected football players.

3.2 Three network metrics

The TC-FPACN model considers the cohesiveness of a football team from three aspects: a) network ability, b) network density, and c) network heterogeneity&homogeneity. We now describe the three network metrics in detail.

3.2.1 Network ability

Given a football player PnP (1≤nN) with a set of skills, each of which is labelled with the corresponding weight and personal level, we first build a model to calculate the personal ability of Pn, denoted \(\phi _{\mathrm {P}_{n}}\), in (1).

$$ \phi_{\mathrm{P}_{n}} = \sum\limits_{m=1}^{M} W_{\mathrm{S}_{m}} L_{\mathrm{P}_{n},\mathrm{S}_{m}}, $$
(1)

where \(W_{\mathrm {S}_{m}}\) is the weight of skill Sm, and \(L_{\mathrm {P}_{n}, \mathrm {S}_{m}}\) is the personal level of Sm for player Pn. With the personal ability defined in (1), we calculate the network ability of \(\mathcal {G}^{\prime }(\mathcal {V}^{\prime }, \mathcal {E}^{\prime })\) for a football team (i.e., the competency of the whole team), which gives

$$ {\Phi}(\mathcal{G}^{\prime}) = \sum\limits_{n=1}^{\lvert \mathcal{V}^{\prime} \rvert} \phi_{\mathrm{P}_{n}}, $$
(2)

where \(\lvert \mathcal {V}^{\prime } \rvert \) is the number of selected football players in a team. We can see from (2) that it is the sum of the personal abilities of the selected players, which means that a higher network ability score contributes to forming a better football team.

3.2.2 Network density

As shown in (2), a naive scheme for building a football team is to identify suitable players with good skills for each position and then put them together. However, the team’s victory depends on not only the number of football stars but also the collaboration of the players, enabling them to function as a cohesive team in the field. Intuitively, good collaboration is commonly built upon players’ relationships. To establish relationships among football players, in this paper, we consider whether they come from the same team or country, which is often used for organizing players in a club. Formally, let us consider the graph \(\mathcal {G}(\mathcal {V}, \mathcal {E})\). Given any two nodes \(v_{i}, v_{j} \in \mathcal {V}\) associated with two football players Pi and Pj, if they come from the same country, the same club, or both, we add the edge (i,j) to \(\mathcal {E}\), and the relationship is weighted by calculating the Jaccard similarity, denoted as ωi,j, in (3).

$$ \omega_{i,j} = \frac{\lvert \mathbf{V}_{\mathrm{P}_{i}} \cap \mathbf{V}_{\mathrm{P}_{j}} \rvert}{\lvert \mathbf{V}_{\mathrm{P}_{i}} \cup \mathbf{V}_{\mathrm{P}_{j}} \rvert}, $$
(3)

where \(\mathbf {V}_{\mathrm {P}_{i}}\) is the vector of player Pi with the elements team name and nationality.

Based on the relationships among football players, we now turn to define the network density for measuring team cohesiveness. Although many methods have been used to define a team’s cohesion based on social networks, such as the diameter communication cost [28], density-based measurement [29], and local clustering coefficient [32], the definition of a team’s cohesiveness is still an open issue. Different from the existing works, we define the network density to measure the strength of inner-team interaction in the subnetwork \(\mathcal {G}^{\prime }(\mathcal {V}^{\prime }, \mathcal {E}^{\prime })\) for a football team in (4).

$$ {\Psi}(\mathcal{G}^{\prime}) = \frac{{\sum}_{(i,j){\in}\mathcal{E}^{\prime}} \omega_{i,j}}{\lvert \mathcal{E}^{\prime} \rvert}, $$
(4)

where (i,j) is an edge in \(\mathcal {E}^{\prime }\), ωi,j is the corresponding weight defined in (3), and \(\lvert \mathcal {E}^{\prime } \rvert \) is the number of edges. If there is no edge between two nodes, we set ωi,j = 0. A larger value of \({\Psi }(\mathcal {G}^{\prime })\) suggests that football players are better able to interact with each other, while a smaller value indicates the presence of more ambiguous relationships. To better understand the importance of the network density, we give a toy example below.

Example 1

Considering the two undirected, weighted graphs in Fig. 2, each node denotes a football player, and the edges reflect the relationship between any two players. The values of \(\phi _{\text {P}_{i}}\) and ωi,j are also shown in the figure. If we ignore the collaborative relationships between football players, it is intuitive that the two players {P2,P3} are highly scored and shall be selected into a team (see the left-hand side of Fig. 2); however, their relationship (the right-hand side of Fig. 2) is rather weak. In contrast, the players {P1,P3} would be the better candidates, as they have the strongest connection, which suggests that the connection strength (network density) among players helps to build and reinforce a cohesive team.

Fig. 2
figure 2

Two types of networks of three football players. The left-hand side is an edgeless graph, while the graph on the right-hand side shows the connections among players

3.2.3 Network heterogeneity & homogeneity

In this section, we proceed to define the network heterogeneity&homogeneity, which is also an important factor for team cohesiveness in the TC-FPACN. It is well known that heterogeneity and homogeneity are opposites, which means that improving heterogeneity may compromise homogeneity and vice versa. Specifically, heterogeneity highlights the diversity of attributes and behaviors among group members; in contrast, homogeneity emphasizes the within-group similarities regarding these shared attributes.

We adopt the Gini coefficient [37] to measure heterogeneity (or homogeneity) for the set of players, denoted Gc. Since the Gini coefficient can be calculated in many forms [38,39,40], we use an approximate calculation method [38] as follows:

$$ Gc = \frac{1}{M}\sum\limits_{m=1}^{M} \frac{1}{2{N}^{2}u}\sum\limits_{i=1}^{N}\sum\limits_{j=1}^{N} W_{\mathrm{S}_{m}} \lvert L_{\mathrm{P}_{i}, \mathrm{S}_{m}} - L_{\mathrm{P}_{j}, \mathrm{S}_{m}} \rvert, $$
(5)

where \(\lvert L_{\mathrm {P}_{i}, \mathrm {S}_{m}} - L_{\mathrm {P}_{j}, \mathrm {S}_{m}} \rvert \) measures the difference in the skill level related to Sm between two players Pi and Pj, and u is the average value of skill Sm. In (5), we see that Gc = 1 indicates the maximum heterogeneity, while Gc = 0 is the maximum homogeneity, which means that they are interdependent [8].

In the context of football games, the two main tasks are attack and defense, and they require different mechanisms to select players to successfully complete the tasks. Attacks on a goal benefit from players who have different skills and require a set of heterogeneous forward players. However, defense requires homogeneous players since it is expected that most defense players can play in any position in the defense area. Considering that Forward and Midfield players are involved in the attack and Backward players are responsible for the defense, based on the Gini coefficient defined in (5), we measure the network heterogeneity&homogeneity for \(\mathcal {G}^{\prime }(\mathcal {V}^{\prime }, \mathcal {E}^{\prime })\) for a football team as follows:

$$ {\Upsilon}(\mathcal{G}^{\prime}) = \left\{ \begin{array}{rl} Gc, & \quad \text{if} v_{n} \in \mathcal{G}^{\prime}{\cap} \mathcal{G}_{\mathrm{F}} \\ \frac{1}{Gc}, & \quad \text{if}~ v_{n} \in \mathcal{G}^{\prime}{\cap} \mathcal{G}_{\mathrm{B}} \end{array} \right., $$
(6)

where vn (\(1 {\leq } n {\leq } {\lvert {\mathcal {V}^{\prime }}\rvert }\)) represents a football player selected from the two graphs (i.e., \(\mathcal {G}_{\mathrm {F}}\) and \(\mathcal {G}_{\mathrm {B}}\)) simultaneously. Equation (6) shows that a cohesive team should maximize network heterogeneity for the Forward/Midfielder while minimizing it for the Backward in the team composition.

3.3 Team composition via three network metrics

As mentioned, we delve into three network metrics of the TC-FPACN model that lay the foundation for building a cohesive football team. Considering all these factors, we introduce the trade-off parameters α and β, where 0≤α + β≤ 1, which configures acceptable combinations among network ability, network density, and network heterogeneity&homogeneity. Formally, given the attributed collaboration network of football players \(\mathcal {G}(\mathcal {V}, \mathcal {E})\) and a fixed budget (Bu) for recruiting players, we use σ to denote the objective function of the TC-FPACN and then formulate the team composition task as solving the following optimization problem.

$$ \begin{array}{ll} \mathop{\max}\limits_{\mathcal{G}^{\prime} \subseteq \mathcal{G}} & \sigma(\mathcal{G}^{\prime}) := {\alpha} {\Phi}(\mathcal{G}^{\prime}) + {\beta} {\Psi}(\mathcal{G}^{\prime}) + (1 - \alpha - \beta) {\Upsilon}(\mathcal{G}^{\prime}), \\ \text{s.t.} & {\sum}_{n=1}^{\lvert \mathcal{V}^{\prime} \rvert} Cost(\mathrm{P}_{n}) \leq \text{Bu}, \\ & \lvert \mathcal{V}^{\prime} \rvert = 11, \end{array} $$
(7)

where \({\sum }_{n=1}^{\lvert \mathcal {V}^{\prime } \rvert } Cost(\mathrm {P}_{n})\) denotes the total cost of the football team, in which the function Cost(Pn) measures the cost of player Pn based on his personal rating, which we will explain in Section 5.1.

As shown in problem (7), the goal of TC-FPACN is to find a subgraph \(\mathcal {G}^{\prime }(\mathcal {V}^{\prime }, \mathcal {E}^{\prime })\) containing a set of football players that maximize the function considering the three metrics simultaneously. The subgraph \(\mathcal {G}^{\prime }\) contains players for three types of positions in (8):

$$ \mathcal{G}^{\prime} = \mathcal{G}^{\prime}_{\mathrm{F}} \cup \mathcal{G}^{\prime}_{\mathrm{B}} \cup \mathcal{G}^{\prime}_{\mathrm{G}}, $$
(8)

where \(\mathcal {G}^{\prime }_{\mathrm {F}} \subseteq \mathcal {G}_{\mathrm {F}}\), \(\mathcal {G}^{\prime }_{\mathrm {B}} \subseteq \mathcal {G}_{\mathrm {B}}\), and \(\mathcal {G}^{\prime }_{\mathrm {G}} \subseteq \mathcal {G}_{\mathrm {G}}\). Note that we focus on choosing suitable players in the field and neglect bench players, which means that the number of nodes in \(\mathcal {G}^{\prime }\) is 11 (i.e., \(\lvert {\mathcal {V}^{\prime }}\rvert = 11\)), and \(\mathcal {G}^{\prime }_{\mathrm {G}}\) contains one goalkeeper.

4 Optimization method based on greedy algorithm

Given that finding the optimal subgraph based on the optimization function of problem (7) is NP-hard [11], we develop a greedy algorithm to solve the aforementioned team composition problem. We consider a team with a 4-3-3 formation, which is widely-used in international competition. This formation means that there is one goalkeeper, four guards, three midfielders and three forwards on a team. We first leave out the goalkeeper and develop two algorithms to find the best players from Forward/Midfielder (i.e., \(\mathcal {G}_{\mathrm {F}}\)) and Backward (i.e., \(\mathcal {G}_{\mathrm {B}}\)), respectively. Next, we propose a pruning technique to organize the final football team.

We show the process to find the best Forward/Midfielder players in Algorithm 1. For brevity, we omit the pseudocode for finding the best Backward players because the two algorithms differ only in the input: the former selects players from \(\mathcal {G}_{\mathrm {F}}\), while the latter chooses players from \(\mathcal {G}_{\mathrm {B}}\). As shown in Algorithm 1, we start with an empty graph (line 1), which poses a difficulty to the direct application of the three network metrics; therefore, we need to choose the starting football player. In this paper, we consider a key player with a good trade-off between personal ability and connections to other players. Specifically, for each player, we first extract the subnetwork that consists of the player and the player’s neighbors (lines 2-4), and then determine the key player (denoted vc) that maximizes both personal ability and network density (lines 5-6). The algorithm then proceeds through multiple iterations (lines 7-13). In each loop, the algorithm adds the most suitable player v in \(\mathcal {G}_{\mathrm {F}}\), who maximizes the value of the objective function of problem (7) (lines 8-10). Note that we remove the player who is selected from \(\mathcal {V}_{\mathrm {F}}\) at the end of each iteration, which avoids the same players being selected into the team (line 11). Finally, once the total number of players reaches the size requirement, the algorithm returns the final subgraph \(\mathcal {G}^{\prime }_{\mathrm {F}}\) (line 14).

Algorithm 1
figure a

Finding Forward/Midfielder based on a greedy algorithm.

The results from the algorithms above are used as inputs for the final team composition. Since we need to ensure that the total cost of a team does not exceed the budget, we add a pruning strategy to the greedy algorithm. We propose the idea of cost performance, denoted Cp, as a measurement to decide which player must be cut if the total cost exceeds the given budget. Specifically, for a football player Pn, the corresponding cost performance Cp is computed in (9).

$$ Cp(\mathrm{P}_{n}) = \frac{\phi_{\mathrm{P}_{n}}}{Cost(\mathrm{P}_{n})}. $$
(9)

We frame the new approach for solving the objective function of the TC-FPACN in problem (7) as the FBTP (Finding the Best Team with Pruning) algorithm presented in Algorithm 2. We first find the best goalkeeper (line 1); and the best team under no budget constraint consists of \(\mathcal {G}^{\prime }_{\mathrm {F}}\), \(\mathcal {G}^{\prime }_{\mathrm {B}}\) and the selected goalkeeper (line 2). The pruning operations are embedded in the greedy algorithm (lines 3-8). Specifically, we use a loop to check whether the total cost of the football team exceeds the budget. If the cost does not satisfy the budget requirement, we perform a pruning strategy that determine the football player vcut with the lowest cost performance (line 4) and remove vcut from the football team \(\mathcal {G}^{\prime }\) (line 5). Next, we choose the other suitable candidate according to the position of vcut (lines 6-7) based on the greedy algorithm. For example, if the position of vcut belongs to Forward/Midfielder, we execute the procedures in lines 8-11 of Algorithm 1 to select v.

Algorithm 2
figure b

Finding the Best Team with Pruning (FBTP).

To better illustrate the workflow for constructing a football team based on the algorithms mentioned above, we provide a vivid example in Fig. 3, which illustrates the process of finding five football players from Forward/Midfielder. We first focus on choosing players without the budget constraint (see the left-hand side of the figure). We start with the key player S and proceed to find the most suitable forward (or midfielder) in each iteration through Algorithm 1. For instance, in step 1, we tend to choose the football player A that maximizes the objective function of problem (7). We return the final selection result (i.e., {S,A,B,C,E}) in step 4, as the number of players is full. Since the selected players do not consider a proposed budget, on the right-hand side of Fig. 3, armed with Algorithm 2, we proceed to conduct the pruning operation by removing the player with the lowest cost performance and then find another football player, i.e., we remove C and add D. For example, we output the candidate set, {S,A,B,E,D} if the total package is no larger than the budget; otherwise, the pruning and selection processes are repeated until the budget requirement is satisfied.

Fig. 3
figure 3

An example of the process for finding five football players with a budget constraint

5 Empirical study: data analysis and team evaluation

Given the discussions in Section 1, it is difficult to form a series of football teams in the real world to evaluate the performance of the proposed model. Fortunately, football video games provide a convenient and quick way to assess the effectiveness of our model. In this paper, we implement and test our method on the two most popular game platforms (i.e., PES2018 and FIFA2022). Figure 4 shows screenshots of the two platforms; both are classical and full-fledged platforms that not only are equipped with well-simulated football players in real life but also provide hours of entertainment in multiplayer mode, including simulating a football match. We conduct a series of experiments with the quick games of PES2018 and FIFA2022 based on a Windows PC. All the codes are implemented in Python, and the numerical computations are conducted on a server with a 12-core Intel(R) Xeon(R) CPU E5-2620 v3 @2.40 GHz and 16 GB memory. The source code of our method is publicly available at https://github.com/misterbobo/TCFPACN.

Fig. 4
figure 4

Football game interfaces of the game platforms. The left-hand side is the playing field of PES2018, and the right-hand side shows the user interface of FIFA2022

5.1 Data analysis

Since the values of many attributes of the team composition are calculated from game data, we first analyze the original data from PES and FIFA and preprocess the dataFootnote 6. In PES2018, we retrieve the data that contain 9,563 football players; we also collect FIFA2022 data, which includes data on 18,278 players from the official websiteFootnote 7. Table 1 provides a brief overview of the two datasets, both of which list player IDs, positions, and names, as well as descriptions of each player’s skills, such as a player’s attacking prowess in Table 1a.

Table 1 The structure of the original data in two game platforms

As seen from Table 1, a player serves in a particular position in a football team. It is also clear that each position has different skill requirements. Consider an example in Table 1a, the skill of attacking prowess is crucial for a Forward player, while it has no relevance for a goalkeeper. Table 2 shows the assessments of 23 skills for some well-known players in PES2018. The numerical values reflect each player’s performance on each skill. As seen from Table 2, it is necessary to link the skills to distinct positions.

Table 2 A sample of four players’ assessment by 23 criteria in PES2018

For each dataset, we first divide the raw data into three groups (i.e., Forward/Midfielder, Backward, and Goalkeeper) according to each player’s position on the pitch. For each group, we rank the skills based on the average values and select the top-10 skill values presented in Table 3. Notably, we ignore the criteria for goalkeepers in Table 3 because both datasets have only a few skills that are relevant to goalkeepers; hence, we include all of them. The weight of each skill is assigned following the principle mentioned in [3], as provided in the last column of Table 3.

Table 3 The criteria for forward/midfielder and backward

The main goal of this paper is to form a cohesive team with a budget constraint (see problem (7)). Therefore, it is necessary to know the salary for each football player. However, there are many football players with missing salaries in both PES and FIFA datasets. It is known that a player’s cost is positively correlated with his rating, which is a good indicator. Here, given a football player Pn, we use the fitting function mentioned in [5] to evaluate his cost as follows, which can be used to formulate the total team cost.

$$ Cost(\mathrm{P}_{n}) = \eta e^{\theta r(\mathrm{P}_{n})}, $$

where η = 6.375 × 10− 4, 𝜃 = 0.1029. In addition, r(Pn) denotes the rating (or overall) of Pn (see Table 1).

5.2 Performance metrics and parameter settings

To understand game results intuitively, we use goal difference (GD) and team points (Tps), which are the general rules in international competitions, as the metrics to evaluate team performance. Specifically, given a competition set Δ = {Δ12,⋯ ,ΔZ}, where Δz (1≤zZ) represents a match and Z is the total number of matches, the value of GD for a football team is calculated as the number of goals scored in all matches minus the number of goals conceded, which gives

$$ \text{GD} = \sum\limits_{z=1}^{Z} \delta_{s}({\Delta}_{z}) - \delta_{c}({\Delta}_{z}), $$

where δsz) and δcz) are the number of goals scored and conceded in one match, respectively. Tps denotes the total match scores of a team, as shown in (10).

$$ \text{Tps} = \sum\limits_{z=1}^{Z} \text{Tp}({\Delta}_{z}), $$
(10)

where Tp(Δz) is a team point for one match, which gives

$$ \text{Tp}({\Delta}_z) = \left\{ \begin{array}{rl} 3, & \quad \text{if win} \\ 0, & \quad \text{if draw} \\ -1, & \quad \text{if lose} \end{array} \right.. $$
(11)

Unless stated otherwise, we set the number of matches Z = 30 and set Bu = 100 to simulate the unconstrained budget case. In addition, for PES2018, we use the FBTP algorithm with the settings α = 0.6 and β = 0.2 in the TC-FPACN to solve the optimization problem (denoted TC-FPACN+FBTP); similarly, we set α = 0.4 and β = 0.4 for FIFA2022. We further present a sensitivity analysis of parameters α and β based on our new evaluation strategy in Section 5.5.

5.3 Simulation results

As the team budget has a large impact on team composition, we investigate the capability of the TC-FPACN+FBTP to deal with different team composition scenarios (i.e., with or without the budget constraint).

5.3.1 Team performance without a budget constraint

In this subsection, we conduct experiments to show the effectiveness of the team generated by the TC-FPACN+FBTP that ignores the budget constraint. We show our team formation results in Table 4. Based on the recommended players, we compose our DREAM TEAM in PES2018 and FIFA2022, denoted DT-PES and DT-FIFA, respectively (see the left-hand side of Fig. 5a and 5b). To conduct a performance comparison and ensure the fairness of competitions, we select a team in PES2018 with a cost approximately equal to DT-PES, namely, MD WHITEFootnote 8 (the right-hand side of Fig. 5a), which is one of the most competitive teams in the game. In FIFA2022, we choose MANCHESTER UNITEDFootnote 9 as the competitor (pictured on the right in Fig. 5a), which not only has a similar cost to DT-FIFA but also has the leading record in its football league.

Table 4 Selected football players in PES2018 and FIFA2022 with Bu = 100
Fig. 5
figure 5

Recommended players to compose DT-PES v.s. MD WHITE in PES2018 and DT-FIFA v.s. MANCHESTER UNITED in FIFA2022

Table 5 shows the battle results on the two game platforms, including the scoreline of each match, the total cost, Tps, and GD. A close inspection of the match results in the table shows that DT-PES wins more matches than MD WHITE in PES2018, and DT-FIFA achieves good performance than MANCHESTER UNITED in FIFA2022. Moreover, the cost of our team is slightly smaller than that of MD WHITE (or MANCHESTER UNITED). It is clear that whichever platform we use, our team dominates through the 30-race series, which highlights the effectiveness of the proposed model.

Table 5 Match results for DT-PES v.s. MD WHITE and DT-FIFA v.s. MANCHESTER UNITED without a budget constraint

To demonstrate the strength of our team, we simulate matches in which random teams battle with MD WHITE and MANCHESTER UNITED, respectively. There are two ways to generate a random team. Given the total cost of MD WHITE (or MANCHESTER UNITED) as the budget constraint, one way is to pick a player for each position randomly based on the average budget, while the other way is first to pick a few players that consume most of the budget and then select other players based on the remaining budget. We name the resulting teams RAND 1 and RAND 2, respectively in PES2018, and RAND 3 and RAND 4, respectively in FIFA2022. The simulated results are shown in Table 6. From the perspective of Tps and GD, we find that our teams perform better than all the random teams when competing against MD WHITE in PES2018 and MANCHESTER UNITED in FIFA2022.

Table 6 Match results for random teams against MD WHITE in PES2018 and MANCHESTER UNITED in FIFA2022

5.3.2 Team performance considering different budget constraints

It is common for football player recruitment to be constrained by a budget crunch. In this subsection, we discuss the performance of the TC-FPACN+FBTP by adjusting the budget constraint. In PES2018, since MD WHITE is one of the best teams with the highest cost burden, we use its cost as the budget limit (denoted as Buhi), and set the budget change from Bulo to Buhi, where Bulo = 10 and Buhi = 60. Similarly, in FIFA2022, we set Buhi = 40, whose value is close to the cost of MANCHESTER UNITED, and Bulo = 0. We define the budget levels in Table 7.

Table 7 The budget levels and the corresponding range of values

Since Table 5 shows the outcomes of the simulated matches against MD WHITE and MANCHESTER UNITED, both of which have a cost of Level V, we select only four typical teams on each game platform whose costs fall within Level I to Level IV. Specifically, in PES2018, we choose AS RED WHITE, VALENCIA, LONDON FC, and PM BLACK WHITE; in FIFA2022, the four teams are CD TONDELA, FC NANTES, REAL SOCIEDAD, and AC MILAN. For each competitor, we use the corresponding budget level as the constraint to select football players who constitute the DREAM TEAM based on the TC-FPACN+FBTP. We show the match results in Table 8. As shown in the table, all eight teams generated by the TC-FPACN+FBTP are more successful at winning events in terms of Tps. In addition, except for losing two goals when playing a 30-game series against REAL SOCIEDAD in FIFA2022, the remaining teams formed with our method still win the series with the superior goal difference. The match results suggest that the proposed method can assemble a team that wins nearly all the competitions given a budget level.

Table 8 Match results under different budget constraints

5.4 Method comparisons

In this subsection, we compare the TC-FPACN+FBTP with other approaches from two aspects. We first compare the TC-FPACN+FBTP with the other football team composition method, namely, CEFG (Cost-Effective Forward selection Greedy) [5]. Next, we discuss the performance of the search strategy based on the random walk algorithm (RW) [41], which is widely used in many areas (e.g., recommender systems [42, 43], community detection [44, 45], and sampling algorithms [46]) for solving the constrained optimization problem (7), denoted TC-FPACN+RW.

5.4.1 Comparison with the CEFG method

We first compare the team composition quality of our method with the CEFG. We again use the PES2018 and FIFA2022 game platforms and focus on the Tps and GD of the two methods for different budget levels. For a given budget constraint, we first generate two football teams on a platform with the TC-FPACN+FBTP and CEFG and then simulate 30 matches between the two teams. Figure 6 compares the simulation results, from which we can conclude that the team generated by the TC-FPACN+FBTP dominates the play on the football pitch. In addition, the data in all four figures shows that the Tps (or GD) increases first and then decreases with the increase in the budget level, and the numerical value reaches a peak at Level II in PES2018 and Level III in FIFA2022. Interestingly, a closer observation reveals that the cost of our team at Level II in PES2018 is approximately equivalent to that at Level III in FIFA2022. A possible reason for the disappointing performance of CEFG is that the team recommended by the CEFG tends to contain a few superstars, and the remaining players may lack competitiveness, especially at a small budget level, thereby leading to poor match results. However, the TC-FPACN+FBTP is more efficient for building a cohesive team that balances the ability in each position and facilitates collaboration among players. Thus, the results suggest that the proposed method generates reliable and promising performance and is not constrained by the choice of game platform.

Fig. 6
figure 6

The performance of the TC-FPACN+FBTP and CEFG under different budget levels in terms of Tps and GD

5.4.2 Comparison with the random walk strategy

In this subsection, we compare the performance of the TC-FPACN+FBTP with the TC-FPACN+RW. Let \(\mathcal {G}(\mathcal {V}, \mathcal {E})\) be the attributed collaboration network of football players. The TC-FPACN+RW begins at a node vi randomly, and at each step, it moves to another node vj with a probability proportional to the weight of edge (i,j). We consider the probability (or weight) from vi to vj based on the objective function value σ that includes vj in problem (7), which means that a higher value of σ results in a greater probability of choosing node vj. The searching process stops if the required number of football players is met, and all the nodes selected in this way form the final football team.

Similar to the process of the simulation match mentioned in Section 5.4.1, we use the team formed by the TC-FPACN+FBTP to compete against the team set up by the TC-FPACN+RW in PES2018 and FIFA2022. The simulation results are compared in Fig. 7. Figure 7a and b show that the team generated by the proposed method wins all matches against the team produced by the TC-FPACN+RW on the PES2018 platform. In addition, Fig. 7c and d show that the team formed by the TC-FPACN+FBTP also shows enough dominance to win matches under four budget constraints (i.e., from Level II to Level V). A possible explanation for the results might be that the RW strategy focuses only on neighbors of the current node in the players’ network in each searching step, which is easily trapped in a local optimum, thereby compromising the discovery of the most suitable players. Note that at Level I, the value of Tps and GD of the team built via our method is smaller than the team produced by the TC-FPACN+RW (see the rightmost bars in Fig. 7c and d), which means our team lost most of the matches. This result is likely to be related to the very low budget, which fails to recruit even one competitive football player. Nevertheless, the overall results show the effectiveness of the proposed FBTP searching algorithm.

Fig. 7
figure 7

The performance of the TC-FPACN+FBTP and TC-FPACN+RW under different budget levels in terms of Tps and GD

5.5 Sensitivity analysis of the parameters

In this subsection, we discuss the parameter sensitivity of the TC-FPACN model, which includes α and β, under no budget constraint. We again use the FBTP algorithm to choose football players. Since it is time-consuming to simulate all matches for different parameter settings, we seek another indicator to evaluate team performance efficiently instead of using Tps and GD. In PES2018, we observe that there is an eye-catching number, namely, Team SpiritFootnote 10 (TS), when we complete the configuration of a football team (e.g., the upper right corner of the left-hand side in Fig. 5a). In fact, TS indicates how good the relationship is on the pitch, and a high TS value could occur in a player who has an affinity for the manager’s team instructions, which naturally leads to better teamwork. In FIFA2022, due to the lack of a similar concept to TS, we use the overall rating (OR), which is calculated by first summing the ratings of all football players on a team and then computing the average (e.g., see the player’s rating on the left-hand side of Fig. 5b). We assume that a higher value of OR indicates better team performance.

Armed with TS and OR, as well as the total team cost, we set up the tests to loop through all values of α and β, and the increment of α and β in each iteration is 0.1. If we select a smaller increment, the evaluation becomes more labor-intensive, and the recommended players do not change much. Figure 8 exhibits the results when tuning α and β. If α = 0 and β = 0, the objective function of problem (7) maximizes the heterogeneity&homogeneity of a team, which results in both poor TS and OR values. Similarly, if α = 1 and β = 0, the function considers only the network ability, which not only leads to a degradation in the TS or OR value but also increases the cost burden. Additionally, there is a slight incline in the values of TS and OR when increasing β. This observation suggests that the network density is an important factor that noticeably benefits the team spirit, and it also demonstrates that football is a team sport. Given the results in Fig. 8a and b, we can choose appropriate settings for the parameters α = 0.6 and β = 0.2 for PES2018 because they achieve the highest team spirit value while incurring a relatively low cost. For FIFA2022, Fig. 8c and d show that at the grid point (α,β) = (0.4,0.4), we obtain a good balance of a relatively high OR value and a low total cost; hence, we use this pair of parameters as the tuning result.

Fig. 8
figure 8

The values of TS (for PES2018) and OR (for FIFA2022) and the corresponding total costs under different settings of α and β

6 Conclusions

In this paper, we study the problem of optimizing football team composition in the context of the attributed collaboration network of football players. Since the team’s success requires full cooperation between football players, we propose a team scoring function that considers three network metrics, namely, network ability, network density, and network heterogeneity&homogeneity. We then convert the constrained team composition task into the problem of finding an optimal subgraph in the attributed collaboration network. To tackle this problem, we present a novel approach that searches a subgraph by using a greedy algorithm with pruning techniques. We conduct an empirical study of the proposed techniques on two simulated game platforms (PES2018 and FIFA2022). The experimental results show that our method can build a competitive team.

Despite achieving good performance, we have barely scratched the surface of football players’ cooperation mechanisms. In particular, the search strategy tends to be trapped in a local optimum in our study. Further work needs to be conducted to investigate sophisticated social factors and delve into how they interact, as well as to explore other search optimization algorithms based on a given budget constraint. In addition, although our new approach aims to determine a football team composition, the investigation of the cooperation factors in this paper can be generalized to solve the team cohesion problem. We will investigate such a generalization and its applications in other problem domains.