1 Introduction

Handball is an important and popular sport, especially in Europe. The most powerful leagues, such as the German and French Premier Leagues, have very large budgets (DPA 2019). It is also one of the most widely-played sports, in multiple leagues (DHB 2019). Therefore, it is a sport with economic and human importance at the European level. Today, there is growing interest in evaluating the performance of handball teams, which involves establishing an impartial method of assessing players. In the worst case, assessments are merely subjective. The aim is to be able to make an unbiased evaluation of players based on their statistics for the main actions in each match.

The most recent study, Oytun et al. (2020), evaluates different machine-learning models for predicting particular types of athletic performance in female handball players, as well as determining the significant factors influencing expected performance. The main difference between their study and this paper is that in that case, the authors applied a methodology to evaluate athletic performance in female handball players, while the present paper aims to assess the performance of handball goalkeepers.

Goalkeepers play a crucial role in handball. In the coaching community, it is well known that the goalkeeper’s performance can predict the team’s ranking in big events (see Hansen et al. 2017). There have, however, been relatively few research studies carried out regarding elite goalkeepers. For example, the methodology presented in Schwenkreis (2019) allows the effectiveness of a handball team, or of a single handball player, to be quantified, but it is insufficient for evaluating the performance of a handball goalkeeper in a match.

Recently, Hatzimanouil et al. (2017) conducted a full study of the shot effectiveness of each handball position with respect to the goalkeeper. Cveniç (2000) and Hatzimanouil (2019) proposed methodologies based on statistical studies for evaluating the performance of goalkeepers in handball. Cveniç (2000) applied three methods to the goalkeepers of the European Men’s Handball Championship 2018 based on (i) the goalkeeper’s save percentage, (ii) the goalkeeper’s save percentage and time played, and (iii) further taking into account the distance from which the shots are taken. Hatzimanouil (2019), meanwhile, carried out a statistical study to classify the goalkeepers at the European Woman’s Handball Championship 2018 into three categories, based on their save efficiency, time played and number of matches played. In this paper, our purpose is to establish the weights of the most outstanding actions of handball goalkeepers from different points of view during the course of a tournament. The most outstanding actions are those reported by the European Handball Federation (EHF) in their statistics for each match. Specifically, we address the following research questions: Is it possible to obtain a weighting scheme for the actions of handball goalkeepers to:

  1. 1.

    RQ1: identify the best goalkeeper in the tournament?

  2. 2.

    RQ2: identify the top five goalkeepers in the tournament?

  3. 3.

    RQ3: rank the best goalkeepers in the tournament correctly?

This paper uses three different approaches to answer the above research questions:

  • Multi-criteria—Group Decision-Making approach The first of the methodologies for weighting the actions of goalkeepers is based on the application of multi-criteria decision-making (MCDM) theories, which provide a suitable solution for processing the results of the assessment. For this purpose, experts (30) were consulted in order to label the importance of each of the performance indicators linguistically. The final weighting scheme is obtained from these opinions after applying different fuzzy logic techniques of consensus and aggregation (Romero et al. 2020).

  • Metaheuristic Weighting The weighting process can be transformed into an optimisation process in which we seek the maximum alignment of the weights obtained with the judgement of the experts in the competition (Chỳna et al. 2013; Romero et al. 2021). With this aim, those weights that optimise the choice of the most valued goalkeeper or the choice and ranking of the five best goalkeepers in the competition are then found. Metaheuristic algorithms are used for this purpose.

  • Statistical Weighting This method is based on the reference values established in Antón García (2005) widely used in handball today. Reference scales are thereby established for each type of player action over several high-level matches and championships. From these scales, the reference weights are calculated from the resolution of a number of balance equation systems between the indicators representing the result of the same action (Goal/Save).

The features of these techniques are particularly interesting, in terms of accuracy and significance, for handball goalkeeper assessment. Firstly, each approach has been proven to establish weighting schemes in a decision-making process Pérez et al. (2018), Paul and Das (2015) and Sáez et al. (2014). Furthermore, all three allow the use of expert knowledge in order to solve the problem, which is particularly recommended in this context (Kvam 2011). Finally, they do so from three different viewpoints and via three different procedures: as a general opinion on the problem (MCDM), as a provider of relevant elements through which weights are subsequently optimised (PSO), or setting the weights of the indicators directly.

The design of the proposed approaches was different in terms of the sources of the weighting scheme. For the MCDM approach, we use the opinion of a set of experts; the metaheuristic approach is aimed at optimising the weighting scheme according to the choice of best goalkeeper in the competition, and the statistical approach is based on reference levels and the statistics of the Championship. However, to the best of our knowledge, there are no studies quantitatively assessing the impact that these approaches have on the choice of the best goalkeeper. To address this deficiency, this paper presents an assessment of the 2020 European Men’s Handball Championship. The data have been gathered, analysed and compared in order to answer the research questions set out above.

Consequently, it is worth highlighting the following contributions in this study: The primary contribution of this paper is the empirical analysis of three representative strategies for weighting evaluation criteria (a fuzzy approach, a metaheuristic optimisation strategy and a statistical method), whose results are compared to the tournament’s best goalkeepers. A significant secondary contribution is that the results obtained will offer guidance and valuable information for data selection and will be good predictors of goalkeeper performance. Thus, our results provide significant insights into the features that distinguish an outstanding goalkeeping performance in a short tournament from those of the other players. Additional contributions of this work are the following:

  • Building a dataset from play-by-play data from official statistics of the 2020 Men’s EHF European Championship (MEN’s EHF EURO 2020), with the complete set of criteria for evaluating the performance of a handball goalkeeper.

  • Estimating the weights to evaluate handball goalkeepers, taking into account each meaningful action in a handball match.

  • Putting forward a real-world application and evaluation for the sport of handball by a comparison of several weighting schemes using classical ranking metrics.

The novelty of our work lies in the fact that we address a number of difficult matters not previously considered in the literature: 1) data collection and performance indicator calculation from play-by-play data, 2) comparison of different strategies (fuzzy logic, metaheuristics, statistical) for weighting performance criteria, 3) comparison with the rankings of the best goalkeepers in a tournament.

The paper is organised as follows: Sect. 2 describes the process for evaluating the performance of handball goalkeepers. Section 3 explains the different alternatives for establishing the weights of the evaluation criteria: fuzzy, metaheuristic and statistical. Section 4 presents our Case Study for assessing the performance of goalkeepers at the 2020 European Handball Championship. Conclusions and future work are outlined in Sect. 5.

2 Handball goalkeeper evaluation process

In this section, we present the proposed model for the evaluation of handball goalkeepers in short competitions. There are three significant steps, as shown in Fig. 1. First comes the identification of criteria. In fact, the selection of suitable criteria depends on the tournament. Next, the criteria weights must be determined. Then, the third step is to implement the multi-criteria decision-making method, which consists of choosing the method to be used to select the best player.

Fig. 1
figure 1

Best Goalkeeper Selection Process

2.1 Motivation

Choosing the ”Player of the Match” is one of the most ambiguous decision-making issues in the area of player evaluation. The selection process is critical for identifying the crucial components of a team. This is why it has become an essential focus for every team, but there is no standard for player evaluation, which should be undertaken based on the situation. It is always challenging to choose the most valuable player because the selection criteria are changing.

Moreover, if there is one position that requires individual assessment in a sport like handball, it is the goalkeeper. Some studies in the literature quantify the influence of a goalkeeper’s performance at more than 50% of the final result. Even so, the majority of coaches think that a different set of indicators should evaluate this specific position.

The position of the goalkeeper as a player is privileged; he is always the last defender when the opposing team is trying to score a goal, and the first to start the attack when the opposing team takes a shot that is not intercepted by the defense. Goalkeepers are the only players in handball who perform most of their actions without the help of their teammates. No one can compensate for their mistakes, and they play in a space that only they can access.

Fig. 2
figure 2

Player-of-the-Match Selection Process

For all these reasons, goalkeepers have a more significant influence than other players on the development of the game and the final result of the match. We thus need different indicators to assess their performance in each game.

2.2 Selection criteria identification

The identification of decision-making criteria is very important in player-of-the-match selection. Many approaches still limit themselves to using the count of goals/saves as the single determining factor when choosing the best player.

The evaluation of only one criterion (goals or saves) is not the most suitable approach, since many other factors must be considered in the selection process. Nowadays, it is important to structure the problem and to assess pertinent criteria explicitly before reaching a decision. A number of methods exist to solve multi-criteria problems, and at the root of many of them is the idea that most decision-making can be improved by breaking down the general evaluation of alternatives into evaluations of a number of relevant criteria.

The quantitative and qualitative criteria for evaluating handball players have been selected after an expert-based evaluation process, choosing those that seemed most appropriate for the specific evaluation according to the experts (see Fig. 2):

  • A collection of basic assessment criteria is chosen as a baseline for the decision process, to determine which indicators need to be considered to model the player/performance. This set can be obtained from the literature.

  • This basic set is extended using the criteria of a National Handball Coach. The extension achieves a higher level of granularity than the Basic Indicators Set .

  • The list of criteria is made available to a large number of people who can give their opinion on it (fans, journalists, players) and contribute new indicators to the set.

  • A small number of high-level experts study the results of the process to make the final decision on the criteria to be taken into account

The indicators used in this paper for evaluating the specific position of goalkeeper can be summarised as shown in Table 1:

Table 1 Selected Criteria

3 Determination of criteria weights

The weights of the related criteria have to be determined with the assumption that the weights represent the level of contribution to the overall performance of the team.

3.1 Multi-criteria—group decision-making approach

The proposed method is based on that set out in previous work by the authors for handball outfield players (Romero et al. 2020). In this approach, the set of criteria obtained is given to a set of experts who are weighted according to their coaching experience. Fuzzy linguistic terms, such as most important, important, and normal, were used to determine the experts’ influence weightings, which were represented by fuzzy membership functions.

Every expert, depending on his/her own experience, must express an opinion regarding the chosen criterion. In this case, each criterion is evaluated by linguistic labels belonging to one of the following predefined label sets.

$$\begin{aligned} \begin{aligned} S^3&= \{ s^3_1 = H - High, s^3_2 = M - Medium, s^3_3 = L - Low\} \\ S^5&= \{ s^5_1 = VH - Very\_High, s^5_2 = H - High, \\ s^5_3&= M - Medium, s^5_4 = L - Low, \\ s^5_5&= VL- Very\_Low\} \end{aligned} \end{aligned}$$
(1)

These label sets are represented by using Trapezoidal and Triangular Fuzzy Numbers (Fig. 3).

Fig. 3
figure 3

Linguistic label definitions as fuzzy sets

Once all the experts have provided the required information, if they applied different sets of linguistic labels, the information is standardised (represented on the same scale) to be used later in the aggregation processes. For this purpose, multi-granular fuzzy linguistic modelling methods (Morente-Molinera et al. 2020) can be applied. Our study adopts a model based on the concept of 2-tuple fuzzy linguistic representation (Martínez and Herrera 2012). This approach has produced excellent results in practical applications, especially in terms of efficiency (Serrano-Guerrero et al. 2020).

A high level of consensus in the expert opinions is mandatory to assure the viability of the aggregation process, i.e. we require that the experts’ estimates have a common intersection at some \(\alpha \)-level cut (Hsu and Chen 1996). For this purpose, we compute a global consensus degree that is compared against a consensus threshold provided by the set of experts prior to the application of the consensus. If this degree is greater than or equal to this consensus threshold, then the consensus-reaching process is considered successful, and hence it should end. Otherwise, the experts need to discuss it further.

Once the expert opinions are validated, we use fuzzy techniques to find a compact and synthesised representation of expert opinions. The aggregation process is carried out in the following steps:

  1. 1.

    Each of the n experts defines the linguistic vector \(V^e = \{v^e_1, v^e_2, \cdots , v^e_n \}\) indicating the importance that they give to each of the chosen criteria. The union of all these linguistic vectors generates the global assessment matrix containing all the valuations from all the experts.

  2. 2.

    Individual Aggregation: The linguistic vector \(V^e = \{v^e_1, v^e_2, \cdots , v^e_n \}\) representing the expert evaluations of all indicators \(S_e\) is aggregated. The weights assigned to the experts \(W_e\) are used. This is done through the following steps:

    1. (a)

      Baseline A fuzzy definition of each label is used (balanced or unbalanced)

    2. (b)

      Aggregated Valuation Matrix We obtain the weight of each label in each criterion according to the occurrence frequency and the expert weighting \(w_{e_n}\). The weighted mean operator is used to do this.

    3. (c)

      Fuzzification We use a fuzzy operator in order to obtain a fuzzy set for each criterion, for example the minimum t-norm, which truncates each baseline membership function according to the previously computed weighting of each label.

    Then, the aggregated output will be the fuzzy set representing how relevant each indicator would be in measuring player performance.

  3. 3.

    Defuzzification The centre of area (COA) defuzzification method computes the centre of mass of the membership function of the fuzzy set (the centroid). The COA method maintains the underlying semantic ranking relation within the set of linguistic labels, i.e. given two linguistic labels \(s_i,s_j \in S\) such that \(s_i < s_j\) then \(u_{COA}(s_i) < u_{COA}(s_j)\). Thus, the centroid of a type-1 fuzzy set A in a continuous domain X is calculated as follows:

    $$\begin{aligned} U_{COA} (A) = \frac{\int _x x*\mu _{A}(x)dx}{\int _x \mu _{A}(x)} \end{aligned}$$

    and this will be the method used here to obtain a numerical output.

  4. 4.

    Stratification A stratification process was carried out to reduce the sparsity of the weighting distribution. The purpose is to group criteria with similar evaluations using a hierarchical clustering algorithm. As a result, we obtain the criteria segmented into groups according to this importance.

  5. 5.

    Collective Aggregation The aggregation process obtains the final solution according to the opinions given by the experts. These results allow us to define a weighting scheme to evaluate the performance in accordance with this set of criteria. For this purpose, we aggregate this value for each group of indicators and normalise the final result.

3.2 Metaheuristic weighting

The process of evaluating goalkeepers in a handball championship and choosing the best of them can also be addressed as an optimisation problem.

This way, given the criteria provided by experts where they establish the set of indicators for each goalkeeper in the match, two kinds of problem can be identified: (i) approximating the position of a player in the ranking of goalkeepers, (ii) approximating the total score of a goalkeeper. The first is frequently a combinatorial problem, while the second is usually a continuous problem.

This paper uses a combined approach from i) and ii), trying to obtain an optimal set of weights which allows the total score of the players to be calculated, and an optimal ranking built. To do this, three objective functions based on three metrics described in Sect. 4.1 were used, leading to three different optimisation problems: the first tries to optimise the weights of the indicator set to obtain a ranking where the first position is occupied by the best goalkeeper. The second optimises the weights to obtain a ranking in which as many goalkeepers as possible from the top 5, as indicated by the experts in their ranking, coincide (it does not matter if the order is the same). Finally, the third problem optimises the weights of the indicator set to obtain a ranking in which the top 5 is the same as in the ranking provided by the experts.

The solution of these problems will produce interesting results: on the one hand, it will be possible to analyse the subjectivity introduced by the experts, showing which indicators had more weight in their judgement. On the other hand, experts will have feedback to assess and improve their own criteria, since they usually evaluate the players according to qualitative criteria and opinions. Finally, it will be possible to detect whether there is subjectivity when choosing the best goalkeepers of the tournament by the European Handball Federation (EHF).

In order to solve these problems, metaheuristic algorithms were employed. Metaheuristics are algorithms which are designed to solve a broad range of optimisation problems, guiding the search process and trying to explore the whole search space (see Talbi 2002; Boussaïd et al. 2013). In comparison with heuristics, they are of a higher level of abstraction, being able to incorporate and control different heuristics and adding mechanisms to escape from local optima in order to obtain the global optimum. Specifically, Particle Swarm Optimisation (PSO) was used in this study (see Shi and Eberhart 1998). It is a state-of-the-art population-based metaheuristic inspired by bird flocking. PSO starts by initialising a population of particles. They are the initial solutions or initial weights for each indicator in the indicator set which will be optimised during the PSO execution. Each solution has position and speed attributes which are first initialised. Then, the particles start to move and update their positions and speeds according to the positions that the particle has traveled and the best position in the swarm. These rules guarantee a good tradeoff between exploration and exploitation and ensure the algorithm’s convergence (see Jiang et al. 2014; Samma et al. 2016). The particular expressions for updating the position and speed of each particle depends on the specific variant or PSO implementation used. The pseudocode of a canonical PSO algorithm is shown in algorithm 1.

figure a

3.3 Statistical weighting

In this section, two methodologies are established to assess the performance of a goalkeeper in a handball match. For this purpose, the value or weight of each type of significant action by the goalkeeper during a handball match is estimated. For example, each type of shot will be assessed as a goal, as a save by the goalkeeper or a miss by the shooter. Moreover, if the goalkeeper scores a goal or gives an assist it will also be evaluated among the other actions.

First of all, the official European Statistics established by the European Handball Federation (EHF) includes 25 types of action. To be able to use both methodologies based on the statistics of the goalkeepers, we have to make the following simplification:

  1. 1.

    There are 7 goalkeeper actions which ultimately have little weight due to the infrequency with which they happen in a handball match. These are: Turnover, Assist, Received 7 meters, Missed shot to goal, Exclusion, Goalkeeper’s goal and Steal. Looking at all the goalkeepers of the European Championship and their statistics in all the matches played, between the 7 actions there were only 79 events. If we compare this to the 4611 events covering the other actions, they are less significant due to their limited occurrence in matches. We will give these 7 parameters a constant reference value based on the study of Antón García (2005), adding this weight when they are positive actions, for example giving an assist, and subtract it when they are negative actions, for example missing a ball.

  2. 2.

    In addition, when a player shoots it must be a goal, a save or a miss (throwing wide or high of the goal). To simplify matters, consider the possible outcomes to be only goal or no goal, that is, a save and a missed shot (throwing wide) are the same. It is assumed that a shot misses the goal because of the good performance of the goalkeeper, and also, the goalkeeper’s team does not concede the goal. With this second simplification we go from calculating the weights for 6 (types of shots) \(*\) 3 (possibilities goal, no goal or miss) \(=\) 18 variables to 6 (types of shots) \(*\) 2 (possibilities goal or no goal) \(=\) 12 variables. In addition, there are far fewer missed shots than saved shots or goals, for example 7-meter shots were only missed (rather than saved) 28 times out of 338 total.

The first methodology will be grounded on the reference values established by Antón García (2005), which is an important study of handball, where the reference values for each type of action by handball players during various high-performance matches and championships are analysed.

After the simplifications explained above, we have 12 variables left to estimate their weights, which can be grouped in pairs, since they are directly related, each type of shot being either a goal or not a goal. For each type of shot, we solve the following system of equations:

$$\begin{aligned} \begin{aligned} \beta _1 X + (1-\beta _1) Y = K_1 \\ \beta _2 X + (1-\beta _2) Y = K_2 \end{aligned} \end{aligned}$$
(2)

where X is the weight of each non-goal shot, Y the weight for each goal shot, and \(\beta _1\) the normal reference value of a non-goal shot probability of the goalkeeper for this type of shot. \(\beta _2\) is the reference value of the non-goal shot probability of the goalkeeper has for this type of shot with a very negative performance (\(\beta _1\) and \(\beta _2\) are calculated from Antón García (2005)). The scoring will be normalised over all the values \([-5,5]\). Therefore, good performances will have a value \(K_1=2\) and a normal performance \(K_2=0\).

This system of equations is repeated with its reference values (\(\beta _1\) and \(\beta _2\)) for each type of shot, giving all the weights for actions based on the mean and variance of the global data from Handball European 2020 for the indicator set.

The second methodology has the same simplifications as the first, but the way the \(\beta _1\), \(\beta _2\), \(K_1\) and \(K_2\) are calculated varies. In this case, they are calculated based on the total data of goalkeepers of the Handball European Championships held in 2020. Specifically, the average probability of goals over the total number of shots is calculated as \(\bar{x}\) and the standard deviations are calculated as \(\sigma \) for each type of shot, using the statistics from all the matches of the European Championship. The same formula as in (2) is applied, but in this second methodology:

  • \(\beta _1 = \bar{x} + \sigma \).

  • \(\beta _2 = \bar{x} - \sigma \).

  • \(K_1=5\), represents the score of a great performance.

  • \(K_2=-5\), represents the score of a very negative performance.

With these premises, both parameters are recalculated for each type of shot and the weight is calculated in each case, for shots that are goals and not goals.

4 Experiment

For the implementation of the experiments, we analyse the best goalkeepers at the Handball European Championship held in 2020.

4.1 Evaluation metrics

This subsection describes the different metrics used in the paper to assess the results of the decision-making methods.

  • Mean Reciprocal Rank (MRR): It tries to measure “Where is the first relevant item?”, in this case, ”Where is the Most Valuable Goalkeeper?”. More formally, MRR calculates the reciprocal of the rank at which the first relevant document was retrieved (see Craswell 2018). The Most Valuable Goalkeeper in the dataset used is Gonzalo Pérez de Vargas. Thus, given a dataset D where best is the player ranked first, the aim is to obtain a set of weights for the indicators from the indicator set in order to build a ranking where best ranks first. Formally, it is computed as the inverse of the position of the best player in the ranking built according to Equation (3).

    $$\begin{aligned} MRR = \frac{1}{n}\sum _{i=1}^{n}\frac{1}{rank_{best}} \end{aligned}$$
    (3)

    where n is the number of times the ranking is built or the number of repetitions of the experiment and \(rank_{best}\) is the position occupied in the ranking by the best player.

    The MRR function is bounded between 0 and 1. Therefore, when the experiment is formulated as an optimisation problem, the aim is the following: given a dataset D, the best player best and a fixed number of experimental repetitions n, the aim is to obtain a set of weights which maximises MRR. This means the best player will rank first in the ranking built by the optimisation algorithm in each repetition. This problem is formulated by means of Eq.  4.

    $$\begin{aligned} \underset{rank_{best} \in \mathbb {Z}}{\hbox {Maximise}} \, MRR(D, best, n) \end{aligned}$$
    (4)
  • Mean Average Precision (MAP): this is a more strict metric than the MRR. It evaluates the complete list of recommended items up to a specific cut-off N. In this way, MAP will compute the elements of the top N which are found, heavily weighting the error made at the top of the list and gradually decreasing the importance of the error as they go to the lower items on the list. More formally, MAP is the arithmetic mean of the average precision values for an information retrieval system over a set of N query topics (see Sakai (2007) and Beitzel et al. (2018)). The drawback of this metric is that it does not consider the recommended list as an ordered list.

    This paper uses a simplified version of MAP. To do that, MAP is computed using only the successes, which means, the elements of the top N identified. The errors made are not weighted and computed. Thus, MAP is computing according to Eq. (5).

    $$\begin{aligned} MAP = \frac{1}{N}\sum _{i=1}^{N}\frac{i}{rank_{i}} \,\,\, if \,\,\, i = rank_i \end{aligned}$$
    (5)

    Thus, MAP will be a bounded function between 0 and 1 where the values of MAP are continuous values which computes the number of top N goalkeepers detected out of the total of N. In the experiments carried out, \(N = 5\) has been fixed.

    When the problem is formulated as an optimisation problem, the formalisation is the following: given a dataset D and a cut-off N, the aim is to obtain a set of weights which maximises MAP formula according to Eq. (6).

    $$\begin{aligned} \underset{rank_i \in \mathbb {Z}}{\hbox {Maximise}} \, MAP(D, N) \end{aligned}$$
    (6)
  • A modified Manhattan distance (MMD): Inspired by Ekstrøm et al. (2019), a metric is defined with the purpose of comparing two rankings based on the distance between them. For each item, the absolute error has an interpretation as the Manhattan distance of the individual rankings from the ground-truth ranking. Therefore, it evaluates whether the top N elements of the ranking provided by the experts in the dataset D is exactly the same and is in the same order as the ranking built. In the experiments carried out, \(N = 5\) is fixed. This is a bounded metric between 0 and 1 which provides a continuous value in order to obtain more accurate information about the ranking and the deviation from the ranking provided by the experts. This metric is computed according to Eq. (7).

    $$\begin{aligned} MMD = \frac{1}{N}\sum _{i=1}^{N} \frac{1}{|i - rank_i| + 1} \end{aligned}$$
    (7)

    where N is the specific cut-off to compute the metric, i is the player who ranks i, and \(rank_i\) is the position held by the i element in the ranking built. The metric computes the distance between the element i in the ranking provided by the experts and in the ranking built. It then uses the inverse function and accumulates the results of the top N elements.

    When the problem is addressed as an optimisation problem, the formulation is the following: given a dataset D, and a cut-off N, the aim is to obtain a set of weights which maximises MMD according to Eq. (8).

    $$\begin{aligned} \underset{rank_i \in \mathbb {Z}}{\hbox {Maximise}} \, MMD(D, N) \end{aligned}$$
    (8)

4.2 Alternative selection

For the implementation of the experiment, we have chosen the five goalkeepers selected as the best goalkeepers in the competition (see Table 2). The selection is based on their performances during the tournament (see Table 3). The Most Valuable Goalkeeper was calculated 40 percent from the votes received by fans; a panel of EHF experts decided the remaining 60 percent.

Table 2 Selected GoalKeepers
Table 3 Excerpt of GoalKeepers Statistics in 2020 EHF Championship

4.3 Weighting results

This section presents the processes for calculating the weighting schemes for each evaluation criteria using the alternatives set out above and the results obtained.

4.3.1 Fuzzy decision-making

The set of criteria defined to evaluate a goalkeeper is given by 30 Spanish handball coaches who are weighted according to their coaching experience. As mentioned above, fuzzy linguistic terms, such as most important, important, and normal, were used to determine the coaches’ experience. For example, a National Handball coach was assigned the most important influence, while a Level 1 coach was assigned an important influence weighting.

Each expert can decide individually on the level of importance that should be given to each criterion value. For this purpose, they used the following linguistic label set:

$$\begin{aligned} S^3 =&\{ s^3_1 = H-High , s^3_2 = M-Medium,\\&\,\, s^3_3 = L-Low\} \end{aligned}$$

The results of the survey can be seen in Fig. 4 (left) (dark blue: very important, light blue: important, white, not very important).

Fig. 4
figure 4

Expert Opinions (left) and Clustering (right)

Using the fuzzy definition of the linguistic labels (see Fig. 3), the fuzzification of each of the criterion assessments can be carried out. Once it is verified, there is enough consensus among the opinions collected from the experts, an aggregation process must be carried out to obtain a weight for each criterion. For this purpose, linear combination of each element according to the weight of the criteria is applied.

Then, after a defuzzification process, we obtain a weighting for each criterion represented as a weight vector (See Table 7 MCDM-R). This approach does not guarantee a normalisation of the criteria weights, making comparison with other methods difficult. Thus, to ensure that the weighting scheme is consistent and coherent with different techniques, we perform a normalisation process (See Table 7 MCDM-N).

In addition, with the aim of reducing the sparsity of the weightings obtained, we carry out a clustering/stratification process as can be seen in Fig. 4 (right).

Consequently, different groups of indicators, with their corresponding weightings are found. As a result, we aggregate this value for each group of indicators and normalise the final result. See the weighting scheme obtained in Table 7 MCDM-G. Table 8 shows the results obtained according to the evaluation metrics using these weighting schemes. A priori, a weighting scheme based on expert opinion could be an excellent strategy to evaluate goalkeeper performance. In contrast, the results obtained show that this procedure is not especially useful for identifying the best goalkeepers in a tournament. The main problem is related to ignoring the frequency of actions in the process of evaluating the goalkeeper. This weakness is detrimental to the evaluation of those goalkeepers who play longer during the tournament, who are the most eligible as the best goalkeepers in the competition and who may accumulate more negative actions considered important by the experts. Nevertheless, the results obtained allow us to order the top 5 best goalkeepers almost correctly (see Table 6). This way, it can be seen as a useful technique in order to have a ranking of goalkeepers previously chosen by other criteria.

4.3.2 Metaheuristic weighting

The process of finding the optimal weights for the indicators from the indicators set using metaheuristic algorithms is simple, except for two issues that must be defined a priori: the first is related to the codification of the individuals or particles in the case of PSO. The second is concerned with the fitness function which will be used to run the metaheuristic algorithm.

This way, each individual is codified as a real value vector with as many elements as there are indicators in the set of indicators. Each indicator of each individual is initialised according to the bounds defined for a given indicator. Therefore, the algorithm will have an input matrix of mxn dimension, where m is the number of individuals in the population (30 individuals have been used) considered by the algorithm and n is the number of indicators. Regarding the fitness function the algorithm will optimise, it must be chosen according to the nature of the problem (combinatorial, continuous) and the implementation of the algorithm. Metrics described in Subsect. 4.1 have been used leading to three different optimisation problems. In order to validate the results obtained, 100 repetitions of each problem were carried out.

The results obtained in the metaheuristic weighting experiments are shown in Table 7. This shows the different attributes or indicators from the indicator set, as well as its value for each metric. Mean and median values for each of the described metrics have been reported to provide more accurate information. Furthermore, the values of the metrics for each set of weights shown in Table 7 were computed. Table 8 shows these results. Although MeanManh and MedManh are not capable of identifying the best goalkeeper in the tournament, they offer good results to identify and rank the top 5 goalkeepers.

4.3.3 Statistical weighting

The process of calculating weights for the indicators is explained in Subsect. 3.3. After simplification, the weights are estimated by solving Eq. 2. With this resolution, the weight of each goal and no goal is obtained for each type of shot.

Tables 4 and  5 show the values (\(\beta _1\), \(\beta _2\), \(K_1\) and \(K_2\)) for the methodology 1 and methodology 2, respectively. Methodology 1 based on the reference values indicated Antón García (2005) and methodology 2 based on the mean and variance of the global data from Handball European 2020 for indicators set.

Table 4 Parameters for methodology 1
Table 5 Parameters for methodology 2

The results obtained in the statistical weighting experiment are shown in Table 7. The results are consistent if we take into account the possible game-changing actions for a goalkeeper in a match and the value of these actions. The main problem of this methodology is that it does not take into account the number of matches played by each goalkeeper. Since at high performance the players are more effective and this means the goalkeepers with most games have more goals conceded and their score is thus penalised.

4.4 Results and discussion

Table 6 shows the ranking results obtained for each technique and their relative order (in parentheses) among the top 5 selected. Table 7 shows the weights calculated per action for every technique implemented. The evaluation of these results according to the performance metrics is shown in Table 8. The most suitable strategy for answering the three research questions is to use a metaheuristic optimisation algorithm in order to optimise the distance-based performance metric.

Table 6 Selected GoalKeepers Ranks
Table 7 Weights according to the method
Table 8 Value of the metrics for the weights obtained in each methodology

The results obtained through the use of a decision-making mechanism based on fuzzy logic are not suitable for any of the three metrics. This is because the proposed mechanism is based on the importance of the actions without considering their frequency. Thus, a negative action that is very frequent will hurt goalkeepers who play many games. Nevertheless, it is capable of ordering the top five goalkeepers correctly (see Table 6), and so it can be considered a proper technique for sorting alternatives previously chosen through other techniques, based on the frequency of actions. Regarding the weights obtained by this approach in comparison to the best fit (which is obtained by using metaheuristic algorithms) shown in Table 7, it should be noted that five features are not weighted (TO, ST, Miss, Excl and P7) because they are not considered relevant by the experts. In addition, 6mCSaves, WingGoals, BTSaves, BTGoals, FBGoals and 9mGoals features show a large deviation from the weights obtained by using metaheuristic algorithms because the fuzzy approach ignores the frequency of the actions.

The best results, according to the selected performance metrics, are obtained by using the metaheuristic-based weighting method. However, from the point of view of handball experts, it is not reasonable that WingSaves and 9mSaves are more valuable for a goalkeeper than 7mSaves, FBSaves and 6mSaves. This is because this methodology optimises by fitting the 5 best goalkeepers in the tournament, and it is likely that the variables that make the difference are in the types of shot a goalkeeper has more chance of saving. Moreover, this is a “blind” approach which tries to optimise a performance metric, and therefore the results are not always easily interpretable, as the algorithm can over-fit the data to maximise the metric. Despite this, metaheuristic weighting provides the best result in the three metrics considered. Finally, we should not forget that the choice of the 5 best goalkeeper in the tournament is subjective and is established by the organisation.

Concerning the results obtained with statistical weighting, they are not the best among the metrics used, but neither are they bad, because they place 3 goalkeepers within the best 5 (see Table 6). As for the weights obtained for each indicator from the point of view of handball experts, they are the most reasonable since they are based on the study carried out by Antón García (2005) and on the European Handball Federation’s own statistics. Comparing the weights obtained by the statistical approach to the weights obtained by metaheuristic algorithms, it should be noted that the variability of the statistical weights is very low, in both positive and negative features, which makes it difficult to differentiate properly between two goalkeepers.

Finally, with respect to the results obtained by the different approaches in the three metrics considered (see Table 8) the results can be grouped by metric. Regarding reciprocal rank, the fuzzy approach obtained the worst result with a maximum value of 0.125, followed by the statistical approach which returned 0.200 as the maximum value. Finally, the metaheuristic approach was able to identify the best goalkeeper in the tournament, returning a value of 1. For the MAP metric, the fuzzy approach gave the worst value, 0.200, followed by the statistical approach which returned a value of 0.600. The best value was again provided by the metaheuristic algorithm, which achieved a value of 0.800. In this case, the difference between the statistical and metaheuristic algorithms was not so great. Finally, concerning the modified Manhattan distance, the fuzzy and statistical approaches gave near-zero values, while the metaheuristic approach obtained 0.828 as the maximum value. Therefore, the data shown in Table 8 confirm that metaheuristic algorithms are the best quantitative approach for evaluating the performance of handball goalkeepers.

5 Conclusions

The evaluation of handball goalkeepers is a highly complex procedure that focuses on examining several different criteria. This paper shows how to implement this process and compares several soft-computing techniques to process the evaluation results.

To this end, the evaluation scenario based on the MEN’s EHF EURO 2020 is designed to compare several soft-computing techniques in a real tournament. This experiment shows that the metaheuristic-based method outperformed the other approaches in terms of identifying and ranking the best handball goalkeeper. This implies that the criteria weights to be used to identify and rank the best goalkeepers can be obtained by using this algorithm. The method also allows the expert evaluations, which are often difficult to obtain in a disaggregated form, to be estimated and quantified. On the other hand, the use of statistical techniques allows the identification and sorting of the best goalkeepers in the tournament to be carried out with acceptable results. Furthermore, it offers more consistent weights from the point of view of the game, which is advantageous to the general evaluation of the performance of handball goalkeepers.

In contrast, the use of a fuzzy technique based on expert opinions produced poor results in terms of choosing the best goalkeepers in the tournament. The main weakness of this technique is to ignore the frequency of the actions when assigning a specific weight in the evaluation process. This factor is detrimental to some goalkeepers in comparison to others.

In future work, this method can be explored or optimised along the following lines. Firstly, the application of a new soft-computing technique by applying a fuzzy aggregation operator to combine expert opinion with the statistical characteristics of each criterion. Secondly, it is possible to develop an adaptive metaheuristic algorithm that considers the insights obtained by the statistical techniques. The algorithm developed will therefore be “less blind”, since it will seek to optimise the objective function defined, and will do so taking into account the observations of the statistical approach, which provide practical information about the reality of each game.

Finally, the results obtained can be used as part of an evaluation model for handball goalkeepers. The application of this model could help those who make decisions about team line-ups, who sign new players, who choose national team goalkeepers, or need to choose better goalkeepers for matches, among many other applications.