Abstract
Judging by its significant potential to affect the outcome of a game in one single action, the penalty kick is arguably the most important set piece in football. Scientific studies on how the ability to convert a penalty kick is distributed among professional football players are scarce. In this paper, we consider how to rank penalty takers in the German Bundesliga based on historical data from 1963 to 2021. We use Bayesian models that improve inference on ability measures of individual players by imposing structural assumptions on an associated high-dimensional parameter space. These methods prove useful for our application, coping with the inherent difficulty that many players only take few penalties, making purely frequentist inference rather unreliable.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
1 Introduction
Publications from many disciplines deal with the analysis of the world’s most popular sport, football. Particular attention is paid to the study of set pieces, especially those that offer imminent opportunity for scoring a goal. For example, Baranda and Lopez-Riquelme (2012) and Strafford et al. (2019) study how corner kicks are carried out depending on the current score and examine their relevance for the match outcome. Penalty kicks have also been investigated. Kuhn (1988), Loy (1998), Kamp (2006), and Noël et al. (2014) examine penalty shootouts from psychological and cognitive perspectives based on field experiments and devise strategies for penalty takers. This interest is hardly surprising: for instance, since 1978 about every fifth knockout game at the FIFA World Cup has been decided by penalty shootouts. This results in an estimated chance of 67% that a future world champion must successfully complete at least one penalty shootout on his way to the title (Memmert and Noël 2020).
Penalty kicks in German leagues have been investigated by, e.g. Bornkamp et al. (2009) who use a nonparametric Bayesian approach to rank Bundesliga keepers by penalty saving abilities based on historical data. They find little support for significant differences between goalkeepers, but motivate an investigation of penalty taker ability. Using logistic regression, Kuss et al. (2007) refute the widespread myth that a fouled player should not take the penalty kick because of the alleged greater risk of failure. They argue that this may be due to a selection process that results in penalty shooters being largely homogeneous with respect to their resilience in stress situations. Furthermore, coaches presumably select penalty takers already before the game due to knowledge of such special characteristics and skill.
Historical data such as the ranking of Bundesliga penalty shooters published by the German football magazine Kicker (2021) suggest the existence of significant differences in players’ ability to score penalties. However, such rankings are often based on the absolute number of penalties converted and thus fail to capture how efficient players are in exploiting chances. Even rankings based on relative frequencies are questionable: for players frequently nominated as penalty takers, the relative frequency may be a viable estimate. However, for players with small \(n_i\), the actual conversion probability is likely to be estimated imprecisely.
In this paper, we discuss shrinkage estimates from Bayesian beta-binomial models for ranking penalty takers in the German Bundesliga. This approach offers statistical power in estimating the skill of players with little information in borrowing information from players who have taken many penalties. We analyse historical data on 993 players who took a total of 4955 penalties in first division matches from season 1963/64, when the Bundesliga was founded, to 2020/21. We have compiled the necessary dataset from German websites kicker.de and transfermarkt.de. Section 2 examines empirical Bayes estimates from beta-binomial models. We discuss moment matching estimates and estimates based on informative priors obtained from frequentist regression. Section 3 presents more flexible fully Bayesian hierarchical models and compares findings with the results of Sect. 2 using a Bayesian information criterion and out of sample analysis. Based on the outcomes, we compile rankings for active players in the German Bundesliga.Footnote 1 Section 4 concludes.
2 Empirical Bayes estimates of penalty conversion ability
The empirical distribution of penalties taken is shown in the left panel of Fig. 1. It is evident that the distribution is right skewed with mode at 1. The empirical distribution of estimated success rates \({\widehat{\theta }}_i:=y_i/n_i\), shown in the right panel of Fig. 1, illustrates why rankings based on such a measure are problematic: we find a multimodal distribution with much probability mass near 0 and 1. This is due to many players with few (often a single) attempts who have successfully converted all kicks or, at the other extreme, did not score at all. Owing to insufficient information, the skill of these players is thus likely to be estimated imprecisely by purely frequentist methods.
A preliminary approach for obtaining more reliable estimates of penalty conversion abilities \(\theta _i\in [0,1]\) for a set of players indexed by \(i=1,\dots ,N\) is to use a shrinkage estimator in a compound model
where \(\theta _i\) are latent realisations from the ability distribution \(\pi (\theta )\). For modelling success rates in sports, this concept was introduced by Efron and Morris (1972, 1977) who analysed baseball hit ratios using the James–Stein estimator which they motivate as a Bayesian point estimator assuming Gaussian \(\pi (\cdot )\).
Particularly, for samples with many small \(n_i\) it seems more adequate to assume a beta-binomial model, where
i.e. a player’s score \(y_i\) in \(n_i\) attempts is generated from a binomial distribution with success rate \(\theta _i\). We model \(\theta _i\) via a reparameterised beta prior with expectation \(\mu := \frac{\alpha }{\alpha + \beta }\) and precision measure \(\nu := \alpha + \beta\). We thus have
where \(\pi (y_i\vert \theta _i)\) is the usual binomial likelihood and \(\pi (\theta _i \vert \mu ,\nu )\) denotes the prior. By conjugacy, the posterior \(\pi (\theta _i\vert y_i)\) is \({Beta}(y_i+\nu \mu , n_i-y_i+\nu (1-\mu ))\). An empirical Bayes (EB) approach to conduct inference via (4) specifies an informative prior from point estimates for \(\mu\) and \(\nu\).
2.1 Empirical Bayes estimates using moment matching
An instructive approach discussed by, e.g. Carlin and Louis (2000) uses moment matching (MM) to estimate the hyperparameters. When the \(n_i\) are equal across samples, MM estimators are directly derived from the expected value and variance formulae for a marginal distribution of binomial proportions with success probability \(\theta _i\sim {Beta}(\nu \mu ,\nu (1-\mu ))\). The case of unequal \(n_i\) is more complicated, as in the compound model (2) the \({\widehat{\theta }}_i\) are independently distributed with \(E ({\widehat{\theta }}_i) = \mu\), but \(Var ({\widehat{\theta }}_i) = \mu (1-\mu )n_i^{-1}+(\nu +1)^{-1}\mu (1-\mu )(1-n_i^{-1})\) is a function of \(n_i\). This makes it preferable to give different weight to the \({\widehat{\theta }}_i\). For known variances, Kleinman (1973) suggests to estimate \(\mu\) by the (best linear unbiased) estimator
with inverse variance weights \(w_i\). The precision \(\nu\) is then estimated by
which obtains from matching \({\widehat{\sigma }}^2 := \sum _{i=1}^N w_i({\widehat{\theta }}_i-{\widehat{\mu }})^2\) with the variance of a \({Beta}(\nu \mu ,\nu (1-\mu ))\) random variable.Footnote 2 If \(w_i\) are unknown, \({\widehat{\mu }}\) and \({\widehat{\nu }}\) can be computed iteratively using (5) and (6) with initial values \(w_i = n_i\) and updates \(w_i = n_i/[1+({\tilde{\nu }}+1)^{-1}(n_i-1)]\), where \({\tilde{\nu }}\) is the current value of \({\widehat{\nu }}\).
Using (4) and estimates \({\widehat{\mu }}\) and \({\widehat{\nu }}\), frequentist conversion probability estimates \({\widehat{\theta }}_i := y_i/n_i\) are then updated by computing posterior means
where \({\widehat{S}}_i\) is the shrinkage factor towards the beta prior for player i. It is evident from (7) that more information on individual ability (that is, large \(n_i\)) lessens the effect of the prior on \(\breve{\theta }_i\), i.e. information from the population of penalty takers has less weight for individual estimates. At the other extreme, the prior information dominates estimates of conversion probabilities when \(n_i\) is small, resulting in an adjustment towards the overall mean.
Applied to our data, we obtain hyperparameter estimates \({\widehat{\mu }} = .6788\), and \({\widehat{\nu }} = 1.3405\), resulting in a Beta(.9099, .4306) prior. The corresponding bimodal density is shown in red in Fig. 1. Posterior quantities are presented in Fig. 2. The left panel of Fig. 2 compares relative frequencies against MM estimates from the corresponding posterior distribution. It can be seen that including prior information has a plausible effect: shrinkage is low for observations where the data provides much information whereas point estimates for small \(n_i\) are markedly adjusted. This is especially relevant for observations with \({\widehat{\theta }}_i\in \{0,1\}\) and small \(n_i\) where estimates are most strongly corrected towards \({\widehat{\mu }}\). This is sensible as accurate estimates are possible for players with many penalties whereas players with only a few penalties are shrunken towards the mean.
Furthermore, the posterior distributions shown in the right panel of Fig. 2 show that for players with a small \(n_i\), dominance of the prior over the likelihood leads to greater uncertainty of the estimated conversion probability. The effect of shrinkage can also be judged from the dashed black line which shows a kernel density estimate (KDE) of the distribution of posterior means. The shape of the KDE indicates that MM gives more realistic estimates of expected scoring probabilities than the pure frequentist estimates. This is especially evident for the tails of the distribution which have substantially less probability mass compared to the density of the \({\widehat{\theta }}_i\) in Fig. 1.
Table 2 lists the top and bottom 10 historical ranks of Bundesliga penalty kickers based on MM estimates along with 95% credible intervals (CIs). Interestingly, compared to the score-based ranking in Table 1 we find that none of the top listed players made it into the best 10, which are now led by Hans-Joachim Abel with 100% scoring frequency in 16 trials. However, the other top players also converted all their penalties, i.e. the correction due to the prior distribution is small and score probability estimates and CIs are consistently close to 1. For the last 10 ranks, we find that shrinkage leads to somewhat stronger adjustments of the \({\widehat{\theta }}_i\). Tim Borowski, a former German international, for example, is attested a posterior mean scoring probability of 21% despite 3 failed penalties out of 3 attempts. However, this estimate is also subject to considerable uncertainty as his 95% CI is [.005, .643].
2.2 Empirical Bayes estimates from informative beta priors
Several points should be emphasised when interpreting the estimates from the MM approach in Sect. 2.1. First, the prior has most probability mass at extreme abilities and high variance, i.e. MM causes the prior weight to be strongly shaped by players for whom frequentist ability estimates \({\widehat{\theta }}_i\) are most unreliable. Therefore, few penalties are needed for the data to dominate the prior. However, adequacy of the implied shrinkage is an empirical issue and relates to how fast the \({\widehat{\theta }}_i\) converge towards the true abilities as players progressively take penalties. To get an impression of the convergence speed Fig. 3 visualises trajectories of the \({\widehat{\theta }}_i\) over time for a sub-sample of the players: as ability estimates inherently stabilise later in the career, we consider players with \(n_i\ge 10\). Even for players with \(n_i\approx 20\), we observe enough time-variability in scoring rates for shrinkage towards the prior to be attractive. Notably, given the MM estimate \({\widehat{\nu }} = 1.3405\) it can be computed from (7) that the shrinkage factor \({\widehat{S}}_i\) decays quickly with \(n_i\) and is already at .4 for only \(n_i=2\) attempts, which seems rather fast.
Figure 3 also shows evidence of the selection process outlined in Sect. 1: we do not observe players who repeatedly fail to score and yet continue taking penalties: players in the sub-sample who start off with a failed attempt usually have a scoring streak afterwards and thus keep getting nominated. This selection might result in overestimation of abilities for players who have performed poorly and only had a few attempts.
We next consider EB estimates based on informative beta priors
For this, we estimate prior hyperparameters incorporating player performance data \({\varvec{x}}_i\) by maximising the frequentist regression beta-binomial likelihood
with \(\Gamma (\cdot )\) the Gamma function. Prior mean \(\mu ^s_i := \alpha _i^s/\nu ^s\) and precision \(\nu ^s := \alpha _i^s + \beta _i^s\) have logit and log links in linear predictors,
To get a notion of the selection effect and the impact of likelihood-based prior parameter estimates, we compare posterior means from two preliminary models first: one where the \(\mu _i^s\) are assumed constant across players and one where \(\mu _i^s\) depend on \(n_i\). Note that players who are known as good penalty kickers will take penalties more often, making \(n_i\) a natural predictor of the scoring ability.
For the baseline model, we obtain hyperparameter estimates \({\widehat{\mu }}^s = .74\) and \({\widehat{\nu }}^s = 41.04\). The latter can be interpreted as the prior sample size and gives the threshold beyond which the data dominates the prior in contributing to posterior means, see (7). This suggests that the impact of the prior is significantly stronger than for MM estimates, cf. Fig. 2. Figure 4 shows posterior distributions and means for both preliminary models and provides evidence of stronger shrinkage. This is particularly noticeable for players with extreme scoring ratios and few attempts, although players with large \(n_i\) are now also subject to stronger adjustment towards the prior. The red dots above the \(45\deg\)-line show that players with few attempts tend to be rated worse in the model adjusted for \(n_i\) than in the baseline model with only a constant. Consequently, frequently nominated players tend to be better rated once we accommodate for the effect of selection: the colour transition in posterior means indicates that penalty takers are systematically better rated from approximately 20 penalties onwards as priors become more informative. This is also reflected in estimation uncertainty displayed by posterior distributions where the variance is generally lower for the adjusted model and decreases markedly with \(n_i\), cf. the densities shown above the square. We discuss ability point estimates, the implied shrinkage, and resulting rank reversals with respect to results from MM in more detail for fully Bayesian estimation in Sect. 3.1.
We also consider priors that additionally depend on miss, the number of missed penalties, i.e. penalty kicks that did not require activity by the goalkeeper because the ball went past the goal, hit the crossbar or one of the posts, see Table 3 for summary statistics on the regressors. The rationale behind this is as follows: players whose penalties are often harmless for the keeper are presumably bad candidates and are also less likely to be nominated again. Our dataset contains 265 players with at least one penalty that did not require intervention by the goalkeeper. In total, 54 of these players were not nominated again after missing their first and only penalty. Including miss as a predictor for the scoring probability in \({\varvec{x}}_i'\varvec{\gamma }\) should attenuate the correction towards the population average for these players and also allow better discrimination between players with similar \({\widehat{\theta }}_i\) but different failure rate. Frequentist coefficient estimates \(\widehat{\varvec{\gamma }}\) are given in Table 8 in [Appendix]. We find evidence for the expected effects, i.e. significant positive (negative) estimates for n (miss).
Posterior distributions and their means for informative priors are shown in Fig. 5. As anticipated from the results for the basic models in Fig. 3, the shrinkage in the extended model is significantly stronger than for MM. The KDE for the posterior means indicates that point estimates are significantly concentrated here as well and only occasionally fall below 50% (the median is .6978). Nevertheless, our prior choice yields substantial differences to relative frequency estimates in some cases. For example, former HSV player Sergej Barbarez with four missed penalties in nine attempts is now the worst rated player with a posterior mean of .426 (cf. Table 4 and the leftmost orange-coloured posterior distribution in Fig. 4). Though below the average posterior mean of 69.7%, Tim Borowski (2 saves and 1 miss in 3 attempts) is rated much better now with a posterior scoring probability of 60.7% (and 95% CI [.503, .706]). At the other extreme, prior information yields more optimistic and less uncertain estimates for frequently nominated and thus potentially better players. The ranking is led by Michael Zorc whose posterior mean of 92% exceeds his score ratio of 86%, cf. Table 1.
3 Fully Bayesian analysis of penalty conversion probabilities
A potential drawback of the methods discussed in Sect. 2 is the choice of prior: we obtain inference on the \(\theta _i\) using individual as well as prior information which depends on aggregate information from all observations through hyperparameter point estimates. However, using plug-in estimators for \(\mu\) and \(\nu\) (or \(\alpha\) and \(\beta\)) implies using an estimated posterior distribution. Doing so bears the risk of overfitting and being overly optimistic about the precision of the results. This is especially problematic for players with small \(n_i\). A refined approach hence is to assume a joint probability model for all parameters: instead of letting prior information reflect a (conditional) population average, we allow information on the individual conversion abilities to contribute probabilistically to the posterior of the hyperparameters, which then shapes posteriors of the \(\theta _i\). This is achieved in fully Bayesian hierarchical models where we put priors (hyperpriors) on the model hyperparameters.
3.1 A fully Bayesian beta-binomial hierarchical model
For the baseline hierarchical model (HB) we retain model distributions as in (2) but obtain the joint posterior \(\pi ({\varvec{\theta }},\alpha ,\beta \vert {\varvec{y}})\) from a fully Bayesian approach that assumes an exchangeable beta prior, that is, \(\pi ({\varvec{\theta }}\vert \alpha ,\beta ) = \prod _{i=1}^N \pi (\theta _i\vert \alpha ,\beta )\). Together with the binomial likelihood \(\pi ({\varvec{y}}\vert {\varvec{\theta }}, \alpha , \beta )\) and a hyperprior \(\pi (\alpha , \beta )\) we thus have
By standard results, the marginal posterior of \((\alpha , \beta )\) follows as
Regarding the choice of an “objective” hyperprior we follow Gelman et al. (2013) and use the Pareto kernel
which obtains by placing a uniform density on \(\varphi (\alpha ,\beta ):=(\mu , \nu ^{-1/2})\) and using the Jacobian of \(\varphi\) to transform back to the original scale.Footnote 3 Showing that using (15) as a prior for \((\alpha ,\beta )\) in (14) yields a proper marginal posterior if \(0<y_i<n_i\) for at least one i follows from standard arguments.Footnote 4
Contours of \(\pi (\alpha ,\beta \vert {\varvec{y}})\) based on a numeric evaluation of the functional in (14) with \(\pi (\alpha ,\beta )\) as in (15) are shown in the left panel of Fig. 6. The marginal posterior density has a mode at (23.42, 8.87) and mean at (33.01, 11.83) which, chosen as hyperparameters in a beta prior, roughly agrees with expected value and prior sample size of the beta prior in the baseline likelihood-based EB model of Sect. 2.Footnote 5 This seems reasonable since both models rely on the full likelihood but \(\pi (\alpha ,\beta )\) in HB introduces additional uncertainty. We thus expect that shrinkage towards the prior will be stronger than for MM.
The left panel of Fig. 6 reveals two marked features of \(\pi (\alpha ,\beta \vert {\varvec{y}})\). First, the contours indicate that regions with \(\alpha + \beta \le 2\) have virtually zero density. Second, the correlation between \(\alpha\) and \(\beta\) is strongly positive. By properties of the beta distribution, the marginal posterior thus places nearly all probability mass on hyperparameter combinations \((\alpha ,\beta )\) that yield unimodal beta priors with similar central tendency. This is markedly different from the bimodal prior obtained from MM in Sect. 2.1 which has most probability mass in the tails, giving substantial weight to extreme outcomes. Furthermore, the additional uncertainty from imposing the hyperprior (15) on \((\alpha ,\beta )\) implies higher uncertainty in outcomes from the (conditional) posterior distribution in the hierarchical model. Consequently, we expect estimates of posterior quantities to have higher variance than MM estimates. We will address these aspects in more detail in Sect. 3.2.
Although the \(\pi (\theta _i\vert \alpha ,\beta ,{\varvec{y}})\) are beta distributions, uncertainty in \((\alpha ,\beta )\) in HB does not allow direct simulation from the conditional posteriors as for the empirical Bayes models where using a standard routine for generating beta variates with fixed hyperparameters suffices. Instead, a procedure that incorporates variability in the hyperparameters due to the hyperprior (15) is needed. The idea is to first generate samples from the marginal posterior \(\pi (\alpha ,\beta \vert {\varvec{y}})\) which are then used to draw from \(\pi ({\varvec{\theta }}\vert \alpha ,\beta ,{\varvec{y}})\). We use the following easily implemented discrete-grid sampling scheme suggested in Gelman et al. (2013) to generate random variates from the joint posterior \(\pi ({\varvec{\theta }}, \alpha , \beta \vert {\varvec{y}})\).
Sampling from the joint posterior \(\pi ({\varvec{\theta }}, \alpha , \beta \vert {\varvec{y}})\).
-
1.
Simulate from \(\pi (\alpha ,\beta \vert {\varvec{y}})\). Draw \((\alpha _\ell ,\beta _\ell )\), \(\ell =1,\dots ,L\) for sufficiently large L using the following discrete-grid sampling procedure:
-
1.1
Compute the (unnormalised) marginal posterior \(\pi (\alpha ,\beta \vert {\varvec{y}})\) with hyperprior (15) over a sufficiently large grid. Normalise by approximating the density as a step function with unit integral over the grid.
-
1.2
Obtain \(\pi (\alpha \vert {\varvec{y}})\) by discrete summation.
For \(\ell =1,\dots ,L\):
-
1.3
Draw \(\alpha _\ell\) from \(\pi (\alpha \vert {\varvec{y}})\) and draw \(\beta _\ell\) from \(\pi (\beta \vert \alpha ,{\varvec{y}})\) given \(\alpha _\ell\) using the discrete inverse CDF method.
-
1.4
Add a uniform zero-mean jitter to \(\alpha _\ell\) and \(\beta _\ell\) to get continuously distributed samples.
-
1.1
-
2.
Simulate from \(\pi ({\varvec{\theta }}\vert \alpha ,\beta ,{\varvec{y}})\). For each \(i=1,\dots ,N\), use the \((\alpha _\ell ,\beta _\ell )\) to obtain L draws of \(\theta _i\) from the conditional posterior distribution \(\theta _i\vert \alpha ,\beta ,{\varvec{y}}\sim {Beta}(\alpha + y_i, \beta + n_i - y_i).\)
3.1.1 Results
Posterior quantities are approximated using \(L=50000\) draws generated with the above sampling scheme. The right panel of Fig. 6 shows KDEs of the posterior distributions of the \(\theta _i\vert \alpha ,\beta ,{\varvec{y}}\). We find that modelling uncertainty in the hyperparameters by the joint probability model results in the \(\pi (\theta _i\vert \alpha ,\beta ,{\varvec{y}})\) being significantly shrunk towards each other. As for the EB models, this correction is most pronounced for penalty takers with little information since the beta prior in the hierarchical model is shaped significantly by players with many attempts—a sub-sample whose information is more relevant for inference on players where data is scarce. Comparing the kernel density estimate for the distribution of posterior means to the distribution of MM posterior means in Fig. 2 further illustrates this feature.
Figure 7 compares posterior means and relative frequencies \({\widehat{\theta }}_i\) to examine shrinkage of frequentist estimates implied by the fully Bayesian approach somewhat more closely. Consistent with the discussion of implications of the hyperprior in the previous section we find more pronounced shrinkage on the estimated penalty kicking ability of the players. It is noticeable that modelling the hyperparameters via the joint probability distribution results in a pronounced correction of scoring probabilities towards the posterior population average for players with an intermediate number of attempts and \({\widehat{\theta }}_i\in (0.5,1)\)—a sub-sample for which MM estimates often differ only marginally from \({\widehat{\theta }}_i\). Two further features distinguish the results from those of the MM approach. First, failed penalties are more influential in HB than for MM, i.e. the posterior is adjusted towards lower scoring probabilities if a player demonstrates lack of skill in several attempts. Consequently, we expect order reversals in the outcomes relative to MM. Furthermore, for \({\widehat{\theta }}_i\) close to unity the uncertainty as visualised by vertical grey bars in Fig. 7 is larger than for MM where the precision of the results seems to be exaggerated, cf. Fig. 2.
Table 5 shows the top and bottom 10 ranks for posterior mean estimates in the fully Bayesian hierarchical model along with 95% CIs. Overall, we find that the posterior means are corrected more strongly compared to MM. Lower bounds of the 95% CIs for top players fall in a more realistic range near the average posterior scoring probability. We find that the all-time ranking in the hierarchical model is led by Manfred Kaltz, now with Lothar Matthäus in second place. Hans-Joachim Abel holds the third place but is closely followed by several renowned players, amongst them Michael Zorc and the presumably best active penalty taker in first-tier Bundesliga, Bayern Munich’s Robert Lewandowski, see also Table 7.Footnote 6
As expected, the upward correction for players in the lower tail of the ability distribution is more pronounced in the hierarchical model. Again, the 95% CIs appear more reasonable, the more so as upper bounds overlap with the intervals of upper-ranking players. This appears natural: given pressure through the public, trainers tend to waive assigning players to further penalties once they have failed some, especially when further candidates are on the team. Yet, these players—being trained professionals—of course would be able to perform with a substantially higher success rate than observed so far if they were assigned to further penalties.
For the bottom ranks, we observe more order reversals which, as explained above, result from higher failure weighting implied by the prior. Tim Borowski, for instance, is again ranked amongst the worst penalty takers in Bundesliga history but ranks slightly better than Bruno Labbadia with only one goal in five attempts and posterior mean conversion probability of about 67.5%.Footnote 7 Interestingly, Borowski ranks almost as high as Klaas-Jan Huntelaar who failed in 7 out of 16 attempts.
In view of these results, it is worth discussing how appropriate the implied shrinkage is. Mean and mode as well as the highest density region in the conditional posterior distribution for (\(\alpha ,\beta\)) in Fig. 6 suggests that about 30–40 penalties are needed for the data to dominate the prior. This finding fits well with the maximum likelihood estimate \({\widehat{\nu }}\approx 41\) for the baseline model in Sect. 2.2. However, considering the convergence of relative frequencies reported in Fig. 3 this value may seem too high. That said, model adequacy should not be assessed entirely by discrepancies between Bayesian and frequentist estimates. For a more informed evaluation of model fits Sect. 3.2 compares models using Bayesian information criteria and a test sample covering penalties in season 2020/21.
3.2 Additional Markov chain Monte Carlo estimates and model comparison
It seems worthwhile—e.g. for predictive purposes—to know which active first division players are the most likely to convert future penalties. To this end, we first assess adequacy of the discussed models. We also consider further models that satisfy the Bayesian paradigm and incorporate uncertainty in hyperparameters. Here, we use the same regressor sets as for the EB models and fit fully Bayesian beta-binomial regressions (BBBR) with logit and identity links for \(\mu\) and \(\nu\), setting vague Gaussian and gamma priors. Posterior quantities for these models are obtained from Markov Chain Monte Carlo (MCMC) sampling.Footnote 8
For Bayesian applications, metrics based on an estimate of the (expected log) predictive density (ELPD) have been suggested for evaluating out-of-sample predictive performance. For comparison of the Bayesian models, we use the leave-one-out cross-validation criterion (LOOIC) proposed by Vehtari et al. (2017) which approximates the LOOCV estimate of ELPD by regularised importance sampling. Results are given in Table 6. Generally, the EB models with informative priors and their MCMC counterparts perform very similarly, yet MCMC fits are reported to be slightly inferior in each case, which is likely due to prior induced uncertainty in parameters. The MM approach is clearly dominated by the other models. As expected, the beta-binomial regression models with constant linear predictor from Sect. 2.2 perform very similarly as HB with Pareto prior for (\(\alpha , \beta\)).Footnote 9 LOOIC indicates that the most complex beta-binomial regression model performs best among the fully Bayesian models.Footnote 10
For a comparison with the naïve frequentist strategy of using independent binomial proportions \({\widehat{\theta }}_i\) as ability estimates, we compute binary cross-entropy loss using predicted scoring probabilities for 94 penalties kicked in the 2020/21 season by players with \(n_i\ge 1\).Footnote 11 As is seen from the ll column in Table 6, log loss for the test dataset also indicates that MM performs worst among the Bayesian models, while full BBBR does best. Clearly, the frequentist approach performs worst.
We now discuss the results for the BBBR model with \({\varvec{x}}_i = (1, n_i, miss_i)'\) in more detail and examine ability estimates of top and bottom active Bundesliga players.Footnote 12 The results are shown in Table 7.Footnote 13 The ranking is topped by Robert Lewandowski with a posterior scoring probability of 88.4%, closely followed by Max Kruse and Andrej Kramaric.Footnote 14 At the bottom of the first division ranking, we find Borussia Mönchengladbach’s Breel Embolo and Sebastian Rudy of TSG Hoffenheim who both missed their only Bundesliga penalties to date. Again, we conclude that rank reversals from the BBBR model with \({\varvec{x}}_i = (1, n_i, miss_i)'\) can be insightful. For example, Erling Haaland (one save and one miss in four attempts) with all-time score-based rank 406 is rated almost as Thorgan Hazard (2 saves and 2 misses out of 12 attempts) and all-time rank 119.
4 Discussion and outlook
This article conducts an analysis of the penalty-kicking abilities of Bundesliga players for a historical sample covering seasons 1963/64 to 2020/21. An inherent problem in estimating scoring probabilities and establishing a ranking is that the data are imbalanced among players. From a statistical perspective, sparsity in observations resulting from many penalty takers having only taken a few penalties is challenging. Purely likelihood-based estimation where players are treated as independent then yields imprecise results.
We have shown that a ranking based on hierarchical Bayesian models in which the accuracy of all players is modelled by a joint probability distribution have advantages over a purely frequentist approach. In comparison to common rankings based on the relative scoring frequency or even the absolute number of goals, such as that of Kicker (2021), we find some notable rank reversals using posterior estimates from hierarchical models. This predominantly concerns lower quantiles of the distribution where the shrinkage is greatest. Changes in the top ranks are usually small. This is as expected, as players in the top ranks include those that tend to convert many penalties. Hence, they get assigned to further penalties, resulting in many data points for those players. Thus, purely likelihood-based inference already is fairly informative.
Although the Bayesian results may be statistically more convincing than a comparison of plain scoring frequencies, the posterior mean results should be interpreted with consideration of the associated uncertainty. While differences in posterior means are not negligible—suggesting that there are indeed players who are significantly more accurate on penalties than others—credible intervals tend to overlap considerably. This pattern is evident both for the historical overall ranking and also for separate rankings of active players in the first and second Bundesliga.
For further research, it might be interesting to use a dataset that provides more detailed information on the individual penalty kicks for each player to develop a model that can be used for predicting outcomes of penalties in future Bundesliga games.
Notes
Note that our parameterisation differs from Kleinman (1973) who uses \(\mu :=\alpha /(\alpha +\beta )\) and \(r := (\alpha + \beta + 1)^{-1}\), i.e. \(\theta \sim {Beta}((r^{-1} - 1)\mu , (r^{-1}-1)(1-\mu ))\).
Being a precision measure of the beta prior, \(\nu\) inversely affects the variance of the scoring probabilities. Placing a uniform prior on \(\nu ^{-1/2}\) is thus approximately equivalent to placing a diffuse prior on the standard deviation of \({\varvec{\theta }}\).
See Gelman et al. (2013) for further discussion of prior choice.
We have also considered alternative weakly informative prior specifications for \(\mu\) and \(\nu\), e.g. independent uniform and gamma. However, this led to virtually the same results which we omit for brevity.
We recommend googling Michael Kutzop for the remarkable story of his single missed penalty.
Note that discrepancies in \({\widehat{\theta }}_i\) and CIs for players with equal performance are an artefact from approximating posterior distributions by simulation.
In BBBR models, we use \(\mathcal N(0,1.5)\) priors for the \(\gamma _i\) and set a Gamma(4.5, 0.1) prior for \(\nu\) which, similarly as for the fully Bayesian hierarchical model, has expectation 45 but roughly \(95\%\) probability mass on [1, 85].
Note that the HB model can be cast as a BBBR with linear predictor \(\mu = \gamma _0 \sim {\mathcal {U}}[0,1]\) and \(\nu =\sigma _0\sim {Pareto}(\eta ,-1.5)\) with reasonably small lower bound \(\eta >0\).
Following a suggestion of an anonymous referee, it would also be interesting to jointly model the (categorical) probability distribution of the triplet (\({y}\), \({miss}\), \({save}\)) instead of using counts of misses as a predictor for the scoring probability in BBBR, in order to account for differing \(\textit{rates}\) instead of \(\textit{counts}\) of misses. We however leave this idea for future research for brevity here.
For ll to be defined under the frequentist paradigm, we let \({\widehat{\theta }}_i = 1-10^{-5}\) to handle cases where \(y_i = n_i\).
Posterior quantities for model parameters are shown in Table 9 in [Appendix].
Complete results for active first division players are given in Table 10 of [Appendix], which also gives player performance data at the end of season 2020/21. We have also estimated conversion probabilities for historical data on \(2^{nd}\) tier Bundesliga. See Table 12 of [Appendix] for the emerging ranking of active players.
Estimates based on a dataset that also includes DFB Cup penalties put Union Berlin’s Max Kruse ahead of Robert Lewandowski. However, credible intervals have considerable overlap, see Table 11 in [Appendix].
References
Baranda, P.S.D., Lopez-Riquelme, D.: Analysis of corner kicks in relation to match status in the 2006 world cup. Eur. J. Sport Sci. 12(2), 121–129 (2012)
Bornkamp, B., Fritsch, A., Kuss, O., Ickstadt, K.: Penalty specialists among goalkeepers: a nonparametric Bayesian analysis of 44 years of German Bundesliga, pp. 63–76. Physica-Verlag HD, Heidelberg (2009)
Bürkner, P.C.: Brms: an R package for Bayesian multilevel models using Stan. J. Stat. Softw. 80(1), 1–28 (2017)
Carlin, B.P., Louis, T.A.: Bayes and empirical Bayes methods for data analysis, 2nd edn. Chapman & Hall, London (2000)
Carpenter, B., Gelman, A., Hoffman, M.D., Lee, D., Goodrich, B., Betancourt, M., Brubaker, M., Guo, J., Li, P., Riddell, A.: Stan: a probabilistic programming language. J. Stat. Softw. 76(1),(2017)
Efron, B., Morris, C.: Empirical Bayes on vector observations: an extension of Stein’s method. Biometrika 59(2), 335–347 (1972)
Efron, B., Morris, C.: Stein’s paradox in statistics. Sci. Am. 236, 119–127 (1977)
Gelman, A., Carlin, J., Stern, H., Dunson, D., Vehtari, A., Rubin, D.: Bayesian Data Analysis, 3rd edn. Chapman & Hall, London (2013)
Kamp, J.V.D.: A field simulation study of the effectiveness of penalty kick strategies in soccer: late alterations of kick direction increase errors and reduce accuracy. J. Sports Sci. 24(5), 467–477 (2006)
Kicker: Elfmeterstatistik der Spieler: Bundesliga. (2021). https://www.kicker.de/bundesliga/elfmeter-schuetzen-historie. Accessed 10 June 2021
Kleinman, J.C.: Proportions with extraneous variance: single and independent samples. J. Am. Stat. Assoc. 68(341), 46–54 (1973)
Kuhn, W.: Penalty-kick strategies for shooters and goalkeepers. In Reilly, T. (ed). Science and Football: Proceedings of the First World Congress of Science and Football, Liverpool, 13–17th April 1987. E. & F.N. Spon, London (1988)
Kuss, O., Kluttig, A., Stoll, O.: The fouled player should not take the penalty himself: an empirical investigation of an old German football myth. J. Sports Sci. 25(9), 963–967 (2007)
Loy, R.: Handlungsstrategien von Torhütern und Schützen in der Strafstosssituation des Fussballsports. Sportwissenschaft rund um den Fussball, pp. 67–78 (1998)
Memmert, D., Noël, B.: The Penalty Kick: The Psychology of Success. Meyer & Meyer Sport, London (2020)
Noël, B., Furley, P., van der Kamp, J., Dicks, M., Memmert, D.: The development of a method for identifying penalty kick strategies in association football. J. Sports Sci. 1–10 (2014)
Strafford, B.W., Smith, A., North, J.S., Stone, J.A.: Comparative analysis of the top six and bottom six teams’ corner kick strategies in the 2015/2016 English premier league. Int. J. Perform. Anal. Sport 19(6), 904–918 (2019)
Vehtari, A., Gelman, A., Gabry, J.: Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC. Stat. Comput. 27(5), 1413–1432 (2017)
Acknowledgements
We thank Janine Langerbein and Jens Klenke for excellent research assistance.
Funding
Open Access funding enabled and organized by Projekt DEAL.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Christoph Hanck gratefully acknowledges support by the German Research Foundation (DFG) through the Collaborative Research Center “Statistical modelling of nonlinear dynamic processes” (SFB 823, Project A4). The authors thank the associate editor and two anonymous reviewers for valuable comments which significantly improved the paper.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Hanck, C., Arnold, M.C. Hierarchical Bayes modelling of penalty conversion rates of Bundesliga players. AStA Adv Stat Anal 107, 177–204 (2023). https://doi.org/10.1007/s10182-021-00420-w
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10182-021-00420-w