1 Introduction

Publications from many disciplines deal with the analysis of the world’s most popular sport, football. Particular attention is paid to the study of set pieces, especially those that offer imminent opportunity for scoring a goal. For example, Baranda and Lopez-Riquelme (2012) and Strafford et al. (2019) study how corner kicks are carried out depending on the current score and examine their relevance for the match outcome. Penalty kicks have also been investigated. Kuhn (1988), Loy (1998), Kamp (2006), and Noël et al. (2014) examine penalty shootouts from psychological and cognitive perspectives based on field experiments and devise strategies for penalty takers. This interest is hardly surprising: for instance, since 1978 about every fifth knockout game at the FIFA World Cup has been decided by penalty shootouts. This results in an estimated chance of 67% that a future world champion must successfully complete at least one penalty shootout on his way to the title (Memmert and Noël 2020).

Penalty kicks in German leagues have been investigated by, e.g. Bornkamp et al. (2009) who use a nonparametric Bayesian approach to rank Bundesliga keepers by penalty saving abilities based on historical data. They find little support for significant differences between goalkeepers, but motivate an investigation of penalty taker ability. Using logistic regression, Kuss et al. (2007) refute the widespread myth that a fouled player should not take the penalty kick because of the alleged greater risk of failure. They argue that this may be due to a selection process that results in penalty shooters being largely homogeneous with respect to their resilience in stress situations. Furthermore, coaches presumably select penalty takers already before the game due to knowledge of such special characteristics and skill.

Table 1 Score-based all-time top and bottom penalty takers in Bundesliga

Historical data such as the ranking of Bundesliga penalty shooters published by the German football magazine Kicker (2021) suggest the existence of significant differences in players’ ability to score penalties. However, such rankings are often based on the absolute number of penalties converted and thus fail to capture how efficient players are in exploiting chances. Even rankings based on relative frequencies are questionable: for players frequently nominated as penalty takers, the relative frequency may be a viable estimate. However, for players with small \(n_i\), the actual conversion probability is likely to be estimated imprecisely.

In this paper, we discuss shrinkage estimates from Bayesian beta-binomial models for ranking penalty takers in the German Bundesliga. This approach offers statistical power in estimating the skill of players with little information in borrowing information from players who have taken many penalties. We analyse historical data on 993 players who took a total of 4955 penalties in first division matches from season 1963/64, when the Bundesliga was founded, to 2020/21. We have compiled the necessary dataset from German websites kicker.de and transfermarkt.de. Section 2 examines empirical Bayes estimates from beta-binomial models. We discuss moment matching estimates and estimates based on informative priors obtained from frequentist regression. Section 3 presents more flexible fully Bayesian hierarchical models and compares findings with the results of Sect. 2 using a Bayesian information criterion and out of sample analysis. Based on the outcomes, we compile rankings for active players in the German Bundesliga.Footnote 1 Section 4 concludes.

2 Empirical Bayes estimates of penalty conversion ability

The empirical distribution of penalties taken is shown in the left panel of Fig. 1. It is evident that the distribution is right skewed with mode at 1. The empirical distribution of estimated success rates \({\widehat{\theta }}_i:=y_i/n_i\), shown in the right panel of Fig. 1, illustrates why rankings based on such a measure are problematic: we find a multimodal distribution with much probability mass near 0 and 1. This is due to many players with few (often a single) attempts who have successfully converted all kicks or, at the other extreme, did not score at all. Owing to insufficient information, the skill of these players is thus likely to be estimated imprecisely by purely frequentist methods.

Fig. 1
figure 1

Historic distribution of penalties taken and penalty conversion rates. Notes: Data from Kicker (2021). Left: absolute frequencies of penalties taken in seasons 1963/64–2019/20. Right: distribution of penalty conversion rates along with the corresponding MM-based beta prior density (red line) (colour figure online)

A preliminary approach for obtaining more reliable estimates of penalty conversion abilities \(\theta _i\in [0,1]\) for a set of players indexed by \(i=1,\dots ,N\) is to use a shrinkage estimator in a compound model

$$\begin{aligned} y_i\sim \pi (y_i\vert \theta _i),\quad \theta _i\sim \pi (\theta ), \end{aligned}$$
(1)

where \(\theta _i\) are latent realisations from the ability distribution \(\pi (\theta )\). For modelling success rates in sports, this concept was introduced by Efron and Morris (1972, 1977) who analysed baseball hit ratios using the James–Stein estimator which they motivate as a Bayesian point estimator assuming Gaussian \(\pi (\cdot )\).

Particularly, for samples with many small \(n_i\) it seems more adequate to assume a beta-binomial model, where

$$\begin{aligned}y_i &\sim \, {Bin}(\theta _i, n_i), \nonumber \\\quad \theta _i\vert \alpha ,\beta &\sim \, iid\, Beta\,(\alpha ,\beta ), \end{aligned}$$
(2)

i.e. a player’s score \(y_i\) in \(n_i\) attempts is generated from a binomial distribution with success rate \(\theta _i\). We model \(\theta _i\) via a reparameterised beta prior with expectation \(\mu := \frac{\alpha }{\alpha + \beta }\) and precision measure \(\nu := \alpha + \beta\). We thus have

$$\begin{aligned} \pi (\theta _i\vert y_i)&\,\propto \pi (y_i\vert \theta _i)\pi (\theta _i\vert \mu ,\nu ) \end{aligned}$$
(3)
$$\begin{aligned}&\,\propto \theta _i^{y_i+\nu \mu -1}(1-\theta _i)^{n_i-y_i+\nu (1-\mu )-1}, \end{aligned}$$
(4)

where \(\pi (y_i\vert \theta _i)\) is the usual binomial likelihood and \(\pi (\theta _i \vert \mu ,\nu )\) denotes the prior. By conjugacy, the posterior \(\pi (\theta _i\vert y_i)\) is \({Beta}(y_i+\nu \mu , n_i-y_i+\nu (1-\mu ))\). An empirical Bayes (EB) approach to conduct inference via (4) specifies an informative prior from point estimates for \(\mu\) and \(\nu\).

2.1 Empirical Bayes estimates using moment matching

An instructive approach discussed by, e.g. Carlin and Louis (2000) uses moment matching (MM) to estimate the hyperparameters. When the \(n_i\) are equal across samples, MM estimators are directly derived from the expected value and variance formulae for a marginal distribution of binomial proportions with success probability \(\theta _i\sim {Beta}(\nu \mu ,\nu (1-\mu ))\). The case of unequal \(n_i\) is more complicated, as in the compound model (2) the \({\widehat{\theta }}_i\) are independently distributed with \(E ({\widehat{\theta }}_i) = \mu\), but \(Var ({\widehat{\theta }}_i) = \mu (1-\mu )n_i^{-1}+(\nu +1)^{-1}\mu (1-\mu )(1-n_i^{-1})\) is a function of \(n_i\). This makes it preferable to give different weight to the \({\widehat{\theta }}_i\). For known variances, Kleinman (1973) suggests to estimate \(\mu\) by the (best linear unbiased) estimator

$$\begin{aligned} {\widehat{\mu }} :=\, \frac{\sum _{i=1}^N w_i{\widehat{\theta }}_i}{w},\quad w := \sum _{i=1}^N w_i \end{aligned}$$
(5)

with inverse variance weights \(w_i\). The precision \(\nu\) is then estimated by

$$\begin{aligned} {\widehat{\nu }} :=\, \frac{{\widehat{\mu }}(1-{\widehat{\mu }}) \left[ \sum _{i=1}^N w_i\left( 1-\frac{w_i}{w}\right) - \sum _{i=1}^N\frac{w_i}{n_i}\left( 1-\frac{w_i}{w}\right) \right] }{{\widehat{\sigma }}^2 - {\widehat{\mu }}(1-{\widehat{\mu }}) \left[ \sum _{i=1}^N\frac{w_i}{n_i}\left( 1-\frac{w_i}{w}\right) \right] } - 1, \end{aligned}$$
(6)

which obtains from matching \({\widehat{\sigma }}^2 := \sum _{i=1}^N w_i({\widehat{\theta }}_i-{\widehat{\mu }})^2\) with the variance of a \({Beta}(\nu \mu ,\nu (1-\mu ))\) random variable.Footnote 2 If \(w_i\) are unknown, \({\widehat{\mu }}\) and \({\widehat{\nu }}\) can be computed iteratively using (5) and (6) with initial values \(w_i = n_i\) and updates \(w_i = n_i/[1+({\tilde{\nu }}+1)^{-1}(n_i-1)]\), where \({\tilde{\nu }}\) is the current value of \({\widehat{\nu }}\).

Using (4) and estimates \({\widehat{\mu }}\) and \({\widehat{\nu }}\), frequentist conversion probability estimates \({\widehat{\theta }}_i := y_i/n_i\) are then updated by computing posterior means

$$\begin{aligned} \breve{\theta }_i := {\widehat{S}}_i {\widehat{\mu }} + (1-{\widehat{S}}_i){\widehat{\theta }}_i, \quad {\widehat{S}}_i := \frac{{\widehat{\nu }}}{{\widehat{\nu }} + n_i} \end{aligned}$$
(7)

where \({\widehat{S}}_i\) is the shrinkage factor towards the beta prior for player i. It is evident from (7) that more information on individual ability (that is, large \(n_i\)) lessens the effect of the prior on \(\breve{\theta }_i\), i.e. information from the population of penalty takers has less weight for individual estimates. At the other extreme, the prior information dominates estimates of conversion probabilities when \(n_i\) is small, resulting in an adjustment towards the overall mean.

Applied to our data, we obtain hyperparameter estimates \({\widehat{\mu }} = .6788\), and \({\widehat{\nu }} = 1.3405\), resulting in a Beta(.9099, .4306) prior. The corresponding bimodal density is shown in red in Fig. 1. Posterior quantities are presented in Fig. 2. The left panel of Fig. 2 compares relative frequencies against MM estimates from the corresponding posterior distribution. It can be seen that including prior information has a plausible effect: shrinkage is low for observations where the data provides much information whereas point estimates for small \(n_i\) are markedly adjusted. This is especially relevant for observations with \({\widehat{\theta }}_i\in \{0,1\}\) and small \(n_i\) where estimates are most strongly corrected towards \({\widehat{\mu }}\). This is sensible as accurate estimates are possible for players with many penalties whereas players with only a few penalties are shrunken towards the mean.

Fig. 2
figure 2

Empirical Bayes—shrinkage towards a beta prior from moment matching. Notes: Penalties taken in seasons 1963/64–2019/20. Left: relative frequencies \({\widehat{\theta }}_i\) and posterior means \(\breve{\theta }_i\) with horizontal \(\mathcal U(-.005, .005)\) jitter. The solid red line marks the set of equal estimates. The dashed red line shows the prior mean estimate \({\widehat{\mu }} = .67\). Grey bars show 95% credible intervals. Right: Beta posterior distributions. The dashed black line shows a kernel density estimate of the distribution of posterior means (colour figure online)

Furthermore, the posterior distributions shown in the right panel of Fig. 2 show that for players with a small \(n_i\), dominance of the prior over the likelihood leads to greater uncertainty of the estimated conversion probability. The effect of shrinkage can also be judged from the dashed black line which shows a kernel density estimate (KDE) of the distribution of posterior means. The shape of the KDE indicates that MM gives more realistic estimates of expected scoring probabilities than the pure frequentist estimates. This is especially evident for the tails of the distribution which have substantially less probability mass compared to the density of the \({\widehat{\theta }}_i\) in Fig. 1.

Table 2 lists the top and bottom 10 historical ranks of Bundesliga penalty kickers based on MM estimates along with 95% credible intervals (CIs). Interestingly, compared to the score-based ranking in Table 1 we find that none of the top listed players made it into the best 10, which are now led by Hans-Joachim Abel with 100% scoring frequency in 16 trials. However, the other top players also converted all their penalties, i.e. the correction due to the prior distribution is small and score probability estimates and CIs are consistently close to 1. For the last 10 ranks, we find that shrinkage leads to somewhat stronger adjustments of the \({\widehat{\theta }}_i\). Tim Borowski, a former German international, for example, is attested a posterior mean scoring probability of 21% despite 3 failed penalties out of 3 attempts. However, this estimate is also subject to considerable uncertainty as his 95% CI is [.005, .643].

Table 2 Empirical Bayes estimates—top and bottom 10 penalty takers

2.2 Empirical Bayes estimates from informative beta priors

Several points should be emphasised when interpreting the estimates from the MM approach in Sect. 2.1. First, the prior has most probability mass at extreme abilities and high variance, i.e. MM causes the prior weight to be strongly shaped by players for whom frequentist ability estimates \({\widehat{\theta }}_i\) are most unreliable. Therefore, few penalties are needed for the data to dominate the prior. However, adequacy of the implied shrinkage is an empirical issue and relates to how fast the \({\widehat{\theta }}_i\) converge towards the true abilities as players progressively take penalties. To get an impression of the convergence speed Fig. 3 visualises trajectories of the \({\widehat{\theta }}_i\) over time for a sub-sample of the players: as ability estimates inherently stabilise later in the career, we consider players with \(n_i\ge 10\). Even for players with \(n_i\approx 20\), we observe enough time-variability in scoring rates for shrinkage towards the prior to be attractive. Notably, given the MM estimate \({\widehat{\nu }} = 1.3405\) it can be computed from (7) that the shrinkage factor \({\widehat{S}}_i\) decays quickly with \(n_i\) and is already at .4 for only \(n_i=2\) attempts, which seems rather fast.

Fig. 3
figure 3

Convergence of relative frequencies. Note: Trajectories of relative frequency estimates for \(n_i\ge 10\). \(r_{ij} := (j-1)/(n_i-1),\, p_{ij} :=\sum _{l=1}^{j}y_{il}/j,\, j\in \{1,\dots ,n_i\}\) with \(y_{il}\in \{0,1\}\) the outcome of penalty l for player i. Spline fits are shown for better readability

Figure 3 also shows evidence of the selection process outlined in Sect. 1: we do not observe players who repeatedly fail to score and yet continue taking penalties: players in the sub-sample who start off with a failed attempt usually have a scoring streak afterwards and thus keep getting nominated. This selection might result in overestimation of abilities for players who have performed poorly and only had a few attempts.

We next consider EB estimates based on informative beta priors

$$\begin{aligned} \theta _i|{\widehat{\mu }}^s_i,{\widehat{\nu }}^s\sim {Beta}({\widehat{\nu }}^s{\widehat{\mu }}^s_i, {\widehat{\nu }}^s(1-{\widehat{\mu }}^s_i)). \end{aligned}$$
(8)

For this, we estimate prior hyperparameters incorporating player performance data \({\varvec{x}}_i\) by maximising the frequentist regression beta-binomial likelihood

$$\begin{aligned}&L(\varvec{\gamma }, \sigma _0\vert {\mathbf {y}}) = \prod _{i=1}^N f(y_i\vert \varvec{\gamma }, \sigma _0), \end{aligned}$$
(9)
$$\begin{aligned}&f(y_i\vert \varvec{\gamma }, \sigma _0)=\left( {\begin{array}{c}n_i\\ y_i\end{array}}\right) \frac{\Gamma (y_i+\alpha _i^s)\Gamma (n_i-y_i+\beta _i^s)\Gamma (\alpha _i^s + \beta _i^s)}{\Gamma (n_i + \alpha _i^s + \beta _i^s)\Gamma (\alpha _i^s)\Gamma (\beta _i^s)}, \end{aligned}$$
(10)

with \(\Gamma (\cdot )\) the Gamma function. Prior mean \(\mu ^s_i := \alpha _i^s/\nu ^s\) and precision \(\nu ^s := \alpha _i^s + \beta _i^s\) have logit and log links in linear predictors,

$$\begin{aligned} \mu ^s_i := logit ^{-1}\left( {\varvec{x}}_i'\varvec{\gamma }\right) ,\quad \nu ^s := \exp (\sigma _0)^{-1}. \end{aligned}$$
(11)

To get a notion of the selection effect and the impact of likelihood-based prior parameter estimates, we compare posterior means from two preliminary models first: one where the \(\mu _i^s\) are assumed constant across players and one where \(\mu _i^s\) depend on \(n_i\). Note that players who are known as good penalty kickers will take penalties more often, making \(n_i\) a natural predictor of the scoring ability.

For the baseline model, we obtain hyperparameter estimates \({\widehat{\mu }}^s = .74\) and \({\widehat{\nu }}^s = 41.04\). The latter can be interpreted as the prior sample size and gives the threshold beyond which the data dominates the prior in contributing to posterior means, see (7). This suggests that the impact of the prior is significantly stronger than for MM estimates, cf. Fig. 2. Figure 4 shows posterior distributions and means for both preliminary models and provides evidence of stronger shrinkage. This is particularly noticeable for players with extreme scoring ratios and few attempts, although players with large \(n_i\) are now also subject to stronger adjustment towards the prior. The red dots above the \(45\deg\)-line show that players with few attempts tend to be rated worse in the model adjusted for \(n_i\) than in the baseline model with only a constant. Consequently, frequently nominated players tend to be better rated once we accommodate for the effect of selection: the colour transition in posterior means indicates that penalty takers are systematically better rated from approximately 20 penalties onwards as priors become more informative. This is also reflected in estimation uncertainty displayed by posterior distributions where the variance is generally lower for the adjusted model and decreases markedly with \(n_i\), cf. the densities shown above the square. We discuss ability point estimates, the implied shrinkage, and resulting rank reversals with respect to results from MM in more detail for fully Bayesian estimation in Sect. 3.1.

Fig. 4
figure 4

Likelihood-based empirical Bayes–comparison of posterior quantities. Note: Dots show posterior means based on priors obtained from beta-binomial regression on \({\varvec{x}}_i\), with horizontal \(\mathcal U(-.005, .005)\) jitter. Beta posterior distributions are displayed at the margins. The dashed black lines show kernel density estimates of posterior means

Table 3 Summary statistics of Bundesliga penalty takers

We also consider priors that additionally depend on miss, the number of missed penalties, i.e. penalty kicks that did not require activity by the goalkeeper because the ball went past the goal, hit the crossbar or one of the posts, see Table 3 for summary statistics on the regressors. The rationale behind this is as follows: players whose penalties are often harmless for the keeper are presumably bad candidates and are also less likely to be nominated again. Our dataset contains 265 players with at least one penalty that did not require intervention by the goalkeeper. In total, 54 of these players were not nominated again after missing their first and only penalty. Including miss as a predictor for the scoring probability in \({\varvec{x}}_i'\varvec{\gamma }\) should attenuate the correction towards the population average for these players and also allow better discrimination between players with similar \({\widehat{\theta }}_i\) but different failure rate. Frequentist coefficient estimates \(\widehat{\varvec{\gamma }}\) are given in Table 8 in [Appendix]. We find evidence for the expected effects, i.e. significant positive (negative) estimates for n (miss).

Posterior distributions and their means for informative priors are shown in Fig. 5. As anticipated from the results for the basic models in Fig. 3, the shrinkage in the extended model is significantly stronger than for MM. The KDE for the posterior means indicates that point estimates are significantly concentrated here as well and only occasionally fall below 50% (the median is .6978). Nevertheless, our prior choice yields substantial differences to relative frequency estimates in some cases. For example, former HSV player Sergej Barbarez with four missed penalties in nine attempts is now the worst rated player with a posterior mean of .426 (cf. Table 4 and the leftmost orange-coloured posterior distribution in Fig. 4). Though below the average posterior mean of 69.7%, Tim Borowski (2 saves and 1 miss in 3 attempts) is rated much better now with a posterior scoring probability of 60.7% (and 95% CI [.503, .706]). At the other extreme, prior information yields more optimistic and less uncertain estimates for frequently nominated and thus potentially better players. The ranking is led by Michael Zorc whose posterior mean of 92% exceeds his score ratio of 86%, cf. Table 1.

Fig. 5
figure 5

Empirical Bayes estimates from informative beta priors–posterior quantities. Notes: \({\varvec{x}}_i = (1, n_i, miss_i)'\). Left: the solid red line marks the set of equal relative frequencies \({\widehat{\theta }}_i\) and posterior means \({\tilde{\theta }}_{i,{\varvec{x}}_i}^s\). The dashed red line shows the prior mean estimate \(\overline{{\widehat{\mu }}} = .7033\). Grey bars show 95% credible intervals. Right: beta posterior distributions. The dashed black line shows a kernel density estimate of the distribution of posterior means (colour figure online)

Table 4 Subjective Empirical Bayes estimates—top and bottom penalty takers

3 Fully Bayesian analysis of penalty conversion probabilities

A potential drawback of the methods discussed in Sect. 2 is the choice of prior: we obtain inference on the \(\theta _i\) using individual as well as prior information which depends on aggregate information from all observations through hyperparameter point estimates. However, using plug-in estimators for \(\mu\) and \(\nu\) (or \(\alpha\) and \(\beta\)) implies using an estimated posterior distribution. Doing so bears the risk of overfitting and being overly optimistic about the precision of the results. This is especially problematic for players with small \(n_i\). A refined approach hence is to assume a joint probability model for all parameters: instead of letting prior information reflect a (conditional) population average, we allow information on the individual conversion abilities to contribute probabilistically to the posterior of the hyperparameters, which then shapes posteriors of the \(\theta _i\). This is achieved in fully Bayesian hierarchical models where we put priors (hyperpriors) on the model hyperparameters.

3.1 A fully Bayesian beta-binomial hierarchical model

For the baseline hierarchical model (HB) we retain model distributions as in (2) but obtain the joint posterior \(\pi ({\varvec{\theta }},\alpha ,\beta \vert {\varvec{y}})\) from a fully Bayesian approach that assumes an exchangeable beta prior, that is, \(\pi ({\varvec{\theta }}\vert \alpha ,\beta ) = \prod _{i=1}^N \pi (\theta _i\vert \alpha ,\beta )\). Together with the binomial likelihood \(\pi ({\varvec{y}}\vert {\varvec{\theta }}, \alpha , \beta )\) and a hyperprior \(\pi (\alpha , \beta )\) we thus have

$$\begin{aligned} \pi ({\varvec{\theta }}, \alpha , \beta \vert {\varvec{y}}) \propto&\, \pi ({\varvec{y}}\vert {\varvec{\theta }}, \alpha , \beta ) \pi ({\varvec{\theta }}\vert \alpha , \beta ) \pi (\alpha , \beta ) \end{aligned}$$
(12)
$$\begin{aligned} \propto&\, \prod _{i=1}^N \theta _i^{y_i}(1 - \theta _i)^{n_i - y_i} \prod _{i=1}^N \frac{\Gamma (\alpha + \beta )}{\Gamma (\alpha )\Gamma (\beta )}\theta _i^{\alpha - 1}(1 - \theta _i)^{\beta - 1} \pi (\alpha , \beta ). \end{aligned}$$
(13)

By standard results, the marginal posterior of \((\alpha , \beta )\) follows as

$$\begin{aligned} \pi (\alpha , \beta \vert {\varvec{y}}) \propto \prod _{i=1}^N\frac{\Gamma (\alpha + \beta )}{\Gamma (\alpha )\Gamma (\beta )}\frac{\Gamma (\alpha + y_i)\Gamma (\beta + n_i - y_i)}{\Gamma (\alpha + \beta + n_i)} \pi (\alpha , \beta ). \end{aligned}$$
(14)

Regarding the choice of an “objective” hyperprior we follow Gelman et al. (2013) and use the Pareto kernel

$$\begin{aligned} \pi (\alpha ,\beta ) \propto (\alpha + \beta )^{-5/2}, \end{aligned}$$
(15)

which obtains by placing a uniform density on \(\varphi (\alpha ,\beta ):=(\mu , \nu ^{-1/2})\) and using the Jacobian of \(\varphi\) to transform back to the original scale.Footnote 3 Showing that using (15) as a prior for \((\alpha ,\beta )\) in (14) yields a proper marginal posterior if \(0<y_i<n_i\) for at least one i follows from standard arguments.Footnote 4

Contours of \(\pi (\alpha ,\beta \vert {\varvec{y}})\) based on a numeric evaluation of the functional in (14) with \(\pi (\alpha ,\beta )\) as in (15) are shown in the left panel of Fig. 6. The marginal posterior density has a mode at (23.42, 8.87) and mean at (33.01, 11.83) which, chosen as hyperparameters in a beta prior, roughly agrees with expected value and prior sample size of the beta prior in the baseline likelihood-based EB model of Sect. 2.Footnote 5 This seems reasonable since both models rely on the full likelihood but \(\pi (\alpha ,\beta )\) in HB introduces additional uncertainty. We thus expect that shrinkage towards the prior will be stronger than for MM.

Fig. 6
figure 6

Fully Bayesian hierarchical model–simulated posterior distributions. Note: Left: unnormalised marginal posterior \(\pi (\alpha ,\beta \vert {\varvec{y}})\) computed over \((\alpha ,\beta )\in [0,80]\times [0,30]\). Contour lines indicate \(5\%, 15\%, \dots , 85\%, 95\%\) of the density at the mode (23.42, 8.87). Right: Kernel estimates of posterior densities \(\pi (\theta _i\vert \alpha ,\beta ,{\varvec{y}})\)

The left panel of Fig. 6 reveals two marked features of \(\pi (\alpha ,\beta \vert {\varvec{y}})\). First, the contours indicate that regions with \(\alpha + \beta \le 2\) have virtually zero density. Second, the correlation between \(\alpha\) and \(\beta\) is strongly positive. By properties of the beta distribution, the marginal posterior thus places nearly all probability mass on hyperparameter combinations \((\alpha ,\beta )\) that yield unimodal beta priors with similar central tendency. This is markedly different from the bimodal prior obtained from MM in Sect. 2.1 which has most probability mass in the tails, giving substantial weight to extreme outcomes. Furthermore, the additional uncertainty from imposing the hyperprior (15) on \((\alpha ,\beta )\) implies higher uncertainty in outcomes from the (conditional) posterior distribution in the hierarchical model. Consequently, we expect estimates of posterior quantities to have higher variance than MM estimates. We will address these aspects in more detail in Sect. 3.2.

Although the \(\pi (\theta _i\vert \alpha ,\beta ,{\varvec{y}})\) are beta distributions, uncertainty in \((\alpha ,\beta )\) in HB does not allow direct simulation from the conditional posteriors as for the empirical Bayes models where using a standard routine for generating beta variates with fixed hyperparameters suffices. Instead, a procedure that incorporates variability in the hyperparameters due to the hyperprior (15) is needed. The idea is to first generate samples from the marginal posterior \(\pi (\alpha ,\beta \vert {\varvec{y}})\) which are then used to draw from \(\pi ({\varvec{\theta }}\vert \alpha ,\beta ,{\varvec{y}})\). We use the following easily implemented discrete-grid sampling scheme suggested in Gelman et al. (2013) to generate random variates from the joint posterior \(\pi ({\varvec{\theta }}, \alpha , \beta \vert {\varvec{y}})\).

Sampling from the joint posterior \(\pi ({\varvec{\theta }}, \alpha , \beta \vert {\varvec{y}})\).

  1. 1.

    Simulate from \(\pi (\alpha ,\beta \vert {\varvec{y}})\). Draw \((\alpha _\ell ,\beta _\ell )\), \(\ell =1,\dots ,L\) for sufficiently large L using the following discrete-grid sampling procedure:

    1. 1.1

      Compute the (unnormalised) marginal posterior \(\pi (\alpha ,\beta \vert {\varvec{y}})\) with hyperprior (15) over a sufficiently large grid. Normalise by approximating the density as a step function with unit integral over the grid.

    2. 1.2

      Obtain \(\pi (\alpha \vert {\varvec{y}})\) by discrete summation.

    For \(\ell =1,\dots ,L\):

    1. 1.3

      Draw \(\alpha _\ell\) from \(\pi (\alpha \vert {\varvec{y}})\) and draw \(\beta _\ell\) from \(\pi (\beta \vert \alpha ,{\varvec{y}})\) given \(\alpha _\ell\) using the discrete inverse CDF method.

    2. 1.4

      Add a uniform zero-mean jitter to \(\alpha _\ell\) and \(\beta _\ell\) to get continuously distributed samples.

  2. 2.

    Simulate from \(\pi ({\varvec{\theta }}\vert \alpha ,\beta ,{\varvec{y}})\). For each \(i=1,\dots ,N\), use the \((\alpha _\ell ,\beta _\ell )\) to obtain L draws of \(\theta _i\) from the conditional posterior distribution \(\theta _i\vert \alpha ,\beta ,{\varvec{y}}\sim {Beta}(\alpha + y_i, \beta + n_i - y_i).\)

3.1.1 Results

Posterior quantities are approximated using \(L=50000\) draws generated with the above sampling scheme. The right panel of Fig. 6 shows KDEs of the posterior distributions of the \(\theta _i\vert \alpha ,\beta ,{\varvec{y}}\). We find that modelling uncertainty in the hyperparameters by the joint probability model results in the \(\pi (\theta _i\vert \alpha ,\beta ,{\varvec{y}})\) being significantly shrunk towards each other. As for the EB models, this correction is most pronounced for penalty takers with little information since the beta prior in the hierarchical model is shaped significantly by players with many attempts—a sub-sample whose information is more relevant for inference on players where data is scarce. Comparing the kernel density estimate for the distribution of posterior means to the distribution of MM posterior means in Fig. 2 further illustrates this feature.

Figure 7 compares posterior means and relative frequencies \({\widehat{\theta }}_i\) to examine shrinkage of frequentist estimates implied by the fully Bayesian approach somewhat more closely. Consistent with the discussion of implications of the hyperprior in the previous section we find more pronounced shrinkage on the estimated penalty kicking ability of the players. It is noticeable that modelling the hyperparameters via the joint probability distribution results in a pronounced correction of scoring probabilities towards the posterior population average for players with an intermediate number of attempts and \({\widehat{\theta }}_i\in (0.5,1)\)—a sub-sample for which MM estimates often differ only marginally from \({\widehat{\theta }}_i\). Two further features distinguish the results from those of the MM approach. First, failed penalties are more influential in HB than for MM, i.e. the posterior is adjusted towards lower scoring probabilities if a player demonstrates lack of skill in several attempts. Consequently, we expect order reversals in the outcomes relative to MM. Furthermore, for \({\widehat{\theta }}_i\) close to unity the uncertainty as visualised by vertical grey bars in Fig. 7 is larger than for MM where the precision of the results seems to be exaggerated, cf. Fig. 2.

Fig. 7
figure 7

Fully Bayesian hierarchical model–conditional posterior means. Note: the solid red line marks the set of equal relative frequencies \({\widehat{\theta }}_i\) and posterior mean estimates \({\tilde{\theta }}_i\vert \alpha ,\beta ,{\varvec{y}}\). The dashed red line shows the average posterior mean. The \({\widehat{\theta }}_i\) are jittered to reduce overplotting. Grey bars show 95% credible intervals (colour figure online)

Table 5 shows the top and bottom 10 ranks for posterior mean estimates in the fully Bayesian hierarchical model along with 95% CIs. Overall, we find that the posterior means are corrected more strongly compared to MM. Lower bounds of the 95% CIs for top players fall in a more realistic range near the average posterior scoring probability. We find that the all-time ranking in the hierarchical model is led by Manfred Kaltz, now with Lothar Matthäus in second place. Hans-Joachim Abel holds the third place but is closely followed by several renowned players, amongst them Michael Zorc and the presumably best active penalty taker in first-tier Bundesliga, Bayern Munich’s Robert Lewandowski, see also Table 7.Footnote 6

Table 5 Fully Bayesian hierarchical model—top and bottom 10 penalty takers

As expected, the upward correction for players in the lower tail of the ability distribution is more pronounced in the hierarchical model. Again, the 95% CIs appear more reasonable, the more so as upper bounds overlap with the intervals of upper-ranking players. This appears natural: given pressure through the public, trainers tend to waive assigning players to further penalties once they have failed some, especially when further candidates are on the team. Yet, these players—being trained professionals—of course would be able to perform with a substantially higher success rate than observed so far if they were assigned to further penalties.

For the bottom ranks, we observe more order reversals which, as explained above, result from higher failure weighting implied by the prior. Tim Borowski, for instance, is again ranked amongst the worst penalty takers in Bundesliga history but ranks slightly better than Bruno Labbadia with only one goal in five attempts and posterior mean conversion probability of about 67.5%.Footnote 7 Interestingly, Borowski ranks almost as high as Klaas-Jan Huntelaar who failed in 7 out of 16 attempts.

In view of these results, it is worth discussing how appropriate the implied shrinkage is. Mean and mode as well as the highest density region in the conditional posterior distribution for (\(\alpha ,\beta\)) in Fig. 6 suggests that about 30–40 penalties are needed for the data to dominate the prior. This finding fits well with the maximum likelihood estimate \({\widehat{\nu }}\approx 41\) for the baseline model in Sect. 2.2. However, considering the convergence of relative frequencies reported in Fig. 3 this value may seem too high. That said, model adequacy should not be assessed entirely by discrepancies between Bayesian and frequentist estimates. For a more informed evaluation of model fits Sect. 3.2 compares models using Bayesian information criteria and a test sample covering penalties in season 2020/21.

3.2 Additional Markov chain Monte Carlo estimates and model comparison

It seems worthwhile—e.g. for predictive purposes—to know which active first division players are the most likely to convert future penalties. To this end, we first assess adequacy of the discussed models. We also consider further models that satisfy the Bayesian paradigm and incorporate uncertainty in hyperparameters. Here, we use the same regressor sets as for the EB models and fit fully Bayesian beta-binomial regressions (BBBR) with logit and identity links for \(\mu\) and \(\nu\), setting vague Gaussian and gamma priors. Posterior quantities for these models are obtained from Markov Chain Monte Carlo (MCMC) sampling.Footnote 8

Table 6 Comparison of Bayesian beta-binomial models

For Bayesian applications, metrics based on an estimate of the (expected log) predictive density (ELPD) have been suggested for evaluating out-of-sample predictive performance. For comparison of the Bayesian models, we use the leave-one-out cross-validation criterion (LOOIC) proposed by Vehtari et al. (2017) which approximates the LOOCV estimate of ELPD by regularised importance sampling. Results are given in Table 6. Generally, the EB models with informative priors and their MCMC counterparts perform very similarly, yet MCMC fits are reported to be slightly inferior in each case, which is likely due to prior induced uncertainty in parameters. The MM approach is clearly dominated by the other models. As expected, the beta-binomial regression models with constant linear predictor from Sect. 2.2 perform very similarly as HB with Pareto prior for (\(\alpha , \beta\)).Footnote 9 LOOIC indicates that the most complex beta-binomial regression model performs best among the fully Bayesian models.Footnote 10

For a comparison with the naïve frequentist strategy of using independent binomial proportions \({\widehat{\theta }}_i\) as ability estimates, we compute binary cross-entropy loss using predicted scoring probabilities for 94 penalties kicked in the 2020/21 season by players with \(n_i\ge 1\).Footnote 11 As is seen from the ll column in Table 6, log loss for the test dataset also indicates that MM performs worst among the Bayesian models, while full BBBR does best. Clearly, the frequentist approach performs worst.

We now discuss the results for the BBBR model with \({\varvec{x}}_i = (1, n_i, miss_i)'\) in more detail and examine ability estimates of top and bottom active Bundesliga players.Footnote 12 The results are shown in Table 7.Footnote 13 The ranking is topped by Robert Lewandowski with a posterior scoring probability of 88.4%, closely followed by Max Kruse and Andrej Kramaric.Footnote 14 At the bottom of the first division ranking, we find Borussia Mönchengladbach’s Breel Embolo and Sebastian Rudy of TSG Hoffenheim who both missed their only Bundesliga penalties to date. Again, we conclude that rank reversals from the BBBR model with \({\varvec{x}}_i = (1, n_i, miss_i)'\) can be insightful. For example, Erling Haaland (one save and one miss in four attempts) with all-time score-based rank 406 is rated almost as Thorgan Hazard (2 saves and 2 misses out of 12 attempts) and all-time rank 119.

Table 7 BBBR—top and bottom active penalty takers in Bundesliga

4 Discussion and outlook

This article conducts an analysis of the penalty-kicking abilities of Bundesliga players for a historical sample covering seasons 1963/64 to 2020/21. An inherent problem in estimating scoring probabilities and establishing a ranking is that the data are imbalanced among players. From a statistical perspective, sparsity in observations resulting from many penalty takers having only taken a few penalties is challenging. Purely likelihood-based estimation where players are treated as independent then yields imprecise results.

We have shown that a ranking based on hierarchical Bayesian models in which the accuracy of all players is modelled by a joint probability distribution have advantages over a purely frequentist approach. In comparison to common rankings based on the relative scoring frequency or even the absolute number of goals, such as that of Kicker (2021), we find some notable rank reversals using posterior estimates from hierarchical models. This predominantly concerns lower quantiles of the distribution where the shrinkage is greatest. Changes in the top ranks are usually small. This is as expected, as players in the top ranks include those that tend to convert many penalties. Hence, they get assigned to further penalties, resulting in many data points for those players. Thus, purely likelihood-based inference already is fairly informative.

Although the Bayesian results may be statistically more convincing than a comparison of plain scoring frequencies, the posterior mean results should be interpreted with consideration of the associated uncertainty. While differences in posterior means are not negligible—suggesting that there are indeed players who are significantly more accurate on penalties than others—credible intervals tend to overlap considerably. This pattern is evident both for the historical overall ranking and also for separate rankings of active players in the first and second Bundesliga.

For further research, it might be interesting to use a dataset that provides more detailed information on the individual penalty kicks for each player to develop a model that can be used for predicting outcomes of penalties in future Bundesliga games.