1 Introduction

Contest designers put a lot of effort in designing contests in various fields. For operational research in sports contests, fairness for the contestants and attractiveness for the spectators are the main restrictions (Arlegi & Dimitrov, 2020; Szymanski, 2003; Wright, 2014). The main fairness criterion for contests is fulfilled if equally skilled contestants have equal winning probabilities (e.g. Arlegi & Dimitrov, 2020). To ensure this, most contest designs balance potential advantages, e.g. home and away matches within one contest to balance the home advantage. Hence, many contest studies analyse symmetric contests that are designed to have either a balanced or no technical advantage but empirically uncover side effects, such as the order of actions (Ginsberg & Van Ours, 2003) or scheduling effects (Goller & Krumer, 2020).

In contrast, we investigate the fairness of an unbalanced, one match sequential contest with two competitors in the sport of darts. This contest has a built-in advantage (BIA) for the first moving contestant, who has an advantage in potentially more, but never less situations within a match.Footnote 1 In this study, we analyse a simple question: Is the two player, sequential contest with BIA a fair contest, or are there specific groups of individuals who are systematically disadvantaged by the contest design?

The outcome of a contest is about human behaviour, and even though behavioural responses are potentially among the most individual, there is hardly any personalized evidence. The importance of individual behavioural responses calling for more individualized effects estimation is demonstrated in the work of González-Díaz et al. (2012). They found large differences of individual ‘critical abilities’ of tennis players affecting the performance and success. While it is impossible to observe or estimate a true individual effect owing to the unobservable outcomes in situations of non-realized interventions, there is a huge potential to find or describe those individuals who benefit most or least from some given treatment, which is the BIA in the contest design. For designing rules for competitions, identifying the most benefiting group of individuals might help operational researchers to improve contest design in terms of fairness, attractiveness, or competitive balance. This is especially true if specific groups of individuals systematically realize differential effects and are therefore (dis-)advantaged by the contest design.

The relation of incentives and performance is widely discussed in the literature and is among the fundamentals in contests (Ariely et al., 2009; Baumeister, 1984; Ehrenberg & Bognanno, 1990; Rosen, 1986). Humans behave differently in situations with increased level of incentives, which are present in many fields of daily life; such as job talks, presentations or speeches in the public, competitions and decisions at the workplace or in sports. Often those situations are subject to higher rewards or reputation that might increase or decrease the performance of individuals. In darts, social incentives might play a crucial role, owing to euphoric spectators and the otherwise very standardized environment. For athletes competing near their hometown the stakes are higher. Having a supporting crowd around might be motivating, which we call social support from now on. Social pressure is the counterpart, as being watched can also be a burden inducing pressure.Footnote 2 The increased focus magnifies the reputation gained if performing well, as well as the loss of reputation, otherwise. This is a known phenomenon in the social facilitation literature (e.g. Butler & Baumeister, 1998; Zajonc, 1965), which is found to have an influence even in cases in which not the athlete, but only the audience has high expectations (Baumeister et al., 1985; Strauss, 1997).

This motivated us to analyse differential effects in the BIA on an individualized level, as well as of the respective groups associated with social incentives and ability in a flexible way. To the best of our knowledge we are the first to empirically analyse a BIA in an asymmetric two players sequential contest. This study especially contributes to operations research in investigating the fairness of such a contest with a thorough analysis of the BIA using novel statistical approaches in a causal framework.

This work relies on recent advances in econometric methods made in the emerging Causal Machine Learning literature. These new methods are more flexible compared to parametric approaches. Moreover, they can deal flexibly with rich data, which makes some identifying assumptions more credible, and they are useful for the analysis of heterogeneous treatment effects.

For this, we use a dataset from the increasingly popular sport of darts, which offers several benefits when investigating individual responses. Particular for darts is a friendly but euphoric atmosphere created by the spectators. Moreover, conditions at different venues are highly standardized. A noteworthy advantage in comparison to various other sports is that there is no direct interaction between the contestants, which could lead to problems in assessing individual performance, otherwise. Furthermore, the outcome is precisely measurable, the rewards are high enough for competitors to take that task seriously, humans are observed while performing their usual job (Levitt & List, 2008), clear rules avoid subjective decisions by referees, and almost no external influences.

As expected, we find the average effect for the technical advantage of moving first to deliver an about 8.6% points higher probability to win a match. Despite this advantage, the contest is ex-ante fair for players with symmetric BIAs and a randomized allocation to be the first-mover. The contest cannot be regarded as fair in case of systematically differential BIAs, which is systematic effect heterogeneity associated to specific groups. The empirical analysis shows equal BIAs for equally skilled contestants, but the personalized effects span from about − 5 to + 15% points for individuals with different characteristics showing that there is a substantial heterogeneity in the effect of the BIA. Those contestants with lower performance measures and less experience profit most from the BIA. Differential effects are found in line with the social pressure hypothesis, as contestants playing in a neutral environment benefit from an 8.8% points higher treatment effect compared to those playing in a supportive environment. Interestingly, this differential effect in the BIA is found only for the first-mover player, while the second moving player is not affected by this social pressure. Therefore, this is not a general home disadvantage but is related to social pressure only for the first moving contestant. This results in an unfair contest design, since equally skilled contestants do not necessarily benefit from equal winning probabilities.

In the following section the related literature is discussed. Section 3 provides an insight into the setting of darts and details the data used. Section 4 presents the methodological challenges and the estimation procedures used to obtain the results in Sect. 5, followed by a conclusion in Sect. 6.

2 Literature review

There are few studies on the sport of darts, starting with Tibshirani et al. (2011) analysing the statistics behind darts and the ways to get the highest expected payoffs. Liebscher and Kirschstein (2017) predicted winning probabilities for individual players in the world darts championship. While they implicitly incorporated the BIA in their prediction, they did not intend to estimate causal effects. Ötting et al. (2020), and Klein Teeselink et al. (2020) discovered no or low performance decrements under pressure for professional darts players by looking at more or less pressure owing to varying importance of situations within a match. Substantial decrements among youth and amateur players are found by Klein Teeselink et al. (2020), which might suggest a selection of more choking resistant individuals among professionals.

One of the most fundamental relations in contests is the role of incentives on performance, discussed in the behavioural and psychological (e.g. Ariely et al., 2009; Baumeister, 1984; Butler & Baumeister, 1998; Masters, 1992) and the economics literature (e.g. Lazear, 2000; Prendergast, 1999; Rosen, 1986; Shapiro & Stiglitz, 1984; Stiglitz, 1976).Footnote 3 Several works have investigated the role of social incentives, for example Cao et al. (2011) and Toma (2017), which used data on basketball and found no effect of pressure on performance when a team plays at home compared to road games. Other works find this to be a source of pressure that impacts performance negatively. Ariely et al. (2009) found high social rewards to act counterproductive for performance in a set of experiments. Harb-Wu and Krumer (2019) investigated shooting performance in biathlon and found home athletes to miss more shots compared to athletes from other countries. Furthermore, the authors suggested that performance decrements are present only for the best ranked quarter of athletes. Despite the direct psychological effect, there is an indirect effect from public expectations (Baumeister et al., 1985; Butler & Baumeister, 1998; Strauss, 1997). Furthermore, Baumeister et al. (1985) and Strauss (1997) found performance decrements among individuals in cases in which the audience, but not the individual expects success.

Symmetric contests with either a balanced or no advantage are analysed with regard to differential effects for specific groups in several works. Ginsburg and Van Ours (2003) found different winning probabilities in a musical contest depending on the order of actions. Dohmen (2008) and Harb-Wu and Krumer (2019) reported performance decrements among home contestants compared to non-home contestants. In other works, the groups of first-movers (e.g. Apesteguia & Palacios-Huerta, 2010) or the second-movers (e.g. Page & Page, 2007) are found to have higher winning probabilities than the respective other. Recently, Goller and Krumer (2020) established differential winning probabilities in home matches for underdog teams depending on scheduling of the matches. Asymmetric contests, in which certain contestants are favoured by the contest design, are examined in various theoretical works. Examples are the role of the incumbency advantage for autocrats’ investment decisions (Konrad, 2002) or in political campaigns (Meirowitz, 2008), and more generally the influences of head-starts or handicaps in contests (Kirkegaard, 2012; Segev & Sela, 2014). In a more general scheme, a BIA is found for relative age effects in youth sports, in which children born at a certain time of year are favoured by the calendar year system, which groups children for competitive purposes (for a review, see Musch & Grondin, 2001; for a solution on how operational research can improve fairness, see Hurley, 2009).

Using sports data, machine learning methods are up to now almost exclusively used for prediction tasks, rarely in causal studies for average effects but, to the best of our knowledge, not yet for the systematic estimation of heterogeneous effects.Footnote 4 Especially in recent years, a number of methods have been proposed in the Causal Machine Learning literature for estimating effect heterogeneity (Athey et al., 2019; Lechner, 2018; Wager & Athey, 2018; Zimmert & Lechner, 2019; among others). Machine learning methods (or statistical learning methods, see e.g. Hastie et al., 2009) were implemented or modified to suit the classical causal framework and developed to become useful in analysing causal questions (Athey, 2017; Athey & Imbens, 2019). While there is some work establishing theoretical guarantees (e.g. Chernozhukov et al., 2018a, 2018b; Wager & Athey, 2018; Zimmert & Lechner, 2019) and simulating the performance of the newly developed estimators (notably Knaus et al., 2021), empirical applications of those new estimators and the systematic estimation of heterogeneous effects become increasingly popular (examples are Davis & Heller, 2017; Athey & Wager, 2019; Knaus et al., 2020; Goller et al., 2021).

3 Setting and data

3.1 Setting

Darts players commit themselves to play in one of the two major federations, the British Darts Organization (BDO) or the Professional Darts Corporation (PDC). While both federations hold their own tournaments, the PDC receives more media attention and distributes higher prize money than the BDO.Footnote 5 The most important tournaments are the (PDC/BDO) World Darts Championship held in December and January each year. Additionally, during the year, several other major tournaments are held in different places. The majors are joined by different minor tournaments throughout the year, relevant for accumulating ranking relevant prize money and qualification for the major tournaments. For a full list of tournaments considered in this work, see “Appendix 5”.

All BDO and PDC darts tournaments operate under the rules of the Darts Regulation Authority. Most matches are played in the (best-of-K) legs format, where K is an odd number. This implies that a player who wins (K + 1)/2 legs wins the match. To win a leg, the contestants must score exactly 501 points and complete with a ‘double’—a special region on the border of the darts board in which the score achieved is doubled. The two contestants perform their moves sequentially; for each move, three darts are to be thrown. Once a player reaches 501 points, the opponent is not allowed to ‘catch-up’. Therefore, the first-mover in a given leg potentially has up to three darts more compared to the non-starting player. Having an odd number of maximum legs played, the first-mover has a technical advantage for the whole match.

Before each match the starter is determined in a shootout, which is one dart each player, with the player closest to the centre of the darts board starting the match. How this challenge of non-random determination of moving first is solved is discussed in Sect. 4.2. For a more formal model the interested reader can refer to “Appendix 3”.

3.2 Data base

To investigate the BIA, we use data from the sport of darts. Starting from the year 2009 until 2019, matches from the most important BDO and PDC tournaments were extracted from the software, dartsforwindows, containing information on match statistics and outcomes. This resulted in a total of 11,604 matches played by 394 different players.Footnote 6 Importantly, the variable of interest, i.e. starting the first leg, as well as the outcome variable, winning the match, is generated. This is complemented by venue and tournament characteristics, such as prize money, which is standardized to make it comparable over the studied time period.Footnote 7 Personal characteristics of the players, such as nationality, hometown, and date of birth, among others are collected and utilized to create additional variables. Those contain the age or the number of years the player has played darts at the time of the respective match, or the distance between the hometown and the venue. Home and Venue in country of birth variables are created if the player lives within 100 kms of the venue, and is born in that country, respectively.Footnote 8 Since matches vary in their potential length, the data base contains the maximum number of legs to play (bestoflegs) in the respective match. Finally, (pre-match) performance measures and players statistics, such as the 3-darts average (cumulated 3 darts score, averaged over all the matches in the past 2 years), rankings, etc., are added. For the full list of covariates and sources, the reader can refer to Appendices A and F, respectively.

3.3 Descriptive statistics

The outcome variable under consideration is winning the match and is depicted in the descriptive statistics presented in Table 1. Further, information is provided by categorising the players according to the variable of interest, i.e. starting the first leg, in columns (2) and (3). The difference in the outcome variables for starting and non-starting players can be observed in Panel A, column (4). If starting the first leg would be perfectly randomized (instead of performing the shootout) these numbers would represent the average treatment effects. Descriptive evidence that moving first is subject to selection effects, i.e., that starting first is not randomly determined and there are confounding influences in the treatment variable, can be observed in the standardized differences (SD) reported in Panel C, column (4) for the players characteristics.Footnote 9 Starting players have on an average a higher 3-darts average, a better ranking position, and more accumulated matches. For the other characteristics the standardized difference is low. Section 4.2 sets out how this non-random selection into treatment is approached. The full list of variables can be found in Table 5 in “Appendix 1”.

Table 1 Descriptive statistics

4 Methodology

4.1 Notation and framework

The typical notation for binary treatment effects estimation, following Rubin (1974) is used. Suppose the outcome obeys the observational rule: \(Y_{i} = D_{i} Y_{i} \left( 1 \right) + \left( {1 - D_{i} } \right)Y_{i} \left( 0 \right)\), where \(D_{i}\) denoteshe treatment status \(d \in \left( {0,1} \right)\), \(Y_{i} \left( d \right)\). the potential outcome under treatment status d. Furthermore, we define \(X_{i}\) to contain the covariates necessary to account for confounding, and \(Z_{i}\) represents those variables investigated in the heterogeneity analysis.Footnote 10

The first estimand of interest is the average treatment effect (ATE), \(\theta = E\left( {Y_{i} \left( 1 \right) - Y_{i} \left( 0 \right)} \right)\). This represents the average effect for all units on the highest level of aggregation. Contrarily, the estimand on the lowest aggregation level is the individualized average treatment effect (IATE), \(\theta \left( x \right) = E(Y_{i} \left( 1 \right) - Y_{i} \left( 0 \right)|X_{i} = x)\). The group average treatment effect (GATE) represents an intermediate aggregation level according to heterogeneity variables \(Z_{i}\), \(\theta \left( z \right) = E(Y_{i} \left( 1 \right) - Y_{i} \left( 0 \right)|Z_{i} = z)\). Both conditional average treatment effects (CATEs), the GATEs and IATEs, are useful for detecting heterogeneity among the observed units, which are ‘hidden’ in ATE estimates.Footnote 11

To understand the relationship of those three estimands of interest, integrating the IATEs over the characteristics of the groups \(Z_{i} = z\) leads to the GATEs, while integrating over the characteristics of the entire population results in the ATE. Finally, only one of the potential outcomes is observable, as a unit can either be treated \(\left( {D_{i} = 1} \right)\) or non-treated \( \left( {D_{i} = 0} \right) \left( {D_{i} = 0} \right)\); the other remains counterfactual. Contrary to an individual treatment effect, which cannot be identified as knowing both potential outcomes is impossible, identification of the IATE, that comes closest to the individual effect, and any coarser level of aggregation is possible if all confounding influences are captured by the observed control variables \(X_{i}\) (Knaus et al., 2021). In the following section, we shall discuss how to ‘solve’ this fundamental problem of causal inference (Holland, 1986).

4.2 Identification

Most crucial for estimating causal effects is a credible identification strategy. In the case of randomized treatment assignment, non-treated units can directly be used to construct the counterfactual outcome. In the underlying case, with non-random treatment assignment, we impose the following assumptions to identify the estimands of interest based on selection-on-observables. First, the conditional independence assumption (CIA): \(Y_{i} \left( 1 \right), Y_{i} \left( 0 \right) \bot D_{i} |X_{i} = x\), and second, common support (CS): \(0 < P\left[ {D_{i} = 1{|}X_{i} = x} \right] < 1\).

The first assumption, the CIA, states that the potential outcomes are independent of the treatment assignment conditional on the confounders. This implies that all variables affecting both, treatment assignment and outcome, are observed and contained in \(X_{i}\). The second assumption, CS, requires that treatment possibilities are bounded away from 0 and 1. This assumption is testable and of no issue in this work (see Fig. 5 in “Appendix 2.1”).

At this point, we would like to highlight the two different roles of covariates. First, covariates (or confounders; \(X_{i}\)) are required to make the CIA credible. In other words, the characteristics that are responsible for selection into the treatment status have to be accounted for to obtain an ‘as-good-as-randomized’ situation. This is necessary to obtain causal treatment effects, which are free from selection bias. Second, covariates (or heterogeneity variables; \(Z_{i}\)) are used to form groups of observations for which heterogeneous effects are to be estimated.

In this application, the non-random determination of the treatment, i.e. starting the first leg, is a result of the shootout to select the starter as already discussed in Sect. 3.1. The consequence of this shootout is arguably mainly influenced by the ability of the players and their experience with this situation. As indicated in Sect. 3.2, there are two measures of ability, the previous ranking and the performance in form of the previous average scores achieved. Experience is measured by the cumulated number of matches played, age, number of years of playing darts and the number of years played at a professional level. Furthermore, we control for other variables potentially influencing the selection into treatment as described in Sect. 3. Since it is unclear in which (functional) form the potential confounders are most relevant to account for the selection into treatment, a flexible estimation technique, namely, a double machine learning approach with non-parametrically estimated nuisance functions, described in the following section, will be used. Having this set of observed potential confounders, it is credible that the CIA is satisfied.

4.3 Methods

If one considers using a classical linear regression approach to estimate a treatment effect, there are some implicit assumptions to think about. As already discussed, the conditional independence assumption is crucial. Further, there are the assumptions of a constant treatment effect and a linear effect of the confounders \(X_{i}\) on \(Y_{i}\). In most studies, there are neither scientific nor methodological reasons to motivate these latter assumptions. Therefore, adopting a methodology not imposing those assumptions is desirable. The upcoming Causal Machine Learning literature is agnostic about these assumptions.

The growing literature offers several solutions to estimate treatment effects in a flexible way. For a summary and simulation study covering many of the methods, refer to Knaus et al. (2021). They found four causal machine learning methods with good performance in all their simulated settings, while other methods were unstable or did not perform well for estimating the CATE and ATE. Specifically, those are LASSO with covariate modification and efficiency augmentation (Tian et al., 2014), LASSO with R-Learning (Nie & Wager, 2021), Causal Forest with local centring (Athey et al., 2019; Lechner, 2018), and Random Forest with Double Machine Learning (Chernozhukov et al., 2018a; 2018b).

For this work we chose to use Double Machine Learning, for which we have theoretical guarantees required for the strategy of our application, for the ATE (Chernozhukov et al., 2018a, 2018b) and the CATE (Semenova & Chernozhukov, 2021; Zimmert & Lechner, 2019). For investigating the effects on the lowest level of aggregation (IATE), we use the Sorted Effects method (Chernozhukov et al., 2018a, 2018b). The average and group effects are investigated using the Best Linear Prediction (Semenova & Chernozhukov, 2021) and a nonparametric and non-linear CATE estimator (Zimmert & Lechner, 2019). Section 4.3.2 introduces and motivates these methods. To check the sensitivity of the results being method dependent we chose to additionally perform the analysis using a Modified Causal Forest (Lechner, 2018).Footnote 12

In the following subsections, we build on the observation of the two different roles of covariates. First, the procedure to account for selection into treatment by controlling for (potential) confounding factors is introduced in Sect. 4.3.1. Second, the two ways how heterogeneity variables are used to investigate granular treatment effects are described in Sect. 4.3.2.

4.3.1 Double machine learning

The first stage of the estimation procedure uses Double Machine Learning (DML), introduced by Chernozhukov et al. (2018a, 2018b). This approach to overcome the selection problem builds on the augmented inverse probability weighting procedure, going back to Robins et al., (1994, 1995):

$$ Y_{i}^{*} = \mu_{1} \left( {X_{i} } \right) - \mu_{0} \left( {X_{i} } \right) + \frac{{D_{i} \left( {Y_{i} - \mu_{1} \left( {X_{i} } \right)} \right)}}{{p\left( {X_{i} } \right)}} - \frac{{(1 - D_{i} )\left( {Y_{i} - \mu_{0} \left( {X_{i} } \right)} \right)}}{{1 - p\left( {X_{i} } \right)}}, $$

and involves three nuisance parameters. \(\mu_{1} \left( {X_{i} } \right) = E(Y_{i} |D_{i} = 1,X_{i} )\), modelling the conditional outcome mean if treated, \(\mu_{0} \left( {X_{i} } \right) = E(Y_{i} |D_{i} = 0,X_{i} )\), the conditional outcome mean if not treated and \(p\left( {X_{i} } \right) = E(D_{i} |X_{i} )\), the conditional probability to be treated.Footnote 13 This first stage uses all available potential confounders as introduced in Sect. 3.2. Plugging in the estimated nuisances results in the orthogonal score (\(Y_{i}^{*}\)), which has various applications. The expected value of the orthogonal scores, i.e., \(E\left( {Y_{i}^{*} } \right)\), is the average treatment effect. For the different investigated levels of aggregation, different sets of covariates are used. The lowest level of aggregation is obtained as \(E(Y_{i}^{*} |X_{i} = x)\), while group average treatment effects can be obtained by the expected value conditional on the groups defined in Z as \(E(Y_{i}^{*} |Z_{i} = z)\), which are discussed in detail in Sect. 4.3.2. The three nuisance parameters in general can be estimated by any well-suited estimation technique.Footnote 14

To overcome the linearity assumption of effects by the confounders on the outcome, the non-linear and non-parametric Random Forest is used to estimate the nuisance parameters. The Random Forest algorithm, developed in Breiman (2001), is built as an ensemble of single Regression Trees, which are to some extent randomly constructed.Footnote 15 Each Regression Tree recursively splits the space of covariates into non-overlapping areas to minimize the MSE of the outcome prediction until it reaches some stopping criteria. The resulting structure is reminiscent of a rotated tree; one can observe the trunk gradually splitting up into finer branches. The averages of the outcomes of those observations falling into the same end-nodes (leaves) provide the predictions of the tree. Combining several of those tree predictions results in the final predictions of the Random Forest.Footnote 16 This DML step is to remove any (potential) confounding issues. The resulting orthogonal score is free from selection effects, and treatment effects can be constructed at various levels of aggregation.Footnote 17

4.3.2 Conditional average treatment effects

As already mentioned, the scores can be used to directly obtain the average treatment effect by \(\theta = E\left( {Y_{i}^{*} } \right)\). To estimate effects beneath the average level, the obtained orthogonal scores from the DML procedure can be used in different ways. The first method we use is the Best Linear Predictor (BLP) framework proposed in Semenova and Chernozhukov (2021). Here, the orthogonal scores are used as pseudo-outcome in an ordinary least squares regression on covariates to solve the minimization problem: \(\hat{\beta } = argmin \mathop \sum \nolimits_{i = 1}^{N} \left( {\widehat{{Y_{i}^{*} }} - \widetilde{{x_{i} }}\beta } \right)^{2}\), with \(\widetilde{{x_{i} }}\) containing a constant and \(x_{i}\). The fitted values, \(\hat{\theta }\left( {\widetilde{{x_{i} }}} \right) = \widetilde{{x_{i} }}\hat{\beta }\) are the best linear predictors of the IATEs. In fact, \(\widetilde{{x_{i} }}\) can be replaced by any subset of the covariate space. For example, replacing it by only a constant leads to the ATE; replacing \(\widetilde{{x_{i} }}\) by \(\widetilde{{z_{i} }} = \left( {constant, z_{i} } \right)\) leads to the GATEs. Interpretation of the resulting coefficients is equivalent to interpreting coefficients estimated using an OLS regression, except that the level of a causal effect is modelled, rather than the level of the outcome. Moreover, with (single) binary heterogeneity variables, GATEs are estimated nonparametrically and the linearity of effects assumption does not play a role. Standard errors can be computed as heteroscedasticity robust standard errors, which are valid, as shown in Semenova and Chernozhukov (2021).Footnote 18

Once the IATEs are estimated, they can be analysed in various ways. One possibility is to evaluate the distribution of the effects, to observe how different the effects are for the range of observations. To do this, we use the Sorted Effects method suggested in Chernozhukov et al. (2018a, 2018b), which sorts the individualized effects according to the effect size, coming with valid confidence intervals. Comparing the resulting distribution with the average effect is giving some insight into the range of individualized effects. Furthermore, this can be used to analyse whether the most and least affected individuals differ according to their characteristics. For this, Chernozhukov et al. (2018a, 2018b) propose a classification analysis splitting the data according to the size of the effect and compare the characteristics of the individuals belonging to the respective groups of lowest and highest treatment effects.

The discussed method has its strength in analysing the total range of individualized effects and compactly summarising heterogeneity. The drawback is that an assumption of linearity of effects is imposed, as one runs quickly into the ‘curse of dimensionality’ with more than few heterogeneity variables when using non-parametric methods. To investigate specific hypotheses, involving few dimensions of the covariate space, Zimmert and Lechner (2019), as well as Fan et al. (2019), proposed a non-parametric approach to not be dependent on the linearity assumption of the best linear predictor. The already estimated orthogonal scores can be used with (few) selected heterogeneity variables in a classical non-parametric kernel regression as follows:

$$ \hat{\theta }\left( z \right) = \mathop \sum \limits_{i = 1}^{N} \frac{{{\mathcal{K}}_{h} \left( {z_{i} - z} \right)\widehat{{Y_{i}^{*} }}}}{{\mathop \sum \nolimits_{i = 1}^{N} {\mathcal{K}}_{h} \left( {z_{i} - z} \right)}} $$

With \({\mathcal{K}}_{h}\) being a kernel function with bandwidth h, determined by cross validation and 90% undersmoothing, as suggested in Zimmert and Lechner (2019). The drawback of this procedure is that the dimension of \(Z_{i}\) is limited to obtain the required asymptotic guarantees. \(Z_{i}\), in this case, includes one or two heterogeneity variables, which is unproblematic from a theoretical perspective and sufficient to analyse the hypotheses of interest in this work.

5 Results

5.1 Average treatment effect

The first result of the analysis is the average effect of the BIA for the starting player on winning the match. Column (1) in Table 3 shows the average treatment effect. The effect of the technical advantage with 0.0865 is large and precise enough to be statistically different from zero. The average effect for the starting contestants, therefore, amounts to 8.65% points higher winning probability for the match. This effect is in line with what we would expect owing to the technical advantage implicit in the contest design. While this is a sizeable effect, it would be interesting to note the kind of players that may profit more or less from this advantage. To analyse this, we look into more granular effects in the following sections.

5.2 Individualized average effects

Starting on the most granular level, we see in Fig. 1 the individualized average treatment effects for the BIA sorted in size. The solid black line represents the average treatment effect with the dotted black lines being its confidence intervals. The solid blue line represents the sorted individualized effects, accompanied by the shaded blue confidence intervals.

Fig. 1
figure 1

Sorted Effects. Notes: Sorted Effects. 999 (weighted) bootstrap replications. Bias corrected. The blue line represents the sorted Conditional Average Treatment Effects accompanied by the shaded 90% confidence interval. The black line with squares represents the ATE, accompanied by the dashed line representing the 90% confidence interval. (Color figure online)

While most of the players have positive effects, there are also negative point estimates for specific players. For those players, the technical advantage might not be an actual advantage, even though this is not backed by statistical confidence. In total the individualized effects range from around − 5 to + 15% points, suggesting that there is some heterogeneity in the effect.

A comparison of the 10% most affected (highest treatment effect) to the 10% least affected (lowest treatment effect) individuals in selected characteristics can be found in Table 2. This helps to evaluate the differences in specific characteristics for those with the highest and lowest treatment effects. The estimate represents the difference in the characteristics of the most and least affected. Joint p-values in contrast to the ‘usual’ p-values account for testing the estimates of, in this case, 10 tests (for more details the interested reader is referred to Chernozhukov et al., 2018a, 2018b).

Table 2 Difference in characteristics of 10% most and least affected groups

In general, the most affected, i.e. those with the highest realization of the technical advantage, have lower performance measures, less experience, competing away from home, and in tournaments with lower prize money, compared to the least affected. In contrast to the other experience proxies, an insignificant estimate is found for age. With increasing age experience generally increases, because athletes had more time to accumulate experience compared to younger athletes. However, age in darts can also be a poor indicator of experience as older players who started playing darts later in life or play irregularly are not necessarily more experienced than a mid-aged player who has played all their life. Furthermore, only home as defined as the hometown being within a radius of 100 km to the tournament venue seems to be differently allocated, while being born in the country (Venue in country of birth) of the venue is almost equally frequent among most and least affected.

5.3 Group average treatment effects

Now we turn to investigating more specific effect heterogeneities. Two group effects are investigated: First, in Sect. 5.3.1, we analyse if there are differential effects in the BIA associated with the ability of the contestants. Second, we evaluate in Sect. 5.3.2 whether, motivated by the social pressure hypothesis, there are differential effects associated to competitors performing near their hometown.

5.3.1 Ability

A contest is fair if equally skilled contestants have equal winning probabilities. To discover if the BIA differs for equally skilled contestants on any level of ability, we therefore analysed the BIA of the first moving player (contestant i) associated with both competitors’ ability (of i and j). As already discussed, our preferred measure of ability is the performance measure 3-darts average.

In general, Fig. 2 shows the GATEs to be smaller for higher ability athletes, in line with the results in Table 2. However, both contestants’ GATEs are close to each other for every value of the 3-darts average. In other words, there is no evidence that the BIAs are different for equally skilled contestants. Note that, especially for low and high 3-darts average, there are fewer observations, resulting in larger confidence intervals.

Fig. 2
figure 2

Built-in advantage of starting contestant by ability. Notes: The broken lines represent the 90% confidence intervals. Contestant i is the starting contestant, j the opponent. (Color figure online)

5.3.2 Social incentives

We observe in Table 2 that those with the highest treatment effects are rather competing away from home, with an estimated difference of − 0.49, compared to those with the lowest treatment effects. Investigating this more specifically, the group average effects are shown in Table 3.

Table 3 Base results—best linear predictors

Column (2) presents the effect associated with competing at home (− 0.0885) relative to those performing not at home. The technical advantage therefore, exists only for the group of individuals not performing at home (0.0939), and close to zero for those performing at home. The diminished winning probability of about 8.8% points for the group of contestants performing in a friendly environment with support, i.e. at home, is in line with the social pressure and contrary to the social support hypothesis, as found in Dohmen (2008) and Harb-Wu and Krumer (2019).Footnote 19 Column (6) provides a summary of effect heterogeneities for a larger set of covariates, confirming the home effect by holding the other variables constant.

No substantial difference is found for the group of competitors born in the same country as the venues’ country (0.0032, column (5) in Table 3). This indicates that the differential effect associated with competing at home can be attributed to social pressure from the audience, while there is no difference in the effect associated with the audience cheering for their compatriots.

Interestingly, there is no differential effect for the second moving contestant (j) to be affected by social pressure or any type of home (dis-)advantage, as seen in Table 3. Columns (3) and (4) show the differential effect for non-starting contestants associated to competing in a venue near their hometown. In other words, competing in a friendly environment has no (negative) effect on the BIA for non-starting contestants. Therefore, there is no evidence for a general home (dis-)advantage, but evidence in line with social pressure for those starting the match. The F-Statistics presented in the last line of Table 3 provides a significance test of all covariates in the BLP regression, which are statistically significant in columns (2), (4), and (6), each of which involves the home indicator.

On the contrary, an analogy with the finding of Harb-Wu and Krumer (2019), that the effect is driven by those athletes in the top quartile of the ability distribution, cannot be confirmed.

Figure 3 displays the treatment effect for the starter associated with performing at home or not at home with respect to the 3-darts average, the most accurate measure of individual ability in darts. We discovered that the effect associated with playing at home fluctuates around the zero effect, while the effect for those playing away from home is always above. Especially for the boundaries of the figure, i.e. low and high 3-darts average, there are less observations; therefore, the estimates become imprecise. Replacing the 3-darts average by the previous position of the player in the ranking leads to similar conclusions; the results can be found in Fig. 6 in “Appendix 2.2”. Further, no significant differences are found in a more formal test using interacted variables of ranking and 3-darts average with home in a best linear prediction in Table 6 in “Appendix 2.2”.

Fig. 3
figure 3

Differential built-in advantage by home and ability. Notes: The dark blue line represents the GATEs for contestants not home, the light blue line for contestants performing at home. The broken lines represent the 90% confidence intervals. (Color figure online)

5.4 Sensitivity checks

To assess the sensitivity of the results being method dependent we chose to additionally perform the analysis using a Modified Causal Forest (Lechner, 2018). For more details on the method and implementation the interested reader can refer to “Appendix 4”.

All estimated effects are in a comparable range to the estimates in the previous sections. In addition, the conclusions drawn hold in general. Table 4 shows the ATE estimate, as well as the GATE associated with performing at home resulting from the Modified Causal Forest estimation. Further, in the lower part of the table, we repeated the estimation but accounted for contestant specific clusters. For both estimations, we found a large differential effect associated to performing at home.

Table 4 Average and group average effects, MCF

For the ability measure, 3-darts average, the pattern is in line with the findings from Sect. 5.3.1. In Fig. 7 in “Appendix 2.4” we observe lower GATEs for higher 3-darts averages of the starting contestant (i) and the opponent (j). Especially, for equal levels of ability the GATEs are similar. We can therefore conclude that the BIA is symmetric for equally skilled contestants. Note that, since the 3-darts average is discretized to form groups of roughly equal numbers of observation in each group, we can see that especially for low and high 3-darts averages there are less observations. In conclusion, this conceptually different method produces similar results, providing some evidence that the results presented in the previous sections are robust.

Accounting for potential contestant specific clusters, the best linear prediction results from Table 3 are replicated with cluster-based sampling for the nuisance estimation, as well as clustered standard errors on the individual level in Table 8 in “Appendix 2.4”. While standard errors are slightly higher, we find the conclusions drawn previously are valid.

5.5 Discussion

A contest design in which one group systematically realizes differential BIAs is unfair. Especially when home contestants are affected, the competition design must be reconsidered and changed to have more fair and interesting contests. A potential idea is to increase the length of the contest, such that the BIA decreases. While that looks like a neat idea in theory, we cannot provide clear evidence for it. Figure 4 provides the BIA associated to the number of maximum legs to play in a match.

Fig. 4
figure 4

BIA by maximum number of legs. Notes: The GATE is shown as solid black line; the broken lines represent the 90% confidence intervals. The green line represents the ATE. (Color figure online)

The GATE fluctuates around the average effect, with no clear tendency towards lower or no BIA for longer matches. A second option is to remove the technical advantage from the contest design in form of even numbers of maximum legs to play, as for example, in tennis tournaments.Footnote 20 Further, Cohen-Zada et al. (2018) found that the order of serves in tennis tiebreak, which is the ABBA sequence, does not provide any advantage to any of the players. Adopting a similar structure might help to improve the fairness in darts contests.

6 Conclusion

The study discovered an average effect of about 8.6% points higher winning probability induced by the BIA in the sequential darts contest with two players. Having a randomized allocation of being the first-mover for equally skilled contestants, the contest is fair for contestants with symmetric potential BIAs. While we found non-differential BIAs for equally skilled contestants, the personalized point estimates suggested substantial heterogeneity in the effect, ranging from about − 5 to + 15% points with athletes having more experience and better performance measures to benefit least. In addition, estimated differential effects were in line with the social pressure hypothesis as found in previous works, especially in the behavioural economics literature. For the group of individuals competing in a friendly environment, the BIA was lower compared to those competing on neutral grounds, which was in line with the works of Dohmen (2008) and Harb-Wu and Krumer (2019) among others. This leads to not necessarily equal winning probabilities for contestants with equal abilities, and therefore, to an unfair contest.

For designing contests, the heterogeneous nature of the BIA in such types of contests should be considered to not unintentionally favour specific groups of contestants. Especially if fairness is a precondition for a contest, the presented findings should be taken into account. Despite fairness issues, local athletes might be important for the attractiveness of the contest. Introducing a fairer contest design may, therefore, improve the attractiveness as well. We did not find any evidence that an increase in the maximum length for a match leads to diminished BIAs, so the suggestion would be to change the design of the contest, such that there is no BIA for any contestant. Furthermore, the findings are of interest to individuals performing in situations of increased importance, either contestants in tournaments or in daily life situations. While preparing for those situations, one should put more emphasis on mental training.

Another insight from this study is that Causal Machine Learning is highly valuable for empirical studies in operations research. It enables us to provide a more complete picture of particular effects under investigation, as well as a way to investigate specific hypotheses in a flexible way. Furthermore, the two steps of removing potential selection bias and estimating treatment effects offers an intuitive, flexible, and robust way to improve credibility of empirical research.

Finally, we call for more, fine granular investigations in contest designs and operations research in general. Since we only have data on male professional players, we encourage further studies to analyse how the BIA affects women, as well as youth and amateur players in darts or other asymmetric contests. Furthermore, in future studies, the limitations of audience presence owing to the current corona pandemic could be used to analyse the BIA, without the direct influence coming from the audience. Methods from the Causal Machine Learning toolbox appear to be especially useful for this and should be considered for further empirical analyses.