# Caveats and pitfalls in crowdsourcing research: the case of soccer referee bias

## Abstract

In a recent crowdsourcing project, 29 teams analyzed the same data set to address the following question: “Are football (soccer) referees more likely to give red cards to players with dark skin tone than to players with light skin tone?” The major finding was that the results of the individual teams varied widely, from no effect to highly significant correlations between skin color and the rate of red cards, which some teams interpreted as indicative of a referee bias. We analyzed the same data using a Poisson log-linear regression model and obtained an odds ratio of 1.34 (95%-CI, 1.13–1.59), which means that players with a darker skin tone have in fact a slightly higher odds of receiving a red card. This result is in agreement with the median odds ratio of 1.31 from all 29 teams. We then extended the original study by investigating the likelihood of receiving yellow cards. If a referee bias was in fact present, it would be plausible to see a similar association. However, players with darker skin tone were significantly *less* likely to receive a yellow card, with an odds ratio of 0.94 (95%-CI, 0.91–0.97). The risk of receiving a card is most strongly affected by a player’s position, and there are significantly more players with darker skin tone at center back and defensive midfield where receiving red cards is generally more likely. Taken together, our results do not support the hypothesis of a referee bias. Our most important finding, however, is that the perceived diversity of results from the crowdsourcing teams is due to placing too much emphasis on dichotomous decisions (significant vs. nonsignificant). When we focus on point estimates and their reasonable bounds, the individual substudies predominantly reinforce each other. We argue that data scientists should put less emphasis on statistical significance and instead focus more on the careful interpretation of confidence intervals or alternative methods for measuring the effect size and its precision.

## Keywords

Crowdsourced research Statistical significance Confidence interval Effect size estimation Soccer referee bias## 1 Introduction

Crowdsourcing research—the recruitment of a “crowd” of scientists for an online collaboration—is a relatively new approach to tackling interdisciplinary research projects. Crowdsourcing research is an interesting paradigm that offers several advantages over more conventional research practices; for example, it can reveal how conclusions depend on analytical choices made by different data scientists.

In a recent crowdsourcing project [28, 29], 29 teams analyzed the same data set to investigate the same question: “Are football (soccer) referees more likely to give red cards to players with darker skin tone than to players with lighter skin tone?” The individual teams analyzed the data independently using a variety of statistical techniques. After their initial analyses, the teams discussed their analytical choices, but did not disclose their preliminary findings. Since all teams analyzed the same data, they acted as peer-reviewers with an unusually high problem understanding. In the final result, 20 teams found a significant positive correlation between the number of red cards received and the players’ skin tone, whereas nine teams did not find any significant relation. The constructive comments on the analytical choices were expected to lead to a convergence of results [29]; however, a “disturbing” [28, p. 191] range of effect sizes was reported. How can this discrepancy be explained? The answer to this question has implications far beyond the scope of the present study, as similar discrepancies between collaborating teams might be observed in future crowdsourcing projects.

- 1.
Is there a relation between a player’s skin tone and the number of cards (both yellow and red) that he received, and, provided that such a relation exists, is referee bias a plausible explanation?

- 2.
How should significant and nonsignificant findings by different crowdsourcing teams be interpreted?

Stratification of players with respect to skin tone and position, with percentage of received cards

Position | Skin tone | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|

0 | 0.125 | 0.25 | 0.375 | 0.5 | 0.625 | 0.75 | 0.875 | 1 | yellow | red | |

Attacking midfielder | 27.6 | 18.9 | 32.3 | 5.5 | 9.4 | 1.6 | 3.9 | 0.0 | 0.8 | 7.7 | 4.5 |

Center back | 31.5 | 11.7 | 29.4 | 4.8 | 8.5 | 1.6 | 3.6 | 2.4 | 6.5 | 20.1 | 23.4 |

Center forward | 24.1 | 13.1 | 24.1 | 5.8 | 7.9 | 2.1 | 8.9 | 3.7 | 10.5 | 9.6 | 8.7 |

Center midfielder | 36.7 | 15.0 | 30.0 | 3.3 | 6.7 | 0.0 | 5.0 | 0.0 | 3.3 | 4.5 | 2.8 |

Defensive midfielder | 26.2 | 14.5 | 29.7 | 5.2 | 4.7 | 0.6 | 6.4 | 3.5 | 9.3 | 14.8 | 10.4 |

Goalkeeper | 32.9 | 19.9 | 34.2 | 4.3 | 2.5 | 0.6 | 1.2 | 2.5 | 1.9 | 3.5 | 11.1 |

Left fullback | 25.6 | 14.5 | 26.5 | 8.5 | 6.8 | 4.3 | 7.7 | 2.6 | 3.4 | 8.2 | 8.3 |

Left midfielder | 21.1 | 21.1 | 36.8 | 1.3 | 3.9 | 3.9 | 3.9 | 1.3 | 6.6 | 4.8 | 4.4 |

Left winger | 23.2 | 5.4 | 32.1 | 1.8 | 8.9 | 3.6 | 14.3 | 3.6 | 7.1 | 2.9 | 1.8 |

Right fullback | 26.6 | 11.0 | 28.4 | 8.3 | 8.3 | 2.8 | 9.2 | 0.9 | 4.6 | 8.3 | 5.7 |

Right midfielder | 31.7 | 13.3 | 30.0 | 1.7 | 13.3 | 0.0 | 6.7 | 1.7 | 1.7 | 3.7 | 2.0 |

Right winger | 8.9 | 19.6 | 23.2 | 7.1 | 12.5 | 5.4 | 8.9 | 5.4 | 8.9 | 2.8 | 3.0 |

Unknown | 21.7 | 10.5 | 32.2 | 2.0 | 7.9 | 3.3 | 11.8 | 3.9 | 6.6 | 9.3 | 13.9 |

In the present study, we obtained an odds ratio of 1.34 (95%-CI, 1.13–1.59). This means that, after controlling for position, players with a darker skin tone have in fact a slightly higher odds of receiving a red card. This result is in agreement with the median odds ratio of 1.31 from all 29 teams. But is referee bias a plausible explanation? We believe that this particular hypothesis has a very low prior probability, as professional referees have gone through extensive training and should be assumed to be fair. Thus, “referee bias” is an extraordinary claim, and in the words of the astrophysicist Carl Sagan, extraordinary claims require extraordinary evidence. We therefore investigated the relation between *yellow* cards and skin tone. Red cards are normally given because of a clear foul or serious misconduct and result in immediate dismissal from the pitch and possibly a suspension for one or more future games. By contrast, yellow cards are normally given as an official warning in more ambiguous cases, including unsporting behavior. A red card means not only a severe punishment for the player who receives the card, but it also represents a major (and possibly decisive) intervention in the game. Consequently, when showing a red card, a referee is under a much higher level of scrutiny from the public, coaches, and soccer associations. Our assumption was that if a referee bias was in fact present, then it would be plausible to see a similar relation between the number of yellow cards and skin tone. However, that was not the case. On the contrary, the odds of receiving a yellow card *decrease* with darker skin tone.

Regarding our second question, the central problem hinges on the difference between a significance test and a confidence interval [5]. If we focus on statistical significance, then we somehow have to reconcile nine nonsignificant with 20 significant findings by the crowdsourcing teams. Here, we show that the focus on significance gives a message that is *opposite* of the appropriate interpretation. By focusing on the effect size and the correct interpretation of overlapping confidence intervals, we see that the individual studies, overall, reinforce each other.

This paper is organized as follows. First, we describe the data pre-processing and the distributions of players with respect to skin tone, positions played, and cards received. We then investigate whether some referees tend to give disproportionally more cards to players with darker skin tone. In a regression analysis, we then predict the rate of cards using four different methods. We conclude the paper with a discussion of our main findings.

## 2 Materials and methods

### 2.1 Data pre-processing

We retrieved the raw data from the crowdsourcing project website at https://osf.io/47tnc/. This data set contains demographics from all players (\(N_1 = 2053\)) who played in the first male divisions of England, Germany, France, and Spain in the 2012–2013 season. The data set contains the number of red and yellow cards that each player received in his professional career, as well as data about the referees (\(N_2 = 3147\)) who issued the cards.

Based on the players’ pictures, two raters had assessed the players’ skin tone individually on a scale from 1 (very light skin tone) to 5 (very dark skin tone). These scores were normalized to [0, 1]. We first checked the inter-rater reliability using Pearson’s correlation coefficient. Given the strong inter-rater reliability (\(r = 0.92\)), we decided to average the scores to obtain the response variable for the regression analysis. Averaging the ratings led to nine different scores: 0, 0.125, 0.25, 0.375, 0.5, 0.625, 0.75, 0.875, and 1, where 0 indicates very light and 1 indicates very dark skin tone.

### 2.2 Distribution of players with respect to skin tone and position

We first investigated the distribution of players with respect to skin tone and position. For example, 0.8% of all players at position attacking midfielder have skin tone rating 1, and 23.4% of all red cards were given at position center back (Table 1). There is a significant correlation between the players’ skin tone and the position played (\(P = 0.02\), Kruskal–Wallis test). For example, consider the positions attacking midfielder and right midfielder versus the positions center back and defensive midfielder. At attacking midfielder and right midfielder, 6.4% of players have a darker skin tone (0.75 and above), whereas more than twice that many players (15.2%) at center back and defensive midfielder have this skin tone. The ensuing question is, are cards more frequent at center back and defensive midfielder?

### 2.3 Distribution of red cards with respect to position

The percentage of red cards received depends strongly on a player’s position. For example, 6.51% of all red cards were received at the positions attacking midfielder and right midfielder, whereas 33.79% of all red cards were received at center back and defensive midfielder. To address the problem of multiple testing, we performed all pair-wise comparisons between positions using the Marascuilo procedure (\(\alpha = 0.05\)). We observed that players at center back received significantly more red cards (23.4%) than players at left winger (1.8%) and right midfielder (2.0%). Relative to games played, players at center back and defensive midfielder received significantly more red cards than players at attacking midfielder and right midfielder: 591 red cards in 105,488 games (0.56%) versus 116 red cards in 45,974 games (0.25%) (\(P = 3.76 \times 10^{-12}\), two-sample test of proportion; Holm–Bonferroni correction for multiple testing). Interestingly, we make the same observation when we consider only players with very light skin tone: they received significantly more red cards at center back and defensive midfielder than at attacking midfielder and right midfielder (\(P = 0.025\)). Thus, the positions center back and defensive midfielder—positions with significantly more players with darker skin tone—involve a higher risk of receiving a red card.

### 2.4 Distribution of yellow cards with respect to position

Relatively to games played, players at center back and defensive midfielder received significantly more yellow cards than players at attacking midfielder and right midfielder: 4846 yellow cards in 105,488 games (4.6%) versus 5750 yellow cards in 45,974 games (12.51%) (\(P < 9.44 \times 10^{-13}\), two-sample test of proportion; Holm–Bonferroni correction for multiple testing). Again, when we consider only players with light skin tone, they also received more yellow cards at center back and defensive midfielder than at attacking midfielder and right midfielder (16.2 vs. 12.8%, \(P < 9.44 \times 10^{-13}\), two-sample test of proportion; Holm–Bonferroni correction for multiple testing). We performed all pair-wise comparisons between positions using the Marascuilo procedure (\(\alpha = 0.05\)), but no pair-wise comparison was significant.

The conclusion is that position strongly influences the number of cards. We also checked whether soccer club and league country are potential confounders, but we failed to see any significant association between these attributes and cards given. Furthermore, the variable “club” is not static, as it is not uncommon for a player to change his club. Therefore, we controlled only for the variable “position” in a Poisson regression model (Sect. 2.6).

### 2.5 Trend analysis of cards received

Are there any referees who tend to give more cards to players with darker skin tone than to players with lighter skin tone? To answer this question, we proceeded as follows. From our preprocessed data set containing a total of 2978 referees, we removed all referees who never showed a card (\(n = 1056\)). For the remaining 1922 referees, we counted how many yellow or red cards each referee showed to players stratified based on skin tone. Then, we performed a \(\chi ^2\)-test for a trend in proportions. Increasingly higher proportions of cards for players with increasingly darker skin tone could point to a referee bias. For the proportions of red cards, we found 29 significant trends (\(P < 0.05\), no corrections for multiple testing). However, these results must be interpreted cautiously, as the median number of red cards is only 1. For example, referee #2426 showed only one red card, and the player has the darkest skin tone. Clearly, this number is too small to say anything about a trend. We observed the clearest positive trend (\(P = 0.026\)) for referee #1852 who showed eight red cards in total (Fig. 1a). For example, this referee encountered 155 players with skin tone 0 and showed them 2 red cards, whereas he encountered 13 players of skin tone 0.75 and also showed them 2 red cards. If we assumed this referee is biased, then we would expect to observe a similar trend for the proportions in yellow cards. However, this trend is not obvious (Fig. 1b), and the bias is therefore questionable.

Yellow cards are given far more frequently than red cards and are therefore a better indicator of a potential referee bias. For 1922 referees, we observed 217 significant trends in the proportions of yellow cards (\(P < 0.05\), no corrections for multiple testing). Among these significant trends, 107 have a negative slope, which could point to a decrease in proportions for increasingly darker skin tone. Thus, among 1922 referees, only 217 (11%) are associated with a significant trend in proportions of yellow cards, and about half of these significant trends suggest that yellow cards are more frequently given to players with lighter skin tone.

### 2.6 Poisson log-linear regression model

*Y*following a Poisson distribution with mean \(\mu \), the expected count is \(E(Y)= \mu \). Let \(X = (X_1, X_2, ..., X_n)\) denote the

*n*explanatory variables and let \(\varvec{\beta }= (\beta _0, \beta _1, \beta _2, ..., \beta _n)\) be the regression parameters. Here, we have to consider rate data, as the number of cards depends on the number of games played; obviously, the more games a person plays, the higher the chances of receiving a card. The Poisson regression model for the expected rate of the occurrence of an event is given by Equation 1,

*t*is the number of games, and \(\log (t)\) denotes the offset. The expected value of the response variable

*Y*is then given by Equation 2,

### 2.7 Methods used to predict the rates of cards

We predicted the rate of cards using the offset log(games) and two predictor variables: “position” and “skin tone”. We used four different regression techniques: (1) Poisson log-linear model (Eq. 2); (2) a binary regression tree [7]; (3) a random forest [6] with 50 trees and a minimum terminal node size of 3 (sampling without replacement); and (4) a deep neural network with three hidden layers of 50 nodes each, trained with backpropagation and maxout [14] (500 epochs; all other parameters with default settings [1]). As baseline model, we included a null model that predicts 0 for all instances, irrespective of the predictor variables. For the Poisson regression model, we used the R function glm() of the package stats [22]. The regression tree and random forest were implemented with tree() [23] and randomForest() [18], respectively. The deep neural network was implemented with h2o.deeplearning() of the package h2o [1]. The performance measure was the mean squared error (MSE) in leave-one-out cross-validation (LOOCV). All analyses were carried out in the R environment [22].

## 3 Results

### 3.1 Likelihood of receiving a red card

For the predictor variable skin tone, we obtained an odds ratio of 1.34 (95%-CI, 1.13–1.59), which means that for a one-unit increase in skin tone, we expect a 34% increase in the odds of receiving a red card. Thus, a darker skin tone is associated with an increased odds of receiving a red card. Position, however, has a far stronger influence on the odds. Particularly, if the position is center back, then the odds ratio is 2.56 (95%-CI, 2.02–3.28).

Next, we investigated whether our missing value imputation could have had an effect on the results. In the data set, there was no information given for the position of 152 (9.6%) players. We deleted these cases and carried out the regression analysis again; however, the effect was negligible: the odds ratio decreased from 1.34 to 1.32 (95%-CI, 1.11–1.57).

Finally, we checked whether outliers could have biased the results. We considered an outlier a player whose red card count (normalized by the number of games) was three times the interquartile range above the 75% percentile. In total, 22 players were identified as outliers and excluded from further analysis. Again, the change was negligible: the odds ratio decreased from 1.34 to 1.33 (95%-CI, 1.12–1.58).

### 3.2 Likelihood of receiving a yellow card

For the predictor variable skin tone, we obtained an odds ratio of 0.94 (95%-CI, 0.91–0.97), which means that the odds of receiving a yellow card *decrease* with darker skin tone. The odds ratio is lowest for position goalkeeper (OR \(= 0.37\), 95%-CI, 0.35–0.39) and highest for a defensive midfielder (OR \(= 1.54\), 95%-CI, 1.45–1.60).

MSE from LOOCV for generalized linear model (GLM), regression tree (RT), random forest (RF), deep neural network (DNN), and the null model

GLM | RT | RF | DNN | Null model | |
---|---|---|---|---|---|

Red cards | \(5.59 \times 10^{-5}\) | \(6.05 \times 10^{-5}\) | \(5.79 \times 10^{-5}\) | \(5.71 \times 10^{-5}\) | \(7.85 \times 10^{-5}\) |

Yellow cards | \(4.66\times 10^{-3}\) | \(4.64\times 10^{-3}\) | \(4.69\times 10^{-3}\) | \(5.68\times 10^{-3}\) | \(24.32 \times 10^{-3}\) |

### 3.3 Predicting rates of cards

Table 2 shows the mean squared error (MSE) of the regression models in leave-one-out cross-validation (LOOCV). The Poisson regression model achieved the lowest MSE for the prediction of the rate of red cards, whereas for yellow cards, the regression trees performed slightly better. Overall, the differences between the models are very small. All models outperformed the null model.

To investigate the significance of the results, we used a random permutation test, which is a nonparametric test involving a Monte Carlo procedure [4, 27]. In short, the test statistic is first calculated based on the original, unpermuted data set. Then, the values of the covariates are randomly permuted many times, and each time, the test statistic is calculated again. This procedure generates the empirical distribution of the statistic under the null hypothesis of no association between the covariates and the target (or response) variable. Finally, the test statistic resulting from the unpermuted data set is compared to the empirical distribution, so that an empirical *p* value can be computed. In general, random permutation tests make no assumptions about the underlying distribution of the data or the correlation structure of the covariates. Such tests are particularly useful when parametric tests are not available or not suitable, for example, when their assumptions are violated. Here, we randomly permuted the variables “games” and “red cards” and performed LOOCV with the Poisson regression model. This procedure was repeated 1000 times to generate the empirical distribution of MSE under the null hypothesis of no association between the predictor variables (“position” and “skin tone”) and the outcome (“rate of cards”). The empirical distributions of MSE under the null hypothesis for red and yellow cards are shown in Fig. 2. The MSE resulting from the unpermuted data (blue vertical lines, Fig. 2) is significant, which indicates that position and skin tone are predictive of the rate of cards.

Which variable is more important, “position” or “skin tone”? To answer this question, we performed LOOCV again, but using only one of these variables at a time. For red cards, the model achieved \(\mathrm {MSE_{glm}} = 5.64 \times 10^{-5}\) using “position” only. Using “skin tone” only, the model achieved \(\mathrm {MSE_{glm}} = 5.76 \times 10^{-5}\). Using both variables, \(\mathrm {MSE_{glm}} = 5.59 \times 10^{-5}\) (Table 2). For yellow cards, the model achieved \(\mathrm {MSE_{glm}} = 4.67\times 10^{-3}\) using “position” only, whereas it achieved \(\mathrm {MSE_{glm}} = 6.12\times 10^{-3}\) using “skin tone” only. Using both variables, \(\mathrm {MSE_{glm}} = 4.66\times 10^{-3}\) (Table 2). Thus, most information is contained in the variable “position”, whereas “skin tone” adds only little to the predictive performance.

Interestingly, the performance of the regression tree and random forest slightly deterioriated when “skin tone” was included. When we used “position” only for the prediction of red card rates, we obtained \(\mathrm {MSE_{rt}} = 5.75 \times 10^{-5}\) and \(\mathrm {MSE_{rf}} = 5.65 \times 10^{-5}\). For yellow cards, we obtained \(\mathrm {MSE_{rt}} = \mathrm {MSE_{rf}} = 4.64\times 10^{-3}\).

## 4 Discussion

Are football (soccer) referees more likely to give red cards to players with dark skin tone than to players with light skin tone? Our result (OR \(= 1.34\); 95%-CI, 1.13–1.59) indicates a positive association between a player’s skin tone and the rate of red cards received. This result is in agreement with the median OR \(= 1.31\) of the 29 teams from the crowdsourcing project (cf. Fig. 1, p. 20 in [29]). However, if we assumed that our observation points to a referee bias, then it would be plausible to see a similar association between skin tone and the rate of yellow cards. In fact, as team #17 pointed out [19], red cards are often given in clear-cut cases, whereas yellow cards tend to be given in more ambiguous situations that leave more room to the referee’s judgement and thereby could allow a possible unfairness to manifest itself. Poisson regression analysis revealed that players with darker skin tone were significantly *less likely* to receive a yellow card (OR = 0.94; 95%-CI, 0.91–0.97). Furthermore, the trend analysis of the proportions of cards for players of different skin tone did not provide evidence in favor of referee bias. Therefore, referee bias is not a convincing explanation for the observed effect.

Importantly, note that the data set allows to investigate only associations, but it does not allow to draw any conclusions about cause-and-effect relations. This limitation is also clearly stated in the original study [29]. The hypothesis of a referee bias is only one of many possible hypotheses, and we believe that it is one with a very small prior probability. When we assessed the predictive performance of the Poisson model in LOOCV and applied a random permutation test, we observed that most information is contained in the variable “position.” Thus, the unspectacular finding is that, among the variables considered in this study, a player’s position has the strongest influence on his likelihood of receiving a red card—a finding that certainly does not come as a surprise to soccer aficionados.

Odds ratios are notoriously difficult to interpret [20]. When expressed in natural language, the sentences often become cumbersome. For example, the odds ratio of 1.34 means the following: for every player with darker skin tone not receiving a red card, 1.34 times as many players with darker skin tone received a red card than the number of players with lighter skin tone receiving a red card for every player with lighter skin tone not receiving a red card. This convoluted sentence does not mean that players with darker skin tone are 1.34 times more likely than players with lighter skin tone to receive a red card. This interpretation would refer to the relative risk (RR). If the response variable is a very rare event, like in this study, then the relative risk does not diverge a lot from the odds ratio; still, the odds ratio always overestimates the relative risk. Even if we assume that players with darker skin tone have a 34% increased risk of receiving a red card, we need to take into account that receiving a red card is a very rare event, so the absolute increase in risk is still small. We remember the “1995 pill scare” that associated a new generation of contraceptive pills with a doubled risk (i.e., an increase of 100% in relative risk) of a potentially fatal side-effect, whereas the increase in absolute risk was only from \(\frac{1}{7000}\) to \(\frac{2}{7000}\) [13].

*p*value is the probability of observing data as extreme as, or more extreme than, the actual data at hand, given that the null hypothesis is true. Here, the null hypothesis is that skin tone has no effect on the response variable. Under this hypothesis, the probability of obtaining an odds ratio as extreme as (or more extreme than) the observed one is 0.00056. There exists a vast body of literature showing that the

*p*value is not an evidential measure for or against a hypothesis because it does not say anything about an alternative hypothesis; see, for example, [15]. Rare events happen all the time without being interpreted as evidence against a null hypothesis. For example, suppose that we observe the numbers 1 and 13 in a row in a roulette wheel. Under the null hypothesis of a fair wheel, the probability (not the

*p*value) of this event is \((\frac{1}{38})^2 = 0.00069\). But we would certainly not interpret it as evidence that the wheel is unfair. What is missing in this example and the present study is the probability of the data under an alternative hypothesis, which would then enable us to calculate an evidential measure: the likelihood ratio. But an alternative hypothesis has nowhere been stated, let alone tested. Many alternative hypotheses could be conceived, and referee bias is only one of them. Thus, the

*p*value of 0.00056 should not be given too much weight; it should rather be interpreted as a “crude indicator that something surprising is going on” [2, p.329]. But note that this “something surprising” could also be a problem with the model specification or data collection.

There are two further commonly encountered claims of referee bias in soccer, which are scope for future work. First, when a top team plays a lesser team, the lesser team might argue that the referee is biased against them. Potentially, such bias (if it exists) could, to some extent, be due to the enormous media presence of players and representatives of the top team. Presumably, it might be harder for referees to decide against such famous players. Second, another common claim is that referees are biased against the away team. Such bias (if it exists) could be explained by the large number of home team supporters in the stadium. Presumably, it might be harder for a referee to decide against the home team in front of a large supporter crowd.

The major finding of the original study was that the results of the 29 teams varied widely, with possible conclusions ranging from no bias in referee decisions to a huge bias; an outcome that was described as “disturbing” [28, p. 191]. However, the 29 teams used different statistical tools and different data pre-processing approaches, so differences in the estimated effect size and its precision are not at all surprising. On the contrary, they are to be expected. Dozens of data mining challenges, such as the annual KDD Cup, showed that when the same data set is analyzed by different researchers, a wide variety of analytical approaches (and results) is the norm, not the exception. Nonetheless, the confidence intervals from 27 of 29 teams largely overlap. In fact, the point estimates of 20 teams are actually remarkably close. Their results can therefore, overall, be considered confirmatory, not contradictory.

We believe that the key problem is the focus on statistical significance. A confidence interval is often interpreted as a mere significance test: if the interval does not include the null value (here, OR \(= 1\)), then the result is significant; if not, not. But this interpretation reduces the result of an individual study to a dichotomous outcome. The statistical literature is replete with examples illustrating the problems of significance tests [5, 10, 11, 12, 26, 31]. For example, Rothman *et al.* discuss clinical trials on the effect of the drug flutamide for the treatment of advanced prostate cancer [24]. Based on the results of ten studies, the drug seemed to be associated with a small beneficial effect, with a summary odds ratio of 0.88 (95%-CI, 0.76–1.02). A new study, however, reported an odds ratio of 0.87 (95%-CI, 0.70–1.10) and a nonsignificant *p* value, leading to the conclusion that flutamide has no beneficial effect. Thus, the new study was interpreted as refuting the earlier studies. But in fact, the confidence interval for the effect size suggests that new data are readily compatible with a small beneficial effect.

But isn’t there a contradiction between two confidence intervals when one interval includes the null value, whereas the other one does not? To answer that question, let us consider the two 95%-CI shown in Fig. 3a. We assume that two independent studies produced these interval for the same effect. When this effect refers to a relative risk or odds ratio, the null (or “nil”) value of no effect is \(\delta _0 = 1\). In this example, the wider interval includes the null value, but the narrower interval does not. Interpreted in terms of significance, the result might seem inconclusive: nonsignificant versus significant. But the data from which the wider interval was constructed are readily compatible with large effect sizes. Specifically, note that the (rather large) effect size of \(\delta _1 = 7\) is as compatible with the data as the nil value. Suppose that we carry out a significance test for \(\mathrm {H}_0: \delta = 1\) and obtain a *p* value of, say, \(P_0 = 0.08\). For \(\mathrm {H}_1: \delta = 7\), we would obtain exactly the same *p* value, \(P_1 = 0.08\), assuming that the interval is symmetric around the point estimate.

Significance testing gives undue emphasis to the nil hypothesis of no effect. Confidence intervals, on the other hand, enable us to judge how compatible the data are with various hypotheses. Importantly, compatibility is not an all-or-nothing decision. Assuming that there were no strong biases or other serious problems with the two studies, the appropriate interpretation of Fig. 3a is that the narrower interval reflects a higher precision (perhaps due to a larger sample size) than the wider interval. Both intervals point to a reasonably large effect, thereby reinforcing each other. Taken together, the results point to an effect size of approximately 4. Whether this magnitude is relevant remains to be discussed in the context of the concrete investigation. This is where the informed judgement of the researcher is needed.

Figure 3b illustrates that a conclusion based on statistical significance can be the opposite of the appropriate interpretation. This example is inspired by a hypothetical study described in [24]. We assume again that the statistical models used to construct the intervals are correct in both studies. The first study (with the wider interval) points to a relatively large effect size, whereas the second study (with the narrower interval) does not. The extreme narrowness of the interval reflects a very high precision due to a very large sample size. Both the lower and upper bound are very close to the null value, which is evidence for the absence of any strong effect. Still, the null value is not included, so the result is statistically significant. By contrast, despite the lack of significance, the wider interval provides evidence for a reasonably large effect size. Note that the discussed examples illustrate a fundamental problem that data scientists face. For a recent analysis of the problems of null hypothesis significance testing and further examples, see [3].

It has been argued that statistical significance testing should be abandoned [8, 16, 26]. Our stance, however, is that significance tests can have a role to play. Indeed, in the present study, we used several statistical tests, but we did not base our overall evaluation on statistical significance (or lack thereof). Our conclusion is therefore not a new one. In fact, it has been echoed in various forms for at least six decades by numerous scientists, e.g., [9, 11, 17, 21, 25, 30, 32]. What is needed is a change of the current research practice. Data scientists should not concern themselves so much with statistical significance (i.e., whether a confidence intervals includes the null value or not) but focus more on the appropriate interpretation of confidence intervals or alternative methods for measuring the effect size and its precision.

## Notes

### Acknowledgements

We are grateful to Alfons Schuster and the anonymous reviewers for their valuable comments on our manuscript.

## References

- 1.Aiello, S., Kraljevic, T., Maj, P.: With contributions from the H\(_2\)O.ai team: h2o: R Interface for H\(_2\)O (2015). https://CRAN.R-project.org/package=h2o. R package version 3.6.0.8, Accessed 3 Nov 2016
- 2.Berger, J., Delampady, M.: Testing precise hypotheses. Stat. Sci.
**2**(3), 317–352 (1987)MathSciNetCrossRefzbMATHGoogle Scholar - 3.Berrar, D.: Confidence curves: An alternative to null hypothesis significance testing for the comparison of classifiers. Mach. Learn. 1–39 (2016). doi: 10.1007/s10994-016-5612-6
- 4.Berrar, D., Granzow, M., Dubitzky, W.: Introduction to genomic and proteomic data analysis. In: Dubitzky, W., Granzow, M., Berrar, D. (eds.) Fundamentals of Data Mining in Genomics and Proteomics, pp. 1–37. Springer, Berlin (2007)Google Scholar
- 5.Berrar, D., Lozano, J.: Significance tests or confidence intervals: which are preferable for the comparison of classifiers? Journal of Experimental and Theoretical Artificial Intelligence
**25**(2), 189–206 (2013)CrossRefGoogle Scholar - 6.Breiman, L.: Random forests. Mach. Learn.
**45**(1), 5–32 (2001)CrossRefzbMATHGoogle Scholar - 7.Breiman, L., Friedman, J., Olshen, R., Stone, C.: Classification and Regression Trees. Chapman and Hall, London (1984)zbMATHGoogle Scholar
- 8.Carver, R.: The case against statistical significance testing. Harv. Educ. Rev.
**48**(3), 378–399 (1978)CrossRefGoogle Scholar - 9.Cohen, J.: Things I have learned (so far). Am. Psychol.
**45**(12), 1304–1312 (1990)CrossRefGoogle Scholar - 10.Cohen, J.: The earth is round (\(p <\).05). Am. Psychol.
**49**(12), 997–1003 (1994)CrossRefGoogle Scholar - 11.Cox, D.: The role of significance tests. Scand. J. Stat.
**4**(2), 49–70 (1977)MathSciNetzbMATHGoogle Scholar - 12.Cummings, G.: Understanding the New Statistics: Effect Sizes, Confidence Intervals, and Meta-Analysis. Routledge, Taylor & Francis Group, New York/London (2012)Google Scholar
- 13.Gigerenzer, G., Gaissmaier, W., Kurz-Milcke, E., Schwartz, L., Woloshin, S.: Helping doctors and patients to make sense of health statistics. Psychol. Sci. Public Interest
**8**(2), 53–96 (2008)CrossRefGoogle Scholar - 14.Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: ICML (3),
*JMLR Proceedings*, vol. 28, pp. 1319–1327. JMLR.org (2013)Google Scholar - 15.Goodman, S., Royall, R.: Evidence and scientific research. Am. J. Public Health
**78**(12), 1568–1574 (1988)CrossRefGoogle Scholar - 16.Guttman, L.: The illogic of statistical inference for cumulative science. Appl. Stoch. Models Data Anal.
**1**, 3–10 (1985)CrossRefGoogle Scholar - 17.Hubbard, R., Lindsay, R.: Why \(p\) values are not a useful measure of evidence in statistical significance testing. Theory Psychol.
**18**(1), 69–88 (2008)CrossRefGoogle Scholar - 18.Liaw, A., Wiener, M.: Classification and regression by randomforest. R News
**2**(3), 18–22 (2002). http://CRAN.R-project.org/doc/Rnews/ - 19.Morey, R., Wagenmakers, E.: Seven outliers produce the false impression of skin tone bias in soccer referees: A Bayesian logistic regression analysis (2015). https://osf.io/rkieb. Accessed 3 Nov 2016
- 20.Osborne, J.: Bringing balance and technical accuracy to reporting odds ratios and the results of logistic regression analyses. Pract. Assess. Res. Eval.
**11**(7) (2006). http://pareonline.net/getvn.asp?v=11&n=7. Accessed 3 Nov 2016 - 21.Poole, C.: Low \(p\)-values or narrow confidence intervals: which are more durable? Epidemiology
**12**(3), 291–294 (2001)CrossRefGoogle Scholar - 22.R Core Team: R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria (2015). https://www.R-project.org/
- 23.Ripley, B.: Tree: Classification and Regression Trees (2016). https://CRAN.R-project.org/package=tree. R package version 1.0-37. Accessed 3 Nov 2016
- 24.Rothman, K., Greenland, S., Lash, T.: Modern Epidemiology, 3rd edn. Wolters Kluwer, Alphen aan den Rijn (2008)Google Scholar
- 25.Rozeboom, W.: The fallacy of the null hypothesis significance test. Psychol. Bull.
**57**, 416–428 (1960)CrossRefGoogle Scholar - 26.Schmidt, F.: Statistical significance testing and cumulative knowledge in psychology: implications for training of researchers. Psychol. Methods
**1**(2), 115–129 (1996)CrossRefGoogle Scholar - 27.Sheskin, D.: Handbook of Parametric and Nonparametric Statistical Procedures, 4th edn. Chapman and Hall, London/New York (2007)zbMATHGoogle Scholar
- 28.Silberzahn, R., Uhlmann, E.: Many hands make tight work. Nature
**526**, 189–191 (2015)CrossRefGoogle Scholar - 29.Silberzahn, R., Uhlmann, E., Martin, D., Anselmi, P., Aust, F., Awtrey, E., Bahník, S., Bai, F., Bannard, C., Bonnier, E., Carlsson, R., Cheung, F., Christensen, G., Clay, R., Craig, M., Dalla Rosa, A., Dam, L., Evans, M., Flores Cervantes, I., Fong, N., Gamez-Djokic, M., Glenz, A., Gordon-McKeon, S., Heaton, T., Hederos Eriksson, K., Heene, M., Hofelich Mohr, A., Högden, F., Hui, K., Johannesson, M., Kalodimos, J., Kaszubowski, E., Kennedy, D., Lei, R., Lindsay, T., Liverani, S., Madan, C., Molden, D., Molleman, E., Morey, R., Mulder, L., Nijstad, B., Pope, N., Pope, B., Prenoveau, J., Rink, F., Robusto, E., Roderique, H., Sandberg, A., Schlüter, E., Schönbrodt, F., Sherman, M., Sommer, S., Sotak, K., Spain, S., Spörlein, C., Stafford, T., Stefanutti, L., Tauber, S., Ullrich, J., Vianello, M., Wagenmakers, E., Witkowiak, M., Yoon, S., Nosek, B.: Many analysts, one dataset: Making transparent how variations in analytical choices affect results https://osf.io/gvm2z. Accessed 3 Nov 2016
- 30.Stang, A., Poole, C., Kuss, O.: The ongoing tyranny of statistical significance testing in biomedical research. Eur. J. Epidemiol.
**25**, 225–230 (2010)CrossRefGoogle Scholar - 31.Tukey, J.: The philosophy of multiple comparisons. Stat. Sci.
**6**(1), 100–116 (1991)MathSciNetCrossRefGoogle Scholar - 32.Yates, F.: The influence of statistical methods for research workers on the development of the science of statistics. J. Am. Stat. Assoc.
**46**(253), 19–34 (1951)Google Scholar