Preliminary remarks

This commentary deliberately refrains from using numbers and equations, as it is intended solely as a suggestion as to how certain values, which by now (should) be part of the standard repertoire in empirical research, can be meaningfully used and interpreted. For detailed mathematical–statistical derivation and justification, we refer to the literature used. We have deliberately decided on a commentary to initiate an open discussion in empirical sports science via this journal and not to conduct an internal discussion about different positions on the basis of a submitted article with the reviewers via the submission platform, especially since we do not want to address anything new in terms of content, but rather pick out an aspect we and probably most colleagues are repeatedly confronted with in the area of teaching and research. Specifically, we focus on the interpretation of results in intervention studies (de Vet, Terwee, Mokkink, & Knol, 2011), which, following the differentiated validity discussion by Westermann (2000), can also be referred to as improving the validity of interpretation. We want to address two examples that are located at different hypothesis levels (Hussy & Möller, 1994), but equally illustrate that the results in intervention studies may not only be interpreted as “applies” or “does not apply”, but under certain conditions in the research and “translation context” can and should also be interpreted in a more differentiated way.

Objectives

In our view, the following two points of discussion are illustrative examples for content-related gradations of a differentiated interpretation: (1) An attempt is made to answer the question of how many times more likely, for example, the alternative hypothesis is true than the null hypothesis, in order to strengthen confidence in the confirmation or rejection of a research hypothesis. (2) An attempt is made to answer the question of whether the effect estimated in a study or for the population is greater, with a certain degree of certainty, than a minimum effect that can be reasonably expected.

In the first case, the so-called Bayes factor offers a solution, which can be determined with freely available programs such as R 4.2.0 (R-Project, Vienna, Austria, 2022) or JASP 0.17.2.1 (JASP Team, Amsterdam, Netherlands, 2023) for a number of common inferential statistical methods (van Doorn et al., 2021). In the second case, an extension of the approach for the determination of effect sizes to include a chance-adjusted minimum effect as well as corresponding confidence intervals is required (Cumming, 2014; Herbert, 2019; Lakens, 2013). Readers interested in applying the procedures illustrated below are directed to the online Supplementary Material which includes an example dataset along with an overview of statistical outputs obtained from frequentist and Bayesian analyses run in both SPSS 29.0 (IBM, Armonk, NY, USA) and JASP.

Interpretation aid at the level of the test hypotheses

In frequentist statistics, the null hypothesis (H0) usually states that there is no statistically significant difference or relationship, whereas the alternative hypothesis (H1) usually postulates a statistically significant difference or relationship. In this case, one tests the empirical data against a theoretical model that always assumes the validity of the null hypothesis (see e.g. Lakens, 2017; Lakens, Scheel, & Isager, 2018b; for explanations of equivalence tests with the aim of testing whether an effect lies below an a priori determined threshold for the phenomenon of interest). Accordingly, the notorious p-value (also referred to as the empirical probability of error) expresses the conditional probability of the presence of empirical data (D) or data even more extreme than the one just observed under the assumption that the null hypothesis is true, i.e. p = P(D|H0). If the p-value is small and falls below the “magic limit” of the significance level α (also referred to as critical error probability, Lakens et al., 2018a) set before the statistical analysis, it is classically concluded that the empirical data are probably not compatible with the assumption of the null hypothesis. Accordingly, the null hypothesis is provisionally rejected and the decision is made in favour of the alternative hypothesis. In the opposite case, if the p-value is above the significance level, the null hypothesis is maintained, but not confirmed (!), and the alternative hypothesis provisionally rejected (Hussy & Jain, 2002; see Otte, Vinkers, Habets, van Ijzendoorn, & Tijdink, 2022, for an analysis of the formulations for statistically nonsignificant results found in scientific articles). This is probably the most widespread dichotomous decision rule in empirical (sports) science, although it has been and continues to be the subject of critical discussion and repeated misinterpretation since its introduction (Nickerson, 2000). Among other things, the rule “forbids” that p-values are compared with each other in such a way that a smaller p-value is interpreted as evidence for the size of an effect of the alternative hypothesis or vice versa (Büsch & Strauß, 2016).

In contrast to the frequentist approach, in Bayesian statistics the alternative (H1) and null hypothesis (H0) are considered as “competing models”, assuming that empirical data (D) are available. The resulting Bayes factor (BF) quantifies the strength of evidence for H1 relative to H0, and depending on whether the research hypothesis corresponds to H1 or H0, the BF can also be calculated in favour of either H1 or H0 (Kass & Raftery, 1995). A key advantage of the BF is that by comparing the two hypotheses we can express how many times more likely the data are under one hypothesis, e.g. H1, compared to the data under the other hypothesis, e.g. H0. Assuming that we want to calculate the BF in favour of the H1 [BF10 = P(D|H1)/P(D|H0)], the interpretation options summarised in Table 1 apply. The BF can alternatively also be specified for the H0 as the inverse of BF10 (i.e. BF01 = 1/BF10), which, in contrast to the frequentist approach, allows estimation and quantification of the statistical evidence in favour of the H0 (relative to the H1; Dienes, 2014). The Bayes factor can also be interpreted as an update factor for repeated measurements (e.g. study replications), indicating what we can learn from newly acquired data. The question that can be answered with repeated measurements is to what extent new data change our confidence (belief) with regard to, for example, the validity of two competing hypotheses H1 and H0? For example, if we assume that before collecting new data the probability for H1 is assumed to be 60% [P(H1) = 0.6] and the probability for H0 is assumed to be 40% [P(H0) = 0.4] and, based on the new data, we obtain BF10 = 1, then we would not learn anything new (no evidence) from the data. With BF10 = 1, the probability that the data at hand (D) could be observed under the validity of H1, P(D|H1), is identical to the probability that the data could be observed under the validity of H0, P(D|H0). Conversely, the larger the BF10 the higher our confidence in the validity of H1 over H0 in the before–after comparison (Table 1). In order to determine the updated confidence in the validity of H1 relative to H0 (posterior beliefs), i.e. the ratio of the probabilities in favour of H1 and H0 in the presence of new data, the ratio of the probability assumptions before the new data (prior beliefs) needs to be combined with the Bayes factor BF10 (for a mathematical justification see e.g. Wagenmakers, Morey, & Lee, 2016). A high value for BF10 shifts the initial probability ratio of H1 to H0 quantified in the prior beliefs in favour of H1 in the posterior beliefs. Assuming that the Bayes factor associated with the above study example with prior beliefs = 0.6/0.4 = 1.5 is BF10 = 10, then the result for the posterior beliefs is = 1.5 × 10 = 15. Whereas before the new data, H1 was only classified as 1.5 times more likely than H0, we now have, considering the new data, a stronger confidence in the validity of H1 compared to H0, because now H1 seems 15 times more likely than H0. Conversely, new data can also lead to an increase in our confidence in the validity of H0 over H1. This is always the case when the value for BF10 is smaller than 1 and becomes stronger the closer the value for BF10 approaches null (Table 1).

Table 1 Evidence categories for the Bayes factor based on Jeffreys (1961) taken from Wetzels et al. (2011)

Despite the clear differences between the p-value on the one hand and the Bayes factor on the other, it should nevertheless be noted that the Bayes factor, by trend, increases as the p-value decreases, although the p-value only permits a statement against the null hypothesis (Wetzels et al., 2011). A p-value based on the frequentist approach can be converted into an estimated upper bound for the Bayes factor BF10 (so-called Bayes factor bound; BFB), with or without including the sample size and, if necessary, considering restrictions on the magnitude of the p-value (for detailed explanations see, among others, Benjamin & Berger, 2019; Held & Ott, 2016, 2018; Sellke, Bayarri, & Berger, 2001). The Bayes factor upper bound represents the highest possible Bayes factor BF10 that is compatible with the observed p-value and thus provides, e.g. for reviewers of a submitted paper with a frequentist data analysis approach, a helpful measure for assessing the statistical evidence of a result in favour of the H1 relative to the H0. The Bayes factor upper bound for p = 0.05 is BFB = 2.46, for p = 0.01 it is BFB = 7.99 and for p = 0.001 it is BFB = 53.26. In particular, a p-value just below the often-used significance level of α = 0.05 is thus associated with only marginal, anecdotal evidence (Table 1), which, in the words of Jeffreys (1961), is “worth no more than a bare mention”. Even if the p-value and Bayes factor correspond, the p-value does not allow a direct assessment of the probabilities of two competing assumptions that are of interest to us in science as well as in everyday life, e.g. when assessing the trustworthiness of strangers (Tschirk, 2019; Wasserstein & Lazar, 2016). However, from the perspective of the Bayes factor, it would also be necessary to reconsider for the frequentist approach whether, for example, instead of the usual critical error probability of α = 0.05 it would be more consistent to use α = 0.01 or smaller (for discussion, see e.g. Benjamin et al., 2018; Lakens et al., 2018a) as standard criterion if substantial evidence is aimed for (Benjamin & Berger, 2019; Cohen, 1994; Wetzels et al., 2011). One side effect to be considered is that tightening the requirements for statistical evidence, i.e. stricter testing, would also have direct consequences e.g. for the minimum sample size needed (see Brysbaert, 2019, for a tutorial that considers the frequentist and Bayesian perspectives).

Interpretation aid on the level of empirical-content hypotheses

While the interpretation of the p-value is of decisive importance at the level of test hypotheses or statistical prediction, the interpretation at the level of an empirical-content hypothesis requires a focus on the effect size. Despite all justified criticism in the scientific community, the conventions of Cohen (1988) are largely preferred for an interpretation, although the author himself has pointed out the context-dependency of his suggestions (see also e.g. Caldwell & Vigotzky, 2020; Durlak, 2009; Mesquida, Murphy, Lakens, & Warne, 2022). A small intervention effect in competitive athletes, for whom marginal differences can determine success or failure, is to be interpreted differently than a small intervention effect in novice athletes, for whom even some regularity of physical activity can lead to short-term, sometimes exponential improvements in performance (Rhea, 2004).

Effect sizes denote the size of a population or sample effect that can be represented either as a difference or distance measure, e.g. the effect size d for the mean difference relative to the common dispersion, or as a correlation measure between two variables, e.g. the effect size r (Cohen, 1988). Since distance measures can be converted into correlation measures and vice versa, we will limit ourselves in the following to the most frequently used effect size d for interpreting the significance of a difference between two groups or the change in a sample.

In contrast to the conventions for d, i.e. small effect d ≥ 0.2, medium effect d ≥ 0.5 and large effect d ≥ 0.8, a context-dependent interpretation of the effect requires the following question to be asked before the study begins: “How large should a potential effect be for the intervention to be interpreted as worthwhile?” or the other way round: “How large could a potential effect be without the intervention being interpreted as worthwhile?” Both questions can be simplified to: “What should be the minimum size of an effect so that the intervention could be interpreted as worthwhile?” A corresponding minimum effect size can be determined either on the theoretical and/or content level or on the methodological level of measurement. In the first case, the minimum effect size could be derived from theory considering the thematically relevant studies. In the second case, a measurement–methodological effect size would have to be defined that would be larger than a null effect (Cumming, 2014), but that would be too small or trivial for a substantive interpretation. Possible reasons for small or trivial and thus negligible effects could result, for example, from the reliability of the measurement procedure, the homogeneity of the variances of the differences, etc. The basic assumption of this approach is reminiscent of the differentiation in individual case analyses between the minimal important change (MIC), which can be based on a consensus of experts, for example, and the minimal detectable change (MDC), which is essentially dependent on the accuracy of the measurement procedure or the standard error of measurement (SEM) (De Vet et al., 2006; King, 2011; Terwee et al., 2021).

The approach of minimum effect sizes goes back, among others, to considerations by Murphy and Myors (1999) on the testing of minimum effect hypotheses. The authors address the issue that although null hypothesis testing (in the sense of an effect being exactly zero) is easy to implement and has become widely accepted in empirical research despite all criticism and misinterpretations, true null effects do not reflect reality. Minimum effect hypotheses, on the other hand, ask whether an effect is “good enough” to be able to describe an intervention as worthwhile or more worthwhile than other interventions, for example. This assumption corresponds to the considerations of the Bayes factor explained earlier with regard to the question of whether new data change our confidence in the likelihood ratio of two competing hypotheses (see also Rouanet, 1996). In Murphy and Myors (1999, 2023), the decision on minimum effects is oriented within the framework of the F‑statistic (e.g. for variance analyses) to what percentage of explained variance can be regarded as negligible or can be assumed to be justified as equivalent to a true null effect. Assuming that, for example, up to 1% of explained variance in the F‑statistic could be neglected, this would correspond to an effect size of η2 = 0.01. Transferred to the t-statistic, this assumption would correspond to a d-value of approximately 0.20 (for the conversion of effect sizes, see e.g. https://www.psychometrica.de/effect_size.html). In the case that the reliability of a measurement procedure can be assumed as very high, e.g. with ICC ≥ 0.95 or ICC ≥ 0.99, the effect to be neglected can also be set lower. For example, an explained variance of 0.5% (η2 = 0.005) would correspond to a d ≅ 0.14 and of 0.1% (η2 = 0.001) to a d ≅ 0.06. Ultimately, it must be decided which minimum effect appears negligible or, vice versa, at which threshold value an interesting or worthwhile effect can be assumed (see also minimum effect tests, Jovanovic, Torres, & French, 2022).

The minimum effect size can also be referred to as the smallest effect size of interest (SESOI) and documents the threshold or reference point below which the effect size (ES) in the sample under investigation, including the confidence interval of the effect size (CI), should not fall (Anvari & Lakens, 2021). The SESOI needs to be documented in a justified and transparent manner. The confidence intervals serve to estimate whether an intervention effect is so large that it can be reproduced with a certain probability and is always larger than the minimum effect (Herbert, 2019; Herbert, 2000; Kamper, 2019). This means that one can be somewhat more certain that one has not backed the wrong horse or the wrong intervention and that it is highly likely that the effect will be meaningful. The approach of this effect-size oriented approach (magnitude-based approach) can be illustrated with the help of a tree plot (Fig. 1).

Fig. 1
figure 1

Tree plot of an effect size with confidence interval and minimum effect. ES effect size, CI confidence interval, SESOI smallest effect size of interest. (adapted based on Herbert, 2000, p. 232)

By considering effect size and confidence interval as well as minimum effect and null effect, i.e. that there is no or no substantial difference between two measurement points, the intervention effects can be interpreted validly. If, for example, the effect size and the confidence interval are above the minimum effect (SESOI), it can be assumed that the intervention is beneficial or meaningful. If, on the other hand, the effect size and the confidence interval are between the null effect and the minimum effect, it can be assumed that the intervention is effective but not worthwhile, whereas if the effect size and the confidence interval are both below the null effect an intervention would be assessed as not beneficial or not meaningful, possibly even as increasingly harmful. However, the validity of the interpretation is always limited if the confidence interval includes a reference point, i.e. either the minimum effect or the null effect (Fig. 2).

Fig. 2
figure 2

Interpretation of intervention effects in consideration of a minimum effect size and a null effect (adapted based on Kamper, 2019, p. 764)

Concluding remarks

In this commentary we focused on two selected issues at different levels of hypothesis. Of course, many other aspects of empirical (sports) research need to be considered for sufficient interpretive validity such as sound theoretical foundation and methodological rigour (Fiedler, McCaughey, & Prager, 2021), visualisation of the (raw) data (Loffing, 2022), etc. The aspects considered here represent only a small section of an overall complex research process, often underestimated in its interactions, which holds traps in store at various points, that each of us has stepped into at some point and unfortunately contains limitations for the validity of studies. What the different approaches have in common is the striving for an improvement in methodological precision that also offers substantial added value for sports science research. Consequently, this commentary is to be understood as the starting whistle of a continuous and constantly developing discussion and it should not be misunderstood as the final whistle of a discussion that may also be tiring to some degree (Mesquida et al., 2022). For the kick-off of the discussion to be initiated, we finally formulate two provocative suggestions for the interpretation of study results: (1) Always calculate the Bayes factor to enable (better) estimation or interpretation of the trustworthiness of your (statistical) test hypothesis of interest! (2) Always determine and justify the minimum effect size in advance to enable (better) estimation or interpretation of the substantial benefit of your intervention!