Abstract
Sports science as an empirical science produces study results that are to be interpreted hypothesis-oriented. The validity of the interpretation of statistically and practically significant results depends on the one hand on the theoretical foundation of the research question and on the other hand on the concrete methodological procedure in intervention studies. Considering hypotheses at the empirical-content and statistical level, recurring interpretation difficulties arise when numbers are translated into words or recommendations for action. On the basis of two examples, a discussion in the scientific community is to be initiated, which could be continued in this journal in case of corresponding interest in methodological issues.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
Preliminary remarks
This commentary deliberately refrains from using numbers and equations, as it is intended solely as a suggestion as to how certain values, which by now (should) be part of the standard repertoire in empirical research, can be meaningfully used and interpreted. For detailed mathematical–statistical derivation and justification, we refer to the literature used. We have deliberately decided on a commentary to initiate an open discussion in empirical sports science via this journal and not to conduct an internal discussion about different positions on the basis of a submitted article with the reviewers via the submission platform, especially since we do not want to address anything new in terms of content, but rather pick out an aspect we and probably most colleagues are repeatedly confronted with in the area of teaching and research. Specifically, we focus on the interpretation of results in intervention studies (de Vet, Terwee, Mokkink, & Knol, 2011), which, following the differentiated validity discussion by Westermann (2000), can also be referred to as improving the validity of interpretation. We want to address two examples that are located at different hypothesis levels (Hussy & Möller, 1994), but equally illustrate that the results in intervention studies may not only be interpreted as “applies” or “does not apply”, but under certain conditions in the research and “translation context” can and should also be interpreted in a more differentiated way.
Objectives
In our view, the following two points of discussion are illustrative examples for content-related gradations of a differentiated interpretation: (1) An attempt is made to answer the question of how many times more likely, for example, the alternative hypothesis is true than the null hypothesis, in order to strengthen confidence in the confirmation or rejection of a research hypothesis. (2) An attempt is made to answer the question of whether the effect estimated in a study or for the population is greater, with a certain degree of certainty, than a minimum effect that can be reasonably expected.
In the first case, the so-called Bayes factor offers a solution, which can be determined with freely available programs such as R 4.2.0 (R-Project, Vienna, Austria, 2022) or JASP 0.17.2.1 (JASP Team, Amsterdam, Netherlands, 2023) for a number of common inferential statistical methods (van Doorn et al., 2021). In the second case, an extension of the approach for the determination of effect sizes to include a chance-adjusted minimum effect as well as corresponding confidence intervals is required (Cumming, 2014; Herbert, 2019; Lakens, 2013). Readers interested in applying the procedures illustrated below are directed to the online Supplementary Material which includes an example dataset along with an overview of statistical outputs obtained from frequentist and Bayesian analyses run in both SPSS 29.0 (IBM, Armonk, NY, USA) and JASP.
Interpretation aid at the level of the test hypotheses
In frequentist statistics, the null hypothesis (H0) usually states that there is no statistically significant difference or relationship, whereas the alternative hypothesis (H1) usually postulates a statistically significant difference or relationship. In this case, one tests the empirical data against a theoretical model that always assumes the validity of the null hypothesis (see e.g. Lakens, 2017; Lakens, Scheel, & Isager, 2018b; for explanations of equivalence tests with the aim of testing whether an effect lies below an a priori determined threshold for the phenomenon of interest). Accordingly, the notorious p-value (also referred to as the empirical probability of error) expresses the conditional probability of the presence of empirical data (D) or data even more extreme than the one just observed under the assumption that the null hypothesis is true, i.e. p = P(D|H0). If the p-value is small and falls below the “magic limit” of the significance level α (also referred to as critical error probability, Lakens et al., 2018a) set before the statistical analysis, it is classically concluded that the empirical data are probably not compatible with the assumption of the null hypothesis. Accordingly, the null hypothesis is provisionally rejected and the decision is made in favour of the alternative hypothesis. In the opposite case, if the p-value is above the significance level, the null hypothesis is maintained, but not confirmed (!), and the alternative hypothesis provisionally rejected (Hussy & Jain, 2002; see Otte, Vinkers, Habets, van Ijzendoorn, & Tijdink, 2022, for an analysis of the formulations for statistically nonsignificant results found in scientific articles). This is probably the most widespread dichotomous decision rule in empirical (sports) science, although it has been and continues to be the subject of critical discussion and repeated misinterpretation since its introduction (Nickerson, 2000). Among other things, the rule “forbids” that p-values are compared with each other in such a way that a smaller p-value is interpreted as evidence for the size of an effect of the alternative hypothesis or vice versa (Büsch & Strauß, 2016).
In contrast to the frequentist approach, in Bayesian statistics the alternative (H1) and null hypothesis (H0) are considered as “competing models”, assuming that empirical data (D) are available. The resulting Bayes factor (BF) quantifies the strength of evidence for H1 relative to H0, and depending on whether the research hypothesis corresponds to H1 or H0, the BF can also be calculated in favour of either H1 or H0 (Kass & Raftery, 1995). A key advantage of the BF is that by comparing the two hypotheses we can express how many times more likely the data are under one hypothesis, e.g. H1, compared to the data under the other hypothesis, e.g. H0. Assuming that we want to calculate the BF in favour of the H1 [BF10 = P(D|H1)/P(D|H0)], the interpretation options summarised in Table 1 apply. The BF can alternatively also be specified for the H0 as the inverse of BF10 (i.e. BF01 = 1/BF10), which, in contrast to the frequentist approach, allows estimation and quantification of the statistical evidence in favour of the H0 (relative to the H1; Dienes, 2014). The Bayes factor can also be interpreted as an update factor for repeated measurements (e.g. study replications), indicating what we can learn from newly acquired data. The question that can be answered with repeated measurements is to what extent new data change our confidence (belief) with regard to, for example, the validity of two competing hypotheses H1 and H0? For example, if we assume that before collecting new data the probability for H1 is assumed to be 60% [P(H1) = 0.6] and the probability for H0 is assumed to be 40% [P(H0) = 0.4] and, based on the new data, we obtain BF10 = 1, then we would not learn anything new (no evidence) from the data. With BF10 = 1, the probability that the data at hand (D) could be observed under the validity of H1, P(D|H1), is identical to the probability that the data could be observed under the validity of H0, P(D|H0). Conversely, the larger the BF10 the higher our confidence in the validity of H1 over H0 in the before–after comparison (Table 1). In order to determine the updated confidence in the validity of H1 relative to H0 (posterior beliefs), i.e. the ratio of the probabilities in favour of H1 and H0 in the presence of new data, the ratio of the probability assumptions before the new data (prior beliefs) needs to be combined with the Bayes factor BF10 (for a mathematical justification see e.g. Wagenmakers, Morey, & Lee, 2016). A high value for BF10 shifts the initial probability ratio of H1 to H0 quantified in the prior beliefs in favour of H1 in the posterior beliefs. Assuming that the Bayes factor associated with the above study example with prior beliefs = 0.6/0.4 = 1.5 is BF10 = 10, then the result for the posterior beliefs is = 1.5 × 10 = 15. Whereas before the new data, H1 was only classified as 1.5 times more likely than H0, we now have, considering the new data, a stronger confidence in the validity of H1 compared to H0, because now H1 seems 15 times more likely than H0. Conversely, new data can also lead to an increase in our confidence in the validity of H0 over H1. This is always the case when the value for BF10 is smaller than 1 and becomes stronger the closer the value for BF10 approaches null (Table 1).
Despite the clear differences between the p-value on the one hand and the Bayes factor on the other, it should nevertheless be noted that the Bayes factor, by trend, increases as the p-value decreases, although the p-value only permits a statement against the null hypothesis (Wetzels et al., 2011). A p-value based on the frequentist approach can be converted into an estimated upper bound for the Bayes factor BF10 (so-called Bayes factor bound; BFB), with or without including the sample size and, if necessary, considering restrictions on the magnitude of the p-value (for detailed explanations see, among others, Benjamin & Berger, 2019; Held & Ott, 2016, 2018; Sellke, Bayarri, & Berger, 2001). The Bayes factor upper bound represents the highest possible Bayes factor BF10 that is compatible with the observed p-value and thus provides, e.g. for reviewers of a submitted paper with a frequentist data analysis approach, a helpful measure for assessing the statistical evidence of a result in favour of the H1 relative to the H0. The Bayes factor upper bound for p = 0.05 is BFB = 2.46, for p = 0.01 it is BFB = 7.99 and for p = 0.001 it is BFB = 53.26. In particular, a p-value just below the often-used significance level of α = 0.05 is thus associated with only marginal, anecdotal evidence (Table 1), which, in the words of Jeffreys (1961), is “worth no more than a bare mention”. Even if the p-value and Bayes factor correspond, the p-value does not allow a direct assessment of the probabilities of two competing assumptions that are of interest to us in science as well as in everyday life, e.g. when assessing the trustworthiness of strangers (Tschirk, 2019; Wasserstein & Lazar, 2016). However, from the perspective of the Bayes factor, it would also be necessary to reconsider for the frequentist approach whether, for example, instead of the usual critical error probability of α = 0.05 it would be more consistent to use α = 0.01 or smaller (for discussion, see e.g. Benjamin et al., 2018; Lakens et al., 2018a) as standard criterion if substantial evidence is aimed for (Benjamin & Berger, 2019; Cohen, 1994; Wetzels et al., 2011). One side effect to be considered is that tightening the requirements for statistical evidence, i.e. stricter testing, would also have direct consequences e.g. for the minimum sample size needed (see Brysbaert, 2019, for a tutorial that considers the frequentist and Bayesian perspectives).
Interpretation aid on the level of empirical-content hypotheses
While the interpretation of the p-value is of decisive importance at the level of test hypotheses or statistical prediction, the interpretation at the level of an empirical-content hypothesis requires a focus on the effect size. Despite all justified criticism in the scientific community, the conventions of Cohen (1988) are largely preferred for an interpretation, although the author himself has pointed out the context-dependency of his suggestions (see also e.g. Caldwell & Vigotzky, 2020; Durlak, 2009; Mesquida, Murphy, Lakens, & Warne, 2022). A small intervention effect in competitive athletes, for whom marginal differences can determine success or failure, is to be interpreted differently than a small intervention effect in novice athletes, for whom even some regularity of physical activity can lead to short-term, sometimes exponential improvements in performance (Rhea, 2004).
Effect sizes denote the size of a population or sample effect that can be represented either as a difference or distance measure, e.g. the effect size d for the mean difference relative to the common dispersion, or as a correlation measure between two variables, e.g. the effect size r (Cohen, 1988). Since distance measures can be converted into correlation measures and vice versa, we will limit ourselves in the following to the most frequently used effect size d for interpreting the significance of a difference between two groups or the change in a sample.
In contrast to the conventions for d, i.e. small effect d ≥ 0.2, medium effect d ≥ 0.5 and large effect d ≥ 0.8, a context-dependent interpretation of the effect requires the following question to be asked before the study begins: “How large should a potential effect be for the intervention to be interpreted as worthwhile?” or the other way round: “How large could a potential effect be without the intervention being interpreted as worthwhile?” Both questions can be simplified to: “What should be the minimum size of an effect so that the intervention could be interpreted as worthwhile?” A corresponding minimum effect size can be determined either on the theoretical and/or content level or on the methodological level of measurement. In the first case, the minimum effect size could be derived from theory considering the thematically relevant studies. In the second case, a measurement–methodological effect size would have to be defined that would be larger than a null effect (Cumming, 2014), but that would be too small or trivial for a substantive interpretation. Possible reasons for small or trivial and thus negligible effects could result, for example, from the reliability of the measurement procedure, the homogeneity of the variances of the differences, etc. The basic assumption of this approach is reminiscent of the differentiation in individual case analyses between the minimal important change (MIC), which can be based on a consensus of experts, for example, and the minimal detectable change (MDC), which is essentially dependent on the accuracy of the measurement procedure or the standard error of measurement (SEM) (De Vet et al., 2006; King, 2011; Terwee et al., 2021).
The approach of minimum effect sizes goes back, among others, to considerations by Murphy and Myors (1999) on the testing of minimum effect hypotheses. The authors address the issue that although null hypothesis testing (in the sense of an effect being exactly zero) is easy to implement and has become widely accepted in empirical research despite all criticism and misinterpretations, true null effects do not reflect reality. Minimum effect hypotheses, on the other hand, ask whether an effect is “good enough” to be able to describe an intervention as worthwhile or more worthwhile than other interventions, for example. This assumption corresponds to the considerations of the Bayes factor explained earlier with regard to the question of whether new data change our confidence in the likelihood ratio of two competing hypotheses (see also Rouanet, 1996). In Murphy and Myors (1999, 2023), the decision on minimum effects is oriented within the framework of the F‑statistic (e.g. for variance analyses) to what percentage of explained variance can be regarded as negligible or can be assumed to be justified as equivalent to a true null effect. Assuming that, for example, up to 1% of explained variance in the F‑statistic could be neglected, this would correspond to an effect size of η2 = 0.01. Transferred to the t-statistic, this assumption would correspond to a d-value of approximately 0.20 (for the conversion of effect sizes, see e.g. https://www.psychometrica.de/effect_size.html). In the case that the reliability of a measurement procedure can be assumed as very high, e.g. with ICC ≥ 0.95 or ICC ≥ 0.99, the effect to be neglected can also be set lower. For example, an explained variance of 0.5% (η2 = 0.005) would correspond to a d ≅ 0.14 and of 0.1% (η2 = 0.001) to a d ≅ 0.06. Ultimately, it must be decided which minimum effect appears negligible or, vice versa, at which threshold value an interesting or worthwhile effect can be assumed (see also minimum effect tests, Jovanovic, Torres, & French, 2022).
The minimum effect size can also be referred to as the smallest effect size of interest (SESOI) and documents the threshold or reference point below which the effect size (ES) in the sample under investigation, including the confidence interval of the effect size (CI), should not fall (Anvari & Lakens, 2021). The SESOI needs to be documented in a justified and transparent manner. The confidence intervals serve to estimate whether an intervention effect is so large that it can be reproduced with a certain probability and is always larger than the minimum effect (Herbert, 2019; Herbert, 2000; Kamper, 2019). This means that one can be somewhat more certain that one has not backed the wrong horse or the wrong intervention and that it is highly likely that the effect will be meaningful. The approach of this effect-size oriented approach (magnitude-based approach) can be illustrated with the help of a tree plot (Fig. 1).
Tree plot of an effect size with confidence interval and minimum effect. ES effect size, CI confidence interval, SESOI smallest effect size of interest. (adapted based on Herbert, 2000, p. 232)
By considering effect size and confidence interval as well as minimum effect and null effect, i.e. that there is no or no substantial difference between two measurement points, the intervention effects can be interpreted validly. If, for example, the effect size and the confidence interval are above the minimum effect (SESOI), it can be assumed that the intervention is beneficial or meaningful. If, on the other hand, the effect size and the confidence interval are between the null effect and the minimum effect, it can be assumed that the intervention is effective but not worthwhile, whereas if the effect size and the confidence interval are both below the null effect an intervention would be assessed as not beneficial or not meaningful, possibly even as increasingly harmful. However, the validity of the interpretation is always limited if the confidence interval includes a reference point, i.e. either the minimum effect or the null effect (Fig. 2).
Interpretation of intervention effects in consideration of a minimum effect size and a null effect (adapted based on Kamper, 2019, p. 764)
Concluding remarks
In this commentary we focused on two selected issues at different levels of hypothesis. Of course, many other aspects of empirical (sports) research need to be considered for sufficient interpretive validity such as sound theoretical foundation and methodological rigour (Fiedler, McCaughey, & Prager, 2021), visualisation of the (raw) data (Loffing, 2022), etc. The aspects considered here represent only a small section of an overall complex research process, often underestimated in its interactions, which holds traps in store at various points, that each of us has stepped into at some point and unfortunately contains limitations for the validity of studies. What the different approaches have in common is the striving for an improvement in methodological precision that also offers substantial added value for sports science research. Consequently, this commentary is to be understood as the starting whistle of a continuous and constantly developing discussion and it should not be misunderstood as the final whistle of a discussion that may also be tiring to some degree (Mesquida et al., 2022). For the kick-off of the discussion to be initiated, we finally formulate two provocative suggestions for the interpretation of study results: (1) Always calculate the Bayes factor to enable (better) estimation or interpretation of the trustworthiness of your (statistical) test hypothesis of interest! (2) Always determine and justify the minimum effect size in advance to enable (better) estimation or interpretation of the substantial benefit of your intervention!
References
Anvari, F., & Lakens, D. (2021). Using anchor-based methods to determine the smallest effect size of interest. Journal of Experimental Social Psychology, 96, 104159. https://doi.org/10.1016/j.jesp.2021.104159.
Benjamin, D. J., & Berger, J. O. (2019). Three recommendations for improving the use of p‑values. The American Statistician, 73(sup1), 186–191. https://doi.org/10.1080/00031305.2018.1543135.
Benjamin, D. J., Berger, J. O., Johannesson, M., Nosek, B. A., Wagenmakers, E. J., Berk, R., Bollen, K. A., Brembs, B., Brown, L., Camerer, C., Cesarini, D., Chambers, C. D., Clyde, M., Cook, T. D., De Boeck, P., Dienes, Z., Dreber, A., Easwaran, K., Efferson, C., . . ., & Johnson, V. E. (2018). Redefine statistical significance. Nature Human Behaviour, 2(1), 6–10. https://doi.org/10.1038/s41562-017-0189-z.
Brysbaert, M. (2019). How many participants do we have to include in properly powered experiments? A tutorial of power analysis with reference tables. Journal of Cognition. https://doi.org/10.5334/joc.72.
Büsch, D., & Strauß, B. (2016). Wider die „Sternchenkunde“! Sportwissenschaft, 46(2), 53–59. https://doi.org/10.1007/s12662-015-0376-x.
Caldwell, A., & Vigotsky, A. D. (2020). A case against default effect sizes in sport and exercise science. PeerJ, 8, e10314. https://doi.org/10.7717/peerj.10314.
Cohen, J. (1988). Statistical power analysis for the behavioral sciences. Lawrence Erlbaum.
Cohen, J. (1994). The earth is round (p 〈 .05). American Psychologist, 49(12), 997–1003.
Cumming, G. (2014). The new statistics: why and how. Psychological Science, 25(1), 7–29. https://doi.org/10.1177/0956797613504966.
De Vet, H. C. W., Terwee, C. B., Ostelo, R. W., Beckerman, H., Knol, D. L., & Bouter, L. M. (2006). Minimal changes in health status questionnaires: distinction between minimally detectable change and minimally important change. Health and Quality of Life Outcomes, 4(1), 54. https://doi.org/10.1186/1477-7525-4-54.
Dienes, Z. (2014). Using Bayes to get the most out of non-significant results. Frontiers in Psychology. https://doi.org/10.3389/fpsyg.2014.00781.
van Doorn, J., van den Bergh, D., Böhm, U., Dablander, F., Derks, K., Draws, T., Etz, A., Evans, N. J., Gronau, Q. F., Haaf, J. M., Hinne, M., Kucharský, Š., Ly, A., Marsman, M., Matzke, D., Gupta, A. R. K. N., Sarafoglou, A., Stefan, A., Voelkel, J. G., & Wagenmakers, E.-J. (2021). The JASP guidelines for conducting and reporting a Bayesian analysis. Psychonomic Bulletin & Review, 28(3), 813–826. https://doi.org/10.3758/s13423-020-01798-5.
Durlak, J. A. (2009). How to select, calculate, and interpret effect sizes. Journal of Pediatric Psychology, 34(9), 917–928. https://doi.org/10.1093/jpepsy/jsp004.
Fiedler, K., McCaughey, L., & Prager, J. (2021). Quo vadis, methodology? The key role of manipulation checks for validity control and quality of science. Perspectives on Psychological Science, 16(4), 816–826. https://doi.org/10.1177/1745691620970602.
Held, L., & Ott, M. (2016). How the maximal evidence of p‑values against point null hypotheses depends on sample size. The American Statistician, 70(4), 335–341. https://doi.org/10.1080/00031305.2016.1209128.
Held, L., & Ott, M. (2018). On p‑values and Bayes factors. Annual Review of Statistics and Its Application, 5(1), 393–419. https://doi.org/10.1146/annurev-statistics-031017-100307.
Herbert, R. (2019). Significance testing and hypothesis testing: meaningless, misleading and mostly unnecessary. Journal of Physiotherapy, 65(3), 178–181. https://doi.org/10.1016/j.jphys.2019.05.001.
Herbert, R. D. (2000). How to estimate treatment effects from reports of clinical trials. I: Continuous outcomes. Australian Journal of Physiotherapy, 46(3), 60334–60332. https://doi.org/10.1016/S0004-9514.
Hussy, W., & Jain, A. (2002). Experimentelle Hypothesenprüfung in der Psychologie. Hogrefe.
Hussy, W., & Möller, H. (1994). Hypothesen. In T. Herrmann & W. Tack (Eds.), Methodologische Grundlagen der Psychologie. Enzyklopädie der Psychologie: Themenbereich B Methodologie und Methoden, Serie I Forschungsmethoden der Psychologie, (Vol. 1, pp. 475–507). Hogrefe.
Jeffreys, H. (1961). Theory of probability (3rd edn.). Oxford University Press.
Jovanovic, M., Torres, R. L., & French, D. N. (2022). Statistical modeling. In D. N. French & L. R. Torres (Eds.), NSCA’s essentials of sport science (pp. 644–701). Human Kinetics.
Kamper, S. J. (2019). Confidence intervals: Linking evidence to practice. Journal of Orthopaedic & Sports Physical Therapy, 49(10), 763–764. https://doi.org/10.2519/jospt.2019.0706.
Kass, R. E., & Raftery, A. E. (1995). Bayes factors. Journal of the American Statistical Association, 90(430), 773–795. https://doi.org/10.2307/2291091.
King, M. T. (2011). A point of minimal important difference (MID): A critique of terminology and methods. Expert Review of Pharmacoeconomics & Outcomes Research, 11(2), 171–184. https://doi.org/10.1586/erp.11.9.
Lakens, D. (2013). Calculating and reporting effect sizes to facilitate cumulative science: a practical primer for t‑tests and ANOVAs. Frontiers in Psychology. https://doi.org/10.3389/fpsyg.2013.00863.
Lakens, D. (2017). Equivalence tests: A practical primer for t tests, correlations, and meta-analyses. Social Psychological and Personality Science, 8(4), 355–362. https://doi.org/10.1177/1948550617697177.
Lakens, D., Adolfi, F. G., Albers, C. J., Anvari, F., Apps, M. A. J., Argamon, S. E., Baguley, T., Becker, R. B., Benning, S. D., Bradford, D. E., Buchanan, E. M., Caldwell, A. R., Van Calster, B., Carlsson, R., Chen, S.-C., Chung, B., Colling, L. J., Collins, G. S., . . ., & Zwaan, R. A. (2018a). Justify your alpha. Nature Human Behaviour, 2(3), 168–171. https://doi.org/10.1038/s41562-018-0311-x.
Lakens, D., Scheel, A. M., & Isager, P. M. (2018b). Equivalence Testing for Psychological Research: A Tutorial. Advances in Methods and Practices in Psychological Science, 1(2), 259–269. https://doi.org/10.1177/2515245918770963.
Loffing, F. (2022). Raw data visualization for common factorial designs using SPSS: A syntax collection and tutorial. Frontiers in Psychology. https://doi.org/10.3389/fpsyg.2022.808469.
Mesquida, C., Murphy, J., Lakens, D., & Warne, J. (2022). Replication concerns in sports and exercise science: a narrative review of selected methodological issues in the field. Royal Society Open Science, 9(12), 220946. https://doi.org/10.1098/rsos.220946.
Murphy, K. R., & Myors, B. (1999). Testing the hypothesis that treatments have negligible effects: Minimum-effect tests in the general linear model. Journal of Applied Psychology, 84, 234–248. https://doi.org/10.1037/0021-9010.84.2.234.
Murphy, K. R., & Myors, B. (2023). Statistical power analysis: A simple and general model for traditional and modern hypothesis tests (5th edn.). Routledge.
Nickerson, R. S. (2000). Null hypothesis significance testing: A review of an old and continuing controversy. Psychological Methods, 5, 241–301. https://doi.org/10.1037/1082-989X.5.2.241.
Otte, W. M., Vinkers, C. H., Habets, P. C., van Ijzendoorn, D. G. P., & Tijdink, J. K. (2022). Analysis of 567,758 randomized controlled trials published over 30 years reveals trends in phrases used to discuss results that do not reach statistical significance. PLoS Biology, 20(2), e3001562. https://doi.org/10.1371/journal.pbio.3001562.
Rhea, M. R. (2004). Determining the magnitude of treatment effects in strength training research through the use of the effect size. Journal of Strength and Conditioning Research, 18(4), 918–920. https://doi.org/10.1519/14403.1.
Rouanet, H. (1996). Bayesian methods for assessing importance of effects. Psychological Bulletin, 119, 149–158. https://doi.org/10.1037/0033-2909.119.1.149.
Sellke, T., Bayarri, M. J., & Berger, J. O. (2001). Calibration of ρ values for testing precise null hypotheses. The American Statistician, 55(1), 62–71. https://doi.org/10.1198/000313001300339950.
Terwee, C. B., Peipert, J. D., Chapman, R., Lai, J.-S., Terluin, B., Cella, D., Griffiths, P., & Mokkink, L. B. (2021). Minimal important change (MIC): A conceptual clarification and systematic review of MIC estimates of PROMIS measures. Quality of Life Research, 30(10), 2729–2754. https://doi.org/10.1007/s11136-021-02925-y.
Tschirk, W. (2019). Bayes-Statistik für Human- und Sozialwissenschaften. Springer. https://doi.org/10.1007/978-3-662-56782-1.
de Vet, H. C. W., Terwee, C. B., Mokkink, L. B., & Knol, D. L. (2011). Measurement in medicine: A practical guide. Cambridge University Press. https://doi.org/10.1017/CBO9780511996214.
Wagenmakers, E.-J., Morey, R. D., & Lee, M. D. (2016). Bayesian benefits for the pragmatic researcher. Current Directions in Psychological Science, 25(3), 169–176. https://doi.org/10.1177/0963721416643289.
Wasserstein, R. L., & Lazar, N. A. (2016). The ASA statement on p‑Values: Context, process, and purpose. The American Statistician, 70(2), 129–133. https://doi.org/10.1080/00031305.2016.1154108.
Westermann, R. (2000). Wissenschaftstheorie und Experimentalmethodik. Hogrefe.
Wetzels, R., Matzke, D., Lee, M. D., Rouder, J. N., Iverson, G. J., & Wagenmakers, E. J. (2011). Statistical evidence in experimental psychology: An empirical comparison using 855 t tests. Perspectives on Psychological Science, 6(3), 291–298. https://doi.org/10.1177/1745691611406923.
Funding
Open Access funding enabled and organized by Projekt DEAL.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
D. Büsch and F. Loffing declare that they have no competing interests.
For this article no studies with human participants or animals were performed by any of the authors. All studies mentioned were in accordance with the ethical standards indicated in each case.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
12662_2023_915_MOESM1_ESM.docx
The supplement contains (1) an exemplary question/hypothesis for an intervention study, (2) two fictitious data sets with syntax files (SPSS 29.0 and JASP 0.17.2. 1), (3) the statistical results for the frequentist and Bayesian statistic approach, (4) graphical representations of the results in SPSS and JASP, (5) instructions for the reporting of statistical results according to APA guidelines (7th edition), (6) a forest plot for illustration of the different effect sizes considering the smallest effect size of interet (SESOI), and (7) suggestions for the interpretation of the different results in the fictitious intervention study
12662_2023_915_MOESM2_ESM.jasp
The supplement contains (1) an exemplary question/hypothesis for an intervention study, (2) two fictitious data sets with syntax files (SPSS 29.0 and JASP 0.17.2. 1), (3) the statistical results for the frequentist and Bayesian statistic approach, (4) graphical representations of the results in SPSS and JASP, (5) instructions for the reporting of statistical results according to APA guidelines (7th edition), (6) a forest plot for illustration of the different effect sizes considering the smallest effect size of interet (SESOI), and (7) suggestions for the interpretation of the different results in the fictitious intervention study
12662_2023_915_MOESM3_ESM.sav
The supplement contains (1) an exemplary question/hypothesis for an intervention study, (2) two fictitious data sets with syntax files (SPSS 29.0 and JASP 0.17.2. 1), (3) the statistical results for the frequentist and Bayesian statistic approach, (4) graphical representations of the results in SPSS and JASP, (5) instructions for the reporting of statistical results according to APA guidelines (7th edition), (6) a forest plot for illustration of the different effect sizes considering the smallest effect size of interet (SESOI), and (7) suggestions for the interpretation of the different results in the fictitious intervention study
12662_2023_915_MOESM4_ESM.csv
The supplement contains (1) an exemplary question/hypothesis for an intervention study, (2) two fictitious data sets with syntax files (SPSS 29.0 and JASP 0.17.2. 1), (3) the statistical results for the frequentist and Bayesian statistic approach, (4) graphical representations of the results in SPSS and JASP, (5) instructions for the reporting of statistical results according to APA guidelines (7th edition), (6) a forest plot for illustration of the different effect sizes considering the smallest effect size of interet (SESOI), and (7) suggestions for the interpretation of the different results in the fictitious intervention study
12662_2023_915_MOESM5_ESM.pdf
The supplement contains (1) an exemplary question/hypothesis for an intervention study, (2) two fictitious data sets with syntax files (SPSS 29.0 and JASP 0.17.2. 1), (3) the statistical results for the frequentist and Bayesian statistic approach, (4) graphical representations of the results in SPSS and JASP, (5) instructions for the reporting of statistical results according to APA guidelines (7th edition), (6) a forest plot for illustration of the different effect sizes considering the smallest effect size of interet (SESOI), and (7) suggestions for the interpretation of the different results in the fictitious intervention study
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Büsch, D., Loffing, F. Interpretation of empirical results in intervention studies: a commentary and kick-off for discussion. Ger J Exerc Sport Res (2023). https://doi.org/10.1007/s12662-023-00915-5
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s12662-023-00915-5