Interpretation of empirical results in intervention studies: a commentary and kick-off for discussion

Büsch, Dirk; Loffing, Florian

doi:10.1007/s12662-023-00915-5

Interpretation of empirical results in intervention studies: a commentary and kick-off for discussion

Interpretation empirischer Ergebnisse in Interventionsstudien: ein Kommentar und Anpfiff zur Diskussion

Discussion
Open access
Published: 23 November 2023

(2023)
Cite this article

Download PDF

You have full access to this open access article

German Journal of Exercise and Sport Research Aims and scope Submit manuscript

Interpretation of empirical results in intervention studies: a commentary and kick-off for discussion

Download PDF

528 Accesses
1 Citation
Explore all metrics

Abstract

Sports science as an empirical science produces study results that are to be interpreted hypothesis-oriented. The validity of the interpretation of statistically and practically significant results depends on the one hand on the theoretical foundation of the research question and on the other hand on the concrete methodological procedure in intervention studies. Considering hypotheses at the empirical-content and statistical level, recurring interpretation difficulties arise when numbers are translated into words or recommendations for action. On the basis of two examples, a discussion in the scientific community is to be initiated, which could be continued in this journal in case of corresponding interest in methodological issues.

Conclusion: The Next Steps

Moving Sport and Exercise Science Forward: A Call for the Adoption of More Transparent Research Practices

Article 04 February 2020

Effectiveness of Adult Health Promotion Interventions Delivered Through Professional Sport: Systematic Review and Meta-Analysis

Article Open access 16 June 2022

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Preliminary remarks

This commentary deliberately refrains from using numbers and equations, as it is intended solely as a suggestion as to how certain values, which by now (should) be part of the standard repertoire in empirical research, can be meaningfully used and interpreted. For detailed mathematical–statistical derivation and justification, we refer to the literature used. We have deliberately decided on a commentary to initiate an open discussion in empirical sports science via this journal and not to conduct an internal discussion about different positions on the basis of a submitted article with the reviewers via the submission platform, especially since we do not want to address anything new in terms of content, but rather pick out an aspect we and probably most colleagues are repeatedly confronted with in the area of teaching and research. Specifically, we focus on the interpretation of results in intervention studies (de Vet, Terwee, Mokkink, & Knol, 2011), which, following the differentiated validity discussion by Westermann (2000), can also be referred to as improving the validity of interpretation. We want to address two examples that are located at different hypothesis levels (Hussy & Möller, 1994), but equally illustrate that the results in intervention studies may not only be interpreted as “applies” or “does not apply”, but under certain conditions in the research and “translation context” can and should also be interpreted in a more differentiated way.

Objectives

In our view, the following two points of discussion are illustrative examples for content-related gradations of a differentiated interpretation: (1) An attempt is made to answer the question of how many times more likely, for example, the alternative hypothesis is true than the null hypothesis, in order to strengthen confidence in the confirmation or rejection of a research hypothesis. (2) An attempt is made to answer the question of whether the effect estimated in a study or for the population is greater, with a certain degree of certainty, than a minimum effect that can be reasonably expected.

In the first case, the so-called Bayes factor offers a solution, which can be determined with freely available programs such as R 4.2.0 (R-Project, Vienna, Austria, 2022) or JASP 0.17.2.1 (JASP Team, Amsterdam, Netherlands, 2023) for a number of common inferential statistical methods (van Doorn et al., 2021). In the second case, an extension of the approach for the determination of effect sizes to include a chance-adjusted minimum effect as well as corresponding confidence intervals is required (Cumming, 2014; Herbert, 2019; Lakens, 2013). Readers interested in applying the procedures illustrated below are directed to the online Supplementary Material which includes an example dataset along with an overview of statistical outputs obtained from frequentist and Bayesian analyses run in both SPSS 29.0 (IBM, Armonk, NY, USA) and JASP.

Interpretation aid at the level of the test hypotheses

In frequentist statistics, the null hypothesis (H₀) usually states that there is no statistically significant difference or relationship, whereas the alternative hypothesis (H₁) usually postulates a statistically significant difference or relationship. In this case, one tests the empirical data against a theoretical model that always assumes the validity of the null hypothesis (see e.g. Lakens, 2017; Lakens, Scheel, & Isager, 2018b; for explanations of equivalence tests with the aim of testing whether an effect lies below an a priori determined threshold for the phenomenon of interest). Accordingly, the notorious p-value (also referred to as the empirical probability of error) expresses the conditional probability of the presence of empirical data (D) or data even more extreme than the one just observed under the assumption that the null hypothesis is true, i.e. p = P(D|H₀). If the p-value is small and falls below the “magic limit” of the significance level α (also referred to as critical error probability, Lakens et al., 2018a) set before the statistical analysis, it is classically concluded that the empirical data are probably not compatible with the assumption of the null hypothesis. Accordingly, the null hypothesis is provisionally rejected and the decision is made in favour of the alternative hypothesis. In the opposite case, if the p-value is above the significance level, the null hypothesis is maintained, but not confirmed (!), and the alternative hypothesis provisionally rejected (Hussy & Jain, 2002; see Otte, Vinkers, Habets, van Ijzendoorn, & Tijdink, 2022, for an analysis of the formulations for statistically nonsignificant results found in scientific articles). This is probably the most widespread dichotomous decision rule in empirical (sports) science, although it has been and continues to be the subject of critical discussion and repeated misinterpretation since its introduction (Nickerson, 2000). Among other things, the rule “forbids” that p-values are compared with each other in such a way that a smaller p-value is interpreted as evidence for the size of an effect of the alternative hypothesis or vice versa (Büsch & Strauß, 2016).

In contrast to the frequentist approach, in Bayesian statistics the alternative (H₁) and null hypothesis (H₀) are considered as “competing models”, assuming that empirical data (D) are available. The resulting Bayes factor (BF) quantifies the strength of evidence for H₁ relative to H₀, and depending on whether the research hypothesis corresponds to H₁ or H₀, the BF can also be calculated in favour of either H₁ or H₀ (Kass & Raftery, 1995). A key advantage of the BF is that by comparing the two hypotheses we can express how many times more likely the data are under one hypothesis, e.g. H₁, compared to the data under the other hypothesis, e.g. H₀. Assuming that we want to calculate the BF in favour of the H₁ [BF₁₀ = P(D|H₁)/P(D|H₀)], the interpretation options summarised in Table 1 apply. The BF can alternatively also be specified for the H₀ as the inverse of BF₁₀ (i.e. BF₀₁ = 1/BF₁₀), which, in contrast to the frequentist approach, allows estimation and quantification of the statistical evidence in favour of the H₀ (relative to the H₁; Dienes, 2014). The Bayes factor can also be interpreted as an update factor for repeated measurements (e.g. study replications), indicating what we can learn from newly acquired data. The question that can be answered with repeated measurements is to what extent new data change our confidence (belief) with regard to, for example, the validity of two competing hypotheses H₁ and H₀? For example, if we assume that before collecting new data the probability for H₁ is assumed to be 60% [P(H₁) = 0.6] and the probability for H₀ is assumed to be 40% [P(H₀) = 0.4] and, based on the new data, we obtain BF₁₀ = 1, then we would not learn anything new (no evidence) from the data. With BF₁₀ = 1, the probability that the data at hand (D) could be observed under the validity of H₁, P(D|H₁), is identical to the probability that the data could be observed under the validity of H₀, P(D|H₀). Conversely, the larger the BF₁₀ the higher our confidence in the validity of H₁ over H₀ in the before–after comparison (Table 1). In order to determine the updated confidence in the validity of H₁ relative to H₀ (posterior beliefs), i.e. the ratio of the probabilities in favour of H₁ and H₀ in the presence of new data, the ratio of the probability assumptions before the new data (prior beliefs) needs to be combined with the Bayes factor BF₁₀ (for a mathematical justification see e.g. Wagenmakers, Morey, & Lee, 2016). A high value for BF₁₀ shifts the initial probability ratio of H₁ to H₀ quantified in the prior beliefs in favour of H₁ in the posterior beliefs. Assuming that the Bayes factor associated with the above study example with prior beliefs = 0.6/0.4 = 1.5 is BF₁₀ = 10, then the result for the posterior beliefs is = 1.5 × 10 = 15. Whereas before the new data, H₁ was only classified as 1.5 times more likely than H₀, we now have, considering the new data, a stronger confidence in the validity of H₁ compared to H₀, because now H₁ seems 15 times more likely than H₀. Conversely, new data can also lead to an increase in our confidence in the validity of H₀ over H₁. This is always the case when the value for BF₁₀ is smaller than 1 and becomes stronger the closer the value for BF₁₀ approaches null (Table 1).

Table 1 Evidence categories for the Bayes factor based on Jeffreys (1961) taken from Wetzels et al. (2011)

Full size table

Despite the clear differences between the p-value on the one hand and the Bayes factor on the other, it should nevertheless be noted that the Bayes factor, by trend, increases as the p-value decreases, although the p-value only permits a statement against the null hypothesis (Wetzels et al., 2011). A p-value based on the frequentist approach can be converted into an estimated upper bound for the Bayes factor BF₁₀ (so-called Bayes factor bound; BFB), with or without including the sample size and, if necessary, considering restrictions on the magnitude of the p-value (for detailed explanations see, among others, Benjamin & Berger, 2019; Held & Ott, 2016, 2018; Sellke, Bayarri, & Berger, 2001). The Bayes factor upper bound represents the highest possible Bayes factor BF₁₀ that is compatible with the observed p-value and thus provides, e.g. for reviewers of a submitted paper with a frequentist data analysis approach, a helpful measure for assessing the statistical evidence of a result in favour of the H₁ relative to the H₀. The Bayes factor upper bound for p = 0.05 is BFB = 2.46, for p = 0.01 it is BFB = 7.99 and for p = 0.001 it is BFB = 53.26. In particular, a p-value just below the often-used significance level of α = 0.05 is thus associated with only marginal, anecdotal evidence (Table 1), which, in the words of Jeffreys (1961), is “worth no more than a bare mention”. Even if the p-value and Bayes factor correspond, the p-value does not allow a direct assessment of the probabilities of two competing assumptions that are of interest to us in science as well as in everyday life, e.g. when assessing the trustworthiness of strangers (Tschirk, 2019; Wasserstein & Lazar, 2016). However, from the perspective of the Bayes factor, it would also be necessary to reconsider for the frequentist approach whether, for example, instead of the usual critical error probability of α = 0.05 it would be more consistent to use α = 0.01 or smaller (for discussion, see e.g. Benjamin et al., 2018; Lakens et al., 2018a) as standard criterion if substantial evidence is aimed for (Benjamin & Berger, 2019; Cohen, 1994; Wetzels et al., 2011). One side effect to be considered is that tightening the requirements for statistical evidence, i.e. stricter testing, would also have direct consequences e.g. for the minimum sample size needed (see Brysbaert, 2019, for a tutorial that considers the frequentist and Bayesian perspectives).

Interpretation aid on the level of empirical-content hypotheses

While the interpretation of the p-value is of decisive importance at the level of test hypotheses or statistical prediction, the interpretation at the level of an empirical-content hypothesis requires a focus on the effect size. Despite all justified criticism in the scientific community, the conventions of Cohen (1988) are largely preferred for an interpretation, although the author himself has pointed out the context-dependency of his suggestions (see also e.g. Caldwell & Vigotzky, 2020; Durlak, 2009; Mesquida, Murphy, Lakens, & Warne, 2022). A small intervention effect in competitive athletes, for whom marginal differences can determine success or failure, is to be interpreted differently than a small intervention effect in novice athletes, for whom even some regularity of physical activity can lead to short-term, sometimes exponential improvements in performance (Rhea, 2004).

Effect sizes denote the size of a population or sample effect that can be represented either as a difference or distance measure, e.g. the effect size d for the mean difference relative to the common dispersion, or as a correlation measure between two variables, e.g. the effect size r (Cohen, 1988). Since distance measures can be converted into correlation measures and vice versa, we will limit ourselves in the following to the most frequently used effect size d for interpreting the significance of a difference between two groups or the change in a sample.

In contrast to the conventions for d, i.e. small effect d ≥ 0.2, medium effect d ≥ 0.5 and large effect d ≥ 0.8, a context-dependent interpretation of the effect requires the following question to be asked before the study begins: “How large should a potential effect be for the intervention to be interpreted as worthwhile?” or the other way round: “How large could a potential effect be without the intervention being interpreted as worthwhile?” Both questions can be simplified to: “What should be the minimum size of an effect so that the intervention could be interpreted as worthwhile?” A corresponding minimum effect size can be determined either on the theoretical and/or content level or on the methodological level of measurement. In the first case, the minimum effect size could be derived from theory considering the thematically relevant studies. In the second case, a measurement–methodological effect size would have to be defined that would be larger than a null effect (Cumming, 2014), but that would be too small or trivial for a substantive interpretation. Possible reasons for small or trivial and thus negligible effects could result, for example, from the reliability of the measurement procedure, the homogeneity of the variances of the differences, etc. The basic assumption of this approach is reminiscent of the differentiation in individual case analyses between the minimal important change (MIC), which can be based on a consensus of experts, for example, and the minimal detectable change (MDC), which is essentially dependent on the accuracy of the measurement procedure or the standard error of measurement (SEM) (De Vet et al., 2006; King, 2011; Terwee et al., 2021).

The approach of minimum effect sizes goes back, among others, to considerations by Murphy and Myors (1999) on the testing of minimum effect hypotheses. The authors address the issue that although null hypothesis testing (in the sense of an effect being exactly zero) is easy to implement and has become widely accepted in empirical research despite all criticism and misinterpretations, true null effects do not reflect reality. Minimum effect hypotheses, on the other hand, ask whether an effect is “good enough” to be able to describe an intervention as worthwhile or more worthwhile than other interventions, for example. This assumption corresponds to the considerations of the Bayes factor explained earlier with regard to the question of whether new data change our confidence in the likelihood ratio of two competing hypotheses (see also Rouanet, 1996). In Murphy and Myors (1999, 2023), the decision on minimum effects is oriented within the framework of the F‑statistic (e.g. for variance analyses) to what percentage of explained variance can be regarded as negligible or can be assumed to be justified as equivalent to a true null effect. Assuming that, for example, up to 1% of explained variance in the F‑statistic could be neglected, this would correspond to an effect size of η² = 0.01. Transferred to the t-statistic, this assumption would correspond to a d-value of approximately 0.20 (for the conversion of effect sizes, see e.g. https://www.psychometrica.de/effect_size.html). In the case that the reliability of a measurement procedure can be assumed as very high, e.g. with ICC ≥ 0.95 or ICC ≥ 0.99, the effect to be neglected can also be set lower. For example, an explained variance of 0.5% (η² = 0.005) would correspond to a d ≅ 0.14 and of 0.1% (η² = 0.001) to a d ≅ 0.06. Ultimately, it must be decided which minimum effect appears negligible or, vice versa, at which threshold value an interesting or worthwhile effect can be assumed (see also minimum effect tests, Jovanovic, Torres, & French, 2022).

The minimum effect size can also be referred to as the smallest effect size of interest (SESOI) and documents the threshold or reference point below which the effect size (ES) in the sample under investigation, including the confidence interval of the effect size (CI), should not fall (Anvari & Lakens, 2021). The SESOI needs to be documented in a justified and transparent manner. The confidence intervals serve to estimate whether an intervention effect is so large that it can be reproduced with a certain probability and is always larger than the minimum effect (Herbert, 2019; Herbert, 2000; Kamper, 2019). This means that one can be somewhat more certain that one has not backed the wrong horse or the wrong intervention and that it is highly likely that the effect will be meaningful. The approach of this effect-size oriented approach (magnitude-based approach) can be illustrated with the help of a tree plot (Fig. 1).

By considering effect size and confidence interval as well as minimum effect and null effect, i.e. that there is no or no substantial difference between two measurement points, the intervention effects can be interpreted validly. If, for example, the effect size and the confidence interval are above the minimum effect (SESOI), it can be assumed that the intervention is beneficial or meaningful. If, on the other hand, the effect size and the confidence interval are between the null effect and the minimum effect, it can be assumed that the intervention is effective but not worthwhile, whereas if the effect size and the confidence interval are both below the null effect an intervention would be assessed as not beneficial or not meaningful, possibly even as increasingly harmful. However, the validity of the interpretation is always limited if the confidence interval includes a reference point, i.e. either the minimum effect or the null effect (Fig. 2).

Concluding remarks

In this commentary we focused on two selected issues at different levels of hypothesis. Of course, many other aspects of empirical (sports) research need to be considered for sufficient interpretive validity such as sound theoretical foundation and methodological rigour (Fiedler, McCaughey, & Prager, 2021), visualisation of the (raw) data (Loffing, 2022), etc. The aspects considered here represent only a small section of an overall complex research process, often underestimated in its interactions, which holds traps in store at various points, that each of us has stepped into at some point and unfortunately contains limitations for the validity of studies. What the different approaches have in common is the striving for an improvement in methodological precision that also offers substantial added value for sports science research. Consequently, this commentary is to be understood as the starting whistle of a continuous and constantly developing discussion and it should not be misunderstood as the final whistle of a discussion that may also be tiring to some degree (Mesquida et al., 2022). For the kick-off of the discussion to be initiated, we finally formulate two provocative suggestions for the interpretation of study results: (1) Always calculate the Bayes factor to enable (better) estimation or interpretation of the trustworthiness of your (statistical) test hypothesis of interest! (2) Always determine and justify the minimum effect size in advance to enable (better) estimation or interpretation of the substantial benefit of your intervention!

References

Anvari, F., & Lakens, D. (2021). Using anchor-based methods to determine the smallest effect size of interest. Journal of Experimental Social Psychology, 96, 104159. https://doi.org/10.1016/j.jesp.2021.104159.
Article Google Scholar
Benjamin, D. J., & Berger, J. O. (2019). Three recommendations for improving the use of p‑values. The American Statistician, 73(sup1), 186–191. https://doi.org/10.1080/00031305.2018.1543135.
Article Google Scholar
Benjamin, D. J., Berger, J. O., Johannesson, M., Nosek, B. A., Wagenmakers, E. J., Berk, R., Bollen, K. A., Brembs, B., Brown, L., Camerer, C., Cesarini, D., Chambers, C. D., Clyde, M., Cook, T. D., De Boeck, P., Dienes, Z., Dreber, A., Easwaran, K., Efferson, C., . . ., & Johnson, V. E. (2018). Redefine statistical significance. Nature Human Behaviour, 2(1), 6–10. https://doi.org/10.1038/s41562-017-0189-z.
Article PubMed Google Scholar
Brysbaert, M. (2019). How many participants do we have to include in properly powered experiments? A tutorial of power analysis with reference tables. Journal of Cognition. https://doi.org/10.5334/joc.72.
Article PubMed PubMed Central Google Scholar
Büsch, D., & Strauß, B. (2016). Wider die „Sternchenkunde“! Sportwissenschaft, 46(2), 53–59. https://doi.org/10.1007/s12662-015-0376-x.
Article Google Scholar
Caldwell, A., & Vigotsky, A. D. (2020). A case against default effect sizes in sport and exercise science. PeerJ, 8, e10314. https://doi.org/10.7717/peerj.10314.
Article PubMed PubMed Central Google Scholar
Cohen, J. (1988). Statistical power analysis for the behavioral sciences. Lawrence Erlbaum.
Google Scholar
Cohen, J. (1994). The earth is round (p 〈 .05). American Psychologist, 49(12), 997–1003.
Article Google Scholar
Cumming, G. (2014). The new statistics: why and how. Psychological Science, 25(1), 7–29. https://doi.org/10.1177/0956797613504966.
Article PubMed Google Scholar
De Vet, H. C. W., Terwee, C. B., Ostelo, R. W., Beckerman, H., Knol, D. L., & Bouter, L. M. (2006). Minimal changes in health status questionnaires: distinction between minimally detectable change and minimally important change. Health and Quality of Life Outcomes, 4(1), 54. https://doi.org/10.1186/1477-7525-4-54.
Article PubMed PubMed Central Google Scholar
Dienes, Z. (2014). Using Bayes to get the most out of non-significant results. Frontiers in Psychology. https://doi.org/10.3389/fpsyg.2014.00781.
Article PubMed PubMed Central Google Scholar
van Doorn, J., van den Bergh, D., Böhm, U., Dablander, F., Derks, K., Draws, T., Etz, A., Evans, N. J., Gronau, Q. F., Haaf, J. M., Hinne, M., Kucharský, Š., Ly, A., Marsman, M., Matzke, D., Gupta, A. R. K. N., Sarafoglou, A., Stefan, A., Voelkel, J. G., & Wagenmakers, E.-J. (2021). The JASP guidelines for conducting and reporting a Bayesian analysis. Psychonomic Bulletin & Review, 28(3), 813–826. https://doi.org/10.3758/s13423-020-01798-5.
Article Google Scholar
Durlak, J. A. (2009). How to select, calculate, and interpret effect sizes. Journal of Pediatric Psychology, 34(9), 917–928. https://doi.org/10.1093/jpepsy/jsp004.
Article PubMed Google Scholar
Fiedler, K., McCaughey, L., & Prager, J. (2021). Quo vadis, methodology? The key role of manipulation checks for validity control and quality of science. Perspectives on Psychological Science, 16(4), 816–826. https://doi.org/10.1177/1745691620970602.
Article PubMed PubMed Central Google Scholar
Held, L., & Ott, M. (2016). How the maximal evidence of p‑values against point null hypotheses depends on sample size. The American Statistician, 70(4), 335–341. https://doi.org/10.1080/00031305.2016.1209128.
Article Google Scholar
Held, L., & Ott, M. (2018). On p‑values and Bayes factors. Annual Review of Statistics and Its Application, 5(1), 393–419. https://doi.org/10.1146/annurev-statistics-031017-100307.
Article Google Scholar
Herbert, R. (2019). Significance testing and hypothesis testing: meaningless, misleading and mostly unnecessary. Journal of Physiotherapy, 65(3), 178–181. https://doi.org/10.1016/j.jphys.2019.05.001.
Article PubMed Google Scholar
Herbert, R. D. (2000). How to estimate treatment effects from reports of clinical trials. I: Continuous outcomes. Australian Journal of Physiotherapy, 46(3), 60334–60332. https://doi.org/10.1016/S0004-9514.
Article Google Scholar
Hussy, W., & Jain, A. (2002). Experimentelle Hypothesenprüfung in der Psychologie. Hogrefe.
Google Scholar
Hussy, W., & Möller, H. (1994). Hypothesen. In T. Herrmann & W. Tack (Eds.), Methodologische Grundlagen der Psychologie. Enzyklopädie der Psychologie: Themenbereich B Methodologie und Methoden, Serie I Forschungsmethoden der Psychologie, (Vol. 1, pp. 475–507). Hogrefe.
Google Scholar
Jeffreys, H. (1961). Theory of probability (3rd edn.). Oxford University Press.
Google Scholar
Jovanovic, M., Torres, R. L., & French, D. N. (2022). Statistical modeling. In D. N. French & L. R. Torres (Eds.), NSCA’s essentials of sport science (pp. 644–701). Human Kinetics.
Google Scholar
Kamper, S. J. (2019). Confidence intervals: Linking evidence to practice. Journal of Orthopaedic & Sports Physical Therapy, 49(10), 763–764. https://doi.org/10.2519/jospt.2019.0706.
Article Google Scholar
Kass, R. E., & Raftery, A. E. (1995). Bayes factors. Journal of the American Statistical Association, 90(430), 773–795. https://doi.org/10.2307/2291091.
Article Google Scholar
King, M. T. (2011). A point of minimal important difference (MID): A critique of terminology and methods. Expert Review of Pharmacoeconomics & Outcomes Research, 11(2), 171–184. https://doi.org/10.1586/erp.11.9.
Article Google Scholar
Lakens, D. (2013). Calculating and reporting effect sizes to facilitate cumulative science: a practical primer for t‑tests and ANOVAs. Frontiers in Psychology. https://doi.org/10.3389/fpsyg.2013.00863.
Article PubMed PubMed Central Google Scholar
Lakens, D. (2017). Equivalence tests: A practical primer for t tests, correlations, and meta-analyses. Social Psychological and Personality Science, 8(4), 355–362. https://doi.org/10.1177/1948550617697177.
Article PubMed PubMed Central Google Scholar
Lakens, D., Adolfi, F. G., Albers, C. J., Anvari, F., Apps, M. A. J., Argamon, S. E., Baguley, T., Becker, R. B., Benning, S. D., Bradford, D. E., Buchanan, E. M., Caldwell, A. R., Van Calster, B., Carlsson, R., Chen, S.-C., Chung, B., Colling, L. J., Collins, G. S., . . ., & Zwaan, R. A. (2018a). Justify your alpha. Nature Human Behaviour, 2(3), 168–171. https://doi.org/10.1038/s41562-018-0311-x.
Article Google Scholar
Lakens, D., Scheel, A. M., & Isager, P. M. (2018b). Equivalence Testing for Psychological Research: A Tutorial. Advances in Methods and Practices in Psychological Science, 1(2), 259–269. https://doi.org/10.1177/2515245918770963.
Article Google Scholar
Loffing, F. (2022). Raw data visualization for common factorial designs using SPSS: A syntax collection and tutorial. Frontiers in Psychology. https://doi.org/10.3389/fpsyg.2022.808469.
Article PubMed PubMed Central Google Scholar
Mesquida, C., Murphy, J., Lakens, D., & Warne, J. (2022). Replication concerns in sports and exercise science: a narrative review of selected methodological issues in the field. Royal Society Open Science, 9(12), 220946. https://doi.org/10.1098/rsos.220946.
Article PubMed PubMed Central Google Scholar
Murphy, K. R., & Myors, B. (1999). Testing the hypothesis that treatments have negligible effects: Minimum-effect tests in the general linear model. Journal of Applied Psychology, 84, 234–248. https://doi.org/10.1037/0021-9010.84.2.234.
Article Google Scholar
Murphy, K. R., & Myors, B. (2023). Statistical power analysis: A simple and general model for traditional and modern hypothesis tests (5th edn.). Routledge.
Google Scholar
Nickerson, R. S. (2000). Null hypothesis significance testing: A review of an old and continuing controversy. Psychological Methods, 5, 241–301. https://doi.org/10.1037/1082-989X.5.2.241.
Article CAS PubMed Google Scholar
Otte, W. M., Vinkers, C. H., Habets, P. C., van Ijzendoorn, D. G. P., & Tijdink, J. K. (2022). Analysis of 567,758 randomized controlled trials published over 30 years reveals trends in phrases used to discuss results that do not reach statistical significance. PLoS Biology, 20(2), e3001562. https://doi.org/10.1371/journal.pbio.3001562.
Article CAS PubMed PubMed Central Google Scholar
Rhea, M. R. (2004). Determining the magnitude of treatment effects in strength training research through the use of the effect size. Journal of Strength and Conditioning Research, 18(4), 918–920. https://doi.org/10.1519/14403.1.
Article PubMed Google Scholar
Rouanet, H. (1996). Bayesian methods for assessing importance of effects. Psychological Bulletin, 119, 149–158. https://doi.org/10.1037/0033-2909.119.1.149.
Article Google Scholar
Sellke, T., Bayarri, M. J., & Berger, J. O. (2001). Calibration of ρ values for testing precise null hypotheses. The American Statistician, 55(1), 62–71. https://doi.org/10.1198/000313001300339950.
Article Google Scholar
Terwee, C. B., Peipert, J. D., Chapman, R., Lai, J.-S., Terluin, B., Cella, D., Griffiths, P., & Mokkink, L. B. (2021). Minimal important change (MIC): A conceptual clarification and systematic review of MIC estimates of PROMIS measures. Quality of Life Research, 30(10), 2729–2754. https://doi.org/10.1007/s11136-021-02925-y.
Article PubMed PubMed Central Google Scholar
Tschirk, W. (2019). Bayes-Statistik für Human- und Sozialwissenschaften. Springer. https://doi.org/10.1007/978-3-662-56782-1.
Book Google Scholar
de Vet, H. C. W., Terwee, C. B., Mokkink, L. B., & Knol, D. L. (2011). Measurement in medicine: A practical guide. Cambridge University Press. https://doi.org/10.1017/CBO9780511996214.
Book Google Scholar
Wagenmakers, E.-J., Morey, R. D., & Lee, M. D. (2016). Bayesian benefits for the pragmatic researcher. Current Directions in Psychological Science, 25(3), 169–176. https://doi.org/10.1177/0963721416643289.
Article Google Scholar
Wasserstein, R. L., & Lazar, N. A. (2016). The ASA statement on p‑Values: Context, process, and purpose. The American Statistician, 70(2), 129–133. https://doi.org/10.1080/00031305.2016.1154108.
Article Google Scholar
Westermann, R. (2000). Wissenschaftstheorie und Experimentalmethodik. Hogrefe.
Google Scholar
Wetzels, R., Matzke, D., Lee, M. D., Rouder, J. N., Iverson, G. J., & Wagenmakers, E. J. (2011). Statistical evidence in experimental psychology: An empirical comparison using 855 t tests. Perspectives on Psychological Science, 6(3), 291–298. https://doi.org/10.1177/1745691611406923.
Article PubMed Google Scholar

Download references

Funding

Open Access funding enabled and organized by Projekt DEAL.

Author information

Authors and Affiliations

Institute of Sport Science – Department Sport and Training, Carl von Ossietzky University, Ammerländer Heerstraße 114–118, 26129, Oldenburg, Germany
Dirk Büsch
Institute of Psychology – Department Performance Psychology, German Sport University Cologne, Am Sportpark Müngersdorf 6, 50933, Cologne, Germany
Florian Loffing

Authors

Dirk Büsch
View author publications
You can also search for this author in PubMed Google Scholar
Florian Loffing
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Dirk Büsch.

Ethics declarations

Conflict of interest

D. Büsch and F. Loffing declare that they have no competing interests.

For this article no studies with human participants or animals were performed by any of the authors. All studies mentioned were in accordance with the ethical standards indicated in each case.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

12662_2023_915_MOESM1_ESM.docx

The supplement contains (1) an exemplary question/hypothesis for an intervention study, (2) two fictitious data sets with syntax files (SPSS 29.0 and JASP 0.17.2. 1), (3) the statistical results for the frequentist and Bayesian statistic approach, (4) graphical representations of the results in SPSS and JASP, (5) instructions for the reporting of statistical results according to APA guidelines (7th edition), (6) a forest plot for illustration of the different effect sizes considering the smallest effect size of interet (SESOI), and (7) suggestions for the interpretation of the different results in the fictitious intervention study

12662_2023_915_MOESM2_ESM.jasp

The supplement contains (1) an exemplary question/hypothesis for an intervention study, (2) two fictitious data sets with syntax files (SPSS 29.0 and JASP 0.17.2. 1), (3) the statistical results for the frequentist and Bayesian statistic approach, (4) graphical representations of the results in SPSS and JASP, (5) instructions for the reporting of statistical results according to APA guidelines (7th edition), (6) a forest plot for illustration of the different effect sizes considering the smallest effect size of interet (SESOI), and (7) suggestions for the interpretation of the different results in the fictitious intervention study

12662_2023_915_MOESM3_ESM.sav

The supplement contains (1) an exemplary question/hypothesis for an intervention study, (2) two fictitious data sets with syntax files (SPSS 29.0 and JASP 0.17.2. 1), (3) the statistical results for the frequentist and Bayesian statistic approach, (4) graphical representations of the results in SPSS and JASP, (5) instructions for the reporting of statistical results according to APA guidelines (7th edition), (6) a forest plot for illustration of the different effect sizes considering the smallest effect size of interet (SESOI), and (7) suggestions for the interpretation of the different results in the fictitious intervention study

12662_2023_915_MOESM4_ESM.csv

The supplement contains (1) an exemplary question/hypothesis for an intervention study, (2) two fictitious data sets with syntax files (SPSS 29.0 and JASP 0.17.2. 1), (3) the statistical results for the frequentist and Bayesian statistic approach, (4) graphical representations of the results in SPSS and JASP, (5) instructions for the reporting of statistical results according to APA guidelines (7th edition), (6) a forest plot for illustration of the different effect sizes considering the smallest effect size of interet (SESOI), and (7) suggestions for the interpretation of the different results in the fictitious intervention study

12662_2023_915_MOESM5_ESM.pdf

The supplement contains (1) an exemplary question/hypothesis for an intervention study, (2) two fictitious data sets with syntax files (SPSS 29.0 and JASP 0.17.2. 1), (3) the statistical results for the frequentist and Bayesian statistic approach, (4) graphical representations of the results in SPSS and JASP, (5) instructions for the reporting of statistical results according to APA guidelines (7th edition), (6) a forest plot for illustration of the different effect sizes considering the smallest effect size of interet (SESOI), and (7) suggestions for the interpretation of the different results in the fictitious intervention study

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Büsch, D., Loffing, F. Interpretation of empirical results in intervention studies: a commentary and kick-off for discussion. Ger J Exerc Sport Res (2023). https://doi.org/10.1007/s12662-023-00915-5

Download citation

Received: 27 June 2023
Accepted: 21 September 2023
Published: 23 November 2023
DOI: https://doi.org/10.1007/s12662-023-00915-5

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Interpretation of empirical results in intervention studies: a commentary and kick-off for discussion

Abstract

Similar content being viewed by others

Conclusion: The Next Steps

Moving Sport and Exercise Science Forward: A Call for the Adoption of More Transparent Research Practices

Effectiveness of Adult Health Promotion Interventions Delivered Through Professional Sport: Systematic Review and Meta-Analysis

Preliminary remarks

Objectives

Interpretation aid at the level of the test hypotheses

Interpretation aid on the level of empirical-content hypotheses

Concluding remarks

References

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher’s Note

Supplementary Information

12662_2023_915_MOESM1_ESM.docx

12662_2023_915_MOESM2_ESM.jasp

12662_2023_915_MOESM3_ESM.sav

12662_2023_915_MOESM4_ESM.csv

12662_2023_915_MOESM5_ESM.pdf

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Interpretation of empirical results in intervention studies: a commentary and kick-off for discussion

Abstract

Similar content being viewed by others

Preliminary remarks

Objectives

Interpretation aid at the level of the test hypotheses

Interpretation aid on the level of empirical-content hypotheses

Concluding remarks

References

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher’s Note

Supplementary Information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation