Historically, physicians selected treatments based on opinion, religious belief or ill-defined experience. This included the use of bloodletting for various ailments [2] and mercury for syphilis [3], which are not only ineffective treatments, but detrimental. Over time, medical researchers started to carry out intervention studies, which are studies that aim to compare the effectiveness of an intervention either to no treatment or to the current best treatment. Perhaps the earliest intervention study in sport and exercise science was reported by Ben Cao Tu Jing in the eleventh century:
“It was said that in order to evaluate the effect of genuine Shangdang ginseng, two persons were asked to run together. One was given the ginseng while the other ran without. After running for approximately three to five li [≈1500 to 2500 m], the one without the ginseng developed severe shortness of breath, while the one who took the ginseng breathed evenly and smoothly” [4].
Whilst this intervention study provided evidence for beneficial effects of ginseng, the evidence is low quality: the breathing measurement is subjective, the subjects probably differed at baseline, there was only one subject per group and the researcher probably had a conflict of interest as he seemed enthusiastic about ginseng. This highlights that scientific evidence is not automatically a truth but needs to be interpreted critically. Over time, scientists learned to design more robust intervention experiments as researchers improved scientific methods and reduced bias, and other sources of error.
Since Ben Cao Tu Jing’s historical experiment, researchers have used many study types to investigate the effect of a given medical treatment or exercise on an outcome. The most important difference is the use of observational studies, where the investigator just observes the effect of a treatment but does not control the treatment, versus experimental studies where the experimenter administers the treatment or the intervention. A key step forward was the advent of randomized control trials (RCTs) with additional control measures such as blinding and placebos as advocated by Archie Cochrane, who deemed these designs a superior alternative to observational studies and anecdotal opinions [5].
So how has the refinement of studies that test the effectiveness of an intervention led to “evidence-based practice”? The actual term “evidence-based medicine”, or more broadly “evidence-based practice”, was first used in the 1990s by researchers at McMaster University and has been especially linked to David Sackett who, from 1994 to 1999, led the Centre for evidence-based Medicine at the University of Oxford. In a review titled “Evidence based medicine: what it is and what it isn’t,” Sackett and colleagues define evidence-based medicine as …
“the conscientious, explicit, and judicious use of the current best evidence in making decisions about the care of individual patients” [1].
By that definition, David Sackett and colleagues instruct evidence-based practitioners to do two things: First, gather all of the currently available scientific evidence of sufficient quality and second, interpret and apply that evidence. Importantly, the terms, “conscientious, explicit, and judicious use” mean that we do not necessarily need to do exactly what the evidence says; rather, we have the option to make alternative, subjective decisions if we feel that there is a good reason to do so.
Several issues must be considered when judging the available evidence. First, not all scientific data is of equal quality. Single pieces of evidence include subjective “expert” opinion, case reports and series, non-randomised control trials, and randomised control trials. On the path towards becoming an evidence-based practitioner, we must, therefore, develop the skill of critically interpreting scientific evidence. To aid this, e.g. Greenhalgh has published book “How to read a paper” [6].
A second issue is that it is impractical and not time-effective to systematically search for and read all the current best evidence for an intervention decision as there may be dozens of observational or experimental studies in relation to a single intervention such as the use of aspirin for treating headaches. This demand has led to systematic reviews and meta-analyses as two new publication types that support evidence-based practice as they aim to synthesize the current best evidence for treatment decisions. These forms of evidence also comprise the tip of the so-called evidence pyramid (Fig. 1), which illustrates the hierarchy for different levels of evidence [7]. As displayed in the hierarchy, expert opinion ranks as the lowest form of evidence, randomized control trials rank as the highest evidence for a single trial, and systematic reviews and meta-analyses rank as the highest form of overall evidence, as these studies attempt to synthesize the scientific evidence for a treatment decision or other topic.
Importantly, the conclusion of the evidence pyramid is not that less robust study types are flawed, should not be carried out and be ignored as evidence for training decisions. For example, case studies or case series are an essential study type to generate scientific evidence if it is impossible to recruit sufficient numbers of subjects or if the interventions are too long for well-controlled trials. Examples are case studies of world-class athletes or astronauts, or when an athlete is studied, e.g. over a full 4 year Olympic cycle.
The quality of an evidence statement can be graded using different options including the detailed grading system by the Oxford Centre for Evidence-Based Medicine [8]. A much simpler grading system is the GRADE (short for Grading of Recommendations Assessment, Development and Evaluation) system [9], which has four levels of evidence in relation to the effect of an intervention such as exercise training:
-
Very low quality Any estimate of effect is very uncertain.
-
Low quality Further research is very likely to have an important impact on our confidence in the estimate of effect and is likely to change the estimate.
-
Moderate quality Further research is likely to have an important impact on our confidence in the estimate of effect and may change the estimate.
-
High quality Further research is unlikely to change our confidence in the estimate of effect.
Grading the quality of evidence is good practice as it indicates the confidence that can be placed on its veracity. Importantly, a “very low quality” rating does not mean that, e.g. a given exercise training intervention does not cause an adaptation; it just means there is a lack of supporting scientific evidence for its use or, put simply, we do not know whether it works. In contrast, if the evidence for a training intervention is rated “high quality”, then we can expect it likely causes the desired effect. Caveats are that sometimes the effects differ across populations, and that the evidence only refers to the average response which is problematic if there is a large variation in trainability (as is common in applied exercise interventions).
Evidence-based medicine has now become the gold standard in medical decision-making, i.e. in deciding how to treat a patient. That said, evidence-based medicine is not without issues, as summarized by Greenhalgh et al. [10]. One such problem is that conflicts of interest can lead to the design of biased trials, the non-reporting of negative results, and/or the biased reporting of data [11]. This is of particular concern for company-funded medical and exercise intervention trials. For example, if Sports Drink A had a high amount of sugar, then a biased randomized control trial design could potentially use glycogen-depleted individuals and then test whether subjects run a faster marathon whilst drinking the sports drink versus water. Because of the importance of sufficient glycogen stores and of carbohydrate ingestion for marathon running [12], we would expect that the biased design of the experiment will yield the desired results; namely, fasted subjects will probably run the marathon faster when consuming the sugary sports drink compared to water.
A second problem is that statistical analyses often employed in the applied sciences are inherently flawed from a decision-making standpoint. A primary problem here is the focus on null hypothesis statistical testing (i.e. p values < 0.05 used for accept-or-reject-hypothesis conclusions), which has been termed “dichotomania”. The issue is that we should never conclude ‘no difference’ or ‘no relationship’ in an intervention simply because a p value is above a given threshold of “statistical significance” [13]. Instead, p values simply indicate how probable an effect is and should be considered just one of several decision-making criterions. It can be argued that magnitude-based statistics such as effect sizes and confidence intervals are potentially more valuable for deciding whether, e.g. the adaptation to a training intervention is likely more than just measurement variability and biologically meaningful [13, 14]. Moreover, when interpreting research findings, one should also look at individual effects via graphs that present data points for each participant, as there may be responders and non-responders [15,16,17,18]. We know for example, that the adaptation to the same endurance [19, 20] or resistance training [21, 22] varies greatly and so, e.g. a training might trigger the desired adaptation only in some individuals but not in others.
A third point is the research-based focus on hypotheses because a hypothesis is essentially a subjective statement of the outcome of an experiment before the actual experiment is conducted and so arguably, a hypothesis introduces bias. Despite these obvious flaws and even though few current papers in esteemed journals such as Nature, Science or Cell appear to state hypotheses, they have nonetheless turned into a dogma of scientific theory. David Glass has questioned the use of hypotheses and Popper’s falsification idea; instead of hypotheses, he recommends open research questions, as open questions are not only less biased but also more intuitive than hypotheses [23]. This is a controversial point.
A fourth issue is that the volume of evidence or specifically the number of publications is large for many treatment decisions [10]. For example, a Pubmed search for “exercise” and “training” revealed 461,115 publications as of May 2021, which are beyond the reading capacity of a sole individual. However, systematic searches for specific keywords can generally reduce the number of publications to a manageable level. In addition, systematic reviews and meta-analyses can help to synthesize the results of many studies related to specific interventions. There are caveats, however, as meta-analyses may pool data from studies of variable study quality. Consequently, conclusions from such analyses are only as good as the quality of the included individual studies. Thus, evidence-based practitioners should avoid treating systematic reviews and meta-analyses as scientific truths. Rather, it is essential to critically assess whether they have been well-conducted and whether the population, intervention, controls and outcomes are similar as in the planned intervention about which we wish to learn.