Introduction

Health professions education (such as medicine, physiotherapy, and clinical psychology) covers a large amount of theoretical and practical content over a broad range of subjects, to prepare students for entering the workplace. Distributed practice (spaced practice) is spacing out study over time as opposed to massing (or cramming) study. Retrieval practice is the act of recalling information from memory, such as using practice tests. Both strategies have been reported as effective at improving knowledge retention in a number of contexts and are thus considered to benefit health professions education (Dunlosky et al., 2013). These strategies are also considered ‘desirable difficulties’, coined to represent study strategies that feel challenging but are often more effective than those that feel easy (Bjork & Bjork, 2011). Previous research in health professions education demonstrates that students and educators often hold misconceptions about what are effective study strategies, and commonly use strategies that are considered less effective (Piza et al., 2019). Even with a clear concept of effective study strategies, students during a unit of learning will often revert to less effective strategies than originally intended (Blasiman et al., 2017). Exploring the effectiveness of distributed practice and retrieval practice in health professions education is therefore indicated to help guide students and educators.

Distributed practice in previous research is often compared to no intervention, massed study, or varying the inter-study interval (ISI). ISI is the interlude separating different study sessions, and consists of three main types: expanding, contracting and equal. Expanding schedules refer to the gradual increase in ISIs, contracting schedules are the gradual decrease in ISIs and equal schedules are equally spaced ISIs, with research demonstrating varying effectiveness (Gerbier et al., 2015; Karpicke & Roediger, 2007, 2010; Küpper-Tetzel et al., 2014). The overall length of an ISI also affects learning outcomes, with increasing ISIs up to 29 days demonstrating improved long term memory outcomes when compared to shorter ISIs (Cepeda et al., 2006; Rohrer, 2015). Furthermore, as the retrieval interval increases, concurrently increasing the ISIs improves outcomes compared to shorter ISIs (Cepeda et al., 2008).

Retrieval practice includes three main types, each varying in cognitive load: recognition, cued recall, and free recall. Recognition questions, such as multiple-choice, allow students to select an answer that they recognise but may have been unable to recall without the suggestion. Cued recall questions refer to fill-in-the-blank and short-answer questions which increase cognitive demand, and free recall is considered the most cognitively demanding, as no question cue, or answer suggestion is provided (Adesope et al., 2017). Retrieval practice that increases cognitive demand, correlates with improves assessment scores (Adesope et al., 2017; Rowland, 2014) and retrieval practice questions that are identical to assessment questions are reported to be more effective than non-identical retrieval practice questions (Veltre et al., 2015).

The comparison groups for retrieval practice can include no study or normal class, restudying (rereading or rewatching content), concept mapping, or comparisons between varying types of retrieval practice, with retrieval practice generally demonstrating superior outcomes in all comparisons (Adesope et al., 2017). Including feedback with retrieval practice has shown mixed results, with positive outcomes found in lab-based studies, but null effect in classroom-based studies (Adesope et al., 2017). Further, some studies showed a reduced effect when feedback was added to retrieval practice (Kliegl et al., 2019; Racsmány et al., 2020). Longer retrieval intervals, the interval between practice and final assessment, favours retrieval practice over restudy (Rowland, 2014). One specific use of retrieval practice is pre-questions, which is the retrieval of information that has yet to be covered, and may also enhance retention of that material (Little & Bjork, 2016; Richland et al., 2009).

Time on task is also an important variable to track in comparison trials. Increasing time on task has shown a strong correlation with improved academic grades (Chang et al., 2014). Therefore, this could be a confounding factor if distributed practice or retrieval practice time on task does not equal that of the comparison or control group. Controlling for time on task in trials will reduce the risk of this factor confounding results.

The stakes of an assessment may also be relevant, defined as formative assessments (or no-stakes assessments) and summative assessments which can be low-stakes (low weighting or grade) or high-stakes, such as exams that must be passed to complete a unit. Mixed outcomes have been found when increasing the stakes of assessments. High stakes may increase the motivation to engage with the learning strategy, thereby improving outcomes (Phelps, 2012). However, increased stakes may induce test anxiety, thereby reducing final performance (Hinze & Rapp, 2014). Learning setting is also important, with interventions that are applied to assessments and coursework relevant to educators, whereas interventions applied to self-directed learning, such as homework are also applicable to students.

How distributed practice and retrieval practice are implemented may affect the outcome. Therefore, this review also summarises key implementation variables, including type of retrieval practice and distributed practice, type of comparison group, the inclusion of feedback with retrieval practice, the retrieval interval, time on task and the stakes of an assessment. Included in this review is also a critical appraisal of the methodology quality of studies and therefore the strength of the results. No current systematic review appraises the distributed practice and retrieval practice literature in a health professions education context, however, there has been related work with a scoping review of spaced learning in health professions education (Versteeg et al., 2020), a systematic review of instructional design in simulation-based education (Cook et al., 2013) and a review of brain-aware teaching strategies for health professions education (Ghanbari et al., 2019).

The purpose of this review is to determine the effect of distributed practice and retrieval practice on academic grades in health professions education. This review will highlight directions for future research and guide educators and students towards more effective learning strategies to assist in improving knowledge acquisition.

Methods

A systematic review method was applied according to the PRISMA guidelines to answer the review question: Are distributed practice and retrieval practice effective learning strategies at improving academic grades in health professions education?

The inclusion criteria are outlined in Table 1 and articles were only included from peer reviewed journals. Both control and comparison studies were included in this review, however case series were excluded. Studies were excluded if the intervention, control, or comparison groups did not have equivalent outcome measures. Laboratory studies were excluded to improve the applicability of the research to health professions education. Content relevant to tertiary healthcare programs was searched via healthcare professions, which are included in the search criteria listed below. These were further screened for applicability, with graduate programs and studies that included non-clinical content, such as cognitive psychology studies excluded. There were no exclusion criteria for comparison groups, therefore both control groups and a variety of comparison groups were included in this review. Studies were excluded if the only outcome measure was students’ subjective rating of their performance, as this is often an inaccurate judgement of learning (Dunlosky & Rawson, 2012). Academic grades were therefore a required outcome measure for inclusion, despite satisfaction, judgement of learning and engagement also benefitting from both distributed practice and retrieval practice (Browne, 2019; Bruckel et al., 2016; Karpicke, 2009; Son & Metcalfe, 2000).

Table 1 Table 1

Search strategy

Identification

The population and intervention inclusion criterion were used to create search terms, including alternate terms such as spaced practice for distributed practice. This method was applied to the databases of EBSCOhost (Education Source, CINAHL Complete, ERIC, MEDLINE Complete, Psychology and Behavioral Sciences Collection), Web of Science, and Scopus. Search terms: (health OR physiotherap* OR “physical therap*” OR “allied health” OR pharmacy OR medic* OR nursing OR “occupational therap*” OR “speech patholog*” OR dentist* OR psycholog*) AND (student* OR undergrad* OR postgrad* OR tertiary OR universit*) AND (“retrieval practice” OR “retrieval-based practice” OR “spaced practice” OR “distributed practice”) in November 2022. Search mode: EBSCOhost ‘find all my search terms’, Web of Science ‘TOPIC’, Scopus ‘article title, abstract and keywords’.

Screening, eligibility, and inclusion

After removal of duplicate articles, the remaining articles were screened for eligibility by title, then abstract and finally the full article against the inclusion and exclusion criteria. The results of this screening process are displayed in Fig. 1, with the most common reasons for exclusion being non-tertiary health professions education, such as other non-clinical healthcare disciplines, qualified healthcare professions education, or clinical healthcare patient populations.

figure 1

Figure 1

Critical appraisal

The Medical Education Research Study Quality Instrument (MERSQI) and Newcastle-Ottawa Scale-Education (NOS-E) were used to critically appraise the eligible articles (Cook & Reed, 2015). One review of these appraisal methods suggests that whilst the MERSQI focuses on more objective design issues, the NOS-E is more subjective, but covers more information on the implications of study design. They therefore complement each other when used together (Cook & Reed, 2015). See ‘Table 1’ of the article by Cook, David A. MD, MHPE and Reed, Darcy A. MD, MPH for further information of the criteria definitions and scoring system of the MERQI and NOS-E (Cook & Reed, 2015).

Summary of articles

The summary will also highlight the key variables described in the introduction. Statistical significance will only briefly spotlight significant findings and for studies that compare multiple timepoints, the statistical significance summary will focus on the longest retrieval interval.

Results

The MERSQI score for each included study is provided in Table 2 and NOS-E score in Table 3. The studies’ variables, and statistical significance are summarised in Table 4.

Table 2 Table 2
Table 3 Table 3
Table 4 Table 4

Summary of articles

Of the 56 studies, some studies conducted more than one experimental intervention. Therefore, a total of 63 experiments are included in this review. Of these experiments, 43 demonstrated a significant positive effect for distributed practice and/or retrieval practice compared to massed practice, rereading, normal class, or no intervention. One study demonstrated a negative effect for distributed practice. Retrieval practice alone was most studied (n = 33), the spacing out of retrieval practice was commonly studied (n = 16) and distributed practice alone was less frequently studied (n = 14).

The most common units were introductory psychology (n = 16), physiology (n = 8), anatomy (n = 6) and anatomy & physiology (n = 4). Interventions were generally classroom based rather than homework based. The content from one class only was assessed in fifteen studies and the retrieval interval reported for these studies was generally seven days (five days for two studies). Many other studies were longer, covering content of an entire unit, the retrieval interval based on the final exam. Nine studies included an assessment of knowledge post unit completion. Interventions and comparison groups varied widely.

Recognition or cued recall were the most common retrieval types with only a few studies using free recall. Four studies compared types of retrieval practice and predominantly found fill-in-the-blank words more effective than fill-in-the-blank letters, short answer questions more effective than fill-in-the-blank, and free recall more effective than recognition. Feedback was common in retrieval practice interventions, however some studies failed to report on this at all.

Of the distributed practice studies, five compared types of distributed practice and found an expanding schedule more effective than an equal schedule in two studies but no difference in one study, an expanding schedule more effective than contracting and equal schedules in one study, and a contracting schedule more effective than expanding and equal schedules in another study. An expanding schedule was superior in three out of the five studies.

Time on task was frequently not reported, those studies that did measured time on task often did not control for this variable. Assessments were most frequently summative (n = 24) compared to formative (n = 14). Studies that used a small grade, for example, 2%, as incentive to complete the assigned work and was independent of final assessment outcomes, were included as formative assessment. Notably, many studies did not report on the stakes of assessments at all (n = 25), nor what percentage grade was assigned to the assessment. Two studies compared assigned versus optional homework and found assigned homework to be more effective (Janes et al., 2020; Trumbo et al., 2016). Five summative and two formative assessments showed no effectiveness of distributed practice and/or retrieval practice.

Study quality

Of the studies that completed multiple experiments, those that used the same methodology in either the MERSQI or NOS-E are only rated once, however, if the methodology differed, they were rated separately under the relevant experiment. Within-subject studies were considered randomised if the order of interventions was randomised so that time until assessment was averaged over the conditions. The between-subject studies most often used the same community to select their comparison group, and historical cohorts were generally described as having the same class structure and content. Some studies did not mention or require ethical approval, and therefore did not include informed consent proceeding their randomisation. This resulted in a few of these studies not reporting any allocation concealment, and therefore scoring lower on the NOS-E. Non-randomised studies occasionally recorded subject characteristics, such as gender, age, and ethnicity, but infrequently recorded baseline scores such as a pre-test and rarely controlled for these characteristics with a statistical covariate analysis resulting in a lower comparability of groups.

Most studies were only sampled from a single institution. The retention of participants, which is scored in both the MERSQI and NOS-E was generally high. The representativeness of the intervention group was scored low in some studies, most often because the numbers were not reported. Blinding of assessment was scored high on most studies in the NOS-E, as most assessments were multiple-choice questions. No experiments included outcomes of ‘Behaviours’ or ‘Patient or healthcare outcomes’.

On the MERSQI, the ‘Validity evidence for evaluation instrument’ was generally scored low, with only two studies reporting on internal structure demonstrating reliability. Many studies did not report the source of content for their assessment, whether from a textbook, or expert. The data analysis was generally sophisticated and appropriate for all but a couple of studies.

Discussion

This systematic review was undertaken to determine the effect of distributed practice and retrieval practice on academic grades in health professions education and to summarise a range of interventional variables that may affect study outcomes. This review indicates that distributed practice and/or retrieval practice are effective in most of the studies, when compared to several comparison or control groups, at improving test and examination scores, and is therefore a worthwhile learning strategy to consider in health professions education. Only one study showed a negative effect for distributed practice, but several variables such as time on task and test delay may explain this result (Cecilio-Fernandes et al., 2018). Although retrieval practice was the most studied learning strategy, many retrieval practice studies did not report on what content was being reviewed. Therefore the number of spacing out of retrieval practice interventions may be underreported if content from numerous weeks was being reviewed.

Interventions were generally applied to specific units of learning, rather than entire programs or content matter. Studies of these introductory units did not always include clear healthcare program information; however, it was assumed that these introductory units are a requirement for many health professions programs.

Not reporting the time on task for intervention and comparison groups was common in this review and is a strong confounding factor limiting the strength of the results. A false positive may occur when there is more time on task or a false negative when there is less time on task. One study reported a higher time on task for the distributed condition compared to the massed condition, but still found no significant benefit (Kerdijk et al., 2015). In this case, having students distribute their practice would be particularly ineffective, and a poor use of time. Whereas another study showed a higher time on task for the restudy group compared to the retrieval group, and even though they spent less time studying, the retrieval group had significantly superior outcomes (Schneider et al., 2019). This would then be considered a particularly effective learning strategy in this context, not to mention an efficient strategy. Conclusions may be erroneously drawn about the effectiveness of the learning strategies when time on task is not monitored and should be a focus of future research.

There are many variables mentioned in this review that could be affected by an overarching variable of student motivation. These include whether interventions were classroom-based or homework-based, optional homework or assigned homework, and summative or formative assessments.

Classroom-based interventions may result in students having less competing distractions and challenges with time management than in a home environment for homework-based interventions (Xu, 2013). Although only a couple of experiments mentioned online homework specifically, it is important to be aware that a face-to-face intervention compared to an online intervention may affect learning. Many health professions programs changed to elements of online learning during and post COVID (Kumar et al., 2021; Naciri et al., 2021; Schmutz et al., 2021). Results are mixed as to which may be more effective for student, but issues around student motivation, engagement and academic integrity may be relevant (Miller & Young-Jones, 2012; Platt et al., 2014). This will be an area to watch as more studies report on the effect of online learning compared to face-to-face learning in general, as well as with distributed practice and retrieval practice.

The two studies that found assigned homework to be more effective than optional homework may also be affected by student motivation (Janes et al., 2020; Trumbo et al., 2016). Assigning compulsory homework is an external motivation that will likely rise to the top of a student’s priority list compared to optional homework. To better understand the effect of motivation in these studies, measuring time on task for each group would give further insight. One study did not measure time on task (Janes et al., 2020), and the other found that on average, the more effective assigned homework group completed approximately 5.5 times the amount of time on task compared to the optional homework group. Therefore, although assigned homework is more effective, this is likely due to the student’s motivation to complete it and spend more time on the task.

There was no clear benefit in this review when comparing summative and formative assessments. It was a common area of missing information in the experiments and the study methodologies varied significantly. Further research should include this information and the percentage of the grade, as well as other factors that may explain the variability in outcomes, such as time spent studying and test anxiety.

There were also three interventions that used the placement of exams as distributed retrieval practice. Considering that most exams are a requirement to complete a unit and progress through a degree, this would be considered a high external motivation to increase effort to study and recall information, and therefore improve the retrieval practice effect. Two of these studies assessed increasing the number of exams, with one showing no significant benefit (Kerdijk et al., 2015) and the other showing that increasing the number of exams was a significant benefit (Keus et al., 2019). The time on task was not reported for either study, however it likely increased as the number of assessments increased. The time that students spent studying for each exam was also not reported, and likely increased due to the external motivation of a summative assessment. The third study looked at a post final exam assessment of knowledge retention and found that including content in the final exam increased the likelihood of improved long-term knowledge retention (Glass et al., 2013). This study also highlights the pitfalls of assessments in general, as many students may cram for an assessment and show positive outcomes but forget the knowledge in the long term. Overcoming this challenge may involve future research including more long-term, post-unit formative assessments that students don’t necessarily know about in advance, to get a clear gauge on the effectiveness of different learning strategies.

Student motivation is a complex, multifactorial topic and is not heavily addressed in this paper. It was not considered by any of the interventional experiments included in this review either. However, it may be an important variable to consider in future research, as students that participate in an intervention will of course benefit from these learning strategies more than students that chose not to participate, or only partially participate.

Study duration was another factor that may affect replication in ‘real world’ classrooms. Many studies only had a single intervention point which was then assessed a week later. This is frequently reported in prior research (Karpicke & Blunt, 2011; McDaniel et al., 2009) and research demonstrates that forgetting new content may plateau by one week (Loftus, 1985). The longer duration studies, however, may be more relevant to educators’ goals of long term memory in health professions education. This long term memory is needed for the scaffolding of information in future units (Belland et al., 2017) and to ultimately be applied in students professions post-graduation. Interventions and comparison groups varied widely, including studies that educated students on the benefits of distributed practice and/or retrieval practice as the intervention (n = 5). The goal of this type of intervention may be to improve students own self-directed studying. Considering educators limited time (Inclan & Meyer, 2020) and students reporting low use of these strategies in previous research (Persky & Hudson, 2016), this could be an interesting direction for future research. This future research should aim to students’ self-directed learning durations and type, as well as academic scores, to best understand if this is an effective strategy or not.

This paper supports the theory that increasing cognitive demand improves outcomes (Adesope et al., 2017), with short answer and free recall questions showing greater benefit than recognition questions. An expanding schedule was the most effective inter-study interval, however study methodology such as retrieval interval differed between studies and the results are mixed. Further research is needed to determine the most effective inter-study interval. There was insufficient information to determine if providing feedback with retrieval practice was more effective than not providing feedback, as there were no comparison groups directly assessing this, and most studies either provided feedback, or did not report on it at all. Future research has many avenues to understand this further, including the timing of feedback, with delayed feedback showing superior outcomes on other research (Butler et al., 2007).

Study quality

The MERSQI and NOS-E scores in this review were affected by the inclusion criteria, which avoided low scoring in the ‘Study design’, ‘Type of data’ and ‘Outcome’ sections. (Cook & Reed, 2015). Between-subject studies were most commonly unable to be randomised when comparison groups were entire classes. A benefit of the within-subject design is an identical population comparison group, which therefore scored well in the NOS-E.

Most studies will have limited the generalisability of their results considering they were only sampled from a single institution. However, this is also simpler for researchers to control for variables such as content delivered and assessed. Considering that no two units are the same, a complete replication of these findings in future research or in educators’ classrooms cannot be expected, however it does still provide some good direction for future application of these strategies.

Studies that did not receive a full score for participant retention were generally large class sizes of an introductory unit, which often have higher student attrition (Trumbo et al., 2016) or studies that were of a longer duration (Dobson, 2013). High participant retention was noted in interventions that only addressed a single class, most likely due to the short duration of the study (one week). Incentives such as small amounts of course credit or money also likely improved retention in other studies.

No experiments had outcomes that included ‘Behaviours’ or ‘Patient or healthcare outcomes’, likely due to the nature of undergraduate education studies. Undergraduate health professions education does not generally apply learning to a patient population in a clinical context, compared to graduate training and continuing professional development (Chen et al., 2004). However, even in continuing professional development environments, assessing learning outcomes via patient or healthcare outcomes is poorly researched (Chen et al., 2004; Prystowsky & Bordage, 2001).

The low ‘validity evidence for evaluation instrument’ found in most studies, will limit the ability to generalise these outcomes into other settings, such as future high stakes examinations and professional practice (Beckman et al., 2005). Many studies did not report the source of content for their assessment, whether from a textbook, or expert. This domain of the MERSQI is commonly scored poorly in previous research (Reed et al., 2007). The time and resources required to validate an assessment is likely the cause of this finding, however future research should report on the source of content for assessment where possible. The high score of data analysis in this review is a domain that is typically scored high in other research (Reed et al., 2007).

Recommendations

Based on this systematic review, educators and students may find distributed practice and retrieval practice effective in their own classroom or self-directed study context at improving academic grades. Foundational units, such as introductory psychology, anatomy, and physiology, could particularly benefit from these learning strategies. Educators could trial increasing the number of formative and summative assessments as a method of providing students with retrieval practice and distributed practice. This may improve academic grades and long-term memory of content for future units and professional practice. Expanding the distributed practice schedule may provide greater benefits compared to equal or contracting schedules. Free-recall or cued recall questions are likely to improve learning more than recognition questions such as multiple choice. Educators could also educate students on the benefits and practical applications of distributed practice and retrieval practice, which may improve self-regulated learning. Students could be encouraged to trial spacing out their revision, writing their own retrieval questions or sourcing questions from peers and external sources to improve their academic grades and long-term memory.

Limitations

Due to the heterogeneity of studies, a meta-analysis is not possible. A single reviewer scored the MERSQI and NOS-E of the included papers and this may increase the risk of errors. The study quality instruments of both the MERSQI and NOS-E do not cover all aspects of study quality; elements that were missing include trial registration, which aims to reduce many types of bias, such as citation bias (Pannucci & Wilkins, 2010) and inflation bias or ‘p hacking’ (Head et al., 2015). Summarising the key variables and statistical significance from each study is also simplistic and can therefore be misleading if read in isolation. This review should be used to help navigate readers to source articles for the full picture, and not be used as evidence alone of a study’s significance.

Conclusion

Distributed practice and retrieval practice are often effective at improving academic grades in health professions education. Of the 63 experiments, 43 demonstrated significant benefits of distributed practice and/or retrieval practice over control and comparison groups. Study quality was generally good with an average of 12.23 out of 18 on the MERSQI and an average of 4.55 out of 6 on the NOS. Key areas of study quality improvement are the validity of assessment instruments and the number of included institutions within a study. Future studies should consider measuring and reporting time on task which may illuminate the efficiency of distributed practice and retrieval practice. The stakes of the assessments, which may affect student motivation and therefore outcomes, should also be considered. Educators can note that the use of multiple exams, particularly if they are summative, will result in most students participating in the spacing out of retrieval practice. Introductory psychology, anatomy and physiology educators and students have a variety of retrieval practice and distributed practice applications that could be trialled and may successfully transfer to their own contexts.