Background

Traditional methods of teaching focus on the use of quizzes or tests as a summative assessment – assignment of a grade or score at the end of a learning activity. However, literature suggests the use of quizzes as a formative assessment, i.e. with the goal of monitoring progress and providing ongoing feedback, can improve motivation and content retention. Specifically, studies have demonstrated improved retention of material following repeated testing compared with either a single study period, repeated study without testing, or concept mapping [1,2,3,4,5,6]. The premise behind this is that in contrast to repeated study, testing requires active retrieval of information, a key component of long-term retention [2,3,4]. This is often referred to as the testing effect or test-enhanced learning. Additionally, recurring quizzes can take advantage of the spacing effect, in which spaced repetitions lead to better long-term retention than non-spaced repetitions [7,8,9,10].

Aside from content retention, formative assessment through quizzes can benefit both learners and teachers in a number of ways. Learners can identify areas of weakness and become accustomed to the exam timing and format. Moreover, quizzes can improve performance on summative assessments [2, 11, 12]. Teachers can assess the efficacy of curricula and instruction methods, as well as identify learners who may be struggling or at risk for failure, allowing early, targeted intervention [1].

In medical education, spaced and test-enhanced learning have been used to improve both content retention and skill performance for medical students and residents [5, 12,13,14]. Despite evidence of their effects on knowledge retention, little data exist on their value as a tool for preparation for board certification examinations. With the goal of helping pediatric residents pass their initial American Board of Pediatrics (ABP) Certifying Examination (CE), in 2011 we initiated a series of online quizzes known as the Board Examination Simulation Exercise (BESE). BESE was developed by two of the authors (JDM and KS), included multiple choice questions written and peer reviewed by the study authors, and quizzes utilized principles of both the spacing and testing effects. Five years after implementation, we sought to evaluate the utility of these quizzes as both a resident preparation tool for the ABP CE and a predictor of ABP CE performance. Specifically, we hypothesized that 1) higher resident participation in and performance on BESE quizzes would be associated with higher performance on the final year In-Training Examination (ITE) and the ABP CE, and 2) introduction of BESE would be associated with improved ABP CE board pass rates in the programs.

Methods

Settings and subjects

BESE was implemented in 2011 at Nationwide Children’s Hospital (NCH) and McGovern Medical School at The University of Texas Health Science Center (UTH), and in 2013 at University of Louisville (UL). Residents included in this initial analysis were Pediatrics or Internal Medicine-Pediatrics residents at NCH, UTH, and UL, completing residency between 2013 and 2016 for NCH and UTH and 2015–2016 at UL. Therefore, residents at each program had the opportunity to complete at least 2 years of quizzes. Residents who did not take the final year ITE were excluded from ITE analyses. Residents who did not take the ABP CE were excluded from the ABP CE analyses. This study was submitted to the NCH Institutional Review Board (NCH IRB #14–00720) and was determined to not meet the definition of human subjects research.

Intervention

BESE structure has been previously described without outcome data [15]. Twenty-three online quizzes (22 in Year 1) were offered to all residents each academic year (July –June), approximately every 2 weeks. Each quiz consisted of 20 multiple-choice questions—drawn randomly for each participant from a test item bank of 40–50 questions per topic area—to be completed in 25 consecutive minutes. Question content was derived from ABP content specifications [https://www.abp.org/content/general-pediatrics-content-outline] and divided into 23 sections (Additional file 1: Table S1). Given the length of the ABP content specifications we did not aim to cover all of the content but utilized the quizzes to sample knowledge in a given topic area. Questions were written by JDM, KB, KGS, and MDH, and peer reviewed for accuracy by JDM, RRD, and RW. Question format was modelled after widely accepted guidelines (http://www.nbme.org/IWW/). Quizzes were taken online via each institution’s learning management system (LMS) providing a random selection of questions at each quiz. Feedback was provided immediately at the conclusion of each quiz. The score for each quiz was provided and the correct answer was revealed for each question, along with an explanation of the right and wrong answers with 2–5 paragraphs of teaching points. To minimize recall, the questions and answers were available to the participant for review for only 1 h after the quiz and were not able to be copied.

Data collection and definitions

Individual resident BESE participation and scores were collected in each institution’s LMS. ITE and ABP CE performance were routinely made available to each Program Director from the ABP. Resident data collected included the number and topics of quizzes completed each year, raw score on each quiz, ITE scores, and first attempt ABP CE score. Data were deidentified by program coordinators at each site prior to aggregation and analysis by the authors.

BESE years were categorized according to academic year. Residents were categorized into one of the following groups: Post Graduate Year (PGY) 1, PGY-2, or PGY-3/4. Participation was defined as the number of quizzes completed per year. Quiz score was the number of correct responses, with a maximum score of 20. The annual BESE score was the mean of a resident’s quiz scores during a given academic year. The composite BESE score was the mean of a resident’s scores during all years of training. The PGY-1 BESE score was the mean of a resident’s scores during their first year of training. Quizzes not taken were not included in score calculations.

Outcomes

The primary outcome was failure on first attempt of the ABP CE. Secondary outcomes were final ITE score and ABP CE score on first attempt. The final ITE score was the percentage of correct answers on the ITE in the final year of training. The ABP CE score was the scaled score reported by the ABP.

Data analysis

Normally distributed continuous variables were compared using the t-test or one way Analysis of Variance (ANOVA) and results expressed as means and standard deviation (SD). Non-normally distributed continuous variables were compared using the Mann–Whitney or Kruskal-Wallis tests and results expressed as medians and 25–75% interquartile range (IQR). Correlations were performed using Spearman’s test. Correlation coefficients 0.3–0.5 were considered weak, 0.5–0.7 moderate, and > 0.7 strong. Contingency tables were analyzed using Chi-square or Fisher’s exact test, as appropriate. Receiver Operating Characteristic (ROC) curves were created to determine the diagnostic ability of BESE as a tool to predict failure of the ABP CE. Since the prevalence of failure may change over time, positive and negative predictive values were computed for hypothetical prevalence values. Statistical analyses were performed using GraphPad Prism version 7 (La Jolla, CA).

Results

Data were available for 329 residents – all eligible residents across the three programs: 173 (53%) NCH, 106 (32%) UTH, and 50 (15%) UL. Final year ITE scores were unavailable for 5 (1.5%) residents and ABP CE scores were unavailable for 37 (11%) residents due to individual decisions to not take the ABP CE. As shown in Table 1, the median number of quizzes completed per year was 15 [IQR 7–20], with PGY-1 and -2 residents completing more quizzes compared with PGY-3/4 residents. Further data on participation and scores are shown in Table 1.

Table 1 Number of residents, quizzes completed, and annual scores during the study period

Overall, 26 (9%) residents failed the ABP CE on the first attempt. There were no differences in ABP CE failure rates or scores by number of quizzes completed, either in the final year or throughout training (data not shown). ABP pass rates significantly increased from 2009 to 2016 at UTH (p = 0.009), UL (p = 0.003), and all programs combined (p = 0.0001), χ2 for trend. Further data on ABP CE pass rates are shown in Fig. 1.

Fig. 1
figure 1

Percentage and number of residents passing the ABP CE on the first attempt. The graph above displays the percentage of residents passing the ABP CE on the first attempt for Pediatric and IM-Pediatric residents completing residency 2009–2016 from NCH (diamonds and dotted line), UTH (squares and long-dashed line), UL (triangles and short-dashed line), and all BESE programs (black circles and solid line). National first-attempt pass rates are represented by the asterisks and dashed-dotted line. The table beneath displays the number and percentage of residents passing the ABP CE on the first attempt. BESE quizzes were implemented in 2011 at NCH and UTH and in 2013 at UL. There were significant increases in the pass rate from 2009 to 2016 at UTH (p = 0.009), UL (p = 0.003), and all programs combined (p = 0.0001), χ2 for trend. NCH=Nationwide Children’s Hospital; UTH = McGovern Medical School at The University of Texas Health Science Center; UL = University of Louisville; ABP CE = American Board of Pediatrics Certifying Examination

As shown in Fig. 2, there was an association between the composite BESE score and both the final ITE score (n = 318, r = 0.53, p < 0.0001) and ABP CE score (n = 287, r = 0.39, p < 0.0001). The PGY-1 BESE score was also associated with the final ITE score (n = 190, r = 0.47, p < 0.0001) and the ABP CE score (n = 181, r = 0.44, p < 0.0001). Rates of ABP CE failure increased with composite score quartile rank, with 3, 5, 7, and 19%, of residents failing the ABP CE in quartiles 1 (top) - 4 (bottom), respectively (p = 0.0005, χ2 for trend). Residents with composite scores in the bottom quartile were more likely to fail the ABP CE than residents in the top 3 quartiles combined [OR 4.6 (95% CI: 1.9–10.3), p = 0.006].

Fig. 2
figure 2

Correlation of Composite BESE Scores with ITE and ABP CE Scores. The composite BESE score showed weak to moderate correlation with both final year ITE (a – circles, n = 318) and 1st attempt ABP CE scores (b – inverted triangles, n = 287); Spearman correlation. ITE = In-Training Examination; ABP CE = American Board of Pediatrics Certifying Examination

The PGY-1 and composite BESE scores performed similarly in predicting failure of the ABP CE. The area under the curve (AUC) for PGY-1 Score was 0.77 (p = 0.0011), indicating fair accuracy in predicting ABP CE failure. The AUC for Composite Score was 0.69 (p = 0.013), indicating poor accuracy. A cutoff score of 11 was chosen as this is nearest to the 25th percentile of all composite scores (11.02). Table 2 shows the sensitivity and specificity of PGY-1 and composite scores ≤11 for failure of the ABP CE, as well as positive and negative predictive values at hypothetical prevalence values.

Table 2 Positive and Negative Predictive Values for BESE Scores at Hypothetical Prevalences of ABP CE Failure

Discussion

Our programs incorporated online quizzes into the educational curriculum of our pediatric and internal medicine-pediatrics residents with the goal of improving performance on the ABP CE. Participation was not associated with performance on the ABP CE. Composite BESE scores were associated with both the final year ITE and ABP CE scores. Furthermore, composite scores in the bottom quartile were associated with a higher risk of failure on the ABP CE. Most importantly, overall board pass rates improved for the 3 programs combined and specifically for UTH and UL during the period of BESE implementation. While national pass rates for first time test-takers also increased during this time, all programs were above national rates in 2016.

Studies show that both medical knowledge and clinical skills decay with time. Medical students retain approximately 40% of basic science while still in medical school and approximately 25% after 50 years of practice [16]. In a study of cardiopulmonary resuscitation (CPR) skills, only 2.4% of those trained 3 years earlier could successfully perform CPR [17]. However, spaced and test-enhanced learning can diminish knowledge loss and enhance medical education at all levels. Among medical students, the use of spaced education on urology topics was associated with higher scores on an end-of-year test [18, 19]. In another study, students performed better on a resuscitation skills assessment if they were tested at the end of training compared with those who were not tested, [14] and a difference was preserved 6 months following the intervention [20]. Residents and faculty have shown both short [5, 21,22,23] and long term [24] improvements in medical knowledge following spaced or test-enhanced education. Beyond knowledge retention, these principles can change practice. Kerfoot et al. used spaced learning to decrease the percentage of inappropriate prostate specific antigen screening [25]. Matzie and colleagues demonstrated improvements in the frequency and quality of feedback to medical students when surgical residents received spaced emails containing teaching points on feedback [26].

While multiple studies have demonstrated the benefits of spaced, test-enhanced learning in medical education, little data exist on its association with outcomes on board certification examinations. Peterson et al. examined the effects of self-assessment modules on certification exam results among family medicine residents. Module completion during residency was associated with both increased odds of passing the certification exam and a higher score [27]. However, the authors only evaluated completion of the modules and did not correlate module scores with exam outcomes. In the present study, we identified a cutoff for both PGY-1 and composite scores associated with increased odds of failing the ABP CE. The first year risk stratification is particularly important for earlier identification and intervention for residents at risk. ITEs may serve a similar purpose but are only offered annually and specific question data are not reported. The frequency of BESE provides learners continual assessment with immediate, more specific feedback.

There are many strengths to the curriculum structure described. First, the quizzes mimic the ABP CE format and time-per-question. This is important, as formative assessments are most effective when the format matches the summative assessment [1]. Second, rapid scoring provides immediate, explanatory performance feedback to residents [2]. Third, quiz organization by topic and repetition annually allows residents to track longitudinal performance in specific subjects. Both residents and program directors can use these results to identify areas of weakness for targeted intervention. Normative data allows peer performance comparison. Finally, the use of spacing and testing effects promotes long term retention.

There are also some limitations to this study. First, ITE and ABP CE scores in these learners are influenced by multiple factors, including different program educational activities and individual board preparation routines. Second, BESE participation and scores may reflect general attitudes toward board preparation. Residents who took BESE seriously may have prepared for the certification exam more intensely than those who did not. Nevertheless, identifying these residents early in training who display lower performance can still be beneficial to program directors. Third, we recognize that standardized test scores may better reflect test-taking abilities and not capture true medical knowledge. Furthermore, we know that these results may not indicate differences in clinical skills or effectiveness. Additionally, since BESE was designed to sample resident knowledge in specific topic areas we cannot make a statement on their comprehensive knowledge in any area. Fourth, only two of the three programs saw statistically significant increases in ABP CE pass rates during the study period. This may have been due to the ceiling effect, as pass rates at NCH were already above national rates prior to BESE implementation and were the highest among the three programs. Finally, test validity has not yet been established and the small number of items may affect the reliability and validity of the quizzes. Though the questions were peer-reviewed for accuracy prior to use, item quality may have been variable. Future plans include performing test item analysis and validation of these questions, and extending this process to additional residency programs to better define its generalizable impact.

Conclusion

Using five years of data from three residency programs, we demonstrated an association between performance on biweekly quizzes and pediatric board certification exam results and, more importantly, improved board pass rates over time. Though additional data are needed, BESE is a promising approach for resident learning and board preparation. Residency programs should consider incorporation of spaced, test-enhanced learning sessions into their curricula, simulating high stakes examination conditions for their discipline.