FormalPara What does this study add to the clinical work

This study is the first to investigate the effectiveness of test-enhanced learning in the field of gynaecology and obstetrics (OBGYN). Repeated testing was more effective than repeated case-based learning in OBGYN. Curricular implementation of longitudinal key feature testing can thus improve learning outcomes for OBGYN.

Introduction

Undergraduate medical education's ultimate goal is to provide future physicians with a sound knowledge basis and the necessary abilities to manage their patients. The process of applying information to new clinical settings in order to make informed decisions regarding diagnostic procedures and treatment alternatives is one of the higher order cognitive tasks that must be mastered.

In this context, clinical reasoning is an essential competence in medical education [1]. It describes the making of informed decisions by a physician based on knowledge, intuition, experience, and guidelines about the diagnostic and therapeutic procedure based on a patient’s initial situation and symptoms, taking into account potential possible differential diagnoses. Two cognitive approaches are currently described in the literature that underlie clinical reasoning, namely the intuitive and the analytical approach. Overall, a combination of both approaches is realistic and important for clinical reasoning [2].

Thus, clinical reasoning is an essential skill, the foundations of which must be acquired during basic medical training. Students’ performance in clinical reasoning can be assessed summatively by means of key feature questions [3].

In 1995, Page et al. [4] introduced the concept of a key feature and described its function as the cornerstone of key feature problems, a new format for the written assessment of medical students’ and practitioners’ clinical decision-making skills. A key feature is defined as a critical step in solving a clinical problem, and a key feature problem consists of a clinical case scenario followed by questions focusing only on these critical steps (key-feature question/KFQ). Hrynchak et al. [5] summarised the evidence on the reliability and validity of KFQs for assessing clinical reasoning and were able to show, based on a systematic literature review, that KFQs are an adequate format for assessing clinical reasoning. Taken together with the current evidence on test-enhanced learning, there is scope for using formative examinations made up of KFQs in order to enhance clinical reasoning competencies.

Test-enhanced learning represents a pedagogical intervention that is consistent with the current emphasis on using assessment to improve pedagogical practice in medical education [6]. It represents a fitting complement to the tools that educators can use to help medical students, residents and practicing physicians retain information and progress towards greater clinical expertise [7]. In doing so, this can also be implemented into digital teaching in the current Covid 19 times [8]. The available research demonstrates robust effects across health professions, learners, learning formats, and learning outcomes.

The application of test-enhanced learning in the field of gynecology and obstetrics has not yet been investigated. Thus, the following questions arise:

  • Does repeated exposure to clinical cases in OBGYN with interspersed key feature questions (KF cases) lead to better learning outcome than repeated exposure to read-only cases (RO cases) with the same content but without questions?

  • What influence does the temporal distance of the intervention package from the final examination have on the retention of the respective contents?

The following hypotheses are formulated for the primary study question:

  • H0: There is no difference when clinical cases with built-in questions are used instead of cases with the same content but without built-in questions.

  • H1: There is a difference when clinical cases with built-in questions are used instead of cases with the same content but without built-in questions.

For the secondary question, the following hypotheses are formulated:

  • H0: The temporal distance of the intervention package from the final exam has no effect on the overall outcome and effectiveness of exam-supported or unsupported learning.

  • H1: The temporal distance of the intervention package from the final examination has an effect on the overall outcome and the effectiveness of examination-supported or unsupported learning.

Methods

This was a non-randomised, controlled, non-blinded crossover trial involving medical students in the fifth year of medical school. Undergraduate medical education is separated into two phases at the institution where the study was done. Basic sciences, as well as anatomy, biochemistry, physiology, and medical psychology, are taught throughout the first 2 years. Students advance to the clinical part of the curriculum after passing a high-stakes examination. The clinical phase consists of many individual courses, with the subject of OBGYN being taught in the fifth year. In the first term of that year, students attend lectures on OBGYN, and they complete a 1-week clinical attachment in the following term in groups of 8–11 students each. In the context of the Covid-19 pandemic, the clinical attachment was moved to a virtual space (ILIAS open source e-Learning e. V, Cologne, Germany). The inclusion criteria of the participants were as follows:

  • Enrolled medical student at the University of Bonn in the one-week clinical attachment of OBGYN

  • Consent to study participation

  • Age of the participants > 18 years

The exclusion criteria were as follows:

  • Lack of informed consent to participate in the study

  • Age of the participants < 18 years

Course design

During the virtual one-week clinical attachment, each student was given access to a total of four clinical case vignettes—each comprising two obstetric and two gynecological cases. The content presented was aligned to the “Learning Opportunities, Objectives and Outcomes Platform” (LOOOP) of the Medical School Association (Medizinischer Fakultaetentag, MFT), a nationwide online resource for the study of human medicine and dentistry (https://looop-share.charite.de). In addition, a total of 40 short answer questions (SAQs) were written based on the four case vignettes, representing key feature elements of these cases (Table 1).

Table 1 Key feature case topics and individual key feature elements

Each student block placement group was exposed to two cases containing key feature questions and two RO cases (i.e., identical content but without KFQs). A reading case consisted of the main text and some background information. A description of the patient's symptoms and history was followed by physical examination findings and results of diagnostic tests as well as information on how the clinical case progressed, including complications. The background information covered aspects that were the subject of KFQs in the corresponding presentation format of the same case.

A specific case that was presented as a RO case to the student group in week 1 was presented as a KF case to the student group in week 2, and vice versa. This was continued in the following weeks, resulting in the cross-over design of the study. As spacing is essential for test-enhanced learning, each group had to revisit their specific key feature cases and reading cases two weeks after their actual virtual clinical attachment. During this process, the students' activities were tracked and recorded in the learning management system. Likewise, each clinical attachment group had to complete the forty SAQ questions as a pre-test at the beginning of their one-week assignment. A total of 60 min was available to complete the test. At the end of term, all students received the same SAQ test as a formative final exam for the subject of gynecology (Fig. 1).

Fig. 1
figure 1

Cross-over study design and case vignettes in the weekly clinical block placement

All students participating in the one-week OBGYN clinical attachment in winter term 2020/2021 were required to work on the online cases. On the first day of term, they were invited to provide written consent to participate in the study, i.e. to have their data anonymized and analyzed for study purposes.

Statistical analysis

Pre- and post-test consisted of 40 SAQ questions each. One specific student would have been exposed to the content covered in 20 of these items in KF cases (‘intervention items’) while the content covered in the remaining 20 items would have been presented in RO cases (‘control items’). For a different student, this assignment would have been the opposite. Dummy coding was used to define intervention and control items for each student, regardless of the group they had been assigned to during the clinical attachment. Descriptive analyses were carried out for the demographic data of the sample as well as the pre-test and post-test. For nominal scaled data, such as demographic variables, simple frequency calculations were carried out for the different characteristics.

A descriptive item or scale analysis was conducted for the 40 items in both the pre-test and the post-test. Cronbach's α was used as a measure of internal consistency.

In order to answer the research question, the differences between percent scores in intervention and control items in the post-test were analyzed using a paired T-test.

The second research question was addressed by repeating this analysis for three student groups created particularly for this purpose, each with a different time interval between the intervention and the post-test: Group 1 (G1) had a 1* to 4-week gap from the post-test (n = 35), Group 2 (G2) had a 5- to 8-week gap (n = 29), and Group 3 (G3) had the longest distance (9–12 weeks) (n = 30).

For statistical analysis, IBM SPSS Statistics for Windows version 27.0 was used (IBM Corp., Armonk, NY, USA). Data are presented as mean, standard deviation (SD), or numbers and percentages unless otherwise stated. The alpha level was set at 0.05. This study was approved by the University of Bonn's local ethics commission (application no. 0l4/ 2 l).

Results

The flow of participants through the study is displayed in Fig. 2. Four of the 118 students who were eligible to participate in the study did not provide written consent. Complete data for 94 students were acquired, yielding an effective response rate of 79.7 percent for this longitudinal sample. 56.4% of the study sample were females. There were no statistically significant differences between the study groups in terms of gender; since the students' dates of birth are not recorded on a personalised basis, a differentiation by age cannot be made.

Fig. 2
figure 2

Study cohort and data enrollment

In order to determine the quality criteria of the pre-test and post-test and thus to be able to assess the measurement quality, the item characteristics were determined for each of the 40 items in both the pre-test and the post-test. These item characteristics for the 40 items in the pre-test and post-test are given in Table 2.

Table 2 Item characteristics of the SAQs in the pre- and post-test

Primary aim

To answer the primary research question (“Does repeated exposure to clinical cases in OBGYN with built-in key feature questions lead to higher learning success than repeated exposure to cases with the same content but without built-in questions?”), the “intervention” and “control” conditions were analysed independently of the content focus of the items.

In the total sample, an average of 69.3 ± 7.3% of the maximum possible 84 points were achieved in the pre-test and 72.8 ± 6.4% in the post-test. Across groups, 69.8 ± 12.0% of the maximum possible points were achieved in the intervention items in the pre-test and 68.3 ± 10.6% of the maximum possible points in the control items (p = 0.410). In the post-test, a mean of 74.2 ± 8.6% of the maximum possible points were achieved for the intervention items and 71.0 ± 9.2% of the maximum possible points for the control items. In the post-test, an average of 3.2 ± 12.4% higher scores in relation to the maximum possible score were achieved in the intervention items than in the control items. This difference in favour of the intervention items compared to the control items proves to be statistically significant (p = 0.017) and thus leads to a rejection of the null hypothesis of the primary hypothesis (Fig. 3).

Fig. 3
figure 3

Average scores achieved in % in pre-test and post-test. The error bars represent the standard errors. *p = 0.017 in the paired t-test to examine the differences between intervention and control items

Secondary aim

To answer the secondary research question two, three groups were formed to determine the influence of the temporal distance between the intervention and the post-test on the outcome: Group 1 (G1) showed a gap of 1–4 weeks to the post-test (n = 35), while Group 2 (G2) showed a gap of 5–8 weeks (n = 29) and Group 3 (G3) showed the largest gap of 9–12 weeks (n = 30). Table 3 displays the results of the analysis for the secondary study question. Following a Bonferroni correction for multiple testing, there was no significant difference in any of the subgroups. However, Table 3 shows very clearly that the retention only decreases significantly for the control items. On average, the intervention items, therefore, not only tend to lead to a higher immediate learning gain, as the higher value for group one shows, but also to a longer persistence of this effect.

Table 3 Percent scores in intervention and control items in the post-test by time intervals between the intervention and the post-test

In relation to the secondary research question, there are thus definitely indications that the temporal distance for the control items shows a negative influence on retention. The statistical results, however, are not clear enough to be able to justify the rejection of the null hypothesis in a methodologically and substantively serious manner of the secondary question.

Discussion

In the present study, the paradigm of repeated testing with KFQs was implemented for the first time in medical students for the consolidation of differential diagnostic and therapeutic competences in the field of gynecology and obstetrics. Despite rather moderate item characteristics, it was shown that the intervention (KF cases) led to greater learning success than the control condition in OBGYN medical education (R0 cases).

Regarding the secondary research question, there was no effect of the delay between the intervention and the final exam on the difference in performance between intervention and control items.

All of the assessments in this study were formative in nature. Because the students did not experience any penalties associated with the assessments, it is reasonable to believe that the indirect testing impact was minor. On the one hand, this avoided the disruptive issue of prospective point loss (and hence a change in learning behavior); on the other hand, this may have had a detrimental influence on the students' diligence in working through the e-case seminars and the final formative assessments.

Repeated testing with key feature questions can be an attractive alternative to more resource-intensive teaching methods for specific learning objectives. Given the scalability associated with e-learning interventions, as well as the pedagogical rationale for using key questions to promote complex cognitive functions, our study contributes to the growing body of literature on how e-learning can be used effectively to improve student learning outcomes [8], especially in pandemics such as the Covid 19 pandemic [8].

Another study examined the use of script concordance examinations in the context of clerkships and the assessment of clinical reasoning in OBGYN [10]. The authors demonstrated satisfactory reliability for assessment in training and a favorable association with the assessment of clinical reasoning using key features.

In line with our null result regarding the impact of the time interval between the intervention and the post-test, a meta-analysis of self-directed learning found no significant relationship between the observed effect size and the duration of the intervention with self-directed learning or the time gap between the conclusion of the intervention and the evaluation of outcomes [11].

A study in the field of gynecological endocrinology showed that case-based learning can be beneficial in postgraduate training [12]. The residents interviewed agreed that case-based, interactive training was superior to traditional lecture-based training. The authors concluded that a non-traditional curriculum can be successfully implemented in a residency training curriculum and significantly improves understanding and confidence in critical endocrinology concepts.

Limitations

The quality of the SAQ examination items was at best moderate. The data suggest that a number of items was too easy. This could have been avoided by pilot-testing the exam and making necessary adjustments before using the exam in the context of a study [13, 14]. There was no subject-specific focus for the questions that did poorly.

Students' initial level of knowledge was quite high, with a mean score of 69.3%. This might be explained by the fact that the SAQs were focused on knowledge that had already been acquired in the preceding term. According to a literature review, while SAQs have a high discriminatory power due to the possibility of point differentiation and produce tests with high reliability in the digital domain, SAQ correction and evaluation are time-consuming because automatic evaluation of open questions is not possible. Furthermore, to improve clinical reasoning, the study postulates the use of key feature questions rather than SAQ questions as the item to be chosen [14]. In contrast, other studies here point to the use of new methods in the SAQ format, such as the Short Answer Management Problems (SAMPs) format. They are designed to measure a candidate's problem-solving skills and knowledge in the context of a clinical situation and thus strengthen clinical reasoning [15]. New subject-specific assessment rubrics are also being developed for SAQs to strengthen clinical reasoning, such as in the area of manual physical therapy [16].

Despite the statistically significant increase from pre- to post-test scores, there remains doubt that this reflects a clinically meaningful gain in clinical reasoning.

The study's monocentric nature restricts the generalizability of our findings. Because the goal of this study was to give insight into the real-world effectiveness of test-enhanced learning, certain potential confounding factors were not experimentally controlled. Most crucially, we did not gather data on how much time was spent on self-study. Unlike laboratory studies of test-enhanced learning, we did not try to limit the amount of time students spent on case vignettes, but instead allowed them to complete their sessions whenever they wanted. Finally, this study did not investigate whether repeated testing with KFQs impacted on students' clinical performance. Although one study suggests an association, further research is needed to establish a causal link between frequent testing and improved patient outcomes [17].

Conclusion

Our data demonstrate improved retention following repeated formative testing with KFQs in obstetrics and gynecology. The time interval between the intervention and the point of final data collection did not mediate this effect.