1 Introduction

Medical emergencies require executive functions and the ability to prioritize tasks. Although emergency medicine is integrated into most curricula of medical degree programs, preparedness among graduates to deal with acute injury or illness is scarce. Recent research has identified the deficits in the early management of medical emergencies, including clinical reasoning/diagnosis, prescribing the appropriate medication, multidisciplinary teamwork, and handover of patients [1,2,3]. A lack of preparedness was also self-reported by students as well as graduates, and above all observed by clinical educators [2, 4]. As a matter of fact, junior doctors frequently describe feelings of being overstretched when dealing with acutely unwell or critical patients at high risk [5]. Others report that responsibility for severely ill patients during night shifts is one of the major sources of excessive stress and anxiety during the first year of residency [6].

Educational interventions need to address this lack of competency in the specialty of emergency medicine. To date, different simulation training settings, ranging from simulated patients to full-scale and technology-enhanced analogous trauma rooms have been implemented [7, 8]. Although the outcomes of these physical training environments have been generally positive, its wide-ranging practice was limited owing to the cost in terms of time and money, as well as the availability of room facilities and trained personnel. The development of computer-based simulation has added a new dimension to education and proved to enable the operational learning, e.g., of cardiopulmonary resuscitation [9, 10] and even more complex tasks, such as the management of multiple patients in an emergency room [11, 12]. Unlike the above-mentioned physical simulation facilities, computer-based simulation can be provided without elaborate infrastructure and made available to individuals. Furthermore, users can determine their own pace of learning progress and outcome. Moreover, modern soft and hardware systems are physically capable of calculating multiple physiological parameters, which would provide for a dynamic simulation of diseases covering a wide range of underlying conditions in real time. Accordingly, clinical and laboratory parameters as well as the effects of actions taken by the user (e.g., drug administration, diagnostic and therapeutic interventions) may be programmed, visualized, and direct feedback provided by future systems.

Most recently, virtual reality (VR) technologies have been developed to deliver an immersive, interactive experience that can be conducted privately and inexpensively. Head-mounted displays (HMD) and motion sensors provide the basis for the creation of highly immersive three-dimensional environments. The concept of immersion describes the ability of this technology to create a complete and vivid illusion of reality. While “immersion” refers to representational fidelity from a technical viewpoint, the term “presence” in the context of VR relates to the subjective feeling of being present in the virtual environment [13]. VR-enhanced learning environments with high representational fidelity—and thus a high degree of perceived presence—can increase learner motivation, enhance spatial knowledge representation, and improve contextualization of learning [14,15,16]. Of note, early VR technology was known to elicit unwanted symptoms (e.g., nausea or disorientation), summed up under the terms “cyber/simulation sickness”. These undesirable effects are most likely related to discrepancies between the optical information that the user receives via the HMD and other sensory input such as spatial perception [17]. In the meantime, technical improvements, such as improved HMD resolution and latency, have been implemented to minimize such symptoms.

VR technology without additional devices does still lack the convincing haptic representations of performing surgical procedures and handling tissue for example. Nevertheless, there is substantial evidence supporting VR applications in the training of different surgical and interventional procedures or hygiene measures [18,19,20]. Although relying less on haptic fidelity, a conclusive concept for internal and emergency medicine has not yet been reported [21].

Here, we implemented and evaluated a novel training concept for medical emergencies that uses up-to-date VR technology alongside a simulated dynamic physiology system. The program “Simulation-based Training of Emergencies for Physicians using Virtual Reality (STEP-VR)” was tested in a pilot phase, during which we evaluated its technical feasibility and the students’ tolerance of simulation sickness. Afterward, STEP-VR was implemented as a compulsory training session into the curriculum of a medical degree program.

In order to take advantage of the educational potential offered by VR, we had to develop an understanding of how to maximize the desirable impacts on student learning. We thus aimed to address the following questions as to:

  1. 1.

    Whether the implementation of STEP-VR into the medical curriculum for a semester cohort is feasible and whether students accept the didactic concept;

  2. 2.

    How students perceive VR-based training with respect to simulation sickness and the degree of presence;

  3. 3.

    What level of psychological distress users experience during exposure to the VR content and which factors are associated;

  4. 4.

    How students evaluate the training benefits and whether the active and passive roles adopted during training differ in the estimated learning success.

2 Materials and methods

2.1 VR-based simulation training

STEP-VR was aimed at advanced medical students who find themselves in the role of physicians being confronted with a variety of critically ill patients in an emergency department (Fig. 1A). The training software (version 0.11b) was provided by ThreeDee GmbH® (Munich, Germany), a startup specializing in 3D visualization. STEP-VR is designed to train cognitive abilities [22] including executive functions, situational awareness, time management, and prioritization of tasks. Relevant diagnostic tools include medical history, laboratory tests, or various imaging modalities, and are based on real patient data (Fig. 1B, C). Specific procedures (e.g., placement of an IV cannula) can only be performed in an abstract and non-haptic manner of handling. After determining the most likely diagnosis, therapeutic measures must be administered, such as infusions, blood transfusions, application of numerous pharmaceuticals, (non)-invasive ventilation, as well as to arrange endoscopic or cardiologic interventions. The applied measures exert real-time effects on the simulated patients as a pathophysiological continuum. Based on the literature, realistic courses of the underlying conditions are modeled out of > 100 clinical parameters. Accordingly, dose-dependent effects of drugs, transfusions, or interventions provide an immediate response and/or feedback on monitoring parameters (Fig. 1D). Ultimately, a guidelines-based checklist summarizes whether the correct measures have been taken.

Fig. 1
figure 1

In-simulation views and parameter monitoring of STEP-VR. The emergency department of a university hospital was transferred to the VR simulation: General interior of the emergency department (A). Patient file containing paramedic protocol and computer menu illustrating chest X-ray (B). Treatment room with patient (C). Parameter monitoring for the scenario sepsis following the administration of i.v. crystalloid fluids, broad-spectrum antibiotics and norepinephrine via syringe pump (D). The Y-axis represents the respective units (arbitrary units for infectious load in the bloodstream (INF), mmHg for mean arterial pressure (MAP), % for hematocrit (HCT) and mmHg*min/l for total peripheral resistance (TPR)), the X-axis time indicates the intervals of five minutes

The VR hardware setup for this study employed a Schenker XMG Core 15 Laptop (chipset: Intel Core i7-9750H, 6 × 2.6 GHz; graphics adapter: Nvidia GeForce GTX 1650, 4 GB GDDR6 VRAM) and an Oculus Rift S VR HMD. The equipment enabled STEP-VR to run with a constant framerate > 60 frames per second on “high quality” display settings of the HMD.

2.2 Study design and data collection

The study was conducted at a medical school in Germany offering a six-year curriculum. It took place in the fifth year of studies within the framework of short clinical rotations in internal medicine lasting two weeks, including four compulsory afternoon sessions to consolidate clinical reasoning. One of these was STEP-VR, held as small-group training session of eight to ten students followed by debriefing/discussion:

  1. 1.

    VR-based training (120 min) performed in three to four randomly chosen scenarios and guided by a student assistant.

  2. 2.

    Debriefing (60 min) of the emergency cases in the form of a review together with a tutor to summarize and further discuss the issues of problem solving.

In total, five different VR scenarios from the subject of internal medicine were provided, each addressing a specific leading symptom: (1) fever and abdominal pain due to sepsis, (2) hematemesis due to acute esophageal variceal bleeding, (3) abdominal pain due to acute biliary pancreatitis, (4) chronic cough due to pulmonary tuberculosis, and (5) chest pain due to acute myocardial infarction. A more detailed description of the scenarios and their respective learning objectives is presented in Supplement Table 1. To avoid sequence effects, student groups completed the scenarios in a random order. Each scenario was handled by a different student who volunteered to enter the VR scenario. Owing to the restricted time schedule, a maximum of three to four students out of the group were exposed to the VR environment. They were named the active participants (AP). The other students taking part in the training watched from the first-person perspective on a monitor and were able to assist verbally. They were named the observers (OBS). There was also the specific role of one student to take notes of the key features of the scenario, in order to present them to the tutor in the second part of the session. Students were asked to fill out questionnaires, which were specific to the role of AP (APQ, immediately after the VR exposure) and as OBS (OBSQ, at the end of the session). However, suspecting bias among those OBS who had previous immersive VR experience as AP, these response data were excluded, i.e., cleaned, from the comparative analysis of the measures “presence” and “estimated learning success”, as well as the two motivation items (we term these participants”cleaned” OBS).

Figure 2 provides an overview of the study design and data collection. The pilot phase ran from April until September in 2020 with volunteers who attended the VR-based training (approximately 30 min) with the aim of assessing the technical feasibility. Students were asked to fill out pre- and post-intervention questionnaires to rate simulation sickness. The main study was conducted in two subsequent semester cohorts from October 2020 until July 2021. Participants of the study provided written informed consent after being detailed on the study conditions.

Fig. 2
figure 2

Schematic design depicting the data collection by questionnaires that were administered to participants. A pilot study on simulation sickness was conducted. For the main study, students fulfilled two roles: all as observers (OBS, n = 227) and some as active participants (AP, n = 97)

2.3 Questionnaires and measures

Questionnaires were developed, of which the measures are listed in Table 1. In the main study, students were asked to enter their matriculation number, in order to match the corresponding data between the questionnaires given to AP and OBS.

Table 1 Overview of the measures addressed in the questionnaires

2.4 Statistical analysis

Analyses were performed with descriptive statistics, including mean (M), minimum (Min), and maximum (Max), standard deviation (SD), and skewness (Skew) of items. Skewness between − 2 and 2 indicated that data were roughly distributed symmetrically. We evaluated the psychometric properties by conducting a maximum likelihood exploratory factor analysis (EFA). The Kaiser criterion was considered for all factors [23]. According to this rule, only the factors that have eigenvalues greater than one are retained for interpretation. Bartlett’s test of sphericity and the Kaiser–Meyer–Olkin coefficient (KMO coefficient) were used to evaluate whether we could subject data to EFA. The factor solution had to meet the following criteria [24] of a Bartlett’s test of sphericity (p < 0.05) and a KMO-coefficient > 0.50. Consistency was assessed by employing Cronbach’s alpha (α). A value exceeding 0.7 was considered as acceptable, greater than 0.8 as good.

The Welch test was used to compare between groups. Nominally scaled variables were compared with the Chi-square test. The Wilcoxon test was used with the level of significance set to p < 0.05 to compare simulation sickness in the pre- and post-intervention questionnaires as two matched samples. Correlation according to Pearson was investigated [25]. Additionally, curvilinear relationships were analyzed [26]. Furthermore, we applied Little’s Missing Completely at Random (MCAR) test [27] to check whether values were systematically missing.

2.5 Ethics approval

The local institutional review and ethics board judged the project as not representing medical or epidemiological research on human subjects and as such adopted a simplified assessment protocol. The project was approved without any reservation under the proposal number 20201012-01. Survey data from the questionnaires were retrieved anonymously using the EvaSys® platform (Lüneburg, Germany). A student’s decision to participate or not, as well as the results of the questionnaire, had no consequences on the students’ academic progress. Data were processed and stored in accordance with the local data protection laws.

3 Results

3.1 Characteristics of the participants and evaluation of the training session

A total of 38 students participated in the pilot study, of whom 36 answered the questionnaire assessing simulation sickness. Two hundred fifty two students were enrolled in the VR-based training sessions, of whom 227 took part in the main study and adopted the role of OBS at some time during the training sessions. The OBSQ was completed by 183 participants, of whom 130 were so-called “cleaned” OBS, meaning that they had not assumed the role of an AP before. The APQ was completed by all 97 AP.

Characteristics of all participants who filled out the OBSQ are listed in Table 2. More than 60% of the participants were aged between 22 and 25 years. Sixty-three students were male (34.5%) and 119 were female (65.0%). The result of Little’s MCAR test for the key variables provided an χ2 distance of 1689.6 (df = 1706; p = 0.607.), indicating that there was no systematic accumulation of missing data.

Table 2 Characteristics of the participants in the main study (n = 183)
Table 3 Descriptive data of simulation sickness sub-scores among participants of the pilot study (n = 36). Items were rated on a four-point Likert scale with a maximum value of 4

The training session was evaluated as part of the OBSQ (Supplement Table 2). Students rated the quality of the training session highly and as well suited to conveying the learning content. They assessed the didactic concept of the training session as a good mixture of practical exercise and discussion. They emphasized the good working atmosphere. Moreover, the results highlighted the importance of the helpful comments by the student assistant and additional value of the tutor to clarify uncertainties. The overall difficulty of the training session was perceived as appropriate, as was the density of learning input.

3.2 Simulation sickness

The pilot study served to assess the feasibility of the training session and to rule out disturbing side effects such as simulation sickness. Overall mean values for simulation sickness were significantly higher after the exposure in the subscales of nausea and disorientation, but still fairly low to acceptable after the training session (Table 3).

3.3 “Presence”, “estimated learning success”, and motivation rated by the group of AP and OBS

As the participation of students in the VR pilot study was voluntary, we first compared both groups of AP and OBS with respect to demographic characteristics. Using the Welch test, no differences were found with respect to age, career choice, or experience with VR. Interestingly, both the chi-square test for the nominally distributed gender variable (p < 0.05) and the Welch test for the question on the frequency of playing games from the first-person perspective (FPG) (p < 0.05) demonstrated that AP were more likely to be male and with previous gaming experience. In accordance, a highly significant correlation was found between male students and the frequency of playing FPG (r = − 0.52).

Students’ ratings of the measures “presence”, “estimated learning success”, and for the items of motivation are summarized in Table 4. As expected, AP rated the degree of presence significantly higher than “cleaned” OBS (M = 3.81, SD = 0.98, n = 97 and M = 3.26, SD = 0.74, n = 130 respectively, p < 0.001). However, the rating of “cleaned” OBS was still above average. Looking into more detail on item level Supplement Table 2), there was large agreement that simulation created a realistic environment, while interaction with the virtual patient was only rated above average, as was the feeling of being in a real emergency. The “estimated learning success” was rated higher in the group of AP than “cleaned” OBS (M = 3.87, SD = 0.75, n = 97 and M = 3.72, SD = 0.77, n = 130 respectively, p = 0.05), however with only marginal significance. The usefulness of the learning tool was generally rated highly, and students would strongly embrace more opportunities to train in VR simulations. However, the effect on the improvement in confidence or in the ability to prioritize tasks was only rated just above average. Furthermore, students were asked to rate their motivation before (pre-training) and after the session (post-training). Baseline motivation in both groups was similar. However, a significantly higher motivation post training was reported by AP in comparison to “cleaned” OBS (M = 3.97, SD = 0.88, n = 53 and M = 3.61, SD = 0.84, n = 130 respectively, p < 0.01).

Table 4 Comparison between the group of “cleaned” OBS and AP in the measures “presence” and “estimated learning success”

3.4 Impact of “perceived stress” on “estimated learning success”

The group of AP (n = 97) rated the “perceived stress” induced by the VR training. We subjected the respective items to an EFA with maximum likelihood estimation and promax rotation. The KMO-coefficient measure of sampling adequacy was very good (0.83); Bartlett’s test of sphericity was significant (χ2 = 465.91, df = 36, p < 0.001). A three-factor solution provided the best fit and good internal consistency in accordance with the criteria outlined in the statistical analysis section. We defined the factors (1) “sense of control” (the self-assessed competence to manage the patient cases, with extremes, such as under- and over-estimation being deleterious, number of items = 6, α = 0.755), (2) “challenge to perform” (the challenge to make correct decisions based on specific factual and procedural knowledge, number of items = 5, α = 0.698) and (3) “social interaction to learn” (the perception of being observed and commented on by others, number of items = 4, α = 0.669) (Table 5). Altogether, the three factors explained 52% of the variance. As an indication of discriminant validity, correlation between the three scales was only moderate (r = 0.322–0.406, p < 0.05).

Table 5 Exploratory factor analysis for items of the “perceived stress” measure for the AP (n = 97)

Discriminant validity was analyzed by inspecting group differences. We found no significant differences for gender in relation to all three factors (data not illustrated). A direct comparison of age groups revealed that the younger participants up to 22 years of age experienced “social interaction to learn” to a greater extent than participants older than 29 years (M = 1.75, SD = 0.0, n = 6 and M = 1.28, SD = 0.28, n = 23, respectively, p < 0.01). Marginally significant differences among two factors were found, depending on the frequency participants indicated that they play FPG.”Sense of control” was rated higher by those who responded that they never play FPG than by those playing occasionally (M = 3.21, SD = 0.63, n = 117 and M = 2.92, SD = 0.67, n = 63 respectively, p < 0.1. Items in the factor “challenge to perform” were rated more highly by those who never played games (compared to those playing occasionally (M = 2.14, SD = 0.62, n = 117 and M = 2.02 SD = 0.59, n = 63, respectively, p < 0.1). Additionally, the three factors of the “perceived stress” measure were found not to correlate with the five items from the “presence” measure.

The “estimated learning success” measure rated by AP (n = 97) was also subjected to EFA. The 2-factor solution was the best-fitting model that also had conceptual coherence as (1) “didactic value” (the general potential attributed to VR simulation by participants, number of items = 4, α = 0.83) and (2) “individual learning benefit” (the positive effects on theoretical/practical skills for the respective individual, number of items = 4, α = 0.82) (Table 6). The internal consistency of the two factors was considered good. They explained 71% of the variance (λ1 = 57%; λ2 = 14%). In this case, there was a high degree of correlation between the two factors (r = 0.71, p < 0.01). According to the quality criteria of factor solutions, one item (number 4) of the “estimated learning success” measure had to be eliminated. The following statistical analysis was performed with this shorter version as the final inventory.

Table 6 Exploratory factor analysis for items of the “estimated learning success” measure, which was assessed by AP (n = 97)

Finally, we investigated the relationship between the measures of “perceived stress” and “estimated learning success”. The factor “sense of control” was associated quadratically with “didactic value” and “individual learning benefit” (Fig. 3). In more detail, both scales increased as the “sense of control” improved from low up to medium levels. However, when reaching higher levels of “sense of control”, the relationship reversed and the scale value declined, as described by the Yerkes–Dodson law [28]. Optimal levels of “didactic value” and “individual learning benefit” were experienced when “sense of control” was in the medium range, which we saw as acceptable. The factor “challenge to perform” correlated negatively with “didactic value” (r = − 0.26) and “individual learning benefit” (r = − 0.24, p < 0.05). There was some correlation between “social interaction to learn” and “didactic value” as well as “individual learning benefit”, although it was not statistically significant.

Fig. 3
figure 3

Curvilinear relationship between the factors “sense of control” and “didactic value” (A) and “individual learning benefit” (B), respectively. Observed, individual data points are portrayed as dots referring to their average Likert-values. Linear and quadratic relationships are illustrated by a continuous and fragmented line, respectively

4 Discussion

Advances in VR technology enabling the production of high-quality simulation environments that train the management of patient cases are dramatically affecting educational concepts, teaching activities, and the outcome of student learning. In the present study, we developed and implemented a novel VR-based training environment for medical emergencies, evaluated its acceptance by students, and explored selected psychological and didactic aspects that were associated with the benefits of the VR method and teaching intervention.

A primary objective of the study was to demonstrate that implementation of a VR-based training session as a compulsory course in the medical degree curriculum is feasible and well accepted by students. With the chosen format, we did not experience any administrative or technical difficulties and thus were able to enroll 252 participants, of whom 227 took part in our study. Compared to other recent work examining the feasibility and outcomes of VR-based training [20, 29], the sample size of our study was considerably large. Investigating sociodemographic parameters, we experienced that the willingness to act as AP was greater among male participants, of whom a higher percentage had prior experience in FPG. This is in line with recent publications demonstrating that twice as many men compared to women play computer games several times a week [30]. Possibly as a consequence, female students, especially those with less computer affinity, exhibit greater skepticism toward VR technology for teaching purposes [31]. For any broad application in future, the reservations of these individuals in particular should therefore be identified and addressed specifically. However, considering the whole teaching approach, evaluation of the VR-based training session revealed a wide acceptance among both male and female participants. Of note, debriefing and the discussion of open questions with a tutor following the exposure to VR were seen as an integral part of the course. This speaks in favor of embedding VR into a sound didactic concept rather than having students experimenting on their own (e.g., in a “VR lab” or at home). Beyond any didactic principles, the guided application of VR-based teaching is mandatory to promote the necessary media competence among medical students [31].

Our second aim was to investigate the technical aspects of the simulation, namely simulation sickness and presence. In our pilot study, VR exposure caused simulation sickness. However, reported values in the three elements nausea, oculomotor, and disorientation did not indicate high levels of discomfort in our sample. Referring to a comprehensive literature review including 55 studies recording simulator sickness scores among > 3000 VR participants [32], values obtained in our study even appeared to be relatively low in comparison. In light of the fact that only few participants had any active experience with VR prior to the study, STEP-VR as a simulation-based training environment can clearly be offered to a wider audience without any concern relating to adverse events such as vegetative reactions. Completion of actions only possible in one ground plane, which was congruent with the actual bodily position of the participant, might be one reason for the good tolerance. Additionally, the restrictive use of content-rich text boxes and a high average frame rate per second attained by the equipment we employed reduced the perception and perhaps effects of simulation sickness. It is worth noting that research in healthy participants and vestibular patients revealed a decline in motion sickness susceptibility with increasing age [33]. In our study, the majority of participants were 22–25 years old and therefore fairly young.

We observed a high degree of presence among AP, who perceived the VR simulation as very realistic. This was confirmed by OBS, but to a significantly lower degree. However, when exploring the interaction with patients on item level, authenticity was only rated slightly above average by AP and average by OBS. One explanation for this finding may be attributed to the fact that the virtual patients were portrayed in a rather motionless manner and communication could be performed only via a dialog menu. Owing to the focus lying on training procedural skills and acceptance of the lack of haptic handling in the VR environment, many procedures were only depicted in a simplified manner, e.g., the insertion of an IV cannula as “drag & drop” to the patient’s arm. Therefore, future development of the software should aim to include more realistic patient modeling and characteristics including responsive and life-like gestures. Another possible way to enhance authenticity could be to incorporate physical objects into the virtual environment, providing a more realistic and haptic experience for students. This approach, described by the terms augmented reality (AR) or mixed reality (MR), can also facilitate interaction with tutors and student observers, as the objects are visible during the simulation. However, the downside of using AR/MR technology is that the required HMD is more expensive, and additional equipment (e.g., phantoms) is needed, which can hinder its widespread implementation.

More research is also needed to clarify how real-time communication affects the perception of high-level patient interaction in VR. However, this requires the integration of sophisticated applications to enable speech recognition. Recent developments in natural language processing in virtual standardized patients only demonstrated a response accuracy of 74–92% in dialogs [34, 35]. Especially in the management of emergencies, even higher levels of precision might prove mandatory, in order to prevent the misconception of information under time pressure.

Third, an aim was to evaluate the extent and quality of psychological distress among the participants of the VR simulation. On average, AP reported being moderately challenged during the management of virtual patients. More specifically, EFA identified three distinct factors corresponding to subtypes of perceived stress. The first two, “sense of control” and “challenge to perform” represent more classic concepts of stress and have been reported as appearing in various contexts of medical education [36, 37]. Both parameters correlated negatively with the use of FPG, which is in line with enhanced spatial navigation and visuomotor coordination in this subgroup [38, 39], presumably facilitating the use of VR-based simulations and decreasing perceived stress. A third identified factor of perceived stress entitled “social interaction to learn” is based on the assumption that the possibility of publicly unmasking knowledge deficits, as well as the feeling of being observed, also exert discomfort on AP. This might be a particular concern, as it is not possible to notice the audience around when wearing an HMD. Interestingly, by comparing age groups, we revealed that young participants experienced this factor significantly to a greater extent than their older counterparts. This suggests that age might have a moderating/mediating effect on the social experience in virtual reality, a correlation that has not yet been investigated in more detail. Future studies are needed to elucidate the role of (social) stress in VR-based simulations by comparing different settings (with and without observers).

Finally, we were also interested in the perceived learning benefits and the relation to putative associated factors. Estimated learning success was certainly high for both AP and OBS at the end of the training session. In our comparative analysis, AP rated learning benefit significantly higher. This might be attributable to the fact that a strong feeling of presence has been shown to enhance skill acquisition and factual memory [29, 40, 41]. Higher motivation scores among the AP compared to OBS, possibly owing to higher vigilance and increased activity while training, might serve as another explanation for the better learning success for AP. The promotion of motivation and self-efficacy are key concepts of simulation-based learning and have already been reproduced in the setting of VR-based approaches [14, 42].

Performing EFA on the “estimated learning success” measure revealed that items contributing to the “didactic value” of the used VR simulation were rated very highly among AP, while the items comprising translation into an “individual learning benefit” were rated somewhat lower. A possible explanation for this difference is the limited time in the training session, which was approximately 30 min per AP. Furthermore, the version of the VR software in use did not provide any immediate and specific feedback interface (such as a text-based performance report) to the students currently active. Instead, feedback was only provided during debriefing by the tutor, which might lead to a lower rating for the learning benefit acquired through the training scenario.

Investigating the potential influence of perceived stress on estimated learning success among AP, we were able to demonstrate that the factor “sense of control” revealed a typical inverse U-shaped correlation to both subtypes of “estimated learning success”, resembling the relationship that Yerkes and Dodson described for arousal and performance in challenging tasks [28]. According to one study we found in the literature, stress responsiveness was a predictor of good performance [43], while most work highlights the impairment of cognitive skills under excessive stress in high-fidelity surgical or medical emergency simulations [44,45,46]. The results of the aforementioned study are also concordant with the negative correlation that we discovered for the second factor “challenge to perform” and “estimated learning success”. These findings indicate that support should be provided, e.g., by means of a student assistant or a digital tutor to mitigate excessive stress caused by the demands of this teaching method. Interestingly, on the other side of the inverse U shape, students with a high “sense of control” also reported poor learning. These participants may have had previous experience in dealing with medical emergencies, and/or the theoretical content presented was already familiar to them such that they found no gain in competence. However, owing to the lack of objective performance data, we cannot conclude whether this self-estimation was justified or misleading. In contrast, “social interaction to learn” was the only factor of “perceived stress”, which exhibited no correlation to the two factors of “estimated learning success”. Presence of an audience is described as a stressor in conventional healthcare simulation approaches [47, 48] and correlated with signs of increased stress and a trend towards worse learning outcomes in a physical intubation model [49]. However, the audience in our training session was generally perceived as supportive, as indicated by the results on item level, which might have somehow balanced positive and negative effects, thus making it difficult to establish any meaningful correlation.

4.1 Strengths of the study

In our study, we used state-of-the-art technology to enable the best possible learning success by high-presence VR training and mitigate major side effects to a large extent. The number of participants was comparably large, coming from two unbiased semesters, and was thus a statistically meaningful dataset. In comparison with voluntary or incentive-driven studies to recruit students, this compulsory training session is also likely to include participants with more critical attitudes towards VR technology. In this way, the participants of our study may well prove to be representative of the heterogeneity of medical students in general.

4.2 Limitations of the study

One limitation of the study is that the sample was restricted to undergraduate students in one academic year and at one institution. Performed at a single medical school in Germany, further studies at other universities and with different student cohorts will be required to assess the validity of the results. In addition, as the course was performed exclusively in small groups under guidance, the additional influence of both the student assistant and the observing peers could not be clarified. As participation in the role of AP was voluntary, we are clearly unable to exclude selection bias in this group.

A second limitation is the use of adapted or newly created questionnaires. As the combination of the teaching method (VR-based simulation) and setting (small group teaching) is a relatively new approach, the currently available questionnaires, especially to evaluate perceived stress, were not considered applicable and had to be adapted. Of course, these strongly modified/newly created questionnaires await re-validation and currently restrict our ability to compare with other work.

Most importantly, all the results presented are subjective data recorded using surveys. While, at least to our knowledge, no concept exists for the objective measurement of simulation sickness or feeling of presence (based on physiologic parameters, for example), the reported “estimated learning success” certainly needs confirmation through a randomized controlled trial and test performance data derived from a pre- and post-intervention study design. Additionally, future work might place emphasis on objective measurements of stress, such as heart rate variation or electrodermal activity [50, 51], and investigate any possible habituation as a result of the repeated use of VR-based simulation.

5 Conclusions

Curricular implementation of a highly immersive VR-based training session of medical emergencies is feasible. The chosen teaching modality including guidance by a student assistant and subsequent debriefing by a tutor enjoys a high level of acceptance among medical students. Self-rated high level of learning success was reported by both AP and OBS, the effect on learning being more pronounced in students taking the role as AP.

Moreover, this study provides insights into how different conceptions of perceived stress distinctively moderate subjective learning success. Whether peer observers are perceived as helpful or seen as stressors could not be elucidated entirely and warrants further studies comparing group to individual teaching settings. Obviously, as the learning benefit in this study was reported subjectively by the participants, there is the need for an objective evaluation of the learning outcomes achieved by STEP-VR in a controlled assessment setting, ideally with different levels of explicit and implicit feedback.