1 Introduction

Learning refers to a psychological process where an individual's response to a situation evolves based on their experiences, as defined by Pithers (1998). This transformation may manifest in various ways, such as behavioral changes, the acquisition of knowledge or the formation of attitudes, as noted by Dewey and Boydston (1985). The National Academies of Sciences, Engineering and Medicine defines “learn” as:

“… an active verb; it is something people do, not something that happens to them. People are not passive recipients of learning, even if they are not always aware that the learning process is happening. Instead, through acting in the world, people encounter situations, problems, and ideas. By engaging with these situations, problems, and ideas, they have social, emotional, cognitive, and physical experiences, and they adapt. These experiences and adaptations shape a person’s abilities, skills, and inclinations going forward, thereby influencing and organizing that individual’s thoughts and actions into the future.” (2018, p. 12).

Consistent with the given definition, Experiential Learning Theory (ELT) characterizes learning as the process by which knowledge is formed through the transformation of experience, where knowledge is the result of both grasping and transforming experiences, as stated by Kolb (1984, p. 41). ELT is founded on the notion that learning is not solely an end result, but rather a continuous process, as emphasized by Kolb and Kolb (2005). Based on motor skills learning theory Learning starts at (1) a cognitive or knowledge acquisition phase, followed by (2) an associative or skill acquisition phase in which experience is gained through repetition and practice and (3) an autonomous phase in which trainees perform low error tasks which can be improved with minimal corrections (Rogers et al. 2001). The causal cues and repetitions are meant to trigger individual experiences and enhance mental models (Jou and Wang 2013). The goal of this iterative, context-based learning approach is to reduce human errors to an acceptable level, as competence improves significantly through the development of skillsets (Deaton et al. 2005).

For many years, medical education has relied on Halsted's apprenticeship model (McKnight et al. 2020), where residents and fellows acquire skills through supervised practice with patients as the primary training opportunity. The operating theater has traditionally served as the main classroom for surgeons during their apprenticeship and medical students, as surgical skills necessitate repeated exposure to diverse scenarios and hands-on experience. However, the increasing complexity of procedures, introduction of new technologies and techniques, stricter work hour restrictions, greater legal scrutiny and heightened awareness of trainees' roles in patient care have created a need for more advanced training approaches. Although cadaveric training has been the gold standard, it is constrained by high costs, limited accessibility, non-pathologic states and ethical considerations (Mao et al. 2021). Similarly, simulated patients, role-play and mannequins are commonly used to improve clinical, surgical and decision-making skills, but they are often constrained by time, location, cost and limited reusability for most scenarios.

In recent decades, computer-assisted training, particularly through technology-mediated learning platforms such as Virtual Reality (VR), has gained significant traction (Jou and Wang 2013). VR has gained popularity in high-risk industries such as defense, aviation and medicine (Mehrotra and Markus 2021). According to the International Nursing Association for Clinical Simulation and Learning (INACSL) standards of best practice, VR is defined as “a simulation-based learning activity designed to provide an experience through the use of an alternative medium. Learners can complete specific tasks in a variety of potential environments, use information to provide assessment and care, make clinical decisions, and observe the results in action” (INACSL Standards Committee 2016, p. 41). While VR has a long history in medical education and numerous studies have demonstrated its effectiveness in improving skill mastery outside the operating room, assessing the translation of these skills to the clinical environment remains challenging (McKnight et al. 2020).

The aims of the research were to:

  • Determine whether VR training improves the skill acquisition of the candidates.

  • Describe the short-term skill acquisition obtained by simulation training.

  • Determine the factors affecting the magnitude of the skill acquisition.

Therefore, an empirical study was performed using a within-and-between-subject design approach to validate the immersive VR as a training platform for medical students. The training scenario covered the topic of Arterial Blood Gas (ABG) collection for second-year medical students. ABG collection includes an arterial blood sample being collected from an artery, primarily to determine arterial blood gases. The ABG gases are being used to assess patients in critical care. This process is a fundamental skill that students need to acquire during their education.

Following this brief introduction, the paper is structured into a further eight sections. Section 2 summarizes the recent literature around the applications of VR for surgical training, and Sect. 3 outlines a requirements framework derived from the literature which was then tailored for the system of interest within the case study. A methodology is then provided that describes the requirements-driven approach taken to design the study, gather data as evidence for fulfillment of requirements and finally the evaluation of the system against the requirements. Section 5 reports on the results of the study, with Sect. 6 specifically providing the results of the validation exercise. The paper finishes with a discussion of the research findings against the initial aims described prior and considers the limitations of the work. The paper concludes with a summary of the work and implications for the field of research.

2 Application of VR for surgical training

Within the literature around the applications of VR surgical training, VR tends to be used in two ways: as a surgical training tool or as a pre-surgical training and planning tool or both.

2.1 Application of VR for surgical training skills and education

VR-based simulators have been extensively used for healthcare faculty education and nursing studies in the past ten years (Mäkinen et al. 2022). Barteit et al. (2021) in their review of using HMD-VR for medical training reported that HMDs were most often used for training in the fields of surgery and anatomy and it was perceived by trainees as a salient, motivating and engaging training platform. Previous literature also demonstrated the benefits of VR for surgical residency programs where post-training, residents were able to complete the procedures more quickly and displayed better accuracy in performing the surgical tasks (Mao et al. 2021). While training is important, it is also crucial to formulate proper skill acquisition and assessment programs, and VR can create an opportunity where novice or experienced surgeons can perform or practice a surgery where instructors can identify individual strengths, weaknesses and any areas for improvement (Mehrotra and Markus 2021). Bracq et al. (2021) investigated the use of VR for training scrub nurses to identify errors and situational awareness in the operating room. The results of their study suggested that students perform better than experienced nurses, which indicates the importance of providing refresher training for healthcare professionals. This can be useful in tracking development and acquisition of skills in residents, as well as helping to maintain skill level in senior surgeons (Pelargos et al. 2017). Rahman et al. (2020) highlights that 10% of surgical applications used HMD-VR for education and training was found to be prevalent in the context of urology, neurosurgery and craniomaxillofacial surgery. Neurosurgeons must contend daily with critically important small anatomical structures wrapped by fragile vascular networks, which require micro-granularity of control and operation of surgical tools. The success (or otherwise) of these operations relies on exceptional knowledge, great technical prowess and meticulous preparation. This review highlights the opportunity where trainee’s competency level can be assessed, while they perform a simulated surgery in VR. This creates an opportunity for transparent communication within the surgical team and patients and their families and increased efficiency, improved patient care and reduced technical errors associated with the learning curve of surgery. However, Bernardo (2017) asserts that the translation of skills from neurosurgical VR simulation to real-life situations has yet to be definitively proven, as users do not receive the typical physical feedback or interaction that is a fundamental aspect of traditional surgical training. A study by Masuoka et al. (2019) shows that the training outcome is highly dependent on the technology. In their study 17 medical students used Google Cardboard, which allowed multiple students to observe a 3D model. While participants reported largely favorable responses in the evaluation of the dissection model, low scores were also observed because of visually induced motion sickness and eye fatigue. Their study reports the choice of headset and quality of the content has a direct impact on the students’ learning experience. Furthermore, researchers explored the application of VR as a self-learning tool to allow teachers to monitor and guide the learning process of the student (Fairén et al. 2020). The result revealed that students preferred the presence of the teacher within the virtual space to provide support and feedback and guide them throughout the learning process. Lecturers also supported the application of VR as a collaborative learning method where students benefit from sharing knowledge and experience with other classmates.

The other main application of VR is to provide a deeper anatomic understanding. For example, the application of VR for temporal bone model representation is a valuable education tool for the acquisition of this important anatomical knowledge (Yamazaki et al. 2021). The VR-based learning tool enabled spatial learning and enabled the user to interact, manipulate and observe the model from multiple angles. The tool also allowed the users to change their position relative to the model which can increase spatial awareness and memory. Silva et al. (2018) discuss a recent development by Stanford University which allows students to use an immersive VR headset to inspect a heart model, manipulate and walk through the model, thus providing a more complete understanding of the heart’s anatomy and physiology. Additionally, these models have been used to help parents better understand their child’s anatomical pathology, in a more interactive and effective manner than traditional diagrams and plastic models have been able to. This enhanced level of understanding may help families better engage with the treatment plan.

The other application is familiarity of the surgeons with surgical instruments. Huber et al. (2018) discussed the benefit of VR for laparoscopy training where trainers conduct surgical tasks with standard laparoscopic instruments. They report on studies in which VR laparoscopy simulations were used for skill acquisition and these skills were transferred to the operating room. VR has the potential to be beneficial in maxillofacial training as it can increase the level of comfort among dental students before their first experience with using needles and administering local anesthesia, and also enhance instrument knowledge among neurosurgery residents (Mehrotra and Markus 2021).

2.2 Application of VR for pre-surgery training

The VR environment where trainees can collaborate will facilitate surgery planning through four key stages: data initialization, virtual resection specification, virtual resection modification and volume estimation and surface reconstruction. This allows for training and knowledge exchange among surgeons, whether in a shared or remote environment, thus enhancing their surgical expertise (Chheang et al. 2021). According to research findings, VR systems enable surgeons to rehearse upcoming procedures using patient-specific imaging in a simulation, which can enhance familiarity and shorten the time required for orientation during actual surgery (Pelargos et al. 2017; Vaughan et al. 2016; Kenngott et al. 2022; Huber et al. 2018). The creation of patient-specific VR models for pre- and intra-operative usages will create opportunities to discuss several surgical options prior to actual surgery (Yamazaki et al. 2021). For example, Chheang et al. (2021) investigated the use of the VR environment to support surgeons in tumor surgery planning and reported on its benefits for training, planning and knowledge-sharing among surgeons. Hooper et al. (2019) found that when the VR training was conducted just before the actual surgery, it had a particularly positive impact on the doctor’s technical skills and knowledge. Moreover, Pulijala et al. (2018) also highlighted that those surgeons who received VR training prior to surgery reported higher self-esteem during procedures. This appeared to be particularly relevant for doctors with limited clinical experience. Furthermore, Bernardo (2017) reported that VR-based training has resulted in operation time reduction (the time required for the operation and the number of errors committed) by increasing confidence and reducing the surgeon’s unnecessary movements.

Research shows a wide range of ad hoc applications of VR in surgical training and planning, with superficial evaluations often conducted by technology vendors, based on assumed environments and tasks, envisaged (as opposed to actual) users and effectiveness of learning outcomes underpinned with little or no research focusing on a requirements-driven validation approach. This presents decision-making challenges for those seeking to adopt, implement and embed such systems in teaching practice. In the next section a requirements-driven approach will be presented and discussed.

3 Validation approach

There are number of frameworks aiming at identifying the factors impacting the learning process and outcomes of VR-based training (Pedram et al. 2020; Makransky and Petersen 2019; Petersen et al. 2022). In these studies, the aims were to present research-based theoretical models to guide research and development in the absence of comprehensive guidelines, to generate frameworks to guide the design process of medical training platforms or to assist researchers in validating such platforms. Recently, Pedram et al. (2023) proposed a requirements-driven approach (supported by a tailorable requirements framework hierarchy) to support validation efforts for medical VR training systems. These requirements were grouped into 11 key areas (and 3 “super-categories”) as:

Design Considerations:

(1) Interaction—Controls, (2) Interaction—Feedback, (3) VR Features, (4) Usability

Learning Mechanisms:

(5) Learning Experience, (6) Learner’s State of Mind, (7) Learning, (8) Trainer and Feedback

Implementation Considerations:

(9) Expertise, (10) Technology Adoption, (11) Technology

Figure 1 displays the radial wheel chart of the hierarchy of requirements presented by Pedram et al. (2023). There are set of eleven key top-level (Level 1), fifty-one sub-requirements (Level 2) and, from these, another thirty-nine requirements (Level 3). The nature of these hierarchical decompositions of the requirements means that the validation of the system may be performed from the ‘bottom-up’ (i.e., satisficing of lower-level child or sub-requirements, builds the case for meeting the requirements of the parent requirement), eventually resulting in a traceable landscape from which one can claim full validity of the system against all its requirements (and thus demonstrating its fit for purpose).

Fig. 1
figure 1

Requirements hierarchy wheel; Pedram et al. (2023)

4 Methodology

The validation case was planned as a mixed approach of qualitative and quantitative judgments that took into account the perspectives of the key stakeholders (i.e., the learners, the teachers, the educational organization and the VR developers) using the following 5 validation areas:

  1. 1.

    Survey Factors (the VR factors as described in Tables 1 and 2)

  2. 2.

    Thematic Analysis (analysis based on the free-form strengths and weaknesses provided by the participants)

  3. 3.

    Observations (experimental results as well as the observer’s anecdotal findings)

  4. 4.

    Lecturer Judgment (SME (subject matter expert) judgment from medical teaching area)

  5. 5.

    Design Specifications (technical team representatives confirming design functionality or constraints)

Table 1 Sample of questions in pre-training questionnaire
Table 2 Sample of questions in post-training questionnaire

The validation table consists of seven key sets of information:

  1. (A)

    The requirement (consisting of a unique ID which also indicates any hierarchical relationships, the short-form name for the requirement and the requirement statement).

  2. (B)

    Envisaged measures (plans for how such a requirement might be validated)

  3. (C)

    Validation areas (the evidence gathered against each of the areas)

  4. (D)

    Validation decision (given the evidence presented in the validation areas, the stakeholders make a decision as to how well the current system met that requirement; 5 fully met, 4 strongly met, 3 moderately met, 2 weakly met, 1 not met, N/A)

  5. (E)

    Validation comments (used to record any further comments around the validation decision)

  6. (F)

    Weighting (how important that requirement is for the system of interest; 5 essential, 4 important, 3 good to have, 2 extra to requirements, 1 not needed)

  7. (G)

    Recommendations (potential improvements)

4.1 Sample

As summarized in Table 3, a total of 44 medical students from the University of Wollongong participated in the experiment as part of a Bachelor of Medicine, Bachelor of Surgery (MBBS) graduate program. The final sample consisted of 25 females, 19 males and zero non-binary. The majority were between 23 and 30 years old (n = 41), and the remainder were 31–45 (n = 3). Participants in the two conditions had varying experiences with VR, in total half of the participants had no exposure to VR before and the other half had tried VR in some form at least once prior to the study.

Table 3 Participants' demographic data

4.2 Virtual reality training scenario

The Arterial Blood Gas (ABG) collection scenario was developed by a VR software vendor (specialized in the development of medical procedures) using the Unreal Engine platform (Fig. 2); the procedure involves inserting a small-gauge needle into the radial artery (typically) and collecting a blood sample. This sample is then inserted into the analysis machine for diagnostic and interpretative purposes of a patient’s clinical state. ABG is a common procedure done daily in hospitals and remains an essential skill for all medical students, junior doctors and in some contexts, other healthcare practitioners such as nurses. In Australia, ABG collection is a core diagnostic procedure of medical programs, to be completed in both simulated experience/environment and clinical settings (Medical Deans of Australia and New Zealand 2020). The key learning components required to be proficient are the process and the technique of locating the artery, accessing it with the needle, and the preparation and analysis of the sample, all while maintaining appropriate safety and hygiene conditions. The procedure has the potential to be further extended in critical care settings (such as Emergency Medicine or Intensive Care) as an Arterial Line Insertion—which provides persistent vascular access for the purposes of obtaining a blood sample on an ongoing basis and for the monitoring of arterial pressures.

Fig. 2
figure 2

© UOW Media

Student undertaking Vantari VR ABG collection training

The module is designed and developed by Vantari VR (https://www.vantarivr.com). The training module is accessed and stored on Vantari’s cloud-based platform which hosts a library of medical procedures for training. The user (student/clinician) can access this platform with an internet connection and an active subscription. The user needs to have a Virtual Reality headset (HMD) and a computer. For this study we used an Oculus Quest 2, laptop (with Intel i7 processor and NVIDIA GeForce RTX graphics card) and a connecting USB-C Link cable to run the scenario.

The user is then immersed in a medical environment with a virtual patient, virtual instruments and a step-by-step guide of written and narrated instructions on how to perform the procedure. As a user performs the procedure, a real-time performance tracking module gives the user feedback on how they are performing. Also, the haptic feedback replicates feel and tactile sense through vibration that the user was feeling on the controllers. The user’s performance data are collected by the system which can then be viewed post-procedure. This visualizes key data points such as steps successfully completed and duration.

4.3 Procedure

To begin with, a within- and between-subject design approach was used to measure survey factors. The surveys included open-ended questions on weaknesses and strengths of the training platform which enabled us to conduct a thematic analysis on students’ responses. The training scenario covered is the topic of Arterial Blood Gas (ABG) collection which involves using a needle and syringe to directly sample blood from an artery. In total 44 medical students volunteered to participate in this study, and they were randomly assigned to either study or control group. The participants were provided with a consent form to read and sign, which included a general description of the experiment and outlined their rights as research participants. This study (including Participant’s Informed Consent) was approved as a Human Ethics submission in July 2021 by the UOW Ethics Committee Protocol # 2021/258. Participants were then allocated to one of the two different conditions:

The control group had access to the Student study Guide, other usual pre-readings/resources and attended a theory session before the hands-on practical session.

The study group had access to the Student study Guide, other usual pre-readings/resources, theory session and were provided with VR training before the hands-on practical session.

4.3.1 Pre-VR training measure and data collection

Each participant was assigned a unique ID before randomly begin assigned to a group. Students in the study Group (n = 25) responded to a knowledge test, completed a pre-training questionnaire and pre-practical questionnaire. The knowledge test was prepared based on the narration used during the virtual lesson and consisted of seven multiple choice questions with four possible answers (see A1. The Knowledge Test Questions). The sample question was: “Which 2 needle sizes are generally used for the collection of ABGs?”. The study group completed the same knowledge test prior to VR training and then later prior to practical session (theoretical knowledge gain was measured through pre-test to post-test changes in number of correct responses). The control group completed the knowledge test prior to the practical session.

Students in study Group also completed a pre-training questionnaire which was designed to assess students’ background, experience and state of mind prior to the VR training. The pre-test collected demographic information and assessed participants’ prior levels of well-being, intrinsic motivation, self-efficacy, stress. The intrinsic motivation and stress scale (Pedram et al. 2020) and self-efficacy (Makransky and Petersen 2019) consisted of three, two and four items, respectively, had been used with each item being a 7-level Likert scale, where 1 was strongly disagree, 7 was strongly agree, and 4 was neutral. Additionally, the screening questions were used to allow us to identify which students had completed reading materials and had seen or performed ABG collection prior to the training.

Pre-practical questionnaire measured same factors as to pre-training questionnaire with addition of two factors of “Sense of Engagement” and “Sense of Enjoyment.”

4.3.2 Post-VR training measures and data collection

Students from the study Group were given a designated location to attend an individual virtual practical session on the topic of ABG collection. On entry to the virtual training environment, the students were given a short, guided tutorial to familiarize them with the look and feel of the VR visuals and identification of controls through manipulation of virtual objects. Immediately following the VR training, the study group then completed the post-training questionnaire which quantitatively recorded their perceptions on the experience in VR, measured cognitive load and finished with two open-ended qualitative questions asking about the strength and weaknesses of VR training in ABG Collection. The NASA Raw Task Load Index (Hart and Staveland 1988) was used to measure participants’ cognitive load followed by items to measure simulator sickness (Pedram et al. 2020), realism/representational fidelity (Pedram et al. 2020), immediacy of control and self-efficacy (Makransky and Petersen 2019), usefulness, ease of use, presence and enjoyment (Pedram et al. 2020), control and active learning, cognitive benefit, reflective thinking, stress, intrinsic motivation, tool functionality and attitude toward use (Pedram et al. 2020). Each item is 7-level Likert scale items where 1 was strongly disagree, 7 was strongly agree, and 4 was neutral.

4.3.3 Practical session

Four days after the VR training session all students from the Control (n = 19) and study group (n = 23 as 2 out of 25 students were unable to attend the practical session) returned to the university campus to attend their practical session. The practical ABG collection training is a compulsory component of the second-year Clinical Skills curriculum, and it occurs on campus using a 3D simulated wrist-hand array that pumps a red liquid through simulated arteries within the wrist area. At the beginning of the one-hour practical session, the lecturer provides a 20-min a step-by-step demonstration of the ABG collection procedure and also answers any questions from the students (comprising both study and control participants). Figure 3 shows an example of the demonstration on the 3D wrist-hand array. In an effort to minimize any confounding variables, the students involved in the study were all assigned to practical sessions taught by the same lecturer using the same equipment. Following the demonstration, the students worked in groups of three around a model wrist-hand, taking turns in performing ABG collection. The lecturer supervised the students, answering questions and offering advice, when appropriate. The lecturer was blinded as to which group of participants each student belonged to.

Fig. 3
figure 3

© S. Pedram

Lecturer practical demonstration of ABG collection

Immediately prior to the practical session, all students completed the pre-practical training questionnaire and knowledge test. Four observers (tutors/teaching assistants within the General School of Medicine) were recruited to assess the students’ performance based on the checklist (see A2. Observation Template) on four different measures: (1) performance, (2) safety and hygiene, (3) confidence and (4) if assistance was needed. Observers were also blinded to which students in the cohort had received the VR training.

4.3.4 Experimental data summary

Table 4 shows full sets of measures/constructs that were obtained during the study. The primary outcome measures are the self-assessment scores of trainee’s learning experience in VR, level of confidence and subjective learning score, using a Likert scale questionnaire—prior and post-training and objective assessment during practical session.

Table 4 Full set of measures

5 Results

The results of the study are presented in three sub-sections;

  1. 1)

    Prior Experience. To establish the extent to which students had experience and knowledge of the procedure and any existing familiarization with VR.

  2. 2)

    VR Training. The pre- and post-VR training results concerning training factors, procedural knowledge and students’ experience.

  3. 3)

    Practical and VR Training Comparison. Assessor ratings, post-practical training results concerning training factors and procedural knowledge.

5.1 Prior experience

As summarized in Table 5, questions were included to identify the pre-training experience baseline. Twenty-four out of 44 students had never seen ABG collection before, and only 5 students had performed ABG collection prior to attending the training session.

Table 5 Participant’s pre-training exposure to the ABG collection

5.2 VR Training results

5.2.1 Pre-VR training factors-study group

Table 6 summarizes the students’ state of mind prior to attending the experiment. Each factor was measured using 7-Likert scale items where 1 was strongly disagree, 4 was neutral, and 7 was strongly agree. Only 64% had tried VR before the training (Table 4); they also reported below-average levels of gaming experience (M = 3.32, σ = 1.99). In general, students felt good prior to the VR training (M = 5.48, σ = 1.19) and were highly motivated (M = 5.88, σ = 0.952) to be attending the training session and to learn how to perform ABG collection. They reported very low level of stress prior the VR training, felt mentally ready/prepared for the session (M = 2.58, σ = 1.23), and they felt moderately confident about their knowledge on ABG collection (M = 4.33, σ = 1.07). In summary, students approached the VR training in a positive state of mind for learning.

Table 6 Pre-VR training factors—study group N = 25

5.2.2 Post-VR training factors

Students on average reported positive experiences during VR training (Table 7); they reported very low levels of simulator sickness following the VR training session (M = 1.18, σ = 0.405) and very low levels of physical impact (M = 1.84, σ = 0.800) such as physical pain or strain while wearing the VR headset. Similarly, we observed low workload scores on mental demand (M = 3.56, σ = 1.083), physical demand (M = 2.28, σ = 0.842), temporal demand (M = 2.04, σ = 0.978), effort (M = 3.84, σ = 1.247) and frustration level (M = 1.960, σ = 1.135).

Table 7 Post-VR training factors—N = 25

Students found VR training very enjoyable (M = 6.77, σ = 0.427) and engaging (M = 5.40, σ = 1.217) to the point that they embraced their role and became deeply involved with the scenario and virtual patient and experienced a high level of presence (M = 5.60, σ = 0.859). They reported high levels of representational fidelity (M = 5.83, σ = 0.708) and immediacy of control or agency (M = 6.37, σ = 0.633) over the environment and their actions where they could manipulate the objects (e.g., pick up, cut, wear gloves) and use the tools in the manner intended. This has made their experience in the virtual operation room seem and feel consistent with their experiences in the real world.

They reported high scores for control and active learning (M = 6.464, σ = 0.607) where they highlighted that this type of virtual reality learning program helps them to have more control over their learning and learn at their own pace. This will have great cognitive benefit (M = 6.44, σ = 0.574) where it helps students to memorize easier, comprehend and analyze the learning material and creates opportunity for high reflective thinking (M = 6.31, σ = 0.719) where students can reflect on their understanding and link new knowledge with previous knowledge and experiences. Students post-VR-training felt very low stress (M = 2.68, σ = 1.63) and high self-efficacy (M = 6.08, σ = 0.648) prior to attending the practical session and reported high intrinsic motivation (M = 6.44, σ = 0.541) to learn more about medical procedures.

In respect of usability, students scored the training platform highly for ease of use (M = 5.93, σ = 0.833) and usefulness (M = 6.66, σ = 0.544). This presumably has resulted in a high score for tool functionality (M = 6.16, σ = 0.746) and attitude toward technology (M = 6.84, σ = 0.374) to be used for medical training purposes.

In regard to the trainer and their role in the learning process, as per results presented in Fig. 4, 36% of students believed that it is preferable to have a trainer or teacher in the physical room providing feedback/guidance, while 32% preferred to have the trainer or teacher interacting virtually within the same VR synthetic environment with the student. With respect to joint learning experiences, students would not have preferred to have other learners either in the virtual environment (12%) or in the room observing them (4%). Only 16% preferred to undertake VR training alone (without trainer or other learners).

Fig. 4
figure 4

Participants’ assisted learning preference

The participants perceived high levels of enjoyment and were highly likely to recommend VR training to others (Table 8). Over ¾ of the participants gave the maximum rating for recommendations, with the remainder at the second highest rating.

Table 8 Participant’s recommendation of VR training to others

A survey to measure perceived workload was performed immediately post-VR training. A reduced version of the NASA Raw Task Load Index (R-TLX) was used for this study; the R-TLX differs from the TLX as it does not include weighting of the separate indices. The Likert scale values were factored onto a 100-pt scale. Table 9 shows the results of the R-TLX (along with the interpretation bands). The R-TLX showed that for the VR task, the level of perceived temporal demand (i.e., whether the participant felt time pressures) and the level of frustration were at the “medium” band; physical demand and performance (how much the participant thought the workload affected their performance) ratings were in the “somewhat high” band; and the mental demand and effort were in the “high” band. The overall Global Workload Score for the task came out at M = 40.476 σ = 9.380, placing this training experience into the middle of the “somewhat high” band.

Table 9 Post-VR NASA R-TLX (N = 25)

It is important to understand that the low to high workload bands must be interpreted appropriately for the task being done. For example, one would expect a task with high levels of activity or vigilance to be high workload, whereas more sedentary tasks would be low workload. Although high levels of workload can be linked to stress and causation of human errors, seeking the lowest workload is not always the goal as low workload can cause attentional deficits and lack of challenge. Therefore, benchmarking against similar types of activities is essential to determine whether a task is “too high” or “too low.” Grier (2015) conducted a meta-analysis of NASA-TLX Global Workload Scores they looked at over 1000 global TLX scores to benchmark based on types of task/contexts. Of the contexts provided, two are of interest for this study; (1) Medical and (2) Video Games. This enables us to benchmark against 105 other studies in Table 10.

Table 10 NASA-TLX benchmarking

The global workload score for our VR task was 40.48 (Table 9) benchmarked against similar tasks; this puts the workload just above 25th centile for workload on medical tasks and just below 25th centile for workload on video game tasks (shown as a bold in Table 10). Thus, the VR training exerted a perceived global workload that is less than ~ 75% of other medical or video game tasks that have been reported about in the published academic literature. This appears to be appropriate as a simulated medical task could potentially have less workload as the individual performing the task is shielded from potential decision-making and environmental changes and would likely feel less stress due to the low-risk nature of simulation. In terms of video game tasks, the ABG collection VR training was relatively sedentary (did not involve the user walking or moving large distances) and time-coordination of movement was not an issue. The environment was also static in terms of objects or displays moving around independently of the user.

5.2.3 Training factors (pre- and post-VR training comparison)

Students were compared based on intrinsic motivation before and after the training, and we observed a higher score for post-VR training (M = 6.09, σ = 0.519) in comparison with pre-training motivation (M = 5.88, σ = 0.952), and the difference reached statistical significance level (t(25) = − 3.301, p = 0.03) (Table 11). This observation, in conjunction with the results presented above, indicates that students had enjoyed the training and were even more motivated to learn about medical procedures after attending the training session in VR. Furthermore, our observations revealed a high self-efficacy score for post-VR training where results indicated a statistically significant higher score compared to pre-VR training (t(25) =  − 11.45, p < 0.001) (Table 12). The high self-efficacy can result in higher learning (Makransky and Petersen 2019). This is an encouraging observation indicating that after attending VR training students perceived themselves as having a higher capability for performing the medical procedure. In regard to stress levels, in general, students felt there was less stress post-VR training in comparison with pre-VR training, but the difference did not reach statistical significance level.

Table 11 Participants intrinsic motivation level comparison (prior and post-VR training)–paired sample t test
Table 12 Participants self-efficacy level comparison (prior and post-VR training) paired sample t test

5.2.4 Theoretical knowledge score (pre- and post-VR training comparison)

As the result summarized in Table 13, students’ post-VR theoretical knowledge score (M = 3.60, σ = 1.339) is higher in comparison with their pre-VR theoretical knowledge score (M = 2.84, σ = 1.545) and the growth reached statistical significant level (t(25) =  − 2.55, p = 0.009).

Table 13 Paired sample t test for pre- and post-VR knowledge score

To confirm that the growth we are observing is due to VR training and not other factors, students’ performance scores were also compared based on if students: (i) have seen ABG collection before or (ii) had read lecturer notes prior to attending the VR training. The result of our analysis revealed no significant difference (p > 0.05) between students’ score when they have been grouped based on the aforementioned conditions. This is an important finding indicating that prior procedural experience will not necessarily have an impact on the learning outcome of VR-based training.

5.3 Practical training

The observations data enabled comparisons to be made between the VR training and the practical groups (the study group who had VR training and the control group who had no VR training).

The observers were asked to provide ratings against the following four questions:

  • Performance (How well did the student reach the goal?)

  • Confidence (How confidently did the student approach the task)

  • Safety and Hygiene (How well did the student follow safety and hygiene procedures?)

  • Trainer Support (How much support did the student require from the trainer?)

The observers were also asked to timestamp when the different tasks in the procedure were completed, whether the student met the goal of each task adequately, and whether the students had made any errors (an error analysis is not included within the scope of this paper and is published by Kennedy et al. (2023)), what they had done wrong.

Table 14 shows a summary of the average observer ratings.

Table 14 Observer ratings (mean value)

5.3.1 Comparison VR training to practical study group

The average ratings from the VR study group showed a marginal increase in performance (MVR = 5.20, Mstudy = 5.23) and confidence in approach (MVR = 4.68, Mstudy = 4.91) between the VR training and the practical training. The students’ ratings showed an apparent decrease in ratings for safety and hygiene (MVR = 4.72, Mstudy = 4.26); however, it was difficult to compare these ratings for different types of training because there were limitations over what the participants were physically able to do within the virtual task. For example, the safety and hygiene tasks had to be completed in a specific manner in the VR before the procedure could continue, whereas the practical tasks may have been judged on poor aseptic techniques or omission of the task altogether. These ratings did not meet the threshold for significant difference.

The final rating, on trainer support, showed a decreased need for support in the practical after VR training (MVR = 4.64, Mstudy = 6.14), with significant difference, p < 0.001 (Table 15). Again, this rating is difficult to make a direct comparison because the nature of the support needed tended to be different between the two types of training. Anecdotally, although students in both the VR training and practical training might have sought help for the puncture location and drawing of blood, the VR training group was also requesting support to use the controls (for example, two areas of difficulty were getting the pulse task to complete due to insufficient virtual downward pressure and working out how to put on the gloves within the simulation, which were naturally not seen in the practical).

Table 15 Did the student require ongoing support from the trainer?

The procedure duration, completion rate and error rate cannot be compared for the VR training and practical training. The average procedure duration was longer for the VR training because the students were learning skills to use the VR controls and the criteria for passing a stage was very specific, meaning that a student needed to do the action in a specific location, at angle and for period of time for the task to complete. The procedure would not continue if the task was not completed, so the VR training was far more stringent than the practical training where students could even bypass entire tasks if they decided to. In terms of error, the VR rate is much lower as the students had less freedom to make errors (constrained by what had been programmed within the simulation).

5.3.2 Comparison practical study group to practical control group

We then compared the two groups (study and control group) based on their practical knowledge on measures of “performing the ABG collection,” “confidence,” “safety and hygiene” and “requiring assistant to perform the task.” The result revealed that even though from theoretical knowledge point of view (knowledge test) there was no statistically significant difference between the two groups, the practical observations showed significant improvement. The study group demonstrated significantly higher completion rates than the control group (t(41) = 7.679, p = 0.008) (Table 16). In terms of performance rating the study group outperformed the control group (32.4% growth in performance). In regard to safety and hygiene measures (t(41) = 4.259, p = 0.045) the study group adhered significantly better than the control group (representing and improvement of 39.7% in adherence to safety and hygiene measures) (Table 17). On the measures of confidence and requiring ongoing support to perform the task study group performs better than control group, but it did not reach the statistical significance. The study group’s error rate was 40% lower than the control group’s error rate.

Table 16 How well did the student reach the goal?
Table 17 Did the student follow safety and hygiene procedures?

The study and control groups were compared on measures of “Enjoyment,” “Engagement,” “Stress,” “Self-efficacy” and “Intrinsic motivation.” These data were collected four days after VR training and immediately prior to attending the practical session. As part of the participants’ course requirements, both study and control groups were expected to have worked through the pre-reading and attended the theoretical session prior to the practical session.

Students were asked if they found pre-reading material and preparation “engaging” (Mstudy = 4.27 and Mcontrol = 3.96) and “enjoyable” (Mstudy = 3.72 and Mcontrol = 3.96). The difference on these measures did not reach statistical significance level. Moreover, students were compared on their “Stress” (Mstudy = 2.404 and Mcontrol = 2.763) and “Motivation” (Mstudy = 5.82 and Mcontrol = 5.61) prior to attending the practical session; the study group reported a lower score for stress and higher score for motivation, but the difference did not reach the statistical significance level.

However, as students were compared on the measure of “Self-efficacy” (Mstudy = 5.04 and Mcontrol = 4.48), the study group reported higher self-efficacy and the difference between the two groups reached statistically significant level (t(42) = 1.980, p = 0.027) (Table 18).

Table 18 Study vs. control group on self-efficacy t test independent sample test

5.3.3 Comparison theoretical knowledge score study group to control group

Although the knowledge score increased significantly for the participants in the pre- and post-VR training, only a marginal difference was seen between the study and control groups (Mstudy = 3.601, Mcontrol = 3.474); however, this was not statistically significant (Table 19).

Table 19 Knowledge test study-control comparison

This result is perplexing; from the data, the starting knowledge score for the study cohort (Mpre-VR = 2.73) was lower than the starting knowledge score of the control group (Mcontrol = 3.47). In a randomized trial, it would have been expected to see an even distribution and that the pre-training scores would be of a similar magnitude. The implications for the research are that the study group’s knowledge starting point was not an advantage over the control group’s knowledge starting point.

5.3.4 Comparison procedure duration and completion rate study to control group

The observers were asked to monitor whether the participants completed each sub-task. 91% of the study group successfully completed the procedure compared to only 57.9% of the control group (Table 20).

Table 20 Completions study to control groups

The procedure time was on average marginally lower for the study group (Mstudy = 406 s) compared to the control group (Mcontrol = 443 s); however, this was not statistically significant. As described in Table 19, the completion rate was much lower in the control group. Omission of some tasks was prevalent in the control group, which would then skew the total procedure duration. Therefore, a further analysis was performed to compare procedure times for only those procedures that were completed. The completions procedure rates showed no statistical difference between the study group (Mstudy = 414 s) and control group (Mcontrol = 435 s).

6 Validation result

Using the validation framework described in Fig. 1, from the total list of 109 requirements within the framework, 87 were selected for validation within this study, with one further requirement added by the stakeholders, making a total of 88 requirement statements to validate against. The additional requirement added was: Depth Perception/Parallax Adaptation. During the duration of the study, a further 14 were excluded (ranked as N/A) due to not being applicable for this stage in the development of the system (e.g., social skills, diagnosis skills, ethical implications) leaving 74 unique requirements statements for validation; an example of the completed validation areas is presented in Table 21.

Table 21 Example of validation areas against requirements statements

The stakeholders of the system were asked to address the evidence provided against the validation areas and decide on whether each requirement had been met (and to what extent). Table 22 shows a summary distribution of requirements against how well they were validated (met) in the VR training system being studied. The distribution shows a skewing toward the higher validation bands. Table 22 shows the % of requirements meeting different levels of validation. 89.2% (of the 74 requirements) are met to some level (2—weakly met or better).

Table 22 Validation bands

During the validation workshop, in addition to assessing how well each requirement was met, a discussion was made over how important the requirement was. This was primarily based on the customer stakeholder view (lecturers) with input from the other workshop attendees. This is an important step, as this would uncover requirement gaps for prioritization. For example, a weakly met requirement that is low in importance would not be as of interest as a weakly met requirement that is essential. Table 23 shows the weightings distribution. Reassuringly, the distribution of weightings shows a skewing toward the upper left end of the table (meaning that in general more important requirements are being better met); however, there are some areas of concern (shown in the red rectangle). For the purposes of this study a green rectangle area has been used to bound the acceptable region (i.e., where the requirements should be if the customer is to accept the system). Based on the current ratings, a customer acceptance level of 75% has been reached.

Table 23 Validation weightings

As described in Table 23, there are areas (in the red rectangle) where the current VR training system is under-delivering (i.e., where the customer’s needs are not being met sufficiently). To make the most difference, efforts for improvement should therefore be targeted to better meeting those specific requirements. This validation approach will allow system developers to not only identify but also prioritize improvement against specific decisions. Potentially, developers might want to start increasing attention to the requirement that is not met first. Calculating the weighted average for each requirement area enables a comparison to be made between relative strengths and weaknesses of the VR training system (Fig. 5). Each requirement area is therefore assigned a weighted average between a minimum of 1 and max of 25. In this validation study Usability (4.0), Learner’s State of Mind (6.0) and Technology Adoption (10.0) were catered to excellently (> 20, 4*5 or 5*5) and should be considered a strength of the current system capability. Expertise (9.0), Learning Experience (5.0), Technology (11.0), Learning (7.0) and VR Features (3.0) are also good strengths in the validation (> 16, 4*4). Interaction Feedback Sensory (2.0) and Controls (1.0) were moderate strength (< 15, 3*5). With the current customer acceptance range of > 12 (acceptance boundary shown as red dashed circle) there is one area of concern, Trainer and Feedback (8.0), which scored a weighted average of 6.7.

Fig. 5
figure 5

Average weighted validation scores per level 1 requirement areas

7 Discussion

The overall aim of the paper was to provide a validation of the HMD-VR Training system. This was performed utilizing both qualitative and quantitative approaches. This study began with identifying and measuring a number of factors affecting skill acquisition using the VR-based training platform. Prior training students reported positive state of mind for learning. Post-VR training students reported high positive learning experience in the HMD-VR training system. Low levels of simulation sickness, physical strain and stress, coupled with high levels of enjoyability, engagement, presence and fidelity were identified as factors affecting the overall training experience. Students felt significantly greater levels of intrinsic motivation and self-efficacy after VR training. Students perceived global workload levels to be below 75% of benchmarked workload studies involving medical and video game type tasks. However, students reported they would prefer to have a trainer present (either physically co-located or available within the virtual environment). In terms of learning outcomes, high scores were recorded for active learning, cognitive benefit and reflective thinking. Students performed significantly better in the knowledge test after they had undertaken the VR training. Interestingly, prior experience (undertaken or observed ABG collection) did not have an impact on the student’s knowledge test score.

Moreover, the result also revealed that the training improved the skill acquisition of the candidates and VR trained students had improved self-efficacy after VR training which carried over into their practical training. The observer ratings showed that students with VR training significantly decreased their need for support in the practical session. The observers’ ratings showed that students with VR training significantly outperformed the control group (32.4% improvement) and students with VR training followed safety and hygiene procedures significantly better than the control group (39.7% improvement). Students in the study group had a significantly higher completion rate than the control group; however, the procedure duration time was not statistically different between the study and control group.

Lastly, the VR-HMD training system was validated against 11 key areas and these requirements were grouped into 3 super-categories of: (1) Design Considerations, (2) Learning Mechanisms and (3) Implementation Considerations. The result of validation demonstrated validity against 90% of the requirements (at weakly met or higher level), with 73% strongly or fully met.

7.1 Design considerations

Some of the design considerations of an ideal VR system are identified below, based on literature and studies conducted previously. Firstly, the simulation needs to use scenarios from real case studies (i.e., what you wish to train for) (Vaughan et al. 2016), and secondly, the simulation should provide a controlled environment that is low risk and safe for both the trainee and others present (Bielsa 2021; Bernardo 2017). Consideration should also be given to unintended consequences of low-risk training (e.g., trainee’s translation of the training experience may result in perception of real-life risks becoming artificially low, particularly if the simulation was a no-fail scenario). It is important to not reduce risk perception for students, particularly in no-fail simulations by displaying the consequences of their actions/errors such as blood spurt if no gauze or wrist flinch if in wrong place). In terms of skills acquisition, the VR system should also have appropriate levels of procedural realism so that skills learnt undertaking simulated tasks can be translated into real-life skills. For example, many surgical procedures involve fine movements and positioning; the VR control elements positional granularity must be at least equal to that required of the task (Lopes et al. 2017). It was observed that learners interaction and control over virtual training environment is crucial for training. For VR systems, the position of the user and dynamic changes in position are required to be tracked and translated into both the control and corresponding field of view such that the user feels immersed, but also to avoid mismatch between the real and perceived interactions. Kinesthetic manipulation and having control over virtual elements will allow trainees to have a realistic learning experience which is more consistent with real-life training. It is crucial for the VR scenario to be perceived as a credible scenario. Mäkinen et al. (2022) also discussed the use of manipulation of VR content via hand controllers as being beneficial to learning. Meeting this requirement will have impact on the perceived procedural realism. The level of physical fidelity should also be matched to the skill being acquired and is closely linked to the sense of immersion (Pedram et. al., 2020). The simulations should provide capability to display realistic cause and effects and feedback based on the actions of the user (Barteit et al. 2021). To create this, the hardware needs to support and provide optimal refresh rates, frame rates, display sizes and display resolutions and support interaction.

7.2 Learning mechanisms

In general for VR to enhance the quality of the training, the platform must match the capabilities of a technology to the demands of a task (Rahman et al. 2020). Capability matching means where the physical environment should match the requirement of the given task. For example, if the procedure takes place in an operating theater, the simulated environment should be similar in terms of physical appearance and soundscape (Bernardo 2017). VR creates an opportunity to enhance the quality of training by recreating experience and training conditions similar to the physical world (Rahman et al. 2020). The technology must facilitate the training in a way to enhance users’ learning behavior, and when technology is appropriately employed to solve learning tasks, it can improve learning experience through reflective thinking and influence learning outcomes.

The other important factor is social presence. Social presence refers to the user perception of the quality of the tool to foster the social aspect of the experience (Pedram et al. 2020). This includes for instance, having the trainer present in the form of a virtual tutor who can assist them through the training task. Trainers by providing feedback can support learners throughout training process. Fairén et al. (2020) reported students generally preferred to receive the feedback, while they are in the training environment. It was also observed that 68% of the students in this study preferred the trainer to be involved in the training process and to provide feedback. One of the greatest affordances of VR is to create the opportunity for repetitive training where trainees’ errors can be identified and refined by feedback from the trainer or system (Bielsa 2021) and this will in turn create an opportunity for reflective thinking and learning. It is reported that consistent feedback during the training can reduce the cognitive load for the learner and will allow learners to focus on the task (Breitkreuz et al. 2021). As it has been reported by Bernardo (2017), ideal learning can occur under the following conditions: “feedback during training, repetitive practice, curriculum integration, range of difficulty level, multiple learning strategies, capturing of clinical variation, a controlled environment, individualized learning, defined outcomes, and validity.” p. 1026.

As it was reported by the lecturer and also previously indicated by Bielsa (2021) mixed-modal training is proven to be essential and superior to the traditional Halsted approach regarding the acquisition of surgical skills. Simulation-based training can help trainees acquire the necessary skills and knowledge in a more effective manner. Virtual environments give trainees the opportunity to visualize 3D relations and gain an experience in mixed reality, something that would never be possible in the real world (Mehrotra and Markus 2021). That said, the simulation should become a component of a structured curriculum in which the trainee can benefit from a sequential and constructive (or incremental?) learning process (Bernardo 2017).

7.3 Implementation considerations

In general, the system must be both fit-for-purpose but provide improvement over or alongside existing training mechanisms if it is to be a suitable candidate for adoption. Although certain skills acquisitions are advantageous in immersive VR, some applications had no advantage over less immersive training solutions (McKnight et al. 2020; Jensen and Konradsen 2018). For successful implementation of a VR-HMD training system the medical trainers or lecturers should be involved in the development of the system providing the knowledge around the medical scenarios, anatomy/physiological landscape, technical skills and techniques in addition to driving the learning outcome requirements and integration into the curriculum. Moreover, any new system being adopted for use in the medical education field must adhere to the relevant regulations; e.g., medical image privacy, patient data (Masuoka et al. 2019), ethical implications for both users and patients and HIPAA compliance (McKnight et al. 2020).

7.4 Limitations

A number of limitations of this study:

  • Our evidence-based findings rely on 44 medical students; it will be beneficial to extend this study to include more participants and add diversity to investigate to what extent the findings will differ.

  • In our study, we focused on a single training scenario involving ABG (Arterial Blood Gas) collection. We acknowledge that the alignment of the technology with the task requirements, also known as technology-task fitness, is a critical factor in effective learning. Therefore, further investigation using different scenarios on the training platform would be beneficial in order to draw more conclusive findings.

  • The assessment of training transfer quality, which reflects effective learning, was evaluated through multiple methods: (1) trainee self-reported perceptions of learning, (2) pre–post-differences in trainee competency test scores (taken before training and prior to practical training), (3) error reduction rate and (4) performance observation by assessors. While an ideal approach would involve workplace evaluation of skills and competences, measuring training transfer in the actual clinical setting was not feasible in this study. Nonetheless, future research should aim to investigate the transfer of training from VR-based training to real-world medical practice, where feasible, to accurately assess the learning benefits of immersive VR training in the context of medical education.

  • Due to the time constraints, the study was constrained to the outcome of a single VR-based training session; a longitudinal study is required to understand the learning curve for VR clinical skills training. It would also be advantageous to compare the effect of repeated exposure/training/practice over time with VR-based training and other modes of training.

8 Conclusion

The overall aim of the research was to provide a validation of the VR-based training system for medical training. This was performed utilizing a systematic approach by validating the system against the predefined requirements. The study found a user acceptance level of 75%. This enabled the identification of weaknesses of the current system and possible future directions. In this paper, within- and between-subject design approaches were used to firstly assess the validity against requirements, then to determine whether VR training improves the skill acquisition of the candidates and to determine the factors affecting the magnitude of the skill acquisition. Therefore, students were divided into study and control groups and were compared based on their level of skills acquisition. Results revealed that students exposed to VR training (the study group) outperformed the control group in practical clinical skills training tasks and also adhered to better safety and hygiene practices. The study group also had a greater procedural completion rate over the control group. Students showed increased self-efficacy and knowledge scores immediately post-VR training. Prior ABG training did not impact on VR training outcomes. Low levels of simulation sickness, physical strain and stress, coupled with high levels of enjoyability, engagement, presence and fidelity were identified as factors affecting the overall training experience. In terms of learning, high scores were recorded for active learning, cognitive benefit and reflective thinking.