1 Introduction

Safety training is widely recognized as an important means to reduce injuries, illnesses, and fatalities. Although safety training is considered to be beneficial, the degree of its effectiveness varies (Ricci et al. 2016; Robson et al. 2012). One of the main reasons for such variation is the different delivery methods used by training organizations (Burke et al. 2006). Although classroom training is the most commonly used method, it has been proven to be ineffective in achieving several desired outcomes such as safety knowledge acquisition and improvements in attitudes, beliefs, behavior, and health (Ricci et al. 2016). Traditional delivery methods have been identified to possess several limitations, such as limited levels of engagement (Burke et al. 2006), difficulty in transferring the training to the real world (Gao et al. 2019), time inflexibility, and training inconsistencies due to instructor dependency. Besides safety training, these limitations can also be commonly generalized to other types of skills training that have indirect safety outcomes such as surgical, equipment assembly, operations, or maintenance training (i.e., safety-relevant training). With the majority of workplace accidents being considered preventable, continual improvements in safety and safety-relevant training are required to improve safety outcomes and reduce the number of workplace accidents. Virtual reality (VR) technology presents an important opportunity to improve the effectiveness of safety and safety-relevant training due to its increased level of presence, ability to fail safely, and capability in presenting several scenarios that are difficult to replicate in the real world due to financial constraints or safety concerns.

Advancements in VR technology and its availability, along with the realization of the opportunities that VR technology presents are among several factors that have contributed to the rapid increase in research related to the use of VR technology for safety-relevant training. Despite this increase, research focusing on how effectiveness evaluations for VR training are conducted remains limited. When evaluations are performed, the evidence supporting the effectiveness is either limited (Gao et al. 2019; Narciso et al. 2021) or inferior in quality, such as limited sample size or questionable research design (Jensen and Konradsen 2018; Renganayagalu et al. 2021; Tichon and Burgess-Limerick 2011). Although the consensus of the prospect of VR for training is positive, any perceived benefits need to be validated. Renganayagalu et al. (2021) presented an important finding that review studies investigating the effectiveness of VR for training are lacking. Before efforts are put toward evaluating the effectiveness of VR for training, it is important to first understand how these evaluations should be performed to ensure results are both valid and meaningful. This paper aims to understand how VR training evaluations are currently performed to provide insights into the design of VR training evaluations. These insights are expected to help guide future work in determining a more consistent and standardized approach for VR training evaluation. A more standardized approach is expected to lead to an increase in consistent and comparable research to better determine if VR is an effective tool for training and how it can be applied to safety-relevant training in general.

Although numerous opportunities for the use of VR technology in safety-relevant training have been identified, existing review studies targeted specific sectors, such as health services, construction, disaster preparedness, and mining. Other studies such as Renganayagalu et al. (2021) and Narciso et al. (2021) examined all professional training with the inclusion of safety-relevant training. In this case, safety was not discussed in isolation, making the value of VR for safety-relevant training difficult to determine. This paper aims to extend existing research by reviewing evaluations used for safety-relevant training. This process is performed by first determining how VR safety-relevant training is evaluated and what methods are used (e.g., studies objectives, study designs, and application domains). This paper also analyses the evaluation measures used in VR safety-relevant training and categorizes those based on Kirkpatrick’s four-level model to provide an overview of common measures used.

2 Background

This section provides a theoretical framework of training effectiveness evaluation based on Kirkpatrick’s four-level model (Kirkpatrick 1976), followed by an overview of VR and safety-relevant training and related literature reviews.

2.1 Kirkpatrick’s four-level model for training evaluation

The definition of effectiveness must be first described before performing evaluations. According to the Oxford Dictionary, effectiveness is “the degree to which something is successful in producing a desired result.” In the case of training effectiveness, the desired result differs depending on what is being trained. Kirkpatrick (1976) divided the process of evaluating the effectiveness of training into achievable measures in his four-level model (Fig. 1). In level 1, the measure of reaction gauges trainees’ attitudes toward the training. In level 2, the measure of learning determines the level of knowledge and skills acquired. In level 3, the measure of behavior determines the level of change in trainees’ related behavior after training. In Level 4, the measure of results relates to tangible improvements, such as a reduced number of safety incidents, increased revenue, or reduced operational costs. The Kirkpatrick model is hierarchical, indicating that the prior level is to be focused on and measured before progressing to a higher level of the model. This condition is due to the concept that higher-level measurements are not expected to change if the lower levels of the model have not been satisfactory (Salas et al. 2012). For example, if trainees consider the training to be designed and delivered poorly, the training is unlikely to produce a meaningful increase in learning and unlikely to lead to behavioral changes or tangible results as a poor reaction tends to lead to trainee inattention and lack of engagement. According to Kirkpatrick and Kirkpatrick (2007, p. 123), although increases in the higher-level measurements are observed after training, evaluating levels sequentially is essential for building a compelling chain of evidence. When a chain of evidence is formed, the value of training becomes more meaningful (Kirkpatrick and Kirkpatrick 2007, p. 123). For example, an organization may implement new health and safety training with an observable reduction in accidents (level 4 of the Kirkpatrick model). This observable reduction in safety-related accidents can be due to a multitude of factors, with the introduction of new training being one possible factor. As the relationship between each factor and the observable result is indirect, it is difficult to ascertain which factors contributed to the reduction of accidents. Using Kirkpatrick’s framework makes it easier to find these relationships between the training and a reduction in safety-related accidents by using the four levels of training evaluation to build a chain of evidence.

Fig. 1
figure 1

Kirkpatrick’s four-level model of training evaluation (Kirkpatrick 1976)

Several newer evaluation models have been proposed after the introduction of Kirkpatrick’s model. Kraiger et al. (1993) suggested that learning outcomes are multidimensional, that is, learning should be evaluated by assessing the changes in trainees’ cognitive, skill or affective capabilities. Kraiger et al. (1993) believed that Kirkpatrick’s model is unclear in specifying these expected changes and their corresponding assessment technique. Kraiger et al. (1993) proposed a classification scheme to address this concern. Holton (1996) raised concerns with Kirkpatrick’s model, particularly on the implied causality of the levels in the model and suggested that it is a taxonomy used beyond its scope rather than a model. Holton (1996) stressed the importance of intervening variables and identified three elements—ability, motivation, and environment—which can influence primary outcomes. Holton (1996) introduced three primary outcomes in his model: learning, individual performance, and organizational results. A model aimed at integrating multiple models together has also been established by Alvarez et al. (2004). This hybrid model combines four existing models proposed by Kirkpatrick (1976), Tannenbaum et al. (1993), Holton (1996), and Kraiger (2002). Alvarez et al. (2004) performed a review of empirical studies that analyzed the factors affecting the effectiveness of training and included the findings in their model. Although the hybrid model presented is more elaborate, Kirkpatrick’s footprints are still evident considering that the other three included models are derived from Kirkpatrick’s model. Upon examining seven other training evaluation models (including Holton’s (1996) model), Reio et al. (2017) found that most training evaluation models are variations of Kirkpatrick’s four-level model. Despite its criticisms over the last five decades and the proposals of newer evaluation models, Kirkpatrick’s model continues to be the most relevant and widely used model by different organizations as the basis for training evaluation (Bates 2004; Reio et al. 2017). The strength of Kirkpatrick’s model comes from its simplicity and practicality (Reio et al. 2017), making it the most widely accepted and influential model (Phillips 2003). On the other hand, its criticism mainly stems from a lack of evidence in the sequential relationships of the levels (Alliger and Janak 1989). After careful consideration of the literature, Kirkpatrick’s four-level model has been used in this review as the basis for categorizing evaluation measures of VR safety-relevant training. Kirkpatrick’s levels were used solely for taxonomy, utilizing its applicability strength without focusing on the relationship between each level and their degree of importance, therefore, avoiding its major criticism.

2.2 VR and safety-relevant training

The term VR encompasses a wide range of technologies with one important characteristic, the ability to subject a user to an artificially generated environment. This range of technologies is often described by its level of immersion, which is the technical capabilities of a system to not only substitute real sensory information with computer-generated ones, but also to support natural actions for perceiving this information (Slater and Sanchez-Vives 2016; Slater 2009). For example, a highly immersive VR system such as the head-mounted display (HMD) allows its user to perceive visual and auditory (and sometimes tactile) information using natural actions such as reaching a hand to touch objects and turning the head to change viewpoint. In comparison, a low immersive VR system such as a desktop VR, despite also facilitating visual and auditory perceptions, requires unnatural actions such as using a computer mouse and keyboard to perceive the same sensory information. Another important concept in VR literature is presence, which is a user’s subjective experience of being in the virtual environment (Schubert et al. 2001). Acknowledging that presence and immersion are sometimes used interchangeably or have different definitions (e.g., Witmer and Singer 1998), this paper follows the distinction between presence as a subjective experience and immersion as system characteristics as described previously.

The first wave of VR in the 1980s and 1990s created widespread public attention with many recognizing VR’s potential to solve various real-world problems (Slater and Sanchez-Vives 2016). Researchers recognized the potential of VR in education and training with early applications in aviation (Blake 1996; 1995; Page 2000), health (Satava 1995), and military sectors (National Research Council 1995; Hill et al. 2003). During this period, VR was still uncomfortable, experimental, and expensive, meaning that it was not viable for most users. As a result, attention from the public towards VR technology subsided, ending the first wave of VR. However, recent advancements in VR technology through the release of the Oculus Rift DK1 HMD in early 2013 marked what many refer to as the second wave of VR. This release was followed by the introduction of reliable, comfortable, and affordable commercial HMD devices, which helped bring comfortable and affordable VR technology to the masses. Besides the reduction in price, improvements can also be seen in the usability and ergonomics of VR hardware such as a reduction in weight and simpler setup, as well as in the quality of its output such as higher refresh rate, wider field of view, and introduction of haptic feedback.

This recent advancement in VR technology offers several opportunities that may help overcome current safety-relevant training limitations. For example, VR can simulate dangerous and difficult situations while still maintaining the safety of trainees (Czarnek et al. 2020; Freina and Ott 2015; Pedram et al. 2020); help motivate learners and keep them engaged with the learning activity (Freina and Ott 2015; Gao et al. 2019; Li et al. 2015; Sacks et al. 2013); and expose trainees to realistic simulated hazards which can improve the overall effectiveness of safety training (Burke et al. 2011). VR may also assist in transferring learning outcomes to the real world by recreating a learning context that is highly realistic (Freina and Ott 2015; Ganier et al. 2014; Rose et al. 2000). Another advantage of VR training is that it can be accessed on-demand, reducing potential scheduling issues and supporting just-in-time training. Finally, training can be delivered more consistently due to it being delivered by a system removing discrepancies that may exist between different trainers due to how they decide to deliver the content and other human factors.

2.3 Related literature reviews

Several related literature reviews have been identified that address the use of VR for learning and training. Abich et al. (2021) provided a domain-agnostic review to identify evidence of improvements in knowledge, skills, and abilities when using VR for training. The research included nine application domains in the review, these included aviation or aerospace, industry or manufacturing, military, first responder, general, medical, safety, education, and assembly. Radhakrishnan et al. (2021) performed a systematic review of immersive VR for skills training, which also included a wide range of industries. Suh and Prophet (2018) provided an overview of immersive technology research in various areas, such as education, marketing, business, and healthcare. Jensen and Konradsen (2018) investigated the use of VR for the education and training sector and narrowed the scope to only include HMDs. Apart from assessing the quality of the studies, Jensen and Konradsen (2018) emphasized how factors such as immersion, presence, physical discomfort, and attitude toward HMD technology influence learning outcomes (i.e., cognitive, affective, and psychomotor skills acquisition).

Other studies have focused on narrower scopes, these include education (Merchant et al. 2014; Radianti et al. 2020; Pellas et al. 2021) and training (Narciso et al. 2021; Renganayagalu et al. 2021). Education is the process of learning with the objective of acquiring knowledge, typically undertaken systematically in institutions such as schools and universities, whereas training refers to the process of learning with the objective of performing specific tasks or applying specific skills. Merchant et al. (2014) performed a meta-analysis examining the effectiveness of desktop-VR-based instructions with respect to learning outcomes for students in K-12 and higher education. Similarly, Pellas et al. (2021) also targeted K-12 and higher education although with a focus on immersive VR. The research by Radianti et al. (2020) complements Merchant et al. (2014) and Pellas et al. (2021)’s work by reviewing applications of immersive VR for higher education only. Narciso et al. (2021) synthesized the use of immersive VR for professional training focusing on the applied domain, hardware and stimuli used. The research provided a review of evaluation methods used by studies and the overall effectiveness of immersive VR for professional training. However, the results were only presented as complementary due to the fact that only 21 of the 66 studies included some form of evaluation. Renganayagalu et al. (2021) focused primarily on the effectiveness of VR HMDs for professional skill and safety training including only those studies that performed evaluations. Apart from discussing the effectiveness within different domains, the authors provided a summary of the types of skills trained, the training evaluation methods used, and the types of participants studied.

With regard to safety-relevant training, review studies typically focus on a specific field, making it difficult to identify benefits that may exist for all safety-relevant training. Figure 2 presents this context gap in systematic reviews with respect to safety-relevant VR training. As can be seen, the medical and surgical field has the highest number of reviews (12 review studies) as identified by a review conducted in 2020 (Abich et al. 2021). Laparoscopic surgery was the most popular application for VR surgical training (Gurusamy et al. 2008; Nagendran et al. 2013; Yiannakopoulou et al. 2015; Alaker et al. 2016). Other reviews focused on specific surgical procedures such as microsurgery (Erel et al. 2003), orthopedic surgery (Vaughan et al. 2016; Aïm et al. 2016), neurosurgery (Pelargos et al. 2017), and ear, nose, or throat surgery (Piromchai et al. 2015). Other medical applications include renal interventions (Detmer et al. 2017) and dental medicine (Joda 2019). Besides the medical and surgical fields, other studies performed reviews for construction (Gao et al. 2019; Li et al. 2018), disaster preparedness (Hsu et al. 2013; Feng et al. 2018), and mining (Tichon and Burgess-Limerick 2011) sectors. Li et al. (2018) conducted a critical review of the applications of VR and augmented reality technologies in construction safety. Despite having a similar context, Gao et al. (2019) focused on training effectiveness rather than the application. The research investigated whether traditional training tools or computer-aided technologies are effective for acquiring knowledge, rectifying unsafe behavior, and reducing injuries. Feng et al. (2018) conducted a systematic review focusing on immersive VR serious games aimed at indoor evacuation processes. The research focused on the development and implementation criteria and generated a conceptual framework specific to evacuation. Hsu et al. (2013) reviewed several VR-based disaster preparedness and response training limited to the United States and presented the associated benefits and challenges of the implementation. Finally, Tichon and Burgess-Limerick (2011) reviewed the use of VR for safety training in the mining sector. The research identified and briefly described a series of studies, identified trends and limitations within those studies, and drew conclusions and suggestions for future research directions.

Fig. 2
figure 2

Context gap

3 Methods

The following sections describe the methods used for this systematic review, including the information sources, search strategy, selection process, and data collection process.

3.1 Information sources

The literature search process was performed during the month of August 2021 and covered four interdisciplinary databases (Scopus, Web of Science, EBSCOhost, and ACM digital library) and two discipline-specific databases (IEEE Xplore for computer science and engineering and PubMed for biomedicine and health).

3.2 Search strategy

The following search strategy was performed on the six databases using the search string below.

(“virtual reality” OR VR OR “virtual environment”) AND.

(evaluat* OR investigat* OR examin* OR assess* OR measur* OR compar*) AND.

(effect* OR impact OR outcome) AND.

(safe*) AND.

(train*) AND NOT.

(“machine learning” OR “deep learning”) AND NOT.

(“artificial intelligence” OR “neural network”) AND NOT.

(rehabilitation OR therapy).

This search string was used for initial screening from the title, abstract, and keywords field in Scopus (i.e., TITLE-ABS-KEY), topic field in Web of Science (i.e., TS), abstract field in EBSCOhost (i.e., AB), all field in PubMed, all metadata field in IEEE Xplore, and abstract field in ACM digital library.

This paper focuses on studies that performed evaluations on VR training, therefore, six different synonyms of “evaluate” were used. An asterisk was used for each synonym to include different forms of the word such as “evaluate”, “evaluation”, and “evaluating”. Ensuring that these words were present helped limit the number of studies that focused on prototype development without performing any evaluation. Different words were also used to include the outcome of the evaluation such as “effectiveness”, “impact”, or “outcome”. The term “safe” and “train” were separated, and asterisks were used to include extended forms and combinations, including studies that did not explicitly phrase “safety training” together. Similar to Radianti et al. (2020), this review did not include the term “machine learning”, “deep learning”, “artificial intelligence”, or “neural network” to exclude studies that focused on artificial intelligence without the human learning context. This review also excluded the terms “rehabilitation” and “therapy” to remove studies that focused on personal health and fitness or focused solely on physical training with no learning.

3.3 Selection process

Studies identified throughout the database searching performed above were imported into a single EndNote Library. The authors then filtered out the studies prior to 2016 as it was considered to represent the start of a new wave of VR technology signified by the commercial release of VR HMDs (i.e., Oculus Rift in March 2016 and HTC VIVE in April 2016). This second wave of VR can be observed in an increased interest in VR technology from both public and academic domains. Figure 3 shows results from Google Trends and Scopus database that highlights this significant increase after 2016. The rationale for this research to focus on work conducted after 2016 is that significant advancements in VR technology at this time led to broader application and new research efforts. Although only papers published after 2016 were considered due to the release of significantly improved HMDs this work also includes desktop VR to ensure all levels of immersion are included for a comprehensive analysis.

Fig. 3
figure 3

Google Trends searches worldwide and Scopus database number of documents by year for keywords “virtual reality” and “VR”

After isolating the appropriate period, duplicate results were removed following the guidelines by Bramer et al. (2016). Primary sorting was then conducted by scanning the title and abstract of each study based on the following inclusion and exclusion criteria.

3.3.1 Inclusion criteria

Primary studies that evaluate the effectiveness of VR for safety or safety-relevant training. Safety training is defined as training with direct safety outcomes such as improvements in safety awareness, knowledge, or behavior and accident prevention. For example, hazard identification, electrical safety, and working at heights training are considered safety training. Safety-relevant training includes all types of skills training, where upon successful completion it will have a positive indirect effect towards the trainee’s own safety or safety of others. Examples of such training are surgical, equipment assembly, operations, or maintenance training.

3.3.2 Exclusion criteria

  • Nonempirical or secondary studies;

  • Studies that are not published in English;

  • Studies with unoriginal data;

  • Studies that do not use or focus on VR technology;

  • Studies that do not use VR for training;

  • Studies with training unrelated to safety or focus on personal health (e.g., physical therapy, rehabilitation, or sport);

  • Studies that do not evaluate the effectiveness of VR training;

  • Studies or training intended for children.

This paper limits the scope of the review to studies intended for adults since adults learn differently compared to children (Knowles 1970) and therefore require separate analysis. The remaining articles that met the inclusion and exclusion criteria were then read in entirety to confirm eligibility using the same criteria described above. For articles that developed VR training without effectiveness evaluation but mentioned the intention in future research, efforts were made to identify these additional records. The selection and data collection processes were conducted primarily by the first author with a list of suggested exclusions presented to the second and third authors for articles that were ambiguous. The second and third authors then reviewed the suggested articles and either rejected or accepted the exclusion. Further ambiguity in selection and data collection was arbitrated by discussion. The first author read all selected studies at least twice, once for selection and then for coding. A flow diagram illustrating the process and the number of articles is presented in Fig. 4.

Fig. 4
figure 4

PRISMA flow diagram of the review process

3.4 Data collection process

The following data was coded within a single spreadsheet after the study selection process was completed.

  • Bibliographic information;

  • Study purpose and training topic;

  • Application domain;

  • Studies design of the experiment;

  • Whether the study focused on evaluation or development (study objective);

  • Methods and measures used for evaluating the effectiveness.

3.4.1 Bibliographic information and study purpose

Bibliographic information such as the author(s), year published, and title, along with the study purpose and training topic were coded first as the process was simply a duplication into the spreadsheet. The study purpose and training topic were obtained from explicit statements about both the aim and the training subject for each study. These were collected and used as references to the authors and were not used for any analysis.

3.4.2 Application domains

Application domain was coded primarily based on the list of industries and sectors described by the International Labor Organization (ILO; n.d.a) and the list of industries from WorkSafe Victoria (2022) with several additions from the authors. These additions were made to adjust the scope of the encompassing industry. For example, the term “maritime” or “offshore” safety was repeatedly used by studies and therefore was determined more suitable than “shipping; ports; fisheries” or “inland waterways” as described by the ILO. Another example is the military industry, which is specific enough to have a separate domain rather than to be included as part of “public service”. Other than these, there are other applications that were not described such as “general” and “space”.

3.4.3 Study designs

The study design was coded based on the methodological criteria for safety intervention evaluation research by Shannon et al. (1999). The study design was mainly categorized into true-experimental design, quasi-experimental design, and non-experimental design. A true-experimental design includes studies that have two or more independent groups where the participant allocation is randomized. When the allocation is not randomized, or the randomization is not mentioned, it is considered a quasi-experimental design. Non-experimental design includes studies with limited baseline comparison such as post-only one-group design, post-only non-equivalent control design, and pre-post one-group design (Shannon et al. 1999). True-experimental design is the ideal research design for maximizing internal validity. A quasi-experimental design is ideal where it may not be feasible to conduct a true experiment due to practical, ethical, political, or financial reasons (Shannon et al. 1999). Non-experimental design is considered to provide the least accurate results as it lacks a baseline measure used to compare intervention results, making the conclusions regarding its effectiveness less reliable (Shannon et al. 1999). As the name suggests, post-only one-group design involves collecting evaluation data after exposing all participants to VR training. Pre-post one-group design is similar to post-only one-group design with additional data collected before the training to measure the changes after training. Post-only non-equivalent control design is similar to a quasi-experimental between-groups design, without controlling the baseline of each participant that is normally captured using a pretest or a demographic survey. Besides this, for true-experimental and quasi-experimental design, each study is also categorized into between-groups, within-groups, or mixed design. Between-group design involves studies comparing dependent variables or measures between two independent groups. In the criteria explained by Shannon et al. (1999), within-group design, as part of the quasi-experimental design is limited to interrupted time-series design, which requires multiple data collection points both before and after participants undertake training. This review adopts a more general within-group design definition, which is studies that have all participants experiencing all conditions. These studies include evaluation, where two (or more) training conditions (e.g., traditional and VR training) are experienced by all participants and data, is collected either after each exposure (i.e., learning curve) or once at the end. A mixed design incorporates studies that have a combination of between-groups and within-groups designs.

3.4.4 Study objectives

The study objective was coded based on two categories namely, evaluation and development. Evaluation refers to studies with the main objective of evaluating the effectiveness of VR training, whether it is an existing or a newly developed prototype. A study was deemed to have evaluation as the study objective when it was stated as the aim of the study and where the primary focus of the results section was on presenting and discussing evaluation results. Development refers to studies where the main objective of the study was describing the software or hardware development of a VR training prototype with less emphasis on evaluation. This reduced emphasis was noted when the majority of the results were related to development with a separate and smaller section for evaluation of the training. The aim of defining studies using these categories is to provide insights into the research motivations and to distinguish the level of focus on improving training results. For example, a development study may focus on developing a working prototype to validate technological outcomes before improving training elements.

3.4.5 Evaluation measures

Reported measures for each study were recorded comprehensively based on the description given by the authors. The descriptions of measures for each of the studies were then grouped into categories of similar measures. The categories were then coded into one of the four levels of Kirkpatrick’s model (reaction, learning, behavior, and results). A measure is categorized as reaction if it is a subjective impression of the training obtained from either surveys, comments, discussions, or interviews. A measure is categorized as learning if an objective test for either knowledge or skills was conducted. A measure is categorized as behavior if the behavior of participants was observed after the training during a period of work experience. A measure is categorized as results if it measures tangible improvements in the form of quantitative data after the training such as human resources data on reduction in the number of accidents. In the case of studies that reported more than one experiment in a single paper, data collected on each experiment was weighted appropriately based on the number of experiments. For example, papers reporting two experiments (Chen et al. 2020; Liang et al. 2019b; Clifford et al. 2019; Mirauda et al. 2020; Leder et al. 2019; Zhang et al. 2017) were weighted by multiplying 0.5 to the coded data for each of the experiments. The purpose of this weighting is to maintain an equal value for each paper.

4 Results and discussion

This section presents the results and discussion of the systematic review to provide an overview of how VR safety-relevant training is currently being evaluated for its effectiveness (Sects. 4.3) with an analysis of evaluation measures used, which are then categorized into the four levels of Kirkpatrick’s model (Sect. 4.4).

4.1 Application domain

Figure 5 presents an overview of the number of studies in different application domains for VR safety-relevant training. Health services were the most active domain, consisting of 36.03% of all studies reviewed, the breakdown of the target audiences within this domain is also presented in Fig. 5. Construction was the second most active domain with 19.12% of all studies. Studies without a specific application domain (e.g., disaster preparedness or slip, trip, and fall prevention training) were categorized as “general” and represent 8.82% of all studies, another 8.82% of all studies were conducted within the transportation domain, and 6.62% of all studies were related to engineering. Another 9 application domains represented the remaining 20.59% of all studies and were categorized under “other domains”, each of these domains comprises 5 studies or less. Application domains represented in the “other domains” category included utilities, manufacturing, maritime or offshore, education, automotive, emergency services, mining, military, and space.

Fig. 5
figure 5

Training application domain out of all 136 studies (left) and target audiences of VR training out of 49 studies in health services (right)

4.1.1 Health services

Within health services, three key professions had a strong focus on VR training, this included surgeons, medical doctors, and nurses. More than half of studies within health services (55.10%) are aimed specifically toward surgeons, as can be observed on the right chart of Fig. 5. This is expected as surgery training is one of the earliest adopters of VR, and the abundance of commercially available VR simulators for surgery has resulted in a significant number of studies evaluating and validating the effectiveness. For example, dV-Trainer (Mimic Technologies Inc, Seattle, WA) has been frequently used and studied as a simulator for training robot-assisted surgery using the da Vinci Surgical System (Intuitive Surgical Inc, Sunnyvale, CA). Furthermore, since surgery is a highly dangerous procedure, competence must be confidently ensured. To do so, rigorous evaluations are conducted to validate the effectiveness of VR training to confirm its ability to accurately measure level of competency. Studies aimed at medical doctors (excluding surgeons) represented 14.29% of all studies in health services, this includes general practitioners and medical specialists, such as neonatologists (Xiao et al. 2020), anesthesiologists (Casso et al. 2019; Shewaga et al. 2018), and cardiologists (Jensen et al. 2016). Studies aimed at nurses were similarly represented, consisting of 16.33% of all health services studies. VR training aimed at nurses is relatively diverse and ranges from direct safety training (e.g., operating room fire, infection control, and home hazards training; (Polivka et al. 2019; Rossler et al. 2019; Yu et al. 2021)), to procedural safety-relevant training (e.g., catheter insertion, chemotherapy administration, childbirth, surgery assistance, and nasal sample collection training; (Butt et al. 2018; Cecil et al. 2021; Chan et al. 2021; Chang et al. 2019; Edwards et al. 2021)). Other professions in health services included general healthcare staff (Rahouti et al. 2021) and laboratory technologists (Prendinger et al. 2016).

4.1.2 Construction

The construction domain also includes a substantial number of VR safety training studies with the majority of those studies focusing on general occupational health and safety and hazard identification and management training. In addition to a strong focus on typical hazards found on construction sites, a smaller number of studies focused on specific hazards and machinery, such as demolition robots (Adami et al. 2021), cranes (Dhalmahapatra et al. 2021; Song et al. 2021), precast/prestressed concrete (Joshi et al. 2021), and rollers (Vahdatikhaki et al. 2019). The focus on improving the safety of construction workers is expected to continue as fatality rates in construction have remained relatively consistent over the past 5 years with 3.2 and 3.1 deaths per 100,000 workers in Australia during 2015 and 2020 respectively (Safe Work Australia 2016a, 2021) and 10.1 and 10.2 in the USA during 2015 and 2020 respectively (Bureau of Labor Statistics 2016, 2021). Construction represented the third highest fatality rate in Australia in 2020 behind agriculture, fishing, and forestry; and transport, postal, and warehousing (Safe Work Australia 2021). The high number of studies that focus on evaluating the effectiveness of VR training applications in the construction domain suggests that improvement is a strong focus, with researchers attempting to use innovative technology to improve safe behavior to reduce accidents. However, unlike VR surgical simulators, VR safety-relevant training in the construction domain is largely one-off experimental prototypes, often lacking in follow-up improvements and evaluations or actual implementations. This is perhaps due to the difference in the level of incentive to implement VR training between the construction and health services domain such as surgery training. In construction, traditional training is relatively easy and inexpensive to deliver, with the training typically being conducted by outsourcing the training to a third party (i.e., training providers). There are often multiple training providers available in an area providing compliance training of similar quality with competitive fees. In comparison, the need for effective VR surgery simulators is greater due to the expensive and logistically demanding nature of traditional surgery training. Expert surgeons able to assist in training are less available, and traditional surgery training typically involves the elaborate use of cadavers or expensive realistic models. Although the implementation rate of VR safety-relevant training may be slower for the construction domain compared to surgery training, if the validity and effectiveness of VR safety-relevant training in construction is increasingly established, it is likely to increase the demand and development and led to commercially viable VR training systems rather than prototype solutions. Recently, attempts have been made to commercialize VR training in the construction domain such as by Pixo VR (n.d.a) and Next World Enterprises (2022). Future research should focus on continually evaluating the effectiveness of VR training in the construction domain with a focus on implementing such training into organization training practices to progress from the current experimental prototypes to a more widely adopted and commercially viable training system.

4.1.3 General

The category “general” presented in Fig. 5 (left) represents VR training for which the application is broader than a single application domain. Examples included studies that focused on general public training on preparing for disasters such as fire (Benvegnù et al. 2021; Fu and Li 2020; Liang et al. 2020, 2019a; Lovreglio et al. 2021; Saghafian et al. 2020), earthquake (Li et al. 2017; Liang et al. 2018), and active shooters (Sharma et al. 2020); slip, trip, and fall prevention training (LoJacono et al. 2018; Weber, et al. 2020) and general indoor hazard awareness and response training (Cavalcanti et al. 2021).

4.1.4 Transport

Transport safety training studies focus on road (6 studies), civil aviation (5 studies), and railway safety (1 study). VR training for improving road safety is typically aimed at training regular drivers’ risk awareness and safe driving habits (Agrawal et al. 2017; Lang et al. 2018; Suto et al. 2020). Other road safety studies focus on efficient paths for performance driver training (Simpson and Rafferty 2020), interactions with automated cars (Sportillo et al. 2018), and bicycle safety training (Tsuboi et al. 2018). Four of the five studies that focus on civil aviation, trained passengers and flight crews on emergency-related procedures. The remaining study focuses on railway safety with respect to crane operations and signalling training for clearing railways after an accident (Xu et al. 2019b).

4.1.5 Engineering

The engineering domain includes studies that are applicable to all fields of engineering, such as job safety analysis training (Ta et al. 2019), workshop and laboratory safety training (Makransky et al. 2019; Mondragón-Bernal 2020; Sim et al. 2019), and discipline-specific training aimed at chemical (Chen et al. 2020; Colombo and Golzio 2016; Ouyang et al. 2018), mechanical (Keßler et al. 2020), and gas engineers (Asghar et al. 2019).

4.1.6 Other domains

A strong focus on VR training applications for safety-relevant training is prominent within safety-critical fields, such as health services, construction, transport, and engineering. Although not as strong, interest in VR training also exists in other safety-critical domains, such as utilities (Avveduto et al. 2017; García et al. 2016; Herrington and Tacy 2020; Kwegyir-Afful and Kantola 2021; Mirauda et al. 2020), manufacturing (Caporusso et al. 2019; Lacko 2020; Leder et al. 2019; Torres-Guerrero et al. 2019), maritime or offshore (Chae et al. 2021; Jung and Ahn 2018; Jung and Kim 2017; Smith et al. 2019; Smith and Veitch 2019), automotive (Borsci et al. 2016; Cooper et al. 2021; Sebastian et al. 2018), emergency services (Chen et al. 2021; Druzhinina et al. 2019; Prasolova-Førland et al. 2017), mining (Liang et al. 2019b; Zhang 2017), military (Clifford et al. 2019; Salcedo et al. 2016), and space (Liu et al. 2016). These findings are consistent with the need for safety-critical fields to always improve safety and prevent accidents with effective training considered an important factor in reducing safety incidents.

4.1.7 Agriculture, fishing, and forestry

It is interesting to note that no studies were found that have a focus on VR training for the agriculture, fishing, and forestry industry. This is unexpected considering this industry had the highest fatality rates for 2015 both in the USA (Bureau of Labor Statistics 2016) and Australia (Safe Work Australia 2016a) with 22.8 and 16.7 fatal injuries per 100,000 workers respectively. Although these fatality rates are decreasing, it remains the industry with the highest fatality rate in 2020 with a rate of 21.5 and 13.1 in the USA and Australia respectively (Bureau of Labor Statistics 2021; Safe Work Australia 2021). One possible factor for the lack of VR safety training in this domain could be the lack of safety training applied in general. The agriculture, fishing and forestry domain is unique in that a large proportion of the workforce are self-employed (Chapman and Husberg 2008; Safe Work Australia 2016b) without access to dedicated safety officers (McBain-Rigg et al. 2017) making it more difficult to ensure safety compliance when compared to other domains where the majority of the workforce is directly employed with an organization. The high fatality rate and the inability to reduce accidents using conventional methods demonstrate that the agriculture, fishing and forestry domain is still lagging behind other domains when it comes to the implementation of occupational health and safety. This lack of focus has likely resulted in a low motivation to continually improve and therefore there is a lack of desire to use innovative technologies such as VR. However, the shortcomings of conventional methods may provide an opportunity for alternative methods such as VR training as a potential training solution. The majority of fatalities in the agriculture, fishing and forestry domain involved the (mis)use of vehicles such as tractors (Safe Work Australia 2016b). VR training may provide a safe alternative to train employees on how to safely operate such vehicles without exposing them to physical risk. This can be done through the use of vehicle simulation as has been done for other types of vehicles such as those solutions in the transportation domain. Furthermore, the recent availability of mobile and portable VR hardware may also provide a greater opportunity to deliver safety-related training in rural and remote areas where work in this domain is commonly performed. Future researchers are encouraged to develop and evaluate the effectiveness of VR training applicable to the agriculture, fishing, and forestry domain.

4.2 Study objectives

This review categorizes papers into two main study objectives, evaluation and development. As illustrated in Fig. 6, 61.03% of the studies aimed to evaluate the effectiveness of VR training. Effectiveness was typically evaluated directly by evaluating the effectiveness measures of the overall VR training either before and after training or compared against the control group (refer to study designs below). Besides this direct evaluation, other studies focused on specific factors expected to influence the effectiveness of training (i.e., indirect). Examples of factors that researchers highlighted, which indirectly influenced effectiveness (i.e., indirect evaluation), were the immersion of the input and output devices, different learning strategies, anxiety levels (Kwon 2020), confidence levels, and prior experiences (Truong et al. 2021). Studies that investigated how the level of immersion can impact VR training effectiveness evaluated the use of different interfaces and display types, such as desktop or HMD (Buttussi and Chittaro 2018; Jung and Ahn 2018), virtually generated or 360° imagery (Moore et al. 2019), stereoscopic display (Tawadrous et al. 2017), and different spatial references (Fu and Li 2020); different controls (e.g., keyboard and mouse, controllers, hand tracking or no control (Burigat and Chittaro 2016)); and presence of additional multisensory stimuli, including haptic feedback (Cooper et al. 2021; Francone et al. 2019; Kim et al. 2020; Simpson and Rafferty 2020). The influences of learning strategies in VR training were evaluated in terms of utilizing gamification (Cavalcanti et al. 2021), presenting safety instructions with positive or negative consequences (Shi et al. 2019), lecture-based or mastery learning (Smith and Veitch 2019), and other strategies (Orland et al. 2020; Salcedo et al. 2016; Sebastian et al. 2018). The remaining 38.97% of studies focus on describing the development of VR training with supplementary effectiveness evaluation.

Fig. 6
figure 6

Percentage study objective overall (inner circle) and percentage study designs by study objective (two outer circles)

4.3 Study designs

As observed in Fig. 6, the majority of studies with evaluation as the main objective performed either a true experiment or a quasi-experiment representing 27.94% and 22.79% of all studies respectively and 83.12% of “evaluation” based studies used either true experimental or quasi-experimental design. In contrast, over half of the “development” studies used a non-experimental design with post-only one-group design being the most common (i.e., 11.76% of all studies). It is suggested that a non-experimental design is preferred for development studies due to its ability to quickly and easily identify quantifiable values of development, such as using simple usability or user experience questionnaires. The facts that the majority of studies focus on evaluation (i.e., 61.03%) and most of these studies utilize good study designs (i.e., 83.12%) are encouraging in terms of the state of VR for safety-relevant training. This is because researchers often focus on development studies when the technology is relatively new, and only when the technology is more well-understood do researchers attempt to evaluate how effective it is. Although development studies are necessary to determine and advance the capabilities of VR technology, studies aiming to understand how effective VR training is and how VR training could be implemented effectively may now benefit from a shift towards higher priority. This is likely due to the fact that VR technology has seen considerable technological advancements to the point that the technology has matured and can now provide highly functional, reliable, and usable training experiences. A shift in priority towards evaluation studies helps validate if and where VR training is effective for safety-relevant training in general and more importantly understand the role VR technology could play in safety-relevant training. As with the different training methods available (e.g., traditional classroom, computer-based/e-learning, and hands-on), understanding which types of safety-relevant training are best suited to take advantage of VR is worthwhile to investigate. This could help determine if current safety training could be improved through the use of this relatively new technology. Despite the importance of evaluation, development studies play a key role in advancing the current state of VR technologies. When development studies are undertaken, it is recommended that such studies consider validating their VR training by employing either a true- or quasi-experimental design as opposed to a simple post-only one-group design as is typically done. This review encourages future researchers, particularly those undertaking development studies, to consider improving their experimental design to increase the quality of their results. For example, instead of a post-only one-group or pre-post one-group design, researchers may benefit from multiple data collection points before and after training and move away from a non-experimental design to a quasi-experimental interrupted time series design. Another suggestion is for researchers to expose all participants to a control such as traditional training in addition to the VR training (i.e., within-groups design), and capture the outcome measure after each exposure, preferably allowing considerable interval between each exposure. Finally, a post-only non-equivalent control group may be enhanced by controlling a group baseline using a pretest.

With respect to true- and quasi-experimental design, a mixed design is the most common approach used by both “evaluation” and “development” studies, followed by between-groups and then within-groups design. Overall, mixed design covers 44.12% of all studies with 20.22% being a true experiment and 23.90% a quasi-experiment. Between-groups design represents 19.11% of all studies with 12.86% representing true experiment and 6.25% a quasi-experiment, whereas within-groups design only represents 6.25% of all studies. While there is nothing inherently good or bad about each of the design approaches, they do have their own advantages and disadvantages. Therefore, the correct use of these designs is best determined on a case-by-case basis, which is out of the scope of this review. This review instead presents the findings on study design as an overview of the current study design trend seen by research conducted in the use of VR for safety-relevant training.

When analyzing study design based on the application domain, VR training in health services represented a large portion of true- and quasi-experimental design with 13.97% and 12.50% respectively (see Fig. 7). This supports the previous rationale (Sect. 4.1.1) that existing commercially available VR simulators and the dangerous nature of this domain, particularly for surgeons, require rigorous testing with reliable experimental design. Other than health services, transportation is the only other application domain where non-experimental design is the least used approach. The majority of studies in construction, engineering, and other domains adopted either non- or quasi-experimental design with a smaller portion adopting true experiments. Less rigorous evaluation in “other domains” is reasonable considering the utilization of VR for training in these fields is still uncommon and largely exploratory. However, it was interesting to find that the construction domain, with the second largest application, still had a limited number of studies utilizing true-experimental design. Researchers conducting studies in the construction domain are encouraged to continue undertaking evaluative studies prioritizing a more comprehensive experimental design approach (i.e., true or quasi-experimental).

Fig. 7
figure 7

Percentage study design based on application domain

4.4 Evaluation measures categorized using Kirkpatrick’s four level model

This section presents an analysis of the evaluation measures used by reviewed studies which are categorized into the four levels in Kirkpatrick’s model as presented in Fig. 8. It is important to note that some of the reviewed studies evaluate more than one measure and therefore may be represented in multiple categories. Learning (level 2 in Kirkpatrick’s model) was the most commonly used evaluation measure with 98 (72.06%) studies. This was closely followed by studies that measured reaction (level 1 in Kirkpatrick’s model) with 90 (66.18%) studies. A relatively low number of studies evaluated the effectiveness of VR training using behavior or result measures, which are beyond level 2 of Kirkpatrick’s model. In fact, no study reviewed evaluated behavior (level 3 of Kirkpatrick’s model), and only three studies evaluated results (level 4 of Kirkpatrick’s model). One study had measures that are inapplicable to any of the levels in this model (Torres-Guerrero et al. 2019). Torres-Guerrero et al. (2019) collected electroencephalogram (EEG) signals to measure the stress and concentration level when performing a welding task.

Fig. 8
figure 8

Evaluation measures categorized into Kirkpatrick’s four level model

4.4.1 Level 1—reaction

Studies evaluating reaction typically used a post-survey in the form of a Likert scale or open-ended questionnaires to measure reaction, the surveys were either self-developed or based on existing work. Reaction was also measured using interviews, discussions, and observations. As presented in Table 1, reaction is categorized into 12 measures for the purpose of this review, namely affective reaction, realism, quality of instruction and feedback, motivation, usability, perceived learning effectiveness, subjective comparison, presence, engagement and interactiveness, intention to use, confidence level, and comments, impressions or feedback. These categories were developed by combining similar measures together, however, it is important to note that each category may not perfectly align with each included measure as each measure has its distinctive definition. Classifying each measure into the most appropriate category was deemed necessary to extract meaningful insights that aligned with Kirkpatrick’s model. Researchers intending to perform an evaluation study using one of the measure categories are encouraged to assess the suitability of the exact measure to be used.

Table 1 Categorized reaction measures and associated studies

Out of the 136 studies reviewed, 90 (66.18%) evaluated at least one measure of reaction. Usability is the most common reaction measure used by 35.66% of the reviewed studies, followed by perceived learning effectiveness (30.15%) and affective reaction (20.59%) measures. Usability can be described as the “quality of a user’s experience when interacting with products or systems” (U. S. General Services Administration n.d.a). Existing surveys such as the system usability scale (SUS Brooke 1996) and simulator sickness questionnaire (Kennedy et al. 1993) were commonly used, although other studies also measure other metrics such as ease of use and user-friendliness. Perceived learning effectiveness is the trainees’ perception of how effective the VR training system is for learning. Metrics, such as the helpfulness of the system for learning, perceived benefits, learning gains, and usefulness, are included as part of this measure. Affective or emotional reaction comprises trainees’ overall feeling toward the training, their perceived enjoyment, how pleasant, attractive, and impressive the training is, and their degree of satisfaction. In addition to these three measures, evaluators are interested in the general comments, impressions, and feedback from trainees in 17.28% of the studies. These measures are typically obtained by using interviews or open-ended post-surveys. This process allows trainees to express their perceived strengths and weaknesses of the system and recommend changes and improvements.

Presence represents the trainees’ sense of being in a virtual environment, realism measures how realistic or similar a virtual environment is with respect to the real-world equivalent. The two measures were used for evaluation in 16.54% and 15.81% of all studies, respectively. Existing surveys, such as the Igroup Presence Questionnaire (IPQ Schubert et al. 2001), Presence Questionnaire (PQ Witmer and Singer 1998), and the Slater–Usoh–Steed presence questionnaire (Usoh et al. 2000) are commonly used for evaluating presence. Intention to use, measures the desire for trainees to undertake or integrate the evaluated VR training in the future, their likelihood to practice using the VR system, and their likelihood to recommend the training to others. This measure is used in 14.71% of all studies. Confidence level represents 13.97% of the studies, this includes studies that measure “self-efficacy”, which is defined as one’s belief in their own ability to execute the necessary behaviors to achieve specific goals (Bandura 1977). Engagement or interactiveness includes metrics, such as flow, involvement, attention, active learning, and sense and immediacy of control. The measure of engagement is used in 13.24% of studies. Subjective comparison is measured by 11.03% of studies, this is defined as a trainee’s subjective preference when comparing VR training with traditional training methods or other methods of interest. The final two measures, quality of instruction and feedback, and motivation, are used as evaluation measures in 8.82% and 5.88% of the studies, respectively.

Figure 9 illustrates a diminishing number of studies that evaluated a higher number of reaction measures with 43.89% of studies evaluating reaction using two or fewer reaction measures and 81.67% of studies using four or fewer measures. Usability and presence were the two leading measures for studies with two or fewer measures. Out of the 23.5 studies with only one measure, 25.53% measured usability and 21.28% measured presence. Out of the 16 studies with two measures, 28.13% measured usability and 21.88% measured presence. Usability and presence can provide validation that the VR system used was of an acceptable standard and are therefore recommended to be used in all studies evaluating the effectiveness of VR training. If a standardized questionnaire is used such as SUS for usability or PQ, IPQ, or Slater–Usoh–Steed presence questionnaire for presence then comparisons of usability and presence can be conducted between studies. However, these two measures, while important, are not very useful when measured in isolation when investigating training effectiveness. Rather, they should be complemented with other measures depending on the intended outcome of the training. For example, if training is intended to increase trainees’ confidence in performing a task, then confidence level questions should be included in addition to usability and presence. Similarly, if training is intended to promote trainees’ intention to train using the system or to increase engagement during training, then the respective measures should be implemented. In addition to providing a subjective assessment of training, evaluating reaction has the added benefit of identifying areas of the training that need improvement. For example, if the realism of a VR training system is rated inadequately, then realism should be the focus for future improvement. This benefit can be capitalized by evaluating a comprehensive set of measures rather than focusing on a single measure, such as usability or presence. Additionally, greater insights can be obtained by providing open-ended sections, inviting trainees to elaborate on their decisions.

Fig. 9
figure 9

Number of studies measuring reaction and the number of reaction measures evaluated (out of 90 studies that measured reaction)

4.4.2 Level 2—learning

Evaluations that measured learning used three methods— these include using pen-and-paper or computer-based knowledge tests, evaluating trainees’ performance in the VR system, and measuring the performance in more realistic settings (i.e., transfer test). As previously shown in Fig. 8, 98 (72.06%) of all studies reviewed evaluated learning using either one or a combination of these methods. Performance test in VR was the most common method, used in 58 (42.65%) of all studies. This was followed by studies that measured knowledge with 33.5 (24.63%) of all reviewed studies utilizing a knowledge test. Finally, a transfer test was implemented by 26 (19.12%) of the studies (Fig. 10). Further analyses of each of the three learning methods are presented in the following order: knowledge test, performance test in VR, and transfer test.

Fig. 10
figure 10

Learning methods used by review studies as a percentage of use

Studies that evaluated learning using a knowledge test after training (post-survey) represented 24.63% of the 136 studies reviewed. In addition to measuring knowledge post-training, several studies performed a test prior to training using a pre-survey with some also including a knowledge test at an extended period after training (e.g., one month) using a follow-up survey. A pre-survey knowledge test is useful in measuring an individual’s baseline knowledge prior to training, this helps determine the different levels of knowledge between trainees and knowledge gained from the training being evaluated. This test is recommended when evaluation includes training subjects who are likely to have prior knowledge or experience directly related to the training (Kirkpatrick and Kirkpatrick 2007, p. 49). A baseline measurement for the outcome variable, in this case, the knowledge level is always recommended in general to account for any unequal distribution even in a randomized true-experimental design (Shannon et al. 1999). As observed in Table 2, out of 33.5 studies that performed a post-knowledge test, almost half (43.28%) of them did not obtain baseline knowledge. A follow-up survey administering a knowledge test is useful to gain insights into how well the knowledge is retained after an extended period. This is particularly important for safety training where refresher training is recommended as retention results are useful in determining training frequency and period.

Table 2 Number and percentage of studies that evaluate learning using knowledge test before, directly after, and an extended period after training

A total of 58 (42.65%) of all studies reviewed evaluated learning by measuring trainee performance in VR either during or after training. Performance measures are domain-specific in nature as VR training is used in different application domains (Sect. 4.1). However, the measures can be generalized into categories such as economy or efficiency, general performance, safety, completion or pass rate, autonomy, response time, and hazard identification and management. Similar to reaction measures, each category is defined by combining similar (but slightly different) measures together for the purpose of this review. A list of the measures classified into the categories is available in Table 3, which is also applicable to the transfer test described later in this section. Table 4 presents the studies using each categorized measure for both the performance test in VR and the transfer test.

Table 3 Categorized performance test in VR and transfer test measures
Table 4 Studies categorized by performance test in VR (PVR) and transfer test (TRF)

General performance or accuracy measured using a score was the most commonly used measure when evaluating performance in VR and was used by 39 (28.68%) of all studies. The number of errors during training is also another key metric used to measure performance. Studies that evaluated a multivariate score for total performance, whether it was calculated systematically within the VR system or graded by experts, was also a commonly used measure of performance.

Economy or efficiency was used to evaluate learning by 31 (22.79%) of all studies reviewed and covered the efficiency of time, movement, resource usage, and the number of attempts until achieving a predetermined proficiency. Efficiency of time is measured as the time spent during training or the completion time to perform an assessment task. Trainees are deemed to be proficient if they can perform tasks accurately and quickly. The efficiency of movement is evaluated within a VR system by calculating the number of movements and the total path distance of these movements. This measure is relevant for training psychomotor skills where precise movements are necessary to maintain safety (e.g., surgical procedures). Reduced resource usage, wastage and the number of attempts are also important measures used to evaluate economy and efficiency.

A safety score, including negative scores for physical damage (in the virtual environment) to the trainees and others (e.g., virtual patients), was used by 9.56% of all reviewed studies. Time spent exposed to virtual hazards and the time used to avoid them were also common measures of safety.

Some performance tasks involved using a predetermined benchmark measure for which a pass or fail was determined. This measure allowed trainees to complete the task and compare the final score with a predetermined benchmark (i.e., pass or fail). Other studies terminated the task once mistakes are made (i.e., complete, or incomplete). Studies that evaluated the effect of VR training using completion and pass rates accounted for 6.62%. Similarly, studies that evaluated the ability of trainees to identify, diagnose, and manage virtual hazards accounted for 6.62% of the studies.

A small portion of the studies reviewed evaluated the response or reaction time of the trainees (2.21%) and their autonomy or the ability to perform the task without additional help (1.47%).

The fact that performance tests conducted in VR were the most used method for evaluating learning suggests another opportunity for VR for training assessments. VR offers not only the ability to deliver training, but also to objectively assess competency. The ability to simulate reality safely is beneficial for both simulating realistic hands-on training and assessment. The prime example of simulation for assessment is in the aviation industry where prospective pilots are required to log their time on the simulators to be qualified (Bradley and Abelson 1995). Rapid advancements in VR (second VR wave) have brought a range of new VR technologies leading to greater interest in a more diverse set of fields, future studies reviewing the current state of VR as an assessment tool for competency may provide valuable insights.

A transfer test is a performance test conducted in a similar environment to its comparative real task. This measure can be administered by asking trainees to perform the actual task or a simulated version of the task. In the health services domain, a simulation may be performed on a manikin, cadaver, 3D-printed or artificial physical model while the actual task may be performed via surgery or assisting a supervised procedure. As presented in Fig. 10, 26 (19.12%) of all studies reviewed evaluated transfer. Similar to the performance test in VR, general performance or accuracy measured using a score was the most common measure used in 15.44% of all studies, followed by economy and efficiency in 8.46% of studies. With regards to evaluations of economy and efficiency using a transfer test, all included studies measured the task completion time. Domain-specific performance graded by experts was commonly used for performance scores, particularly for health services-related training where standardized assessment rating scales exist. These scales are used as part of a traditional training assessment where the assessor observes each trainee’s performance and provides a score for each of the competencies listed. Examples of such scales include the Arthroscopic Surgical Skill Evaluation Tool and Cochlear Implant Surgery Assessment Tool. Five studies measured the completion or pass rate, and another three studies measured either safety, autonomy or hazard identification and management.

4.4.3 Level 3—behavior

None of the included studies evaluated behavior. The lack of studies evaluating this category (Level 3) is reasonable and likely due to the logistical requirement for a specialized trainer to observe the job behavior of multiple trainees.

4.4.4 Level 4—results

Three studies, namely, García et al. (2016), Wu et al. (2020b), and Butt et al. (2018) performed result (level 4) evaluations. García et al. (2016) evaluated the reported number of accidents, working days lost, and reduction in expenses after the implementation of a desktop-based VR training experience for the maintenance and operations of high-voltage overhead power lines. The data was gathered between one and three years after the implementation of the VR training. Wu et al. (2020b) recorded the number of self-reported injuries two months after undertaking a needle stick or sharp injury prevention VR training experience for nursing and medical students. Butt et al. (2018) focused on whether the use of VR for urinary catheterization training would encourage practice and recorded the time each trainee spent practicing after training. However, the considered time interval for recording the result was short (i.e., two-week intervals).

5 General discussions, recommendations, and limitations

This review found that existing studies that evaluated the effectiveness of VR for safety-relevant training were mostly applicable to health services and construction domains. Although health services and construction domains represented approximately 55% of the studies reviewed, interests from a wide range of domains suggest that VR may provide a wide range of benefits to several fields. Surgical training, an early adopter of VR for training, had the greatest number of effectiveness evaluation studies. The majority of these studies also utilized either true or quasi-experimental design, which is considered to provide more reliable results than non-experimental design. Health services, particularly surgery, also had the greatest number of VR applications for safety-relevant training. In addition to the availability and applicability of commercially available surgery simulators, the dangerous nature of surgical procedures is likely to be an important contributing factor that encourages researchers in this domain to perform rigorous evaluations and validations of the system. Similarly in construction, the inherent dangers to people within the industry as indicated by high fatality rates have likely led to an appetite for innovation in safety training delivery such as exploring the benefits of using VR. While interests have been strong within the construction domain, the industry lacks a standardized approach to its use of VR in training and applicable simulators, with the majority of studies performing evaluation using a non-experimental design for one-off prototypes. The difference in evaluation design between health services and construction domains may be due to both domains having different levels of experience in implementing virtual training simulators, particularly from a historical point of view. Researchers investigating the construction domain are encouraged to perform evaluations that implement true-experimental design to improve the reliability of results. If effectiveness is validated using a true-experimental design, it may motivate others working in the construction sector to invest in VR technology and may encourage an effort towards standardization and encourage the development of commercially available construction safety simulators.

Despite being the domain with the highest fatality rate both in the USA and in Australia, no study was found in the agriculture, fishing, and forestry sector. The ineffectiveness of current traditional training methods for preventing accidents in this sector should prompt researchers to investigate alternative solutions such as investigating the benefits of using VR technology for training. The high number of self-employed workers in this sector, also with limited access to safety officers and training facilities can potentially leverage VR training that is system-based, less reliant on instructors, and flexible in terms of the time and place of training. Developing VR training on how to safely operate vehicles such as tractors should not be difficult considering there are other vehicle simulators being developed for different domains. As with other domains, the challenge is in the successful implementation of training systems such that workers in the sector improve their safe behavior, which eventually results in fewer accidents and fatalities. Future research should aim to investigate the implementation of VR safety training in the agriculture, fishing, and forestry sectors to evaluate whether it can provide benefits towards safety-related training.

The majority of studies reviewed focus on evaluating the effectiveness of VR training as opposed to prototype development. This result may suggest that VR is starting to mature into an effective training tool and outgrowing its infancy stage. Despite this, a significant portion of the studies reviewed were still focused on evaluating the general usability of the hardware and side effects of the technology, especially for studies that focused on prototyped systems. Many studies concentrated on evaluating the effects of specific features of VR and their effect on training effectiveness rather than the general effectiveness of the training system. This is likely due to the technology still being relatively new with many questions being asked between the interplay of different features such that each feature is best evaluated individually. As VR matures and becomes more reliable, each of these features is likely to be better understood, allowing evaluators to focus more on the effectiveness of the system in delivering training. The combination of studies focusing on development and evaluation is crucial to push VR technology forward. A developmental study focuses on testing new technology whereas an evaluation study either provides validation or rejects that the technology being investigated is effective and that efforts in further advancements are worthwhile. While effectiveness evaluation is beyond the scope of some developmental studies, it is recommended that when evaluation is performed, post-only one-group design is avoided by taking additional steps to satisfy quasi-experimental design requirements.

This review categorized evaluation measures using Kirkpatrick’s four-level model to determine how studies currently evaluate the effectiveness of VR safety-relevant training, categories included reaction, learning, behavior, and result measures. The majority of studies evaluated the effectiveness of VR safety-relevant training in terms of its learning (level 2 72.06%) followed closely by reaction (level 1 66.18%). There were no studies reviewed that evaluated the effectiveness using behavior measures (level 3) with a limited number of studies evaluating using result measures (level 4). The limited number of studies evaluating effectiveness using behavior and result measures is not surprising considering they are substantially more complicated, time-consuming, and resource intensive to implement these measures compared to reaction and learning measures. In fact, Kirkpatrick considered evaluating behavior to be “the most difficult and time consuming of the four levels” (Kirkpatrick and Kirkpatrick 2007, p. 105). The categories for outcome measures are presented both in Sect. 4.4.1. (Level 1–reaction) and Sect. 4.4.2. (Level 2–learning) and aim to provide some guidance for researchers and assist in making informed choices about evaluating training effectiveness. When reaction is measured, it is recommended to implement designs that evaluate the usability of the VR system and the sense of presence using established questionnaires. This approach is likely to provide readers with confidence that the evaluation of the VR training has been performed on a VR system of an acceptable standard. Additionally, when established questionnaires are used appropriately, researchers can estimate the quality of their VR training by comparing the values of these results against existing systems. If the intended outcome of using VR technology in training is to improve learning outcomes, then it is recommended to use at least one of the three commonly used evaluation methods which are knowledge test, performance test in VR, and transfer test. It is important that each training solution is appropriately evaluated using a test that best represents the objective of the training. For example, if an assembly training solution aims to teach trainees how to perform real assembly tasks, then performing a knowledge test or a performance test in VR may not be sufficient, instead, it is likely a transfer test is required to test their ability after training. On the other hand, if a training solution aims to convey important information, then a knowledge test should suffice. Finally, while the common aim for most safety training is to reduce the number of accidents and fatalities, which requires an evaluation of results (level 4), it is reasonable that there is still a lack of studies evaluating training using results measures due to the difficult nature in its implementation particular with a relatively new research area. Efforts toward evaluating the implementation of VR safety-relevant training in the workplace by observing trainees’ behavior (level 3) and the results from implementation (level 4) should be a focus of future research.

One of the limitations of this review is the potential overlook of relevant studies due to the search strategy. Although this paper attempts to comprehensively identify all relevant studies, applying restrictions in the scope of the search using a combination of search strings is needed to ensure the feasibility of this review. As a result, some relevant studies may possibly be unidentified. For example, this review includes studies that use the form of the word “safe” and “train” to generally cover safety training. Studies that focus on safety-relevant training such as disaster preparedness training, or other types of professional skills training may not recognize or explicitly state the safety implications of their training and thus, not include the term “safe” in their paper. Similarly, studies that use different terminology for their training outcomes instead of “effectiveness”, “impact”, or “outcome” may also be overlooked. Another limitation is the exclusion of studies prior to 2016, which may limit the generalizability of the findings, specifically to VR safety-relevant training prior to 2016. Finally, the current review does not analyze the findings and assess the quality of the studies (beyond study design). Investigating whether VR for safety-relevant training is effective as well as assessing the quality of the evaluation processes and the findings may provide valuable insights. These are beyond the scope of the current review and present an important opportunity for future research to address.

6 Conclusion

This review presents the current state of evaluation studies that focus on the effectiveness of VR safety-relevant training. Research domains in which VR safety-relevant training was applied are identified and described with results showing that the health services domain followed by construction is the most active domain. The study objective and design of each evaluation study are analyzed with results showing that the majority of studies focus on evaluation, which utilized true- or quasi-experimental design. Besides performing evaluation studies, developmental studies should also be performed to continue to push the boundaries of what VR technology can provide, however, it is recommended that these studies also perform evaluation of their prototypes using an acceptable experimental design (e.g., true- or quasi-experimental design). Finally, the paper categorizes evaluation measures for each study using the four levels of the widely used Kirkpatrick's model. Reaction (level 1) and learning (level 2) are the two most commonly used evaluation measures while none of the reviewed studies evaluated behavior (level 3) and only three studies evaluated results (level 4). Usability, perceived learning effectiveness, and affective reactions are the three most commonly used reaction (level 1) measures. When evaluating learning (level 2), there are three commonly used evaluation methods, these include knowledge test, performance test in VR, and transfer test. Categories of measures are also identified and analyzed for the performance test in VR and the transfer test. By describing the space of VR safety-relevant evaluation studies, this review aims to extend the existing body of literature in providing information and guidance on how such VR training evaluations can be performed. Discussions and recommendations for future work are presented with the aim to achieve more comprehensive, standardized, and consistent evaluations when measuring the effectiveness of VR for safety-related training.