1 Introduction

Proximal humerus fractures (PHF) are the seventh most frequent fractures in adults and the third in patients over 65 years old (Roux et al. 2012), which means it is a condition that will increase parallel to population ageing. In Sweden, for example, the incidence of PHF increased 31% from 92.7 per 100,000 person-years in 2001 to 121.9 per 100,000 person-years in 2012, according to the National Register (Sumrein et al. 2017). Moreover, PHF can involve high costs for the healthcare systems. For example, it is estimated that the total cost of PHF was 48.5 M€ for the Netherlands between 1986 and 2012, including both direct and indirect costs (Mahabier et al. 2015). While the cost can be quite high, it also highly depends on the treatment provided. Among the main treatment options, Reverse Shoulder Arthroplasty (RSA) is the most expensive 90-days post-surgery ($16,151), followed by hemiarthroplasty (HA) ($9348), Open Reduction with Internal Fixation (ORIF) ($6745) and non-operative treatment ($1932) (Levy et al. 2017). Due to these cost differences and resources required for different treatments, some authors emphasize the need to select the correct treatment for each patient. Further, in medicine, assessment of treatment cost-effectiveness must include the trade-offs involved in choosing among alternative interventions to maximize overall health outcomes given the available resources. As a result, cost-effectiveness includes considerations of immediate patient treatment expenses, available healthcare resources, economic and societal impacts of long-term injury, and time spent in therapies (Neuman and Sanders 2017). Thus, some have argued that ORIF might sometimes be a more cost-effective choice than HA or that surgical treatments are not cost-effective versus non-operative treatments on PHF involving the humerus neck (Corbacho et al. 2016; Thorsness et al. 2016).

Given the likely increasing incidence of PHF and the range of treatment costs, it is critical to have effective methods for determining appropriate treatment. Commonly, the number of fragments and their displacement are two important variables relevant to PHF treatment decisions. However, how exactly these variables relate to treatment selection remains controversial, and there is significant heterogeneity about how to weigh the variables relative to different fracture patterns. Generally, it is accepted that minimally displaced fractures are treated conservatively, while displaced, comminuted or angulated fractures are usually treated with surgical methods (Schumaier and Grawe 2018).

Three-dimensional printed models (3DPM) have constituted a solid alternative to traditional methods (i.e. CT-Scan and conventional radiographs). They have shown a higher interobserver and intraobserver agreement in the classification of PHF using Neer classification (Bougher et al. 2021; Iordens et al. 2016). By providing a physical 3D model of the fracture, 3DPM allows observers to classify the fractures properly, using both visual and tactile interactions with the 3DPM to diagnose fractures accurately. A further benefit of 3DPM is that the models can be used for surgical planning and for practising the procedure after selecting an initial diagnostic route.

While 3DPM has proven effective, it can be limited by the need for specialized printing technologies and the additional time it takes to print the fracture after a patient is scanned. Moreover, plastic one-time-use prints are not ideal in the context of environmental sustainability. Other disadvantages of 3DPM include the space requirements for a 3D Printing laboratory (or at least the 3D printer), printing failure rates, and material and electricity costs.

Virtual reality (VR) can be considered a possible new tool in the daily workflow of Orthopaedic Surgery. This technology has been used for various purposes in shoulder and elbow surgery, such as surgical planning or educative aims (Lohre et al. 2020). However, to the best of our knowledge, no previous literature has presented a study of the effectiveness of VR in PHF classification. Given that VR does not require a physical 3D model to be made, as in 3DPM, it may provide a more sustainable alternative to 3DPM.

In the present work, we propose VR as a new implement for achieving accurate PHF classification among orthopaedic surgery residents without the disadvantages of 3DPM noted above. This study aims to assess the accuracy of evaluations when using VR as compared to classifications using 3DPM. Evaluations were considered accurate or correct when they agreed with classifications provided by a committee of experts, and incorrect when these evaluations differed. To accomplish our aim, we performed a study with residents of orthopaedic surgery from different hospitals who were randomised into two groups (3DPM versus VR models).

2 Materials and methods

2.1 Sample obtention and participants

2.1.1 Sample: patients with PHF

We analysed the entire population of patients of Hospital de Antequera (Malaga, Andalusia, Spain) diagnosed with PHF from 2012 to 2022 with a CT-Scan (n = 63).

Three inclusion criteria were established: (1) patients must have had a CT-Scan on the acute moment of the fracture with at least 80 slices uploaded to the image database of the hospital, (2) all treatment and follow-up of the patients was done at Hospital de Antequera and (3) patients signed an informed consent permitting the use of their images for medical or investigation purposes. Further, a single exclusion criterion was used: any previous fractures or pathology on the fractured shoulder or fracture-dislocation.

A total of 34 patients met all the previous conditions, and their images were used as the sample for this study. The CT-Scan model used was SOMATOM Emotion 16-slice 2007 (Siemens), and DICOM files of the mentioned CT-Scans were processed.

Of the patients who met the established criteria, 16 had fractures on their right humerus and the other 18 on their left humerus.

2.1.2 Participants: orthopaedic surgery residents

The study involved all 37 residents from three hospitals of different levels of attention in Andalusia. One of these participants was excluded due to technical issues, leaving 36 participants included in the presented analyses (n = 36; male = 25, female = 11; average age = 27.75). All participants signed an informed consent form allowing the anonymized use of their responses.

Hospital levels are defined according to the definitions adopted by WHO-CHOICE: primary-level hospitals have few specialities and under 200 beds, secondary-level hospitals are highly differentiated by function with 200 to 800 beds, and tertiary-level hospitals are highly specialized and equipped with 300 to 1500 beds (Mulligan et al. 2003).

Each of the 36 orthopaedic surgery residents surveyed classified a total of 34 PHF using both Neer and HGLS classifications and selected a potential treatment.

2.2 Gold standard classification

While many accepted classifications exist for PHF, Neer classification is probably the most common. Traditional Neer classification identifies a fragment as an independent entity when there is a split over 1 cm, or it is angulated over 45°, although Neer has noted that he specified these values somewhat arbitrarily (Carofino and Leopold 2013; Schumaier and Grawe 2018). Other widely accepted classifications for PHF are AO/OTA (Arbeitsgemeinshaft für Osteosynsthesefragen/Orthopaedic Trauma Foundation) and Hertel’s HGLS (Head–Greater–Lesser–Shaft) classifications. HGLS shows higher interobserver and intraobserver agreement than the other mentioned classifications on plain radiographs and CT-Scans (Sukthankar et al. 2013; Iordens et al. 2016). All the classifications mentioned above specify that the humerus can be fractured in different patterns that mainly affect four different pieces: head, shaft, greater tuberosity and lesser tuberosity. Otherwise, each classification method is subtly but meaningfully distinctive. Regardless of the classification method, the delimitation of the morphological characteristics of each fracture line and fragment is essential for the correct selection of a patient’s treatment. Nevertheless, there is a need to improve the interobserver and intraobserver agreement, and recently many new diagnostic methods have appeared in pursuit of improving understanding of PHF mechanics and classification.

In order to specify an accurate initial classification of the sampled PHF, a committee of three senior Orthopaedic Surgeons specialized in upper limb surgery was constituted, with an average experience of over 20 years in superior limb surgery. We provided all available images for each of the 34 selected patients (simple radiographs, CT-Scan slices, conventional 3D reconstructions, 3DPM and VR models) to each committee member for an assessment. Committee members worked independently to provide initial classification. Each expert provided a classification for each fracture by Neer and HGLS classifications. In this first round, the experts gave a concordant classification of 30 of the 34 patients. The remaining four were discussed by the committee until there was consensus on a unique classification for each. The classifications of the expert committee were considered the correct classifications for the remainder of the study, and, as such, a PHF classification in this study is considered accurate or correct when it is the same as the classification provided by the expert committee for the presented PHF model.

The appropriate treatment was determined through a review of the clinical histories, and the best treatment was defined as the course of treatment provided to each patient. We provided the three main treatment options available for any fracture on orthopaedic surgery, offering conservative treatment (also known as non-operative, typically performed with a sling), osteosynthesis (classically as ORIF performed with wires, plates and/or screws) and arthroplasty (including hemiarthroplasty, total shoulder arthroplasty and reversed shoulder arthroplasty).

2.3 Graphical design: classifications and models

2.3.1 Classifications and treatment forms

Classification forms for both the 3DPM and VR environment were designed by the lead author using a Huion H1060P Graphics Drawing Tablet and GIMP software (v2.10.30). The classification forms replicate the standard classification forms used for Neer and HGLS classifications and are implemented in a way that ensures that the forms in both conditions are as similar in appearance as possible. The classification forms were printed in colour on two A4 plasticised paper (one for the left humeri and another for the right ones) for the 3DPM condition (Fig. 4) and presented as UI images in the VR environment (Fig. 2).

2.3.2 Design of the 3D models

3D models were obtained from DICOM files. These files were processed by InVesalius 3.1 (CTI Renato Archer, Brazil), distinguishing and delimiting different fracture fragments and excluding irrelevant elements until a useful STL file was obtained (Moldovan et al. 2021; Ejnisman et al. 2021). Images were then processed on Meshmixer v3.5 (Autodesk, 2018). Tissue and bone defects unrelated to the PHF and scan artefacts were excluded from the 3D model of the humerus to make it printable (Raeker-Jordan et al. 2021). A refined STL file was obtained for each model, which was used as the final model for both the 3DPM and the VR model. Model names were enumerated with a random index from 1 to 34.

2.4 Experiment in VR

2.4.1 Hardware and software

The VR experiment was run on an Oculus Quest 2. This system consists of a standalone head-mounted display with a resolution of 1832 by 1920 pixels per eye, a 90 Hertz refresh rate, and two controllers. The virtual environment was created in Blender 2.93 and exported into Unity 2020.3.26 LTS. The VR framework used to handle the hardware interaction was the XR interaction toolkit 1.0. This framework is publicly available in the Unity repository and provides tools to manage the hardware input and translate them into actions in the virtual environment. The cleaned STL files were imported into the Unity environment for presentation in VR as virtual 3D objects.

2.4.2 Virtual environment

Navigation and interaction with the virtual environment were performed with the Oculus Quest 2 controllers (Fig. 1). Participants could grab the humeri with any controller by pressing the grab button. Using the joystick, they could navigate the 3D environment with the right controller. The left controller was used to interact with the PHF visualization and classification system (Fig. 2). Participants could switch between interface and bone interaction modes by pressing the ‘X’ button on the left controller. When the interface interaction mode was activated, a line (ray cast) extended from the controller to facilitate aiming. When pointing at interactable interface elements, the ray cast would be green when the user pointed towards them and red when pointing at something else. Participants could select an option by pressing the trigger button on the left controller. When pointing at individual elements in the interface, they were highlighted green. When a participant selected an interface element, the option was highlighted in red to let the participant know which option was selected.

Fig. 1
figure 1

Controllers bindings

Fig. 2
figure 2

Set up from the centre. Right Neer and HGLS classifications can be seen in the left, left HGLS and Neer classifications can be seen in the right (like exploring a patient or an image of the shoulder in front of the observer), recommended treatment and button “next” are in the centre. Note only one Humerus was presented at a time, but left and right are visualized here for demonstration

When in bone interaction mode, participants could grab and move the humerus models by pressing the grab button while hovering over the model. To avoid unnecessary distractions, there was no gravity affecting the humeri, meaning that when participants grabbed the humeri, these would follow the speed of the hand that the participant used to grab the humeri, and when they released the humeri, it would stay in the position that it was released.

2.4.3 VR experiment mechanics

When the experiment started, the participants were immersed in a virtual room. The main elements were the interface containing HGLS and Neer classifications criteria and the humeri models (see Fig. 2 for the virtual room layout). Since each humerus could be either left or right, there was a diagram of HGLS and Neer classifications on the right side corresponding to the left humerus, and a diagram of HGLS and Neer classifications on the left side corresponding to the right humerus. To make it easier for the participants, the right humeri were presented on the left side, and the left humeri were presented on the right side (like seeing a patient or an image in front of the observer). The models were presented near the participants, so they did not have to navigate or walk to see them well. In the middle of the interface, the participants could choose the recommended treatment from a list of options (conservative, osteosynthesis, and arthroplasty). The button to render the next humerus was placed at the bottom centre of the interface (see Fig. 2).

The humeri were shown to the participants one by one. The presentation order of the humeri was randomised and distributed to avoid order bias. For each humerus, the participants were asked to select one of three treatment options (conservative, osteosynthesis, or arthroplasty) and classify the fracture by both the Neer and HGLS systems. These selections could be made in any order. Once a treatment option and both classifications answers were selected, participants pressed a button to move on to the next humerus. The responses were saved in a.CSV file for further analysis after the experiment.

2.5 Experiment in 3DPM

2.5.1 3D models printing and storage

The cleaned STL files were also used for printing the 3DPM (Fig. 3). Each model was processed for 3D printing in Cura 4.5 (Ultimaker 2020) and sliced by the specific predefined parameters of this software for a Creality Ender 3 printer. Processing for 3D printing did not visibly alter the external structure or presentation of the PHF.

Fig. 3
figure 3

Proximal humerus fracture Three-Dimensional Printed Model. Each model was printed on white PLA of 1.75 mm on 1:1 scale. Thirty-four of these models were evaluated by each participant

3D Printing was performed on a Creality Ender 3 v2, with an average time of around 7 h per model. Models were printed using 1:1 scale. The material used was 1.75 mm white PLA, with a total amount of 1.3 kg for the whole experiment. Printing supports were removed post-print, and no further post-printing treatment was needed.

Models were stacked on a foam mould and randomly identified with a number from 1 to 34. Each humerus was marked with its number and a letter ‘I’ (Izquierdo, Left in Spanish) or ‘D’ (Derecho, Right in Spanish).

2.5.2 3DPM experiment environment and materials

Each participant received two reference cards, one for left humeri and another for the right ones, printed in High Quality with Neer and HGLS classifications and treatment options (Fig. 4). Then, a pack of 34 cards numbered from 1 to 34 was randomly ordered and given to each participant in order to determine model classification order. The residents were asked to classify each 3DPM according to both the Neer and HGLS classifications and to indicate a proper treatment approach from a list of 3 options (conservative, osteosynthesis, or arthroplasty).

Fig. 4
figure 4

Original card for right-side fractures shown on 3DPM questionnaire. Left-side fractures had analogue cards with a representation of left humerus fractures. Number ‘1’ on the right bottom corner alludes to the corresponding model. Cards were shown on a randomised fashion. Translation: '1. Classify the corresponding model by HGLS classification’ ‘2. Classify the corresponding model by Neer classification’ ‘3. Provide a proper treatment on this patient’

2.6 Study design and data collection

The study structure was based on a randomised controlled trial design. Participants were run individually and randomly assigned to a condition, VR or 3DPM. Each participant received a brief seminar about PHF and the different classification approaches as well as a description of the experiment. All participants were individually randomised and had a 50% of probability of belonging to any of the two groups. After the experiment was introduced, the participants signed an informed consent form and classified the humeri using the method (3DPM or VR) corresponding to their assigned condition.

Age, sex and previous experience with the technologies (3DPM or VR) used were registered, as well as the year of residency, the WHO-CHOICE hospital each participant belongs to, and time spent (in minutes) completing the 34 humeri questionnaire. Experience with VR/3DPM was specified in 3 levels: no previous experience, low experience (less than 10 previous uses of VR or 3DPM for any purpose) or high experience (10 or more previous uses of VR or 3DPM for any purpose). After completing the classifications, participants were asked to complete a utility and confidence VAS (Visual Analogue Scale) and a questionnaire to assess the degree of satisfaction with the experiment's performance and the technology used.

2.7 Statistical analysis

Characteristics of participants are presented according to trial group assignment. We performed descriptive analyses on outcomes and covariates. Numerical variables are described by means and standard deviations, and categorical variables are described by absolute and relative frequencies.

Group comparisons were carried out using Mann–Whitney’s test, Student’s t-test, or Fisher’s exact test as appropriate. We assessed associations between the binary outcome variable (correct classification) and group assignment (VR/3DPM) using mixed-effects logistic regression models adjusted by age, sex and potential confounding variables to ensure that additional variables do not meaningfully impact statistical results, obtaining results as Odds Ratios (ORs). Potential confounders were included in the model when they were associated with the outcome variable at a level of statistical significance p < 0.25. A random intercept was included to account for correlation among measurements in the same subject (Hosmer et al. 2013). All statistical tests were two-sided, and p values < 0.05 were considered statistically significant. All statistical analyses were conducted using Stata 17.0 (StataCorp LLC).

3 Results

3.1 Homogeneity of the groups

3DPM and VR groups were compared by age, gender, time of residency (years), hospital level, residency level, previous experience with the given technologies, and time spent completing the evaluations (descriptive statistics can be found in Table 1). Within the 3DPM group (n = 18), 15 identified as male, and 3 as female, with an average age of 27.9 years with a Standard Deviation (SD) of 2.4 years. Within the VR group (n = 18), 10 identified as male and 8 as female, with an average age of 27.6 years (SD = 1.9 years). The average amount of time in residency was 2.8 years (SD = 1.4 years) for the 3DPM condition and 3.1 years (SD = 1.5 years) for the VR condition, with a detailed breakdown of Post-Grade Year (PGY) and hospital type by group presented in Table 1. No significant differences were found on any of the analysed variables (Table 1). So, we can conclude that participants of both groups are homogenous for these variables. Notably, both groups spent a similar amount of time analysing all 34 humeri, with an average total time of 32.5 (SD = 13.4 min) for the 3DPM group and 31.5 min (SD = 11.6 min) for the VR group. This difference was also not statistically significant.

Table 1 Characteristics of the participants according to the trial group assignment

3.2 Association between different characteristics and correct classification

To study the impact of 3DPM and VR on fracture classification accuracy, we investigated the correct classification rate (or accuracy) relative to the expert classifications according to two classification methods (Neer and HGLS). Along with a comparison of accuracy between 3DPM and VR, we also considered how participant characteristics might impact classification accuracy to get a more complete picture of any differences or similarities observed between conditions (see Table 2 for summary).

Table 2 Association between participant’s characteristics and correct classification

Regarding technologies, 3DPM showed higher fracture classification accuracy than VR for both the HGLS classification (51.8% vs. 55.9%) and Neer classification (64.9% vs. 67.5%). However, these differences were not statistically significant (p = 0.169 for HGLS and p = 0.365 for Neer).

Unsurprisingly, participants with more professional experience performed classifications more accurately than those with less. Professional experience, as measured by PGY, had a significant impact on the classification accuracy for both Neer (p < 0.001) and HGLS (p = 0.005) classifications, with higher PGY associated with improved accuracy for each classification method.

Previous experience with the technology (3DPM or VR, depending on the condition) also showed a statistically significant association with the classification accuracy, with more experience significantly improving accuracy for both classification methods. For HGLS, the difference on accuracy between no previous experience (49.9%), low experience (55.9%), and high experience (64.7%) had a span of 14.8% from no to high experience (p < 0.001). The difference between the same experience characteristics for Neer was 9.9% (no = 63.6%, low = 67.3%, high = 73.5%, p = 0.026). These results indicate that familiarity with the technology used to display the fracture plays a role in one’s ability to accurately classify the fracture.

Across all participants and conditions, a total of 1224 assessments were provided for both Neer and HGLS classifications. Table 3 shows age and time spent comparing right and wrong answers on Neer and HGLS using both technologies. We did not find statistically significant differences.

Table 3 Association between numerical covariates and correct classification

3.3 Clinical significance and comparison of VR and 3DPM

Multivariate adjusted Odds Ratios (ORs) for PHF classification accuracy adjusted by gender, age, hospital level, PGY, previous experience and time spent on the test were estimated. Multivariate adjusted associations between the assigned group and the classification accuracy are shown in Table 4. These results are largely in line with the results presented in Sect. 3.2.

Table 4 Multivariate adjusted odds ratios for correct classification of proximal humeral fractures

When controlled by age, sex, hospital level, PGY, previous experience and time spent on the test, we did not find statistically significant differences in the correct classification of PHF comparing VR versus 3DPM.

Time of residency showed a direct relationship with fracture classification accuracy for both HGLS and Neer classification (OR = 1.24; 95% Confidence Interval (95% CI) 1.04–1.48, and OR = 1.21; 95% CI 1.05–1.40, respectively). This implies an improvement of approximately 24% for HGLS and 21% for Neer classification per year of residency.

Time spent on the whole test also showed a statistically significant direct relationship with classification accuracy for both HGLS and Neer (OR = 1.02; 95%CI 1.01–1.04, and OR = 1.02; 95%CI: 1.00–1.03, respectively). Thus, for each extra minute spent on analysis of the 34 humeri there was an observed accuracy improvement of approximately 2%.

4 Discussion

New technologies that allow for rapid 3D interactive visualizations and classification of humeral factures are extremely valuable for assessing and determining treatment plans. Fürnstahl et al. proposed a computer-assisted method for identification of humeral fragments based on a CT-Scan and comparative analysis of the 3D models of a fractured humerus and the contralateral humerus (Fürnstahl et al. 2012). While the value of these technologies is not in doubt, there is still a need for identifying which technologies are best suited for a given situation and user group. Currently, 3DPM and VR are prominent and rapidly developing technologies for interacting with 3D models whose potentials are not yet fully explored in the context of fracture classification.

It is clear from previous literature that using 3DPM has become a Gold Standard method for the diagnosis of PHF (Bougher et al. 2021; Iordens et al. 2016). The use of 3DPM in preoperative planning has improved shoulder surgery processes and outcomes, reducing operation time, radiation exposure, pain, and enhance patient outcomes while improving anatomic healing and increasing post-operative mobility (Li et al. 2022; You et al. 2016). The use of 3DPM has also been tested and validated for fracture classification and diagnosis in other regions of the body (Yammine et al. 2021). In that current context, we have reported the results from a non-inferiority randomised controlled trial to determine if there could be any significant differences between using 3DPM and VR for PHF classification and diagnosis. We observed that there are no statistically significant differences among classifications using these technologies, indicating that diagnostic outcomes using either technology are likely equivalent. Observed fracture classification accuracy was 51.8% (VR) vs. 55.9% (3DPM) for HGLS and 64.9% (VR) vs. 67.5% (3DPM) for Neer classification, which is consentient with interobserver agreement results reported in the literature (Iordens et al. 2016). In the present work, we also directly compared PHF classification accuracy using VR and 3DPM. We obtained an adjusted OR for VR versus 3DPM of 1.14 (0.77–1.67) on HGLS classification and 1.08 (0.79–1.48) on Neer classification, both favouring 3DPM though without statistical significance. Thus, according to our study VR is non-inferior to 3DPM for PHF classification and we can consider both technologies as equivalent.

To our knowledge, this is the first study evaluating the use of VR on PHF classification. In at least one case, a comparison of Augmented Reality (AR) combined with other methods for diagnosing PHF, including 3DPM, showed the highest interobserver agreement using 3DPM versus other technologies (Cocco et al. 2020). There are also some works that argue for the use of both AR and VR separately in orthopaedic trauma surgery, highlighting specifically the combination of AR with 3DPM, though in this case there is no direct empirical comparison of the technologies (Negrillo-Cárdenas et al. 2020). Some other works evaluate the utility of VR technologies for pre-surgical tasks, demonstrating that VR can be an appropriate technology (Negrillo et al. 2020; Verhey et al. 2020; Zhang et al. 2021). For example, Zhang and colleagues (2021) investigated the utility of VR for the evaluation and understanding of craniofacial structures.

4.1 Comparison between 3DPM and VR

While the non-inferiority design can indicate equivalent diagnostic outcomes between 3DPM and VR, there are obviously additional differences between these technologies that may be considered. The cost and time of making 3DPM has diminished considerably in recent years along with a reduction in the expertise required to prepare and produce models from 3D scan data. However, production of a 3DPM still requires a 3D printer, often a fixed location for the printer, and a regular supply of materials. This can often mean that valuable office or equipment space must be dedicated to the process. Perhaps the most notable challenge for 3DPM is the time it takes to produce a 3DPM of sufficient quality. For this study, we spent an average of about seven hours printing each model and typically one model is printed on a 3D printer at a time. Adding in the possibility for print failures, a 3DPM could take up to a full day to produce, though failure rates are constantly improving. Finally, 3DPM may be made of one-time use plastics which can have negative climate impacts, though bioplastics can reduce this impact.

Because VR does not require a physical production step, once the digital model is converted and cleaned for presentation it can be directly imported and viewed in VR. This results in a notable reduction in time for model availability in VR compared to 3DPM, which requires a similar conversion and cleaning process. With VR, time-to-presentation can be reduced to minutes instead of hours and no additional space is required as the same computer used for preparing the model can export it to the VR headset. Modern VR headsets can be deployed without much setup and often require only the headset and hand controllers to function. The VR scenario used for the current test required minimal movement and could be designed for seated or standing workspaces. An additional advantage of VR is that it can be used for instantaneous multi-site and multi-user evaluations, making remote consultations more robust and reducing ecological impacts of traveling or sending the 3DPM to another hospital. Also, the decrease in plastic use with VR may mean that VR has a lower environmental impact over time. It is precisely because of these advantages that we devised this study as one of non-inferiority. In our study, we also observed a shorter time for PHF classifications using VR when compared to 3DPM, although this difference was not statistically significant.

While there is a lot of promise around VR, there are a few possible downsides. Some users may find VR headsets uncomfortable and while simulator sickness is much less common, some users may still experience dizziness or nausea in VR. Moreover, not all users with glasses will be able to wear their glasses while using the headset, possibly resulting in reduced visual acuity in VR. Finally, there are currently no off-the-shelf solutions for the workflow presented in this study, making implementation with current systems more technically demanding than more mature 3D printing technologies. However, this is something we expect to change rapidly in coming years.

4.2 Future questions and work

Potentially countering the demonstrated benefits of 3DPM, Spek and colleagues (2022) reported that the use of 3DPM did not improve agreement among observers while identifying fracture characteristics and patterns (Spek et al. 2022). While observer agreement is important, this paper should be analysed carefully, both because of the small sample size and because the possibility of anchoring bias due to the combination of onscreen images and 3DPM was not considered. Thus, it is not clear if 3DPM did not improve agreement because it did not add anything to the classification process or because participants primarily relied on more familiar non-3DPM data when it was available. However, building on the current research demonstrating that VR produced outcomes equivalent to 3DPM, it would be interesting to see how the introduction of VR to the diagnostic toolkit would impact a study design similar to that used by Spek and colleagues. Of particular interest is the possibility of multi-user and instantaneous multi-site interactions with models in VR. We believe that while isolated studies, like this one, provide important insight into the baseline benefits of new technology applications, it is also important to investigate added benefit (if any) provided when new technologies are combined with existing tools commonly used in daily clinical workflows such as plain radiographs or CT-Scan.

VR is widely accepted as a useful technology for learning during medical residency and is often more effective than benchtop trainers or synthetic models (Verhey et al. 2020). However, there is a lack of evidence supporting the usefulness of VR technology for fracture classifications. This is due, in part, to a focus in the Orthopaedics field on using VR for rehabilitation and training technical surgical skills (Hasan et al. 2021; Berton et al. 2020; Matamala-Gómez et al. 2022). When used for surgical pre-planning, but not diagnosis, for shoulder fractures, VR demonstrated better outcomes than conventional training methods and faster time-to-surgery than 3DPM used for this purpose, making preoperative planning better and shorter (Chen et al. 2018). The current research suggests that including VR methods earlier in the planning process, e.g. diagnostic classification, may further streamline pre-operative planning processes, allowing for consistent tool and visualization throughout the process. Future research into using VR and 3DPM throughout the entire diagnostic and treatment process could be valuable in this regard.

One potentially useful insight from the current work is that there seemed to be a significant, and unsurprising, correlation between time in residency and classification outcomes. Future works comparing these kinds of technological interventions for treatment determinations should consider and report training time and/or years of experience. More importantly, care should be taken when comparing results between studies when participant training level or experience may be different. Additionally, while the sample size is fairly large for this type of study, especially given the difficulty of accessing a concentrated population of orthopaedic surgical specialists and residents, the width of the confidence interval on both Neer and HGLS classifications suggests that future work should consider an even bigger sample. Though for the non-inferiority study the confidence intervals may not be important, it still worth future work considering the benefit of identifying an even larger participant population.

5 Conclusions

Our findings support that VR can be a valuable alternative to 3DPM for the classification of PHF, providing both non-inferior classification power and advantages with regard to time and resource costs. This work demonstrates a significant direct relationship between PHF classification accuracy, PGY, and time spent analysing PHF. Importantly, obtained data demonstrate that while there is no significant difference in PHF classification outcomes between VR and 3DPM, the time needed to analyse a fracture using VR is marginally, but not significantly, shorter than using 3DPM. When these results are considered in the context of the difference in time to prepare a 3DPM for viewing, these strongly favour VR, suggesting that VR should be included as an additional weapon in the presurgical armoury of orthopaedic surgeons for PHF.

The main strengths of the current work are the high number of expert participants, the use of validated models from actual patient CT-Scans, and the randomised study design. This work represents the first paper in the literature that evaluates the feasibility of classifying PHF in virtual reality. Moreover, the work is a prospective, multicenter and multidisciplinary work incorporating engineers, statisticians, and orthopaedic surgeons in the development and implementation of the study. Further investigation is needed into the effects of mixing VR with more conventional radiological techniques and the utility of VR for preoperative planning in other surgical contexts. Moreover, user-friendly and accessible software tools for VR-based preoperative planning are needed for the broader adoption of the methods presented in this paper. Given the results of this paper, we believe efforts in this direction will be valuable for both doctors and patients, improving presurgical planning and reducing treatment time.