Keywords

1 Introduction

When attempting to learn spatial information from two-dimensional displays, such as when students study gross anatomy using textbook images, a high level of workload is placed on the learner [1]. This workload has been associated with decreases in knowledge acquisition [2]. Emerging three-dimensional visualization technologies (3DVTs) support a learner’s understanding of spatial depth information by providing realistic representations of three-dimensional objects [3]. 3DVTs include physical models, virtual or augmented reality, and holographic displays. Indeed, numerous studies support the utilization of 3DVTs over traditional 2D displays [4,5,6,7,8].

Anatomical science is a domain that requires effective display of spatial information. Digital images are present not only in anatomical training and instruction, but also in medical diagnosis, pre-operative planning, and minimally invasive surgery [9]. Cadavers are commonly held as the gold standard for anatomical training as they enable hands-on experience with actual human tissue [10, 11]. However, cadavers can be costly to maintain, challenging to store, and require extra work for instructors [10,11,12]. For these reasons, 3DVTs are considered valuable supplements to traditional anatomical training.

A traditional supplement used in anatomical training is a physical model [12, 13]. Physical models replicate an anatomical system/structure using any variety of materials (e.g., plastic, fiberglass, clay). Physical models afford a learner hands-on experience with anatomical structures through rotation and often disassembly to aid in spatial comprehension. In general, physical models are easy to obtain, highly portable, and provide a useful tool to increase a person’s base knowledge of anatomy at a low cost [14].

In addition, recent technological advances have increased the use of virtual reality in anatomical training. Virtual reality—the computer-generated simulation of three-dimensional objects/environments—provides capabilities similar to that of physical models for rotation, manipulation, and enhanced spatial understanding [15]. Much of the work to date on the use of virtual reality for knowledge acquisition in anatomical training has involved computer-based applications and modules (i.e., “desktop VR”). Research comparing these computer-based models to physical models has found benefits from using physical models for training in anatomical identification [16, 17]. While some work has been done with respect to more immersive virtual reality technologies (e.g., with head-mounted displays) for procedural training, much less has been done regarding the use of immersive VR for anatomical knowledge acquisition [18,19,20].

As physical and immersive virtual reality (VR) models share many features (e.g., both present 3D information, allow interaction and study of multiple views), an important criteria for evaluating their use for educational purposes is the level of workload they impose on the learner. Cognitive load theory [21] suggests increased workload is only detrimental if and when it exceeds a learner’s working memory capacity. Currently, differences in detrimental workload imposed on a learner by 3DVTs such as physical or virtual reality models are unknown. Some 3DVT types (e.g., monoscopic 3D displays, digital holograms) have shown lower workload compared to 2D displays [22, 23], but this may not be true for physical and VR models.

The present work was designed to address the question: What are the differences in workload between physical and virtual reality models used for supporting knowledge acquisition in gross brain anatomy? This research question was addressed through both physiological and subjective measures of workload, with knowledge gain assessed through pre- and post- brain anatomy tests. The present experiment showed that workload differences during anatomical knowledge acquisition may stem from limitations in the typical use of physical models compared to models presented in virtual reality.

2 Methods

2.1 Participants

Sixty-one students from the University of Central Florida (29 Males, 32 Females), between the ages of 18 and 28 (Mdn = 18, IQR = 1), completed the experiment for course credit. All participants provided written informed consent prior to participation and were at least 18 years old with normal, or corrected-to-normal, vision.

2.2 Experimental Design

Participants were assigned to the physical model (PM) or virtual reality (VR) learning condition.

2.3 Materials

Physical Model.

The physical model was presented along with a label sheet defining the numbered structures on the model (see Fig. 1). The numbered labels were added by the researchers and color-coded to best match the colored regions of the virtual model in the VR condition. The physical model (6 × 5.5 × 5.5 inches) contained eight pieces and weighed 2.5 lb. The model could be examined as a whole or in any combination of its eight pieces.

Fig. 1.
figure 1

Plastic physical brain model with eight removable pieces (Axis Scientific) and numbered structures.

Virtual Models.

The HTC Vive virtual reality (VR) system includes a head-mounted display, two controllers (one for each hand), and two “light house” sensors that track the headset and controllers, and project them into the virtual environment. The Vive connects to a desktop computer and displays a virtual environment through SteamVR.

Within the virtual environment, two brain models were displayed on a table (see Fig. 2). One brain model showed the external view of the brain, with label sets corresponding to a ventral and lateral view, while the second model displayed labeled structures from the medial view of the brain. Label sets could be toggled on or off by the participant using the controllers. The VR system allowed the participant to fully rotate the brain models to study the brain structures and spatial relationships from different viewpoints.

Fig. 2.
figure 2

Virtual brain models used in the VR condition.

2.4 Measures

Workload Measures

NASA-TLX. The NASA-Task Load Index (NASA-TLX) [24] was administered following the learning task to assess workload. Six items addressing workload (i.e., mental workload, physical workload, temporal workload, effort, frustration, performance) were presented on 100 point scales in 5-point increments. High scores on each scale indicate high workload, with the exception of the performance subscale in which high scores relate to “poor” perceived performance.

MRQ.

The Multiple Resources Questionnaire (MRQ), developed by Boles and Adair [25], was also administered after the learning task to assess workload in terms of cognitive resources used while studying with the 3DVT. Eleven of the original 17 items were utilized for purposes of this experiment (see Table 1). Responses were measured on a 100-point scale, with 0 indicating no usage and 100 indicating extreme usage.

Table 1. MRQ scores by 3DVT condition

Performance Measures

Spatial Anatomy Test. A pre- and post-task Spatial Anatomy Test (SAT) was administered to evaluate knowledge gain. Accuracy and completion time were measured on identification (16 questions; one for each labeled brain structure), spatial knowledge (15 multiple-choice questions), and mental rotation questions (4 questions). Identification questions required the participant to select the correct label for a designated structure. Participants chose from a list of 32 brain structures (16 targets, 16 distractors). Four-alternative multiple-choice questions measured an understanding of spatial relationships between brain structures. In the mental rotation section, a target image of the brain model was provided along with four rotated images (two of which were mirror-images). The participants selected which two of the four new images were simple rotations of the target image. Brain images used for identification and mental rotation test questions were matched to the respective physical or virtual reality model condition.

Regional Saturation of Oxygen (rSO2)

Changes in regional saturation of oxygen (rSO2) in the left and right prefrontal cortex were measured using the Somanetics INVOS Cerebral/Somatic Oximeter through near-infrared light [26]. This non-invasive, indirect neuroimaging measurement, referred to as functional near-infrared spectroscopy (fNIRS), sheds light into cognitive functions such as workload [27].

2.5 Procedure

All participants provided written informed consent prior to participation. Each participant then completed a demographics survey and restrictions checklist, along with an Ishahara Color Blindness Test. These items did not serve as exclusion criteria; rather, they served to provide background information for use in later analyses. Next, the researcher fitted the participant with the fNIRS sensors. A five minute resting baseline was conducted as a reference for any changes in oxygenation during the experiment. The participant then completed the pre-task Spatial Anatomy Test (pre-SAT) to assess his or her prior knowledge concerning spatial brain anatomy. The participant was then given their assigned 3DVT (either the physical model or virtual reality system) and had ten minutes to use the technology to study the 16 labeled brain structures. The participant had the option to end the ten-minute study period early if they felt confident. Once the study time was complete, the participant completed a series of post-task surveys on the computer, including the NASA-TLX, post-task SAT (identical to pre-task SAT but randomized order), and MRQ. Upon completion of the experiment, the fNIRS sensors were removed, and the participant was thanked, granted credit, and dismissed. The experiment took no longer than three hours to complete.

3 Results

3.1 Workload

NASA-TLX.

Independent-samples t-tests were conducted for four of the six subscales of the NASA-TLX (i.e., Mental Demand, Temporal Demand, Effort, and Frustration) to examine the effect of each 3DVT on workload. Nonparametric Mann Whitney U tests were conducted for the two other subscales (i.e., Physical Demand and Performance), determined to come from non-normal distributions. Average scores on each subscale can be seen in Fig. 3.

Fig. 3.
figure 3

NASA-TLX scores by 3DVT condition. Error bars represent standard error.

There was a significant difference between 3DVTs for Mental Demand scores, t(59) = −2.56, p = .012, with higher Mental Demand in the PM (M = 69.31, SD = 19.31) than VR condition (M = 55.94, SD = 20.73). There was also a significant difference between PM and VR conditions for Frustration, with higher Frustration in the PM (M = 33.62, SD = 27.35) than VR condition (M = 20.47, SD = 20.26). There was no significant difference between the two 3DVTs with respect to Physical Demand, U = 415.50, Z = −0.706, p = .480, r = −0.09, Temporal Demand, t(59) = −1.22, p = .226, Effort, t(59) = −1.89, p = .064, or Performance scores, U = 569.00, Z = 1.53, p = .127, r = .20.

MRQ.

Independent-samples t-tests were conducted for 8 of the 11 included MRQ subscales (Manual, Short-Term Memory, Spatial Attentive, Spatial Concentrative, Spatial Emergent, Spatial Quantitative, Visual Lexical, and Visual Phonetic processes) to examine the effect of 3DVT on workload.

Nonparametric Mann Whitney U tests were conducted for the remaining three subscales (i.e., Spatial Categorical, Spatial Positional, and Tactile processes), which were determined to violate normality assumptions. See Table 1 for each of the 11 MRQ scale scores by 3DVT condition.

There was a significant difference between 3DVTs on Spatial Concentrative process scores, t(59) = −2.94, p = .005, with higher scores in the PM (M = 74.66, SD = 18.29) than VR condition (M = 58.47, SD = 23.95). There was a significant difference between 3DVTs on Spatial Emergent process scores, t(59) = −2.73, p = .008, with higher scores in the PM (M = 66.45, SD = 25.22) than VR condition (M = 48.88, SD = 25.08). There was a significant difference between 3DVTs on Spatial Positional process scores, U = 628.50, Z = 2.38, p = .017, r = .31, such that the PM condition (Mdn = 90.00, IQR = 36) had significantly higher scores than the VR condition (Mdn = 69.00, IQR = 22). No other significant differences were found between 3DVTs for the remaining subscales (all p’s > .089).

3.2 Performance

Independent samples t-tests were conducted on pre-task SAT accuracy scores to investigate any differences in prior knowledge between groups. There was no significant difference between PM and VR conditions for overall, identification, or multiple choice SAT accuracy (all p’s > .059). However, there was a significant difference between PM and VR conditions for mental rotation accuracy scores (p = .039).

2 (3DVT: PM, VR) × 2 (Testing time: pre, post) mixed factor ANOVAs were conducted on average SAT accuracy scores and completion times for the overall test and for identification, multiple choice, and mental rotation questions to examine differences in the level of spatial knowledge acquired between PM and VR conditions.

Level of Spatial Knowledge Acquired.

There was a significant main effect of testing time on each of the following: average overall accuracy, F(1, 59) = 629.23, p < .001, η 2p  = .91, average identification accuracy, F(1, 59) = 445.67, p < .001, η 2p  = .88, average multiple choice accuracy, F(1, 59) = 396.57, p < .001, η 2p  = .87, and average mental rotation accuracy, F(1, 59) = 6.49, p = .013, η 2p  = .10. For each measure, post-task SAT scores were significantly higher than pre-task SAT scores (see Fig. 4).

Fig. 4.
figure 4

Average overall (A), identification (B), multiple choice (C), and mental rotation (D) accuracy on SAT by 3DVT condition. Error bars represent standard error.

There was no main effect of 3DVT condition on average overall, F(1, 59) = 0.01, p = .942, η 2p  < .01, identification, F(1, 59) = 2.23, p = .141, η 2p  = .04, or multiple choice accuracy scores, F(1, 59) = 0.47, p = .494, η 2p  = .01. There was a significant main effect of 3DVT condition on average mental rotation accuracy, F(1, 59) = 5.73, p = .020, η 2p  = .09, with higher accuracy in the PM (M = 34.05, SD = 22.39) than VR (M = 20.31, SD = 22.39) condition.

There was no significant interaction between testing time and 3DVT on any accuracy measure (all p’s > .430).

Completion Times.

There was a significant main effect of testing time on each of the following: average overall SAT completion time, F(1, 59) = 12.06, p = .001, η 2p  = .17, multiple choice completion time, F(1, 59) = 66.93, p < .001, η 2p  = .53, and mental rotation completion time, F(1, 59) = 15.51, p < .001, η 2p  = .21. For overall and multiple choice questions, participants took significantly longer to complete the post-task SAT than the pre-task SAT. Conversely, participants took significantly longer to complete the pre-task SAT than the post-task SAT for mental rotation questions. There was no main effect of testing time on average identification completion time, F(1, 59) = 0.22, p = .642, η 2p  < .01 (see Table 2).

Table 2. Completion times on SAT by 3DVT condition and overall

There was no main effect of 3DVT condition on average overall, F(1, 59) = 0.04, p = .836, η 2p  < .01, identification, F(1, 59) = 1.31, p = .257, η 2p  = .02, multiple choice, F(1, 59) = 0.02, p = .887, η 2p  < .01, or mental rotation completion times, F(1, 59) = 0.87, p = .355, η 2p  = .02.

There was an interaction between testing time and 3DVT for overall, F(1, 59) = 4.93, p = .030, η 2p  = .08, and multiple choice completion times, F(1, 59) = 6.15, p = .016, η 2p  = .09. Overall test and multiple choice questions showed significantly longer completion times for post- than pre-test. Average completion times were longer for the VR condition compared to the PM condition at pre-test with the reverse trend at post-test. There was no significant interaction between testing time and 3DVT for identification, F(1, 59) = 3.78, p = .057, η 2p  = .06, or mental rotation completion times, F(1, 59) = 2.52, p = .118, η 2p  = .04.

3.3 Regional Saturation of Oxygen (RSO2)

A 2 (3DVT: PM, VR) × 2 (Testing Time: pre, post) × 2 (Hemisphere: left, right) mixed factor ANOVA was conducted on average rSO2 values during pre- and post-task SAT. There were no significant main effects or interactions among the included variables (all p’s > .225). A 2 (3DVT: PM, VR) × 2 (Hemisphere: left, right) mixed factor ANOVA was conducted on average rSO2 values during the learning task. There was no main effect of hemisphere on average rSO2 values during the learning task, F(1, 57) = 1.88, p = .175, η 2p  = .03. There was a main effect of 3DVT condition, F(1, 57) = 9.17, p = .004, η 2p  = .14, with higher change from baseline in VR (M = 1.78, SD = 3.51) than PM (M = −0.65, SD = 2.99). Critically, there was a significant interaction of hemisphere and 3DVT condition, F(1, 57) = 6.90, p = .011, η 2p  = .11. Bonferroni pairwise comparisons revealed a significant hemispheric difference in the PM condition (p = .008), but not in the VR condition (p = .367). Pairwise comparisons revealed a significant difference in rSO2 change from baseline in the left hemisphere between 3DVT conditions (p = .001), while this difference between 3DVTs did not reach statistical significance for the right hemisphere (p = .053). See Table 3 for average rSO2 change from baseline values by 3DVT condition and hemisphere.

Table 3. Average RSO2 change from baseline values by 3DVT condition and hemisphere

4 Discussion

The present experiment provides a workload comparison between a physical model (PM) and models presented in virtual reality (VR) for supporting knowledge acquisition in anatomical training. While participants in both three-dimensional visualization technology (3DVT) conditions showed similar levels of knowledge gain, the VR condition decreased test completion time (pre- to post-test) to a greater extent than the PM condition. The PM condition was found to impose a higher degree of workload in terms of mental demand, frustration, and spatial processes, while average RSO2 values suggested higher workload in the VR condition.

These results differ from previous studies comparing physical models to computer-based virtual models which showed benefits for physical models in anatomical knowledge acquisition [16, 17]. Thus, the more immersive virtual reality condition explored here may provide a closer match to the use of tangible, physical models. Benefits of physical models (e.g., tangible, portable) must be weighed against benefit of models presented in virtual reality (e.g., immersive, readily accessible library of models to access online).

The workload differences found in the present experiment may stem from different presentation formats between the physical and virtual reality models. The virtual reality model had structure labels fixed to the models (see Fig. 3) that could be toggled on and off. The physical model differed in that the model was labeled by numbers 1-16 with structure identifiers listed on a sheet of paper next to the model. Thus, studying with the physical model required an extra mental step to connect numbered labels to structure identifiers. It is possible that this difference is responsible for the higher workload seen in the physical model condition. Future work could better match the information presentation formats to provide a more direct comparison between 3DVT types.

Previous studies have shown workload affects a person’s ability to learn from a 3D anatomical model [28,29,30]. The workload differences here were not associated with detriments in knowledge gain, but rather in test completion time. This suggests that the higher workload with the physical model relative to virtual reality models may not have been high enough to hinder overall knowledge gain. In other words, when the workload associated with a knowledge acquisition task is moderate at most, workload differences among 3DVT types may manifest in completion or response times rather than knowledge gain. When task load increases, the selection of 3DVT may become more important for knowledge acquisition. This question is left open to future work.

In sum, educators and trainers should be aware of the capabilities and limitations of 3DVTs to ensure they do not impose a level of workload that hinders knowledge acquisition. The present experiment suggests that the selection of 3DVT for supporting anatomical knowledge acquisition may be made on factors such as cost, accessibility, and interest since minor differences in workload did not hinder learning. Still, further research is needed to better understand the link between workload and performance in spatial knowledge acquisition.