Introduction

Even though virtual reality (VR) technology and the idea to use it for learning and teaching has already existed for several decades (e.g., Bricken, 1991), VR applications and environments currently develop rapidly and, at the same time, become more and more affordable to a broader audience of possible users (e.g., Buehler & Kohne, 2019; Hilfert & König, 2016). VR applications offer various advantages not only for entertainment but also for different learning contexts (e.g., Computer Science, Pirker et al., 2020), such as experiencing impossible to attend scenarios that happened in the past (e.g., Anne Frank House in VR: Hartmann, 2013; battle of Thermopylae: Christopoulos et al., 2011), as well as scenarios that are too dangerous (e.g., virtual firefighter training: Wheeler et al., 2021), simply physically impossible to attend or experience in real life (e.g., solar system events: Huang et al., 2022; body transfer: Slater et al., 2010; virtual wings: Egeberg et al., 2016), or too time or cost expensive (e.g., virtual tourism: Wagler & Hanus, 2018; virtual scuba diving: Jain et al., 2016).

Given these advantages of VR for different scenarios in terms of interactive and engaging (learning) experiences (e.g., Hamilton et al., 2021), it has to be assumed that VR might also enhance recognition and understanding in dynamic domains that comprise different highly complex movements, such as different movement patterns of fish to generate propulsion (e.g., Imhof et al. 2012). One particularly exciting aspect of VR in this domain is its potential to offer multiple perspectives on the dynamics and movements depicted in virtual objects. Thus, the present work aims to answer whether and how interacting with different perspectives on movements affects learning in VR environments.

Learning About Movements: Dynamic Visualizations and 3D-Models

Dynamic (2D) visualizations are frequently used as instructional tools to visualize learning contents comprising continuous processes and dynamic changes because they explicitly show the visuospatial changes over time (e.g., congruence principle, Tversky et al., 2002). Dynamic visualizations prove to be highly beneficial in understanding dynamic phenomena and outperform static visualizations, particularly when learners with lower visuospatial ability must be supported (e.g., Höffler, 2010), as well as when the learning content involves biological movement (e.g., Berney & Bétrancourt, 2016; Höffler & Leutner, 2007; Plötzner et al. 2021).

Dynamic visualizations were successfully used during learning about human biological movements, such as tying knots or playing the piano, through observation (e.g., De Koning et al., 2019; Marcus et al., 2013; Mierowsky et al., 2020), but also during learning about non-human movements (e.g., fish movements, Brucker et al., 2015, Imhof et al., 2011, Imhof et al., 2012). However, dynamic visualizations usually provide the viewers with only one perspective on the displayed (dynamic) contents or, in some rare cases, with predefined camera movements showing multiple perspectives one after another. Boucheix et al. (2018) showed that mixed camera viewpoints in instructional videos are more effective for learning complex medical hand procedures than single viewpoints or no video instructions. These findings suggest that incorporating diverse perspectives can enhance learning.

One benefit of using 3D models for educational content is that they empower learners to adjust their viewing proximity, enabling a closer and more detailed examination of specific features of the 3D model (e.g., Burkhardt et al., 2010). At the same time, these features are already ideally prepared to facilitate a comprehensive learning experience (Burkhardt et al., 2010). Moreover, interactive 3D-visualizations can encourage learners to explore the (dynamic) contents from various perspectives through interactively changing the viewpoints (e.g., interaction with animation, De Koning & Tabbers, 2011). It has been shown that 3D-models are beneficial compared to 2D-visualizations during learning (e.g., Remmele et al., 2015) and that particularly learners with higher visuospatial ability profit from 3D-models (e.g., Huk, 2006). Furthermore, an active exploration of 3D-models resulting in multiple perspectives is an effective method to promote learning (e.g., James et al., 2002; Jang et al., 2017; Keehner et al., 2008).

James et al. (2002) showed that active exploration of 3D-objects in a VR learning phase (in terms of actively rotating the 3D-objects) leads to better results in recognition performance than passively observing the objects. Jang et al. (2017) came to a similar conclusion. In their study, the learning performance of participants who directly interacted with a 3D-model of the inner ear exceeded that of those who only passively watched a video of such an exploration. According to Jang et al. (2017), interaction favors the creation of a better internal representation of the depicted virtual objects. Keehner et al. (2008) demonstrated that interactively rotating virtual objects led to better learning performance than passive observing. However, this advantage of interactivity vanished when participants were presented with the same visual input (i.e., the same multiple perspectives) in the passive observing condition.

Furthermore, in the study conducted by Keehner et al. (2008), it has been shown that participants who passively observed optimal rotations of objects, as well as those who actively interacted with the visualizations in a skillful manner, both performed better than participants who interacted with the visualizations ineffectively. Taken together, the results of Keehner et al. (2008) indicate that giving participants the possibility to interact with the visualization actively seems not to be enough. It also must be ensured that participants can perceive the most task-relevant information. Thus, the question occurs, which interaction possibilities in VR environments best allow participants to view 3D-objects from various perspectives.

Multiple Perspectives: Interaction Formats and Hand Proximity

When using 3D-models to depict dynamic virtual objects in VR, there are two different ways to change the perspective (i.e., interaction formats) on the depicted virtual objects interactively. Firstly, one could change the orientation of the virtual object in the scene by manually rotating the object to look at it from different perspectives, whereas the position of the viewer (as well as the background of the scene) stays the same. This is called object movement in the following. Secondly, one could change the position of the viewer (e.g., by walking around the object) and thereby the viewpoint from where one looks at the object (as well as what is seen as a background behind the object), whereas the position of the virtual object in the scene stays the same. This is called viewer movement (or camera movement, respectively) in the following. Interacting with and moving around objects can be considered a biological primary knowledge concept (e.g., Ayres et al., 2009; Geary, 2007; Sweller, 2020; Van Gog et al., 2009). Due to evolutionary processes we—as humans—can learn and use such biological primary knowledge with minimal effort. This should hold true if the way to interact in VR environments follows the same principles that apply in the real world. However, complex interaction mechanisms, in terms of input devices that are implemented in a way that does not fit common principles, which humans are used to performing in real-life interactions, can be considered secondary knowledge (see Geary, 2007).

One factor that could potentially influence interactions with VR environments is the possible presence of the user’s hands in close proximity to the processed and to-be-learned materials. Hand proximity has been linked to the idea that objects close to the hands are readily considered for immediate interaction and action (e.g., Gozli et al., 2012). The potential positive impact of hand proximity on learning might be explained through at least two principal theoretical frameworks. First, the Attentional Prioritization Theory, as proposed by Reed et al. (2006), suggests that stimuli near the hands receive enhanced visuospatial attention due to the increased activity of visuo-tactile bimodal neurons (Reed et al., 2006; see also; Brown et al., 2015; Reed et al. 2010; Tseng et al., 2012). This hypothesis is supported by further research indicating that these neurons possess overlapping tactile and visual receptive fields extending into the space surrounding the hands. Visual receptive fields near the hands exhibit a spatially graded increase in activity, with higher firing rates for objects closer to the hands, as demonstrated in studies by Graziano and Gross (1993) and Graziano et al. (1997). Graziano et al. (1994) speculated that an approximately 20-cm zone around the hands exists where bimodal neurons sensitive to both touch and visual stimuli are activated by visual input.

The second theory is the Dual Visual Pathway Theory, introduced by Gozli and colleagues in 2012. The theory explains that the proximity of objects to the hands selectively enhances the activation of one of two distinct visual processing streams identified by Goodale and Milner (1992). Specifically, it suggests that objects near the hands amplify the activity within the dorsal pathway while simultaneously diminishing the activity within the ventral pathway (Goodhew et al., 2014; Gozli et al., 2012; Taylor et al., 2015). Beyond that, Goodhew and Clarke’s (2016) research discusses that the activation of magnocellular over parvocellular cells is influenced by the demands of spatial attention, suggesting that visual processing is task-dependent.

Critics have stated that no single theory comprehensively explains all observed effects of object proximity to the hands, as evidenced by various studies (e.g., Brockmole, et al., 2013; Davoli et al., 2012; Tseng & Bridgeman, 2011). The precise processes driving the impact of hand proximity on perception remain a subject of ongoing discussion and require further exploration through primary research.

Aside from the precise reasoning for the effect, extensive research has highlighted several advantages related to hand proximity, including improved visuospatial processing, heightened attention, and enhanced visuospatial learning (e.g., Brockmole et al., 2013; Brucker et al., 2021; Reed et al., 2006; Tseng et al. 2012). In the context of using dynamic 3D-models in interactive VR environments, this leads to the question of the potential benefits of direct interactions with these models in such settings. If the mere proximity of the hands can already yield positive effects on cognitive processes and learning outcomes, then actively engaging with, and manipulating 3D-models in VR may outperform simply moving around the virtual objects without directly manipulating them with the user’s hands. Understanding how direct interaction with VR elements contributes to improved comprehension and learning could have significant implications for designing and optimizing VR-based educational and training experiences. During developing more effective virtual learning environments and given the potential benefits of multiple perspectives as well as direct manipulation, it also becomes essential to explore and compare these effects in both Desktop-VR and true immersive VR environments (e.g., Johnson-Glenberg et al., 2021).

Differentiation of VR Environments

Fuchs et al. (2011) highlight the significance of understanding the purpose of virtual reality, which is to facilitate sensorimotor and cognitive activity within a digitally created artificial world. They define VR as follows: “Virtual reality is a scientific and technical domain that uses computer science (1) and behavioral interfaces (2) to simulate in a virtual world (3) the behavior of 3D entities, which interact in realtime (4) with each other and with one or more users in pseudo-natural immersion (5) via sensorimotor channels” (Fuchs et al., 2011, p. 8). Moreover, they emphasize immersion and interaction as key elements within VR environments.

However, these conditions are met by several devices and Smith (2019) differentiates between desktop and headset VR. According to Smith (2019), the term Desktop-VR covers all experimental setups that use an ordinary computer screen as a display where the interaction runs via a computer mouse, keyboard, or trackpad. Alternatively, alongside the term “Desktop-VR”, the term virtual reality learning environments (VRLE) can also be used in this context (see Al Amiri et al., 2020; Thorsteinsson 2013).

The advantages of Desktop-VR applications lie in the low-cost procurement of the required hardware and software as well as the familiarity of the participants with these input devices (thereby avoiding long familiarization and practice phases; Smith, 2019). In Desktop-VR settings, content is displayed on a two-dimensional screen, limiting depth perception to monocular (instead of stereoscopic) cues. Furthermore, the use of mouse and keyboard for interaction significantly differs from the real-world motor activities being simulated (Smith, 2019).

In contrast, according to Smith (2019), in headset VR environments (called immersive VR in the following), the digitally generated environment is projected directly onto a head-mounted display (HMD), which users wear directly in front of their eyes, and the interaction runs in these environments via controllers or hand-based gesture interactions (e.g., Li et al., 2019) represented in the VR as virtual hands. HMDs not only allow for stereoscopic vision by presenting the two eyes with images differing minimally in perspective (as is the case with natural vision due to human anatomy) but also interpret the (head) movements of the user and adapt the visual information accordingly (Smith, 2019).

Immersion and Presence in VR Environments

Independent of the differentiation of VR systems, two central characteristics of VR environments are immersion and presence (see Slater &Wilbur, 1997). Immersion is defined by Slater and Wilbur (1997) as an objectively detectable technological property with the following subcategories: (1) inclusiveness (degree to which the physical reality is excluded), (2) extensiveness (number of sensory modalities represented), (3) surroundedness (spatial extent of the VR), (4) vividness (degree of resolution, fidelity, and quality of the stimuli presented), (5) matching (degree of correspondence between proprioceptive information received by users and information provided by the display), and (6) plot (extent of a plot in VR that is self-contained and clearly distinguishable from reality, including autonomy as the degree of independent behavior of objects and interaction as the degree to which users can modify what is happening).

Recently, the emphasis on the concept of presence has become more pronounced with the advent of immersive technologies (see Cummings and Bailenson 2016). The growing emphasis on presence is attributable to the necessity of comprehending the psychological effects of recent technological innovations and the superior experiences they can produce. According to Slater and Wilbur (1997), presence describes the subjectively perceived feeling of “being there”. Current research is diverse regarding the conceptual description of presence (Grassini & Laumann, 2020). According to Hartmann and colleagues (2015), two pivotal elements of presence are the social and spatial dimensions. In the present context of one person learning within VR without social interaction, the spatial presence is of particular interest: Spatial presence is described as the subjective experience of a user perceiving themselves to be physically placed “within” a digitally mediated space, notwithstanding its origin as a technology-crafted illusion (Hartmann et al., 2015). Schubert et al. (2001) assume that (spatial) presence arises through constructing a spatial-functional-mental model of the non-physical environment. The cognitive processes involved are, on the one hand, the formation of mental representations of one’s own body movements in virtual space and, on the other hand, the suppression of information from physical space that is contrary to the impressions from the virtual environment (Grassini & Laumann, 2020; Schubert et al., 2001). Slater and Wilbur (1997) describe presence as an increasing function of immersion. Thus, as the degree of immersion increases, users will likely feel more present in the situation.

However, immersive VR environments might be perceptually more demanding than Desktop-VR environments (see Skulmowski, 2023a). Overall, it can be stated that Desktop-VR environments provide a lower level of immersion and, in line with that, a lower level of presence than immersive (headset) VR environments (Smith, 2019). In line with that, Johnson-Glenberg et al. (2023) showed in one study that more immersive VR conditions (large 2D- and 3D-screens) outperform a small 2D-screen-condition on learning outcomes. However, not only the degree of immersion and level of presence are influencing factors on the effectiveness of different VR environments, but also the embodiment and the resulting agency might play a role (as shown by another study with a more detailed result pattern by Johnson-Glenberg, 2018).

Embodiment and Agency in VR Environments

Johnson-Glenberg (2018) further emphasizes a third fundamental aspect of immersive VR environments: embodiment and the resulting agency linked to content manipulation in three dimensions. This aspect pertains to the level of personal empowerment or control that users have over the digital content. Embodiment and agency are fostered by developing new VR hand controllers enabling meaningful and congruent movements that align with the to-be-learned contents (Johnson-Glenberg, 2018). Skulmowski and Rey (2018) highlight in their taxonomy on embodied learning the importance of both bodily engagements, in terms of to what extend bodily activity is involved during learning, as well as task integration, in terms of whether bodily activities are meaningfully related to the respective learning tasks or not. Lachmair et al. (2022) emphasize that the whole body of a person can potentially be involved in immersive VR interactions and that this resulting amount of interaction possibilities might overwhelm users.

Makransky and Petersen (2021) introduce the Cognitive Affective Model of Immersive Learning (CAMIL) that also highlights presence and agency as the key benefits of immersive VR, because these aspects are facilitated by immersion, interactive control, and representational fidelity (see also Dalgarno & Lee, 2010). CAMIL identifies six factors influencing learning that can be enhanced by presence and agency: interest, motivation, self-efficacy, embodiment, cognitive load, and self-regulation (Makransky & Petersen, 2021, see also Petersen et al., 2022).

Fischer and Brugger (2011) present a model for understanding cognition, particularly emphasizing the role of the body by introducing the GES—an acronym for grounding, embodiment, and situatedness—framework. The GES framework has also been recently used as a cognitive framework that can help to inform interaction guidelines for user interface design, especially in VR (see Lachmair et al., 2022). Grounding refers to universal principles applicable to all individuals, such as physical laws (e.g., gravity, object impermeability, causality), which constitute a foundational element of our cognitive architecture and are resistant to change. Embodiment captures the unique, individual-specific characteristics, including bodily traits (e.g., handedness, sensory-motor proficiency), highlighting the variability in cognitive experiences based on physical constitution. Lastly, situatedness encompasses the context-specific experiences arising from interaction within a particular environment, underscoring how situational factors profoundly shape cognitive outcomes (see Fischer & Brugger, 2011).

Research conducted thus far has yielded inconclusive results when comparing Desktop-VR and immersive VR technologies. A recent meta-analysis demonstrated beneficial effects for Desktop-VR (e.g., Cromley et al., 2023), whereas another meta-analysis indicated advantages for immersive VR (e.g., Wu et al., 2020). Johnson-Glenberg et al. (2021) investigated whether learning in a 3D VR environment is superior to learning through a 2D monitor of a personal computer (PC): participants on both platforms (3D VR versus 2D PC) learned either by watching a playback video (low level of embodiment and agency) or by using controller or mouse to manipulate the content (high level of embodiment and agency). Results revealed that higher embodiment and agency in terms of active manipulation were beneficial. However, immersive VR (3D VR) did not simply outperform Desktop-VR (2D PC) platforms because the low embodied VR group performed significantly worse than the three other groups. In contrast, the high embodied VR group demonstrated the highest level of learning and retention (Johnson-Glenberg et al., 2021). This discrepancy may help elucidate the inconclusive findings obtained thus far. Nonetheless, investigating higher levels of embodiment and agency in VR needs to be further explored.

There are various ways of interactive control, that enable embodiment and agency in different ways (i.e., object movement versus viewer movement), and they need to be examined to identify differences between them as well as differences across alternative platforms, as it cannot be assumed that results from one platform can be directly applied to the other. The exploration of interactive control with perspectives in different VR environments constitutes a major objective of this paper. Through this investigation, we address how the immersive nature of VR environments influences users’ ability to engage with and manipulate perspectives within the VR environment, with particular emphasis on its implications for learning.

Beyond the ability of interaction and the differentiation of VR, as we will outline in the following, visuospatial ability plays a pivotal role in learning about dynamic phenomena.

Learners’ Visuospatial Ability and Learning About Movements

When considering learning about movements with dynamic visualizations, 3D-models, and VR environments, learners’ visuospatial ability emerges as a crucial factor to be considered. Understanding how learners’ characteristics in terms of their visuospatial ability interact with different interaction formats and how they influence the learning process within different VR applications can provide valuable insights for designing learning environments. As Mayer and Sims (1994) described, visuospatial ability encompasses mental manipulation and visualization of spatial information about objects in two or three dimensions. This includes mentally rotating objects, understanding spatial relationships, and interpreting 3D-representations (e.g., Hegarty & Waller, 2005). These abilities are crucial for observing continuous dynamic changes of 3D-models from different perspectives in VR environments (e.g., Sun et al., 2019). Extensive research on learners’ visuospatial ability (e.g., Höffler, 2010) consistently demonstrates that learners with higher visuospatial ability outperform those with lower visuospatial ability when engaged in learning with visualizations. This suggests that visuospatial abilities are critical in effectively comprehending and utilizing visual learning materials. Moreover, previous research reveals that visuospatial ability may moderate learning effectiveness with different instructional formats (e.g., dynamic versus static visualizations, e.g., Höffler, 2010; 2D- versus 3D-models, e.g., Huk, 2006; or different VR environments, e.g., Sun et al., 2019). There are two alternative interaction assumptions: the ability-as-compensator versus the ability-as-enhancer assumption (e.g., Höffler, 2010).

Regarding the ability-as-compensator assumption, learners with higher visuospatial ability may not require well-designed instructional formats as they achieve high learning outcomes even when exposed to suboptimal instructional formats. On the other hand, learners with lower visuospatial ability may struggle when faced with suboptimal designed instructions (see ability-as-compensator assumption; Höffler, 2010). As a result, specific instructional formats (e.g., dynamic visualizations) act as compensatory tools for learners with lower visuospatial ability, enabling them to achieve similar learning outcomes to learners with higher visuospatial ability (e.g., Lee, 2007). In contrast, the ability-as-enhancer assumption suggests that learners with higher visuospatial ability might even be able to utilize specific instructional formats, even if these formats are less optimally designed (for example due to their necessary complexity). Their enhanced visuospatial capabilities facilitate a more effective engagement with such specific instructional formats, which could result in improved learning outcomes. For example, their advanced imagination may enable a more accurate evaluation of how a model can be rotated to achieve an optimal viewing position. In sum, considering learners’ visuospatial ability is relevant when studying different interaction formats in different VR environments during learning about movements.

Present Research

The present research investigated the effectiveness of different interaction formats with dynamic 3D-models, allowing observation from multiple perspectives in desktop and immersive VR settings. To achieve these objectives, we addressed two different interaction formats (object movement versus camera/viewer movement) in two separate studies. Study 1 focused on a Desktop-VR environment, whereas Study 2 focused on an immersive VR environment. Additionally, in Study 1, we included a no interaction condition, in which the 3D models were displayed dynamically on the screen but without participants having the opportunity to interact with them by influencing neither the environment nor the object. Beyond the effect of the interaction formats, the potential moderating influence of learners’ visuospatial ability during learning through different interaction formats was explored.

With regard to the interaction format object movement, the possibility to directly manipulate and rotate the virtual 3D-models allowed to change the viewpoint and thereby observe different perspectives on the depicted dynamic content. At the same time, the viewer stayed in the same position. In contrast, during the interaction format camera/viewer movement the possibility to move the camera or the body of the viewer around the virtual 3D-model allowed to change the viewpoint and the resulting different perspectives. At the same time, the position and orientation of the 3D-model in the virtual scene stayed the same.

Within Study 1, both interaction formats were controlled via the mouse as input device. In contrast, in Study 2, the interaction formats were controlled in the object movement condition via VR controllers and in the viewer movement condition via body movements by walking around the 3D-model in the virtual scene. We have formulated the following hypotheses.

Hypothesis 1 (H1): object movement > camera/viewer movement > no interaction

We were interested in the effectiveness of the two interactive formats in a desktop and a truly immersive VR environment. We expected that the possibility to directly manipulate the orientation of the virtual 3D-objects (object movement) would outperform the possibility of changing the perspective from which one is looking at the 3D-models by changing the viewpoint of the camera or the viewer (camera/viewer movement) due to possible effects of hand proximity. This is expected due to incorporating the dynamic 3D-models into the peripersonal space around the hands, thereby possibly making available additional cognitive resources for learning and, thus, resulting in a more accurate dynamic mental model. Moreover, we expected that both interactive formats outperform a condition where no interaction is allowed (i.e., the dynamics must be observed from a given predefined perspective).

Hypothesis 2 (H2): higher visuospatial ability > lower visuospatial ability

We expected a main effect for learners’ visuospatial ability: learners with higher visuospatial ability will be better able to mentally manipulate and understand the 3D-models and virtual scenes than learners with lower visuospatial ability. This should improve learning outcomes for learners with higher visuospatial ability.

Research Question 1 (R1): moderating role of visuospatial ability – enhancer or compensator

Additionally, we explored how learners’ visuospatial ability might moderate the effectiveness of various interaction formats, examining whether these abilities act more as compensators or enhancers according to the respective assumptions.

Finally, we were interested in the question whether these effects are the same in different VR environments by exploring these in a Desktop-VR in Study 1 and an immersive VR in Study 2.

Study 1

Method

Participants and Design

In total, 164 participants were recruited through an online platform (www.prolific.com; selection criteria: native German speaker; age 18–40 years) from which they were redirected to IWM-study (Klemke & Halfmann, 2020), a specially developed online test environment accessible through a browser (compatible with all commonly used web browsers). The questionnaires and the unity-based Desktop-VR learning environment were presented to the participants there. Due to technical issues, four individuals had to be excluded from the analysis, resulting in a final sample size of 160 participants (99 males, 57 females, 4 diverse; 140 right-handed, 18 left-handed, 2 ambidextrous; 74 university students, 68 employed professionals, 17 identified as either job seekers, unemployed, pupils, or housewives/husbands, 1 missing for occupation) with an average age of 27.26 years (SD = 5.53). Participants were reimbursed £8.75 for approximately 45 min. The study was approved by the local ethics committee of the Leibniz-Institut für Wissensmedien (IWM, LEK 2020/024). Participants were randomly assigned to one of three conditions in a between-subject design investigating the influence of the first independent factor interaction format with the three factor levels: first, camera movement (n = 54), second, object movement (n = 53), and third, no interaction condition (n = 53; see a detailed description of the three different conditions below in the section on "interaction formats"). Moreover, participants’ visuospatial ability was assessed as a second continuous independent factor (higher visuospatial ability was operationalized as one standard deviation above the mean, whereas lower visuospatial ability was operationalized as one standard deviation below the mean). As dependent variables, we measured pictorial recognition (easy, medium, difficult), factual knowledge about movement patterns (see Appendix 1 for the factual knowledge questions), feeling of presence, and motion sickness.

Materials and Domain

Within the Desktop-VR learning environment, participants were exposed to highly realistic 3D-models of four fish, visualizing the movement patterns these fish use for propulsion. Identifying these movement patterns in real fish is challenging, given that fish may also deploy other movements (e.g., for navigational purposes). This complex dynamic learning domain has already been investigated and tested in previous studies (e.g., Imhof et al., 2012), in which dynamic 2D-visualizations have been used that were rendered based on the 3D-models used in the present study. Participants were asked to study the 3D-models to learn how to classify fish based on the used body parts (i.e., several fins or the body itself) and how they move in the three-dimensional space (i.e., paddle-like or wave-like). Four different movement patterns were shown (i.e., subcarangiform, balistiform, labriform, and tetraodontiform). In this study, the 3D-models were integrated into a virtual underwater environment containing no elements except a sandy bottom and the surrounding water. In addition to the 3D-models, verbal explanations of the movement patterns were provided. These spoken texts were consistent across all experimental conditions and covered the principle of locomotion (undulation or oscillation), physical parameters of the movement (amplitude and wavelength), body parts involved, and the typical fish species associated with each movement pattern. The verbal information was not synchronized with the movements of the 3D-models, as it referred to more general characteristics of the movement patterns and was not time-dependent.

Interaction Formats

The three different interaction formats, object movement, camera movement, and no interaction, were implemented as follows. In the object movement condition (see Fig. 1), participants were given the ability to manipulate the orientation of the 3D-model in the virtual learning environment by rotating it via mouse interaction (movement of the virtual object, no movement of the camera/viewer).

Fig. 1
figure 1

These static screenshots of different perspectives in the object movement condition were obtained through the manipulation of the object’s position and orientation. The background remained constant as a point of reference, because only the object is rotated, while the camera position stayed stable.

To modify the orientation, users needed to position the mouse cursor over the 3D-model and then press and hold the left mouse button to “lock” the cursor in place. Releasing the mouse button preserved the current orientation of the 3D-model, ensuring it was retained in the last adopted position. Because the 3D-models needed to be rotatable on three axes, but the mouse as an input device operates only in a two-dimensional space, we implemented the following types of rotation to enable participants to rotate the model freely on all three dimensions (Fig. 2).

  1. 1.

    Rotation around the pitch axis: By holding down the right mouse button and circularly moving the mouse, the 3D-model rotated on the vertical axis like being placed on a steering wheel.

  2. 2.

    Rotation around the roll axis: By holding down the left mouse button and moving the mouse up and/or down, the 3D-model rotated away from and/or towards the viewer. This action figuratively rolled the 3D-model.

  3. 3.

    Rotation around the yaw axis: By holding down the left mouse button and moving the mouse to the left and/or right, the 3D-model rotated like being placed on a turntable.

Fig. 2
figure 2

Representation of the pitch, roll, and yaw axis on which the model could be rotated. The initial position of the observer relative to the object was lateral.

In the camera movement condition (see Fig. 3), participants could manipulate the camera’s perspective on the 3D-model within the virtual learning environment via mouse interaction (this time: movement of the camera/viewer, no movement of the virtual object).

Fig. 3
figure 3

These static screenshots of different perspectives in the camera movement condition were obtained through the manipulation of the camera’s position and orientation. The background changed as a point of reference, because the camera position changed, while the fish position in the virtual scene was fixated.

To adjust the viewpoint of the camera and thereby also the perspective from which the viewer looked at the virtual object, the mouse cursor had to be positioned anywhere in the virtual scene and be “locked” to it by pressing and holding the left mouse button while simultaneously moving the mouse (up and down as well as left and right) to change the camera’s position. Releasing the left mouse button would maintain the current perspective. The camera’s distance was held constant on an invisible sphere surrounding the 3D-model in its center. Moreover, the camera’s orientation was always towards the fish. Furthermore, the camera and thereby the view on the whole scene could not be rotated on the pitch axis, so the bottom could never become sloped or even upside down. Therefore, no implementation of the third rotation axis (pitch axis) was needed in this condition. In the no interaction condition without interactivity, the perspective on the 3D-models corresponded to the perspective used in previous studies (e.g., Imhof et al., 2012), and participants were not able to interact with the 3D-models in the Desktop-VR environment.

Measures

Dependent Variables for Learning Outcomes: Recognition and Factual Knowledge

First, we administered a movement pattern recognition test comprising 45 dynamic multiple-choice items to assess recognition as a learning outcome. These items (see Fig. 4) consisted of underwater videos of real fish in their natural environment performing one of the four to-be-learned or distractor movement patterns. The videos were displayed system-paced and lasted 7 s each. After 7 s, the video disappeared, leaving only the five answer options visible (one for each of the four to-be-learned movement patterns and the additional option “none of the above”). To correctly identify the movement pattern, learners had to identify the used body parts relevant for propulsion as well as how they were moved.

Fig. 4
figure 4

Example screenshots of easy, medium, and difficult recognition items (from left to right)

The videos were categorized into three difficulty levels: easy, medium, or difficult. The selection of test videos followed the methodology outlined in the study by Imhof et al. (2012), which categorized the videos based on factors such as visibility of the movement pattern, absence of secondary movements (e.g., for navigation), and similarity to other movement patterns. Easy videos showed the relevant movement pattern solely and continuously (6 videos). Medium videos continuously showed the relevant movement, but additional navigational movements occurred (20 videos). Difficult videos did not continuously show the relevant movement pattern and included several additional movements (19 videos). The score for each difficulty level was calculated by summing up the number of correct answers in the category. Each correct answer was awarded one point, and the percentage correct was calculated.

Secondly, the study assessed participants’ factual knowledge about locomotion principles employed in the movement patterns (undulation or oscillation), the body parts involved, and the physical parameters of the movement (such as amplitude). This knowledge was evaluated through eight multiple-choice questions (see Appendix 1). The necessary information for answering these questions was provided in the verbal materials of the study (consistent across all experimental conditions). Participants were informed that multiple correct answers could exist without specifying the exact number. Participants were asked two questions for each movement pattern, each offering four possible answer options, of which two were correct.

Assessment of Learners’ Visuospatial Ability

We used a short version of the paper folding test to assess the second independent factor—learners’ visuospatial ability (PFT, Ekstrom et al., 1976). Following Blazhenkova and Kozhevnikov (2009, p. 640), this test measures the ability to form representations of “object location, movement, spatial relationships, and transformations.” This makes the PFT well-suited to cover the domain of fish movements. The short version of the PFT comprises ten multiple-choice items, where participants view depictions of papers being folded and punched. Participants must choose the correct answer from five options showing unfolded papers with punches in various positions. Correct answers earn one point (max. 10 points). Participants have 3 min to complete the task.

Additional Variables: Familiarity with the Domain, Motion Sickness, and Presence

In line with prior studies (e.g., Brucker et al., 2015) in this domain, we assessed, besides demographical data (age, gender, handedness, occupation) also, participants’ familiarity with the domain (as an indicator of prior knowledge that might affect learning outcomes in VR environments, e.g., Dengel & Mägdefrau, 2021). The questions for familiarity with the domain asked about participants’ biology school education, experiences with marine activities like diving, snorkeling, swimming, rowing, or owning an aquarium, interest in related topics such as fish, biology, zoology, physics, aircraft construction, and shipbuilding, as well as their exposure to related media like documentaries, books, or aquarium visits. The questionnaire comprised 22 questions and was developed with input from a domain expert. Participants received one point for each indication of familiarity with the domain and could earn additional points for higher numbers of diving and/or snorkeling experiences (max. 32 points).

Simulator motion sickness caused by the VR environment was measured with the Virtual Reality Sickness Questionnaire (VRSQ) by Kim et al. (2018). The VRSQ comprises nine items all rated on a five-point Likert scale. Participants’ perceived presence was measured with the Igroup Presence Questionnaire (IPQ) developed by Schubert et al. (2001). The IPQ consists of 14 questions in total: one question captures the general presence, whereas the other assesses three specific components of presence: spatial presence, involvement, and realness of the environment (on seven-point Likert scales specific to each question).

Procedure

Participants were recruited through the online platform Prolific (London, UK). They were required to use a computer or laptop (touch devices were excluded) equipped with active loudspeakers and an external mouse (a touchpad was not permitted). Participants who confirmed these criteria were directed from Prolific to IWM-study. The study comprised a preliminary phase, a learning phase in the Desktop-VR environment (where the experimental manipulation took place), the assessment of additional variables, and a testing phase. Participants navigated through the study using buttons. In the preliminary phase, participants were provided with general information about the study and were required to confirm their participation eligibility and informed consent. Following the initial instructions, participants were instructed to proceed with the study in full-screen mode for optimal viewing. Before starting the study itself, a VR check of the Unity application was conducted to ensure that the hardware being used met the requirements of the Unity application. If the hardware did not pass the VR check, the participants could not proceed with the Study and were redirected to Prolific. If the VR check was passed, a questionnaire was administered (demographics, familiarity with the domain). Additionally, visuospatial ability was measured with the PFT (Ekstrom et al., 1976), from which participants were automatically directed to the next page after 3 min.

Following the preliminary phase, participants were given instructions on how to proceed during the learning phase. They were informed about the respective interaction format in the Desktop-VR depending on their experimental condition. The learning phase started with an exercise to familiarize participants with the respective interaction format. For this purpose, participants were provided with a 3D-model of a submarine. In both interactive conditions (object movement and camera movement), participants were allowed to practice their respective way of interacting (object movement: manipulating the orientation of the 3D-model versus camera movement: changing the camera’s viewpoint) while receiving a 72-s verbal explanation about submarines. This practicing task lasted at least two and at the maximum 5 min. After 2 min, participants could choose to proceed with the study by clicking the respective next button. Participants in the no interaction condition saw the submarine without interaction possibilities for 2 min and heard the verbal explanation.

Subsequently, participants in all experimental conditions were provided with a written introduction to “underwater locomotion”, covering the basic principles of locomotion (undulation, oscillation), the body parts possibly involved in different movements, and some physical parameters of fish movements (amplitude, wavelength). Then, the four 3D-models of the to-be-learned movement patterns were presented one after another each for 2 min, accompanied by a 72-s verbal explanation that provided information about the specific movement pattern, body parts involved, and its physical parameters. Participants could pause briefly between each movement pattern before clicking the next button to see the following pattern. Depending on their respective experimental interaction format conditions, participants could or could not interact with the 3D-models during the presentation of the movement patterns. Following the learning phase, participants had to answer the VRSQ (Kim et al., 2018) and the IPQ (Schubert et al., 2001). In the following testing phase, learning outcomes in terms of recognition and factual knowledge were assessed. Finally, participants were informed about the study goals and redirected to Prolific.

Results

All data (in both studies) were analyzed with the statistical analysis software SPSS® (version 28, IBM Corp. released in 2022). A p-value of .050 was used to indicate significance and a p-value of .100 was used to indicate a (non-significant) tendency in all analyses (in both studies). Regarding the comparability between the experimental conditions, we found no significant differences (all ps > .282) between groups (see Appendix Table A1 for means and values) in participants’ age, gender, handedness, occupation, familiarity with the domain, or visuospatial ability.

Learning Outcomes

Recognition performance in terms of easy, medium, and difficult recognition, as well as factual knowledge were analyzed with ANCOVAs with the factor interaction format (object movement versus camera movement versus no interaction) and the second continuous factor learners’ visuospatial ability. The continuous factor learners’ visuospatial ability was z-standardized. The interaction term between interaction format and learners’ visuospatial ability was inserted into the model of all analyses to test the moderating effect of learners’ visuospatial abilities on the different interaction formats. Higher visuospatial ability was defined as the mean plus one standard deviation, whereas lower visuospatial ability was defined as the mean minus one standard deviation. Moreover, learners' familiarity with the domain was added as a covariate (without interaction terms) into the model of all analyses (see Table 1 for adjusted means and standard errors).

Table 1 Means and standard errors (in parentheses) for easy, medium, and difficult recognition and factual knowledge (in % correct) for higher and lower visuospatial ability (VSA) in all three experimental conditions of Study 1

There was no significant main effect of interaction format on easy recognition, medium recognition, difficult recognition (all Fs < 1), and factual knowledge (F(2, 153) = 1.128, MSE = 139.033, p = .326, ns, η2p = .002). Contrary to our hypothesis H1, we observed no overall positive effect of interactivity or direct manipulation in the object movement condition compared to the camera movement condition on learning outcomes.

Regarding learners’ visuospatial ability, there was a significant main effect on all three difficulty levels of recognition performance (easy: F(1, 153) = 4.122, MSE = 740.494, p = .044, η2p = .026; medium: F(1, 153) = 4.299, MSE = 370.965, p = .040, η2p = .027; difficult: F(1, 153) = 12.111, MSE = 243.692, p < .001, η2p = .073) as well as on factual knowledge (F(1, 153) = 9.182, MSE = 139.033, p = .003, η2p = .057). Participants with higher visuospatial ability achieved higher recognition performance and factual knowledge than those learners with lower visuospatial ability. This finding supports our hypothesis H2, suggesting that higher visuospatial ability positively influences learning performance.

Moreover, there was a significant interaction between interaction format and learners’ visuospatial ability for difficult recognition performance (F(2, 153) = 3.328, MSE = 243.692, p = .038, η2p = .042). Bonferroni-adjusted pairwise comparisons revealed that even though the overall interaction pattern reached statistical significance (p = .038; see Fig. 5), the pairwise post-hoc comparisons were not statistically significant (all ps > .102). Additionally conducted LSD-adjusted pairwise comparisons revealed that participants with higher visuospatial ability achieved higher values for difficult recognition when using the camera movement condition compared to the no interaction condition (p = .034), but not compared to the object movement condition (p = .120), whereas there were no differences for participants with lower visuospatial ability for difficult recognition (comparison camera movement and no interaction condition, p = .204; comparison camera movement and object movement, p = .162). There were no other significant main effects or interactions concerning learning outcome measures (all ps > .317).

Fig. 5
figure 5

Interaction between interaction format and learners’ visuospatial ability on difficult recognition in Study 1

Motion Sickness and Presence

Motion sickness as well as the four subcategories of presence (general presence item, spatial presence, involvement, and realness) were also analyzed with ANCOVAs with the factor interaction format (object movement versus camera movement versus no interaction) and the second continuous factor learners’ visuospatial ability (z-standardized with interaction term) as well as learners’ familiarity with the domain as a covariate (without interaction terms; see Table 2 for adjusted means and standard errors).

Table 2 Means and standard errors for motion sickness, general presence item, spatial presence, involvement, and realness for higher and lower visuospatial ability (VSA) in all three experimental conditions of Study 1

There was no main effect for interaction format (F(2, 153) = 1.151, MSE = .289, p = .319, ns, η2p = .015) on motion sickness. However, there was a main effect of learners’ visuospatial ability on motion sickness (F(1, 153) = 5.055, MSE = .289, p = .026, η2p = .032) in the direction that learners with higher visuospatial ability experienced lower motion sickness than learners with lower visuospatial ability. Moreover, in addition to this main effect of visuospatial ability there was a (non-significant) tendency for an interaction between interaction format and learners’ visuospatial ability for motion sickness (F(2, 153) = 2.850, MSE = .289, p = .061, ns, η2p = .036; see Fig. 6). Bonferroni-adjusted pairwise comparisons revealed that for learners with lower visuospatial ability the no interaction condition led to higher motion sickness than the camera movement condition (p = .036), but not than the object movement condition (p = .251); whereas there were no differences for learners with higher visuospatial ability (all ps > .999).

Fig. 6
figure 6

(Non-significant) tendency for an interaction between interaction format and learners’ visuospatial ability on motion sickness in Study 1

There was a significant main effect of interaction format on spatial presence (F(1, 153) = 4.604, MSE = 1.150, p = .011, η2p = .057). Bonferroni-adjusted pairwise comparisons showed that camera movement led to higher spatial presence ratings than the no interaction condition (p = .009), whereas there was no difference between object movement and no interaction (p = .427). Moreover, there was a (non-significant) tendency for the general presence item (F(1, 153) = 2.700, MSE = 1.989, p = .070, ns, η2p = .034). Bonferroni-adjusted pairwise comparisons showed that object movement tended to lead to higher ratings for the general presence item than the no interaction condition (p = .077), whereas there was no difference between camera movement and no interaction (p = .325).

Moreover, there was a significant main effect of familiarity with the domain on the general presence item (F(1, 153) = 9.361, MSE = 1.989, p = .003, η2p = .058) and spatial presence (F(1, 153) = 4.884, MSE = 1.150, p = .029, η2p = .031). Higher familiarity with the domain leads to higher presence ratings. There were no other main effects or interactions concerning the four subcategories of presence (all ps > .266).

Discussion of Study 1

In the first study, we set out to investigate the influence of different interaction formats (object movement, camera movement, no interaction) on learning performance in a Desktop-VR setting. Our initial hypothesis (H1) suggested that object movement would lead to better learning outcomes compared to camera movement, and both interactive formats would outperform the no interaction condition. Additionally, we hypothesized in hypothesis H2 that learners with higher visuospatial ability would outperform those with lower visuospatial ability across all learning outcome measures (easy, medium, and difficult recognition, and factual knowledge).

However, contrary to our hypothesis H1 on interaction format, we did not find a significant main effect of interaction format on learning performance. The absence of this main effect led us to conclude that there is no evidence in support of a near-hand effect in our Desktop-VR study with mouse interaction. In this Study 1, “hand proximity” was operationalized as holding the mouse cursor directly onto the 3D-models of the fish (object movement) rather than using the mouse on the whole virtual scene displayed on the monitor (camera movement). The results do not indicate that the mouse interaction gave the impression of directly “touching” the 3D-objects in the Desktop-VR (e.g., in terms of using the mouse as a tool to enlarge the distance that we can reach, see Brockmole et al., 2013). Furthermore, it is worth noting that participants in the camera movement condition may have used the mouse cursor next to or even on the 3D-models of the fish, potentially affecting the results.

A possible explanation for the lack of superiority of the interactive conditions compared to the no interaction condition could be that the restricted perspective used in the no interaction condition might already have been very well chosen. This perspective was selected for prior studies by a domain expert, and it might have been well suited for learning about fish movement patterns. Allowing participants to interact with the 3D-objects or the virtual scene to change the perspective might have led to masking the relevant movements. In this domain, the saliency of visual details and movements does not necessarily indicate how relevant a certain aspect is for recognizing the movement pattern relevant for propulsion (see also Lowe, 2003), so it might be disadvantageous for the learners to orient the 3D-models or the virtual scene according to more salient aspects of the 3D-models or the depicted dynamics.

In relation to learners’ visuospatial ability, our findings align with our hypothesis H2, demonstrating that learners with higher visuospatial ability outperformed those with lower visuospatial ability across all learning outcome measures. This finding is consistent with previous research on learning with dynamic visualizations (e.g., Höffler, 2010), 3D-models (e.g., Huk, 2006), and virtual environments (e.g., Lee & Wong, 2014; Sun et al., 2019), as well as studies specifically focusing on classifying fish movement patterns (e.g., Imhof et al., 2011).

For the difficult recognition task, we observed an interesting interaction between learners’ visuospatial ability and the interaction format (R1) of the Desktop-VR. Accordingly, the main effect of visuospatial ability on difficult recognition has to be interpreted in the light of this significant interaction. While the object movement condition and the no interaction condition led to similar learning outcomes regardless of learners’ visuospatial ability, the camera movement condition showed a more differentiated pattern. Learners with higher visuospatial ability achieved higher results on difficult recognition items when controlling the viewpoint of the camera in the virtual scene, whereas learners with lower visuospatial ability achieved worse results on the difficult recognition items when being allowed to control the camera viewpoint. This finding suggests that visuospatial ability plays a crucial role in learning in VR environments (e.g., Sun et al., 2019) and moderates the effectiveness of different interaction formats. The ability-as-enhancer hypothesis (e.g., Höffler, 2010) seems to align best with this pattern: Having visuospatial ability at their disposal helped learners to learn better when they could change perspective by controlling the camera viewpoint on the invisible virtual sphere on two axes (similar to interacting with for example a real globe), but not when they could directly manipulate the orientation of the 3D-objects using complex mouse interactions. This result pattern only occurred for the most demanding difficult items and further speaks against an effect of hand proximity (for learners with higher visuospatial abilities) during mouse interaction in Desktop-VR environments. The complex mouse interactions during object movement were very complex (not to say very unnatural in terms of an embodied perspective) because the virtual objects had to be rotated on three axes, whereas the mouse as an input device operates only in a two-dimensional space (see also Fröhlich & Plate, 2000). The learners may not have extensively utilized this complex interaction pattern (object movement), which could account for its similar outcome to the no-interaction condition.

Moreover, higher visuospatial ability prevented participants from experiencing simulator motion sickness, whereas learners with lower visuospatial ability tended to experience motion sickness particularly in the no interaction condition. This indicates that visuospatial ability might play a compensating role regarding motion sickness during different interaction formats. Regarding the relationship between motion sickness and the level of interaction, often referred to as navigational control within an application, it has been shown that greater control over the environment can lead to reduced simulator sickness (e.g., Stanney & Hash, 1998). Further research has explored the occurrence of motion sickness (e.g., Boletsis & Cedergren, 2019; Coomer et al., 2018; Rebenitsch & Owen, 2016). However, a common limitation among these studies is that visuospatial ability as an individual factor is often neglected. In the present study, the higher values of motion sickness did not result in worse learning outcomes for learners with lower visuospatial ability in the no interaction condition. Nonetheless, the possibility of having control over the Desktop-VR might play a stronger role for learners who lack visuospatial abilities.

Familiarity with the domain did not foster learning but helped participants to feel more present in the Desktop-VR environment. Moreover, both interaction conditions led to higher presence than the no interaction condition: camera movement led to higher spatial presence and object movement at least tended to lead to higher ratings on the general presence item.

In conclusion, Study 1 served as a first investigation into the impact of different interaction formats (object movement, camera movement, no interaction) and learners’ visuospatial ability as well as its moderating role during learning (recognition, factual knowledge) in a low immersive desktop “VR” environment. The results shed light on the complexities of interaction design and learners’ characteristics in VR learning environments. Building upon the insights from Study 1, Study 2 was designed to further explore these factors in an immersive VR environment, adapting the methodological approach from Study 1 for the two interaction formats, respectively.

Study 2

In the following, we explain how Study 2 differs from Study 1. It is important to note that the essential measurements (e.g., recognition and factual knowledge, as well as visuospatial ability) in Study 2 are identical to Study 1 and are therefore not reported again.

Method

Participants and Design

In Study 2, 65 participants (19 males, 46 females, 0 diverse; 54 right-handed, 11 left-handed, 0 ambidextrous; 59 university students, 1 employed professional, 5 identified as either job seekers, unemployed, pupils, or housewives/husbands) with an average age of 25.18 years (SD = 4.96) were recruited via a participant recruiting system (www.iwm.sona-systems.com; selection criteria: native German speaker, age: 18–40 years). Participants were reimbursed 10 € for approximately 60 min. The study was approved by the ethics committee of the IWM (LEK 2020/043). Participants were randomly assigned to one of two conditions in a between-subject design investigating the influence of the first independent factor interaction format with the factor levels: first, object movement (n = 33) and second, viewer movement (n = 32; this was termed camera movement in Study 1). In contrast to Study 1, this study was not conducted in a Desktop-VR in an online setting, but in an immersive VR in a controlled laboratory environment. Thus, we operationalized the two interaction formats according to the used immersive VR: all participants conducted the study with a VR headset and controller in an immersive VR. Object movement was implemented via VR controller input and viewer movement was implemented by allowing participants to walk and move in the VR environment. Moreover, participants' visuospatial ability was assessed as a second continuous independent factor.

Measures

Dependent variables were the same measures as in Study 1 (easy, medium, and difficult recognition, factual knowledge, presence, and motion sickness).

Materials and Environment

Within the immersive VR learning environment, participants were exposed to the same highly realistic 3D-models of fish with the same instruction (i.e., “study the 3D-models to learn how to classify fish based on their movement patterns”) and the same additional verbal explanations used in Study 1. However, in Study 2, the virtual learning environment was presented to the participants through a VR headset (HTC Vive Pro V1), which allows a high degree of immersive virtual reality (Villena-Taranilla et al., 2022). The HTC Vive Pro allowed the participants to experience stereoscopic vision and controller interaction capabilities. Each eye had a resolution of 1440 × 1600 pixels, and the field of view was 110°. The test room was equipped with two base stations that allowed the participants to move around with a significant degree of freedom. Participants used the associated HTC Vive Pro controllers as an input device.

Interaction Format

The two different interaction formats object movement and viewer movement were implemented as follows. In the object movement condition, participants were instructed to remain in a fixed position within the real room and thereby also in the immersive VR environment and not step outside. Participants were allowed to interact with the virtual 3D-models of fish to change their perspective on the fish using the controller (objects were rotated, viewers stayed at their position). Within the VR environment, a virtual hand representing the controller was displayed (see Fig. 7). In the viewer movement condition, participants were allowed to walk and move around (e.g., bend down to look from below at) the displayed 3D-models of the fish without any visible avatar. However, in this condition, participants were not allowed to touch or rotate the 3D-objects (objects stayed in their position, viewers moved in the virtual scene). The available space for movement within the virtual environment was approximately 3 m × 2.85 m.

Fig. 7
figure 7

Screenshot of the participants’ view in the object movement condition

Procedure

In contrast to Study 1, participants in Study 2 did not participate online in the study but attended individual appointments in our VR lab. When arriving in the lab, participants were welcomed, had to read the general information about the study, and gave their informed consent. Study 2 comprised the preliminary phase, the learning phase, the assessment of additional variables, and the testing phase and the procedure was as comparable as possible to the one used in Study 1. Participants answered the questionnaires from the preliminary phase (demographics, familiarity with the domain of fish movements, PFT, Ekstrom et al., 1976) on a PC. Following the preliminary phase, participants were instructed about the respective interaction format in the immersive VR depending on their experimental condition: either having the option to rotate the object physically via the controller (object movement) or having the option to walk and move around the object (viewer movement). Subsequently, participants were immersed into the VR experience by putting on the VR headset and the learning phase started with the submarine exercise, before they read the written introduction to “underwater locomotion”. Then, the four 3D-models of the to-be-learned movement patterns were presented in the same manner as in Study 1 but this time in the VR headset in the immersive VR environment. Following the learning phase, participants removed the VR headset and had to answer the VRSQ (Kim et al., 2018) and the IPQ (Schubert et al., 2001), before they answered the learning outcome measures in the testing phase on a PC. Finally, participants were informed about the study goals and compensated for their participation.

Results

Regarding the comparability between the experimental conditions, we found no significant differences (all ps > .323) between the two groups (see Appendix Table A2 for means and values) in participants’ age, gender, handedness, occupation, familiarity with the domain, or visuospatial ability.

Learning Outcomes

Recognition performance in terms of easy, medium, and difficult recognition, as well as factual knowledge was analyzed with ANCOVAs with the factor interaction format (object movement, camera movement) and the continuous factor learners’ visuospatial ability (z-standardized with interaction term) as well as learners’ familiarity with the domain as a covariate (without interaction terms; see Table 3 for adjusted means and standard errors). Higher visuospatial ability was defined as mean plus one standard deviation, whereas lower visuospatial ability was defined as mean minus one standard deviation.

Table 3 Means and standard errors for easy, medium, and difficult recognition and factual knowledge (in % correct) for higher and lower visuospatial ability (VSA) in both experimental conditions of Study 2

There was no significant main effect of interaction format on easy recognition, medium recognition, difficult recognition, and factual knowledge (all ps > .245). Contrary to our hypothesis H1, we observed no overall positive effect of direct manipulation in the object movement condition compared to the viewer movement condition on learning outcomes.

Regarding learners’ visuospatial ability, there was a significant main effect on difficult recognition (F(1, 60) = 4.916, MSE = 267.881, p = .030, η2p = .076) as well as on factual knowledge (F(1, 60) = 6.631, MSE = 165.928, p = .013, η2p = .100), but not for easy and medium recognition (both ps > .113). Participants with higher visuospatial ability achieved higher performance for difficult items and factual knowledge than those learners with lower visuospatial ability. This partially supports our hypothesis H2, that higher visuospatial ability positively influences learning performance as compared to lower visuospatial ability.

Moreover, there was a significant interaction between interaction format and learners’ visuospatial ability for easy recognition performance (F(1, 60) = 5.524 MSE = 846.796, p = .022, η2p = .084; see Fig. 8). Bonferroni-adjusted pairwise comparisons revealed that participants with higher visuospatial ability profited from object movements compared to viewer movements (p = .046) within easy recognition tasks. In contrast, there was no difference between the two interaction formats for learners with lower visuospatial ability (p = .172).

Fig. 8
figure 8

Interaction between interaction format and learners’ visuospatial ability on easy recognition in Study 2

Furthermore, there was a (non-significant) tendency for an interaction between interaction format and learners’ visuospatial ability for medium recognition performance (F(1, 60) = 3.481, MSE = 558.674, p = .067, ns, η2p = .055; see Fig. 9). Bonferroni-adjusted pairwise comparisons revealed that participants with lower visuospatial ability tended to suffer from object movements compared to viewer movements (p = .068), within medium tasks. In contrast, there was no difference between the two interaction formats for learners with higher visuospatial ability (p = .405).

Fig. 9
figure 9

(Non-significant) tendency for an interaction between interaction format and learners’ visuospatial ability on medium recognition in Study 2

There was a main effect of the covariate familiarity with the domain for medium recognition (F(1, 60) = 4.961, MSE = 558.674, p = .030, η2p = .076) and factual knowledge (F(1, 60) = 4.500, MSE = 165.928, p = .038, η2p = .070). Higher familiarity with the domain led to better performance for medium recognition and factual knowledge. There were no other significant main effects or interactions concerning learning outcome measures (all ps > .109).

Motion Sickness and Presence

Motion sickness, as well as the four subcategories of presence (general presence item, spatial presence, involvement, and realness) were again analyzed with ANCOVAs with the factor interaction format (object movement versus camera movement) and the second continuous factor learners’ visuospatial ability (z-standardized with interaction term) as well as learners’ familiarity with the domain as a covariate (without interaction terms; see Table 4 for adjusted means and standard errors).

Table 4 Means and standard errors (in parentheses) for motion sickness, general presence item, spatial presence, involvement, and realness for higher and lower visuospatial ability (VSA) in both experimental conditions of Study 2

There were no main effects or interactions on motion sickness in Study 2 (all ps > .119). Regarding presence, there was a (non-significant) tendency for a main effect of interaction format on the general presence item (F(1,60) = 3.952, MSE = 1.932, p = .051, ns, η2p = .062). Participants in the viewer movement condition tended to rate the general presence item higher than participants in the object movement condition. In addition to observing a (non-significant) tendency towards a main effect, a significant interaction was found between interaction format and visuospatial ability regarding the general presence item (F(1,60) = 10.297, MSE = 1.932, p = .002, η2p = .146; see Fig. 10). Bonferroni-adjusted pairwise comparisons revealed that only participants with lower visuospatial ability rated general presence higher in the viewer movement condition than in the object movement condition (p < .001), whereas there was no difference between the two conditions for participants with higher visuospatial ability (p = .339).

Fig. 10
figure 10

Interaction between interaction format and learners' visuospatial ability on the general presence item in Study 2

Moreover, there was a significant interaction between interaction format and visuospatial ability on spatial presence (F(1,60) = 7.868, MSE = .999, p = .007, η2p = .116; see Fig. 11). Bonferroni-adjusted pairwise comparisons revealed (the almost identical pattern as for the general presence item) that only participants with lower visuospatial ability rated spatial presence higher in the viewer movement condition than in the object movement condition (p = .003), whereas there was again no difference between the two conditions for participants with higher visuospatial ability (p = .351).

Fig. 11
figure 11

Interaction between interaction format and learners’ visuospatial ability on spatial presence in Study 2

Furthermore, there was a main effect of familiarity with the domain on the general presence item (F(1,60) = 7.387, MSE = 1.932, p = .009, η2p = .110) and spatial presence (F(1,60) = 4.996, MSE = .999, p = .029, η2p = .077) in the direction that higher familiarity with the domain led to higher ratings for the general presence item and spatial presence. There were no other main effects or interactions concerning the general presence item, spatial presence, involvement, and realness (all ps > .115).

Discussion of Study 2

In the second study, we investigated the influence of different interaction formats (object movement, viewer movement) on learning performance in a real immersive VR setting. Our initial hypothesis H1 suggested that object movement would lead to better learning outcomes than viewer movement because during object manipulation the virtual 3D-objects in the VR environment are presented near the (real) hands of the users and thus should be readily considered for immediate interaction and action (e.g., Gozli et al., 2012), thereby making available additional cognitive resources for learning (e.g., Agostinho et al., 2016). Moreover, we hypothesized in H2, based on prior research (e.g., Höffler, 2010), that learners with higher visuospatial ability would outperform those with lower visuospatial ability across all learning outcome measures.

Regarding hypothesis H1 on interaction format, we were not able to find general evidence in terms of a superiority of the object movement condition. Nevertheless, the significant interaction between interaction format and visuospatial ability (R1) on easy recognition revealed that learners with higher visuospatial ability achieved better results on these easy recognition items when they were allowed to directly manipulate the virtual objects manually instead of walking around them in the virtual scene. This result pattern on easy recognition can be interpreted as an effect of hand proximity (e.g., Brockmole et al., 2013; Brucker et al., 2021; Reed et al., 2006; Tseng et al., 2012) for learners with higher visuospatial ability. The finding that this positive effect of hand proximity only occurs for the easy recognition in our study might be interpreted in terms of correspondence of these test items to the task focus and the respective task demands (see Goodhew & Clarke, 2016; Liepelt & Fischer, 2016). In prior studies on hand proximity, we could show that the influence of hand proximity effects is particularly pronounced in tasks that align closely with the instructional intent and the task requirements (e.g., Brucker et al., 2017; Weber, 2016). We found positive effects of hand proximity for a task on memorizing different dynamic sequences when learning about mitoses, whereas we found positive effects of hand proximity for a task on recognizing specific movements when learning about dance steps (it should be noted that in both studies—mitosis and dance steps—there were tasks on both memorizing different dynamic sequences as well as recognizing specific movements). Moreover, in the domain of learning about dance steps, that is visually very fast and highly complex and thereby a very similar task to the present one of learning about biological movements, we found that particularly learners with higher visuospatial ability profited from hand proximity (e.g., Brucker et al., 2017; Weber, 2016). The easy items in the present study correspond at the most with the dynamics depicted repeatedly in the 3D-models during the learning phase, because they only show relevant information. In the learning phase, there was a strong focus on visualizing the correct movements for propulsion, but not to discriminate propulsive movements from navigational ones or to identify partly invisible movement patterns. Therefore, the easy items can be seen more as pure perceptual retrieval exactly fitting the present task demands, whereas the medium and particularly the difficult items (as they show more irrelevant information or only parts of the relevant information) can be interpreted as being farther away from the present tasks demands and thereby going more in the direction of transferring knowledge based on inferences to new situations and challenges. Particularly, learners with higher visuospatial ability seem to profit from hand proximity during recognizing movement patterns. Learners with higher visuospatial ability might have more general competencies to process complex dynamic phenomena, and their advanced imagination may enable a more accurate evaluation of how a dynamic 3D-model can be rotated to achieve an optimal viewing position and thus they might be able to use the additional attentional or cognitive resources given by hand proximity more effectively (see ability-as-enhancer assumption, Höffler, 2010).

However, the descriptive result pattern and the tendency for an interaction between interaction format and visuospatial ability on medium recognition, indicated a potential inverse relationship to the hand proximity effect during direct object manipulation in immersive VR for individuals with lower visuospatial ability as they tended to profit from viewer movement instead of object movement.

Our research supports Hypothesis H2, showing that participants with higher visuospatial ability perform better than those with lower visuospatial ability in several areas. These areas include difficult recognition tasks (in general across both interaction formats) and factual knowledge tasks as well as medium and easy recognition tasks in the object movement condition (see the respective interaction patterns). With these observed interaction patterns, the data also gives evidence for research question R1 regarding the possible interaction between visuospatial ability and different interaction formats. This clearly demonstrates the influence of participants’ characteristics in terms of their visuospatial ability during learning about movements in VR environments. In both Study 1 and Study 2, greater familiarity with the domain resulted in higher presence ratings. Moreover, in Study 2, greater familiarity with the domain also contributed to improved learning outcomes (for medium recognition and factual knowledge).

However, there were no differences for motion sickness observable in the immersive VR environment. This is still in line with the results of Study 1, as we compared in Study 2 the two interactive conditions, but not the no interaction condition, that was accountable for this effect for learners with lower visuospatial ability in Study 1. Furthermore, participants with lower visuospatial ability experienced higher presence in the viewer movement condition than in the object movement condition, whereas there was no such difference between the two conditions for participants with higher visuospatial ability. This was in line with the result pattern on learning outcomes, as learners with lower visuospatial ability tended to learn better in the viewer movement condition.

Comparison Between Study 1 and 2

In an additional exploratory step, we directly compared learning outcomes (in terms of easy, medium, and difficult recognition, as well as factual knowledge), motion sickness, and presence for the two studies with independent samples t-tests for the factor immersion (low = Desktop-VR versus high = immersive VR; see Table 5 for means and standard deviations as well as Tables 6 and 7 for correlations between the dependent variables in Study 1 and Study 2, respectively).

Table 5 Means and standard errors for easy recognition, medium recognition, difficult recognition, and factual knowledge (in % correct). Absolute values for motion sickness, general presence item, spatial presence, involvement, and realness for low and high immersive learning environments
Table 6 Correlation matrix for dependent and additional variables in Study 1 (low immersive learning environment = Desktop-VR; N = 160)
Table 7 Correlation matrix for dependent and additional variables in Study 2 (high immersive learning environment = immersive VR; N = 65)

Learning Outcomes

There were no significant differences between the two VR environments for all learning outcome variables: easy recognition (95%-CI[− 2.311, 13.946], t(223) = 1.410, p = .160, ns, d = 0.207); a (non-significant) tendency for medium recognition (95%-CI[− .553, 11.596], t(223) = 1.791, p = .075, ns, d = 0.263); and difficult recognition (95%-CI[− 4.426, 5.074], t(223) =  − .134, p = .447, ns, d = 0.020), as well as factual knowledge (95%-CI[− 1.976, 5.287], t(223) = .898, p = .185, ns, d = 0.132).

Motion Sickness and Presence

There was a significant difference for all variables in the direction that the higher immersive VR resulted in higher ratings on motion sickness (95%-CI[− .354, − .021], t(223) =  − 2.276, p = .012, d =  − 0.335), as well as on the general presence item (95%-CI[− 2.376, − 1.518], t(223) =  − 8.942, p < 0.001, d =  − 1.315); on spatial presence (95%-CI[− 2.036, − 1.404], t(223) =  − 10.733, p < 0.001, d =  − 1.578); on involvement (95%-CI[− 1.421, − 0.708], t(223) =  − 5.875, p < 0.001, d =  − 0.865); and on realness (95%-CI[− 0.936, − 0.356], t(223) =  − 4.394, p < .001, d =  − 0.646).

The direct comparison between the two studies indicated that the immersive VR in Study 2 yielded more presence and higher motion sickness but did not lead to better learning outcomes than the Desktop-VR in Study 1 (for medium recognition the values tended to be even higher in the Desktop-VR than in the immersive VR environment). This must be considered when deciding for a specific type of VR environment during designing learning environments.

General Discussion

In two studies, we investigated how the possibility of interacting with the perspective (object movement versus camera/viewer movement) of virtual 3D-models in a low (Study1: desktop) and high (Study 2: headset) immersive VR environment influences learning to classify different movement patterns. Beyond the possibility of interacting and the degree of immersion in the VR, we addressed participants’ visuospatial ability because it plays a pivotal role in learning about dynamic movements in VR environments (e.g., Höffler, 2010; Skulmowski, 2023b).

It is crucial to highlight that our research concentrates on exploiting materials that demonstrate complex dynamic movements as the focal point of investigation. This aspect represents a significant strength of our research, particularly when contrasted with other studies in this domain, which often concentrate on knowledge acquisition derived from static (e.g., Jang et al., 2017) or less complex subjects.

We compared in both studies two interactive conditions: The object movement condition allowed participants to rotate the dynamic virtual 3D-objects (while the camera or the viewer respectively was not moved), whereas the camera/viewer movement condition allowed changing the viewpoint of the camera (or the viewer respectively) by moving around the virtual 3D-objects (while the virtual 3D-objects itself was not moved).

Results revealed a generally positive effect of learners’ visuospatial ability (in line with hypothesis H2) on learning. Future studies should explore how visuospatial skills and familiarity with the domain (as a possible indicator of prior knowledge) influence learning efficiency because they are connected to general intelligence (e.g., Buckley et al., 2018; Carroll, 1993). This research could further examine the specific contributions of each factor to general intelligence, and to which extend participants who have these variables at their disposal, are better learners overall, as they show enhanced individual learning efficiency and capacity to manage cognitive load (e.g., Sweller, 2020). Insight into these relationships could enhance educational methods and our understanding of intelligence and learning.

Besides, the general positive effects of learners’ visuospatial ability, our results indicate that it also moderates the effectiveness of different interaction formats in VR environments (in line with research question R1). Interestingly, not only the characteristics of the learners (between-person differences, Lachmair et al., 2022) but also the type of used VR environment in terms of how immersive it was (situatedness, Lachmair et al., 2022; see also Fischer and Brugger 2011) played a role for the effectiveness of more or less embodied interaction formats in this domain. In the low immersive Desktop-VR (Study 1), learners with higher visuospatial ability profited from camera movement. In contrast, in high immersive VR (Study 2), learners with higher visuospatial ability profited from object movement, whereas learners with lower visuospatial ability tended to suffer from object movement. This interaction pattern and its possible explanations will be discussed in more detail in the following section. Although we did not find a dominant main effect for interaction format (as initially hypothesized in H1) in either study, we did find in line with research question R1 that learners’ visuospatial ability moderated the effectiveness of the different interaction formats.

Practical and Theoretical Implications

In comparison, the interaction possibility in the object movement condition in the immersive VR environment of Study 2 is much more an actual “near hand” experience than the mouse cursor interaction in the Desktop-VR environment (of Study 1). This is explained by the fact that the manipulation of the virtual 3D-objects via the controllers (Study 2) is done by the real hands of the participants (with the controllers in it). The hands with the controller must be held near the virtual 3D-objects because the object cannot be manipulated if the hands are too far away. Thus, the proprioceptive information of the hands near the stimuli is given (Reed et al., 2006). Moreover, to ensure some kind of visual information, we implemented a virtual hand representing the real hand with the controller in the immersive VR environment (effects of rubber hands, see Reed et al., 2006). However, this object manipulation was only beneficial for easy recognition items and learners with higher visuospatial ability in the immersive VR. This seems to be the combination under which potential beneficial effects of hand proximity can unfold. To the best of our knowledge, no studies directly investigate hand proximity effects on learning in VR environments (see Peck & Tutar, 2020, for a study on avatar hand proximity on working memory in VR). Thus, our study is a first promising step, and future research is needed to further disentangle hand proximity effects in different virtual learning environments in combination with the moderating role of participants’ visuospatial ability. We are convinced that with the increasing importance of VR environments in the future, this effect has great potential for further research.

In addition to the rationale that object manipulation in immersive VR is advantageous due to the close proximity between the learners’ hands and the virtual (dynamic) 3D objects, other complementary explanations could be the level of embodiment and agency (Johnson-Glenberg, 2018; Makransky & Petersen, 2021; Skulmowski & Rey, 2018), in terms of the degree of embodied interaction, experienced in the different interaction formats as well as the cognitive resources involved with the different experimental conditions (e.g., Sweller, 2020). Even though in both studies, the two ideas of either manipulating the virtual object or manipulating the camera’s viewpoint (respectively, the viewer’s viewpoint) were implemented, the resulting four variants of the expression of the implemented interaction also varied due to the different VR environments. The object movement via mouse control in the Desktop-VR was very complex and unnatural due to the rotation of the 3D-models on three axes with the mouse as an input device designed to operate only in a two-dimensional space. Thus, this interaction format is the least embodied, and participants could not rely on existing experiences or even evolved abilities and thus this format must be considered secondary knowledge (instead of primary knowledge, see Van Gog et al., 2009). At the same time, this format of object movement in the Desktop-VR might be the one exposing the most cognitive resources involved with interleaving the new information given by changing the fish angle and the new manipulation technique of rotating the virtual object on three axes with the mouse input device. In contrast, the viewer movement condition in the true immersive VR was very embodied and natural, giving haptic feedback by the feet touching the ground and using the whole body in order to make viewpoint changes. Walking around a virtual object in an immersive VR exactly works like walking around a real object (e.g., a sculpture in a museum), and the motoric behavior is the same and has not been translated via any input device, even though the visual input differs. Thus, this viewer movement is entrenched, the most grounded, embodied, and maybe even well-situated (see Fischer & Brugger, 2011), and can be considered primary knowledge (Van Gog et al., 2009), thereby building the other pole on a continuum of the level of embodiment of our four different variants of interaction formats.

The two remaining conditions can be considered being positioned between these two extremes (object movement in Desktop-VR = highly unnatural secondary knowledge versus viewer movement in immersive VR = highly natural primary knowledge). The object movement via controller in immersive VR and the camera movement via mouse interaction in the Desktop-VR can be considered more natural than the object movement in Desktop-VR. However, at the same time, they are still less natural than the viewer movement in immersive VR as they both make use of input devices to interact with the virtual elements: the camera movement interaction in the Desktop-VR via mouse input on two axes on the invisible sphere and the object movement in the immersive VR via a controller on three axes (only using their own body parts to hold and move the controller).

In both these conditions, participants had to interact with an entity before them via an input device. In the object movement via the controller in the immersive VR, this entity is the virtual 3D-object, whereas in the camera movement via mouse interaction in the Desktop-VR condition, this entity is the invisible sphere on the computer screen that surrounds the 3D-models. To move and rotate these entities into a position that allows observing a particular perspective on the depicted dynamics of the fish model, it is necessary to build a dynamic mental model (e.g., Lowe & Boucheix, 2017), based on which the final perspective that the learner wants to see can be imagined. The change from the respective starting position of the perspective (either the fish model itself or its surrounding virtual sphere) to the desired output situation invokes the ability of mental rotation and, therefore, the visuospatial ability of the learners. This might be why participants with higher visuospatial ability could improve their learning in these two conditions. Moreover, it might be the case, that incongruences between the changes needed to view a new angle of the fish and the needed manipulation technique for the input device (either the invisible sphere or the VR controller) may have been more difficult for learners with lower visuospatial ability to integrate, whereas learners with higher visuospatial ability managed to do so.

It is essential to acknowledge that the same assumption applies to the object movement via mouse in the Desktop-VR condition. However, our experience from Study 1 indicates that rotating a virtual object with the mouse on three axes is considerably more challenging than rotating the same object with the controller held in hand on three axes. Additionally, it is more complex than rotating a similar virtual object with the mouse on two axes (familiar mouse control, see Smith, 2019). This is because there is no intuitive mapping between the to-be-executed movements in the two-dimensional mouse operation space and the imagined dynamic mental model transformations on the three rotation axes.

In sum, our two studies highlighted the importance of taking learners’ visuospatial ability into account (e.g., Skulmowski, 2023b) and that results on interaction formats in Desktop-VR environments are not transferable on a one-to-one basis to those in true immersive VR environments. Thus, during designing virtual learning environments, not only implementing bodily engagement in terms of interaction possibilities (cf. Johnson-Glenberg, 2018; Skulmowski & Rey, 2018), but also considering participants’ learner characteristics (e.g., Höffler, 2010) as well as the properties of the different environmental situations are essential and must be considered (e.g., Lachmair et al., 2022).

Limitations and Future Research

One area for potential improvement in future studies is to enhance the level of agency (Johnson-Glenberg, 2018) or bodily engagement and task integration, particularly concerning the taxonomy of embodiment in education, as Skulmowski and Rey (2018) proposed. However, when considering agency, it is important to note that it may not always be optimal to allow participants complete freedom of choice; people need to understand what they are doing and why. The perspectives we offered were selected based on prior studies by a domain expert, and thus they have with a high likelihood been the best possible perspective for learning about fish movement patterns. Therefore, allowing participants to interact with the 3D models or the virtual scene to change perspectives might have obscured relevant movements. Another aspect that could be further explored is whether full agency or educated scaffolding (perhaps with ways of interaction) is more effective for learning in VR (see Doo et al., 2020). This should be investigated further in future VR research, as we did not examine which positions or viewpoints were chosen and viewed for how long by the participants (both participants performing rather low as well as participants performing rather high on learning outcomes).

In Study 1, bodily engagement was limited as participants conducted the study while seated. In contrast, Study 2 involved higher bodily engagement, as participants were encouraged to stand and move around during the learning task. However, both studies still employed an incidental approach, with movement not directly integrated into the learning process. Moreover, the movements were not only not aligned with the depicted content (task integration, Skulmowski & Rey, 2018), but they were also not aligned with the underlying processes needed for comprehension (e.g., predicting or imagining the dynamics, see De Koning & Tabbers, 2011; Lowe & Boucheix, 2017). To achieve a higher level of bodily engagement and task integration, future research could, for instance, require participants to perform specific bodily movements or gestures to imitate locomotion patterns (e.g., De Koning & Tabbers, 2013; Scheiter et al. 2020). The maximum potential for investigating the relationship between bodily engagement, task integration, and learning outcomes could be achieved by doing so. Such an approach would provide valuable new insights into the benefits of incorporating physical movements directly into the learning process and its impact on learning.

Moreover, we acknowledge that the findings from the immersive VR study should be replicated with a larger participant sample and include a non-interaction condition as a control condition to strengthen the study's validity. Particularly in the light of the (non-significant) tendencies observed across both studies, a replication effort that aims for higher statistical power, possibly through a within-subject study design, should be considered to clarify and solidify these findings.

Furthermore, on the one hand, we did not directly investigate which perspectives are the best ones to learn about the depicted dynamics (see Keehner et al. 2008), and on the other hand, it might also be of interest to address how much participants interacted in the different interaction formats as well as which perspectives they observed (e.g., using eye-tracking, cf. Drai-Zerbib et al., 2022), how long they observed the different perspectives, and how often they changed between the perspectives (see Boucheix et al., 2018) as well as how these variables connect to learning outcomes. Therefore, future research, which for example records and evaluates interactions in the VR environments, is needed as a feasible step to disentangle these effects. Furthermore, a next interesting step would be to investigate which brain areas are increasingly activated during the different interaction formats for learners with different amounts of visuospatial ability by means of electroencephalography (EEG, e.g., Scharinger, 2018) or by means of functional near-infrared spectroscopy (fNIRS, non-intrusive method of optical imaging, e.g., Brucker et al., 2015) to shed light onto the moderating role of learners’ visuospatial ability during learning in VR environments.

Conclusion

In conclusion, the two studies highlighted the importance of considering participants’ visuospatial ability and the influence of different low and high immersive VR environments on the effectiveness of different interaction formats. It provided valuable insights into designing virtual learning environments that maximize learning outcomes based on learners’ characteristics and environmental settings. Future research in this field should explore the integration of bodily movements in learning tasks and continue to investigate the effects of learners’ characteristics and hand proximity effects in different VR environments on learning processes.