Introduction

Learning with virtual reality (VR) is becoming increasingly popular and is already being used for acquisition of knowledge and skills in areas such as medicine, engineering and psychology (Radianti et al., 2020; Wu et al., 2020). Although VR learning environments (VRLE) can be designed to be very detailed and realistic, in most cases they need additional verbal information to convey the learning material in its entirety. Therefore, exploring in which modality this verbal information is presented in a VRLE is essential when it comes to optimizing the learning experience. Learning with VRLE can be distinguished between non-immersive and immersive VR. Immersion refers to the technological nature of a VR application to give a person the impression of being part of the virtual world or perceiving it as real (Slater & Sanchez-Vives, 2016). Non-immersive VR implies that learners can externally view a VRLE on a computer and thus continue to perceive the real world in the periphery during a VRLE (Slater & Sanchez-Vives, 2016). When learning with immersive VRLE, people are completely surrounded by the learning material, using head-mounted displays (HMDs) that are worn on the head. Thus, the user’s entire field of view is occupied by the VRLE and the real world is not visible to the learner (Slater & Sanchez-Vives, 2016; Wu et al., 2020). This article focuses on immersive VR. As far as learning outcomes in VR is concerned, there are heterogeneous findings. While some studies find positive effects on learning outcome, other studies show no effects or even negative effects (Makransky, Borre-Gude, et al., 2019; Merchant et al., 2014; Parong & Mayer, 2020). To better understand these heterogeneous findings, it is necessary to understand the challenges of learning with VR. What makes VRLE different from traditional learning environments is that there is much more visual input for the learner in VR. If there is too much visual input, it could be overloading for the learner, which has a negative impact on learning (Howard & Lee, 2020; Parong & Mayer, 2020). Reducing the visual fidelity to avoid potential visual overload is not a solution in many cases, as it can also reduce immersion, an essential aspect of VRLE (Mills et al., 2016). Moreover, the well shown motivational effects of VRLE can be impaired if visual fidelity is reduced (Makransky & Mayer, 2022). To avoid such potential visual overload, additional information, such as additional verbal explanations, should not also be presented visually, but should be offered via a different sensory modality. This idea is also known as the modality principle which has been well-established in classical multimedia environments (Mayer, 2005). Now the question arises whether it also applies to VR. From a theoretical point of view, one would clearly assume that the use of the auditory channel could reduce the overload of the visual channel also in VRLE. However, a contemporary empirical study of the modality principle in VR showed that it cannot be simply transferred to VR. It showed the opposite, a reverse modality effect (Baceviciute et al., 2020). Therefore, to better understand the framework for a modality effect in a VRLE, our study aims to investigate the modality effect in a VRLE with a more detailed focus on the underlying cognitive processes. For this purpose, we measured both learning outcomes and cognitive load in a differentiated way to investigate the underlying cognitive processes.

Theoretical background

In contrast to classical learning material in 2D presentation form, VRLE hold various advantages and disadvantages that could be taken into account when designing the learning material. Regarding the advantages, learning in VR can be interactive, realistic, and real-time (Cobb & Fraser, 2005; Slater & Sanchez-Vives, 2016), making it exciting and motivating for users (Makransky & Lilleholt, 2018; Parong & Mayer, 2020). Properly designed VRLE can lead to learners being more motivated, experiencing more enjoyment, and investing more effort in learning (Makransky, Borre-Gude, et al., 2019; Parong & Mayer, 2018). A lot of studies associate immersive VRLE with higher motivation (Makransky, Borre-Gude, et al., 2019; Parong & Mayer, 2018). Such higher motivation in learning may be associated with increased investment, which is referred to as germane cognitive load (GCL). This in turn positively influences learning outcomes (Makransky & Mayer, 2022; Moreno, 2006). Research supports such positive effect of motivation on learning outcomes (Cerasoli et al., 2014).

Another advantage of VR is that it can stimulate the learners’ senses in a way that is as real as possible, to create a feeling of immersion in the virtual environment (Slater & Sanchez-Vives, 2016). Especially the teaching of spatial processes and motor skills could greatly benefit from immersive 3D presentations and interactivity in VR (Parong & Mayer, 2020; Wu et al., 2020). For example, biological or chemical processes can be animated while the learner can view the process from multiple perspectives and interact with the learning material independently (Parong & Mayer, 2020; Radianti et al., 2020). The results of the meta-analysis by Wu, Yu, and Gu (2020) indicate that science learning content is particularly suitable for learning with HMDs. Moreover, immersive VR learning sessions using HMDs could support both knowledge acquisition and skill learning with longer-term learning effects. Compared to static images or animations, VRLE can react dynamically to the learner, e.g., by the learner’s position in the environment. In this way, hidden structures can be made visible when learners walk towards certain objects out of interest, in order to make the underlying three-dimensional structures visible. This also creates different images for each learner and each will have an individual learning experience. However, images are often not enough to convey the intended information and require additional text. In a VRLE on biology, for example, without additional verbal information, the mere pictorial information would not be sufficient to name the corresponding cell bodies and the structures depicted or to explain their functions or interrelationships. In order to convey such information, additional text is needed. Therefore, when designing a VRLE, it is important to consider the modality through which this additional textual information will be presented so that all learners receive the information they need to fully process the learning material.

However, there are also disadvantages. From a development perspective, the cost and effort required to create an immersive VR learning environment can be very high (Richards & Taylor, 2015). From a learner perspective, the current field of view must be constantly updated which can lead to temporal shifts in the representation of the learning material, especially in detailed or complex VRLE. While technological advances continue to minimize these temporal delays, this discrepancy between visual and physical perception may cause discomfort for the user, which is not conducive to learning (Slater & Sanchez-Vives, 2016). These disadvantages mainly show the technical drawbacks and not those that may emerge in learning processes. Radianti and colleagues (2020) also found such a technical focus in their meta-analysis in studies about learning in VR. They point out that the majority of the analysed studies focus only on usability and do not consider learning outcomes or underlying cognitive processes. Thus, it is even more important for research to empirically investigate these processes (Radianti et al., 2020). So what processes are necessary for learning and why might they be challenging in a VRLE? VRLE are characterized primarily by their constant and varied visual impressions. The multitude of visual stimuli in VR could pose the risk of overloading the limited working memory capacity (Mayer & Moreno, 1998). This indicates that features of the VRLE itself could impair learning processes. As mentioned before, one distinctive feature about VR, compared to other learning environments, is the immersion. It enables the user to experience presence when the simulation is perceived as real, making it more authentic (Fabris et al., 2019). However, these features do not come without strain on working memory. To provide such an experience, more details and visual effects need to be added. In natural science studies, for example, three-dimensional views and detailed images of cell structures can be shown (Parong & Mayer, 2018). Viewing angles and depth levels may vary, which is quite different from images in a textbook. However, a high degree of realism necessarily includes some features that are not conducive to the actual learning goal, but need to be processed in working memory. According to the Cognitive Load Theory (CLT), this additional affordance would be called extraneous cognitive load (ECL). Thus, the VRLE itself can cause an increased ECL, which may distract from relevant learning content (Makransky, Borre-Gude, et al., 2019).

During learning, learners also need to orientate themselves spatially in order to follow the dynamic content. This permanent image is usually accompanied by verbal information, such as terms or formulas embedded in the field of view, integrated text fields or as an annotated overlay text. Especially when a visual text is presented in addition to the visual-only presentation in VR, the learning processes could be more challenging due to the many visual stimuli and the resulting overload of the visual channel which could impair learning performance.

Overall, given the advantages and disadvantages for learning in immersive VR, the question is how to design VRLEs that counteract visual overload and are conducive to learning (Radianti et al., 2020; Sattar et al., 2019). Starting from classical learning environments, the modality principle could be an answer to this question.

The modality principle

According to the modality principle, a visual presentation, such as a picture, should be accompanied by a spoken text rather than written text (Mayer, 2005). The modality principle is one of the design principles that emerged from the Cognitive Theory of Multimedia Learning (CTML; Mayer 2005). Together with Baddeley’s working memory model, it describes a sequence of cognitive processes that occur in multimedia learning and, in particular, the cognitive processes requiered when information is processed visual-only vs. audio-visual. According to Baddeley, visual and spatial information are processed separately: visual information is processed in the visuo-spatial sketchpad and verbal information is processed in the phonological loop (Baddeley, 1992). Both memory systems have independent capacities. In a multimedia presentation, such as in VRLE, both channels should be used so that the information can be allocated to both channels accordingly to avoid cognitive overload. Thus, in addition to the visual environment in VR, accompanying text should be presented auditorily. While the visual information is processed in the visuo-spatial sketchpad, the accompanying auditory explanation can be processed with independent capacity in the phonological loop. If the text were presented visually, this information would also be visual and would have to be processed visually. After this reading process, the visual text is transformed into phonemic information that can be processed in the phonological loop. Nevertheless, learners would first need to split their visual attention to two visual sources of information and would thus merely strain their visual memory resources. With auditory text such a split-attention effect would be avoided, because learners can direct their attention to the visual input and the auditorily presented text at the same time (Rummer et al., 2010). Thus, when both visual and verbal information are simultaneously available in working memory, learners can more easily connect and integrate them, which facilitates comprehension and transfer performance.

The positive effect of the modality principle on learning outcomes in multimedia learning environments has been supported by a large number of studies in traditional multimedia learning environments (Ginns, 2005; Harskamp et al., 2007; Moreno, 2006). However, there are studies demonstrating that the modality principle had no effect on learning outcomes or even negative effects (Inan et al., 2015; Tabbers & van der Spoel, 2011). Accordingly, a number of studies have observed a reverse modality effect, according to which learners with visual-only presentation showed better learning performance than learners with audio-visual presentation of the learning material (Crooks et al., 2012; Inan et al., 2015; Oberfoell & Correia, 2016). The modality effect can be reversed if, for example, when longer or more complex texts are presented or when the learning environment is self-paced. Why the modality effect can be reversed in VRLE and which boundary conditions it is subject to will be explained in more detail in a later chapter.

The discussion so far shows that it is important to consider resource and capacity constraints, when designing learning environments. For the CLT, the same assumption of limited working memory applies as for the CTML and can be used to explain the processing of different modalities. Based on the split-attention effect, the modality effect is associated with ECL because learners must divide their attention between two stimuli of complementary information (Youssef-Shalala et al., 2014). Brünken and colleagues (2004) have argued that when the visual channel in working memory is already strained, further visual cognitive resources are not available for another task. If the complementary information is separated between the visual and auditory channel, then the ECL should be reduced accordingly (Mayer, 2005). If the ECL is reduced because both channels are used, then there is more capacity for the GCL, which promotes learning because of the addition hypothesis (Sweller, 2005). The question is whether the modality principle can be adapted to VRLE to achieve the same effects as in classical environments.

Modality principle in VR

Theoretically, the modality effect should be particularly prominent in immersive VRLE. Adding more visual text on top of strong visual stimuli could potentially overload the visual channel. However, limited empirical evidence so far shows only a reverse modality effect in VRLE. A recent study by Baceviciute and colleagues (2020) ,in which 78 students underwent a learning session about cancer in VR, manipulated the presentation modality. In the first visual-only condition, the text was presented semantically embedded in a virtual book. In the second visual-only condition, the text overlaid the VR against a transparent background, and in the audio-visual condition, the text was presented auditorily. They found a reverse modality effect, whereby subjects in the two visual conditions performed significantly better on recall than subjects in the audio-visual condition. In contrast, no significant difference was found between conditions in terms of transfer performance. The results of this single study suggest that learning in VR may benefit from a reverse modality effect. To explain the reverse modality effect in VR, Baceviciute and colleagues (2020) suggest the Colavita visual dominance effect. According to this effect, when multisensory stimuli are presented simultaneously, visual stimuli dominate over auditory stimuli when they are presented simultaneously (Colavita, 1974). This visual dominance may occur at the sensory level even before attentional processes emerge (Sinnett et al., 2007). However, this effect has primarily been studied in audio-visual discrimination tasks and not in learning environments. Nevertheless, in order to better understand these results, we should consider the boundary conditions for the modality effect. The effectiveness of the modality effect can be limited or reversed by certain learning conditions, for example if the text is long or learning material is very complex (Leahy & Sweller, 2016). Especially with longer and more complex texts, the possibility of self-regulation during text reading can compensate for the processing disadvantages (Kürschner et al., 2006). With a visual presentation of text, it is possible to revisit individual sections of text because it is stable. In contrast, with a transient auditory presentation, however, learners are dependent on the speed of the presentation, which can lead to comprehension or retention problems with longer or more complex content (Kürschner et al., 2006, Singh et al., 2012). In the study by Baceviciute and colleagues (2020) the modality effect may have been reversed because the presented texts were very long, with a total of 25 paragraphs of 300–400 characters each, on a rather complex topic about cancer.

In Baceviciute and colleagues (2020) study, learners were able to decide for themselves when to advance to the next learning content. A reverse modality effect can also occur when the learning environment is learner paced as opposed to system-paced (Ginns, 2005). Thus, if learners have enough time to process the visual learning material, the visual overload can be compensated by revisiting relevant information at their own pace or by using text processing strategies (Tabbers & de Koeijer, 2010). Since no time was recorded in the study from Baceviciute and colleagues (2020) for how long the participants spent reading the visual text compared to the non-repeatable auditory condition, a repetition effect in reading could also be responsible for the better results of the visual condition compared to the auditory condition.

Another reason may be that, unlike permanent text in a visual-only presentation, auditorily presented learning content cannot be repeated and it is not possible to jump back and forth between elements while reading, as it is the case with permanent text (Seufert et al., 2009). However, these possibilities in text presentation could facilitate important cognitive processes in multimedia learning environments (Inan et al., 2015).

In addition, the learning material of Baceviciute and colleagues (2020) was rather static, so that the advantages and possibly the added value of a VRLE may remained unused. Learners were instructed to sit as still as possible and restrict their body movement, making dynamic content between the learner and the VRLE very limited, as the image the learner sees is static and more like an animation. If this limitation contributed to learners not experiencing overload in the visual channel, then no facilitation can occur from using both channels as a modality effect.

Additionally, the visual text to the image was not integrated into the learning environment, but was statically pinned into the learner’s field of view, either in the form of an overlay or in the form of a book. At the same time, the learning material did not require a connection between the text and the image. If no connection is necessary to understand the learning material, then the learner does not have to divide his attention between visual sources, so there is no split-attention effect, for which an auditory presentation of the learning material can be considered superior to a visual-only one.

Furthermore, Baceviciute and colleagues (2020) only analyzed recall and transfer. To examine the modality effect in VRLE more closely, it would be desirable to look at comprehension, which, according to Bloom’s taxonomy, is located between recall and transfer. Baceviciute and colleagues (2020) also measured cognitive load during learning with different modalities in VR with single item questionnaires and EEG. Subjects in the visual-only condition reported significantly lower intrinsic cognitive load (ICL) than subjects in the audio-visual condition. Although the element interactivity between the two conditions did not differ, subjects in the subjective ICL rating found the auditory condition easier than the visual book condition. The authors attributed this difference to the volatility of the auditory text, so that learners could not actively invest their cognitive capacity in selection and organization processes. At least not as much as in the visual condition, where learners could repeat the text to integrate information between sentences. However, they found no differences in ECL between these conditions. Moreover, it would be informative to additionally investigate the GCL to determine whether the textual condition actually invests more cognitive capacity in germane processes.

Present study

The present study investigated the influence of the modality of additional instructional text (visual-only vs. audio-visual) on learning outcomes and cognitive load in a VRLE. Furthermore, we want to investigate the cognitive learning processes in more detail, which is why we measure both the learning outcome and the cognitive load in a differentiated way. As can be seen from the theory presented, the modality effect is well supported in classical learning environments. Based on this, theoretically, a modality effect should be even more obvious in VRLE due to the high visual fidelity. However, there is also currently a single empirical study showing a reverse modality effect for VRLE. Thus, there are theoretical arguments for a modality effect in VRLE and an empirical one against it. To solve this dilemma and to derive unambiguous hypotheses, we have explained in the upper section why we believe that in the previous study by Baceviciute and colleagues (2020) the modality effect in VRLE reversed due to its boundary conditions. Therefore, we tried to take these into account in our study. In contrast to the study of Baceviciute and colleagues (2020), we kept the text length significantly shorter to accommodate working memory capacity in both conditions.

We were also particularly careful to ensure that the visual texts were integrated into the VRLE. The visual texts were attached to the corresponding visual elements and dynamically adjusted their rotation according to the learner’s position, so that they could be read from any position in the VRLE while freely moving around and viewing the relevant elements in the VRLE. In the visual-only condition, if learners wanted to connect the text and image, they had to alternate their attention between the text and the image. In the audio-visual condition, learners do not have to split their attention to process the learning material simultaneously and connect it in working memory, which facilitates the modality effect. We also decided to make the VRLE leaner paced, however, we measured the time-on-task in both conditions. This allows us to control whether one condition learned significantly longer in the VRLE, which may influence the modality effect and whether possible differences in learning outcome are related to time spent on learning. While Baceviciute and colleagues (2020) differentiated the learning outcome into recall and transfer, we differentiated the learning outcome more precisely and also recorded comprehension. Since we have adhered to the boundary conditions of a modality effect as much as possible in our VRLE study, we derive our hypotheses strictly from established multimedia learning theories of the modality effect from classical learning environments. Nevertheless, in order to reflect a possible reverse modality effect in the results, we will additionally evaluate the following hypotheses with Bayesian independent samples t-tests.

Our first research question is whether different modalities in VRLE have an effect on learning outcomes. Studies attempting to investigate the cognitive learning processes in VR often measure learning outcomes as a general factor rather than measuring differentially (Makransky, Borre-Gude, et al., 2019; Parong & Mayer, 2018). This could be considered problematic, as the presentation of the learning material in different modalities could affect the depth of processing. By using both channels, working memory is relieved, which could help to increase recall performance. This relief should be even more beneficial when it comes to comprehension and transfer performance (Seufert et al., 2009). If there is more capacity in working memory due to an audio-visual presentation, then multiple elements can be processed simultaneously without the learners having to divide their attention between different visual sources. This allows learners to connect and even integrate the learning material, which is crucial for comprehension and transfer processes (Seufert et al., 2009; Mayer, 1999). Whether a modality effect in a VRLE also applies to different processing levels, however, needs to be investigated more closely. Therefore, in this study we differentiated learning outcomes according to Bloom’s taxonomy into recall, comprehension and transfer (Bloom, 1956).

Based on established multimedia learning theories, the present study contributes to the empirical research on learning outcomes and their cognitive processes in VRLE as urged by Radianti and colleagues (2020) in their meta-analysis. Derived from the theory presented, we hypothesize that audio-visual presentation of learning material in VR will lead to higher recall (H1), comprehension (H2) and transfer scores (H3) compared to visual-only presentation.

Our second research question is whether different modalities in VRLE have an effect on cognitive load. According to the CLT in its original version, there are three main sources of cognitive load that can enhance or impede learning (Sweller, 2005). This differentiation is still valid when it comes to the evaluation of different aspects of instructional design as in our study (Klepsch & Seufert, 2020). In order to better understand the underlying processes, it is therefore important to measure cognitive load in a differentiated way. Recent studies on cognitive load in VRLE also indicate the importance of differentiating cognitive load (Andersen & Makransky, 2020; Parong & Mayer, 2020). Furthermore, Klepsch and Seufert (2020) support that in studies on the design of multimedia learning material, a differentiated recording of cognitive load is recommended in order to be able to shed more light on underlying processes of learning outcomes as well as on multimedia design principles used and to be able to identify design options conducive to learning accordingly (Klepsch & Seufert, 2020). The first of the three types of load is the ICL. It results from the complexity or difficulty of the learning material. Individual elements of the learning material may be more or less interrelated. A high degree of element interactivity or dependency between elements of a learning material places a greater demand on the learner’s working memory than when elements are less intercorrelated or not intercorrelated at all. Objectively, the ICL is not affected by the study design (Schrader & Bastiaens, 2012) and since both conditions are presented with the same VR learning content, we do not expect any differences in ICL between the audio-visual and visual-only condition.

Especially in immersive VRLE with a strong visual component, the use of the modality principle could provide relief by reducing the load on the visual channel (Low & Sweller, 2005). Theoretically this relief minimizes the ECL and allows the learner to devote his or her cognitive capacity to germane processing, that is conducive to learning (Harskamp et al., 2007).

However, no empirical differences have yet been found for ECL in VRLE with different modalities. Although the general ECL in VR may be high, one explanation for the lack of relief could be, that the text as a modality is more stable than a transient audio track. This can make it easier to jump back and forth while reading, as skipped text passages can be easily re-read (Seufert et al., 2009). With an audio track, the learner cannot listen to missed text again or use reading strategies. However, since an audio-visual presentation relieves the visual channel and additionally uses the auditory channel, possible bottlenecks in working memory can be prevented, which indicates a reduction in ECL. Theoretically derived, we expect ECL to be lower when learning with multiple modalities compared to learning with only one modality. Accordingly, when information is allocated to both channels, the learners should have more working memory capacity available due to a lower ECL (Low & Sweller, 2005), which could then be invested in the GCL (Sweller et al., 1998).

Based on the theoretical assumptions presented, we hypothesize that there will be no significant difference between the conditions regarding ICL (H4), but that subjects in the audio-visual condition will show a lower ECL than subjects in the visual-only condition (H5). It is also expected that subjects in the audio-visual condition will show a higher GCL than subjects in the visual-only condition (H6).

When processing learning content, it is significant that learners have sufficient working memory capacity to promote germane processes (Kalyuga, 2007; Low & Sweller, 2005). This might be favoured in individuals with high rather than low working memory capacity (Kozan et al., 2015). In contrast, individuals with low rather than high working memory capacity have been found to benefit more from the modality principle (Seufert et al., 2009). Therefore, in addition to the design of the learning material, learner characteristics such as motivation (Seli et al., 2016), prior knowledge (Zambrano et al., 2019), or working memory capacity (Seufert et al., 2009) may also influence learning outcomes and will be assessed as well as control variables.

Method

A priori analysis

To estimate the necessary sample size, we performed an a priori power analysis. A meta-analysis on studies investigating the modality principle on learning outcomes reports a median effect size of d = 0.72 (95% CI: [0.52, 0.92]; Ginns, 2005). Since this meta-analysis refers to classical multimedia settings and not to VR studies, we used the conservative confident interval of d = 0.52 as the effect size. For this effect size, with an alpha = 0.05 and power = 0.95, the projected total sample size needed is approximately N = 52 (Faul et al., 2009; G*Power Version 3.1.9.2).

Participants and study design

We collected data from a total of 61 subjects aged 19 to 37 years (M = 23.16; SD = 3.15). Among them, a majority of the subjects were female (N = 42,69%) and 90% were students (N = 55). Requirements for study participation were a minimum age of 18 years, proficiency in the German language, and compensation of possible visual deficits by contact lenses or wearing glasses. The experiment was based on a between-subjects design with the between-subject factor modality (visual-only vs. audio-visual). As dependent measures we assessed learning outcomes (recall, comprehension, transfer) and cognitive load (ICL, ECL, GCL). Based on previous research, relevant control variables were included in the analysis: prior knowledge (Zambrano et al., 2019), motivation (Makransky, Borre-Gude, et al., 2019), and working memory capacity (Seufert et al., 2009).

Materials and instruments

Virtual reality learning unit

The immersive VR learning material (see Fig. 1 & 2) focused on the human immunodeficiency virus (HIV). The learning material is based on the tool cellVIEW, on which a VRLE was developed for this study (Le Muzic et al., 2015). The learning phase, which lasted on average 11 minutes, included a guided tour through an HIV cell, in which the interrelationships and processes within the cell were shown in 3D. Relevant external and internal components of the virus, such as proteins or capsid structures, were shown as subjects were immersed in the virus, providing 360° views and detailed images of various structures. The spoken and written text were exactly the same in both conditions. In this process, macromolecular biological processes were explained. One feature that makes our learning material in VR stand out from animations or videos is that the learning material reacts dynamically to the position of the learner. For example, macromolecular structures only became visible when the learner moved towards or into a component of the virus. Learners could move freely in the VRLE and observe the structures like cell membranes from all sides, as well as inside and outside. Verbal explanatory texts for the learning content were presented either visually in text form (visual-only) or auditorily by a speaker (audio-visual), depending on the condition. The verbal explanatory texts had an average length of about M = 38.94, (SD = 16.02) words and were created by subject matter experts and translated from English to German for the study. We made sure that the visual text was smoothly integrated into the VRLE. The rotation of the text has always dynamically adapted to the position of the learner to ensure good readability from all sides. Although the static screenshots in Fig. 1 might appear a little blurry, the text in the VRLE is sharp and clear to read. The high-contrast outline makes the text stand out well against any background to ensure readability. Due to the dynamic adjustment to the position of the learner, the text also does not obscure the relevant structures. The audio in the audio-visual condition was presented through the headphones attached to the VR headset, with the volume kept the same across subjects. The auditory text was played when the learners clicked on the corresponding speaker icon. In both conditions, white circles were placed around the currently relevant object in the VRLE to ensure that learners in both conditions knew at all times which image belonged to the text. Learners were able to decide for themselves when to go to the next learning content. Replaying content or returning to previous content was not possible. A Vive Pro HMD (HTC Corporation, 1080 * 1200 pixels per eye, frame rate 90 Hz, field of view 110 degrees) with two Vive controllers was used for the VR simulation. To enable full roomscale tracking, we mounted four Vive Basestation 2.0 sensors on the ceiling of the laboratory. It was possible for them to move within the tracked area (approx. 2 × 2 m). The windowless lab provided stable lighting conditions so that the VR headset’s sensor technology was not affected by external interference factors. The HMD was connected on an HP Omen computer (Intel Core i7-9750 H, NVIDIA GeForce RTX 2080, 16 GB SDRAM). The questionnaires and constructs collected in the present study are described below. Demographic data, such as age, gender, and semester, as well as partcipants’ previous contact with VR, were also collected.

Fig. 1
figure 1

Virtual Reality Learning Environment cellVIEW VR. Shows the visual-only condition

Fig. 2
figure 2

Virtual Reality Learning Environment cellVIEW VR. Shows the audio-visual condition

Prior knowledge

At the beginning of the study, learners’ prior knowledge was assessed with a total of 18 questions (14 open questions and 4 multiple choice). This test assessed domain specific knowledge about HIV and basic biochemical knowledge. The prior knowledge test consisted of 18 questions (e.g. “What are the four nucleic bases that make up RNA?”) in either open or multiple-choice format with four possible answers. Subjects could score a maximum of 28.5 points. No points were deducted for incorrect answers. Interrater reliability was calculated using the intraclass correlation coefficient. Three independent raters evaluated the subjects’ answers on the basis of a previously developed sample solution (ICC = 0.98, 95% CI: [0.974, 0.990], p < .001).

Cognitive load

Cognitive load was assessed immediately after the learning session using the three subscales (ICL, ECL, and GCL) of the Differentiated Cognitive Load Questionnaire (Klepsch et al., 2017). The questionnaire consisted of six items, with a 7-point scale (1 “absolutely not true” to 7 “completely true”) and distinguishes between three types of cognitive load. The ICL was measured with two items (e.g., “This task was very complex.”). The ECL was measured with three items (e.g., “The design of this task was very inconvenient for learning.“). The GCL was also recorded using two items (e.g., “For this task, I had to think intensively what things meant.“). Satisfying reliability scores between Cronbach’s α = 0.80 and α = 0.86 were reported for the scales in the analysis of the questionnaire (Klepsch et al., 2017).

Learning outcomes

To assess learning outcome subjects were given a total of 13 questions (12 open questions and one multiple-choice) differentiated by recall, comprehension and transfer according to Bloom’s (1956) taxonomy. All questions were based on the content from the previous VR learning material. With five recall (e.g. “To which family does the HI virus belong?“), four comprehension (e.g., “Explain how reverse transcriptase contributes to the construction of new viruses.“), and four transfer questions (e.g., “Which of the two illustrations represents an HIV virus? Please give reasons for your choice”) a total maximum score of 20.5 could be achieved. Participants could score a maximum of 7 points in recall, 6 points in comprehension, and 7.5 points in transfer. When appropriate, the questions also addressed three-dimensional aspects, as in the last example of the learning outcome test where three-dimensional images of different viruses were shown or the visual distinction of the structure of two capsids (hexamer and pentamer) were possible by exploration of the three-dimensional space. Cronbach’s alpha for “recall” was α = 0.25, for “comprehension” α = 0.41 and for “transfer” α = 0.57. However, in our view, Cronbach’s alpha is not an adequate measure to indicate the reliability of a prior knowledge or a post-test. The basic assumption of this approach it, that it assesses a coherent underlying construct. In the prior knowledge and post-test, however, there are diverse content facets, e.g., referring to different “chapters”. To nevertheless ensure the quality of the instruments, we used construct validity, i.e., an expert development of the items. To ensure reliability, we decided on an inter-rater reliability as an alternative and legitimate approach. This test was evaluated by the same three independent raters as in the prior knowledge test, using a sample solution. Interrater reliability yielded a very high intraclass correlation coefficient of r = .98 (ICC = 0.983, 95% CI: [0.973, 0.989], p < .001).

Motivation

The questionnaire on current motivation (QCM; Rheinberg et al., 2001) was used to measure the subjects’ motivation. Using a total of 18 items on a 7-point Likert scale ranging from 1 (totally disagree) to 7 (totally agree), it measures four dimensions of motivation. “Current interest” with five items (e.g., “I don’t need a reward for tasks like this because I enjoy it anyway”), probability of success with four items (e.g., “I believe I am up to the difficulty of this task.”), fear of failure with five items (e.g., “I feel pressured to do well on the task.”), and challenge with four items (e.g., “This VR learning lesson is a real challenge for me”). Reliability for the subscale “interest” was α = 0.83, for “probability of success” α = 0.72, for “fear of failure” α = 0.91 and for “challenge” α = 0.58. For further calculation, a total motivation score was formed. Cronbach’s alpha for the total scale was α = 0.72.

Working memory

To measure the working memory capacity, we used Oberauer’s and colleagues’ (2000) Numerical Memory Updating Test. This Numerical Memory Updating Test, which is often used in multimedia studies, is particularly well suited for VR learning material, since it has a spatial component in addition to the syntactic component due to the arrangement of the numbers in the matrix. In addition, it does not only require learners to store the information but also to process the data. We developed an online version with the same algorithm as the original. Participants were presented with a 3 × 3 matrix in which each cell could contain a number. Starting with three numbers, each number was presented for 1300ms. For each number, several simple arithmetic instructions could be given, indicated by arrows in the cell (up = plus 1; bottom = minus 1). Then the cells are queried for the results one after the other. If more than 75% have been answered correctly in one attempt, the number of cells is increased by one, up to a maximum of 9 cells. If the participant achieves less than 75%, the participant has two more attempts, which automatically end the test if the participant fails again.

Immersion

To measure immersion, we used the Technology Usage Inventory (Kothgassner et al., 2013). The subscale immersion consisted of four items (e.g., “During the virtual simulation, I totally forgot about the world around me.”) and were rated on a 7-point-scale ranging from 1 = “Do not agree at all” to 5 = “Totally agree”. Cronbach’s alpha showed a reliability of α = 0.81.

Study procedure

First, all participants were informed about the content of the study and then asked to carefully read and sign the subject information and informed consent forms. This was followed by an online questionnaire to assess demographic data, prior knowledge and current motivation. Participants were then randomly allocated to either the visual-only or audio-visual condition. A short introduction to the use of the controller and to the control of the self-paced learning unit followed. The subjects then put on the VR-HMD with the integrated headphones to start learning. Since the subjects in the visual-only condition could theoretically have read the texts several times, the time spent with the learning material was recorded as a control variable. Immediately after the learning unit, the cognitive load was assessed. Learners then answered the questionnaire to assess learning outcomes, immersion and completed the task to assess working memory capacity.

Results

All data analyses were performed with the Statistical Package for the Social Sciences Version 26 with an α-error set to α = 0.05 for all calculations, except for the Bayesian independent Samples T-Tests, which were performed with JASP (JASP Team, 2022). The effect size partial eta squared was interpreted according to Cohen (1988). Accordingly, the limits for a small effect size are 0.01, for a medium effect 0.06, and for a large effect 0.14. The requirements of normal distribution of residuals and variance homogeneity for the calculation of an ANCOVA were checked before calculation. The covariates were tested for homogeneity of the regression slopes. We also checked that the correlation of the covariate with the dependent variable did not differ across groups.

Descriptive data and effects of control variables

Descriptive statistics were analysed with t-tests and showed no differences regarding age (p = .64), prior knowledge (p = .47), working memory capacity (p = .63), motivation (p = .54), prior contact with VR (p = .52), immersion (p = .22) and time spent in the VR learning session (p = .22) between conditions. A χ²-test found no differences in gender between conditions (χ² (1, N = 61) = 0.13, p = .79). Descriptive data for all control variables per condition can be found in Table 1.

Table 1 Descriptive data for all variables in the conditions with or without annotations

In order to identify potential covariates, we analyzed correlations for each covariate with each dependent variable. Correlation analyses revealed relevant covariates for the following variables: motivation (r = .35, p = .01) for recall, prior knowledge (r = .61, p < .001) for comprehension, and motivation (r = .23, p = .047) and prior knowledge (r = .395, p = .002) for transfer.

Concerning ECL, we found a correlation with immersion (r = − .44, p < .001). For GCL, a significant correlation with motivation (r = .42, p < .001) and immersion (r = .40, p = .03) was found. In the following analyses, the effects of the control variables were controlled. The descriptive statistics and ANCOVA results for learning outcomes and cognitive load are shown in Table 2.

Table 2 Means, standard deviations and ANCOVA results in the different experimental conditions

ICL = intrinsic cognitive load; ECL = extraneous cognitive load; GCL = germane cognitive load.

Effects on learning outcomes

Contrary to our hypotheses, the visual-only condition scored significantly higher than the audio-visual condition on recall (H1: F(1,58) = 9.60, p = .002, ηp2 = 0.14), comprehension (H2: F(1,58) = 5.74, p = .01, ηp2=0.09) and transfer scores (H3: F(1,57) = 2.85, p = .049, ηp2 = 0.05). The covariate motivation had a significant impact on recall F(1,58) = 10.67, p = .002, ηp2 = 0.16. The covariate prior knowledge had a significant impact on comprehension F(1,58) = 34.01, p < .001, ηp2 = 0.37. The covariates prior knowledge (F(1,57) = 9.09, p = .004, ηp2 = 0.14) and motivation (F(1,57) = 4.03, p = .05, ηp2 = 0.06) had a significant impact on transfer. Results on learning outcome are shown in Fig. 3.

Fig. 3
figure 3

Mean scores for learning outcomes

Effects on cognitive load

In line with our expectations, no significant difference was found between the conditions for ICL (H4: F(1,59) = 2.69, p = .106, ηp2= 0.044). Contrary to our hypothesis we found no significant difference between the conditions for ECL (H5: F(1,58) = 0.73, p = .39, ηp2 = 0.012). Against our expectations, we found significantly higher GCL scores for the visual-only condition compared to the audio-visual condition (H6: F(1,57) = 10.02, p = .002, ηp2 = 0.15).

The covariate immersion had a significant impact on ECL F(1,58) = 12.90, p < .001, ηp2 = 0.18. The covariates immersion (F(1,57) = 5.87, p = .019, ηp2 = 0.09) and motivation (F(1,57) = 12.04, p < .001, ηp2 = 0.17) had a significant impact on GCL. Results on cognitive load are shown in Fig. 4.

Fig. 4
figure 4

Mean cognitive load. ICL = intrinsic cognitive load; ECL = extraneous cognitive load; GCL = germane cognitive load

Bayesian independent samples T-test

Based on the only evidence so far that the modality effect can be reversed in VRLE (Baceviciute et al., 2020), we additionally analyzed the dependent variables with a Bayesian independent samples T-test with BF01 (Quantifies evidence for the null hypothesis relative to the alternative hypothesis) and an alternative hypothesis specifying that the location of the visual-only group is smaller than the location of the audio-visual group.

The prior and posterior plot for our recall data shows that the Bayes factor indicates evidence for H0, specifically BF0− = 12.48, which means that our data for recall are approximately 12 times more likely to occur under H0 (i.e., that the visual-only group shows higher learning outcome scores than the audio-visual group) than under H- (i.e., that the audio-visual group shows higher learning outcome scores than the visual-only group). The result indicates strong evidence in favor of H0. The prior and posterior plot showed that the median of the resulting posterior distribution for δ = -0.059, (95% CI: [-0.274, -0.002]), which indicates uncertainty abouts its size. The same analysis for the comprehension data shows strong evidence in favor of H0 with a BF0− = 11.60, δ = -0.063, (95% CI: [-0.288, -0.002]). The data for transfer shows moderate evidence in favor of H0 with a BF0− = 9.12, δ = -0.078, (95% CI: [-0.337, -0.003]).

Regarding cognitive load, we tested the ICL with BF10 and assumed no differences between the groups as an alternative hypothesis. Analysis revealed for ICL: BF10 = 0.80, δ = -0.355, (95% CI: [-0.846, -0.111]). The groups do not differ from each other with anecdotal evidence.

For the ECL we used BF0+ and found BF0+ = 8.04, δ = 0.088, (95% CI: [0.003, 0.365]). With about 8 times more likelihood the ECL data occur under H0 (i.e., that the audio-visual group shows lower ECL scores than the visual-only group) compared to the H- (i.e., that the visual-only group shows lower GCL scores than the audio-visual group. Results on ECL indicate strong evidence toward H0. For GCL we used BF01 and found BF0− = 13.20, δ = -0.056, (95% CI: [-0.263, -0.002]). With about 13 times more likelihood the GCL data occur under H0 (i.e., that the visual-only group shows higher GCL scores than the audio-visual group) compared to the H- (i.e., that the audio-visual group shows higher GCL scores than the visual-only group. Results on GCL indicate strong evidence toward H0.

Discussion

Based on classical instructional design research and its underlying cognitive models, the.

aim of this study was to investigate the influence of the modality principle in a VRLE on learning outcomes and cognitive load.

Effects of the modality principle on learning outcomes

Contrary to our theoretically derived hypotheses, we found a reverse modality effect for recall, comprehension and transfer. In addition, the results of the Bayesian independent samples t-tests for recall, comprehension, and transfer suggest a reverse modality effect in our data with strong to moderate evidence. The visual-only condition showed significantly higher scores on recall, comprehension and transfer than the audio-visual condition. This is in line with the one former study on the modality principle by Baceviciute et al., (2020) who also found a reverse modality effect but in contrast to the theoretical assumptions which we argued above. The question is, why does the visual text help learners to achieve higher learning outcomes on all levels compared to the auditory text? In our view this can be explained firstly by the characteristics of the VRLE itself, secondly by the characteristics of the text, and thirdly by the content.

First, the VR technology itself could be a reason for the reverse modality effect. VRLE are very different from conventional learning environments. One important difference is the immersion. As mentioned at the beginning, immersion is achieved, among other things, by visually detailed or realistic VRLE and the perspective of being in the environment. Moreover, the learners have to walk and look around in order to fully grasp the VRLE. This active interaction is the other important reason why learning in our VRLE is fundamentally different from classical multimedia learning environments. The amount of visual input and the interaction within the VRLE could encourage learners to explore and making as much as observations as possible. This could lead to a superficial focus on the subject matter rather than to strive for a deeper understanding of the learning material. This assumption might be reflected by mediocre recall performance and the overall low comprehension and transfer performance. What complicates the visual input, compared to classical multimedia learning environments, is that the VR image the learner perceives is constantly changing. The learner has to constantly move and look around to find the right perspective for the visual input in combination with the text. Another difference is that VRLE present a variety of visual stimuli which are not necessarily directly related to the learning task. Thus, one problem with VR as a learning medium is that learners are often distracted from the essential learning content (Makransky, Mayer, et al., 2019). By more easily distracting selective attention, learners might memorize non-relevant concepts (Baceviciute et al., 2020).

In summary, the VRLE itself presents a challenge in learning related to immersion, the interaction with the learning material, and seductive details. But why do learners achieve a higher learning outcome with visual text? The visual text helps learners to initially focus on the relevant element because the text is located close by. Moreover, in the visual-only condition, learners must keep their focus of attention on the text in order to read the additional information and cannot look around or interact with the VRLE during this process. Thus, the visual text presentation additionally supports the hold component of learners’ attention (Bolkan et al., 2018). Even though learners in the visual-only condition had to divide their attention between the two visual sources in order to combine the text and VR image, they still outperformed the audio-visual condition in every learning outcome, even though the latter could theoretically select, process and connect both sources simultaneously without splitting their visual attention.

But aside from these potential benefits, learning with auditory texts in a VRLE with the specific characteristics mentioned above can be challenging. Although the initial focus must still be on the relevant learning material to start the audio track, learners do not have to keep their focus on the relevant element while the audio track is playing. However, we don’t know for certain whether learners actually look around and interact with the VRLE during the narration. Therefore, future research would need to include eye-tracking in VRLE to determine where and for how long the learners’ gaze actually lingers.

As mentioned above, the second reason for the reverse modality effect may be due to the characteristics of the text. The first text characteristic is the text length. In our study we tried to use the optimal text length, i.e. on the one hand it should be long enough to ensure deeper learning processes that require the integration of visual and auditory information. On the other hand, it should not be too long, as Kürschner and colleagues (2006) suggest that the modality effect may disappear or reverse with longer texts. They argue that the ability to self-regulate in reading can compensate for processing disadvantages. This is due to the transience of auditory text. The auditory information is only temporarily accessible to the subjects while they were listening, whereas the visually presented texts are available to the learners until they deliberately move on to the next text (Leahy & Sweller, 2016; Singh et al., 2021; Wong et al., 2012). The longer the texts, the more critical it is when the information can no longer be retained in working memory, or previous information is discarded. Visual text, in contrast, is stable and can be read several times to retain it in working memory. With an average text length of about 40 words, the texts in our study could still be too long, which would still overload the phonological loop and could explain why the modality effect was reversed. This is especially important for the second text characteristic, which is the use of unfamiliar words. This is particularly relevant for the text we used, since many unknown macromolecular biological structures are described. To memorize such terms, it would be helpful to repeat them and see how the term is spelled. In terms of Paivio’s (1991) dual coding theory, one might assume that seeing the term and mentally recalling it in the phonological loop would result in a dually encoded representation of the term and thus better recall performance. However, this is only possible with visual text while in the auditive version the transience again hinders learners from repeating or seeing these terms. In future studies, it would be important to vary the text length and complexity to determine if and at what point a possible modality effect is reversed in a VRLE.

A third reason for the reverse modality effect might be the learning content itself and its inherent complexity. The overall learning outcome of the study can be rated as medium performance. The learners showed a rather medium overall performance for recall, and a medium to low overall performance for comprehension and transfer. This might indicate that the material was too complex to achieve high learning outcomes on average, which is also reflected in the high ICL scores. In conjunction with this, we found the reverse modality effect to be large for recall, medium for comprehension and only small for transfer. As argued above the audiovisual presentation is particularly beneficial when learners need to integrate the given information. The visual text presentation as a counterbalance could support learners in repeating the material in a more self-regulatory way. Combining these arguments, the advantage of the integrative support provided by the audio-visual presentation would outweigh the advantages of the stable visual text when deeper learning processes are required. Therefore, the strong reversal effect decreases in comprehension and even more so in transfer, but does not disappear completely.

Effects of the modality principle on cognitive load

In line with our hypothesis (H4), we found no significant differences for ICL between the visual-only and the audio-visual condition. The ICL is characterized by the complexity or difficulty of the learning material. Since both groups had exactly the same information content, we could show, that the modality of textual information has no influence on the perceived difficulty and thus on the ICL.

Regarding ECL we found no difference between the visual-only and audio-visual conditions. Therefore, our hypothesis (H5) could not be supported. The advantage of relieving the ECL that the audio-visual condition has of using both channels first to process the information is probably compensated for by the stability of the visual text in the visual-only condition, which could also lead to a reduction in ECL. In fact, learners with the visual text also had the advantage over the transient audio-visual condition that they did not have to retain all the information in their working memory at the same time, but could view the permanently presented information repeatedly (Seufert et al., 2009).

The aforementioned possibility of self-regulation may also influence cognitive load, as attention must first be divided between different visual information and then cognitively recombined (Tabbers & de Koeijer, 2010). Consequently, the audio-visual condition can benefit from a relief on ECL, as the presented information can be processed simultaneously. In the visual-only condition, on the contrary, an overload could be compensated if learners can take the time, they need to make cognitive connections between different visual information (Tabbers & de Koeijer, 2010). Also, the results of a study by Liu and colleagues (2020) support the assumption that the reverse modality effect can be observed in simulated learning environments when it is a learner-paced learning environment.

Contrary to our hypothesis (H6), we have found that a visual-only presentation of the learning material in VR can significantly increase GCL compared to an audio-visual presentation. Since no differences between the two conditions in ECL were found, a difference in ECL cannot be responsible for the differences in GCL. However, this is consistent with our findings that learners perform better in a visual-only condition on all three levels of learning outcome and consequently have more GCL invested. Greater investment in GCL could in turn lead to better learning outcomes (Paas et al., 2005). However, one reason for a higher GCL in the visual-only compared to the audio-visual condition could be differences in the way learners actually learn with different modalities in VR. This is also consistent with previous research. Baceviciute and colleagues (2020), who also investigated the modality principle in VR, found that more mental effort is invested in a visual-only condition than in the audio-visual condition. According to Baceviciute and colleagues (2020) reading in VR, compared to listening in VR, is more active learning because sentences are repeated and information is more easily integrated between different sentences. The use of self-regulation strategies referred to above can also be understood as an investment, that can be reflected in higher GCL. In addition, text reading and processing strategies are used that might automatically come along with more investment in learning than more passive listening (Kolić-Vrhovec et al., 2011).

Strengths and limitations

VRLE usually opt for the additional presentation of learning content as auditory text. Thus, the modality principle is already practically applied in VRLE, but there have been hardly any studies so far that have looked at the effectiveness and the underlying cognitive processes when learning in VR. Moreover, as the technology becomes more available and more widely used in learning contexts, empirical research should be available to consider how to design VRLE (Fabris et al., 2019).

One strength of this study is the theoretical foundation, which draws on established learning theories (Mayer, 2005; Sweller, 2005). In order to better understand the cognitive processes involved in learning in VR, the learning outcomes were examined in detail on different levels of processing. Up to now, the modality effect in a VRLE has only been investigated in recall and transfer. In our study, we also examined comprehension, which, according to Bloom’s Taxonomy, lies between recall and transfer in terms of cognitive complexity and mirrors the specific affordance for dealing with high complexity (Bloom, 1956).

The focus on cognitive processes is also reflected in the differentiated measurement of cognitive load, which is divided into ICL, ECL and GCL. Klepsch and Seufert (2020) support that in studies on the design of multimedia learning material, a differentiated assessment of cognitive load is recommended in order to be able to shed more light on the underlying processes of learning success as well as the multimedia design principles used and to be able to identify design choices that are conducive to learning (Klepsch & Seufert, 2020).

Another strength of this study is the learning material. It was designed to be highly immersive, taking learners on an interactive journey through a virus cell. To allow learners to experience the benefits of learning materials made specifically for VRLE, high-end HMDs were used in combination with a VR lab for free movement to immerse the learners in the learning material. This was also supported by the presentation of dynamic biostructures, which allowed the learner to walk into each component of the cell to see its underlying molecular structure. This presentation uses the unique possibilities that VR offers the learner and has not only tried to transform traditional learning material into VR.

In contrast, the present study has limitations that could be considered in future studies. First, the present study exclusively used a subjective self-assessment of cognitive load. Measuring cognitive load using self-report questionnaires requires that the learner is able to self-assess their cognitive load, so that erroneous assignment on perceived load to either ICL, ECL, and GCL cannot be ruled out (Ayres, 2020; Inan et al., 2015). However, 90% of the sample consisted of students who are quite capable of metacognitively assessing their cognitive load (Klepsch et al., 2017). Despite that, the instrument we used also showed quite promising results regarding its reliability and validity in the meta-analysis of Krieglstein and colleagues (2022). Because of the strong visual component in VR, as well as the increased cognitive load that accompanies it, Andersen and Makransky (2020) developed their own questionnaire, the Multidimensional Cognitive Load Scale for Virtual Environments. The questions are specifically adapted to VR applications and measure differentially, but with a clearer focus on ECL by including questions about instruction, interaction in VR, and the VR environment (Andersen & Makransky, 2020). The present study may nevertheless would have benefited from the additional collection of objective data. For example, Baceviciute and colleagues (2020) collected objective EEG data alongside a series of self-report questionnaires, which supported the interpretation of the subjects’ cognitive load results.

A learning outcome test that aims to measure not only the required content but also additional details of the VRLE might help to understand why this and previous studies on the modality effect in VR have found a reverse modality effect. Baceviciute and colleagues (2020) found that subjects in the audio-visual condition were able to reproduce more unnecessary details of the VRLE compared with subjects in the visual-only conditions. In addition, Rummer and colleagues (2010) found that eye movements during reading of a written text may limit retention of pictures, in contrast to text presentations without necessary eye movements. Thus, it could be investigated whether the visual-only condition would nevertheless be superior to the audio-visual condition even for questions that require increased recall of details of the VRLE to answer. Accordingly, the present study could benefit from recording the subjects’ eye movements. This could shed more light on where subjects direct their attention in the immersive VRLE. This could support the interpretation of results, whether the visual-only condition might have paid more attention to the text instead of the visual content in VR (Baceviciute et al., 2020). In addition, the eye movements of the subjects could provide information on whether the visual-only condition read certain passages of the text more than once and were therefore able to process the read information more deeply, which in turn could have a positive impact on learning outcomes (Catrysse et al., 2018; Inan et al., 2015).

Conclusion

The results of the present study contribute to the scarce research on the design of immersive VRLE to promote learning (Radianti et al., 2020). In summary, we can conclude from our data that the modality effect is reversed in VRLE. This is in contrast to the classical multimedia theories from which we derived our hypotheses. In contrast, there is also preliminary empirical evidence for a reverse modality effect in VRLE. Therefore, further investigations on the modality effect in VRLE are mandatory to be able to make further statements on whether the effect is actually reversed in VRLE or not and under which conditions. The results of this study support preliminary assumptions that subjects in immersive VR could achieve better learning outcomes in recall, comprehension and transfer, using written rather than spoken texts.

This is partly consistent with the findings of Baceviciute and colleagues (2020), who also found positive effects on recall, but not on transfer, for subjects learning with written rather than spoken texts in a VRLE. Moreover, the results of our study support that the written instead of spoken text presentation in a VRLE could promote a higher GCL which could explain the higher overall learning outcome performance of the visual-only condition, in contrast to the audio-visual condition. In contrast, we found no reduction in ECL through the use of the modality principle (Low & Sweller, 2005). The assumption that spoken instead of written text presentation counteracts an overload of the visual channel, especially with regard to the highly visual component of immersive VRLE, and thus might promote a lower ECL, could not be supported in our study (Low & Sweller, 2005). We could also show that the modality of the additional information has no influence on the ICL. VRLE may have high potential for future learning scenarios. However, for learning in VRLE to be successful, further empirical research is needed to investigate which cognitive demands are challenging and how these could be supported (Radianti et al., 2020).

Another recent study on the redundancy effect in VR also shows the reverse effect of a multimedia design principle (Liu et al., 2021). It should be noted, however, that the reverse effect was found in a VR classroom. The authors explain the reversed effect by the classroom interferences rather than by the feature of the VR learning environment. If multimedia design principles not only do not show up, but even reverse, this could indicate that other cognitive processes need support for learning in VR than in classical learning environments. Makransky and Petersen (2021) come to a similar conclusion and have already started a first attempt to provide a framework for this with the Cognitive Affective Model of Immersive Learning. They specify presence and agency as two important affordances when learning with VR. However, current evidence suggests that immersive VR affordances require additional cognitive affordances. Which affordances these are and how they can be supported in the future is, in our opinion, an important goal for future research for learning in VR. This could lead to design principles specifically for VRLE that make use of the advantages of VRLE. The advantages of VR lie primarily in the real, dynamic representation and interaction of the learning content. For example, teaching certain courses of action or practical skills to learners in simulated environments that would otherwise be too dangerous. As such environments are already being used in industry and education, empirical research is needed on how such learning environments can be supported in instructional design.