Keywords

1 Introduction

This chapter explores the pedagogical setting of hard skills training that takes place in immersive virtual reality (VR), guided by artificial intelligence (AI) tutoring software. Since the commercial introduction of sophisticated but affordable immersive virtual reality hardware around 2016, immersive VR technology has generated widespread interest for training practical skills. This could be due to the technology’s profound educational affordances: (1) providing a strong sense of presence and (2) affording agentic embodiment of operational activity (Johnson-Glenberg 2018). As such, VR is considered to have promise across many educational domains. In the same time frame, a family of machine learning methods based on deep hierarchies of neural network layers, known as “deep learning,” has made major advances in enabling practical AI systems that act on classifications of large amounts of observed data. The key aspect of deep learning is that the constituent features making up different classes are not engineered by humans but learned from training data (see e.g., Goodfellow et al. 2015).

Our research aims to examine how the inherent features of VR, such as modifiability and observability, could benefit AI-based tutoring software. A software program performing real-time inference on models observed from a learner’s behavior in VR – what we call an AI tutor – could observe more patterns in learner activity than what is within the capabilities of a human trainer. Based on these observations, the tutor could modify the VR environment dynamically to support learning. Also, unlike most human trainers, an AI tutor could maintain its attention on the learner constantly.

The focus of our current work is training hard skills in industrial settings. In these settings, immersive virtual training environments (IVRTEs) are used to simulate real-life operational environments where learners can practice the use of equipment or the mechanisms of machinery and perform safety and work procedures. This domain offers an interesting area for research as: (1) knowledge to be learned is mostly procedural, allowing experimental setups that better isolate phenomena attributable to VR technology and (2) in this domain VR training is increasingly seen to address current hard skills training challenges of timeliness, cost, authenticity, accuracy, and scalability, and thus many industrial organizations are already implementing training using VR technology.

Realizing AI-based tutoring software that can produce richer and more consequential learning in an IVRTE requires both extensive development and experimental studies. In this chapter, we elaborate on a theoretical framework that could inform such work. We will first explore the application of intelligent tutoring systems (ITS) to immersive learning, then review applicable learning theory and conceptualize it within a proposed AI tutor framework, and finally suggest reasonable VR-native pedagogical approaches that could inform empirical research.

2 From ITS to AI Tutor

2.1 Intelligent Tutoring System (ITS)

In the industrial training domain, hard skills learning takes normally place under a human trainer’s supervision and control. The need for better scalability calls for computer systems that would allow learners to learn or assess their knowledge and skills by themselves, without human trainer guidance. However, without personalized pedagogical guidance on both selecting the next learning task and completing the task, the learner may not achieve the performance and conceptual learning goals, or fail to do so within a target time, negatively impacting the very scalability. A computer system providing such pedagogical guidance is known as an intelligent tutoring system (ITS).

A large body of research exists since the late 1960s to inform the construction of an ITS (Alkhatlan and Kalita 2018). The canonical structure of an ITS divides its functions between four interconnected modules (Wenger 1987), see Fig. 1. The expert knowledge module (or domain model) serves as a repository of expert knowledge about the task being tutored. In the procedural training context, this knowledge, captured from subject matter experts, defines the steps of the procedure to the learned. The student model module (or learner model) enables personalized learning by capturing the system’s current understanding of the learner’s mastery of the domain model tasks and the student’s cognitive state. The ITS takes decisions in the tutoring module, which, following the tutoring strategies known to the ITS, executes two decision loops: (1) outer loop, selecting a task that would best help the learner learn and (2) inner loop, guiding the learner by instruction through the right steps constituting a task (VanLehn 2006). Each of these modules has spawned its own rich research topic and literature.

Fig. 1
figure 1

Traditional ITS architecture. (Adapted from Wenger 1987) augmented to show the interaction modalities with the learner when the user interface module is provided by an IVRTE. Sensors such as head, eye, and face trackers provide the computer information about the learner. Sensory simulators, including head-mounted binocular displays, headphones, and haptic vibrators, simulate sensory experiences for the learner

2.2 Observability

To adapt an ITS assuming the canonical structure to an IVRTE, one needs to replace the fourth module, the user interface module with the IVRTE user interface (see Fig.1). Recent ITS research has been naturally directed toward the user interface with widest availability, a web browser or mobile app. As such, the input from the learner consists of typed keyboard input, pointing using a mouse and selections through mouse clicks/taps. In addition, directional input through device acceleration sensors has been utilized. While some systems allow user audio input, the predominant method of conversational input is typing. Additional sensors, such as eye tracking or heart rate monitors, have been used in experiments that aim to enrich the student model with information on learner affect, with the aim of implementing the principles of affective computing (Picard 1997).

In contrast, a standard IVRTE user interface in 2021 consists of sensors that provide kinematic tracking of the user’s head position and rotation, as well as the position and rotation of controllers the user is attached to or holds in each hand. A standard VR headset also includes headphones and a noise-cancelling microphone for audio input. Eye and face tracking as well as heart rate tracking are readily available as commercial options. Tracking of finger joints is available in some hand controllers as well as a camera-based option if controllers are not used. As such, the input from an IVRTE provides much more data than what is utilized by a traditional ITS that tracks learner interactions with a graphical user interface. As the learner’s representation in the virtual-physical space is mediated through sensor hardware, the VR environment uniquely affords extensive observability of the learner’s location, posture, and interaction.

2.3 Modifiability

While the output of a traditional ITS user interface is a two-dimensional page or screen, the output of an IVRTE is generated by devices that simulate the learner’s sensory experience. The main modality is vision through a head-mounted binocular display, supported by spatially simulated audio sources and haptic stimulators in the hand controllers. With sufficient presence (Slater and Wilbur 1997), the world sensed by the learner – an imagined sociotechnical space – becomes fundamentally different compared to real-life experience. As this space is generated by a computer program, it exhibits inherent modifiability.

By modifying the learner’s simulated experience, they can potentially be assisted in reaching their dynamic zones of proximal development (Vygotsky 1978). Toward that end, tasks and scenarios can be presented with variation, refining their features until sufficient skills are demonstrated. These manifestations of modifiability explain VR’s popularity for traditional simulator training targeting special learner groups, such as pilots, astronauts, soldiers, and athletes. A less obvious manifestation of modifiability is the capability to modify the experience in subtle ways to support or scaffold, the learner’s cognitive processes during learning tasks.

2.4 AI Tutor

Various IVRTEs have been implemented in the industrial training context, but very few identify using an ITS (Laine et al. 2022). Typical solutions that assume a self-study setting (e.g., Hirt et al. 2019) guide the learner using authored hard-coded logic or branched programming (Pavlik et al. 2013), with no learner-specific adaptation. Some systems repurpose an ITS originally designed for traditional user interfaces (e.g., Ashenafi et al. 2020), limiting its pedagogical capabilities.

Examples of ITSs specifically designed for controlling procedural training in immersive VR do exist, for example, STEVE (Rickel and Johnson 1998) and PEGASE (Buche et al. 2010). However, while these systems achieve impressive functionality in adapting to learner actions, they do that by producing actions based on rules that trigger on changes in world simulation state.

With our notion of an AI tutor, we aim for more meaningful learning than what is possible with such triggers. We look for a framework that would assume a model of learner cognition based on emerging theories of grounded cognition. In such a framework, tutoring logic could modify the learner’s experience on a fine-grained level based on its observations of the learner’s cognitive state.

3 Grounded Cognition

Learning that takes place in virtual reality is immersive in nature (Dede 2009); this means learning through diving into a simulated environment that provides a strong sense of presence together with affordances of acting and functioning in the artificial environment. The “imagined” property of VR allows us to simulate any immersive physical experience. Such immersion appears to also require expanded ways of understanding the cognitive processes involved in learning. Toward that end, the 4E approach to cognition appears to provide important resources (Newen et al. 2018). This framework assumes that cognition does not only take place in the human head but that it is distributed (Clark 2003; Pea 1993), i.e., embodied, embedded, enacted, or extended across external tools, processes, structures, and environments. The term 4E cognition, attributed to Mark Rowlands (2010), stands for “embodied, embedded, enacted, and extended (4e) cognition.” The 4E approach on cognition involves a collection of interrelated but also conflicting viewpoints, which highlight the materially and socially distributed aspect of cognition (Pea 1993).

Embodied cognition

Investigations of hard skill training highlight the importance of embodied cognition. Skills and their training are inherently dependent on the human body and the tools manipulated, and learned skills become “carved” or “sculpted” into the body (e.g., bicycling and boxing). Embodied cognition succeeded the computational theory of mind (Fodor 1981) that replaced behaviorism in the 1950s. Embodied cognition is anti-dualistic in nature; it claims that psychological processes (“software”) cannot be investigated without the “hardware” that the human body provides. Varela et al.’ (1991, revised 2016) book is commonly seen as a starter for the “embodied cognition movement.” Pioneering research of Pea (1993) and Hutchins (1995) established the distributed cognition approach, which has long roots in sociocultural psychology (Rogoff and Lave 1984; Vygotsky 1978) and philosophy (Clark 2003; Clark and Chalmers 1998). The embodied approach builds on phenomenological tradition of philosophy, such as Merleau-Ponty (1945), according to which cognition is grounded in “lived experiences.” Moreover, many cognitive scientists have rejected the computational theory of cognition according to which human mind processes abstract (“amodal”) symbols independent from the modalities of perception, action, and self-reflection. Knowledge is grounded in sensorimotor routines and experiences (Barsalou 1999, 2008, 2020; Lakoff and Johnson 1999) that forms the basis for language and “wording.” Accumulating behavioral and neural evidence across research on perception, memory, knowledge, language, thought, social cognition, and human development supports this view. Lakoff stated, in his foreword to Bergen (2012), “the ball game is over; the mind is embodied.”

Embodied learning

The role of active bodily engagement has been highlighted in learning (Stolz 2015; Shapiro and Stolz 2019). It is argued that the practice of teaching [declarative] knowledge first before it can be applied (formalisms first) is rooted in the dualistic view of knowledge; in this view intellectual work is associated with the “mind” and practical work with the “body.” Separating knowledge from activity and application leads easily to inert knowledge that cannot be applied in context. Shapiro and Stolz (2019, p. 27) anchor embodied learning on an assumption summarized from the Maturana and Varela (1998) account on embodied cognition: “learning is contingent upon the cognitive activity that is triggered by the environment and is determined by the dynamic nature of living beings engaged in the self-organizing activities by which they sustain themselves.” Learning conceptual knowledge should be integrated with firsthand (direct experience) and secondhand (description of experience) experiences and with both physical and imagined manipulation. Anchoring learning on physical manipulation is critical because it assists experiential grounding of abstract symbols that are used to build embodied mental models (Glenberg 2008). The other three Es (embedded, enacted, extended) are more or less “breaking out” some of the aspects of the original “embodied” thinking into separate areas (see e.g., Newen et al. 2018).

Embedded cognition

Embedded cognition may be seen as the aspect of embodied learning that describes how the environment is partially involved in cognitive processing. For instance, when an outfielder in baseball catches a fly ball, it may appear that they are dependent on sophisticated cognitive operations, when in fact they are exploiting features of the environment in a way that reduces cognitive load (Shapiro and Stolz 2019). Human activity in general takes place in deliberately designed and built cultural environments (e.g., schools, learning labs) fostering learning and development. Embedded cognition can be harnessed by creating artificial worlds open to exploration, designing complex open-ended challenges and tasks that can be worked with virtual physical and semiotic tools, and by manipulating the environment so that desired aspects become opaque or transparent, depending on the purpose. Through deliberate and iterative design efforts, it is possible to create structures, functions, and processes that support training activity, adapt to learners’ developing competences, and foster building and stretching the skill being developed.

Enacted cognition

This perspective emphasizes real-time dynamic interaction between a human and the environment as a crucial aspect of cognition. The world is experienced through exploratory sensorimotor interaction with the environment. Learning is not a property of mind or located at a person but enacted through dynamic interaction between learners and environments. Enaction refers to a dynamic process in which a learner adaptively couples their actions to the requirements of unfolding situations. One aspect of enaction is gesturing. Gestures used in conversations (even in telephone calls) may, for instance, be considered as a form of communication (Shapiro and Stolz 2019). Also, certain gestures may signify the readiness to learn (Shapiro and Stolz 2019, p. 28). “A living organism enacts the world it lives in; its effective, embodied action in the world actually constitutes its perception and thereby grounds its cognition (Stewart et al. 2010).” From the enactive perspective, learning is not the passive reception of information but involves active and deliberate exploration of the environment, entailing motivation and planning activity and observing and transforming the environment as emphasized by Bruner (1966). Interacting with one’s cultural environment structures experiences according to patterns of sociocultural practices (see e.g., Nasir et al. 2020).

Extended cognition

The extended mind thesis assumes that rather than being encapsulated within the brain or the body, cognitive processes extend into the physical world (Clark 2003; Clark and Chalmers 1998). Learners can off-load their cognitive work to the environment (Donald 1991; Wilson 2002), for example, use a paper and pencil as external memory field to support calculation. The human and the environment of their activity develop gradually to support one another and constitute a coupled cognitive system. As far as the IVRTE structures support their activity, such as reminding about the purpose of the tasks, they do not have to invest so much effort in the cognitive task of remembering. The environment can also represent the tools and objects needed for subsequent tasks, as in allowing the learner to pick up the parts and tools they intend to use next. Here the learner is engaged in a developmental process of appropriating and internalizing tools used in the activity to the extent that the tools become a part of their minds (Galperin 1992) and invisible in their hands (they are aware of the object of activity rather than tool that is seamlessly integrated with their activity).

The above examination, summarized in Fig. 2, indicates that learning in general and hard skills learning in particular is an embodied, embedded, enactive, and extended process. While embedded in an IVRTE, the learner does not employ an isolated set of processes. Instead, cognition emerges from interactions of processes in the domains of the modalities, the body, the physical environment, and the social environment with processes traditionally associated with solo cognition, such as knowledge, attention, memory, thought, and language (Barsalou 2020). Barsalou (2020, p. 2) summarizes this interplay of processes as grounded cognition:

From the 4E perspective, cognition, affect, and behavior emerge from the body being embedded in environments that extend cognition, as agents enact situated action reflecting their current cognitive and affective states.

Fig. 2
figure 2

Summary of 4E cognition for a learner immersed in a task in an IVRTE. The learner’s cognition is embodied through their active bodily engagement with the IVRTE. Breaking out aspects of the original embodied thinking, the learner’s cognition is embedded in the virtual world generated by the IVRTE, enacted by their dynamic interaction with the virtual world, and extended to objects in the virtual world

It follows that the research and development of digital tools and environments does not represent the creation of neutral and external instruments but may instead radically remediate a learner’s cognitive processes; the same concerns also apply to the creation of IVRTEs that reshape embodied, embedded, enactive, and extended processes and provide resources for training. Integration of external tools with the human activity is, however, a developmental process of its own, called instrumental genesis (Rabardel and Bourmaud 2003; Ritella and Hakkarainen 2012). Only after the tools have been seamlessly merged and fused with the human activity system are they likely to enhance various aspects of 4E cognition. Organizational researchers use the concept of sociomateriality (Orlikowski and Scott 2008) to examine how epistemic, social, and material processes of using technologies are intertwined. Such entanglement of technology and human activity also concerns immersive virtual technology.

4 VR-Native AI Tutor Framework

4.1 Situated Conceptualizations

To develop a conceptual framework for an AI tutor that could natively utilize observability and modifiability in an IRVTE, we assume the 4E cognition perspective that the learner’s cognitive state emerges from interactions between cognitive activity domains in terms of grounded cognition (Barsalou 2020), see Fig. 3. In this perspective, the physical and social environment domains of the learner’s cognition form conceptualizations of the virtual world they are embedded in. Whether these are represented as amodal symbols or through some other knowledge representation is an open area of research. However, considerable evidence shows that sensory-motor modalities become active as people process conceptual and semantic information, a phenomenon known as multimodal simulation.

Fig. 3
figure 3

Domains of grounded cognition. (Adapted from Barsalou (2020) for a learner mapped to the IVRTE functions that attempt to simulate and sense them. The conceptualizations in the physical and social environment domains arise through grounded simulators (Barsalou 1999). The learner’s external perception is partially replaced by simulated sensory perceptions generated by a physical environment simulation generated through sensory simulator devices, with input from sensors that quantify the learner’s body kinematics. This part of the IVRTE constitutes a minimal IVE. The social environment experienced by the learner is formed through physical environment percepts generated by a social environment simulation. A simulated tutor adapts the physical and social environments for the learner, based on a simulation of the learner’s cognition informed by sensing of the physical and social environments and additional sensors. Dashed arrows indicate inputs)

Barsalou’s (2020) examination of the accumulation of memories that underlie skill acquisition inspires the following example of how a learner in an IVRTE could form a multimodal simulator for the concept of electric screwdriver. When a learner encounters a task requiring a tool, their cognitive processes in different modalities that would normally process the tool’s features become active. These can include how the tool looks (vision) and what it feels like (tactile). Importantly, these activations are not only limited to static ontological representations of the tool concept but span multiple domains of cognition that participate in the cognitive processing while a person is working with the tool. Barsalou offers the “situated action cycle” as one account of the involvement of different cognitive domains in the sequence of processing phases from observing the environment to taking action and ultimately reaching an outcome (reward, punishment, prediction error). According to this account, situated conceptualizations are formed in memory during the processing cycle, recallable when the cycle runs again in similar manner (Barsalou 2020).

4.2 Physical Environment Simulation

To outline a systemic view of the interaction between the learner and the proposed components of an IVRTE featuring an AI tutor (see Fig. 3), we first recognize that the learner’s cognition must necessarily interface with the external world through the learner’s body, which provides for action and mediates the external modalities. The body interacts physically with sensory simulators provided by the IVRTE, primarily the binocular vision simulators (displays), that activate external perception. The sensory simulation is generated in software controlled by sensors that sense the learner’s body kinematics, creating an illusion of the virtual-physical space. This part of the IVRTE, providing a physical environment simulation, essentially describes any VR-based immersive virtual environment (IVE).

4.3 Specifiers

Physical environment simulation elicits external perception activations that, through interacting cognitive processes, form the learner’s perceived environment. However, the same IVRTE simulation may not result in the same concept in the learner unless it also incorporates features that sufficiently activate all cognitive domains that contribute properties of the concept. To invoke or form a situated conceptualization, the physical environment simulator in the IVRTE should thus be instructed to add physical phenomena with features that would be expected to activate the multimodal sensory experiences.

For the social environment, such additions are provided as part of the social environment simulation. To elicit recall of a social situation, it may not suffice to show the appropriate visual representations we normally associate with the situation (such as avatars for the participants). In addition, the social simulation may need to instruct the simulation of physical representations such as additional objects, sounds, or interactions that for an outside observer would seem to be extraneous but which, when perceived by the learner, would be essential for invoking the correct situated conceptualization.

We call these extra pieces of simulation added for the purpose of forming the desired cognitive state “specifiers,” as without their presence the situated conceptualizations formed in the learner in response to the physical simulations may remain unspecific, differing considerably from what was intended for the purpose of supporting learning. Specifiers need not be purely visual; for example, additions to the simulation that elicit gesturing action may offer a way to guide the learner’s cognition toward the intended situated conceptualization (Goldin-Meadow 2011).

4.4 Learner and Tutor Simulations

The responsibility to add the correct specifiers should lie with a function that models the learner’s grounded cognition state. The learner cognition simulator provides this model, utilizing cues from the current physical and social environment simulation states as well as from non-kinetic sensing inputs.

The remaining function, which we call a tutor simulation, is analogous to the tutoring module in a traditional ITS. Based on the current state of the simulations, the tutor simulation instructs the creation of appropriate specifiers needed to invoke the correct conceptualizations. Extending the terminology of traditional ITSs, we denote the tutor simulation as operating in the innermost loop, compared to the inner (guiding through task steps) and outer (selecting tasks) loops of the traditional ITS. The target of this additional loop is to select from a repertoire of specifiers the ones that are most likely to elicit the intended situated conceptualizations in the current learner, allowing the tutor simulation to build or modify situated conceptualizations that may be necessary and/or sufficient for skill acquisition.

ITSs providing a conversational interface that mimics the conversation between a learner and a human tutor have achieved significant improvements in learning effectiveness. The most well-known of such efforts is AutoTutor (Graesser et al. 1999). In an IVRTE, such an interface could be implemented as part of the social environment simulation. Learner utterances recognized by the simulation could be used as inputs for the learner cognitive model. Correspondingly, when selecting specifiers that would elicit the desired situated conceptualization, the tutor could instruct the social environment simulation to produce the appropriate conversational utterances. Here it should be noted that while typing is impractical with current VR technology, we may be able to infer “mute” learner such as hedges, pauses, and disfluencies, which allow the tutor to infer more information about learner cognition (Pon-Barry et al. 2004). This approach could work especially in our domain (industrial setting) where learners may not be comfortable with having a conversation with a computer. Any conversational approach should consider the cultural traditions of the learning domain (compare Pea 2004).

4.5 Implementing the Framework

The functional arrangement described above could form the basis for the implementation of VR-native pedagogical agent software or AI tutor. Deep learning-based AI methods promise powerful ways to implement key parts of the framework. The extensive sensor data already used to inform the physical environment simulation can be utilized to train machine learning models that may be able to recognize specific learner cognitive states. Additional sensors such as eye and face trackers as well as bio-signal sensors may improve the models.

Experimental results suggesting the feasibility of making inferences from learner cognitive state using sensor data are already available. User body tracking data has been used to identify individual users (Miller et al. 2020, Moore et al. 2021). Pfeuffer et al. (2019) identified characteristic behavior for users in VR from monitoring their head, hand, and eye motion data. Holzwarth et al. (2021) correlated head yaw in VR with user’s affective state. Won et al. (2014) were able to automatically distinguish between low and high success learning interactions by monitoring body movement. Marín-Morales et al. (2018) used electroencephalography (EEG) and electrocardiography (ECG) sensors to distinguish between emotional states of users embedded in a virtual environment. Hussain et al. (2011) used machine learning methods to detect learners’ affective states from multichannel physiological data, including heart rate, respiration, facial muscle activity, and skin. In social psychology, VR-based behavioral tracing has been operationalized for quantifying social approach and avoidance, evaluation of a social other, and engagement and attention (Yaremych and Persky 2019).

5 Toward VR-Native Pedagogy

In this section we provide a preliminary outline of principles based on the proposed framework that can guide the pedagogical design of an IVRTE and its AI tutor functionality.

5.1 Simulation Environment

The physical environment simulation in an IVRTE for procedural learning is built to simulate the mechanisms and causal relationships involved in the procedure. The extent to which a simulator attempts to imitate the real world is determined by task analysis. Time and cost concerns often necessitate the prioritization of simulating the parts of the environment the learner is most likely to interact with. However, the learner should be able to freely manipulate the simulation toward the desired end state of the procedure, possibly taking pathways that prevent further progress or cause known problems.

The highest achievable fidelity (both in terms of visual and task fidelity) may not always be desired. While a higher-fidelity simulation adds to learner presence (Dalgarno and Lee 2010), it may impact learning negatively from the grounded cognition point of view as the learned situated conceptualizations may not transfer to semantically similar but different situations exhibiting altered details. When designing specifiers that can be added to the physical world simulation, one consideration is the learner’s emotional and aesthetic engagement with the world (Stolz 2015). The VR environment should simulate professionally adequate ways of working with tools. An object becomes instrument (and, therefore, “invisible” tool in hand) only through learning and internalizing the IVRTE system (instrumental genesis); when disturbances or breakdowns occur, the instrument, again, becomes an object of deliberate inspection (Engeström 1987).

If a learner can achieve the desired performance just by interacting with the IVRTE simulation and the simulation has been implemented to account for the failure modes identified during task analysis, the learner has effectively demonstrated their possession of the targeted knowledge and skills. Such a simulation with no tutoring actions can still make use of the inherent observability of the VR environment by producing a detailed analysis of the learner performance, as well as suggestions for improvement where the learner has exhibited weaker results.

5.2 Task Sequencing

Grounded cognition principles can be already considered when the tutor is selecting the next tasks for the learner from the available tasks created by the instructional designer (ITS outer loop). Existing instructional design guidelines prescribe a theory-first approach (Fowler 2015). However, this approach may not allow the learner to benefit from the IVRTEs’ ability to ground the theoretical concepts as part of the learner’s situated conceptualizations. We may be able to get better results by transforming theory topics into experiences where the learner engages in goal-directed but open-ended operational procedures anchored on their cognitive domains. Grounded cognition emphasizes the importance of affording the learners the opportunities to be active in a congruent way, i.e., allow and encourage movements and interactions that resemble the actual operational procedures and mechanisms (Johnson-Glenberg 2018). Whenever the learner thinks about something (tries to build a mental model or solve a problem), their cognitive process is impacted by the virtual environment they are located in and the affordances they interact with (Newen et al. 2018).

5.3 Scaffolding

In normal circumstances a learner is not expected to succeed in the IVRTE simulation without external help. Thus, the key function for the inner loop becomes the selection of appropriate scaffolding actions for the learner. The scaffolding provides structures and guides the learners’ activity without necessarily prescribing only one or a few “correct” lines of activity. Accordingly, there are likely to be several pathways to the desired learning outcome (reaching the currently targeted step completion). It is also critical to engage the learners themselves in agentic efforts of analyzing situations, selecting promising lines of activity, and assessing their advancements of their efforts. The effectiveness of any proposed scaffolding actions needs to be assessed for their impact on learner performance by design-based or experimental research.

As pointed out by Pea (2004), scaffolding is a complex theoretical concept related to relations between people, tools, and environments (Engeström 1987) rather than anchored on analyses of disconnected cognitive tasks. If we subscribe to the grounded cognition perspective and model the learner through a simulation of the learner cognitive state from observational data (learner model), the scaffolding activation function within the tutor simulation should map the cognitive state and the desired domain model state to scaffolding actions. To implement such a function, it becomes necessary to express the domain model using concepts that are compatible with the learner model. Thus, the inner loop could consider what the learner has already experienced and what should they experience next – following, for instance, the theory of comprehensive learning by Jarvis (2012). Notice that this does not contain an assumption of one normatively corrected performance because there can be multiple pathways to the targeted objective.

In an industrial setting, work instructions and other task and performance support provide distributed cognitive resources (Pea 1993); such resources include manuals, labels, checklists, and affordances of tools that prevent them from being used in the wrong ways. A key function of such resources is quality control, but, simultaneously, they may also be used to scaffold the learning. Sometimes scaffolds are a part of the procedural instruction, but professionals tend to adapt and “devise their own aides” by arranging their tools, materials, and workspaces.

As the learner demonstrates through the learner model that a specific scaffold is no longer needed, the scaffold is faded, and the learner is expected to continue achieving the same performance without the scaffold. As follows, we list possible VR scaffolds that could be tested:

  • Abstracting the world (making physics, mechanisms, structures, affordances simpler)

  • Automating things (mechanisms, other person’s actions, processes etc.)

  • Providing the learner new abilities (X-ray vision)

  • Augmenting the world with text and graphics overlaid on objects

  • Extending the learner’s body (remote manipulation of objects)

  • Allowing the learner to move large distances effortlessly (teleport)

  • Allowing the learner to manipulate large or heavy objects

  • Selectively silencing or emphasizing sounds

  • Emphasizing or hiding objects and affordances

Pea (2004) asks a good question – if a scaffold improves learning, why should it become faded? Why cannot scaffolds just become an aspect of accepted performance support and a part of the distributed system of intelligence? The inherent modifiability of VR provides a straightforward answer to this question, and a key principle for designing scaffolds for an IVRTE; to improve learning beyond what can be done by non-fading performance support, an IVRTE should aim to primarily provide “impossible” scaffolds, actions, and events that could not be implemented in the real world. These scaffolds must fade as they cannot be realized in the real world to continue providing the relevant performance support. What are such impossible scaffolds? The exciting opportunity of VR technology is that within the bounds of achievable presence, anything can be implemented and tested.

6 Discussion

In an article titled “Where’s the pedagogy?,” Fowler (2015) calls for working out missing pedagogical principles in VR-based self-study training solutions. The suggested solution is to add pedagogy through a step-by-step design process. Similarly, although recognizing the unique modifiability afforded by VR, Johnson-Glenberg (2018) focuses on giving guidelines on how to design better VR learning experiences. In general, there may exist a tendency to address VR technology as another medium to apply the “pedagogically well-designed interaction” tradition from web-based self-study and, going further back, from the proper organization of textbooks. However, this approach may fail to produce the learning results expected from the increasingly complex simulations of real-world tasks and associated hard skills training afforded by IVRTEs. In these contexts, we should also ask “where is the teacher?” and focus more on automatic systems that can support the learner in a personalized manner through real-time scaffolding decisions.

Determining the learning benefits of an VR-native ITS that utilizes the observability and modifiability of the VR environment requires design-based and experimental work on the ability to automatically infer learner cognitive state and the situational scaffolding needs from real-time sensor data. Also, further research and development work is needed to assess the learning benefits of any proposed automatic scaffolding interventions based on the general principles presented. Our work resides at an intersection between IT, psychology, and learning science. Each field is approaching experiments on VR technology from its tradition, which complicates the interpretation and application of existing experimental results.

The conceptual work is not without challenges either. Despite large evidence of the existence of grounded principles of human cognition, an understanding of actual working of the cognitive principles, for example, how knowledge is represented under these premises, remains as elusive as ever. On the other hand, a full account of the mechanisms underlying grounded cognition may not be necessary for practical applications of the concept, as demonstrated by the largely unobservable inner workings of highly useful deep neural network architectures.

To the extent that the presented AI tutor framework proves implementable and its theoretical underpinnings have merit, one must also raise the question of ethical use of such technology. Should a VR-native tutor implementation prove to be capable of modifying situated conceptualizations for skill acquisition, it may be able to modify such conceptualizations for any other purpose. Those purposes may be highly beneficial (modifying adverse habitual learning) but also questionable (making learning overly dependent on tutoring software).

Further, we should examine how IVRTE mediated hard skills learning complements conventional training with human educators. A critical concern is to what extent VR training transfers to working with conventional tools and instruments and how VR and regular training support one another. We expect VR training to assist learners in developing the orienting basis (Galperin 1992) for training and further refining their vocational skills. Our work is focused on a specific domain (procedural training in industrial settings), which may not always present many of the typical challenges faced in other educational settings (e.g., social interaction, developmental psychology, abstract concept formation). However, the core insights of the work may be applicable to other educational settings.