Keywords

1 Introduction

It is acknowledged that immersion, the subjective state of intense involvement in an experience [7], is a vital ingredient for numerous human-crafted experiences. From books [19, 28]; films [32, 34]; theatre [27] and games [2, 4, 17, 18, 25], immersion is recognised as being key to entertaining and engaging experiences.

In such artificial constructs the level of immersion is often determined by the quality of the work and the skill of its creation and delivery. These examples, however, rarely take advantage of the ability of extended use of sensory modalities to facilitate effective immersion. Such cases are usually limited by the form of their chosen medium. Even in cases where technology is leveraged to break the bounds of the medium, increase immersion and enrich the experience, such as with the growing and controversial domain of 3D films, they still rely on a single modality, in this case vision [35]. But virtual experiences need not be shackled by such modal limitations.

In digital experiences such as those created through virtual worlds - the domain examined in this paper - there are many varying definitions of the concepts of immersion and engagement. Some focus [25] on the engagement of learners. The authors make the important distinction that, when trying to achieve immersion, it is necessary to consider the level of engagement attained by the interactions with, and feedback from, the virtual environment but it is also important to consider the engagement that is promoted by the activities and concepts the user is involved in.

Mount’s interpretation [25] is applicable from the point of view of our research objective: to design and implement a digital game experience that achieves a high level of immersion and engagement, and investigate the hypothesis that unimodal and mutitimodal experiences create different levels of immersive response. Beyond the concept of an effective activity engagement, this research aims to leverage the power of multimodal interaction in achieving high levels of immersion, also by lowering the barriers to effective levels of environmental engagement because of seamless natural interfaces for input and refined multimodal output.

2 Background and Related Work

In digital experiences, engagement is a vital ingredient of immersion and there are, broadly speaking, two forms of engagement, mental and physical [10, 16]

Mental engagement is highly subjective and requires the experience to feature a context that the participant can relate to. The scenario may be one that is easily generalizable and acceptable by a variety of users. These scenarios tend to be abstract, providing little detail and relying on the immediacy of their nature, such as a base psychological need, in order to function. An example of this is fear and reflex responses such as dodging oncoming objects. Another type of mentally engaging scenario can be much more specialised or themed in order to allow an interested user to relate directly. Such cases require substantial investments, planning and design, and cater to specific user populations with thematic tastes.

Physical engagement, on the other hand, is primarily a matter of utilising interfaces that closely emulate the physical actions of the virtual scenario. Such engagement can directly promote physical immersion [30]. Examples include controllers which are physically identical to the object manipulated in the virtual world, or motion controllers and head mounted displays that achieve suspension of disbelief with regards to the physical presence and surpass the ‘uncanny valley’ [14, 15, 24]. The uncanny valley is a concept primarily associated with robotics and comes into play when the anthropomorphism of a robot is such that it is very similar to a human but not fully accurate. In such cases, humans experience instinctual feelings of revulsion. Similar effects can be experienced with virtual avatars and conflicts can arise when high fidelity manipulatable physical controllers do not exactly match their virtual counterparts.

Turning to human-human interaction, all of the primary senses are utilised. These feedback mechanisms are used by the human body to gather information about its condition and surroundings. More importantly, human interaction and communication is a multimodal affair. Everyday conversations go beyond the verbal and speech auditory cues. They include non-verbal elements such as: eye-contact; touch; gestures; body language and facial expressions. This both facilitates communication and delivers the visceral experience that humans are accustomed to experiencing.

Human computer interaction, however, still remains relatively limited in comparison to the richness of human interaction. There are numerous and substantial forms of multimodal input [21]. These include methods such as touch; voice; gaze; motion; gestures; and Brain-Machine-Interfaces. Inversely, output is mostly limited to visual and auditory. In contrast to the varied input methods, output is still mostly limited to screens and speakers with significant weight on the dominant visual modality [12, 31].

Research shows that auditory feedback and cues are important to human multimodal experiences. Larsson et al. (2002) tested [20] unimodal (aural only) cues and multimodal (aural and visual) cues in navigation and memory tasks in virtual environments finding that auditory information may greatly improve the experience and sometimes directly improve performance. Researchers have also investigated aural feedback in supporting orientation and navigation in virtual worlds where sound is used either as a support for the visual mode or as a substitute when no other sensory information is available. Additionally, they expand aural cues as either localisation or sonification. Localisation represents the inclusion of lifelike 3D sounds that aid in navigation and situational awareness while sonification is the use of sound to represent certain types of information. Finally, regarding somatosensory modalities, commercially available haptic feedback is usually limited to vibration. This does not always correspond to the real life feedback that the user would expect. Exceptions do exist, such as in the well-engineered and supported examples of force feedback controllers [12, 32].

Thus, motivated by the concept of immersion in crafted works across all fields of creativity and inspired by the related research, this study investigates the role of the various modalities in achieving high levels of immersion and engagement.

In terms of requirements, the design and implementation had to provide a balanced set of modal cues in order to explore the value of sensory modalities in achieving effective immersion and presence in virtual worlds. The implementation of the digital experience was evaluated by small scale pilot user tests in order to achieve as similar as possible amounts of information across the modalities. The first two modalities investigated were visual and auditory.

Regarding the game design, the implementation had to satisfy the guidelines for effective and engaging gameplay [1, 29].

3 Defining Immersion in Games

Characterising and measuring user experiences in virtual worlds are inherently difficult tasks and substantial research has been carried out to attain a level of understanding and to establish methodologies for evaluation. Concepts like immersion, have been adopted as a metric by the digital experience design and creation research communities. Research such as that of the Eindhoven University Game Experience Research Lab have postulated [17] that immersion and flow are candidates for gameplay evaluation. The nature of immersion in the context of games was further investigated in search of a concrete and applicable definition [18] and also with regard to the methods of its measurement [26].

In games, Brown and Cairns (2004) identify [4] three levels of immersion: ‘engagement’, ‘engrossment’ and ‘total immersion’. ‘Engagement’, the lowest level, requires a user to invest “time, effort and attention” in order to learn the controls of a virtual environment. Next, ‘engrossment’, necessitates users becoming familiar with the environment and they must be able to experience emotional affectation by the narratives and activities involved. Finally, by overcoming barriers like empathy and atmosphere, ‘total immersion’ can be experienced whereupon the users feel disconnected from reality and the passage of real world time. However, as has been noted by others [25], the findings of Brown and Cairns are a useful classification of engagement-based immersion but do not offer insight into the components that are involved in each level of immersion.

Efforts to disambiguate immersion in games have led to research that attempts to differentiate types of immersion based on the different types of information processing of the human user instead of the degree of immersion. These include Vicarious, Action Visceral and Mental Visceral Immersion [7], Sensory-Motoric, Cognitive and Emotional Immersion [2], Diegetic and Non-Diegetic Immersion [23], Diegetic and Situated Immersion [33], Sensory, Challenge-Based and Imaginative Immersion [11], Mental and Physical Immersion [31 and 5] and Perceptual and Psychological Immersion [22].

A recent, empirically grounded classification of types of immersion in games [7] depicts immersion as having three distinct types: Vicarious Immersion, Mental Visceral Immersion and Action Visceral Immersion. This classification emerged from an extensive qualitative content analysis of online discussions with gamers and was confirmed using an experimental survey-based analysis, with these types also emerging spontaneously during the factor analysis stage of development of the IMX Questionnaire. These three types of immersion have the shared characteristic of a subjective experience of intense involvement in a game. Vicarious Immersion occurs when the player becomes intensely involved in a fictional world in which they may adopt the feelings, thoughts and mannerisms of a character, and forget themselves and the real world. In the most intense experiences of vicarious immersion, the character appears to take on a life of its own. Action Visceral immersion occurs when the player feels a rush of adrenaline and is swept away due to being caught-up-in-the-action of the game. Mental Visceral Immersion involves getting deeply engaged in and excited by the strategising and tactics of a game. It is important to note that one of the keys to achieving immersion is the Suspension of Disbelief [8].

The different levels and types of engagement and immersion all contribute to the experience [6]. The next sections describe a design and implementation that was motivated by these concerns that attempts to make use of the theories and findings in order to create a digital entertainment experience that achieves high levels of immersion.

4 Technical Implementation

4.1 Rationale and Design

Bearing in mind the research surrounding immersion in virtual environments, work began on designing a virtual digital entertainment experience that would allow for the measurement of immersion of the participants across separate modalities and combinations thereof. The aim, was to design and implement a modular experimental setup featuring a mentally engaging scenario and a contemporary interface that would promote high levels of physical engagement. The system output would feature output across multiple modalities that could be isolated and combined as needed and would provide similar amount of information to the participant.

Therefore, a study was designed featuring a simulated collision avoidance scenario as the basic gameplay premise. Specifically the participants would be introduced to a digital environment representative of a motorway and would have to avoid the oncoming traffic. This task would allow for a relatively generalizable and easy-to-relate-to context thereby allowing for mental engagement. In order to also achieve physical engagement, a natural input method was utilised: motion sensing technology. By directly mapping the mental objective of avoiding oncoming objects with a natural reflexive reaction, in this case physical movement, the suspension of disbelief was be easier to achieve and the barriers to immersion lowered.

In terms of game mechanics, the scenario and premise are quite simple, with the context and interface as the elements that promote immersion. The oncoming vehicles need to be avoided and their numbers increase as time passes. The ultimate goal of the player is to avoid the vehicles for as long as possible, an objective that becomes increasingly more difficult as the vehicles increase in number. The expected survival time of each playthrough was designed to last no more than 90 s.

4.2 Implementation Details

Physical Setup. Based on the requirement for low physical engagement barriers and the subsequent decision to use motion sensing interfaces, the setup required a considerable amount of space for the participant to be able to move unencumbered and for the interface devices to properly function. Additionally a suitable display method had to be established for the visual feedback. Initially it was planned to use a projected display in order to maximise the viewable virtual area and minimise the possibility of display limitation breaking immersion. In the process of establishing the physical setup, however, it was determined that a large screen would serve the purpose adequately and with numerous practical advantages, such as reliability and relative portability.

Software Implementation. Following the physical setup, a high fidelity implementation was necessary. For this purpose, the Unity3D engine was used to create a representation of a highway road with appropriate visual elements and a physical make-up to enhance and facilitate the gameplay. For instance, the road representation allowed for mapping of the virtual side barriers to the physical objects limiting the movement of the participants.

The moving vehicles that act as the objects to be avoided are also visible. Except in the case of the simulated fog that limits the long range visibility of the participant, thus enhancing and balancing the level of information delivered by the visual and auditory modalities.

User Interaction. As mentioned above, it was decided to utilise motion sensing interfaces in order to achieve an engaging user interaction. For this purpose, a Microsoft Kinect device was set up to map the movements of the participants on a one to one basis with their virtual avatar. Figure 1 illustrates this effect with the onscreen avatar matching the pictured volunteer’s body stance, position and gesture.

Fig. 1.
figure 1

Scenario test run showing the One-to-One Motion Mapping.

Modality Balance. The balance of the provided modalities was also considered. The purpose of this was to create a digital entertainment experience that explores and utilises the effects of separate and combined modalities in achieving effective immersion.

The two modalities utilised were Vision and Audio. Specifically, the primary objective was to enable the implementation to be able to separate and combine the modalities creating different gameplay situations and assisting in the understanding or even determining, to some extent, the role of each modality in achieving immersion.

Therefore the amount of information that each modality delivered needed to be as similar as possible. With the Visual modality being particularly dominant in collision avoidance tasks [12, 31], the effect of fog was utilised to limit the range of this modality and bring it into line with the Auditory modality. Additionally, significant effort went into implementing a robust audio system that would accurately represent the auditory signature of the oncoming vehicles enabling the participants to determine their location using sound. For this purpose, noise cancelling headphones were used in order to isolate the participants from the physical environment and to allow them the full benefit of the virtual soundscape.

5 Study Methology

5.1 Participants

A group of 19 participants played and evaluated the game in each of the modality combinations. Participants ranged in age from late teens to mid-40 s, 4 female and 15 male. Basic background information relating to the experiment was gathered such as experience with technology and games. Especially in the latter case there was a balanced variety within the participant group as it included both self-described “gamers” and individuals with limited or no experience.

5.2 Procedure

Each participant played the game a total of 12 times, with a set of one trial run and three timed runs for each modality combination: only audio, only visual and audio/visual combined. Their stated task was to play the game by avoiding the oncoming vehicles for as long as possible. For a set of runs with both modalities enabled, the participant would experience the full breadth of the visual and auditory cues. For the visual-only set of runs, all sound would be disabled but the participant would wear noise cancelling headphones in order to maintain isolation from the physical environment. For the audio-only set of runs the participants were blindfolded and therefore their only sensory input was aural via the isolating headphones. The order in which the participants experienced the modality combinations was counterbalanced to control for order effects.

With regards to data collection, the IMX Questionnaire, a quantitative measure of immersive response [7], was completed by each participant at the end of each 4 run set, in relation to the specific modality combination just experienced. This version of the IMX consisted of 16 items with 5-point Likert scale items as responses. This version of the questionnaire measures three factors of game immersion, specifically: “General Immersion”; “Action Visceral Immersion”; and “Mental Visceral Immersion”. Upon completion of all three sets, participants completed a Critical Incidence Technique [13] style questionnaire, reported their preferred modality combination for game play and a short semi-structured interview was carried out exploring participant reactions to the game. Additionally, survival times from each trial were recorded.

6 Results and Conclusions

Following the execution of the experiment, a number of findings, both qualitative and quantitate, were produced and are presented in the following sections. Additionally a number of conclusions were drawn from these findings.

6.1 Quantitative Results

Immersion. The IMX questionnaire results were scored for each mode of interaction (audio, visual or both) and game immersion (General Immersion, Action Visceral Immersion and Mental Visceral Immersion), see Fig. 2. A 3*3 repeated measured MANOVA was performed to explore the impact of output modality used on these variables. The MANOVA was found to be significant (F = 338.4, df = 16, P < 0.001). Uni-verative testing with a Greenhouse-Geisser correction for Sphericity showed significant effects on General Immersion (F = 5.714, df = 1.943, p = 0.0076) Action Visceral (F = 7.156, df = 1.856, p < 0.0032) and Mental Visceral (F = 3.507, df = 1.806, p = 0.046).

Fig. 2.
figure 2

Mean score and 95 % confidence intervals for IMX immersion scales.

A post hoc within subjects t-test revealed a number of significant effects as shown in Table 1.

Table 1. Post Hoc testing of modality combinations on aspects of immersion.

Performance and Preference. Game play time was examined in order to explore performance differences between the modality combinations with longer play times indicating a more successful session, as illustrated in Fig. 3.

Fig. 3.
figure 3

Mean play time and 95 % confidence intervals for both modalities, audio only and visual only.

A repeated measures ANOVA and post hoc repeated measures t-test, revealed:

  • a significant multivariate effect (F = 121.724, df = 2;17, p < 0.001)

  • a significant difference between ‘Both modalities’ and ‘Audio only’ (T = 15.01, df = 18, p < 0.001)

  • a significant difference between ‘Visual only and ‘Audio only’ (T = 11.781, df = 18, p < 0.001)

  • a non-significant difference between ‘Both modalities’ and ‘Visual only’ (T = 1.129, df = 18, p = 0.274)

Despite the Audio only mode being much more difficult 4 participants reported it as their preferred mode of play with all others (15) preferring to use both modalities. None preferred the visual only mode of play.

6.2 Qualitative Results

Responses to the Critical Incidence Technique style questionnaire and interviews were collated and a thematic analysis performed, identifying three common themes: Rapid immersion, the importance of sound and engaging audio experiences.

Rapid immersion. Responses indicated that a high level of immersion was attained very quickly. This was especially noteworthy considering the very short duration of the gameplay sessions. Participants stated that they found the experience “engrossing” and “entertaining” as they would expect from a game. Familiarisation time was very brief, with the participants becoming quickly engrossed in the scenario. The users really tried to survive and felt they reacted as they would if they had been in such a situation in real life. They tried diving to the sides, turning sideways to fit between oncoming vehicles and sometimes panicking.

“Damn this is intense!” - Participant 2

“It’s going to crush my toes!” - Participant 13

The importance of sound. Most participants agreed that the combination of the visual and aural modalities led to the best overall experience. They also stated that the visual-only version was lacking in comparison with this and felt “disconnected” and “muted”. Some participants found that their attention wandered and the expectance of aural cues was “noticeable” and their absence was “disconcerting”.

“Something was missing, I was waiting for the crunch.” - Participant 17

“It felt mechanical, like paying Candy Crush.” - Participant 13

Engaging audio experiences. Many shared the opinion that the audio-only version was “intense” and “thrilling” and that being deprived of their vision required them to immediately engage with the game and concentrate on survival, thus very rapidly creating an intensely immersive experience.

“That was absolutely terrifying! I was all there!” - Participant 6

“Oh god oh god oh god…”Participant 4

7 Conclusions

The research objective of this study was to investigate the hypothesis that unimodal and multimodal digital experiences create different levels of immersive response.

From the above results, and keeping in mind the context and scope of the study, a number of conclusions emerge. Firstly, with regards to achieving immersion, it can be seen that the combination of natural input interfaces and relatable scenarios which are matched to the input can lead to high levels of immersion. As outlined, the sample included individuals who self-identify as gamers as well as individuals with limited exposure to games, and high levels of immersion occurred regardless of the level of familiarity with game playing. Secondly, in terms of unimodal output, the removal of the primary sense, vision in this case, had a dramatic effect, leading to intense experiences, emotions and reactions, and subsequently ‘total’ immersion. Conversely, the removal of a secondary sense (audio) without any context, had the exact opposite effect, with users becoming disconnected with the experience. This supports the hypothesis that unimodal and multimodal digital experiences create different levels of immersive response, also indicating that different levels of immersive response can arise depending on which output is utilised.

8 Future Work

Future work should investigate other modalities and combinations thereof. The participants commented on how they grew to expect some sort of physical feedback such as “wind disturbance from a near miss” and the “vibrations under their feet” while playing. It would be beneficial to explore these possibilities with other modalities but it would also be constructive to further refine and experiment with the visual and auditory modalities. For instance, the use of immersive displays such as the Oculus Rift DK2 [3] could further enhance immersion and therefore enhance the experience.

These technologies would also allow for the creation of a digital game experience where the suppression of specific modalities was contextualised. For example, vision could be suppressed due to fog or auditory perception suppressed due to the presence of loud environmental noises such as rain or heavy machinery. It would be interesting to determine whether the contextual suppression of the senses will affect immersion, or lack of immersion that users found in the conducted experiment. These findings could be formed into HCI guidelines for the design of immersive experiences.

Finally, future work could also take an inclusive “Design for All” stance, to determine whether the work on isolating modalities and immersion could be used to design games that can be enjoyed by players with and without sensory impairments, i.e. by all. Understanding the contribution of the various sensory modalities and their inherent power in communication information and achieving immersion, could potentially open the door to a level playing field for users of mixed abilities in cooperative and competitive contexts.