Keywords

1 Introduction

In HCI research, a user-centered design approach aims to address explicit and implicit needs of users to minimize barriers of technology use. Intuitive user interfaces allow users to utilize prior knowledge and experiences, making them easier to understand and to master. For example, gestures and metaphors such as swipe, pinch, or roll are used to interact with smartphones. All of them are based on analogies of interacting with physical paper. Prior knowledge can stem from experience, such as learning conventions of arbitrary actions (e.g. pressing a button on a keyboard to close an application). Ideally, a natural mapping (NM) allows to infer meaning based on real world experiences and analogies, such as symbolic (e.g. clicking ‘x’ to close an application) or natural analogies (e.g. swiping the application out of the screen) when interacting with technology [1]. Norman introduced various concepts to utilize these experiences, e.g. spatial analogies, cultural standards, perceptual effects, or biological effects [1]. From a cognitive psychology viewpoint, these experiences are stored as mental models [2], cognitive schemes [3], or scripts [4].

NM interaction activates existing mental models and allows for a transfer to the interaction at hand. As a consequence of the transfer, the interaction results in lower required mental workloads, freeing up cognitive resources for processing the actual content of the interaction. Furthermore, NM also allows for a transfer of mental models constructed based on an interaction with technology to the physical world. Virtual simulators (e.g. medical training or driving simulators) can be employed to prepare for real world situations. The extent of these mappings can be modified freely by interaction designers. Often, there have been many attempts to create authentic virtual counterparts of real world interactions. However, studies showed that highly natural mapped interactions (e.g. stereoscopic images, or authentic game controllers) did not automatically enhance performance or user experience (UX), but are only effective for certain interactions [5, 6].

In addition to the mapping of input actions, virtual environments can differ greatly in their mapping of spatial relations: The user perceives both a physical space (e.g. C.A.V.E. environment) and a virtual space (e.g. virtual scene depicted by the C.A.V.E.), where the interaction takes place. For example, a system with high degrees of (natural) spatial mapping may use an isomorphic mapping of distances, object sizes and travel speeds as well as subjective head-tracked viewing perspective, thus resulting in a very natural overall experience. Furthermore, users make assumptions about possible interaction affordances based on their real-world experiences, drawing on existing mental models. Like NM, this spatial mapping should reduce the required mental workloads. Even with the use of very naturalistic input devices (e.g. gestures), systems with lower degrees of spatial mapping (e.g. video games) require more mental transformation processes during the interaction to account for the different perceived perceptual spaces. Yet, a high degree of spatial input and output information available alone should not automatically benefit the interaction process.

In this paper, we discuss the spatial mapping of virtual and physical spaces and the impact of spatial mapping on task performance and user experience. Specifically, as with NM, we argue that the combination of perceived spatial multisensory stimuli has to be meaningful for the specific user task to show any benefits. We examine spatial relationships, because body-centered interaction [7] primarily aims to combine corresponding proprioceptive and exteroceptive sensations to create a sense of embodiment in a virtual environment [8].

2 Natural Mapping and Natural User Interfaces

Natural Mapping is a specific form of input mapping [9] that focuses on intuitive controls for interactive systems. Natural user interfaces (NUI) are often described as direct, authentic, motion-controlled, gesture-based, or controller-less. Instead of relying on buttons, keyboards, joysticks, or mice – which require users to make abstract and arbitrary input commands – they rely on intuitive, physical input methods. These methods are often based on their real life counterparts and don’t require the user to learn the controls before interacting with a video game or virtual environment.

Natural Mapping originally refers to proper and natural arrangements for the relations between controls and their manipulation to the outcome of this manipulation [1]. The interactions are based on prior knowledge: Physical and spatial analogies are used to imitate physical objects within a virtual context, e.g. ‘buttons’ that can be pressed, ‘sliders’ that can be dragged and so on. Cultural standards give the user an idea about an outcome from the interaction, e.g. rotating an object clock- or counterclockwise to increase or reduce a value. What we call ‘intuitive’ means that our cognitive system can adapt to the situation more easily. Based on previous knowledge, mental models about the objects and the interaction are constructed.

2.1 Mental Models

Mental models (MM) are subjective, functional models for technical, physical, and social processes involving complex situations. They are representations of the surrounding world and include relationships between the different parts [2]. MM only include reduced aspects of a situation: Quantitative relationships are reduced to qualitative relations within the models [10], that relate to a specific object in forms of structural or functional analogies. MM are constructed to organize and structure knowledge through processing of experiences. Schemes, frames, or scripts are similar, related concepts. Mental models are used in theories on media reception, such as text [11], film [12], or interactive media [13].

Two mechanisms provide information to construct mental models: In a top-down process, existing knowledge and experiences from other knowledge domains are used as a base of the MM. In a bottom-up process, situation-specific information is integrated into the model. Whenever new, situation-specific information is available, the model adapts to the new circumstances. Both the cognitive processing and the construction of MM are automatic processes.

The benefit from natural mapping comes from the inclusion of previous knowledge in the construction of mental models. NM allows for a transfer from other knowledge domains, thus enhancing the retrieval of existing mental models for the interaction or allow for an easy top-down adaptation of new models [14]. The result is that fewer cognitive resources are needed for the interaction, so more resources are available to process the actual content of the media experience.

2.2 Task-Specific Benefits of Natural Mapping

Interaction tasks in virtual environments are typically divided into natural and magical techniques. Whereas natural interaction aims on high interaction fidelity and simulation of real world counterparts, magical techniques are intentional less natural and focus on usability and performance [15]. Object selection, manipulation, as well as travel/translation, system control and symbolic input are key tasks within a virtual environment [15]. Depending on the context of the interaction or application, the focus may not be on NM at all. When using productive software such as engineering or office software, the efficiency and precision of the controls are more important than an intuitive interaction, thus preferring a magical or abstract technique (such as using a keyboard). Using previously learned hotkeys to achieve a task may not be intuitive, but it is very efficient. As productive software is aimed at experts, intuitive controls for novices are less important.

There are, however, tasks that clearly benefit from NM: Tasks designed for novice users should be intuitive, allowing for a fast learning process. Furthermore, when the task involves sensorimotor transfer processes (e.g. medical training simulations), a NUI should employ natural input devices (e.g. a virtual scalpel) to allow for best transfer results. Also, if the goal of the task is not performance-based, but focuses on entertainment or is meant to provoke body movements (e.g. fitness, sports), NM can be employed effectively [16, 17]. These examples stress the importance of the task-specific context for NM. Depending on the complexity and the goal of the interaction, it may not be necessary to completely simulate a virtual interaction of a real world counterpart – a simplification of the interaction may be sufficient.

2.3 Natural User Interfaces and Spatial Information

Often NUIs aim for a high naturalism, combining spatial input capabilities and multisensory output [18]. Bowman [19] emphasizes the problems of precision of spatial input. Spatial tracking systems are even far behind the modern computer mouse in terms of precision (e.g. jitter), accuracy, responsiveness (e.g. latency) and have several basic disadvantages: (1) Spatial input is often performed in the air and not on a flat surface, (2) in-air movements of humans is often jittery because of natural body tremors, (3) pointing techniques using ray-casting (e.g. magic wands) amplify natural hand tremors, (4) 3D spatial trackers usually don’t stay in the same position when letting go of them [19].

Despite these problems, the fidelity of spatial input capabilities is unparalleled. NM allows for three-dimensional input, e.g. through gestures or tangible objects. For example, a virtual environment could allow users to play virtual golf by using a real world golf club where the position and movements of both the player and the golf club are tracked by the system. The amount of spatial input information can vary greatly: The system could process the information on a basic level, only registering the overall movement of the club as one event. On the other hand, the system could process all available information (6 DOF of movement) for the interaction. In reality, most systems fall in between these extremes. Interactions can be simplified to make them easier to perform (e.g. in video games such as Nintendo Wii Sports or Microsoft Tiger Woods PGA Tour 13 for Kinect) or maintained as complex sequences (e.g. for training simulators).

Simplified interactions usually do not require highly elaborated previous knowledge or skill. Novice users may apply simple concepts (e.g. “swing the golf club and hit the ball”) from common knowledge. Assumptions about the interaction are based on these basic models. Complex interactions in virtual environments are rarely perceived as complex as their real world counterparts [18]. So for experts, even these are simplified. However, a seemingly complex system (e.g. a training simulator) may invoke the assumption of a real world complexity, resulting in frustration and bad user experience, if these assumptions are not met. Still, novices may not notice the simplification due to their basic mental model of the real world interaction.

To use all the benefits of complex spatial input, complex multisensory spatial output is required. If users cannot perceive spatial depth cues, they are not able to make precise spatial inputs. Visual depth cues can be classified into static and dynamic monocular spatial cues and binocular spatial cues [2022]. Monocular cues constitute the majority of depth cues for human depth perception, e.g. occlusion, relative height in the visual field, relative size and brightness of objects, texture gradient, linear and aerial perspective and shadows. Spatial cues requiring binocular vison are parallax and stereopsis, i.e. convergence and accommodation of the optical lenses [23]. In media technology, binocular spatial cues are primarily simulated through the use of stereoscopy [24]. Head-mounted displays, shutter/polarized glasses, or autostereoscopic techniques are used to achieve the effect of two separate stereo images, one for each eye. Combined, these visual spatial cues should allow displays to convey highly accurate spatial information. Furthermore, head-tracking can be used to assure a correct subjective perspective of the virtual scene to maximize the effect.

Systems with a high degree of naturalism often combine high degrees of spatial input and output capabilities. The mapping of spatial relations within the system can also be designed differently, which we refer to as spatial mapping.

3 Spatial Mapping

We conceptualize spatial mapping (SM) as an extension of the natural mapping process, where spatial relationships are included in the mental models for a specific interaction in a virtual environment [2527]. High (natural) SM is considered as an isomorphic mapping of perceived physical (real) interaction spaces and virtual interaction spaces. In this isomorphic mapping, distances and sizes of objects are identical in both the physical and virtual perception spaces. Building on the theory of NM, the high similarity of both spaces favors the transfer of mental models of the physical world in the virtual environment (and vice versa) (Fig. 1).

Fig. 1.
figure 1

Left: System with low spatial mapping (system A), requiring the user to transform spatial information from the virtual and physical space. Right: System with high spatial mapping (system B), requiring no cognitive transformations. Source of images: [28].

Although NM with a given system can be quite authentic (e.g. using gesture input), the spatial relationships during the interaction can be mapped differently: For example, playing virtual table tennis with a Nintendo Wii video game console, users control a racket with a NM input controller that enables movement in 6 DOF in front of a TV screen (system A). The user is represented by an avatar on the screen which mirrors his movements to a certain degree. Even with a high NM, SM is low, because cognitive transformation processes are required to combine the physical and virtual perception spaces. Furthermore, the system reduces relevant spatial information to compress physical space needed for the interaction. System B could employ an isomorphic spatial mapping using a C.A.V.E. There is no representation of the user other than his physical self, and all objects perceived in the environment have the same size and distance as in the real world. Only few transformation processes are necessary, and more cognitive resources remain to process the content of the interaction itself.

3.1 Adequacy and Relevance of Spatial Mapping for Different Types of Tasks

In theory, the combination of spatial input and output technologies allows for very high levels of interaction fidelity for interface design [18]. In practice, NUIs are often seen as more engaging and interesting, but also physically more exhausting. They can be successfully implemented for certain types of interaction, but may result in bad UX for other types of interaction. An often cited example [29, 30] for this argument is the NUI from the movie Minority Report [31], where the protagonist uses a gesture-based interaction system to search an audiovisual database. The system looks visually impressive, but the mapping of the input modalities and the requirements of the task are completely inadequate for the task of searching information. It is exhausting to use and does not provide essential benefits over the use of a mouse and keyboard with a two-dimensional display. If the task would have included a detailed manipulation of several objects within a three-dimensional scene, the high degree of spatial information for the input actions could be applied reasonably.

A high degree of detail for input and output modalities is the ideal precondition for high degrees of user experience. However, many interactions do not require high spatial mapping, as it is not relevant and thus, does not affect UX or task performance. An application may offer a visually rich stereoscopic presentation with a highly natural body posture and gesture recognition systems as input modality. But when the user’s task is to react to acoustic stimuli with a wave gesture, the additional spatial information is irrelevant for the user’s task. It should not be beneficial for the UX – in contrary, the additional information could impair UX because of possible side effects like simulator sickness [32] or physical fatigue due to the physical interaction with the system. As a result, the user may perceive the system as inadequate for the task. Simple tasks requiring just one or two spatial dimensions do not benefit from a high degree of spatial information, making the interaction unnecessarily difficult.

This notion is also supported by Bowman [19], who argues that the mapping between input devices and actions in the interface is critical. He recommends to reduce the number of DOFs the user is required to control, e.g. by using lower-DOF input devices or ignoring some of the input DOFs.

Three key components can be identified that characterize spatial mapping:

  • Degree of detail of spatial input modalities

  • Degree of detail of spatial output modalities

  • Interaction task requiring a certain degree of spatial input/output modalities

The interaction task can be simple, using only one-dimensional spatial information (e.g. moving along the x-axis only). Video games like Space Race [33] use two-way joysticks, the user’s task is to steer left or right only as his avatar is accelerating automatically. Complex spatial information is not necessary for the task.

More often, two-dimensional spatial information is required (e.g. interaction within the vector pane spanned by the x-axis and y-axis). Many modern video games include these interaction tasks by allowing inputs for left, right, top and down. Racing simulations as well as side scrolling games or games with a bird’s eye of view fall into this category. Binocular depth cues (i.e. stereoscopy) are not relevant for the task itself.

Many studies on stereoscopic presentation using games found no effect for performance or UX [5, 34, 35]. However, some studies [36, 37] report positive effects of stereoscopy on fun and enjoyment. These could be explained with a novelty effect, as the players may enjoy the stereoscopic technology as it is new to them. Even studies in VR simulators using scenes with simple selection tasks report not finding any benefits of stereoscopic presentation [38, 39], which support the assumptions made here.

Complex tasks involving three-dimensional spatial information requires users to interact in a 3D space. It is insufficient for virtual environments to use complex three-dimensional scenes to present high degrees of spatial information (e.g. Shooter Games, C.A.V.E.), the user’s task has to involve true three-dimensional interaction to make the available detail of spatial information meaningful. Studies using e.g. selection and manipulation of 3D objects in a 3D space [40] show positive effects of stereoscopy and head-tracking for task performance and UX.

Overall, the design of the interaction task determines what degrees of spatial information is relevant for the input and output modalities. The more, the better does not apply here. Higher degrees of spatiality have to be meaningful for the user’s task to significantly enhance UX or task performance. A truly isomorphic spatial mapping therefore should require a three-dimensional task to show any benefits compared to lesser degrees of spatial mapping. The right combination of task and spatial information should show the best results on UX and task performance.

3.2 User Studies

We conducted a series of studies with virtual environments using low and high degrees of spatial mapping to test the assumptions of this theory. A first study (N = 265) compared two systems by manipulation degrees of spatial mapping (high: stereoscopic presentation, isomorphic spatial relations, subjective perspectives, low: monoscopic presentation, non-isomorphic spatial relations, objective perspective) and using two different user tasks [28, 41]. The task of system A (power wall setup with VR table tennis simulation [42]) required a three-dimensional interaction to manipulate objects within a virtual scene, whereas the task of system B (racing game simulation Gran Turismo 5 [43]) required a two-dimensional interaction only. In both systems, UX (measured with questionnaires MEC-SPQ [44], UEQ [45], IMI [46]), task performance and various user variables were recorded and analyzed. The results confirm our hypotheses: High spatial fidelity only resulted in better UX and task performance for users with the three-dimensional task. For users with the two-dimensional task, additional spatial information did not enhance performance nor UX, as it was rated as inadequate and unnecessary by the participants.

A second study (N = 94) examined different spatial mappings in the video game The Elder Scrolls V: Skyrim [47]. By using an Oculus Rift HMD and a Razer Hydra Controller, we manipulated stereoscopic presentation and natural input mapping. In all groups, the task required a complex three-dimensional interaction (i.e. placing and navigating objects through a custom environment created for the experiment). We used the same measures as in the first study. Overall, the results confirmed that the high spatial mapping was rated more adequate and relevant for the complex interaction task and showed a higher task performance compared to the lower degrees of spatial mapping.

4 Discussion and Implications

In this paper we introduced the concept of spatial mapping as an extension to natural mapping. SM refers to the mapping of spatial relations, sizes and distances of objects as well as visual perspectives within a given virtual environment. High degrees of spatial mapping use an isomorphic mapping from perceived real world spatial relations to the virtual world, thus enabling users to apply previous knowledge and skills based on the real world. By using high degrees of spatial mapping, the cognitive workload for the interaction can be reduced, as there are fewer transformation processes required to learn the interaction. Furthermore, the transfer of mental models constructed within the virtual environment (i.e. virtual training simulations using spatial tasks) to real world applications should be easier as well. However, natural user interfaces must reflect the context of the user’s tasks. High degrees of spatial information, both for input and output capabilities, have to be relevant for the interaction to enhance UX or task performance. For example, spatial depth cues provided by stereoscopic presentation or subjective head-tracking are only beneficial for complex three-dimensional tasks. A system may provide a very natural interaction with high interaction fidelity, but when only simple one- or two-dimensional interactions are required, the system may prove no better or even worse than a more basic system.