1 Introduction

... human kind

Cannot bear very much reality...

T.S. Eliot [1]

Sticking your fingers in your ears to prevent sound waves from arriving at your ears is the simplest and oldest form of sound stimuli reduction, while the earplug is probably the oldest technology that allows humans to achieve this without using their fingers. The first known mention of such an earplug-like technology is in Homer’s Odyssey from the eighth century BC:

First you will come to the Sirens who enchant all who come near them. If any one unwarily draws in too close and hears the singing of the Sirens, his wife and children will never welcome him home again, for they sit in a green field and warble him to death with the sweetness of their song…Therefore pass these Sirens by, and stop your men’s ears with wax that none of them may hear [2]

Interestingly, this story links the use of earplugs—a technology that controls mind-external sound stimuli—with the ability to maintain cognitive control. The seamen on the ship did not use earplugs to prevent physical damage to their ears due to great sound pressure or to reduce annoying noise. The earplugs were used as a shield against the cognitive distraction the particular sound of the Sirens may cause to them, a distraction that would have a negative impact on their cognitive control (executive functions) and ability to keep rowing. Thus, the story highlights the need to reduce the external world’s sensory potential in order to avoid its dangers.

Extended reality [XR] (the umbrella term representative of Mixed Reality [MR] and Augmented Reality [AR] techniques within the broader field of Virtual Reality [VR]) is often used to add virtual objects to the user’s surroundings or to create hyper-realistic situations. These technologies may allow users to immerse themselves in realities removed in space and time from their actual spatiotemporal locations, or to experience how virtual objects visually (and it is almost always visually) interact with actual objects in their surroundings.

This is an essay proposing and exploring the antithetical position—opposing the motivations and need for XR. The essay is provoked by two questions with regard to the topic of XR: which reality is under discussion and why would it need to be extended? Each of these two questions covers a host of others—what is reality? can it indeed be extended? and so on—and these we answer along the way as we argue the case not for extended but rather reduced reality [RR]. The increasing complexity of sensory stimuli in our daily life calls for new ways to cope with potentially stressful situations, for instance, in nursing (e.g. see [3,4,5]) and surgery (e.g. Luz et al. [6] who also argue for a less-is-more approach with regard to visual imagery), and one way is to reduce sensory and thus, cognitive clutter. Could VR design and technologies be used beneficially to subtract something from experience rather than add something to it, and, our focus here, what could be the role of sound in this?

Certainly, the argument could be made that VR technologies already present a reduction of reality or a fantasy world that, in its presentation of sensory potential, cannot match the quality and quantity of sensory potential available in the external world. A number of modalities are typically not available in the worlds of VR—for example, there is no sense of mass or gravity that emanates from the VR world—and the technology itself (even the original recording technologies if cameras and audio recorders have been used) reduces external-world data from its indiscrete forms to discrete forms through digitization processes, with further reductions deriving from choices related to, for instance, frame rate and bit depth. But such reduction is not our concern here.

Our concern, rather, is that a focus on addition—the purpose of most XR—will not lead to better engagement, let alone presence, in virtual worlds when used with present-day approaches. Instead, an approach that focusses on what is important to the perceiver can reduce clutter and allow the required perceptions to emerge. This is one of the reasons we propose an RR concept that focuses on filtering things out and a balanced equalization (to use a term from the art of sound engineering). This has the secondary benefit, we believe, of reducing the need to excessively direct the power of CPUs, GPUs, and sound chips towards providing a sensory, so-called realistic virtual world, and so such processing power could be used for other purposes.

As individuals in an external world, like other species we already reduce or filter the amount, type, and bandwidth of the sensory information derived from that external world. This reduction can be operatively grouped into four categories: physical subconscious; physical conscious; cognitive subconscious; and cognitive conscious. In the first category, it is the very physicality of our bodies which does the reduction. That is, we do not, for example (and unlike some other species), possess the sensory apparatus to detect magnetic fields, and, within the sensory channels we have access to, our size dictates certain thresholds—our sensitivity to sound wave frequencies, for instance, or our ability to tactilely distinguish between rough and smooth. Consciously turning to focus on an event, reaching out to touch an object, or cupping one’s hand behind the ear are examples of the second category. Cognitively, our brains exert some subconscious control over what becomes conscious perception (category three, which can also include illusions such as the McGurk Effect [7]), but (category four) we can also consciously concentrate on limited areas and aspects of the external world—for example, the cocktail party effect [8]. In our physicality and cognitive design, we are always already, necessarily and naturally reducing input from the external world. Where necessary, we go into more detail with some of these natural forms of reduction further below.

Although not the focus of this essay, it is also worth mentioning that reduction is not the only strategy species have evolved to cope with the mass of potential sensation in the external world. We also rearrange temporally and spatially the external world in order to make sense of it and thus be able to act within and upon it. As a temporal example, one might reasonably assume, given the differential in the velocities of light waves and sound waves (approximately 300,000,000 ms−1 versus 330 ms−1 in air at 0° C), that, from a single audio-visual event, the light wave would be perceived as an image long before the sound wave is perceived as a sound. Beyond about 15 m, this assumption (for humans) would be correct. Within that distance, though, it turns out that vision registers in the brain after sound. Up to about 100 m distance, though, the two sensations are still perceived as one percept [9]. The ability to locate a sound event in the cinema not on the loudspeakers to the sides of the theatre but on the screen in front is a good example of a primarily spatial re-locating of sensory information. Chion [10] called this ability to perceptually fuse sound with visual stimuli in the cinema—despite their different origins—synchresis to highlight how we naturally seek to connect synchronously paired stimuli to create a unified and coherent perception. Both these examples show that we can rearrange sensory information from the external world to provide a more commodious environment.

We aim to show that VR designers would benefit from a less-is-more approach and discuss ways to equalize our daily environment using VR technologies to make them more manageable. An intelligent reduction of our sensory and thus perceptual field will improve the differential processing of external stimuli (selective attention) and will allow for better concentration, focus, and ability to achieve flow and presence. This is relevant, for instance, for people working in stressful surroundings or for people unable to filter sensory information (see [11] for a study of the inability to select specific sensory inputs for enhanced processing in people with autism spectrum disorders), but it is also relevant to the design of virtual worlds such as computer games.

Our approach involves an understanding of the nature of reality, listening, and crossmodality and an examination of the role of sound in the creation of a reduced reality. We begin with a brief discussion of reality and related terminology as a means to make a case for a future paradigm shift in VR design. This discussion includes an argument for the need to reduce reality in artificial worlds, such as those found in VR, and suggestions for how to utilize the auditory domain to accomplish this reduction.

2 What is reality?

While we do not argue for a definitive answer to the question what is reality?—this has been argued over for millennia and no doubt will be argued over for millennia to come—we are prepared to state that there is no one reality but rather multiple realities which are conceptions of a subset of an external world or externality which is ungraspable in its full extent. In other words, although there is a common substrate to our realities, each individual has their own conception of reality due not only to their unique spatiotemporality in relation to that externality but also to their unique cognitive alchemy formed through their own experiences.

In this essay, reality is related to the perceiving self’s beliefs and directedness about the external world, and these are connected to the process of forming a perceptual environment. Reality is what we seek to comprehend when we choose to—or are made to—become aware of the external world. The perceiving subject is already bodily embedded in the external world in the process of constituting the experience that is reality. Reality, however, is something different from the external world. The external world is connected to reality only through the environment that functions as a perceptual model of the external world—a model that maps onto reality in a metaphorical sense. When the environment is remodelled, reality changes, the external world does not.

We expand this further below but first must deal with the confusion in the field of XR research and development over the concept of reality.

2.1 Reality according to XR

Our brief definition of reality above might seem a matter of philosophical semantics that places us somewhere on the spectrum of constructivist phenomenology were it not for the fact that it is vitally important to get definitions right in the field of XR in order to ensure that its foundations are built on rock rather than shifting sand. That the field of XR is built on shifting sand will become clear as we discuss the multiple meanings of the term reality in XR and the broader VR field.

In most studies of extended reality, the concept reality equals the empirical world (virtual or actual). As we have noted before [12], our conception of the environment is that it is an emergent perception resulting from a hypothetical modelling of a subset of the external world. We acknowledge that this externality is what is normally referred to as reality in the XR field. This belief is apparent in the terminology itself—extended reality, augmented reality, virtual reality—and the uses to which XR technologies are put (the claims that they emulate and enhance reality). Yet there remains ambiguity and inconsistency.

Take the term virtual reality. Leaving aside arguments as to whether the correct understanding of ‘virtual’ has ever been used in VR (see [13] for further discussion), we assume that the term refers to an emulation of something outside virtuality that, today, uses digital techniques (what might be called digitality). What precisely is being emulated is something that is external to the digital world of the emulation (assuming the ideal is achieved or achievable, otherwise the digital world is merely a simulation). As most VR systems (for the purposes of this essay, this encompasses XR systems) provide the same pool of stimuli to all users, then it must be assumed that VR designers have the conception of a reality that is uniform and singular and that this comprises externality.

There are several issues here. First, putting to one side solipsistic philosophies (the idea that only your own mind can be known to exist) which doubt the existence of externality (and everything else that can be doubted [e.g. 14]), one of the main philosophical threads from Plato to Kant through to the phenomenologists of the nineteenth and twentieth centuries is that it is not possible perceptually to fully grasp the external world. Our sensory modalities, those boundaries between ourselves and externality, filter out much of what could be sensed and is sensed by other creatures even before our cognition gets to fashion the remaining sensory gruel into something perceptual. How then can we seek to simulate reality, let alone emulate it, where that reality is equated with an ungraspable externality?

Second, if reality according to XR is the sum of externality to ourselves, then, obeying the law of conservation of energy, it certainly cannot be changed in any way. If the external world is the totality of existent things (perceivable or not)—there can be no reduction or augmentation of it. The external world is always already everything there is.

Third, and here lies the paradox in the VR field, mixed reality presupposes two or more realities combined to produce another. Yet, in VR, there is only the one reality (that which is external to the subject—see comments below relating to the positivist paradigm prevailing in VR), and that is the one apparently being modelled in VR or added to or extended in XR.

Fourth, there is a thread in VR presence research stating that presence in such worlds is enhanced by increasingly immersive technologies [cf. 15]. Thus, the better the fidelity of sensory stimuli delivered by the VR technologies to similar stimuli in externality (i.e. the greater the realism), the better the chances of presence being achieved in the VR world. However, as we cannot grasp all of externality and so cannot emulate it with digital tools, it will never be possible to be present in such worlds. Surely this is a dead-end route to presence, and so, if we are indeed present in VR-related worlds, immersive fidelity must refer to something else.

The fundamental question of what, or which reality, is being modelled in VR is also relevant in the context of the belief in the direct relationship between realism of the VR system and the system’s ability to induce presence (we briefly discuss presence in Section 3.2), but it also is the basis of the argument over whether ‘virtual reality’ is an oxymoron or a pleonasm. The positivist adherents of VR tend to the oxymoronic viewpoint, believing that reality is external to the subject, and so believe that that which is virtual is not real. Yet, with the naïve optimism of the visionary positivist, they also claim that the term can transform into a pleonasm when reality is, as it no doubt will be, emulated with VR systems: ‘VR makes the artificial as realistic as the real’ [16]. But, as we note above, reality in this sense is an externality which cannot be grasped; thus, the reality being modelled in VR is the Kantian reality, the only reality we can know and thus attempt to model, and so we agree with commentators such as den Hertog [17] that the term ‘virtual reality’ is already a pleonasm as reality is virtual.

2.2 Clarifying the terminology

In an attempt to sort out this conceptual muddle, we build on our previous work on definitions [12]. We have stated that there is an externality which is available to sense (it has sensory potential) and which can be equated to previous concepts of an ‘external world’. Humans, as a species, have sensory horizons that differ to other creatures and thus our sensory apparatus filters out much of what is available to be sensed by other organisms (see [18, 19], for instance, and Section 4 below). An individual has a certain sensory horizon by dint of its corporeal spatiotemporal positioning in externality and its particular sensory aptitudes, and this sensory horizon encloses sensory potential (what is sensed and what could be sensed dependent upon focus, attention, spatiotemporal position, and so on). We focus on and/or are made to be attentive to a subset of this sensable externality and so have a lesser, highly dynamic salient horizon encompassing a salient subset of externality (a horizon is sensorially multimodal and thus is spatiotemporal in nature—our hearing, for example, allows us to sense the past in order to create the present [see the brief note above and section 4 below]). It is from within this salient externality that we sense and thus attempt to model what we sense. The dynamic model so created is a perceptual hypothesis [cf. 20] which we call an environment. The environment is thus an emergent perception, fashioned from sensation and cognition (knowledge and reasoning), and is a fair working approximation of that part of salient externality which we sense. It is within the environment that we are present because the formation of the environment is the process of distinguishing externality from self, and the sense of presence must have somewhere for our self to be present in.

Externality itself is ungraspable, and we only perceive what is inside our saliency horizon. This forms the basis for our definition of environment, a perceptual construct which arises from a confluence of sensation and cognition and which functions as a metonym for the world. Thus, we will never experience externality itself, and, accordingly—if externality does indeed equal reality—we will never have a direct experience of such reality where this would be defined as a direct experience of externality.

This brings us back to our definition of reality as our conscious experience of the perceptually emergent environment. From this experience and past experiences of previous environments (salient memories of which are stored in our cognition), we have a conception of externality which we term the world. Such a definition of reality—the experience of a perceptual environment—accounts for our different conceptions of the world because each of our environments has a strong element of our different cognitions and our different sensory capabilities. We each have different pasts, the experiences of which affect the creation of our environments as do the facts that we have different auditory and visual sensory capabilities (to name just two senses). Such a definition also accounts for our common conception of the world because each of our environments is formed in part from similar sensory capabilities (most of us as adults have a hearing range somewhere between 20 Hz and 15 kHz) and a heritage common to our species (we learn what animals are, what is good to eat, what writing is, how a single concept can encapsulate complex philosophies, we might have a theory of mind, and so forth) (see Fig. 1 for a schematic of our conception).

Fig. 1
figure 1

A basic schematic of self and externality

At first sight, it might seem, from the figure and the position of ‘body’ in relation to ‘self’, that our perspective is that of Cartesian mind–body dualism or substance dualism rather than the mostly phenomenological stance we take. Yet, this would be to assume that the phenomenologist views mind and body as one indivisible monist system—we follow Merleau-Ponty when he states that: ‘There is, then, another subject beneath me, for whom a world exists before I am there, and who marks out my place in that world. This captive or natural mind is my body’ [21]. Merleau-Ponty uses the concept of the ‘body schema’ to underscore the fundamental role of the body in shaping and interacting seamlessly with our environment. In essence, the body schema is an embodied representation of our body that guides the self’s unconscious and pre-reflective understanding of the body’s spatiality, capabilities, and potential for action within externality. The body schema is not fixed, but dynamic and adaptive. It incorporates sensory information from various modalities to create a unified sense of the body, and it can change based on experiences, learning, and bodily changes. It is the body’s sense of possibilities that activates the body schema and structures the environment. As Merleau-Ponty writes: ‘To understand is to experience the harmony between what we aim at and what is given—between the intention and the performance—and the body is our anchorage in the world’ [21].

In our model, then, all that comprises the self would be nothing without the body which acts as the pre-existent interface between externality and self. The body is the effector of the self’s projection into and upon externality, but it is also the gatekeeper and actualizer of external potential to the self.

Accordingly, if the term extended reality, and its counterpoint reduced reality, are to have any meaning in the field of VR, the design of virtual realities—reduced or extended—must be something different from merely designing sensory worlds. In actuality, VR technologies (part of externality) are used to create the potential (viz. virtuality) from which we model perceptual environments, the experience of which forms our reality.

If we take reality to be a perceptual experience—rather than a mind-external world of sensory things, viz. externality—one may also claim that the merging of the virtual with the actual is essentially a reduction of reality as much as it is an augmentation of it. Changes in the design of externalities (actual or virtual) lead to a change in our awareness of that externality—that is, simultaneously a reduced awareness of something and an augmented or enhanced awareness of something else.

Finally, we should note that our model represents not a radical provocation but rather an attempt, for our own purposes, to clarify confusing terminology and concepts. It should be quite obvious that there is an element of constructivism in the model, particularly where we state that reality is not tangible but is instead an experience. Equally, we are not the first to suggest that it is conceptually useful to define environment as something relative rather than as an absolute and as a synonym for world—Gibson [22], for example, used the term to conceptualize an organism’s relationship to the external world.

As presence is an important motivating factor for the design of XR and VR, a brief definition is necessary before proceeding further. Within the field of VR and its progenitor telepresence, a fair definition of presence would be that it is the feeling of being in a place and being able to act in that place [15, 23, 24]. In this sense, presence in VR can be aligned with immersion in the field of computer games [25, 26]. Some, though, make a distinction: thus, immersion relates to the potential of the VR technology (its, typically, visual and auditory fidelity to an external reality), while presence is the psychological feeling resulting from exposure to and use of that technology [15]. For the purposes of this article, our use of the term ‘presence’ can be equated to the concept of immersion as found in the literature on computer games, notwithstanding questions as to correlations between presence in virtual worlds and presence in the real world [e.g., 27] or, indeed, the questions about the locus of presence (for us, the perceptual environment) which we raise here. We return to presence in Section 3.2.

3 Why reduce reality?

In this section, we will discuss some of the pitfalls of XR technologies and will argue how our concept of RR is better suited to address some of the key challenges associated with attention, presence, and stress in externality.

Scholars have found much promise in the idea of XR technologies: the ability to form a representation of how a place or person might look in the future; the ability to feed the user with navigational information without shifting attention from the field of agency; and much more. In 2011, Hugues and colleagues [28] devised an augmented reality taxonomy based on the functionalities of augmented realities. The taxonomy was grounded in the belief that a better grasp of reality (i.e. in this case, the external world) is achieved through the path of more information:

Although any increase in the quantity of information—and consequently, any increase in our understanding of reality—admitted by AR aims for greater mastery of what is real, it is clear that, from a technological point of view, AR can offer interfaces which propose either, more explicitly, information, or, more explicitly, a better mastery of our actions with regard to real events [28].

The question remains if the ambition behind AR technologies—greater mastery over what is real—is achievable with AR technologies. Or even if mastery over reality is an attainable goal given the multiple conceptions of what reality is.

So far, most approaches to XR design focus on adding more sensory information: Mihejl and colleagues [29] argue that ‘the purpose of augmented reality is to improve user perception and increase his/her effectiveness through additional information’ (our italics). Bae and co-authors [30] argue that the purpose ‘is to provide additional information and meaning about observing the real object or a place’ (our italics), and Corvino and colleagues [31] state that: ‘The goal of Augmented Reality systems is to add information and multimedia elements to the natural space and to “increase” the natural space through digital contents’ (our italics). Thus, the main AR design principles deduced from these purpose statements include the following: addition of information, addition of meaning, ‘increase’ of natural (i.e. actual) spaces, and improvement of perception.

We believe that the focus on addition is the wrong approach. Achieving greater mastery over what is ‘real’ often entails (paradoxically) perceiving less of what is ‘real’ rather than more—in our terminology, reducing salient externality in order to achieve a more focussed reality. To avoid the temptations provided by new technologies to add more information, with the risk of creating cognitive clutter and overload, we propose that XR designers shift their focus to the creation of conditions for perceptual environments that enhance the ease of working and living for users or ability to focus on tasks or achieve presence in virtual worlds such as games (something different from the ease of access to as much information as possible).

The concept diminished reality has already flourished in several papers that describe technologies to conceal or see through objects in the visual field [see 32, 34]. While ‘diminish’ and ‘reduce’ are often considered synonyms, we prefer the latter. Both diminish and reduce may refer to the process of making something smaller or lesser in amount, volume, or extent. To reduce, however, more often refers to the process of removing something from an object or phenomenon in order to enhance the qualities of the remaining—non-reduced—part. The analogy is found in cooking, where you make gravies, syrups, and stocks by reducing a liquid to a thicker consistency resulting in a richer and more concentrated flavour. Furthermore, ‘diminished’ has negative connotations where, for example, one might say Trump or Putin are diminished by their actions.

Related to this, the tradition of phenomenology which emerged with Husserl argues for a perceptual reduction characterized by a focus on the essential horizon of consciousness—a shift in attitude where the facticity of externality is bracketed. The perceptual reduction is seen as a return to a pre-reflective level in human experience (epoché), in which taken-for-grantedness is deactivated and what is given in experience is foregrounded. It is a way to take a step back from common beliefs about the world [i.e. externality] and to re-perceive it with a new mindset [35]. Thus, perception is not diminished (like in a category 4 reduction, where we consciously concentrate on a subset of our sensory world) but rather restored to a primordial mode where sensed externality is reduced to presences, allowing new experiences to emerge. Merleau-Ponty later presents a philosophy of being-in-the-world, which he, along the lines of Husserl, frames in opposition to Descartes’ dualism (i.e. the view that the self is something separate from externality). But unlike Husserl, he argues that we can only withdraw partially from our engagement with externality: ‘The most important lesson which the reduction teaches us is the impossibility of a complete reduction’ [21]. Instead, he focusses on the body’s being-in-the-world, as something prior to conscious reflection. Since we are always present in externality with our bodies, we cannot fully withdraw from it. But still, it is possible to loosen our ties to externality and neutralize dogmatic attitudes to it in order to foreground what is present in experience.

In our thinking, reduced reality is the antithesis of extended reality. It emerges from a specific form of directedness to externality that changes appearances and alters the process whereby the perceiving self is constructing an environment. Designing the conditions for reduced realities, thus, is to facilitate a perceptual reduction through sensory and cognitive alterations. In what follows, we briefly argue for a paradigm shift from extending reality to reducing reality where enhanced attention is required for specific tasks, where optimal sensory stimulation is needed to increase hedonic experiences, where presence in a virtual world is desired, and where stress might result from cognitive overload, before moving on to discussing specifically auditory strategies to reduce reality.

3.1 Attention

Simon [36], who coined the concept of the attention economy, discussed how an abundance of information in the modern world leads to scarcity of attention: ‘What information consumes (…) is the attention of its recipients’ [36]. The consequences are a decrease in the quality of decisions due to information overload relative to attention capacity [see 37].

In everyday life, RR technologies may be used to reduce the sensory complexity of the surroundings by removing or diminishing the impact of specific sensable things to allow for an enhanced focus on other things. Our embodied cognitive system is already capable of suppressing sensory input (a category 4 type reduction often referred to as sensory gating [see [38] and our brief note above]) to reduce the complexity of our environment. This cognitive process is largely automatic, and it allows us to segregate incoming stimuli and focus on relevant sensory information (e.g. the cocktail party effect [8]).

Bregman [39] calls this process ‘primitive stream segregation’, a foundational aspect of his theory of auditory scene analysis. Bregman proposed that the auditory system segregates incoming auditory information into distinct perceptual streams based on various perceptual cues. Primitive stream segregation refers to the initial separation of the auditory scene into basic perceptual units, and it is a low-level automatic process in the auditory system. This leads to the perception of distinct sound sources and auditory objects within a complex auditory environment. Bregman further suggests that we also organize and group auditory information that we do not pay attention to—a heuristic process he calls ‘wrap up all your garbage in the same bundle’ [39]. This process happens parallel to the perceptual grouping of the auditory material we pay attention to and aids our ability to stay focused.

Yet, prolonged exposure to sensory overload is often detrimental to cognitive performance as sensory segregation exhausts cognitive resources and causes fatigue. Using RR technologies to reduce irrelevant sensory input could free up cognitive processing power to enhance the performance of other tasks. What is wanted and not wanted depends on the particular domain and the task at hand and, in some cases, is a matter of subjectivity.

Several studies [e.g. 11] have argued for the need to filter sensory information to improve selective attention (e.g. in patients with autism). RR technologies could serve to minimize failure both to notice relevant sources of information (by reducing perceptual clutter) and to focus attention (by reducing the impact of potentially distracting objects or events in the user’s externality) (see [40] for more on selective attention in cognitive engineering). Also, RR is a useful concept in the design of sensable externalities that create the optimal conditions of the flow experience, where users invest all cognitive energy in a specific task and ‘forget’ everything else [41].

3.2 Presence in VR

As noted above, Hugues and colleagues [28] argue that more information in AR leads to greater mastery over reality (i.e. externality). A similar more-is-more paradigm is implicated in concepts of presence in virtual worlds [e.g. 15] and in computer games [e.g. 42]. In both cases, the belief is that increasing the realism (that is, fidelity to the sensory characteristics of externality) delivered by the digital technology will increase presence: as Slater states, ‘[o]ne way to induce presence is to increase realism’ [15]. This equation derives from the positivist view prevailing in VR; that reality is external to the subject (it thus equates to our concept of externality) rather than being, in our terms, the experience of a perceptual environment, which itself is a model of a subset of externality. Thus, the naïve optimism that it is technology alone which drives experiences such as presence.

There are several issues to do with this approach to attaining presence including the limitations of technology in emulating the sensations of externality in all their potential. However, as the purpose of our sensory apparatus is to filter externality and to direct attention to certain aspects of it while creating the conditions for presence in the perceptual environment [43], it becomes apparent that the more-is-more approach is the wrong approach. Rather, we would argue and focus on reducing technologically derived sensation in virtual worlds (thereby freeing up processing power for other tasks) in favour of fine-tuning our perceptual environments by designing such worlds in accord with our natural filtering of externality and our crossmodal perception.

With regard to crossmodality (which we discuss further below) in the context of presence, it should be noted that, in these days of video-conferencing and stressed digital networks, using the auditory modality only (thereby reducing or omitting entirely the visual modality) lessens the occurrence of the cognitive dissonance and loss of presence experienced with drop-outs and image-audio synchronization issues. Reality, as the experience of environments modelled from externality, is fragile and does not readily tolerate cognitive dissonance.

3.3 Stress

Excessive noise or unwanted sound has been implicated in stress and in both negative health issues and disease arising from that stress. For example, the WHO Environmental Noise Guidelines for the European Region [44] highlights noise from road traffic, airplanes, railways, wind turbines, and leisure activities and their potential health consequences: ‘cardiovascular and metabolic effects; annoyance; effects on sleep; cognitive impairment; hearing impairment and tinnitus; adverse birth outcomes; and quality of life, mental health and well-being’. As a specific example, noise, as unwanted or excessive sound, is a particular problem in hospitals (for a review, see [4]). This can lead to stressful lives for health professionals such as doctors and nurses (see, e.g. [3]), mistakes in operating theatres (see [5]), and negative health outcomes for patients (see, for instance, [45, 46]).

It seems quite clear that an RR paradigm that targets excessive or unwanted sound in hospitals would contribute to better outcomes for both staff and patients. Approaches so far mainly comprise the use of music in operating theatres (a form of XR which might have a masking effect but the efficacy of which studies have yet to find conclusive results for, see [5] for instance) and treatment of the operating theatre or ward acoustics (e.g., [47]). The use of VR technologies to reduce everyday auditory realities experienced in hospitals either through masking or filtering unwanted stimuli remains to be comprehensively tested.

3.4 Hedonic experiences

RR techniques can also be used to filter auditory stimuli to enhance the enjoyment of sensory experiences primarily related to other modalities. In a review of research on the hedonic in human–computer-interaction (HCI), Diefenbach and colleagues [48] show how the hedonic is often contrasted with the task-oriented, and that users often spend more time performing a task (or lose track of time) when the hedonic is valued rather than the task itself (see also [49]). The function of food consumption, for example, is often both nutritional and hedonic [50], and several studies show how sound waves may increase the hedonic value of food consumption by modulating taste [51] and odour perception [52]. In this way, filtering auditory information functions as a form of sonic flavouring [53] (or de-flavouring) that enhances or lowers specific aspects of a taste.

While sound stimuli and music are often added as complementary stimuli in restaurants to attempt to create a pleasant atmosphere around food and beverage consumption, excessive background noise may also have a detrimental impact on the dining experience. Loud background sound (music, conversations, fan noise, and so on) has been shown to decrease taste perception [54], lower interest in food, and lower the dining experience [55]. Block and colleagues [56] argued that if we are exposed to overstimulation that we are unable to regulate (either through physical or cognitive regulation [category 1–4] or through regulation of our externality, e.g. if we are unable to turn down excessive background noise), we often achieve sensory balance by regulation in another sense (e.g. ordering less spicy food). Also, restaurants with excessive background noise are often associated with low-quality food [57].

Clearly, sound stimuli and music in restaurants, bars, at home, and so on are useful for many purposes, for instance, enjoying music, but, as we argue here, RR approaches may be useful to manage loud levels of sound and music during other hedonic activities, such as enjoying the taste of food.

4 Strategies to reduce reality

While research articles specifically on auditory RR (or auditory diminished reality for that matter) are virtually non-existent (on noise cancellation techniques, see below), several studies have presented interfaces that alter the user’s visual field and applications for mobile devices that remove undesired objects or persons from real-time video recordings (see [32] and [58]). Future generations of smart glasses will likely be able to identify specific unwanted elements and remove them from our visual surroundings. Examples of possible uses include the removal of objects from the driver’s vision in cars to visualize otherwise occluded road scenes [59], the removal of unnecessary things, the reduction of visual clutter on an office worker’s desk to maintain focus on work [60], the removal of irrelevant body parts in anatomy teaching to increase the learning effect [61], and real-time, real-life ad-blockers [62]. Other similar technologies aim to change the user’s perspective rather than removing objects. Sakata and colleagues [63], for instance, proposed an RR system that allows the user to control the visual distance to other persons, thus reducing the discomfort when other people are getting too close to you. Also, different self-regulated auditory strategies to cope with sensory stimuli have been proposed, for instance, using metacognitive strategies (see [64]) to attention regulation when listening critically to sound in music production sessions [65].

Risko and Gilbert [66] distinguish between cognitive strategies that use internal normalization (i.e. mental processing to manage the environment) and external normalization (i.e. using the body or technology in externality to aid cognitive processing). Mentally zooming in using the cocktail party effect to better hear someone speaking in a noisy room is an example of internal normalization (category 4), while cupping your hand around your ear to more effectively funnel sound waves to the auditory channel or using your hands to bend your pinna in the direction of the sound source is an example of external normalization (category 2). They further argue that the choice of external strategies over mind-internal strategies to regulate and manage the environment is related to internal demand (i.e. available cognitive resources in the specific situation) and metacognitive evaluation (e.g. the user’s confidence in internal strategies versus their trust in external strategies). Thus, while we already and always reduce reality using our brain and body, an increase in sensory information or a lack of confidence to cope with this information (in a specific domain) may call for external RR strategies.

In the following subsections, we discuss the role of auditory and listening strategies that reduce or alter auditory stimuli and aid the user’s construction of a less cluttered and potentially less stressful perceptual environment.

4.1 Current auditory strategies

Maintaining a personal space while still being physically present in externality by removing only distracting or uncomfortable elements from reality (derived from externality or not) is one of the big challenges for today’s auditory reduction technologies. There are several examples demonstrating what is currently possible, or at least attempted, of which we briefly describe just a few. While most of the strategies are thoroughly entrenched in other fields, they figure little in the VR field because of that field’s more-is-more paradigm.

To start with a somewhat unpleasant example, Windsor’s [67] study of interviews with former war detainees shows how music in interrogation is used for sensory deprivation and perceptual distortion of externality. When detainees are exposed to loud and foreign (i.e. unfamiliar) music, the ability to reduce sensory information—both consciously and subconsciously (category 1–4)—is interrupted, and background sounds are masked. This situation, Windsor argues, not only masks causal relations in externality but makes the search for causation pointless: ‘The only causation to be perceived is that of the interrogator choosing to play or stop playing the music’ [67].

Altering or reducing our everyday mode of listening (see [10]), where we listen for sound sources (locations and characteristics of things in externality), however, might serve a more noble purpose. Different forms of aestheticization and medialization processes [68], for example, promote listening modes that reduce or remove our awareness of certain sound sources in externality in order to focus on other auditory elements of that externality. Famously, Schaeffer [69] argued that a new listening mode—reduced listening—emerges from the removal of our everyday causal listening strategies (i.e. the ‘search’ for sound sources in the listener’s externality). Reduced listening is achieved by perceptually removing a direct reference to externality—either cognitively (by insisting on reduced listening to otherwise recognizable sounds) or by designing sounds that promote reduced listening. This allows the listener to focus on the appreciation of the sonorous (timbral) qualities of sounds. Thus, while reduced listening precludes—visually and auditorily—the perceptual relation to the sound source, it is an enhanced form of aesthetic listening (see [70]). According to Schaeffer, reduced listening does not occur naturally in everyday life, as we are inclined as humans to search for information about sound sources in our externality (see also [10]). Reduced listening is a learned practice that we must make an active cognitive effort to perform to fully appreciate sound’s timbral characteristics. Here, we argue that reduced listening fits the RR framework as a metacognitive strategy (self-planning for selective attention) to filter auditory information and reduce cognitive load.

The use of acoustic isolation technologies must be mentioned here not only because such usage is long-standing but also because it is so pervasive an approach to auditorily reducing our experiential reality. Such technologies traditionally comprise building materials and construction techniques but more recently include active noise cancellation devices (more on this below) in specialist settings such as concert halls. The physical construction of public and personal space is the primary means by which humankind, over millennia, has used auditory strategies which filter sonic externality and so reduce our realities. Yet, it is costly, time-consuming, non-portable, and generally inflexible once built.

Tinnitus (that is, subjective tinnitus) and its treatment are good instances with which to exemplify the perceptual basis for our realities and strategies to reduce those realities. Tinnitus, as a result of prolonged exposure to damaging sound pressure levels or certain cases of hearing damage such as illness or physical trauma, manifests itself as anything from a high-pitched or low-pitched, sinewave-like sound to a low-volume, white noise-like sound. There are no sound waves present, but such disturbing sounds are very much a part of the individual’s reality so much so that the sounds themselves often reduce the complexity of that reality by masking. They can be disturbing enough that a variety of devices and therapies of varying efficacy are available to sufferers all of which function by reducing reality—devices use external sounds to mask or distract the patient, and habituation therapies modulate any neural hyperactivity and trick the brain into treating the tinnitus as an unimportant sound [71].

Headphones with active noise cancellation are good examples of technologies that cancel out only a part of the incoming audio frequency spectrum (generally static, low-frequency noise) which might allow for better intelligibility of other sounds (presently, it is used mainly in combination with music listening). There have also been attempts to produce earphones which equalize and filter incoming sound waves (e.g. the ill-fated Hear One earbuds). Sound masking—the opposite principle of noise cancellation—adds unstructured noise (white noise) to the disturbing sound signals. This diminishes the intelligibility of, for instance, speech sounds in the room and the impact of other abrupt sounds. Thus, sound-masking systems hide sound sources by reducing the listener’s cognitive awareness of existing abrupt sounds (specific sound sources) in externality. In contrast, active noise cancellation reduces the amplitude of sound waves before they reach the listener’s ears, thus functioning as a form of hear-through technology.

Yet, while active noise cancellation and sound masking reduce the impact of incoming sounds, thus preventing them from interfering with the user’s mind-internal cognitive tasks (e.g. imagination) or music listening through headphones, they do little to guide the user’s attention to specific events or tasks in that part of externality that is within the user’s auditory sensory horizon. In the following section, we discuss potential auditory strategies that could form the basis for future audio RR technologies.

4.2 Potential auditory strategies

The strategies and technologies discussed above involved filtering and/or masking solely in the domain of sound. More promising, we suggest, is the use of filtering and/or masking approaches that are crossmodal or multimodal in concept and design.

Examples of technologies that aestheticize externality’s sounds and promote reduced listening include the smartphone app Hear and the now terminated app RjDj (from the same company—Reality Jockey Ltd.) which generate a non-linear form of music or sound design which reacts to, and is created from, the listener’s immediate externality and the listener’s movement in that externality in real time. This form of sound generation has latterly been defined as reactive music [see 72]. While the makers of RjDj and Hear call the auditory feedback of their products augmented sound, we argue that the apps point towards a reduced reality paradigm since the primary effect is the aestheticization of sensory stimuli to facilitate a perceptual reduction of reality rather than addition. The potential of apps such as RjDj and Hear is not only to aestheticize externality auditorily and thereby filter out its causal contextual features, but also—by creating the foundation for the emergence of reduced listening—to further a renewed focus on the intrinsic features of sound. This renewed focus also leads to new forms of reasoning about sound: When listening to sounds-in-themselves (i.e. sounds perceptually detached from their physical source, or what Pierre Schaeffer called ‘sound objects’ to distinguish the auditory units that emerge in reduced listening from sound sources emerging in causal listening [69]), listeners access new forms of crossmodal perceptions grounded in embodied sensory-motor experiences [see 73].

Research in HCI has long used affordance theory [22] to guide the creation of multimodal interfaces that better align with the user’s mental models to reduce cognitive load and make interactions more seamless and natural. This helps in creating, for instance, more inclusive and adaptable interfaces that allow users to interact with systems using their preferred mode or a combination of modes [74]. Here, we focus on the intentional activation of action-oriented cognitive images as the basis for a strategy for RR designers. For instance, several experimental studies [75, 76] have shown how sound can activate motor-areas of the brain and aid in the performance of specific gestures and specific tasks. Furthermore, theories of neuronal grouping have shown that if different sensory stimuli (processed cognitively at the same time) fit into the same overall cognitive scheme, the stimuli are easier to process [see 77] and are more easily recalled from memory [78]. Contrarily, if two stimuli afford two inconsistent ways of perceiving these stimuli, one stimulus can sometimes inhibit or reduce the experience of the other. It might lead to confusion or slower processing. Here, and as noted above, such cognitive dissonance has negative consequences for a required perception and consequent cognition, but, we suggest, it might be possible to use certain forms of cognitive dissonance as a strategy VR designers can use to create technologies that balance the user’s environment.

Spence and Wang [51] noticed a similar principle in a study of the effect of sound on the taste of food. Here, they observe how specific auditory stimuli reduce the effect of particular taste stimuli—a cognitive effect called crossmodal masking—for example, Yan and Dando [79] found that loud, low-frequency noise (such as noise from airplane cabins) significantly reduces the sweetness of food. Hence, if these same sounds are reduced, for instance, with noise cancellation systems, it is possible to bring back the full potential of the sweetness experience.

Auditory strategies for RR may also aid in balancing mind-internal activities, such as mind-wandering. Mind-wandering is typically characterized as task-unrelated thinking, a common and mostly constructive activity in everyday life. However, mind-wandering may appear as a cognitive disturbance that competes for attention when the aim is to attend to other tasks in the user’s environment. This is particularly problematic for persons with poor attentional control or other attention deficiencies, such as ADHD. Recent studies suggest that it is possible to reduce mind-internal distraction with music and background sound, such as white noise [80], and this may improve task performance and attention to specific sensory stimuli in the user’s externality.

Related to this, Seiça and colleagues [81] call for a greater focus on aestheticization in sonification design. The authors argue that aestheticization does not hinder data communication (the purpose of sonification design); instead, they suggest that an aesthetic turn, away from mere functional data translation, could awaken ‘the natural potential of listening’ and develop more meaningful and enjoyable interactions in the process of decoding sound.

Future technologies will likely provide us with more sophisticated ways of segregating external sound stimuli according to the user’s needs. For instance, using eye-tracking technologies to decide what the user is looking at (see [82] for an example of such a system) in combination with future audio-scene segregation technologies, it might be possible to enhance or even isolate sound from sound sources the user is looking at by reducing the impact of sound sources outside the user’s visual horizon.

5 Summary

We have highlighted the evident confusion of exactly what reality is being discussed in the broad field of VR, which includes XR, and how the concept is mis-used, and have presented a definition of reality as the conscious experience of a perceptual environment that itself is an abstraction of an ungraspable-in-its-entirety externality. Such a concept allows for the very different experiences of external worlds, both virtual and actual, that individuals have due to their different spatiotemporalities in that externality, past experiences, and varied sensory capabilities. Equally, the concept of reality as the experience of a perceptual environment allows for a more nuanced approach to VR design which includes the effects of crossmodality on our perception.

The advantage of RR, in cases where attention to the task at hand is required, presence in VR is desired, and stress is to be avoided, has been argued for. We have briefly listed and discussed some past, current, and potential future approaches to RR that are solely sound based and then have provided some examples where a crossmodal approach to RR might have advantages or allow for new VR designs. In our opinion, in most cases, the more-is-more approach of XR and VR generally is the wrong approach, and we propose instead the RR concept of less-is-more. While we do not suggest the complete elimination of the XR paradigm—it has its uses—we hope we have argued convincingly enough to persuade the reader of the benefits of a reduction approach to reality particularly where the increasing digitalization and commercialization of our lives, and an always-on existence, have the deleterious consequence of increasing our exposure to more varied, insistent, and saturated sensation.