1 Introduction

Storytelling and narrative have traditionally been key for cultural heritage preservation throughout human history, embracing diverse contexts ranging from entertainment to culture and knowledge [1]. With the progressive evolution of communication means and multiplication of the supports, stories started to be passed down not only verbally but also with graffiti, paintings, poetry, theatrical plays, musical pieces, and then books, photos, and films until the revolutionary carriers put available by the electronic and finally digital era.

In recent years, the use of technology within digital humanities has become essential for dissemination, education, and research in the cultural heritage field. In particular, augmented and virtual reality (AR and VR, which together form a common ground of mixed reality) are relevant enabling technologies and are nowadays widely used by both academic and corporate research as well as development in multiple application domains ranging from neuroscience to sports training, medicine, education, and entertainment [2]. Cultural heritage can also benefit from mixed reality experiences in a variety of areas, ranging from the reconstruction of cultural sites [3] to the creation of accessible knowledge made possible by technology-mediated direct experience [4, 5]. A soundscape, the acoustic real or virtual environment as perceived or experienced and as understood by a person, in context, gives a valuable contribution to tourism and cultural heritage enhancing cultural proposals [6]. In this context, immersive storytelling refers to the use of mixed reality techniques in storytelling, with the goal of eliciting a deeper sense of immersion and presence. Mixed reality offers innovative ways for storytelling, providing immersive experiences that can captivate and emotionally engage audience and facilitate learning [7]. Moreover, it introduces new affordances that allow more ecological interactive experiences, as suggested by various studies (e.g., [8, 9]). Heightened engagement, emotional connection, and enhanced interaction contribute to increased interest in the subject, thereby enhancing knowledge fruition and dissemination [7]. Audio is a key element in allowing users to become more immersed in the virtual experience and can also be an important channel for information about the environment [10, 11].

Focusing on cultural heritage, disparate studies, and reviews are available on mixed reality storytelling [12, 13], audio [6] and technology for audio AR [14]. However, very few comprehensive contributions that deal with these aspects altogether have been found. Some recent studies put a general focus on audio mixed reality and storytelling [15], but no systematic review is available about the specific topic of cultural heritage, especially including audio as a key aspect.

The insights presented here aim to define a state of the art in the available interactive audio virtual environments for cultural heritage storytelling, and to illustrate the most useful techniques for enabling audio mixed reality experiences by reviewing the current literature in the field. Additionally, we discuss key elements that create an immersive experience in cultural heritage fruition, as well as how interactivity and personalized, emotional content can be used for educational purposes and knowledge dissemination. Finally, we investigate how audio mixed reality can help storytelling in cultural heritage and the limitations of the current research in discovering and proposing new forms of advanced interactions. From here, we will outline some promising directions. In particular, along with a systematic review of available works, we propose the following research questions:

  • Q1. What enabling technologies and platforms are available for immersive storytelling in audio mixed reality for cultural heritage? These topics will be discussed in Section 3

  • Q2. What are the available works? What insights does the literature research provide about storytelling in audio mixed reality for cultural heritage? The theme will be analysed in Sections 4 and 5.

  • Q3. Which aspects of User eXperience (UX) and interaction are being studied with regard to storytelling in audio mixed reality for cultural heritage research? The reader will find the topics in Section 5.

  • Q4. Which implications can be drawn from the literature regarding the design of future research? These matters will be discussed in Sec. 6.

In order to identify a common characterization and to provide some answers to the questions at hand, the paper is organised as follows: Section 2 introduces the most important concepts and surveys state of the art. Section 3 illustrates tools and technological platforms available from the design and implementation of storytelling in audio mixed reality for cultural heritage. Section 4 describes the PRISMA review method and criteria used for the selection of works. Sections 5 and 6 present analysis results and a discussion based on previously described criteria. Finally, Section 7 discusses the results and proposes some research insights and application scenarios.

2 Background

In the context of cultural heritage storytelling, audio rendering, and mixed reality have been introduced and investigated since the seventies. In one of the early theoretical examples, the idea of enhancing the experience with multisensory elements was introduced by Youngblood [16]. The author coined the term expanded cinema to describe a form of art that goes beyond traditional cinematic media. This includes special effects, computer art, and multimedia elements. One of the first prototypes for audio augmentation in VR [17] was an automated audio guide, permitting the inclusion in the environment of synthetic audio based on the visitors’ position to avoid isolation and enhance social interaction during museum visits. In another seminal study [18], the authors introduced a classification of four categories of museum visitors based on physical navigation, artwork enjoyment, and information browsing for creating an interactive audio guide prototype in AR. The model records the user’s physical movement inside the museum to dynamically classify visitors with a non-intrusive approach. Finally, in a pioneering work [19] AréViJava was presented as one of the first platforms for virtual tourism in which an avatar guides tours along virtual places. In particular, the authors describe the design process for the reconstruction of the Brest harbour site (France) as it was in 1810, enabling a virtual guided tour. During the tour, different viewpoints were suggested by the virtual guide, together with additional audio and video documents accessible through a website.

In most such works, audio is an integral part of a broader multi-sensory perspective (audio/visual, audio/haptic, audio in service of system feedback) or just one element part of a complex systemic experience, in which its specific contribution is difficult to evaluate, or no audio-specific evaluation is considered. In such a context, we believe that a systematic study highlighting the contribution of audio and mixed reality in cultural heritage storytelling would be of help to improve the sonic interaction design of cultural experiences and improve research in the field.

2.1 Sonic interactions in virtual environments: The immersion-coherence-entanglement model

This review exploits the theoretical and philosophical lens offered by the Sonic Interactions in Virtual Environments (SIVE) new field of study, which refers to the human-computer interplay through auditory feedback: “the study and exploitation of sound being one of the principal channels conveying information, meaning, aesthetic and emotional qualities in immersive and interactive contexts" [20]. SIVE provides a framework for the investigation and identification of advanced interaction opportunities. The authors propose three top-level categories that need to be addressed through an interdisciplinary design work: Immersion, Coherence, and Entanglement. In the AR/VR research community, these terms have multiple definitions and slightly different meanings. In this work, we refer to the terms as defined in the SIVE framework.

According to Slater’s definition of immersion [2], two key concepts are introduced concerning the capture of subjective internal states: plausibility illusion and place illusion. The former determines the overall subjective credibility of a virtual environment; the latter is intended as the quality of a simulation in providing the sensation of “being in a real place” that can be crucial in various cultural heritage experiences, such as museum exhibitions [21]. Immersion, hence, can be defined as “the degree in which the range of sensory channels is engaged by the virtual simulation" [22]. This degree measures the technological level and its enactive potential. In mixed reality, audio is fundamental in creating a sense of immersion [23]. Immersive mixed reality should be designed in an egocentric perspective [20, 24], referring to the coordination of the perceptual/cognitive individuality of the user’s self, identity, or consciousness with multi-sensory information processed by the technological system.

Coherence concerns the plausibility of the rendering, the interactions, and possible behaviours in the virtual environment in realistic fictional experience [25]. It measures the effectiveness of the sonic interaction design and is concerned with various factors, including subjective expectations [26] and social rules [27].

Entanglement [28] describes the effectiveness of the overall sonic experience in terms of dynamic and mutual adaptations among its key actors. It measures the level of active participation of the user, the technology, and the content in what is called the locus of agency [29]. This refers to a meta-environment with technological and digital features in which each actor involved in the experience (including the user and the technological platform) can act. According to Frauenberger’s research in entanglement human-computer interaction [30], the design of computers and interaction design cannot be approached directly. Instead, the focus should be facilitating specific configurations that bring about certain phenomena. The term “agency” denotes a performative mechanism that constructs one’s sense of self by establishing boundaries. The key principle is the shift from inter-action between defined objects to intra-action within a phenomenon, where the boundaries between actors are fluidly determined within a system, i.e., a locus, similar to the Gibsonian ecological theory of perception [31]. In an egocentric perspective, the locus of agency takes shape around the listener or the natural world that is meaningful to them.

2.2 Cultural heritage storytelling

Some notable efforts to come up with guidelines for effective storytelling recently started: according to The Center for Digital Storytelling in BerkeleyFootnote 1, digital narratives should be designed following seven specific criteria:

  • Point of View of the author

  • Gift of voice, and the register of the narration: colloquial, formal, etc.

  • Dramatic Question, the intrinsic message conveyed by the story

  • Emotional Content, the emotions that are transmitted by the narrator

  • Power of Soundtrack, audio elements in the narration

  • Economy, the amount of content and information conveyed by the narrative

  • Pacing, the rhythm of the story, in terms of time storytelling structure

Scientific literature has widely adopted criteria whose primary purpose was education (e.g. [32], for the analysis of autobiographical literacy).

With a specific focus on digital storytelling, Meadows [33] highlights four essential component:

  • Perspective: the perspective that the author wants to convey with the story consisting of emotions, presentation, and the process of encoding/decoding

  • Narrative: the story that is narrated by the storytelling in the specific medium

  • Interactivity: a peculiar characteristic of digital media that can be implemented, e.g., with the design of multiple storylines or choices that the user can do

  • Medium: storytelling message interpretation can be strongly influenced by the type of medium used [34].

No specific framework for digital storytelling evaluation has been found in the literature, especially in the audio domain, but some efforts have been made in this direction. Sitters et al. [35] explore digital storytelling in the health research domain, suggesting that storytelling should be evaluated for its validity in four different ways. Storytelling material should convey emotions (empathic validity), be credible (intersubjective validity), be sound and just (ethical validity), and have stakeholders play an active part in the design process (participatory validity). This last aspect is also underlined by many other studies, including a framework that considers three different dimensions for the design and evaluation [36]: aesthetics, cognition, and sociality. In sound-driven design, Dalle Monache et al. [11] argue that participatory design should involve stakeholders of different fields in order to develop a deeper understanding of the media and its potential applications. Moreover, we would like to emphasize the perspective of Sonic Interaction Design (SID, [37]), which considers sound as a primary channel for conveying not only information and meaning but also aesthetics and emotional qualities in interactive contexts. SID goes beyond mere information transmission. It encloses the wholeness of the user experience, enriching it with auditory cues that enhance immersion and engagement. In this context, storytelling can be used to enhance the overall experience, and the SIVE framework aims to help understand how SID principles contribute to crafting compelling narrative experiences. Furthermore, it is important to underline the differences between music and sound. Indeed, sound encompasses a rich spectrum of elements that contribute to the overall immersive experience, including aesthetic qualities and emotional elements. Since music is an art form with specific canons [38], storytelling through music can be considered a special form of storytelling that can be used to enhance the sonic environment or can be enhanced with other sonic or multimodal elements. In this work, we have recognized music’s important role in storytelling and included it as a key component. To maintain the inclusive breadth of our work, however, we have presented it as one of several options available. On the other hand, as specified in the two frameworks described above, music can be an integral part of the immersive storytelling experience.

Lugmayr et al. [39] introduce the concept of serious storytelling, namely storytelling designed with a specific purpose other than entertainment where “the narration progresses as a sequence of patterns impressive in quality, relates to a serious context, and is a manner of thoughtful process".

Cultural heritage storytelling can be considered a specific case of serious storytelling with some interesting peculiarities. The most important requirements of storytelling for the cultural heritage domain are preserving the correct reconstruction of historical and artistic aspects and conveying a non-ambiguous message. Various works use storytelling techniques to enhance visitor experience in cultural heritage sites or museums, e.g., navigating through a site and following a story. In this kind of situation, personal experiences and emotions can influence information comprehension and interpretation [40] and how users infer and interpret a story [39]. In other words, storytelling can be used to convey cultural heritage notions and concepts for educational purposes effectively. The Storytelling creation process itself can also be used to help understand concepts and historical timelines, e.g., in classroom experience [41]. In this case, the storytelling design process should be correctly used to convey cultural heritage knowledge.

Another important aspect in various cultural heritage-related applications, such as tourism and education, is engagement. Storytelling can be a powerful means for transmitting knowledge, but it should be designed in such a way as to correctly fulfill the task, convey the desired message, and avoid errors in the interpretation [12]. In other words, storytelling for cultural heritage can include engaging elements and gamification, however such elements should not be the primary aim. Moreover, Podara et al. [42] suggests that interactive and non-linear storytelling can be useful to elicit engagement, especially in younger people. On the other hand, the amount and type of interactive elements should be carefully calibrated to avoid the risk that users lose interest in the storyline or in the storytelling message [43]. Again, no comprehensive framework can be found to evaluate serious storytelling. However, various tests are available to evaluate specific elements such as engagement, memorability, cognitive load, etc. The method used in the reviewed paper is discussed in Sec. 5.5.

3 Audio tools and technological platforms

The analysis of the reviewed papers shows a very diverse situation in terms of frameworks and technologies used. Thus, it is difficult to identify the main trends and durable tools over the years, especially when considering open-source or reusable solutions without license fees. Most of these technologies do not find practical applications in specific cultural heritage contexts but serve as valuable insights into the landscape of available technologies within cultural heritage but may not directly contribute to identifiable trends or prevalent practices. However, many recent immersive storytelling applications were realized using commercial software (Unity3DFootnote 2 or Unreal EngineFootnote 3) or free SDKs (Vuforia, ARKit, AR FoundationFootnote 4) during the design and development stages, while some examples of custom solutions can be found for specific purposes.

Fig. 1
figure 1

Examples of technological platforms: a) M5SAR platform, audio guide with the elicitation of all the five senses photo © [44] b) screenshot of PlugSonic user app. photo © [45]

Platforms specifically designed for exploring spaces, providing navigation and spatial audio are described in various studies [46,47,48,49]. In particular, AVIE is a system for directional audio in VR theaters [50], while other audio technologies for spatialisation in theatres are included in Beck’s compendium [51] along with a historical overview.

Various technologies are available for creating personalized sonic experiences, using both multi-channel speakers for room audio spatialization (Ambisonics, Dolby Atmos, etc.) and headphones for mobile applications. With the latter technologies, head and body movement tracking along with dynamic binaural audio technology are crucial for auralisation, i.e., the mathematical modeling of physical audio sources in space [52] convincingly localizing a sound source for attentional and navigation purposes [53]. User movement can be tracked by various types of sensing devices. For outdoor position tracking, GPS information is used [54] often combined with inertial measurement units (IMU) [46, 55], and gyroscopes [56] are used for determining the indoor precise position and/or head orientation.

Audio is combined in a multimodal interaction through the five senses in the M5SAR platform (Fig. 1-a)  [44, 57], by using a custom-built mobile device along with a mobile app that recognizes artwork in a museum. PlugSonic [45] (Fig. 1-b) is a web-based platform for spatial audio storytelling in cultural heritage. Another storytelling design tool is MOGRE [58, 59], a software aimed at the creation of 3D scenarios and stories based on the OgreFootnote 5 graphics engine for children.

Ec(h)o [60, 61] is an engine for audio material classification and real-time search based on user position tracking and preferences. Proper audio material, classified by using a given ontology, is inserted in a database and presented in real time to the user who interacts with a tangible interface (cube) while moving. Practical guidelines and conceptual frameworks are available on cultural experience design. For example, “Adaptive Augmented Reality” (A2R, [62]) is an AR museum guide architecture that recommends content based on gestural and biometric information to elicit interest and engagement during museum visits. Katz et al. [63] suggest that acoustic space is an important aspect to consider in communicating acoustic heritage and producing audio experiences. Moreover, they introduce Past Has Ears(PHE), a hardware/software prototype for the presentation of immersive audio experiences adaptable to multiple platforms. Polaris\(\sim \) [64] is a wearable platform creating privacy-respectful audiovisual AR experiences to foster artistic and musical expression. Polaris\(\sim \) comprises an open-source AR headset (Project North Star), a pair of bone-conduction headphones, and software developed in Unity and PureData. Meyer [65] suggests guidelines for story design in a spatial context, based on a literature review, suitable for interactive audio-visual narrative and 360\(^{\circ }\) films. In his work, Fujinawa et al. [66] hypothesized that a comfortable sound field could induce different behaviours in human movement and navigation in a physical space. In a laboratory study, users are left free to move in a room where sound is diffused with different pressure levels by motor-controlled moveable speakers in three different audio stimuli conditions (white noise, Jazz music, no sound). By using questionnaires and position tracking, user preferences and time spent in areas with different loudness levels have been analyzed, overall showing longer staying in more quiet areas when audio stimuli were evaluated as unpleasant. Kenderdine et al. [67] describe the potential application of computational practices within archival and museological domains that could improve the preservation and cataloging of sources, provide new forms of representation and knowledge, and can empower new forms of art. In this direction, Jazz LuminariesFootnote 6, and interactive installation that display the connection between jazz, blues and Latin artists using a 3D visualization in a fulldome. Other contributions can be found in interactive musical performances such as Carillon Footnote 7 and Membrana Neopermeable [68] that uses a virtual environment used by the performers to interact with virtual instruments. In the supplementary materials, Appendix A provides a historical excursus of platforms and available studies regarding audio features, cultural heritage field, and storytelling capabilities.

4 Methods

While structuring this review, we followed the PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) method [6] as a guide for our work. PRISMA is widely used in diverse research areas as medicine [69], psychology [70], and computer science [71]. The data search and paper selection process is described here.

4.1 Eligibility criteria

Selected papers or books must include research, reviews, or case studies about the use of virtual/augmented/mixed reality for cultural heritage storytelling, specifically referring to audio as a part of the dissertation. Table 1 lists the inclusion criteria. It is worth noting that the author found no work including all the considered topics, probably due to limitations binding the different fields. See, for example, the work in [72], claiming implementations of a mobile virtual environment through a PDA as a future possibility should technological advancements allow for, or “KidsRoom” [73] that describe an AR interactive experience for storytelling not specifically designed for cultural heritage.

Table 1 Inclusion Criteria

4.2 Search strategy

Our research was conducted by performing an extensive search on online publication-related databases: Elsevier Scopus, ACM Digital Library, IEEE Xplore, and Google Scholar. In all these databases, four different clusters of keywords were used to include synonyms and related keywords. Presented below are the four names representing different groupings:

  • Storytelling: storytelling, narrative, guide, guidance.

  • Audio: sound, audio, acoustic, acoustics, auditory, sonic, hearing, soundscape, soundscapes, voiceover, narrator.

  • Immersive media: Virtual Reality, Augmented Reality, VR, AR, Spatialized, Spatialised, Spatialization, Spatialisation, virtual environment, XR, extended reality, immersive media, metaverse, immersive space, immersion, virtual agent.

  • Cultural heritage: cultural heritage, museum, history, tourism, culture, archaeology, archaeological, historic, conservation, preservation, tangible heritage, intangible heritage, living heritage.

Keywords were connected by using OR inside clusters AND between clusters. Search was performed in Title, Keyword, and Abstract in Scopus, IEEE, and ACM by using their specific language. Due to Google Scholar’s limitation in search string length (256 characters), a different string format has been used to include a wider selection of papers. The string formats are specified in (see Appendix B in the supplementary materials).

Fig. 2
figure 2

PRISMA flow diagram

4.3 Study selection

Starting from the previously mentioned inclusion criteria, Fig. 2 depicts and characterizes the PRISMA phases that considered a total of 574 articles selected from 4 scientific databases. Duplicate papers (118 in total) were removed, and the remaining 456 papers were screened by title, abstract, and scanning using the exclusion criteria listed in Table 2. The selected papers must have a strong research focus on cultural heritage and Storytelling, therefore work with different main topics (e.g., technologies that can also be used for cultural heritage, general sound localization studies, generic AR platforms, solutions for urban navigation, etc.) and works in which one topic was absent have been excluded. Since the audio-sensory modality was particularly important for our systematic review, we introduced an ad-hoc classification:

  • Category 1, Audio First: audio is the leading modality in the paper.

  • Category 2, Audio and Multimodal Integration: audio is considered together with other modalities (e.g., haptics, video, etc.), but specific information is given.

  • Category 3, Audio in Multimodal Comparison: audio experience is considered in comparison with others (e.g., Audio vs. Haptic feedback).

  • Category 4, Multimodal Experience: audio is present, but it is not possible to ascertain its specific audio contributions.

  • Category 5, Other experience: audio is present but not considered.

  • Category 6, No Audio: audio is not present.

All papers with level 5 or 6 were excluded since they were of little or no interest for this review.

Table 2 Exclusion Criteria

4.4 Data collection process

4.4.1 Type of immersion

A classification for mixed reality systems has been proposed by Bekele et al. [7], which introduces two different categories for AR systems and three for VR systems. Bekele et al. proposed a general classification based on system flexibility and experience. We exported this classification to auditory information:

  • Outdoor audio AR: By using GPS or other markerless tracking techniques, people can navigate inside a vast open-air field and obtain location-aware audio information.

  • Indoor audio AR: Indoor tracking needs precise information about head orientation and user in a limited space. The most used techniques are IMU, Gyroscope, and AR toolkits like ARKitFootnote 8 or VuforiaFootnote 9.

  • Non-Immersive audio VR: The virtual environment is viewed from a desktop or screen, and audio content is conveyed from a speaker and not spatialized.

  • Semi-Immersive audio VR: Audio content is conveyed through multichannel or directional speaker systems, usually in rooms with multiple users.

  • Immersive audio VR: A user is fully immersed in an auralized virtual environment or audio content that is realistically integrated inside a physical environment, with a high level of presence.

4.4.2 Purpose of each study

Mixed reality can provide important support to cultural heritage for different purposes. Different surveys have already been published on this topic [74]. Based on these works, we identify the following main purposes for the use of immersive media technologies:

  • Education: We identify systems designed with the aim of learning and dissemination, such as mixed reality books for children [75] and audiovisual storytelling experiences [76].

  • Exhibit enhancement: A work about cultural heritage audio augmentation applied to tour visits [77] and an app for augmentation of a physical diorama [78].

  • Exploration: Applications or methods for navigating or discovering spaces or contents, such as augmented audio guides [54, 79].

  • Reconstruction: Re-creation of elements of the past inside today’s world as the reconstruction of Fort Sant Jean [80] and AR interviews [81].

  • Virtual museums and interactive installations: A virtual cultural heritage experience such as virtual drama [55].

4.5 Computer-aided Qualitative Data Analysis Software (CAQDAS)

As part of the research process, we conducted an automatic content analysis on the papers we considered. For the automatic content analysis we utilized LeximancerFootnote 10, a Computer-aided Qualitative Data Analysis Software (CAQDAS) that extracts statistical properties from a text to identify a list of terms and concepts [82]. The software identifies highly connected concepts and clusters them into higher-level groups, defined as themes.

Leximancer uses an analysis method that examines a unified body of text—in this case, reviewed articles. The program selects a ranked list of emerging lexical terms based on their frequency and co-occurrence usage. These terms are used for creating a thesaurus, which in turn builds a set of classifiers from the text by iteratively extending the seed word definitions. The result is a set of weighted term classifiers known as concepts. The text is then classified using these concepts at a high resolution—typically every three sentences. This produces a concept index for the text and a concept co-occurrence matrix. An asymmetric co-occurrence matrix is created by calculating the concepts’ relative co-occurrence frequencies. This matrix produces a two-dimensional concept map using a proprietary clustering algorithm based on the spring-force model for the many-body problem [83]. The correctness of each concept in this semantic network generates a third hierarchical dimension, which displays the more general parent concepts at the higher levels.

5 Analysis and results

A total of 103 papers have been considered eligible for the study. Works were read and analyzed to collect useful information about the study and disclose research trends. All contributions, available in Appendix C of the supplementary materials, have been classified based on the purposes of the study. Among all papers, 63 describe a cultural heritage site application, and 40 are theoretical or applied studies (Table 3).

Table 3 Eligible papers divided by purpose and audio involvement level
Fig. 3
figure 3

Count of articles categorized by (top left) research motivation, (top right) audio involvement level, and (bottom) publication year of eligible studies

One first important consideration concerns the role of audio components in the context of mixed reality storytelling in cultural heritage. Results in Fig. 3 show that 30 studies out of 103 have been specifically designed to evaluate auditory and 37 studies consider audio in a combined [54] or comparative [176] multimodal perspective. On the other hand, in 27% of the studies, it is not possible to determine a specific contribution of audio since this component is ambiguously considered in combination with video or other elements. However, the interest in auditory evaluation is increasing since the number of papers in the field has grown in the last few years (see Fig. 3). As also noted by Jerald [177], the interest in mixed reality is undoubtedly increasing, also thanks to new technologies and platforms available in the market.

Fig. 4
figure 4

Topical map of discovered concepts. Clusters are displayed as heated circles

Figure 4 shows the main themes identified by CAQDAS analysis, and the main concepts considered:

  • Sound: concepts regarding audio and spatialisation are included in the theme, e.g., spatial, sound, position, environment, etc.

  • Story: represents storytelling and narrative elements including narrative, emotional, music, media, history, landscape, world, etc.

  • Experience: it refers to the overall user experience and is very intersected with both content and narrative terms (such as immersive, history, natural, create, cultural heritage, digital media) and technological elements (technology, interaction, interface

  • Social: includes concepts related to cultural heritage spaces and collaborative experiences, e.g., urban, public, site, city, landscape, performance

  • User: centrality and need of personalisation of user in the SIVE experience are underlined in this theme that includes user, audio, interaction, guide, visit.

Storytelling is central to the mixed reality experience and requires careful design in order to improve user interest in different aspects (engagement, education, experience time, etc.) while maintaining the characteristics of effectiveness and correctness of information. Content has to be built around the user, following an egocentric perspective, and should be personalised respecting each person’s particular traits and special needs. Technology must help make the experience realistic and plausible inside the virtual environment, including using collaborative elements. The main concepts discovered in CAQDAS analysis are discussed in greater detail in the following subsections.

5.1 Sound reproduction and spatialisation

Technology plays an important part in conveying a sense of presence in terms of Immersion, Coherence, and Entanglement dimension, corresponding to Sound CAQDAS concepts. A plausible audio rendering requires several types of reproduction devices and spatialisation techniques [178], as it is also underlined by CAQDAS analysis, which includes spatial and position in the domain, as well as real, virtual and place emphasizing the interplay between all experience elements, whether tangible or not. Moreover, the intersection between sound and user underlines the importance of an egocentric audio perspective. The devices identified in the reviewed works have been divided into three different categories:

  • Headphones/earphones: personal reproduction devices with two small speakers available on the market in different dimensions, ear fit (in-ear, outer-ear, over-ear), and connectivity (wired or wireless). Audio headphones can reproduce monophonic sound, diffusing the same audio information in both speakers, stereophonic sound accounting for a horizontally localized source, or binaural sound considering listener acoustics in the so-called head-related transfer functions (HRTF, [179]). Of particular interest here, stereophony conveys a sense of direction, especially if in conjunction with spatialisation models simulating a physical environment [180]. Moreover, audio headphones can isolate totally or partially from the external acoustic environment (audio transparency [181]).

  • Directional speaker: a particular type of loudspeaker with a narrow acoustic beam. Directional audio speakers provide personal content while maintaining environmental audio transparency [112], with performance limits for listeners who occupy areas outside a known sweet spot.

  • Multichannel: a loudspeaker set conveying spatial audio and permitting sound source localisation. Multichannel audio can be composed of similar or different speakers in terms of frequency reproduction (frequency response). In the reviewed papers, some particular multichannel sets have been identified in two works [55, 157]. The former uses four bone conduction speakers installed in a headband to create audio spatialisation and orient the user in a cultural site using GPS and compass sensors. The latter uses a combination of custom-built headphones and room speakers to enable a social experience in a theatre (more information is given later in this chapter).

Information about audio reproduction and spatialisation are available in Fig. 5. In the considered studies, the most widely used audio reproduction devices are headphones/earphones, which, as opposed to other solutions like mobile phone integrated speakers, can limit social/collaborative interaction inside museums or cultural sites [182] and modify the presence and immersion level of the system [183]. However, it is important to highlight that no sound reproduction device is specified in 48 studies (46.7%), even though it could be an important aspect of experience evaluation, especially in studies based on portable devices.

Fig. 5
figure 5

Count of articles categorized by sound reproduction device and immersion type

It is worthwhile to note that audio spatialization and algorithm information are still sparsely treated, but interest in the topic is increasing. In fact, only 35 studies involve a technology making use of spatial audio or more than two speakers. Interestingly, 18 such studies have been published recently as in the last two years.

Commercial headphone equalizers based on HRTFs (Unity or Unreal Engine with SteamAudio or FMOD) are used in various works [46, 104, 116, 133, 144,145,146, 153, 164, 165, 175, 184,185,186]. Other headphones, commercial or open-source headphone-based rendering models, and commercial or open-source libraries have been considered in reviewed papers: OpenAL [108, 109, 142], Audiokinetic WWise [155], Apple XCode [175], Processing [130] or not specified [56, 110, 131, 143, 150, 159, 174, 187]. Andolina et al. [134] uses a “headset capable of 3D audio”.

In works with a multichannel setup, SensiMAR [148] uses 4-speaker Ambisonics 3D audio format, a method for reproducing audio on a spherical spatial sound-field [188], in outdoor Conimbriga archaeological site to recreate the soundscape of ancient Roman activity. “Trying to get trapped in the past” [163] uses Wave Field Synthesis on a linear loudspeaker array to spatially render a virtual theatrical play. “Pigments” [166] use Dolby Atmos for the sonification of Pigments of Imagination audiovisual installation. Finally, a Dolby 5.1 system is used for the sonification of Roal Dahl’s “The Time Machine" story [162].

An interesting approach is used in the installation called CAVE [154, 157], a VR theatrical play on nomadic tribes of Northern Europe, whose audio setup includes a combination of custom-built off-ear headphones providing spatial audio and room speakers. The headphones convey sound effects and dialogues, while room speakers and a subwoofer diffuse soundscape and music.

Fig. 6
figure 6

Distribution of articles by (top left) immersion type, (top right) orientation and position tracking, and (bottom) interaction type

5.2 Interaction with the virtual environment

A correct interaction design and evaluation is fundamental for achieving a high level of coherence and immersion in a system, and in fact, the experience theme resulted from an interaction design activity in 83 out of 92 papers. Experience in CAQDAS analysis, along with related concepts such as immersion, support, interaction, and interface, emphasize SIVE as a key area. Fig. 6 classifies the user interactions with the system grouped by tracking method: user body position and head orientation inside the physical or virtual space, object recognition in rooms or sites, physical actions on tangible interfaces, etc.

Fig. 7
figure 7

Examples of tangible devices for interaction: a) AR-controlled playing card. photo ©[169] b) playback control device in [153] c) haptic vest with built-in sensors. photo ©[134]

The most common tracking method (55 papers) selects and personalises audio and video content based on head orientation and position in space, especially in indoor situations. In all outdoor experiences, GPS is used along with compass sensors [55, 56, 79, 108, 109, 122, 135, 136, 138], otherwise tracking is left unspecified [77, 134, 138]. GPS is also used in one interactive artwork described “called Hybrid Gifts” [161] in which a smartphone app permits museum visitors to communicate by geolocated audio messages with peers in order to express and transmit their emotions while looking at paintings in a museum. For a more precise indoor user position estimate, tracking is performed by inertial IMU sensors [46, 121, 123]; otherwise, it is not specified. In VR contexts, user orientation in virtual space is usually tracked using Head-Mounted Displays (HMD) integrating IMU sensors.

Direct interaction with a Graphical User Interface (GUI) on a device screen is widely used (33 papers), especially in AR experiences, often relying on a smartphone as an interaction device, enabling at the same time the use of a wide range of built-in sensors such as IMUs and cameras as well as internet connectivity. It is important to note that no immersive or semi-immersive experience uses GUI interaction, preferring user movement tracking or tangible controllers.

Computer vision algorithms enabled by AR frameworks (ArKit, ARCore, Vuforia, OpenAL, etc.) are used to recognize objects and elements in space and to interact with an audio virtual environment in 22 works. AR is used to present context- and position-aware touristic information [111, 123, 134, 139, 148, 185] and to navigate in an AR diorama by using a portable device (smartphone or tablet) [78]. Various physical books [75, 88, 96, 97, 97] have been digitally augmented by using devices that recognize book pages and show related content or QR codes [111, 127, 129].

In War children [81], stories narrated by WWII eyewitnesses can be watched inside the user’s room through a smartphone screen. Huang et al. [168] describe a series of video games for touristic purposes, one of which asks visitors to find a specific image inside a cultural site by using AR to get a game reward.

23 studies use tangible devices, such as console games or VR headset controllers [86, 128, 133, 144, 145, 149, 152, 159, 165, 168, 175, 186], haptic interfaces [158] or custom-built devices. In ec(h)o [61], users indicate a particular preference for presented audio content by rotating a wooden cube with coloured faces whose position is detected by a camera. A similar approach is used [169] in a comparison test about movement in a virtual museum, with four different interaction methods on movement in the virtual environment and control condition over an audio guide playback. During the experiment, users can control audio by rotating a playing card (Fig. 7-a), whose movement is recognized by using the Vuforia AR framework. A preference for controlling movements directly was identified.

In an interactive installation, Kenderdine [160] presents a custom-built console to interact with, Kortbek et al. [112] uses a three-step staircase with pressure sensors to detect visitors stepping. Moreover, Geronazzo et al. [153] implemented a custom-built pipe controller (Fig. 7-b) with a button and a 9-axis IMU in order to control movement in an audio virtual environment. On the other hand, Andolina et al. [134] uses a haptic vest with different mounted sensors (Fig. 7-c) and actuators in a navigation task inside a city. The preliminary study compares different interaction/feedback modalities (Visual AR, haptic-audio guidance, and the two combined), showing a preference for the haptic-audio modality. Finally, The Time Machine [79, 135, 136] uses vibrotactile feedback together with audio produced by a smartphone for navigating a cultural heritage site. The audio-vibrotactile feedback provides user information about the distance from a specific position of the site.

Other interesting multimodal interactions were described in 8 studies. Two of them use different eye-tracking techniques to detect the user’s gaze and present information about the artwork or artistic building the user is looking at in a virtual environment. In particular, Kelling et al. [167] uses eye position estimation based on an IMU sensor of an HMD, Sanchez et al. [130] uses gaze to control the playback of audio elements, such as effect or noises, while reading an children’s tales book, and Kwok et al. [113] use eye-tracking glasses. In a fourth study [158], one of the author’s interactive installations uses voice and pitch tracking to induce vibration in objects inside a virtual environment based on voice intensity. Voice is also used in an AR experience [185] to receive information based on the narration.

5.2.1 Users with special needs

In reviewing papers, special attention has been paid to SIVE accessibility for cultural heritage experience users with special needs. Unfortunately, available solutions often provide limited accessibility for the unnatural or insufficient interaction methods proposed therein [111]. This is also noticeable in CAQDAS analysis that presents no concepts related to this theme (such as special needs, impairment, accessibility, etc.). Notwithstanding, users with disabilities, especially low vision, can benefit from audio technologies, and some contributions can be found in audio storytelling for cultural heritage as well. In the following paragraphs, we would like to mention five research projects that provide meaningful examples of the inclusive role of audio in such a context. The Time Machine [79, 135, 136] is a touristic guide that provides vibrotactile feedback to convey distance and directional information about interesting sites, so as to keep focus on the environment and help navigation of visually impaired people and elderly users. MuSA [111] is an application specifically designed for people with low vision, which reads information about art pieces in a museum and then renders it through AR. By receiving an image from a smartphone camera pointing to a specific artwork, the application recognizes various elements and then provides speech information as well as augments the image with colour optimizations and zooms on its particulars. Greta [131] is a mobile application that provides audio descriptions of films in cinemas specifically designed for people with low vision. The preliminary evaluation study shows that the use of voiceovers improves enjoyment and immersion but also affects engagement. Trying to Get Trapped in the Past [163] is a virtual drama whose narrative relies on spatial audio content, with visually impaired people in the sonic interaction designer’s mind. In the recent work by Kelly [119], the author describes the early stage design of a sonification aimed at improving the accessibility of Irish cultural heritage sites.

5.3 Storytelling and personalisation

As underlined by the CAQDAS analysis (concept Story, user and experience), narrative structure (narrative, history, landscape) and the level of sonic interaction between user and audio virtual environment (sense, experience, process) are key aspects to consider when dealing with cultural heritage storytelling in mixed realities. The intersection between story and experience themes highlights that stories should be immersive, and the context is important for the narration. To enhance the sense of presence in mixed reality, especially in the entanglement dimension, storytelling should be designed to be interactive and non-linear in order to allow active participation of both users and narrations eliciting higher engagement and immersion [189]. Non-linearities in narrative can be achieved by using different techniques such as inverting the chronological order of events, creating different parallel storylines, and dynamically modifying the narrative.

The majority of studies use non-linear storytelling (67 out of 92 papers) to let users freely explore the augmented physical environment or the virtual space, conversely, 35 papers follow a linear narrative. No specific information about storytelling architecture can be found in the three remaining papers. An overview of storytelling personalisation features is given in Table 4. A particularly relevant work [105] exploits a dual non-linear structure for encouraging a collaborative approach. As soon as they start the visit of the St. Fagan Historical Museum in Wales, users have to choose between two possible partial storylines. Information about the discarded storyline can be obtained only by asking and discussing it with other users. Moreover, artificial intelligence (AI) was employed [159] to adapt storytelling in a VR video game story and, in Exhibot [108, 109], to generate storylines about the history of the central square in Heraklion(Greece) based on user position and orientation and third party content services. Surprisingly, the previously cited works were the only contributions using AI techniques, although a large number of works can be found in other interactive and non-linear storytelling contexts. For instance, Riedl et al. [190] present a review of AI techniques in Computer games storytelling, Hernandez et al. [191] discusses the application of AI for eliciting emotions during storytelling, Pisoni et al. [192] introduces AI techniques for accessibility in cultural heritage including interactive storytelling.

Table 4 Content personalisation and personalisation method in non-linear storytelling

Position and head/body orientation are often tracked and used to select a storyline. In this context, content can be pre-determined or personalized in different aspects (point of view in space, content presentation order, dynamically generated elements, and directions) by a user’s movement or explicit command. City tour maps are dynamically generated [54, 135, 136] by using the user’s navigation path, along with historical and touristic insights about buildings and monuments nearby and at the destination. In Cave [157], linear storytelling is spatialised according to the user’s specific position inside the room. In the SARIM system [184], sound zones are associated with audio samples coherent with the specific exhibited art piece or historical device.

In terms of audio content, we identify three main high-level categories: (synthetic or real) speech, environmental sounds, and music (Fig. 8). Since voice is the most used material, especially in tourist or educational experiences, sound effects are used to create soundscapes and personalized content. In Ec(h)o [61], audio soundscape and content are based on user preferences selected by choosing the orientation of a physical cube. In a work by Fu et al. [56], the audio soundscape is dynamically generated based on the user’s position. On the other hand, music is used mainly in theatrical or linear storytelling experiences.

Fig. 8
figure 8

Distribution of articles by audio material/storytelling structure

5.4 Collaborative experiences

Collaborative and social experiences are important, especially for improving the level of Immersion and Entanglement with both technology and the user’s peers: in different cultural heritage subfields, such as music or multimedia production, collaboration elicitation among users is crucial [193]. CAQDAS analysis underlined that social is one of the key concepts found in the reviewed works. It is strictly connected with story theme (place, history, urban), which emphasizes that when narratives occur in public places, users can benefit from collaboration, and with experience that together with the previous theme suggest the possibility of new digital media that can include collaborative narratives. In our review, 10 works consider collaborative and social aspects. The majority of them were ad hoc installations in which users can interact in real/virtual shared spaces with tangible interfaces [160], voice intensity and pitch [158], or movement [142]. In [112], social interaction is fostered by a particular physical room setup in which directional speakers mounted on the ceiling deliver personal information about a physical museum without isolating visitors from environmental sounds. Hättich and Schweizer underline the importance of cinemas in fostering sociality  [131]. In This Land AR [171], three users can interact with virtual musical instruments implemented in mobile devices to create a collaborative audio performance. Moreover, storytelling can be efficiently designed for eliciting discussion during or after the experience as in Traces [105] artistic installation/touristic guide, where different complementary linear narratives are used to enhance discussion among different visitors.

Another remarkable point of view [91] states that the use of audio guides or other mediation devices during a cultural heritage visit has the twofold effect of enhancing learning performance and diminishing social interaction. This effect was particularly evident when wearable devices such as headphones were employed, facilitating the creation of personal virtual environments without any collaboration support.

Previous works suggest that the perception of a virtual space changes with the proposed experience, and can help achieve different specific purposes. A careful design of sonic interactions is a powerful tool for inducing collaboration in a virtual environment and has to be encouraged as a mean of enhancing immersion in shared virtual spaces and providing different communication channels.

The concept of proxemics, defined by psychologists as the set of implicit social rules of interpersonal distance among people, has well-known cognitive effects [194] and can help in describing immersive sonic experiences in shared spaces.

5.5 Quality of the experience and evaluation

Evaluation of different aspects is a fundamental task for creating an effective, immersive, and engaging experience. Fig. 6 shows the immersion type in the considered papers, holding that a wide variety of experimental protocols have been used to evaluate system usability, engagement, immersion, learnability, emotional content, cognitive load, and educational aspects, and most studies rely on questionnaires and semi-structured interviews. From the authors’ point of view, the proposed user tests show an overall good acceptance level and engagement, although it is difficult to obtain quantitative information due to differences in the experimental procedure and measured features. CAQDAS analysis does not present any reference to the evaluation of user experience or interaction among the concepts represented, probably because of the heterogeneity of the measurement instruments and traits measured. However, some common traits can be identified in current research.

In order to test User Experience, validated questionnaires are used in many works. The NASA-TLXFootnote 11 questionnaire is used [79, 113, 136] to evaluate technology in terms of Mental and Physical Demand Performance, Effort, and Frustration and has been proven valuable to test user’s technology acceptance levels. System Usability Scale (SUS, [195]) is a 10-item questionnaire and has been used [54, 111, 113, 148] for testing system usability. User Experience QuestionnaireFootnote 12(UEQ) is a 26-item questionnaire used [113, 169] to evaluate the attractiveness, perspicuity, efficiency, dependability, stimulation, and novelty of the technology. Both questionnaires have been adopted in order to avoid fatigue in users while obtaining comparable and valid results. Simulation Sickness Questionnaire for Cybersickness (SSQC, [196]) has been used for evaluation in the different study conditions [169]. Narrative Engagement Scale (NES, [197]) is used [131] to evaluate the narrative engagement difference elicited by an audio description of films in people with low vision. In  [153], User experience in a spatialized narration task is evaluated in terms of immersion and elicited emotions. Users were asked to navigate in an audio-virtual spatialised soundscape composed of several different virtual rooms while listening to a story by using a tangible controller with different degrees of control over movement, audio spatialisation, and audio playback. In the study, the Immersive Response Questionnaire (IMXFootnote 13) was used to evaluate immersion and the Discrete Emotion Questionnaire (DEQ, [198]) to measure emotional content, along with Heart Rate with a non-intrusive device (a wearable band) showing an increased immersion and emotional content in conditions with spatial audio and controlled playback.

In addition to Geronazzo et al., other studies use biosignals for measuring emotional parameters, particularly arousal. Jurica et al. [106] compare Electro-Dermal Response (EDR) with a custom questionnaire for measuring arousal in soundscape navigation tasks inside a cultural heritage site, highlighting a better accuracy of the questionnaire for this specific task. On the other hand, Mansilla et al. [170] uses EDR to measure the emotional content of Virtual Acousmatic Storytelling Environment (VASE) virtual play, showing a correlation between EDR and emotional content increase. Biosignals are confirmed to be useful for emotion measurement and are suggested by the authors of this paper to be promising for use within tests employing auditory stimuli.

Moreover, mixed reality technologies have been proven to enhance performance in the education domain applied to cultural heritage in various studies [75, 76, 78, 91, 93, 95, 96]. In the majority of these studies, a custom questionnaire designed by the researchers involved in the study has been proposed to participants before and after test sessions to evaluate the differences in terms of mechanical and meaningful information memorization for educational purposes.

6 General discussion

Most of the reviewed studies in the cultural heritage domain are related to tourism and museums. In particular, audio guides are augmented with personalized content, non-linear storytelling, user localisation, and positional tracking to (i) propose an enhanced cultural heritage experience fitting the user preferences, (ii) elicit interest and engagement in users, and (iii) foster cultural heritage dissemination. In some applications, tourist information and street directions are shown, helping tourists orient themselves within a cultural site or city. In others, visitors were immersed in audio AR in which synthetic sonic environments aimed at conveying emotional engagement, re-creating historical events, or presenting position-based narratives.

Notwithstanding the vast amount of literature about narrative and storytelling, to our surprise, we found very little information about narrative content and structure. Practically speaking, the mandatory inclusion of all the identified clusters (storytelling, immersive media, audio, cultural heritage) identified the intersection of all these only. Accordingly, this issue suggests the necessity to design new methods for a combined analysis of immersive and non-linear narrative experiences in interdisciplinary contexts. Since none of the reviewed papers explicitly considered a systematic approach to digital narratives, our work would also provide different perspectives for sonic interaction designers, practitioners, and technologists.

In our opinion, the criteria described in Sec. 2.2 are useful in defining a starting point for advanced sonic interactions in immersive storytelling. In the following, we suggest some promising research directions based on the previous guiding principles.

  • Perspective and Point of View. To provide an immersive experience, most narratives should be designed using an egocentric audio approach [20], in which sonic interactions have to be defined in terms of personalized listener-virtual environment relations. The term egocentric refers to the perceptual reference system for the acquisition of multi-sensory information in immersive mixed reality environments, as well as the sense of subjectivity and perceptual/cognitive individuality that shape the self, identity, or consciousness of the user. Such a level of personalization should avoid the mediating action of the immersive technology, which might result in a break in presence that can hardly be restored after a pause [199]. These cognitive illusions depend, for example, on the level of hearing training and familiarity with a stimulus/sound environment, and they should be taken into account from an egocentric perspective to create an immersive, coherent, and entangled experience.

  • Gift of voice. When speech is employed, the narrative is usually held in the third person for informative purposes or city directions and in the first person for conveying emotional content, such as witness experiences or event reconstructions [77, 81]. Popp [133] suggests that the narrator’s role can influence the trustworthiness of the conveyed information of the message. In the analyzed studies, no information can be found about the narrator register. However, we suggest that the narration register should be personalized according to the user’s psychological state and cultural background to create a more immersive and user-specific experience.

  • Narrative, Dramatic Question, Emotional Content, and Power of Soundtrack. From the perspective of the conveyed message, various works [105, 106, 153, 156] are specifically designed for emotion elicitation. Music is mainly used in experience with artistic purposes and linear storytelling. Non-musical elements, such as speech and sound effects, are more common in reviewed works and are sometimes personalized or dynamically generated based on the user’s position or interaction. Again, sonic interactions in virtual environments mediated by physical devices or GUIs help modify the users’ internal psychological state during personal or collective situations, resulting in entangled experiences. Emphatic technologies have a pivotal role in such modulation, tracing fluid boundaries between humans and technologies [200], and eliciting internal emotional states following the user’s expectations and emotions.

  • Economy, Pacing and Interactivity. In many works, the total time and pacing of the experience depend on the interest and involvement of the user, who seems to prefer a higher degree of control in terms of movement and interaction with the virtual environment and storytelling structure. Storytelling personalisation based on user interaction with the sonic or visual virtual environment, as well as non-linear narrative structure, provide promising evidence of eliciting users’ interest in cultural heritage and enhancing dissemination and education [92, 97]. Such a complexity might be managed by AI agents, i.e., non-human entities capable of interacting with ecological behaviors [201], that would be able to predict the user’s intentions in an intelligent environment for storytelling. Their ability to monitor listeners’ behavioural responses could balance users’ expectations and cognitive capabilities to adapt and modulate specific interactions and events [202]. More importantly, AI algorithms have the potential to encourage the exploration of different and meaningful paths within a non-linear narrative tailored to the user’s needs.

  • Medium: In the reviewed works, the immersive medium in use was clearly specified and the experience was usually designed considering its peculiarities. Based on the reality-virtuality continuum proposed by Milgram and Kishino [203], the level of isolation and the combination of real and synthetic elements can differ in audio mixed realities. Fig. 9 illustrates a continuum in the context of audio-specific domain, where different levels of isolation range from a completely real environment (a) to a completely isolated virtual environment. Passing through different degrees of audio augmented environment allows users to experience synthesized audio elements in the real world by using headphones or other hearing devices with a high degree of audio transparency [204]. Skarbez et al. [26] state that it is impossible to completely avoid conflict between sensory information originating in the real environment and sensory information originating in a virtual or augmented environment. Therefore, every VR experience is actually a combination of virtual (e.g., computer-generated stimuli) and real elements (e.g., the feeling of gravity), hence ending up with a mixed reality. In this perspective, the medium to consider in both an AR and a VR experience should be studied as part of a unique mixed reality continuum, thus simplifying the design work and the production of guidelines [205].

Fig. 9
figure 9

Milgram and Kishino’s Mixed Reality on the Reality-Virtuality Continuum [203] with audio reproduction devices. Made with Microsoft Image Creator and Adobe Firefly

6.1 Sonic interactions - notable mentions

In the realm of sonic interaction, tangible and haptic devices are widely employed, especially in non-linear storytelling experiences. Interaction within the virtual environment is used by the system to select a specific storyline and guide the user in the virtual or augmented world. To foster a more natural interaction, several works [112, 134, 153, 160, 169] have adopted custom-built tangible devices that are usually used as the main element during the interactive experience and are usually rated as engaging by users.

Despite the few works reviewed, we would like to mention eye tracking as another promising opportunity to (i) provide information in an ecological manner during interaction with the virtual environment (e.g., [113, 167]), (ii) make the experience accessible in the presence of users with physical impairments( [135]), (iii) augment the experience with multisensory elements such as the environmental sound of tactile feedback (e.g.,  [130]).

Focusing on interactive installations only, we extracted some interesting insights regarding socialization and collaboration aspects. With the help of particular audio setup configurations such as extra-aural headphones [157] or directional speakers [112], or by using standard headphones with custom sensors [142], researchers recreated a social experience taking place in an augmented shared space. From the technological point of view, it was difficult to retrieve specific information about the audio speaker/headphones setup and spatialisation models, especially in studies involving mobile phones and tablets. For the sake of repeatability in research, more technical audio specifications are thus needed to evaluate immersion and to create design guidelines for cultural heritage augmented audio storytelling platforms. This is even more relevant because studies including cognitive evaluations of emotional content, engagement, interest, usability, learnability, acceptability, cognitive load, etc., lead to better results for audio mixed reality compared to static audio guides and onsite information panels [96, 206]. Educational aspects also benefit from augmented books and virtual AR exhibitions in different contexts: children’s narratives, travel guides, and interactive installations. In particular, children show interest in this specific technology, hence opening room for future research, even though ethical issues about technology addiction must be considered.

6.2 Research directions

It is evident from our perspective that entanglement in human-computer interaction is highly desirable, as suggested by Frauenberger [28]. Mid-long-term guidelines should integrate technological aspects and content to pave the way for truly immersive storytelling. In particular, our theoretical framework suggests a binding role of the auditory component in its fluidity of perception-action within virtual environments through an ecological perspective [207]. This means that the listener might enact cultural heritage experiences in exploring the virtual environments [208]. Accordingly, we should consider an embodied, environmentally situated perceiver where sensory and motor processes are enabled by technology and inseparable from exploratory action in a narrative space. It is also important to note that in the field of immersive storytelling for cultural heritage, the new methods introduced by AI may open up new possibilities and design in terms of ease of use, involvement, and accessibility. To better understand the possibilities enabled by a proper design of SIVE, some potential research directions are illustrated below.

6.2.1 Emotional museum

In [209], Perry et al. argues that the state of the art of virtual museums doesn’t take full advantage of all available possibilities in terms of interaction between participants, and emotional and personal development. An interesting scenario can be the use of audio cultural heritage storytelling to create a virtual museum or augmented cultural heritage site specifically designed to convey emotional content through the exploration of the environment. This kind of experience permits new forms of knowledge dissemination for educational purposes or entertainment. Specific guidelines for cultural heritage audio storytelling can be a useful starting point for the design of more effective storytelling that conveys emotional historical reconstruction in an effective and engaging way.

Challenges: Some historical and cultural experiences can contain a significant emotional impact. In this context, the SIVE framework can embrace the Research through Design (RtD) approach [210], helping in developing methods to identify the emotions to be expressed and to treat with respect and care for the arousal and respect of the users. Research through Design combines design practice and inquiry to better understand complex scenarios. It involves iteratively developing prototypes that allow the capture of emerging patterns. This approach recognizes that design practices are not just meant for implementation or creativity but also valuable methods for conducting research. Such practices could be a valid methodology for exploring interactions and possibilities with non-human agents, e.g., the virtual narrator [211]. The authors recently applied this research perspective in designing an augmented reality audioguide for museums [212].

6.2.2 Immersive analytics

Especially with the new possibilities made available by Artificial Intelligence, immersive media, such as mixed reality and audio mixed reality can be used as a new methodology for research, knowledge analysis and decision-making [213]. Immersive analytics consists of using interaction with the immersive media to support analytical reasoning and decision making, providing multimodal interfaces allowing users to immerse themselves in data [214]. A similar approach is already employed in sonification, which consists in transforming data information through sound and auditory features in order to facilitate communication and interpretation. On the other hand, immersive analytics aim to enhance a bidirectional and entangled interaction between the user and the virtual environment [215].

In the literature, few tools [216] exist for immersive analytics of visual media only. However, cultural heritage research can benefit from the auditory component. Many musical sources are available, especially in fields related to audio and music, such as musicology and musical aesthetics but also by analyzing historical audio sources along with written documents. In this context, audio storytelling can be used twofold. Firstly, it helps researchers orient through the vast amount of audio material available by creating specific story paths, e.g., with particular themes or timelines based on historical facts. Secondly, implementing tools to create cultural heritage audio storytelling and use the design process as a reasoning method for knowledge discovery. Moreover, the use of AI can foster collaboration between the user and the virtual environment in an entangled experience [20].

Challenges: In this type of entangled experience, proper verification and processing of sources are crucial to avoid any misleading storylines that may be generated or introduced by interaction with the virtual environment and AI’s hallucinations [217]. Moreover, integrating diverse sources of audio data, as well as designing a convincing interactive experience in mixed reality, can be complex and require robust data management and the integration of multiple 3D user interactions.

6.2.3 Archaeoacoustics, virtual musical instruments, historical voices, and personalities

The analysis of the acoustic of ancient historical places is used by many cultural heritage researchers to better understand different historical aspects of everyday livingFootnote 14. In a work by Ciaburro et al. [218], acoustics properties (reverberation time) of an Ancient Roman catacomb have been measured and analyzed in order to better understand the reason for the choice of the specific prayer space by the ancient population. Similarly, Warusfel and Emerit [219] used available historical documents to simulate the acoustics of an ancient Egyptian temple of Dendara dedicated to Hathor, the goddess of music, love, and joy in order to investigate the role of sound in worship ceremonies. Moreover, the reconstruction of virtual ancient or modern musical instruments is widely used to study the main traits of instrument sound throughout history [220]. Again, in this scenario, audio storytelling can help reconstruct historical events and ancient spaces in virtual spaces with natural acoustics and sound for research purposes, education and entertainment, e.g., in museums and cultural heritage sites. Historical talks and documents played or read immersed in a virtual/augmented reconstructed audio environment could help understand the role of space in history and better comprehend the emotions elicited by historical figures.

Challenges: This type of scenario requires advanced simulation and reconstruction techniques that rely on technology that may not always be accessible or sufficiently advanced to represent historical acoustic environments accurately. Especially in VR and AR experiences, this can affect the correct representation and perception of spaces (e.g., positions and distances [9]), limiting or altering the planned experience. Considering the evolution of sound environments and listening habits in historical/architectural reconstructions is essential. For example, in the given location, external noises such as car traffic not present at the reconstructed period, could affect the perception of the experience. Another example is the modification of the tuning of musical instruments, such as that introduced by J.S. Bach in the 17th century [221]. Using the previous configuration, which was conventional for listeners of the time, might sound out of tune to users.

6.2.4 Universal fruition, multimodality

Multimodal interaction is widely used in museums and cultural heritage sites for entertainment and education but often has few or no audio elements [222], an important feature in many experiences designed for fruition by users with special needs [223]. It is worthwhile to notice that audio storytelling has been proven to be an important accessibility medium for people with disabilities. Visually impaired people highly benefit from visiting museums and cultural heritage sites for historical and touristic information access and urban navigation [224]. Moreover, storytelling was employed in connection with sound spatialisation and 3D elements in the therapy and rehabilitation of cognitively impaired children for facilitating child-therapist interaction [225]. Audio storytelling can deeply help users with physical disabilities but also users with cognitive disabilities in cultural heritage knowledge fruition, improving attention and engagement [226].

Challenges: The addition of too many multimodal elements could confuse the user with special needs and make it complicated to maintain attentional focus on the topic [43]. Moreover, ensuring the sustainability of multimodal interaction systems involves continuous maintenance, content updates, and technical support can be challenging. This may be problematic when moving from prototyping to implementation and ultimately to maintenance due to limited budget and organisational and political issues.

6.3 Limitation

The review process of our study may be limited due to the intrinsic constraints of our search method, i.e., the specific keyword clusters we selected for analysis. Although we carefully chose these clusters employing a multidisciplinary approach, the proposed selection might have restricted the results. Moreover, there may be some specific terminologies related to the study that we did not consider, which could have affected the retrieved works.

Although we used specific scientific repositories universally considered highly reliable, they may have missed some important studies. On the other hand, it is essential to keep in mind that the field of cultural heritage is strictly related to cultural heritage institutions such as museums and public or private foundations. These institutions may have restrictions on content production and delivery for research purposes. They already have a corpus of content and narratives related to artworks and historical sites, including written guides, audio guides, commentaries, and other multimedia products. However, such sources could not be diffused due to copyright issues or political constraints.

Finally, the creation of interesting and engaging storylines and material for various platforms becomes difficult because of the costs associated with the change of the medium, due to the fast-paced technological advancements that are not financially viable in the long run, and the lack of standard technologies and frameworks.

7 Conclusion

This paper systematically reviewed works on audio mixed reality storytelling for cultural heritage by analyzing recent results, identifying promising trends, discussing peculiar works, and proposing directions for future research. Moreover, a brief overview of platforms and applications specifically designed for augmenting audio in cultural heritage storytelling has been presented and discussed.

Audio mixed reality technologies were mainly used in cultural heritage for tourism and education, especially in mobile solutions. Information about audio setups was often incomplete. Nevertheless, audio mixed reality appeared to be a promising technology for the field, and its implementation in a broader context should be examined more closely by sonic interaction designers, especially concerning its potential to convey immersive experiences to users.

We exploited the scientific context of SIVE in analyzing the reviewed works from a more comprehensive point of view to highlight common aspects and valuable examples for engaging storytelling experiences for cultural heritage in terms of immersion, coherence, and entanglement. Audio storytelling was widely used in cultural heritage and mixed reality with a de facto dichotomy between technical discussions and narrative content. In particular, the last aspect seems weakly connected to identifying each sound source and structure and even more to the relationship between storytelling and immersion in terms of experience, enjoyment, and learnability. Nonetheless, personalised and non-linear storytelling, along with sonic interaction, strongly contribute to a complex interrelation towards the creation of a sense of presence in virtual environments. User experience evaluations, mainly based on questionnaires and semi-structured interviews, showed an overall good acceptance level and proved cultural heritage storytelling in immersive environments engaging. Users preferred interactive experiences over controlled ones, obtaining a higher level of overall entanglement and coherence, which can also be fostered by collaborative experiences.

The knowledge acquired with this overview and analysis suggests that a joint effort among sonic interaction designers coming from different backgrounds would be desirable, with the aim of connecting a multiplicity of viewpoints into a shared research agenda.