Keywords

1 Introduction

Audio description (AD) is a widely used accessibility technique for Blind and Visually Impaired individuals (BVI). The general task of an audio describer is an intersemiotic translation (Vandaele, 2012), which is a verbalization of the visual information needed to understand the visual content even without being able to actually see it by a viewer. A wide body of research has demonstrated the effectiveness of audio description in various applications for BVI (e.g., Szarkowska et al., 2013) and sighted people (e.g., Krejtz et al., 2012a, 2012b). As an accessibility technique AD has been relatively unchanged since its introduction which shows its robustness. However, new advances in visual attention research using eye-tracking method have demonstrated clear differences between experts and non-experts in the way they perceive the visual environment (Krejtz et al., 2023b). Prior expertise may change attention patterns over objects related to that domain and induce biases when experts are creating audio descriptions of these objects. In turn, such biases might lower the usability of audio descriptions created by experts for non-experts. Drawing on insights from eye-movement research, we may increase AD’s usability, making it more universally designed. Extending the idea suggested by Orero (2007), we proposed an audio description that is organized around gaze patterns of non-experts, the Gaze-Led Audio Description (GLAD) (Krejtz et al., 2023a, 2023b).

In this chapter, we start with the literature review on AD and challenges related to its construction which professional audio describers face. Presenting the concept of attentional biases, we point out the potential biases in audio descriptions that might lower their accessibility features for non-experts. We claim that this potential risk is especially relevant to the fields that require a domain-specific knowledge (e.g., art or historical heritage) for object comprehension. We also present early ideas for overcoming these biases. On the example of existing accessibility systems for blind and visually impaired (BVI) individuals, we describe the idea of audio description of places and a novel approach to it called, gaze-led audio description (GLAD). The very core of its concept is to make use of the attentional patterns of non-experts to adjust audio descriptions of architectural art objects to facilitate their comprehension and understanding. The concept and its requirements are supported by the results of two empirical studies. The first one presents the qualitative analysis of BVI potential users’ needs and expectations toward the accessibility system designed for sightseeing the city space (Friendly City) and the audio descriptions of places included in the Friendly City system (Krejtz et al., 2023a; Pawłowska et al., 2023). The second study qualitatively describes the eye-tracking results of attentional bias of experts in comparison with non-experts, both presented with AD while sightseeing the historical city of Łódź in Poland. The chapter ends with the proposition of gaze-led audio description and its application for a wide variety of fields facilitating the inclusion of BVI into the cultural heritage.

2 Background

2.1 Enhancing Access with Audio Description

Audio description is a narration translating what is visible into what is heard (Audio Description International, 2005; Fryer, 2010; Vandaele, 2012). AD was primarily created for the blind and visually impaired (BVI) individuals as an accessibility tool allowing for inclusive access to the visual content of cultural heritage (Independent Television Commission, 2000). AD is widely studied and used in practice as an additional narration between dialogues in films (Kruger, 2010; Orero, 2007; Remael & Vercauteren, 2007; Salway, 2007), theater (Fryer, 2010), opera performance (Cabeza-Caceres, 2010), sport shows (Mazur, 2020), or museum spaces (Cabeza-Caceres, 2010; Pawłowska & Sowińska-Heim, 2016; Szarkowska et al., 2016). For example, Szarkowska et al. (2016) created accessible content in a multimedia application guide containing a description of selected works of modern and contemporary art in museum spaces, (see also Pacinotti, 2017). In response to the needs of modern and contemporary art gallery visitors, the proposed guide with AD served as a guide explaining exhibition’s meaning and content (Szarkowska et al., 2016).

Several studies have indicated the beneficial effects of AD on media content comprehension among visually impaired adult viewers (Frazier & Coutinho-Johnson, 1995; Pawłowska et al., 2019; Schmeidler & Kirchner, 2001) and children (Palomo, 2008). For example, Frazier and Coutinho-Johnson (1995) demonstrated that visually impaired participants who watched movies accompanied by AD achieved the same level of movie content comprehension as sighted viewers. Their level of movie content comprehension was also significantly better than those visually impaired who were presented with the same movies without AD (see also Peli et al., 1996). Similar results were found for educational TV shows (Schmeidler & Kirchner, 2001).

Listening to AD is beneficial to a much broader audience including sighted viewers when it is inserted between dialogues in the film. Eardley et al. (2017) noted that by stimulating imagery, AD can potentially enhance the experience of both sighted and BVI individuals. By creating a multi-sensory experience, enriched with narrative information, AD may increase the memorability of audio-described objects.

Krejtz et al. (2012a) in an experimental eye-tracking study, in primary school sighted children, demonstrated that AD guides children’s attention toward described objects resulting, e.g., in more fixations on specific regions of interest in educational movies. AD also sustained attention of sighted viewers resulting in a better comprehension of the movie content (Krejtz et al., 2012a). After watching audio-described educational movies, children easily retrieved visual elements of the movies than their peers who watched the clips without AD and relied more on the recognition rather than based their decisions on the elimination heuristic (Krejtz et al., 2012b). Another series of multimedia learning experiments (Krejtz et al., 2016) corroborated that AD in a group of sighted young adults facilitates focal attention (see also Velichkovsky et al., 2005) when looking at still images of visual art which in turn enhances their comprehension and memorability.

In summary, studies on AD gathered sufficient evidence supporting the use of AD in both sighted and BVI users increasing the accessibility of visual content in various contexts. Therefore, creating audio descriptions is vital not only for extending access to visual content for BVI but also for supporting the understanding and experience of visual content for a broader audience.

2.1.1 Audio Description Challenges. What and How to Describe

The creation of accurate audio description faces several challenges: what to describe vs how to describe it (Vandaele, 2012), audio describer cognitive and emotional biases, and visual content specificity (movies, live shows, and places). Vandaele (2012) addressed the issue of narrativity of audio description in movies. He distinguished the problems of how and what to convert from a visual input into a verbal medium. The later problem of “what to detect and select” to preserve and enhance the visual narration of movies is causing more difficulties for audio describers (Vandaele, 2012). He also recognizes that audio describers may be biased by personal understanding of a movie narration. The proposed solution, “a double hermeneutic-heuristic procedure”, shall help audio describers to understand and attend to their narrative emotions (narrative states of mind) to avoid bias. In the first step, the proposed procedure focuses on identifying narrative states. In the second step, it looks for the discursive triggers prompting these states (Vandaele, 2012).

Orero and Vilaro (2012) proposed using eye-tracking analysis on regular movie viewers to decide which details of the visual content should be audio-described. Following Navarrete (2005) they claim that AD should not extend to subtle visual elements that are hardly noticed by the sighted audience. The number of details according to Orero and Vilaro (2012, p. 304) “may reach the point of saturation where an audience can neither process nor remember any further details.” That observation is in line with the literature on cognitive load theory which points out the limited cognitive resources for information processing (Sweller, 2010a, 2010b). Orero and Vilaró (2012) proposed that audio descriptions in films should be adjusted by prior eye-tracking research with the potential audience, e.g., BVI individuals. To our knowledge, this postulate has not yet been implemented.

2.1.2 Attention Bias of Audio Description Experts

Attention bias is a selection of some information while neglecting the other based on the viewer’s characteristics (e.g., expertise, knowledge, preferences, and affect). Attention biases might be sourced in the emotional state or individual differences, e.g., the level of social anxiety when attending to socio-emotional signals (e.g., Krejtz et al., 2018). Cognitive biases are also related to the expertise in a certain domain while attending and/or processing information related to that domain. For example, the theory of long-term working memory proposes that experts have higher than non-experts limit of working memory when processing visual information related to the domain of their expertise (Cowan, 2001). On the other hand, the information reduction hypothesis claims that expertise causes better selectivity of relevant and neglect of irrelevant visual information (Haider & Frensch, 1999). Following the eye-mind hypothesis by Just and Carpenter (1980), which relates fixation duration with cognitive processing of attended information, better selectivity in experts should lead to longer fixation durations on relevant visual information (Kieras & Just, 2018). For example, in the context of architectural heritage that would be most of the elements of facade or other characteristics of built environments.

Gegenfurtner et al. (2011) based on the meta-analysis of 296 effect sizes in eye-tracking research on expertise differences in the comprehension of visualizations demonstrated consistent differences between attention patterns and characteristics between experts and non-experts. On average experts in comparison with non-experts had shorter fixation durations, more fixations on task-relevant areas, and fewer fixations on task-irrelevant areas. Presented findings support, in general, the hypothesis that experts encode and retrieve information more effectively than non-experts, select more relevant information, and demonstrate a wider visual span.

Gegenfurtner et al. (2011) in their meta-analysis also found significantly longer fixation duration for intermediates than for novices which suggests that expertise domain-relevant information processing is not yet fully automated in intermediates. Interestingly, the results of this meta-analysis showed that visual attention biases in experts are moderated by task and professional domain; however, the latter were not consistent (Gegenfurtner et al., 2011).

A more recent systematic review on the relation between gaze behavior and expertise (Brams et al., 2019) found more evidence on domain specificity of attentional bias and information processing. Summarizing 73 eye-tracking articles they concluded that the selectivity in attention allocation toward relevant visual information is a dominating element in most experts (most studies reported a bigger number of fixations of longer durations on relevant visual information), excluding medicine discipline experts. Interestingly, some studies assessing expertise in medicine report more structured scanning patterns in experts. Systematic visual attention strategies can enhance performance (e.g., Augustyniak & Tadeusiewicz, 2006; Vitak et al., 2012). Brams et al. (2019, p. 1) concluded that “large discrepancies in the outcomes of the papers reviewed suggest that there is not one theory that fits all domains of expertise.”

Recently, Krejtz et al. (2023b) revealed differences between experts (art-history and architecture students) vs. non-experts (psychology students) in visual scanning of city architecture (buildings, churches, and monuments). In the eye-tracking experimental study, participants were instructed to scan and remember two-dimensional photos of architectural objects in the city of Łódź in Poland while their eye movements were recorded with a remote GazePoint HD (150 Hz) eye tracker. Experts’ attentional patterns were more focal, meaning longer fixations were followed by shorter saccades. Additionally, they found that the more focal attentional pattern the better memory of the stimuli image was observed. Presented results suggest that experts in architecture and the art-history domain tend to process visual information about the architectural artifacts in a more deliberative way than non-experts, making sense of each detail of the building, e.g., presented in its façade (Krejtz et al., 2023b). Also, Chmiel et al. (2010) used eye-tracking method and verbal reports to observe differences between what people normally looked at when watching scenes from a film and what audio describers included in AD of these scenes.

The above-mentioned examples of differences in attention patterns between experts and non-experts suggest that experts’ perception of objects might be very different from the perception of non-experts (Castner et al., 2018; Chmiel et al., 2010). Consequently, audio descriptions created solely by experts in the field might be “unnatural” and less accessible for non-experts, which supports our claim of taking the perspective of non-experts while creating AD, by taking into account the knowledge from the analysis of visual attention patterns of non-experts.

2.2 Accessibility of City Built-Environments: Audio Description of Places

Audio description in city space can be implemented in smart-city assistive systems based on a combination of information and communication technology (ICT) and the Internet of things (IoT) proliferation, to help BVI visitors fully experience and appreciate the built environment, public art, or other landmarks. In a more general sense those systems are implementing Industry 4.0 conceptual framework and solutions (Ustundag & Cevikcan, 2018). To our knowledge, such systems mainly focus on navigation help through the city space, indoor navigation (Kuribayashi et al., 2021), or maintaining social distance (Kayukawa et al., 2020). Those accessibility solutions are commonly based on different technologies like Bluetooth to provide audio or tactile cues, audible traffic signals at intersections, tactile paving (raised bumps or patterns on the ground), audio-based way-finding systems in public buildings, and audio guides.

Among the most current ICT-based solutions for facilitating the accessibility of city space for BVI individuals, we might mention the following systems: Mobility as a Service (MaaS) (MaaS, 2016), Safe Smart CLE, Wayfindr (Wayfindr, 2018), NaviLens (NEOSISTEC, 2017), and aBeacon (Tech, 2018). Several cities, e.g., Barcelona in Spain, Marburg in Germany, Seattle in the USA, Louisville in the USA, Helsinki in Finland, Antwerp in Belgium, and Nijmegen in the Netherlands have implemented such systems to provide accessibility to the city space and enhance the well-being of BVI citizens and tourists. An interesting example is the NaviLens system in Barcelona, Spain, which currently boasts deployment of 9100 fiducial markers affixed in 161 metro stations and 2600 bus stops. NaviLens markers are visual fiducial markers, similar to well-known ArUco markers (Garrido-Jurado et al., 2014) or QR codes in color or monochrome (Saez et al., 2020), which can be scanned with smartphones to provide extended information about the navigation route and current location.

In terms of user interaction, He et al. (2020) proposed a light haptic cue-based wearable device PneuFetch that supports BVI people to locate and reach objects in a new environment. Kuribayashi et al. (2022) successfully tested a mobile-assistive Corridor-Walker system equipped with a LiDAR sensor to help blind individuals to avoid obstacles and recognize intersections while walking along indoor corridors. By generating a 2D map of the environment, the system alerts the user by vibrating and audio feedback when an obstacle or intersection appears on a selected walking path. As a result, blind individuals reported being less wall-dependent while walking straight and benefited from feedback about the intersections making their walking path less challenging. In another project, Kayukawa et al. (2020) combined an RGB-D camera (to detect the positions of target objects) and a LiDAR sensor (to create a 2D map of the surrounding environment) in BlindPilot, a system navigating blind individuals in city spaces, e.g., taking an empty seat in transport. Compared to sound-based navigation, the BlindPilot allowed reaching destinations faster and with greater safety. Kuribayashi et al. (2021) also proposed an interesting assistive LineChaser system based on an RGB smartphone camera to help blind individuals to find a queue and its end and allow a blind person to move forward as the line gets shorter. By estimating the position of a nearby person, the system detects whether the person stands in a queue, and it updates information about the distance to this person. Audio and vibration signals navigate the user through the line, suggesting when to move forward and stop.

Extending the use of AD to the outdoor experience potentially can make a difference in the city’s built environment understanding and access. Although still relatively under-investigated, there are attempts to use AD in city space to increase the accessibility of architectural heritage and to enhance understanding of a built environment (Boys, 2014; Pacinotti, 2022). For example, Pacinotti (2022) provided guidelines for AD for bringing churches as places of worship to a broader audience within “A Sense of Place” project aimed at providing audio descriptions of buildings from the joint work of design students and blind and visually impaired volunteers. As the outcome, VocalEyes offers audio-described city tours providing the experience of examining architectural heritage and broader built environment from an audio describers’ perspective (VocalEyes, 2007).

The present Friendly City Project aims at designing the city space accessibility system which will directly implement the AD of city spaces, especially historical architectural artifacts. The AD will be played on the user’s smartphone after approaching an interesting spot in the city space (see Pawłowska et al., 2023).

3 Study 1. Needs and Expectations Toward Audio Description of Places

Audio descriptions of architectural objects in the city space are crucial to the architectural heritage accessibility system, especially for BVI users. Study 1 aim was to collect the needs and expectations of potential BVI users of the Friendly City system. Expectations of blind and visually impaired potential users might help adjust audio descriptions to their everyday usage. The nature of these expectations and needs requires a qualitative approach to the data obtained during in-depth interviews (IDIs).

Different expectations might also be related to the specific characteristics of the architectural objects. For the present qualitative study, eighty-six architectural objects have been selected. They all come from one industrial city formed in the early nineteenth century with its dynamic growth in the late nineteenth and early twentieth centuries. The selected architectural objects present various Neostyle forms, from Neo-Romanesque to Neo-Baroque. We also used twentieth-century modernist or social-realist buildings of various functions: sacral (14%), residential (34%), public utility (40%), and sculptures/monuments/murals (12%).

3.1 Study 1. Method and Qualitative Analysis Results

We have conducted in-depth interviews with 21 (15 females) visually impaired volunteers aged between 20 and 95 years old, recruited from members of the Polish Association of the Blind. None of the participants was visually impaired from birth. Eleven participants had a good knowledge of the city-built environment and architecture. The interviews started with questions related to usage and experience with digital media and technology and continued with questions related to knowledge of the city’s architecture. Next, the general concept of the Friendly City architectural heritage accessibility system (Pawłowska et al., 2023) was presented by the interview moderator. The next set of questions was related to the expectations of the system and more importantly to the expectations toward the audio descriptions of the city-built environment. Participants were presented with examples of prepared audio descriptions for selected buildings representing various Neostyle forms and functions. The notes taken by the interview moderator were the subject of the following qualitative analysis. The analysis was of the basic interpretative analysis with the elements of thematic analysis (Lester et al., 2020) when it comes to the requested features of audio description required by potential BVI users.

BVI potential users expected architectural object descriptions and the Friendly City system to contain four groups of information: functional, basic metrics, core audio description, and contextual.

  • Functional information. This type of information is expected to help BVI users navigate from the bus stop to the architectural object, the location of the object, etc. Participants unanimously stated that the distances to the object in the description should be in meters (local measure units of distance). They claimed it would help direct their attention toward the historical or urban context of place and object. Participants suggested introducing terms to locate the object near significant thoroughfares or squares. Their experience shows that these cues help them to establish contact with other people and passers-by while visiting the city independently without a human guide.

  • Basic metrics. This information is expected to include the name of the architectural object, authors’ names, and year of creation. Participants expected to have basic information about the site’s metrics before the core audio description. That information would help give a better understanding of the site by giving the historical or urban context of the place and object.

  • Core audio description. AD presents a detailed description of the object view. Study participants stressed that the audio description should be relatively short (2–3 min). Longer audio descriptions, according to the participants’ opinions, became too challenging for their attention and negatively influenced comprehension. Participants claimed that shorter AD would be easier to remember and listen attentively.

  • Context and general information. BVI users explicitly expected that the Friendly City system and object descriptions would extend their general knowledge about architecture and cultural heritage. That is why they appreciated the use of professional terms, e.g., pilaster, paneling, rustication, and tympanum. Concordantly, they expected all terms to be defined and explained. It was also emphasized that context and general information needed to be kept as optional, available in the system on demand. Nearly half of the participants also wanted information about the interiors of the audio-described buildings.

3.2 Study 1. Results Summary

Qualitative results of the in-depth interviews with visually impaired people on the content of audio descriptions show that they expect not only a detailed description of the architectural objects. Potential BVI users of the Friendly City system requested features related to the historical context of the architectural artifacts and navigation to them from different parts of the city. According to BVI, audio description could include information on the cultural and spatial context of the buildings and other information extending their general knowledge of art and history. They also suggest that AD should be relatively short (up to 3 min). This poses another challenge for audio describers: how to include all relevant information in a short amount of time. Eye-tracking studies may help mitigate this challenge by showing natural visual attention allocated to certain elements of buildings in relatively short epochs, influencing the pace of certain moments of audio description.

4 Study 2. Eye-Tracking Research on Audio Description of Places

The concept of audio description of places in city space is relatively unaddressed in empirical research. There is no empirical evidence that AD can effectively guide the users’ attention in the city space, e.g., to the audio-described elements of the architectural artifacts. Similarly, there is a lack of evidence that attention patterns differ between experts and non-experts in response to the AD of architecture in city space. To address these issues, we present preliminary results of an eye-tracking study on AD of places conducted in the wild, namely in the city space, as a part of the Friendly City project in Łódź (Pawłowska et al., 2023).

The architectural structure of the center of Łódź constitutes a unique historic complex. The compact, untransformed urban layout and buildings preserved to this day are a testimony to the development of an industrial city in the historicist period. Its classicist, symmetrical layout was laid out on a north–south axis and consisted of three main parts: Old Town, New Town, and Łódka, connected by the main axis defined by Piotrkowska Street. The streets intersect at right angles creating a characteristic checkerboard layout, strongly associated with the industrial cities of England and the United States. The legibility of this layout created in the nineteenth century and its basic development until the beginning of the twentieth century means that the main cultural and administrative institutions are concentrated in a relatively small area. This layout not only makes it easier to move around the city on foot for visually impaired people but also allows them to get to know the historic architecture in a relatively short time.

The vast majority of the buildings included in the Friendly City project are related to the period of the city’s large-scale industrial development (the second half of the nineteenth century and early twentieth century). Objects created in recent decades were also included. They represent a variety of Neostyle forms: neo-Gothic, neo-Renaissance, neo-Baroque, eclecticism as well as Art Nouveau and Modernism. Each building thus has a different message expressed by the forms and details used, closely related to the building’s function and its time of construction. In broader terms, one can say that the divisions of the massing, forms, and decoration used in architecture contain the cultural code of the city. The architectural heritage of Łódź is “decoded” with audio descriptions adapted to the needs of visually impaired people as well as tourists. The Friendly City system includes both individual descriptions of objects as well as 360° presentations of important points grouping monuments, such as the area of Plac Wolności (Church of the Holy Spirit, Town Hall, Tadeusz Kościuszko monument).

4.1 Study 2. Method

Thirty (28 females) volunteers (experts in fine arts and non-experts) participated in the study. The study was conducted in the Łódź city in Poland with a rich and unique architectural heritage. Participants were asked to take a route to visit three selected historical sites from the outside.

The routes were selected in terms of communication proximity, diversity of form, function, and time of construction. For example, Route 1—Former Polonia Hotel (neo-classical building from the early twentieth century, one of the most important hotels in Łódź, now adapted for residential purposes), St. Olga Church (sacred, orthodox building), Gustav Schreer Palace (representative building with neo-renaissance forms with the garden and factory premises); Route 2—Former Esplanade (Art Nouveau building with service and commercial functions, formerly one of the most important confectioner’s houses), Tramway Station Centre (object from the beginning of the twenty-first century, modern tram shelter structure), and Unicorn sculpture, Place for the Stars (early twenty-first-century modern sculpture by Japanese artist Tomohiro Inaba).

The sites were audio-described, and those descriptions were presented to the participants while they were standing in front of the sites. Table 1 presents an exemplary fragment of the original AD that was read to participants while looking at the Esplanada Building (Fig. 1c, d). Participants’ eye movements were collected with the PupilLabs (120 Hz) mobile video eye trackers while listening to audio descriptions. We present a qualitative analysis of visual scanning patterns with a focus on the comparison between two women experts in art history and two non-expert women while listening to the core part of the AD (presenting elements of the building). The participants were presented with the audio description of the aforementioned architectural objects to check their ability to guide the attention of the viewer in the most natural settings. The audio description is an essential part of the designed Friendly City system and that’s why it was crucial to test its usefulness in this experiment before further development. In this very first study, the original AD was used to guide the users’ attention. The major reason for it was to find potential problems in guiding the attention by the AD created in the original approach, potentially biased by the audio describers’ expert knowledge.

Table 1 Fragments of the core AD: original and gaze-led of the Esplanada building in the city of Łódź, depicted in Fig. 1c, d
Fig. 1
4 photographs of 2 multi-story buildings. They present the eye movements scan paths with fixations, saccadic eye movements, and fixation sequences.

Fragments of eye movements’ scanpaths of expert and non-expert when looking at buildings while listening to AD in Study 2: Expert scanpath on Gustav Schreer Palace (a), Non-expert scanpath on Gustav Schreer Palace (b), Expert scanpath on Esplanade building (c), Non-expert scanpath on Esplanade building (d). Note Red and blue circles represent fixations, white lines represent saccadic eye movements, and numbers next to the blue circles represent fixation sequence

4.2 Study 2. Qualitative Results

Attention patterns of an expert and a non-expert over two selected buildings (the Gustav Schreer Palace (Fig. 1a, b) and the Esplanada (Fig. 1c, d) were qualitatively analyzed by an expert in art history. Each recording lasted for approximately two minutes. Figure 1 presents a snapshot of the eye-movement patterns. Similar to Study 1, here we present the results of basic interpretive qualitative analysis (Lester et al., 2020, see also Krejtz & Krejtz, 2005).

4.2.1 Expert’s Attention on the Gustav Schreer Palace

The expert initially focused her attention on the body of the building, the most decorative detail of the façade, and the upper window frame. She then looked at the entrances, briefly examining the spatial relationship of the building. While listening to the core AD, the observer focused her attention directly on the indicated elements focusing on them for a longer time, e.g., the whole of the window frames, and the horizontal stripes of the ground floor decoration. She then directed her attention for longer during the presentation of the ground floor window decoration, cornice, and balcony. The viewer’s attention did not explore the whole arrangement of divisions of the symmetrical massing, and it focused on the details indicated in the AD located only on the right side of the building.

4.2.2 Non-expert Attending the Gustav Schreer Palace

The non-expert initially focused her attention on the upper part of the body of the building and its central decoration in the form of a balcony with a decorative window and a finial—a cornice. Then her gaze followed the central axis from top to bottom, stopping for a longer time at the door decoration. She spent more time analyzing the individual parts of the object going from left to right. While reading the spatial context information, she followed with her eyes the parts indicated in the audio description. While listening to the Core AD, the observer focused her attention on the indicated parts of the description without focusing on the whole element but divided it into individual parts of the detail, she compared the same elements on the left and right. The gaze was guided in leaps and bounds. Then her gaze followed the read description, which focused on the central part of the solid, also guided in leaps and bounds.

4.2.3 Expert’s Attention on the Esplanade Building

The viewer initially focused on the correlation of the building with the neighboring buildings. Then her gaze focused on the ground floor and directed toward the top of the building. Later, she returned to the lower floor, parts of which she examined from right to left. She successively examined each story according to this pattern. While listening to the core AD, her gaze wandered according to the description. For a longer time, her gaze stopped at the boundary of the first story and the finial, examining their relationship and the symmetrically arranged details. Without following the final part of the description any further, she examined the relationship of each part of the building to the neighboring buildings (height, width, and window frames). When the notion of a caduceus appeared in the AD, the viewer focused her gaze on it—not looking for it on the façade.

4.2.4 Non-expert Attending the Esplanade Building

The observer examined the building with her eyes walking from the ground floor upwards, from left to right, focusing her gaze on the visible details and accents of the story divisions. She then explored the object’s relationship to neighboring buildings (the height of the floor layout) while listening to the basic metrics. Attention was then focused on the object’s axis, with the gaze running from the lower toward the upper story, focusing for longer on the accent in the finial—the caduceus. As she heard the information about the neighboring objects, she shifted her gaze first to the left and then to the right. During the listening of the core AD, her attention was focused on the indicated lot. When the detail in the finial—the caduceus—was presented, the subject’s attention only focused on it when the meaning of the term was developed.

4.3 Study 2. Results Summary

The perception of the audio-described buildings was probably influenced by the location of the buildings, and the Gustav Schreer Palace was more isolated from the factors of noise and convening traffic than the Esplanade Building. The viewers mostly focused their attention in the first stage on the spatial relationship of the object comparing (which is particularly evident in Esplanade Building) height relationships and details. In the Gustav Schreer Palace, the increased visual fixation of the non-expert is evident—it is three times that of the art historian. In the Esplanade Building, these differences are not so clear. The non-experts were more likely to focus first on the dominant features visible in the elevations such as the window frames, the entrance, the finial, and the accompanying details. The attention of non-experts was more likely to follow the read description than that of art historians. One might be tempted to say that the art historian’s way of looking focuses on a broader field. When pointing out the lots and their detail, she does not focus on it for as long as the non-expert, concentrating more on examining the relationship between the elements.

5 Discussion and Conclusion. Toward Gaze-Led Audio Description

In this chapter, we reviewed the literature on audio description, exploring its diverse applications and associated challenges. Subsequently, we have provided concise summaries of the results of two empirical studies—one involving qualitative interviews with BVI and the other employing eye-tracking method—both centered on the audio description of places.

Firstly, by integrating the aforementioned with the tenets of user-centered design (Abras et al., 2004; Norman, 1986, 1988), it is apparent that the development of audio description demands an active engagement of visually impaired users. Their qualitative perspectives, expression of requirements, and opinions assume a pivotal role in augmenting the effectiveness and pertinence of the AD creation process.

Secondly, following universal design principles (Błaszczak & Przybylski, 2010; Goldsmith, 2007), we posit that audio description should also be crafted with the subjective and attentional insights from typical users (non-experts). This inclusive approach seeks to address their specific needs, cognitive constraints (Carpena et al., 2019), and other relevant considerations, thereby promoting a more universally accessible and comprehensible audio description experience.

In their work, Krejtz et al. (2023a, 2023b) introduced the concept of Gaze-Led Audio Description (GLAD), a novel approach that leverages aggregated scanpaths of novices to adjust the content of audio descriptions. Their primary aim was to align the order of visual elements in the description with more natural, non-expert viewing patterns. This innovative methodology presents an opportunity to guide professionals in crafting gaze-cued narratives, seamlessly merging the expertise of art historians and professional audio describers with the attentional patterns exhibited by non-experts. The example of the implementation of the GLAD for architectural artifacts, in comparison with the original AD, is portrayed in Table 1.

Within the context of the Friendly City project, adaptations were made to the original audio descriptions to align with the layperson’s perspective. The observation that individuals not versed in art history interpret the sequence and details of a building differently prompted the need for a tailored approach. The conventional AD structure, starting with the building’s plan, location, and compositional scheme, followed by a systematic progression from lower to upper stories, was reconsidered.

Non-experts tended to focus their attention on key architectural features, such as the decorative finial and the entrance area of the ground floor. This alternative approach to viewing effectively captures the architect’s intention, drawing the observer’s attention to visually prominent elements enriched with symbolic detailing. Consequently, the audio description accents were repositioned to align with the natural gaze patterns of non-experts. The modified description not only adheres to their viewing habits but also incorporates additional information about the significance of specific details.

Notably, this enrichment of information responds to the preferences expressed by interviewees, who sought immediate explanations of symbols and terms during the audio description. By seamlessly integrating gaze-cued narrative adjustments and providing relevant context, the GLAD approach not only enhances accessibility for non-experts but also contributes to a more comprehensive and engaging experience of architectural descriptions.

In this chapter, we presented audio description as a core element of accessibility systems in various contexts with a special focus on bringing the architectural heritage of city space closer to a broader audience. We presented a qualitative analysis of scanning patterns with a focus on the comparison between experts in fine art and non-experts, which suggested that the visual attention of non-experts was guided by the audio description presented at the site. Traditional audio description, created by experts, is not free from biases stemming from their expertise. Referring to the empirical evidence about differences in visual scanning of architectural objects between experts and non-experts, Krejtz et al. (2023a) propose an innovative approach to the creation of AD. Eye-movement patterns of non-expert-sighted viewers may serve as guidelines for audio describers to focus in their narrations on parts of historical buildings that attracted the attention of non-experts. In turn, the gaze-led audio description may acquire a more natural experience of “seeing” spaces with audio description. The preference for Gaze-Led AD over traditional AD by sighted tourists and tourists with vision impairments requires additional testing. Nevertheless, the GLAD approach corroborates current literature suggesting that the use of eye tracking for enabling gaze-guided narratives enriches the tourist experience (Kiefer et al., 2018) and is in line with the Tourism 5.0 idea, promoting creative development of inclusive travel experiences that are engaging and relevant for consumers of all backgrounds (see Chap. 10 in this Volume).