1 Introduction

One of the enduring tenets of audio description (AD) is the command to ‘say what you see’ [2]. However, while adhering to a principle of linguistic simplicity avoids placing an additional cognitive burden on blind and partially-sighted (BPS) audiences, the interpretation and contextualisation of linguistic means of expression and narrative markers in multimodal texts, are key elements in effective meaning-making. Regardless of the style of AD on offer, whether descriptive, interpretive or poetic, the BPS viewer is required to infer much of the information available on screen by ‘reading between the lines’–both those supplied by the AD, and those available through other audible channels (dialogue, musical score, incidental music, special effects, etc.) which together comprise the non-visual text.

For a fully-sighted audience, deriving meaning necessary to make sense of audiovisual narrative demands a vast range of acquired skills: verbal interpretation, visual acuity, inductive and deductive reasoning, the ability to seek out and marry together seemingly disparate audible and visual cues, the application of life experience and world or common knowledge to visual scenarios and social situations and, perhaps the most cognitively demanding process of all, inferring meaning where key information has been omitted. For the average, sighted, audience the visual cues are often key to meaning-making, evidencing characters, actions, locations, iconic scenarios (e.g., a party, a day at the beach), time shifts and so forth.

Without the aid of visual cues, this becomes significantly more problematic for audiences experiencing some degree of sight loss. For instance, in a film narrative about a relationship breakdown, there would be significance in the fact that a man who previously wore a wedding ring, no longer appeared to do so; and in a scenario where one protagonist is walking down the street and sees a black car and remarks to another protagonist, “the robber drove a black car”, the audience is required to make the inference that the protagonist believes he may have just witnessed the thief passing him in a black car. Yet the BPS viewer may miss these visual details and therefore experience gaps in the narrative without sufficient prompts from the AD. At the same time, the richness of the visual mode and the time constraints for inserting AD fragments into the existing audio track generally do not provide for the audio describer to comprehensively describe every salient element of the narrative action. Many aspects of an unfolding narrative are therefore intentionally omitted from the AD, leading to AD having previously been referred to as a ‘partial translation’ [3].

To this extent, AD differs from other types of video-to-text translation, e.g., video content description created for the purposes of archive retrieval [7]:18–19). Furthermore, as well as being complete texts, video content descriptions tend to be descriptive rather than interpretive or poetic in nature, lending themselves more readily to (semi-)automation. Machine-based video descriptive techniques are being explored widely in the computer vision world as a way to enhance the commercial efficiency and speed of retrieval in relation to video content archives, but also as a forerunner to broadening narrative assistance for consumers of video material in applications like social media. Fundamental differences between human AD and machine-based video content description, namely, prioritisation and omission strategies versus completeness, and idiosyncratic exposition based on the style choices of an individual (human) describer versus prosaic/unelaborated description by the machine, are significant to the future evolution of AD, particularly in relation to the development of machine-assisted AD.

AD research to date has considered how the audio describer selects and prioritises the visual elements to be included in the AD (e.g., [22] and how best to describe the selected elements (narrative/descriptive, alternative styles, etc.,[23]. However, the impact that omissions in AD have on the target audience has received little attention to date.

Visual cues that might be considered narratively salient but are not captured within the AD may conceivably still be retrievable by sight-impaired audiences using other audio cues and prompts. Whilst narrative salience is difficult to operationalise, visually salient narrative cues are defined here as cues that are central to the plot and contribute to creating a coherent story in the viewer’s mind. From an ‘incomplete’ AD text, the BPS viewer may be able to create meaning by combining the AD prompts with cues derived from dialogue, musical scoring, non-verbal utterances, sound effects and life experience (situational knowledge and social sensitivities) in a process of inferential meaning-making [6] that culminates in a sufficiently detailed mental model. However, in terms of cognitive processing, the additional level of inferencing required by a BPS viewer reliant on AD as a substitution for visual cues in order to build that mental model, may be significant [14]: 121–124). Some information will inevitably remain irretrievable to the audience, but even where this is the case, understanding the types of omission that are likely to be irretrievable and the difficulties this creates for both the BPS audience and efforts to semi-automate AD, makes a study of omissions both relevant and timely. Hence, this paper considers the issue of omissions in the AD, addressing key questions in relation to the production of traditional (human-derived) AD, as well as problematising the concept of omissions where machines are used to replicate parts of the human AD workflow. To this end, our study sought to discover: (i) the extent to which prompts or cues omitted from standard AD may be retrieved from other sources by sight-impaired audiences, including instances of achronological (to visuals) AD interventions; (ii) the strategies to be applied by the BPS audience in order to retrieve the ‘lost’ cues either from the broader narrative or through the application of other human-centric knowledge and resources; (iii) whether some ‘lost’ cues remain irretrievable to the BPS audience and why this might be the case; and (iv) the implications of AD omissions for the future of machine-assisted audio description production.

Furthermore, in an age of increasing automation within the audiovisual industry, and a drive to broaden the reach of AD through (semi-)automated approaches, inferential meaning-making poses a seemingly intractable problem for current computer models. Where human beings may derive meaning from the ‘unenacted’ (i.e., those aspects of the story arc that are left implicit), contemporary computer models are grounded in action-object recognition (detecting only the visually ‘enacted’, i.e., what is made explicit) and, critically, lack multimodal integration. Our focus on omissions in AD will contribute to a better understanding of these issues through the discussion of narrative comprehension in a holistic sense, derived from multiple competing sources, and essential to all forms of storytelling human- and machine-based.

In the sections to follow, this paper reviews the main tenets of multimodal meaning-making in AD, and then considers a number of worked examples of omissions occurring in the audio descriptions of extracts drawn from the MeMAD500 film corpus [7, 20]), the manner by which the human mind makes meaning notwithstanding such omissions, and the problems this poses for the future of automating the description of audiovisual content. In conclusion, we will summarise our findings, and consider possible solutions to the issue of rendering machine descriptions more sensitive to human patterns of AD omissions.

2 Multimodal meaning-making in AD

In the context of AD, the human process of deriving meaning has three dimensions: (i) in order to make sense of the audiovisual source material, audio describers, like all sighted viewers, use their ability to assemble different elements from this material (e.g. visuals, dialogue, music, song lyrics, sound effects) into a coherent narrative in their minds [5, 6, 21]; (ii) guided by the time constraints for AD, the audio describer then decides which of the visual elements (and also non-identifiable sound effects) are crucial for understanding the narrative, before describing these elements with the aim of enabling sight-impaired audiences to create a similarly coherent story; and (iii) finally, sight-impaired consumers of the professional AD will use their own interpretative powers to create a narrative that is unique to them, from a combination of the AD and original film audio.

In generating an AD text (item (i) above), describers are confronted with a series of audio hiatuses each of which presents a complex decision-making opportunity regarding which of the visual elements (and non-identifiable sound effects) are crucial for understanding the audiovisual narrative and which may be omitted. Our purpose, in this paper, is to explore how audiences who rely on AD texts for narrative understanding are likely to/may be able to fill the gaps left between film narrative and supplementary description, and to what extent it would be possible to compensate for different types of omissions in the AD. Specifically, we will explore which types of omissions are likely to be most readily retrievable and which are more likely to be lost to a sight-impaired audience, and the factors that dictate which of these two outcomes is more likely.

As outlined above, in exploring this question, we will first review the theorical foundations of the comprehension process at work in AD (this section), and then consider a number of worked examples (next section).

In the transfer of meaning between film text and target audience (points (ii) an (iii) above), AD can be characterised as a form of ‘hyposemiotic translation’, since the mediatory text uses fewer modes of communication than the source text (Gottlieb, 2005). However, an audio described version of a film or TV programme is still a multimodal text, which retains and combines elements of the spoken verbal mode and the auditory mode of communication. The audio description text is not a stand-alone text. It is interpreted by the audience in conjunction with the original verbal elements such as film dialogue, narration, and song lyrics, as well as auditory elements such as music and sound effects. Some cues conveyed through the visual channel, e.g., light contrast, peripheral visual elements, etc. may also remain accessible for some sight-impaired audiences. Understanding audio described content is therefore a case of multimodal meaning-making, in the same way that understanding the original audiovisual narrative is. In each case, recipients use their ability to combine the different elements and to fill apparent gaps in coherence to form a meaningful narrative in their mind.

The ‘mechanisms’ that enable us to do this successfully have been captured in models explaining textual and multimodal comprehension. In order to ascertain where an omission can be retrieved and where this is likely to be difficult/impossible, we therefore draw on such models, including Mental Model Theory (MMT; [15] and Relevance Theory (RT, Sperber & Wilson, 1995). Although originally developed for the processing of verbal discourse, these frameworks can also be applied to explain discourse processing of audiovisual narrative [6, 10] and other multimodal texts [1, 11, 12] given that the process of building of narrative understanding is comparable in all of these cases.

One of the basic tenets of MMT is that when we try to understand stories or perceive the world around us, we construct mental models of the corresponding situation [15]. In this process, we use initial cues from the film or other source text to activate common and world knowledge structures (iconic vignettes) that contain defaults for the situation we are experiencing or watching or hearing, e.g., dining in a restaurant, an office meeting, playing a game of tennis, etc. Knowledge of this type is thought to be stored in the form of frames, ‘schemata’ or ‘story grammars’, e.g., about places, activities and/or event sequences. Activated knowledge schemata provide a structure for understanding incoming information. In a restaurant schema/frame, an image of people gathered around a table is normally associated with eating a meal. In an office frame, a similar image is more likely to mean that people are gathering for a meeting, and we can expect that one of them will open or chair the meeting. We normally operate within the activated knowledge schemata, unless a change of topic or scene change is indicated, in which case existing schemata must be re-visited and re-framed according to the new information.

Equally important, the process of deriving meaning is shaped by the context of reception from which we create situational knowledge. We interpret new input not only in light of our previous world knowledge, but also in line with the cues we derive from the context, e.g. situational, location, chronology. Both our prior knowledge and the context of reception raise expectations as to what will happen as the film narrative unfolds.

In the process of combining the cues from the audiovisual material or the AD (bottom-up processing) with our world knowledge (top-down processing) and cues from the context, we rely on inferences, i.e., predictions or ‘informed guesswork’ about what is plausible. For example, in our ‘multiple people around a large table’ scenario, we are more likely to infer that animated conversations and hand gestures are about to take place rather than people dancing on the table, since probability suggests that the latter is less likely to occur.

When we are presented with audiovisual material, the mental models built by sighted viewers and audio describers essentially arise from the visual-verbal co-narration, further aided by non-verbal audio such as sound effects and music. Visual, verbal and auditory cues are typically complementary or sometimes redundant, i.e., more than one mode is used to express the same or similar meanings (multimodal redundancy and multimodal complementarity; [17, 18]. Since part of their role is to anticipate the needs of the sight-impaired viewer, audio describers sometimes explicate or insert elements that are not directly visible and only inferable [16], as a measure aimed at compensating for losses in the visual-verbal co-narration, by contrast, at other times–as outlined above–some visual meaning will remain implicit or be omitted due to time constraints.

A recipient’s knowledge about, or familiarity with, the storyline of a film or TV drama or with the genre conventions will contribute to resolving some omissions by allowing them to draw upon appropriate inferences. This knowledge will also create expectations about what is likely to happen as the narrative unfolds. By contrast, a lack of knowledge will increase the dependence on cues from within the audiovisual content and on untested inferences to derive the intended meaning.

Relevance Theory (RT; Sperber & Wilson, 1995) has provided more detailed accounts about how we understand verbal utterances, which can also be applied to audiovisual material. RT starts from the assumption that what is said (or what we see/hear) is always under-specified, for example, because it contains ambiguities that have to be resolved. However, we are normally able to develop the input we receive, via texts, film material or AD, into richer semantic (logical, language related) representations, as a first step towards deriving the intended meaning.

RT has also highlighted the important role of a recipient’s cognitive environment in meaning-making, which comprises everything that a recipient can perceive, remember or infer. It subsumes knowledge but seems to be a broader notion, also encompassing an individual’s beliefs, lived experiences, etc. Naturally, differences in our cognitive environments (including aspects like neurodiversity) will lead to intersubjective differences in story interpretation, and this applies equally to audio describers as it does to their audiences, making the outcome of the meaning-making process uncertain and, to some extent, individual.

However, at the heart of RT is the idea that comprehension processes are guided by two overarching principles of relevance. First, the human tendency to maximise relevance, i.e. our continuous search for meaning (cognitive principle of relevance); second, our assumption that a speaker, storyteller, or filmmaker wants to be understood and therefore chooses the optimally relevant way of communicating their intentions (communicative principle of relevance) (Sperber & Wilson, 1995). Following this principle, we are entitled to stop processing when we derive an interpretation that we find sufficiently relevant, meaning that we generally strike a balance between processing effort and finding a satisfactory explanation.

This also points to the importance of considering mutual knowledge and shared cognitive environments. In order to fine-tune a description to audience needs, an audio describer needs to assess what the audience knows. On an individual viewer level, this is clearly an impossible task, but at the level of the average audience member with typical life experience, it is a matter of assessing ‘best case’ scenarios, although this approach clearly risks leaving atypical viewers at a disadvantage.

In summary, theories of meaning-making can explain why it is not necessary for a filmmaker or audio describer to make everything explicit, as creators can generally rely on recipients to use other sources to fill in gaps. This notion is explored further next, using a series of film extracts taken from the MeMAD500 film corpus, each of which had been professionally audio described and was subsequently used to generate machine-derived content descriptions as part of the MeMAD project research programme.

3 Data and methods

As a principal workstream of the Horizon 2020 MeMAD (‘Methods for Managing Archive Data’) project, a dataset of feature film extracts (‘MeMAD500’) was compiled for the purposes of comparing human and machine derived video content descriptions [7, 8]. One of the main aims of the project was to use this analysis to inform improvements to AI-driven content description models, both for archive retrieval purposes and as a first step towards semi-automated AD. However, quality assessments confirmed the naïve state of current audiovisual content descriptions produced via computer models, which still tend to be trained on still-image datasets. Despite improvements to the computer vision algorithm, and experimentation with feature extraction, the standard of description produced by the machine currently falls a long way short of human acuity [9]. Nevertheless, the audio, content- and machine-descriptions compiled during the MeMAD project have become a useful resource for observing the types of visual information typically omitted from audio described material, and the impact this has on an audience’s attempts at meaning-making. In this context, human-generated content descriptions, created by the research team, represented a basic description of the action on screen (‘say what you see’) without supplementary interpretation or poetic embellishment but also without the omissions resulting from time constraints that apply to AD.

Our worked examples are framed around a systematic analysis of extracts from the MeMAD500 dataset, which were initially studied as a resource for comparing human- and computer-derived content descriptions. For the current study, narratively salient omissions were first identified within each extract. This step involved discussions within the research team, drawing on the theoretical frameworks outlined in the previous section. It was followed by an investigation of the source of potentially compensatory cues. Where retrievable cues were absent from the broader narrative material, or through the application of common knowledge, the nature of the omission was assessed from a pragmatic standpoint (e.g., lack of audio hiatus opportunity for the insertion of AD). Patterns in compensatory cue-seeking that emerged from our analysis were first categorised according to retrievability -irretrievability, followed by an assessment (where retrievable) of their temporal situation within the broader narrative, or degree of accessibility through the application of knowledge resources external to the film narrative, but typically available to the average audience member.

The MeMAD human–computer comparative study of audio and content descriptions included an analysis of the way human audio describers approach the translation of narratively salient prompts and, in particular, examined occurrences where markers that would appear narratively salient to plotline required a degree of inferencing on the part of the audience. When omissions in the AD were observed, a qualitative analysis was conducted to establish whether elements, which were not covered in the AD could be inferred by combining cues in the extant AD with other cues from the original film audio track and/or by drawing upon common knowledge. Where possible, the source of each of these inferences was determined either from within the given AD, from across the broader film material (both audible and visual), or from the kinds of world knowledge and social understanding common to the average viewer.

4 Omissions in MeMAD film corpus extracts: Worked examples

As discussed above, human–computer comparative study of AD and content descriptions (CD) undertaken during the MeMAD project incorporated an analysis of the way human audio descriptions reflected the transfer of narratively salient prompts and examined occurrences where cues that appeared narratively salient to plotline required some degree of inferencing on the part of the audience. As mentioned, this analysis was informed by the communication models outlined in Sect. 2. As part of the analysis, we classified each of the identified AD omissions according to the proximity to source from which full retrieval of the inferred information was required, i.e.: proximal cues, signifying ‘within scene/extract’ information retrieval; distal cues, signifying ‘within narrative but external to current scene/extract’ retrieval; and different types of knowledge and/or life experience. We also recorded the extent to which an omission appeared to be driven by intentionality on the part of the audio describer. Although this distinction is by necessity somewhat hypothetical due to the exploratory and post-hoc nature of our study, we classified as intentional those instances where the audio describer appeared to choose omission due to prioritisation of other information, and as unintentional those instances where the item of narrative relevance was not included in the AD, seemingly due to time or other constraints including synchronisation issues. Examples drawn from our analysis are discussed below, illustrating the frequency of omissions occurring in film material, the breadth of variation in causality and resolution, and the problems this creates for systematising machine-generated content captions.

4.1 Example 1: Pretty Woman–strawberries and champagne

The first omission examined occurs in an extract from the film Pretty Woman [28] (MeMAD500 #102403). In the selected scene, the principal protagonists, Edward and Vivian, become acquainted in a hotel room while sharing snacks delivered by room service. Heavily focused on the social differences between the protagonists, the scene illustrates the contrasting worlds inhabited by the two characters: Edward, a member of the super-rich elite, a successful businessman, high-society socialite; and Vivian, a well-meaning Los Angeles prostitute living on her wits and street-smarts, while fostering dreams of a college career and finding love. Since this information is key to the unfolding narrative, it is ‘front-loaded’ into the visual cueing using scenographic techniques showing images of a luxurious hotel penthouse suite and the wonder of Vivian as she surveys the room.

In the moment, however, we are focussed on the manner of delivery of a container of strawberries, which Edward offers to Vivian as she sits nervously in the penthouse. Expanding on the notion of contrasting lifestyles, an extravagant solid silver bonbonniere is produced from which Edward removes the lid to reveal a clutch of strawberries that he offers up to Vivian (Fig. 1).

Fig. 1
figure 1

Pretty Woman [MeMAD500, #102403]

The audio description is rendered as shown in Table 1.

Table 1 Pretty Woman [MeMAD500, #102403; Time In/Out 21.59/22.55]

In the AD text, the container carrying the strawberries is not referenced (Table 1). The description simply states that Edward extends an invitation to Vivian, ‘[h]e offers her a strawberry’, which fails to explain the manner in which the strawberries are presented, i.e., the BPS viewer may wonder if a single strawberry is offered from a plate or other container. Furthermore, the AD does not state that the lid is being taken off the dish, although the audience hears a ringing metallic sound as the lid is being removed from the body of the container. This metallic sound is potentially a clue to the nature of the vessel, but for the BPS audience there is no certainty that the two things are related. Hence, the sight-impaired viewer is left to infer from a combination of the dialogue, soundtrack/sound effects that the ringing metallic sound pertains to the offering of the strawberries, and that the most likely explanation is that they are contained in a metal dish. An additional cue comes from Edward’s ‘try a strawberry’, which is employed as a synecdochic term (and probably understood by the audience as such), since there is unlikely to be just one strawberry in the container, or that he is suggesting Vivian may select only a single piece from a larger collection, but instead implies that he is offering her a container of multiple strawberries. Although Vivian does select just one strawberry, which may be inferred from the fact that Edward refers to ‘it’ bringing out the flavour of the champagne, this is also reinforced by the AD (‘she takes one and bites into it’). A further point to note is that the AD contains an internal inconsistency in terms of the number of strawberries being offered: firstly we learn that "he offers her a strawberry", suggesting Edward could be handing Vivian a single fruit; but this is followed by the comment "she takes one" which then implies that more than one strawberry is available to her ("she takes it" would, in theory, provide greater coherence). In this respect, the AD performs as an interpretive rather than descriptive text, perhaps reflecting the need for condensation due to the absence of a suitable audio hiatus for further expansion and clarification.

As the extract concludes, Vivian finishes the strawberry and mutters ‘pretty good’, from which the audience needs to establish cohesion between the antecedent events (Edward suggesting strawberries bring out the flavour of the champagne, Vivian tasting one) and Vivian’s final verdict being a reflection on the effect of the strawberry on the taste of the beverage. At the conclusion of the extract, the AD mentions a ‘silver salver’ with a ‘lid’ (sic) which, although not strictly correct, indicates the source of the prior metal-on-metal sound when the lid of the bonbonniere is removed. Use of the term ‘silver salver’ is, in itself, problematic: it is an archaic term, which requires cognitive processing as a matter of lexical retrieval; the link must then be established between the ‘silver salver’ and the strawberries (i.e., that it is being used as a container).

In all of these instances of inferred meaning-making the sight-impaired viewer is required to draw on proximal audio cues (i.e., from within the micro-narrative film extract) only some of which are addressed in the AD, while other cues are asynchronous and hence distal, as well as common knowledge, in order to build a cohesive narrative around the dialogue. Although most of the visual information is ultimately retrievable by the BPS viewer in this scene, Sperber & Wilson’s (1995) ‘cognitive principle of relevance’ (see Sect. 2 above) suggests that this is likely to require significant additional cognitive effort.

4.2 Example 2: Extremely Loud and Incredibly Close–messages and memories

Not all omissions can be resolved by reference to proximal audio cues. Example 2 focuses on another crucial source for resolving omissions, namely different types of audience knowledge. Many gaps in explicit storytelling require a knowledge of the broader narrative from which a particular scene derives, and from where it is possible to draw prompts and cues which fulfil the minimal requirements for establishing narrative cohesion. Other omissions call for the viewer to apply other types of knowledge or life experience, either in the form of computing most likely outcomes in any given social situation, or in terms of drawing on historical knowledge in order to make sense of the events underpinning a narrative. In an extract from Extremely Loud and Incredibly Close [26] (MeMAD #100902) a young boy is involved in solving a mystery about his father, who perished in the terrorist attacks on the US World Trade Center in 2001. This micro-narrative uses the device of analepsis, consisting of a flashback to when the young boy is returning to his parents’ apartment alone, having been sent home early from school on the morning of 9/11. After finding a snack, he turns the television on in the background while listening to an answerphone message from his father (Fig. 2, Table 2).

Fig. 2
figure 2

Extremely Loud and Incredibly Close [MeMAD500, #100902]

Table 2 Extremely Loud and Incredibly Close [MeMAD500, #100902; Time In/Out 00:10:50/00:11:20

It is important to note that there is no dialogue in this scene, since the boy is alone in his parents’ apartment, nevertheless the audio track is replete with prompts and cues to lead the viewer to the focal point of the narrative: events unfolding on the morning of 9/11 and the attacks on the World Trade Center.

For the sighted viewer, visual markers are available to situate the action, but these are more difficult for the BPS audience to retrieve. A significant omission occurs when the camera shifts from the boy to the television screen in the corner of the room, which is emitting a cacophony of traffic horns and screaming voices [a proximal audio cue]. On screen, a CNN rolling news programme displays the unspoken headlines: ‘Breaking News, America under Attack, two Planes Crash into Towers of World Trade Center’. Since the headlines are neither read audibly by the boy, nor voiced within the AD, this key visual cue as to the nature of unfolding events on screen is omitted from the information available to the visually impaired audience. However, the lost information may still be retrieved by the BPS viewer through the soundtrack, firstly through the situational and temporal markers established via the answerphone machine, which commences each message with the date (September 11) and time (mid-morning) which we know historically is a critical time in the unfolding of the 9/11 disaster, and also through the background noise of sirens and screaming emerging into the boy’s sitting room from the television set at the end of the clip. Of course, establishing relevance and attributing source to these sounds requires a degree of common knowledge (i.e. that answer machines save messages chronologically for review/replay at a later point in time; that major television stations run ‘rolling news’ of key stories as events unfold, in real time) and world knowledge (e.g. that September 11th marks the date of historically significant terrorist attacks on the US; that these attacks were focussed primarily on New York; that many workers were trapped inside the World Trade Center buildings awaiting rescue by fire crews who could not reach them). Furthermore, the BPS consumer can draw on situational knowledge retrieved from the context of reception, namely that New York, the main location of the attacks, is where the boy lives [a distal cue introduced earlier in the film/AD].

Prosodic cues, and the speed of the man’s speech relayed through the answerphone messages (this is the second message the boy replays, from a series of phone calls made by his father on that morning) suggest the rising panic experienced by the child’s father, along with a growing realisation that he is trapped inside a burning building without a means of escape. Such vocal devices can be considered proximal to the micro-narrative at the heart of the scene, meaning that they should be relatively accessible to and retrievable by the BPS audience.

In the manner discussed here, BPS audiences with sufficient common knowledge (of the New York terrorist attacks, the date on which they took place and the nature of the incident at the Twin Towers), should be able to retrieve the AD omission using prompts from the diegetic audio.

4.3 Example 3: 500 Days of Summer–the break-up

As highlighted in Sect. 1, a well-known problem audio describers face is that due to a lack of sufficient audio hiatus in the original film soundtrack, visual cues cannot always be relayed in a timely fashion within the AD. In this case they are either wholly omitted, or fall out of sequence with the visual action, forcing the BPS viewer to engage in a potentially burdensome level of cognitive processing in order to make sense of events. Taking an extract from 500 Days of Summer [25] to illustrate this point, we can observe the way many visual cues are omitted from the AD, only some of which can be retrieved from other sources (e.g., the musical scoring), while others may be entirely lost to those without access to visual resource.

The scene, set in a diner, recounts the final throes of a relationship [distal cue] between Summer and her boyfriend Tom, which unfolds as a series of scenes presented in random date order across the timespan of one summer (Fig. 3, Table 3). Interestingly, from an audio perspective, the phrase ‘break up’, or other synonymous terms including phrases like ‘taking a break’ or ‘parting on good terms’ etc., are not used in the dialogue. Instead, viewers join the scene at a point where the couple is dissecting the reasons for their break-up–a segment that occurs achronologically within the general plot exposition, thereby placing the entire audience at an inferential disadvantage by increasing the cognitive load required to connect the current scene to the broader narrative timeline [distal cueing].

Fig. 3
figure 3

500 Days of Summer [MeMAD500, #000100]

Table 3 500 Days of Summer [MeMAD500, #000100; Time In/Out 00:05:36/00:06:16]

Visually, the mise-en-scène of this brief micro-narrative speaks of misery, desperation and unease between two formerly intimate people: the quasi-monochromatic and depressing diner setting, the protagonists’ facial expressions, their physical distance from each other, and the insipid-looking food served by a disengaged waitress. Since dialogue between the couple dominates the audio, there is little opportunity for the audio describer to interject with these highly salient visual details, leaving the BPS audience with a higher cognitive load than their sighted peers in order to make sense of the unfolding plot. Summer’s opening gambit, ‘[t]his can’t come as a total surprise to you’, paired with her matter of fact vocal inflexion, certainly enables the inference that something negative is being discussed [proximal cue]; the subsequent likening of their relationship to that of a famous historical couple (Sid Vicious and Nancy Spurgeon) involved in a tragic stabbing incident, requires world knowledge to confirm the prior inferences about the negative mood of the scene.

Furthermore, plodding background music [proximal cue] hints at a tragi-comedic turn as, despite both parties displaying hangdog facial expressions (Tom’s nuanced with depression; Summer’s heavy with disdain), Summer confesses that she sees herself as Sid Vicious and Tom as Nancy [proximal cue]. With the arrival of some strikingly unappetising-looking food, Summer proclaims that she is enjoying the meal and ‘really glad’ they met, a fact that common knowledge of relationship break-ups might suggest would be unlikely. Moreover, Summer’s facial expression fails to match her vocal positivity, leaving the BPS viewer with a problematic mismatch of information: they surely suspect that Summer is unlikely to be happy, but her words indicate otherwise, and since there is no reference to her miserable facial expression in the AD, an omission has occurred that takes some work to retrieve. In this case, the omission of relevant visual cues in the AD is precipitated by the fact that the segment lacks a sufficient audio hiatus in which to insert sentiment. Nevertheless, the audio describer’s simple interjection that ‘the waitress brings two plates’, whilst prioritising the arrival of a new character on the scene, fails to address the most narratively salient information in terms of plot accessibility and may be considered an inadvertent omission. Likewise, the AD of the parting shot, ‘Tom gets up’, is factually correct, but his expression of shock would be a more relevant description for an audience with sight-impairment [irretrievable/non-inferable omission]. In short, the lack of AD paired with the highly visual cueing evident in this scene results in the loss of key narrative information that cannot be inferred by an audience without dependable vision.

4.4 Example 4: Goal II: Living the Dream–breakfast

On occasion, the audio describer’s prioritisation choices result in intentional omissions, from which the audience must attempt to salvage some semblance of meaning. In an extract from Goal II: Living the Dream [27], a scene unfolds in which Santi’s girlfriend, Roz, brings him breakfast in the sitting room where he has spent the night sleeping alone on the sofa. An argument ensues, with Santi having overslept and missed his flight to football training, which he subsequently blames on his girlfriend for failing to wake him earlier (Fig. 4). A series of omissions occur some of which are intentional, and others wrought from necessity due to a lack of suitable audio hiatus (Table 4). Key omissions are enumerated in the description of the scene that follows.

Fig. 4
figure 4

Goal II: Living the Dream [MeMAD500, #101206]

Table 4 Goal II: Living the Dream [MeMAD500, #101206; Time In/Out 00:52:36/00:53:14]

When Roz walks into the room, the AD locates Santi as being asleep on the couch (via pronominalisation, in keeping with narrative sequencing), but does not identify the female voice as that of his girlfriend [#1]. In this case, it appears that the describer may have assumed the audience can infer this fact from distal cues such as voice recognition, the intimate nature of the conversation which suggests that they know each other well, and the fact that he relies on her to ensure he gets to work on time (perhaps implying that they live together). In other words, this may be an instance of an intentional omission. Her footsteps point to the fact that she is wearing shoes [proximal cue] and therefore unlikely to have just left the sofa after having slept there all night with Santi. She says that she has brought him breakfast, leaving the describer free to prioritise his position of being asleep on the couch over details of the domestic scene (#2), or the fact that he is still wearing clothes from the night before (proximal, #3). Due to time constraints, the facial expressions, gesturing and discarded breakfast tray which are all features of the argument that ensures, are omitted from the AD (#4). Instead, the AD references Santi’s exit in a sports car while the visual setting remains on the sitting room scene, subsequently prioritising information about Santi’s behaviour at the airport when the sports car eventually comes into view. This saves the BPS audience from having to infer/retrieve a shift in location using just the ambient audio [proximal] as a prompt.

From the short dialogue exchange and the events which follow, the BPS viewer is required to retrieve the mood of both protagonists from their vocal prosody (Santi, annoyed at Roz; Roz, kindness turning to anger), and to infer from the AD note ‘he drives off’ that Santi is racing to the airport in his car [proximal cues]. When Santi says that he is ‘going to miss the team plane’, both BPS and fully-sighted audiences must retrieve the fact that Santi is a professional footballer [distal, #5], and that premier football teams travel to match locations together on chartered planes [common knowledge, #6] from the broader narrative. Most likely for reasons of information prioritisation, the describer has chosen to use the hiatus created by a shot change between the house and airport scenes to introduce the location transition rather than describe Roz’s anger which is emblematic of the couple’s troubled relationship and closes out the micro-narrative [prioritisation/determined omission, #7]. In all likelihood, the BPS viewer will already have inferred the mood from the nature and tone of the earlier dialogue, as well as the sound of the drinking glass which Roz smashes down on the breakfast tray, cueing [proximal] her mood at the end of the scene.

This very brief scene contains a poignant turning point in the film narrative, marking a deterioration in the relationship between the featured couple. The BPS audience is required to assimilate a large number of audiovisual cues within a short space of time, even when the visual representations are asynchronous with the soundtrack. However, in this instance, the relevant cues seem largely retrievable through a combination of vocal memory, attention to audio cueing and the application of common knowledge.

4.5 Example 5: The Jane Austen Book Club

Intentional omission of character names during long-form narrative is a common strategy when scripting AD, since it is generally assumed the BPS viewer will recognise a recurring voice across the breadth of an unfolding storyline, and from this recognition infer the character’s identity. However, in examples such as the one below, selected from The Jane Austen Book Club [29], where multiple storylines and character pairings are intersected at speed, the cognitive load that this constant need to infer character identities from vocalisations imposes, may become burdensome. To some extent, success in this regard depends on characters’ voices sounding sufficiently different in timbre and tone for the audience to be able to differentiate between them with ease. This may not always be the case and would then require inferencing character attribution from the evolving plot, location or audio landscape (e.g., the intermittent splicing of scenes between a rowdy nightclub and a tennis court).

In this excerpt, Prudie (named by the describer) is waiting for her mother to pick her up by car, using the time to sit on a wall and read (Fig. 5, Table 5). As a second character approaches, the describer informs the audience that a shadow is cast over her book but the source of the shadow, a character called Trey, remains unnamed in the AD. To the sight-impaired viewer, it is unclear whether the shadow belongs to a person or an inanimate object, and while this is the same for the fully sighted audience, any uncertainty is quickly assuaged by the sight of Trey’s face looming over Prudie. Trey then utters a short comment about rehearsals before the scene ends, followed by a shot of Brigadoon being rehearsed in a school theatre (where Trey is also not identified). Thus, the BPS audience is required to infer Trey’s identity, in spite of there being sufficient hiatus to introduce him via the AD. Since he is the character whose story line most frequently intersects with that of Prudie, the natural inference would be that his voice is the one to be heard, imploring her to help with his rehearsals. Yet this is not a simple inference–there are multiple pairings of characters featured in the plotline, and Trey’s voice [proximal cue] is not particularly distinctive or unique, although his ongoing role in the Brigadoon theatre production is a relevant distal (and thus perhaps less readily retrieval) cue.

Fig. 5
figure 5

The Jane Austen Book Club [MeMAD500, #103906]

Table 5 The Jane Austen Book Club [MeMAD500, #103906; Time In/Out 37.44/37.55]

In summary, the omission of a naming protocol for Trey would appear to be a strategic decision on the part of the audio describer who has chosen a poetic description of the shadow cast over Prudie’s book, and the effects of the sun on her eyes, rather than the alternative strategy of reducing the level of inferencing required of the BPS viewer by labelling the focal source of her conversational engagement. In this case, retrieval is therefore dependent on the viewer’s ability to recognise the male protagonist’s voice from previous scenes. Where this is not possible, the omission will be irretrievable, and a degree of narrative coherence will be lost.

5 Discussion

From these worked examples, it is possible to observe instances where omissions occur in the AD delivery requiring sight-impaired audiences to retrieve narratively salient cues from within the proximal or wider narrative, or from different types of knowledge, and use them to infer meaning. Needless to say, any cinematographic text (dialogue, scenography, soundtrack, physical action) contains omissions but these have generally been incorporated as a deliberate dramatic device either to build suspense or make the audience work to retrieve coherence from disparate sources. Retrieval, in this case, is therefore not necessarily challenging for the standard viewer with access to all cues from both audio channels and visual representations. By contrast, the retrieval of visual cues that have not been included in an audio description, whether intentionally or inadvertently, carries an additional cognitive tariff for the BPS audience, because the visual cues which bring cohesion to sequential action and/or context to the dialogue and other elements of the soundtrack across a longform narrative may only be retrievable if they are supplied as part of the AD.

We have, however, observed different types of AD omissions, with different implications. The first type, omissions resulting from prioritisation choices, will often be retrievable from elsewhere in the text. In our examples, we observed several types of cues which offer the BPS audience an opportunity to compensate for AD omissions: (i) proximal, whereby the cue is available from within the current scene, shot or conversation (e.g. in the dialogue or other audio elements); and (ii) distal, whereby the relevant cue may be available at a greater distance from the omitted information but still from within the broader narrative, either from earlier exchanges of dialogue or through visual cues or character naming conventions previously adopted and verbalised. Alternatively, the viewer may draw upon (iii) common or world knowledge (information the average adult with normal life experience might be expected to have gathered and be able apply from the general to the particular for the purposes of meaning-making), or situational knowledge retrieved from the context of reception (retrieved from earlier parts of the film text), which enables the viewer to infer an item of otherwise ‘lost’ information relating to the current scene. Each of these opportunities offers the audience mitigating information for retrieving critical cohesive cues for the comprehension of narrative.

There is, of course, always the danger that the BPS viewer will fail in their attempt to detect narratively relevant cues and cohesive ties because they are simply irretrievable, resulting in an overall loss of narrative coherence. This was illustrated by the extract from The Jane Austen Book Club, where the audio describer chose to omit narratively salient information (in this case, the name of the momentarily unseen protagonist) in order to invoke an alternative translation strategy, which is arguably more poetic and serves to build an element of suspense, but which may not be successful. However, our analysis suggests that such cases are the exception in relation to omissions resulting from prioritisation.

By contrast, the second type, overt omissions, which occur due to a lack of available audio hiatus in which to situate the relevant description (in a timely manner, i.e. without delay) or a hiatus which cannot accommodate a lengthy or complex explanation, or indeed, potentially even describer error, appears to carry a greater risk of being irretrievable for the BPS audience, as seen in examples 3 (500 Days of Summer) and 4 (Goals 2). Figure 6 illustrates the different types of omission and the range of potentially available cues (or lack thereof), categorising them by availability to the BPS audience.

Fig. 6
figure 6

Summary of omissions: cues, retrieval and description strategies

As the BPS audience engages in narrative cue retrieval, the level of cognitive load required to arrive at an understanding of plot exposition that is roughly equivalent to that of the sighted viewer, is likely to be considerably greater [13, 22], firstly because there is an additional channel of audio to assimilate into the narrative ‘whole’, and secondly because cues delivered through audio means, although relating to visual markers, require higher level cognitive ‘decoding’ or processing [14]. It would seem reasonable to assume that proximal cues, since they rely on the retrieval of more recently consumed information, are likely to be easier to retrieve than distal cues. Common knowledge is likely to be readily retrievable, most notably in the socio-behavioural domain where social norms and nuanced behaviours are deeply ingrained in the neurotypical psyche. However, whilst this study has identified potential differences in the processing and impact of different types of AD omission, based on the availability of various types of cues and knowledge, the logical next step will be to test our observations for generalisability in an experimental research environment.

If confirmed, our observations indicate, on the one hand, that many AD omissions are compensable, i.e., retrievable as part of routine inferential processes of understanding, because they are the result of a (more or less) strategic decision-making process by the audio describer, drawing on informed assumptions about what can be left implicit for the audience to infer. This has implications for the increasing efforts to (semi-)automate AD and current computer vision capability. On the other hand, our exploratory study suggests that the AD omissions which are more difficult to mitigate tend to be omissions over which an audio describer has little control (e.g., due to insufficient hiatus). This raises questions for alternative, more flexible ways of delivering AD without violating the ‘golden’ rule of avoiding overlap between AD and original audio track. Both aspects will be briefly discussed in the next section.

6 AD Omissions and the implications for new ways of AD delivery

Returning first to the human–computer audiovisual challenge posed by the MeMAD project, namely the shift from wholly human-derived to semi-automated descriptions of audiovisual content, there is clearly much to be unpicked in terms of omissions.

Perhaps the most important issue, at least in the immediate future, is that the automation of audiovisual descriptions is largely dependent upon the application of computer vision models which (currently) fail to produce nuanced, narratively salient and coherent descriptions and to incorporate the type of complex multimodal analysis (including an analysis of the audio output) that is such an important aspect of engaging with audiovisual narrative for a human audience. Human audiences assimilate the audible and the visual simultaneously relatively seamlessly, and from this holistic approach to interpreting film narrative, extract implicatures and inferences with ease. By contrast, the description-generating computer is typically unable to incorporate audio cues from dialogue, including vocal prosody, and additional layers of soundscaping available through the musical score, incidental music and sound effects. By comparison, the BPS viewer may have reduced, minimal or a total absence of vision, yet be able to retrieve many elements of visual explication through audio channels (for example, the sound of birdsong and absence of traffic noise may suggest a bucolic location). That is to say, the human being has the ability and cognitive resource to seek complementary information which mitigates impending loss/omissions, especially when these are planned or calculated according to the human ability to retrieve the missing information, whereas the machine has only the information provided in the visually oriented training data from which to draw resources targeted at audiovisual content description.

Even assuming that computer models were capable of assimilating narrative holistically, i.e., drawing time-aligned cues across the different modalities in the audiovisual source, the capability for establishing narrative coherence currently remains unachievable. It is therefore unlikely that omissions in automatically produced descriptions of audiovisual content follow patterns that are similar to those identified in the present study. Omissions in automatically generated descriptions tend to be random [8]. It is unlikely that a machine will be capable of ‘calculating’ what can be omitted safely because the audience can infer the information, versus what would be irretrievable if it were not present in the AD. In other words, computer-generated descriptions of audiovisual content are a long way away from producing narratively coherent descriptions adapted to the available time (length of hiatuses) in the way that audio describers often, but–as our examples show–also not always, manage to do.

An alternative to achieve the required flexibility, especially in those instances where insufficient hiatuses also create problems for human describers, would be to re-investigate the concept of extended AD. The old idea of ‘freezing frames’ was difficult to achieve in the pre-digital age, but digitisation of AV content along with better support for media personalisation open up new opportunities for smart implementation of flexible/extended AD narratives [24] that would, among other things, help overcome the issues with problematic omissions.

In the future, there are certainly steps that can be taken as a way to move towards a more ‘humanlike’ (semi-)automated approach to addressing omissions. Principally, there needs to be an acknowledgement that semi-automated audio/content descriptions require some form of integrated dialogue analysis in order to explicate key aspects of plot. Without audio integration, the level of cue omission is inherent and pervasive, since a highly significant number of narrative cues are already lost. Certainly, building feature extraction characteristics pertaining to dialogue assets would provide a useful adjunct to current machine-based content description models. Key lexical determinants such as words denoting temporal shifts (“see you tomorrow”), location changes (“I”ll meet you at the pub later”), action words (“Let’s hire a boat”) could be extracted from dialogue using speech-to-text and natural language processing technologies, and compared with visually generated machine descriptions to supplement or revise the initial machine-based attempt at content description. While this approach is not a panacea for mitigating omissions by the machine, which would require a model for seeking cohesive ties and pursuit of a template for narrative sequencing and coherence across the piece, as well as the application of world or common knowledge to problem-solving, it would generate a more complete description of the action on screen, providing cues as to intention as well as causality and consequence.

7 Conclusions

Omissions in audio description for BPS audiences are commonly, although not reliably or comprehensively, retrievable by the decoding of proximal or distal cues from within the primary text, or from the application of common or world knowledge to contextualise the unfolding narrative. Sighted audiences are also required to retrieve these cues in order to fully comprehend the narrative but enjoy the added benefit of full access to the visual medium in order to do so. Where omissions are non-retrievable to the BPS viewer, they are most commonly created by the (generally justifiable) prioritisation choices made by the audio describer, or by technical barriers to providing AD such as the lack of a suitable audio hiatus.

The implications of this human ability to read a range of cues from across the multimodal text and draw narratively relevant inferences, as they relate to the future of machine description generation, are clear, but remain unresolved. Since machine learning relies on large volumes of representative data in order to train for desired outcomes, and given the complex cognitive nature of inferencing and meaning-making, there is no immediate sign that automated video content description models contain a solution for deciding what is retrievable through inference and could therefore be omitted in the event of insufficient hiatuses in the audio track and, alternatively, what is key visual information that a BPS audience is unlikely to be able to retrieve without AD and should therefore not be omitted. Possible avenues include, on the one hand, alternative delivery formats such as smart extended (human) AD based on a ‘play-pause-play’ viewing protocol.

At the same time, with regard to the further development of automated description, introducing dialogue analysis and a more comprehensive multimodal analysis into the feature extraction methods within computer vision models would prepare the way for the machine to potentially make more informed decisions about elements which are either necessary or omissible (inferable). When progress on delivering reliable automatic object and action recognition is sufficient, these more narratively important aspects of description will need to be incorporated into the relevant computer models.

The matter of overcoming omissions in video content description is equally relevant to other audiences, in particular those requiring help to access the nuances of plot development, the causality and consequence of storylines within an unfolding narrative or the verbal/visual emotional content of film material. Whilst these needs have been considered by initiatives to expand AD into the realms of cognitive assistance using a bespoke linguistic approach (e.g., AD for autistic audiences with emotion recognition difficulties; [19, 20]) and simplified ‘easy’ AD [4], the impact of omissions on these audience groups must be assessed separately.

Further research into the impact of omissions in video content descriptions on narrative comprehension, including reception studies with target audiences (both sight-impaired and cognitively diverse), will ultimately serve to corroborate, or challenge, the results of our qualitative analysis. Over time, this would ideally gauge the impact of omissions in machine-generated descriptions on viewers, alongside omissions resulting from the more traditional human-derived audio descriptions.