Multimodal Machine Translation through Visuals and Speech

Multimodal machine translation involves drawing information from more than one modality, based on the assumption that the additional modalities will contain useful alternative views of the input data. The most prominent tasks in this area are spoken language translation, image-guided translation, and video-guided translation, which exploit audio and visual modalities, respectively. These tasks are distinguished from their monolingual counterparts of speech recognition, image captioning, and video captioning by the requirement of models to generate outputs in a different language. This survey reviews the major data resources for these tasks, the evaluation campaigns concentrated around them, the state of the art in end-to-end and pipeline approaches, and also the challenges in performance evaluation. The paper concludes with a discussion of directions for future research in these areas: the need for more expansive and challenging datasets, for targeted evaluations of model performance, and for multimodality in both the input and output space.

, and spoken language translation (SLT), shown in contrast to unimodal translation tasks, such as text-based machine translation (MT) and speech-to-speech translation (S2S), and multimodal NLP tasks that do not involve translation, such as automatic speech recognition (ASR), image captioning (IC), and video description (VD).
Today, the rising interest in MMT is largely driven by the state-of-the-art performance and the architectural flexibility of neural sequence-to-sequence models (Sutskever et al, 2014;Bahdanau et al, 2015;Vaswani et al, 2017). This flexibility, which is due to the end-to-end nature of these approaches, has the potential of bringing the vision, speech and language processing communities back together. From a historical point of view however, there was already a great deal of interest in doing machine translation (MT) with non-text modalities, even before the arrival of successful statistical machine translation models. Among the earliest attempts is the Automatic Interpreting Telephony Research project (Morimoto, 1990), a 1986 proposal that aimed at implementing a pipeline of automatic speech recognition, rule-based machine translation, and speech synthesis, making up a full speech-to-speech translation system. Further research has led to several other speech-to-speech translation systems (Lavie et al, 1997;Takezawa et al, 1998;Wahlster, 2000).
In contrast, the use of visual modality in translation has not attracted comparable interest until recently. At present, there is a variety of multimodal task formulations including some form of machine translation, involving image captions, instructional text with photographs, video recordings of sign language, subtitles for videos (and especially movies), and descriptions of video scenes. As a consequence, modern multimodal MT studies dealing with visual (or audiovisual) information are becoming as prominent as those tackling audio. We believe that multimodal MT is a better reflection of how humans acquire and process language, with many theoretical advantages in language grounding over text-based MT as well as the potential for new practical applications like cross-modal cross-lingual information retrieval (Gella et al, 2017;Kádár et al, 2018).
In the following, we will provide a detailed description of MMT tasks and approaches that have been proposed in the past. Section 2 contains an overview of the tasks of spoken language translation, image-guided translation and videoguided translation. Section 3 reviews the methods and caveats of evaluating MT performance, and discusses prominent evaluation campaigns, while Section 4 contains an overview of major datasets that can be used as training or test corpora. Section 5 discusses the state-of-the-art models and approaches in MMT, especially focusing on image-guided translation and spoken language translation. Section 6 outlines fruitful directions of future research in multimodal MT.

Tasks
While our definition of multimodal machine translation excludes both cross-modal conversion tasks with no crosslinguality (e.g. automatic speech recognition and video description), and machine translation tasks within a single modality (e.g. text-to-text and speech-to-speech translation), it is still general enough to accommodate a fair variety of tasks. Some of these tasks such as spoken language translation (SLT) and continuous sign language recognition (CSLR) meet the criteria because their source and target languages are, by definition, expressed through different modes. Other tasks like image-guided translation (IGT) and video-guided translation (VGT) are included on the grounds that they complement the source language with related visuals that constitute an extra modality. In some cases, a wellestablished multimodal machine translation task can be characterised by methodological constraints (e.g. simultaneous interpretation), or by domain and semantics (e.g. video description translation).
We observe that a shared modality composition is the foremost prerequisite that dictates the applicability of data, approaches and methodologies across multimodal translation tasks. For this reason, further in this article, we classify the studies we have surveyed according to the modality composition involved. We also restrict the scope of our discussions to the more well-recognised cases that involve audio and/or visual data in addition to text. In the following subsections, we explain our use of the terms spoken language translation, image-guided translation, and video-guided translation, and provide further discussions for each of these tasks.

Spoken language translation
Spoken language translation (SLT), also known as speech-to-text translation or automatic speech translation, comprises the translation of speech in a source language to text in a target language. As such, it differs from conventional MT in the source-side modality. The need to simultaneously perform both modality conversion and translation means that systems must learn a complex input-output mapping, which poses a significant challenge. The SLT task has been shaped by a number of influential early works (e.g. Vidal, 1997;Ney, 1999), and championed by the speech translation tasks of the IWSLT evaluation campaign since 2004 (see Section 3.2.2).
Traditionally, SLT was addressed by a pipeline approach (see Section 5 for more details), effectively separating multimodal MT into modality conversion followed by unimodal MT. More recently, end-to-end systems have been proposed, often based on NMT architectures, where the source language audio sequence is directly converted to the target language text sequence (Weiss et al, 2017;Bérard et al, 2018). Despite the short time during which end-toend approaches have been developed, they have been rapidly closing the gap with the dominant paradigm of pipeline systems. The current state of end-to-end systems is discussed further in Section 5.2.3.

Image-guided translation
Image-guided translation can be defined as a contextual grounding task, where, given a set of images and associated documents, the aim is to enhance the translation of the documents by leveraging their semantic correspondence to the images. Resolving ambiguities through visual cues is one of the main motivating forces behind this task.
A well-known realisation of IGT is image caption translation, where the correspondence is related to sentences being the descriptions of the images. Initial attempts at image caption translation were mostly pipeline approaches: Elliott et al (2015) proposed a pipeline of visually conditioned neural language models, while Hitschler et al (2016) approached the problem from a multimodal retrieval and reranking perspective. With the introduction of the WMT multimodal translation shared task , see Section 3.2.1), IGT attracted a lot more attention from the research community. Today, the prominent approaches rely on visually conditioning end-to-end neural MT systems with visual features extracted from state-of-the-art pretrained CNNs.
Although the utility of the visual modality has recently been disputed under specific dataset and task conditions (Elliott, 2018;, using images when translating captions is theoretically very advantageous to handle grammatical characteristics (e.g. noun genders) in translating between dissimilar languages, and resolving translational ambiguities. Also,  shows how state-of-the-art models become capable of leveraging the visual signal when source captions are deliberately deteriorated in a simulated low-resource scenario. We discuss the current state of the art and the predominant approaches in IGT in Section 5.1.

Video-guided translation
We posit the task of video-guided translation (VGT) as a multimodal machine translation task similar to image-guided translation, but tackling video clips (and potentially audio clips as well) rather than static images associated with the textual input. Within video-guided translation, there can be variants depending on the textual content. The source text can be transcripts of speech from the video, which would be typically segmented as standard subtitles, or a textual description of the visual scene or an action demonstrated in the clip, often created for visually impaired people. As such, video-guided translation can be subject to particular challenges from both SLT (time-variant audiovisual input) and IGT (indirect correspondence between source modalities). On the other hand, these similarities could also indicate that it might be possible to adapt or reuse approaches from both of those areas to bootstrap VGT systems.
One major challenge hindering progress in video-guided translation is the relative scarcity of datasets. While a large collection such as the OpenSubtitles corpus 1 (Lison and Tiedemann, 2016) can provide access to a considerable amount of parallel subtitles, there is no attached audiovisual content since the corresponding movies are not freely available. Recent efforts to compile freely accessible data for video-guided translation, like the How2 (Sanabria et al, 2018) and VaTeX  datasets (both described in Section 4.3) have started to alleviate this bottleneck. Although there has been decidedly little time to observe the full impact of such initiatives, we hope that they will inspire further research in video-guided translation.

Evaluation
Evaluating the performance of a machine translation system is a difficult and controversial problem. Typically, there are numerous ways of translating even a single sentence which would be acceptably produced by human translators (or systems), and it is often unclear which one is (or which ones are) good or better, and in what respect, given that the pertinent evaluation criteria are multi-dimensional, context-dependent, and highly subjective (see for example Chesterman and Wagner, 2002;Drugan, 2013). Traditionally, human analysis of translation quality has often been divided into the evaluation of adequacy (semantic transfer from source language) and fluency (grammatical soundness of target language) (Doherty, 2017). While this separation is considered somewhat artificial, it was created to make evaluation simpler and to allow comparison of translation systems in more specific terms. In practice, systems that are good at one criterion tend to be good at the other, and a lot of the more recent evaluation campaigns have focused on directly ranking systems for general quality rather than scoring individual systems on these criteria (relative ranking), or scoring systems for general quality instead (direct assessment).
Since human evaluation comes with considerable monetary and time costs (Castilho et al, 2018), evaluation efforts have converged to devising automatic metrics in recent years , which typically operate by comparing the output of a translation system against one or more human translations. While a number of metrics have been proposed over the last two decades, they are mostly based on statistics computed between the translation hypothesis and one or more references. Procuring reference translations in itself entails some costs, and any metrics and approaches that require multiple references to work well may therefore not be feasible for common use. Further in this section, we discuss the details of some of the dominant evaluation metrics as well as the most well-known shared tasks of multimodal MT that serve as standard evaluation settings to facilitate research.

Metrics
Among the various MT evaluation metrics in the literature, the most commonly used ones are BLEU (Papineni et al, 2001), METEOR (Lavie and Agarwal, 2007;Denkowski and Lavie, 2014) and TER (Snover et al, 2006). To summarise them briefly, BLEU is based on an aggregate precision measure of n-gram matches between the reference(s) and machine translation, and penalises translations that are too short. METEOR accounts for and gives partial credit to stem, synonyms, and paraphrase matches, and considers both precision and recall with configurable weights for both criteria. TER is a variant of word-level edit distance between the source and the target sentences, with an added operation for shifting one or more adjacent words. BLEU is by far the most commonly used automatic evaluation metric, despite its relative simplicity.Most quantitative comparisons of machine translation systems are reported using only BLEU scores. METEOR has been shown to correlate better with human judgements (especially for adequacy) due to both its flexibility in string matching and its better balance between precision and recall, but its dependency on linguistic resources makes it less applicable in the general case. Both BLEU and METEOR, much like the majority of other evaluation metrics developed so far, are reference-based metrics. These metrics are inadvertently heavily biased on the translation styles that they see in the reference data, and end up penalising any alternative phrasing that might be equally correct (Fomicheva and Specia, 2016).
Human evaluation is the optimal choice when a trustworthy measure of translation quality is needed and resources to perform it are available. The usual strategies for human evaluation are fluency and adequacy rankings, direct assessment (DA) (Graham et al, 2013), and post-editing evaluation (PE) (Snover et al, 2006). Fluency and adequacy rankings are conventionally between 1-5, while DA is a general scale between 0-100 indicating how "good" the translation is, either with respect the original sentence in the source language (DA-src), or the ground truth translation in the target language (DA-ref ). On the other hand, in PE, human annotators are asked to correct translations by changing the words and the ordering as little as possible, and the rest of the evaluation is based on an automatic edit distance measure between the original and post-edited translations, or other metrics such as post-editing time and keystrokes . For pragmatics reasons, these human evaluation methods are typically crowdsourced to non-expert annotators to reduce costs. While this may still result in consistent evaluation scores if multiple crowd annotators are considered, it is a well-accepted fact that professional translators capture more details and are generally better judges than non-expert speakers (Bentivogli et al, 2018).
The problems recognised even in human evaluation methods substantiate the notion that no metric is perfect. In fact, evaluation methods are an active research subject in their own right Ma et al, 2018Ma et al, , 2019. However, there is currently little research on developing evaluation approaches specifically tailored to multimodal translation. Fully-automatic evaluation is typically text-based, while methods that go beyond the text rely on manually annotated resources, and could rather be considered semi-automatic. One such method is multimodal lexical translation (MLT) , which is a measure of translation accuracy for a set of ambiguous words given their textual context and an associated image that allows visual disambiguation. Even in human evaluation there are only a few examples where the evaluation is multimodal, such as the addition of images in the evaluation of image caption translations via direct assessment Barrault et al, 2018), or via qualitative comparisons of post-editing . Having consistent methods to evaluate how well translation systems take multimodal data into account would make it possible to identify bottlenecks and facilitate future development. One possible promising direction is the work of Madhyastha et al (2019) for image captioning evaluation, where the content of the image is directly taken into account via the matching of detected objects in the image and concepts in the generated caption.

Shared tasks
A great deal of research into developing natural language processing systems is made in preparation for shared tasks under academic conferences and workshops, and the relatively new subject of multimodal machine translation is not an exception. These shared tasks lay out a specific experimental setting for which participants submit their own systems, often developed using the training data provided by the campaign. Currently, there are not many datasets encompassing both multiple languages and multiple modalities that are also of sufficiently high quality and large size, and available for research purposes. However, multilingual datasets that augment text with only speech or only images are somewhat less rare than those with videos, given their utility for tasks such as automatic speech recognition and image captioning.
Adding parallel text data in other languages enables such datasets to be used for spoken language translation and imageguided translation, both of which are represented in shared tasks organised by the machine translation community. The Conference on Machine Translation (WMT) ran three shared tasks for image caption translation from 2016-2018, and the International Workshop on Spoken Language Translation (IWSLT) has led an annual evaluation campaign on speech translation since 2004.

Image-guided translation: WMT multimodal translation task
The Conference on Machine Translation (WMT) has organised multimodal translation shared tasks annually since the first event  in 2016. The first shared task was such that the participants were given images and an English caption for each image as input, and were required to generate a translated caption in German. The second shared task had a similar experimental setup, but added French to the list of target languages, and new test sets. The third shared task in 2018 added Czech as a third possible target language, and another new test set. This last 2 task also had a secondary track which only had Czech on the target side, but allowed the use of English, French and German captions together along with the image in a multisource translation setting.
The WMT multimodal translation shared tasks evaluate the performances of submitted systems on several test sets at once, including the Ambiguous COCO test set , which incorporates image captions that contain ambiguous verbs (see Section 4.1). The translations generated by the submitted systems are scored by the METEOR, BLEU, and TER metrics. In addition, all participants are required to devote resources to manually scoring translations in a blind fashion. This scoring is done by direct assessment using the original source captions and the image as references. During the assessment, ground truth translations are shuffled into the outputs from the submissions, and scored just like them. This establishes an approximate reference score for the ground truth, and the submitted systems are analysed in relation to this.

Spoken language translation: IWSLT evaluation campaign
The spoken language translation tasks have been held as part of the annual IWSLT evaluation campaign since Akiba et al (2004). Following the earlier C-STAR evaluations, the aim of the campaign is to investigate newly-developing translation technologies as well as methodologies for evaluating them. The first years of the campaign were based on a basic travel expression corpus developed by C-STAR to facilitate standard evaluation, containing basic tourist utterances (e.g. "Where is the restroom?") and their transcripts. The corpus was eventually extended with more samples (from a few thousand to tens of thousands) and more languages (from Japanese and English, to Arabic, Chinese, French, German, Italian, Korean, and Turkish). Each year also had a new challenge theme, such as robustness of spoken language translation, spontaneous (as opposed to scripted) speech, and dialogue translation, introducing corresponding data sections (e.g. running dialogues) as well as sub-tasks (e.g. translating from noisy ASR output) to facilitate the challenges. Starting with Paul et al (2010), the campaign adopted TED talks as their primary training data, and eventually shifted away from the tourism domain towards lecture transcripts.
Until Cettolo et al (2016), the evaluation campaign had three main tracks: Automatic speech recognition, textbased machine translation, and spoken language translation. While these tasks involve different sources and diverging methodologies, they converge on text output. The organisers have made considerable effort to use several automatic metrics at once to evaluate participating systems, and to analyse the outputs from these metrics. Traditionally, there has also been human evaluation on the most successful systems for each track according to the automatic metrics. These assessments have been used to investigate which automatic metrics correlate with which human assessments to what extent, and to pick out and discuss drawbacks in evaluation methodologies.
Additional tasks such as dialogue translation (Cettolo et al, 2017) and low-resource spoken language translation (Niehues et al, 2018) were reintroduced to the IWSLT evaluation campaign from 2017 on, as TED data and machine translation literature both grew richer. Niehues et al (2019) introduced a new audiovisual spoken language translation task, leveraging the How2 corpus (Sanabria et al, 2018). In this task, video is included as an additional input modality, for the general case of subtitling audiovisual content.

Datasets
Text-based machine translation has recently enjoyed widespread success with the adoption of deep learning model architectures. The success of these data-driven systems rely heavily on the factor of data availability. An implication of this for multimodal MT is the need for large datasets in order to keep up with the data-driven state-of-the-art methodologies. Unfortunately, due to its simultaneous requirement of multimodality and multilinguality in data, multimodal MT is subject to an especially restrictive bottleneck. Datasets that are sufficiently large for training multimodal MT models are only available for a handful of languages and domain-specific tasks. The limitations imposed by this are increasingly well-recognised, as evidenced by the fact that most major datasets intended for multimodal MT were released relatively recently. Some of these datasets are outlined in Table 1, and explained in more detail in the subsections to follow.

IAPR TC-12
The International Association of Pattern Recognition (IAPR) TC-12 benchmark dataset  was created for the cross-language image retrieval track of the CLEF evaluation campaign (ImageCLEF 2006) . The benchmark is structurally similar to the multilingual image caption datasets commonly used by contemporary image-guided translation systems. IAPR TC-12 contains 20,000 images from a collection of photos of landmarks taken in various countries, provided by a travel organisation. Each image was originally annotated with German descriptions, and later translated to English. These descriptions are composed of phrases that describe the visual contents of the photo following strict linguistic patterns, as shown in Figure 2. The dataset also contains light annotations such as titles and locations in English, German, and Spanish.

Flickr8k
Released in 2010, the Flickr8k dataset (Rashtchian et al, 2010) has been one of the most widely-used multimodal corpora. Originally intended as a high-quality training corpus for automatic image captioning, the dataset comprises a set of 8,092 images extracted from the Flickr website, each with 5 crowdsourced captions in English that describe the image. Flickr8k has shorter captions compared to IAPR TC-12, focusing on the most salient objects or actions, rather than complete descriptions. As the dataset has been a popular and useful resource, it has been further extended with captions in other languages such as Chinese  and Turkish (Unal et al, 2016). However,  (Post et al, 2013) 38h audio 171k segments en, es MSLT (Federmann and Lewis, 2017) 4.5-10h audio 7k-18k segments de, en, fr, ja, zh IWSLT '18 (Niehues et al, 2018) 1,565 audio clips 171k segments de, en LibriSpeech (Kocabiyikoglu et al, 2018) 236h audio 131k segments en, fr MuST-C (Di Gangi et al, 2019a) 385-504h audio 211k-280k segments 10 languages MaSS (Boito et al, 2019) 18.5-23h audio 8.2k segments 8 languages as these captions were independently crowdsourced, they are not translations of each other, which makes them less effective for MMT.
Flickr30k / Multi30k The Flickr30k dataset (Young et al, 2014) was released in 2014 as a larger dataset following in the footsteps of Flickr8k. Collected using the same crowdsourcing approach for independent captions as its predecessor, Flickr30k contains 31,783 photos depicting common scenes, events, and actions, each annotated with 5 independent English captions. Multi30k  was initially released as a bilingual subset of Flickr30k captions, providing German translations for 1 out of the 5 English captions per image, with the aim of stimulating multimodal and multilingual research. In addition, the study collected 5 independent German captions for each image. The WMT multimodal translation tasks later introduced French  and Czech  extensions to Multi30k, making it a staple dataset for image-guided translation, and further expanding the set's utility to cutting-edge subtasks such as multisource training. An example from this dataset can be seen in Figure 2.
WMT test sets The past three years of multimodal shared tasks at WMT each came with a designated test set for the task Barrault et al, 2018). Totalling 3,017 images in the same domain as the Flickr sets (including Multi30k), these sets are too small to be used for training purposes, but could smoothly blend in with the other Flickr sets to expand their size. So far, test sets from the previous shared tasks (each containing roughly 1,000 images with captions) have been allowed for validation and internal evaluation. In parallel with the language expansion of Multi30k, the test set from 2016 contains only English and German captions, and the one from 2017 contains only English, German, and French. The 2018 test set contains English, German, French, and Czech captions that are not publicly available, though systems can be evaluated against it using an online server. 3

MS COCO Captions
Introduced in 2015, the MS COCO Captions dataset  offers caption annotations for a subset of roughly 123,000 images from the large-scale object detection and segmentation training corpus MS COCO (Microsoft Common Objects in Context) (Lin et al, 2014b). Each image in this dataset is associated with up to 5 independently annotated English captions, with a total of 616,767 captions. Though originally a monolingual dataset, the dataset's large size makes it useful for data augmentation methods for image-guided translation, as demonstrated in Grönroos et al (2018). There has also been some effort to add other languages to COCO. A small subset with only 461 captions containing ambiguous verbs was released as a test set for the WMT 2017 multimodal machine translation shared task, called Ambiguous COCO , and is available in all target languages of the task. The YJ Captions dataset (Miyazaki and Shimizu, 2016) and the STAIR Captions dataset (Yoshikawa et al, 2017) comprise, respectively, 132k and 820k crowdsourced Japanese captions for COCO images. However, these are not parallel to the original English captions, as they were independently annotated. EN: the courtyard of an orange, two-storey building with a footpath to a swimming pool in the shape of an eight and small palm trees to the left and right; DE: der Innenhof eines zweistöckigen, orangen Gebäudes mit einem Weg zu einem achterförmigen Schwimmbecken und kleine Palmen rechts und links davon; EN: Mexican women in decorative white dresses perform a dance as part of a parade. DE: Mexikanische Frauen in hübschen weißen Kleidern führen im Rahmen eines Umzugs einen Tanz auf. FR: Les femmes mexicaines en robes blanches décorées dansent dans le cadre d'un défilé. CS: Součástí průvodu jsou mexičanky tančící v bílých ozdobných šatech.

Spoken language translation datasets
The TED corpus TED is a nonprofit organisation that hosts talks in various topics, comprising a rich resource of spoken language produced by a variety of speakers in English. Video recordings of all TED talks are made available through the TED website 4 , as well as transcripts with translations in up to 116 languages. While the talks comprise a rich resource for language processing, the original transcripts are divided into arbitrary segments formatted like subtitles, which makes it difficult to get an accurate sentence-level parallel segmentation for use in translation systems. While resegmentation is possible with heuristic approaches, it comes with the additional challenge of aligning the new segments to the audiovisual content, and to each other in source and target languages. The Web Inventory of Transcribed and Translated Talks (WIT 3 ) (Cettolo et al, 2012) is a resource with the aim of facilitating the use of the TED Corpus in MT. The initiative distributes transcripts organised in XML files through their website 5 , as well as tools to process them in order to extract parallel sentences. Currently, WIT 3 covers 2,086 talks in 109 languages containing anywhere between 3 and 575k segments in raw transcripts, and is continually growing.
Since 2011, the annual speech translation tracks of the IWSLT evaluation campaign (see Section 3.2.2) has used datasets compiled from WIT 3 . While each of these sets contain a high-quality selection of English transcripts aligned with the audio and the target languages featured each year, they are not useful for training SLT systems due to their small sizes. As part of the 2018 campaign, the organisers released a large-scale English-German corpus (Niehues et al, 2018) containing 1,565 talks with 170,965 segments automatically aligned based on time overlap, which allows end-toend training of SLT models. The MuST-C dataset (Di Gangi et al, 2019a) is a more recent effort to compile a massively multilingual dataset from TED data, spanning 10 languages (English aligned with Czech, Dutch, French, German, Italian, Portuguese, Romanian, Russian, and Spanish translations), using more reliable timestamps for alignments than the IWSLT '18 dataset using a rigorous alignment process. The dataset contains a large amount of data for each target language, corresponding to a selection of English speech ranging from 385 hours for Portuguese to 504 hours for Spanish.

LibriSpeech
The original LibriSpeech corpus (Panayotov et al, 2015) is a collection of 982 hours of read English speech derived from audiobooks from the LibriVox project, automatically aligned to their text versions available from the Gutenberg project for the purpose of training ASR systems. Kocabiyikoglu et al (2018) augments this dataset for use in training SLT systems by aligning chapters from LibriSpeech with their French equivalents through a multi-stage automatic alignment process. The result is a parallel corpus of spoken English to textual French, consisting of 1408 chapters from 247 books, totalling 236 hours of English speech and approximately 131k text segments.
MSLT The Microsoft Speech Language Translation (MSLT) corpus (Federmann and Lewis, 2016) consists of bilingual conversations on Skype, together with transcriptions and translations. For each bilingual speaker pair, there is one conversation where the first speaker uses their native language and the second speaker uses English, and another with the roles reversed. The first phase transcripts were annotated for disfluencies, noise and code switching. In a second phase, the transcripts were cleaned, punctuated and recased. The corpus contains 7 to 8 hours of speech for each of English, German, and French. The English speech was translated to both German and French, while German and French speech was translated only to English. Federmann and Lewis (2017) repeat the process with Japanese and Chinese, expanding the dataset with 10 hours of Japanese and 4.5 hours of Chinese speech.

Fisher & Callhome
Post et al (2013) extends the Fisher 6 and Callhome 7 datasets of transcribed Spanish speech with English translations, developed by the Linguistic Data Consortium. The original Fisher dataset contains about 160 hours of telephone conversations in various dialects of Spanish between strangers, while the Callhome dataset contains 20 hours of telephone conversations between relatives and friends. The translations were collected from non-professional translators on the crowdsourcing platform Mechanical Turk. Fisher & Callhome is distributed with predesignated development and test splits, a part of which contains four reference translations for each transcript segment. The data in the corpus also includes ground truth ASR lattices that facilitate the training of strong specialized ASR models, allowing pipeline SLT studies to focus on the MT component. As the largest SLT corpus available at the time of its release, the Fisher & Callhome corpus has been widely used, and remains relevant for SLT today.

MaSS
The Multilingual corpus of Sentence-aligned Spoken utterances (MaSS) (Boito et al, 2019) is a multilingual corpus of read bible verses and chapter names from the New Testament. It is fully multi-parallel across 8 languages (Basque, English, Finnish, French, Hungarian, Romanian, Russian, and Spanish), comprising 56 language pairs in total. The multi-parallel content makes this dataset suitable for training SLT systems for language pairs not including English, unlike other multilingual datasets such as MuST-C. The data is aligned on the level of verses, rather than sentences. In rare cases, the audio for some verses is missing for some languages. MaSS contains a total of 8,130 eight-way parallel text segments, corresponding to anywhere between 18.5 and 23 hours of speech per language.

Video-guided translation datasets
The QED corpus The QCRI Educational Domain (QED) Corpus (Guzman et al, 2013;Abdelali et al, 2014), formerly known as the QCRI AMARA Corpus, is a large-scale collection of multilingual video subtitles. The corpus contains publicly available videos scraped from massive online open courses (MOOCs), spanning a wide range of subjects. The latest v1.4 release comprises a selection of 23.1k videos in 20 languages (Arabic, Bulgarian, Traditional and Simplified Chinese, Czech, Danish, Dutch, English, French, German, Hindi, Italian, Japanese, Korean, Polish, Portuguese, Russian, Spanish, Thai, and Turkish), subtitled in the collaborative Amara environment 8 (Jansen et al, 2014) by volunteers. A sizeable portion of the videos has parallel subtitles in multiple languages, varying in size from 8k segments (for Hindi-Russian) to 335k segments (for English-Spanish). Of these, about 75% of the parallel segments align perfectly in the original data, while the rest were automatically aligned using heuristic algorithms. An alpha v2.0 of the QED corpus is currently underway, scheduled to appear in the OPUS repository (Tiedemann, 2012), containing a large amount of (noisy) re-crawled subtitles.

The How2 dataset
The How2 dataset (Sanabria et al, 2018) is a collection of 79,114 clips with an average length of 90 seconds, containing around 2,000 hours of instructional YouTube videos in English, spanning a variety of topics. The dataset is intended as a resource for several multimodal tasks, such as multimodal ASR, multimodal summarisation, spoken language translation, and video-guided translation. To establish cross-modal associations, the videos in the dataset were annotated with word-level alignments to ground truth English subtitles. There are also English descriptions of each video written by the users who uploaded the videos, added to the dataset as metadata corresponding to video-level summaries. For the purpose of multimodal translation, a 300-hours subset of How2 that covers 22 different topics is available with crowdsourced Portuguese translations. This dataset has also recently been used for multimodal machine translation (Sanabria et al, 2018;Wu et al, 2019b). An example from this dataset can be seen in Figure 3. EN: I'm very close to the green but I didn't get it on the green so now I'm in this grass bunker.
PT: Eu estou muito perto do green, mas eu não pus a bola no green, então agora estou neste bunker de grama.

EN:
A person dressed as a teddy bear stands in a bouncy house and then falls over. The Video and TeXt (VaTeX) dataset ) is a bilingual collection of video descriptions, built on a subset of 41,250 video clips from the action classification benchmark DeepMind Kinetics-600 (Kay et al, 2017;Carreira et al, 2018). Each clip runs for about 10 seconds, showing one of 600 human activities. VaTeX adds 10 Chinese and 10 English crowdsourced captions describing each video, half of which are independent annotations, and the other half Chinese-English parallel sentences. With low-approval samples removed, the released version of the dataset contains 206,345 translation pairs in total. VaTeX is intended to facilitate research in multilingual video captioning and video-guided machine translation, and the authors keep a blind test set reserved for use in evaluation campaigns. The rest of the dataset is divided into training (26k videos), validation (3k videos), and public test splits (6k videos). The training and validation splits also have public action labels. An example from VaTeX is shown in Figure 3.

Models and Approaches
This section discusses the state-of-the-art models proposed to solve the multimodal machine translation (MMT) tasks introduced in Section 2. For some MMT tasks, the traditional approach is to put together a pipeline to divide the task into several sub-tasks, and cascade different modules to handle each of them. For instance, in the case of spoken language translation (SLT), this pipeline would first convert the input speech into text by an automatic speech recognition module (modality conversion), and then redirect the output to a text-based MT module. This is in contrast to endto-end models, where the source language would be encoded into an intermediate representation, and decoded directly into the target language. Pipeline systems are less vulnerable to training data insufficiency compared to data-driven end-to-end systems, since each component can be pretrained in isolation on abundant sub-task resources. However, they carry the risk of error propagation between stages and ignore cross-modal transfer of implicit semantics. As an example for the latter, consider two languages which emphasise words via prosody and specific word order, respectively. Translating the transcript would make it impossible to reflect the word order in the target sentence as the semantic correspondence would be lost at transcription stage. Nevertheless, both pipeline and end-to-end approaches rely heavily on the sequence-to-sequence learning framework on account of its flexibility and good performance across tasks. In the following, we describe this framework in detail.
General purpose sequence-to-sequence learning is inspired by the pioneering works in unimodal neural machine translation (NMT). The state of the art in unimodal MT has been dominated by statistical machine translation (SMT) methodologies (Koehn, 2009) for at least two decades, until the field drastically moved towards NMT techniques around 2015. Inspired by the successful use of deep neural networks in language modelling (Bengio et al, 2003;Mikolov et al, 2010) and automatic speech recognition (Graves et al, 2013), there has been a plethora of NMT studies featuring different neural architectures and learning methods. These architectures often rely on continuous word vector representations to encode various kinds of linguistic information in a common vector space, thereby eliminating the need for hand-crafted linguistic features. One of the first NMT studies by Kalchbrenner and Blunsom (2013) combined recurrent language modelling (Mikolov et al, 2010) and convolutional neural networks (CNN) to improve the performance of SMT systems through rescoring. Later on, the application of recurrent architectures, such as bidirectional RNNs (Schuster and Paliwal, 1997), LSTMs (Hochreiter and Schmidhuber, 1997;Graves and Schmidhuber, 2005), and GRUs (Chung et al, 2014), introduced further diversity into the field, eventually leading to the fundamental encoder-decoder architecture (Cho et  2014; Sutskever et al, 2014). These more advanced neural units were not as susceptible to the problems initially perceived in NMT, dealing naturally with variable-length sequences, and having clear computational advantages as well as superior performance. However, the difficulty of learning long-range dependencies in translation sequences (e.g. grammatical agreement in very long sentences) remained an issue until the introduction of the attention mechanism (Bahdanau et al, 2015). The attention mechanism addressed this issue by simultaneously learning to align translation units and to translate, supplying a context window with the relevant input units at each decoding step, i.e. for each generated word in the target language (Figure 4). The performance of the NMT systems that followed came close to, and soon surpassed, that of the state-of-the-art SMT systems. Successful non-recurrent alternatives have also been proposed, such as convolutional encoders and decoders with attention (Gehring et al, 2017), and the fully-connected deep transformers which employ the idea of self-attention in addition to the default cross-attention mechanism (Vaswani et al, 2017). The main motivation behind these is to allow for efficient parallel training across multiple processing units, and to prevent learning difficulties such as vanishing gradients.

Image-guided translation
In this section, we present the state-of-the-art models for the image-guided translation (IGT) task. We first discuss the visual feature extraction process, continue with reviews of the two main end-to-end neural approaches, and finally briefly cover retrieval and reranking methods. image classification or object detection (Russakovsky et al, 2015), it has been shown that the learned representations transfer very well into vision-to-language tasks such as image captioning Xu et al, 2015). Therefore, the majority of IGT approaches rely on features extracted from state-of-the-art CNNs (Simonyan and Zisserman, 2015;Ioffe and Szegedy, 2015;He et al, 2016) trained for the ImageNet (Deng et al, 2009) image classification task, where the output of the network is a distribution over 1000 object categories. These features usually come in two flavors ( Figure 5): (i) spatial features which are feature maps V ∈ R W ×H×C extracted from specific convolutional layers, and (ii) a pooled feature vector v ∈ R C which is the outcome of applying a projection or pooling layer on top of spatial features. The main difference between these features is that the former is dense and preserves spatial information, while the latter is a compact, spatially-unaware representation. An even more compact representation is to use the posterior class probabilities (v ∈ R K ) extracted from the output layer of a pretrained CNN, with K denoting the size of the taskspecific label set (for ImageNet, K is 1000). Finally, it is also possible to obtain a set of pooled feature vectors (or local features) from salient regions of a given image, with regions predicted by object detection CNNs (Girshick et al, 2014).

Sequence-to-sequence grounding with pooled features
The simplest and the most intuitive way of visually conditioning a sequence-to-sequence model is to employ pooled features in a way that they will interact with various components of the architecture. These approaches are mostly inspired by the early works in neural image captioning (Kiros et al, 2014;Mao et al, 2015;Vinyals et al, 2015), and are categorised in Figure 6 with respect to their entry points.
The very first attempt for neural image-guided translation comes from Elliott et al (2015), where they formulate the problem as a semantic transfer from a source language model to a target language model, within an encoder-decoder framework without attention. They propose to initialise the hidden state(s) of the source language model (LM), the target LM, or both, using pretrained VGG features (Simonyan and Zisserman, 2015). Later initialisation variants are applied to attentive NMTs: Calixto et al (2016) and Libovický et al (2016) experiment with recurrent decoder initialisation while Ma et al (2017) initialise both the encoder and the decoder, with features from a state-of-the-art ResNet (He et al, 2016). Madhyastha et al (2017) explore the expressiveness of the posterior probability vector as a visual representation, rather than the pooled features from the penultimate layer of a CNN.
Huang et al (2016) take a different approach and enrich the source sentence representation with visual information by projecting the feature vector into the source language embedding space and then adding it to the beginning or the end of the embedding sequence. This allows the attention mechanism in the decoder to attend to a mixed-modality source representation instead of a purely textual one. Instead of the conventional ImageNet-extracted features, they make use of local features from RCNN (Girshick et al, 2014) to represent explicit visual semantics related to salient objects. In another model referred to as Parallel-RCNN, they build five different source embedding sequences, each being enriched with a visual feature vector extracted from a different salient region of the image. A shared LSTM encodes these five sequences and average pools them to end up with the final source representation.  revisit the idea of source enrichment to extend it by simultaneously appending and prepending the projected visual features to the embedding sequence; and combining it with encoder and/or decoder initialisation. Caglayan et al (2017a) explore different source and target interaction methods such as the element-wise multiplication between the visual features and the source/target word embeddings. Delbrouck and Dupont (2018) add another recurrent layer within the decoder in their DeepGRU model, conditioned on the visual features and the bottom layer hidden state. Both recurrent layers simultaneously decide on the output probability distribution by additively fusioning their respective unnormalised logits.
As for transformer-based architectures, Grönroos et al (2018) revisit the source enrichment by adding the visual feature vector to the beginning of the embedding sequence (Huang et al, 2016). They also experiment with modulating the output probability distribution through a time-dependent visual decoder gate. More interestingly, they explore different pooled visual representations such as scene-type associations (Xiao et al, 2010), action-type associations (Yao et al, 2011), and object features from Mask R-CNN (He et al, 2017).
Multi-task learning. Training an end-to-end neural model to perform multiple tasks at once can improve the model's task-specific performance by forcing it to exploit commonalities across the tasks involved (Caruana, 1997;Dong et al, 2015;Luong et al, 2015). The Imagination architecture, initially proposed by Elliott and Kádár (2017) and later integrated into transformer-based NMTs by Helcl et al (2018b), attempts to leverage the benefits of multi-tasking by proposing a one-to-many framework which shares the sentence encoder between the translation task and an auxiliary visual reconstruction task. Besides the usual cross-entropy translation objective, the model weights are also optimised through a margin-based loss which minimises the distance between the ground-truth visual feature vector and the one predicted from the sentence encoding. The visual features are only used at training time and are not needed when generating translations. Zhou et al (2018) further extends the Imagination network by incorporating an attention 9 over source sentence encodings, with the query vector being the visual features. In this approach, the auxiliary margin-based loss is modified so that the output of the attention layer is considered a reconstruction of the pooled feature vector.
Other approaches. All grounding approaches covered so far rely on the maximum-likelihood estimation (MLE) principle for the sequence transduction task, i.e. they try to maximise the log-probability of target sentences given the source sentences. Zheng et al (2018) extends MLE with a fine-tuning step, where they use reinforcement learning to find the model parameters which directly maximise the translation metric BLEU. In terms of multimodality, they simply initialise the decoder with pooled features. Toyama et al (2016), Calixto et al (2018) and Delbrouck and Dupont (2019) cast the problem as a latent variable model and resort to techniques such as variational inference and generative adversarial networks (GANs). Finally, Nakayama and Nishida (2017) approach the problem from a zero-resource perspective: they encode {source caption, image} pairs into a multimodal vectorial space using a max-margin loss. In a second step, they train the decoder using {target caption, image} pairs. Specifically, they do a forward-pass with the image as input and obtain the multimodal embedding, from which the recurrent decoder is trained to generate the target caption as usual. The image encoder is a pretrained VGG CNN. The zero-resource aspect comes from the fact that the sets of pairs do not overlap i.e. the approach does not require parallel IGT corpus. , attentive approaches explore how to efficiently integrate a visual attention (approach A in Figure 6) over the spatial features, alongside the language attention in NMTs. The most interesting research questions about visual attention are as follows: where to apply the visual attention, what kind of parameter sharing should be preferred and, how to fuse the output of language and visual attention layers. Caglayan et al (2016a) and Calixto et al (2016) are the first works to tackle these questions, through a visual attention which uses the hidden state of the decoder as query into the set of W × H spatial features. Their implementation is quite similar to the language attention, which results in two modality-specific contexts that should be fused before the output layer of the network. One notable difference is that Caglayan et al (2016a) experiment with a single multimodal attention layer shared across modalities while Calixto et al (2016) keep the attention layers separate. Later on, Caglayan et al (2016b) evaluate both shared and separate attentions with additive and concatenative fusion, and discover that proper feature normalisation is crucial for their recurrent approaches . Delbrouck and Dupont (2017a) propose a different fusion operation based on compact bilinear pooling (Fukui et al, 2016), to efficiently realise the computationally expensive outer product. Unlike additive and concatenative fusions, outer product ensures that each dimension of the language context vector interacts with each dimension of the visual context vector and vice-versa. Follow-up studies extend the decoder-based visual attention approach in different ways:  reimplement the gating mechanism  to rescale the magnitude of the visual information before the fusion, while Libovický and Helcl (2017) introduce the hierarchical attention which replaces the concatenative fusion with a new attention layer that dynamically weighs the modality-specific context vectors. Finally, Arslan et al (2018) and Libovický et al (2018) introduce the same idea into the Transformer-based (Vaswani et al, 2017) architectures. Besides revisiting the hierarchical attention, Libovický et al (2018) also introduce parallel and serial variants. The former is quite similar to Arslan et al (2018) and simply performs additive fusion while the latter first applies the language attention, which produces the query vector for the subsequent visual attention. Ive et al (2019) extend Libovický et al (2018) to add a 2-stage decoding process where visual features are only used in the second stage, through a visual cross-modal attention. They also experiment with another model where the attention is applied over the embeddings of object labels detected from the images.

Inspired by the previous success of visual attention in image captioning
In contrast to the decoder-based visual attention, encoder-based approaches are relatively less explored. To that end, Delbrouck and Dupont (2017b) propose conditional batch normalisation, a technique to modulate the batch normalisation layer (Ioffe and Szegedy, 2015) of ResNet. Specifically, they condition the mean and the variance of the batch normalisation layer on the source sentence representation for informed feature extraction. In the same work, Delbrouck and Dupont (2017b) also propose to apply an early visual attention inside the encoder, to yield inherently multimodal source encodings, on top of which the usual language attention would be applied by the decoder.

Reranking and Retrieval based approaches
The most typical pipeline for MT is to obtain an n-best list of translation candidates from an arbitrary MT system and select the best candidate amongst them after reranking with respect to an aggregated score. This score is often a combination of several models that are able to quantitatively assess translation-related qualities of a candidate sentence, such as the adequacy or the fluency, for example. Each model is assigned a coefficient and an optimisation step is executed to find the best set of coefficients that maximise the translation performance on an held-out test set (Och, 2003). The challenge for the IGT task is notably how to incorporate the visual modality into this pipeline in order to assign a better rank to visually plausible translations. To this end, Caglayan et al (2016a) combine a feed-forward language model (Bengio et al, 2003;Schwenk et al, 2006) and a recurrent NMT to rerank the translation candidates obtained from an SMT system. The language model is special in the sense that it is not only conditioned on n-gram contexts but also on the pooled visual feature vector. In contrast, Shah et al (2016) conjecture that the posterior class probabilities may be more expressive than a pooled representation for reranking, and treat each probability v i as an independent score for which a coefficient is learned. In a recent work, Lala et al (2018) demonstrate that for the Multi30k dataset, better translations are available inside an n-best list obtained from a text-only NMT model, which allow up to 10 points absolute improvement in METEOR score. They propose the multimodal lexical translation (MLT) model where they rerank the n-best list with scores assigned by a multimodal word sense disambiguation system based on pooled features.
Another line of work considers the task as a joint retrieval and reranking problem. Hitschler et al (2016) construct a multimodal/cross-lingual retrieval pipeline to rerank SMT translation candidates. Specifically, they leverage a large corpus of target {caption, image} pairs, and retrieve a set of pairs similar to the translation candidates and the associated image. The visual similarity is computed using the Euclidean distance in the pooled CNN feature space. The initial translation candidates are then reranked with respect to their -inverse document frequency based -relevance to the retrieved captions. Zhang et al (2017) also employ a combined framework of retrieval and reranking. For a given {caption, image} pair, they first retrieve a set of similar training images. The target captions associated with these images are considered as candidate translations. They learn a multimodal word alignment between source and candidate words and select the most probable target word for each source word. An n-best list from their SMT is reranked using a bidirectional NMT trained on the aforementioned source/target word sequences. Finally, Duselis et al (2017) and Gwinnup et al (2018) propose a pure retrieval system without any reranking involved. For a given image, they first obtain a set of candidate captions from a pretrained image captioning system. Two distinct neural encoders are used to encode the source and the candidate captions, respectively. A mapping is then learned from the hidden space of the source encoder Table 2: Automatic scores of state-of-the-art IGT methods on Multi30k English→German test2016: the table is clustered (and sorted by METEOR) across years for constrained systems, followed by unconstrained ones. Systems marked with ( †) are re-evaluated with tokenised sentences, denotes the use of visual features other than ImageNet CNNs. The gains and losses are with respect to the MT baselines reported in the papers. The types refer to Figure 6. to the target one, allowing the retrieval of the candidate caption which minimises the distance with respect to the source caption representation. Table 2 presents BLEU and METEOR scores on the English→German test2016 set of Multi30k dataset, as this is the test set that most studies report against. When possible, we annotate each score with the associated gain or loss with respect to the underlying unimodal MT baseline reported in the respective papers. The results concentrate around constrained systems, which only allow the use of parallel Multi30k corpus during training. A few studies experiment with using external resources Helcl and Libovický, 2017;Elliott and Kádár, 2017;Grönroos et al, 2018) for pretraining the MT system and then fine-tuning it on Multi30k, or directly training the system on the combination of Multi30k and the external resource. Two such unconstrained systems are also reported. At a first glance, the automatic results reveal that (i) initially, neural systems were not able to surpass the SMT systems, (ii) the use of external resources is beneficial to boost the underlying baseline performance, which further manifests itself as a boost in the multimodal scores and (iii) careful tuning allows RNN-based models to reach and even surpass Transformer-based models. From a multimodal perspective, the results are not very conclusive as there does not seem to be a single architecture, feature type or integration type that brings consistent improvements. Elliott (2018) attempted to answer the question of how efficiently state-of-the-art models were integrating information from the visual modality and concluded that when models were adversarially challenged with wrong images at test time, the quality of the produced translations was not that much affected as one would expect. Later on,  showed how these seemingly insensitive architectures start to significantly rely on the visual modality, once words were systematically removed from source sentences during training and test. We believe that this latter finding may also be connected to the fact that better baselines benefit less from the visual modality (Table 2) i.e. sub-optimal architectures may leverage more from the visual information when compared to well trained NMT models. In fact, even the choice of vocabulary size may simulate systematic word removal, if a significant portion of the source vocabulary are mapped to unknown tokens. The same experimental pipeline of  also paved the way for assessing the particular strengths of some of the covered IGT approaches and showed that, the use of spatial features through visual attention is superior than initialising the encoders and the decoders using pooled features.

Comparison of approaches
Lastly, if we take a look at the human evaluation rankings conducted throughout the WMT shared tasks, we see that the top three ranks for English→German and English→French are occupied by two unconstrained ensembles (Grönroos et al, 2018;Helcl et al, 2018b), the MLT Reranking  and the DeepGRU (Delbrouck and Dupont, 2018) systems in 2018. In 2017, the multiplicative interaction (Caglayan et al, 2017a), unimodal NMT reranking , unconstrained Imagination (Elliott and Kádár, 2017), encoder enrichment  and hierarchical attention  were ranked as top three, again for both language pairs.

Spoken language translation
In spoken language translation, the non-text modality is the source language audio, which is translated into target language text. While source language transcripts may be available for training, at translation time the speech is typically the only input modality. We begin this section with a brief introduction to speech-specific feature extraction (Section 5.2.1). Section 5.2.2 reviews the current state of the art for the traditional pipeline methods and finally, Section 5.2.3 covers the end-to-end methods which saw a rapid development in recent years.

Feature extraction
Even though many deep learning applications use raw input data, it is still common to use somewhat engineered features in speech applications. The raw audio waveform consists of thousands of samples per second, and thus one-sample-at-atime processing would be computationally very expensive. Instead, a spectrogram representation is computed. It shows the signal activity at different frequencies, as a function of time. The frequency content is computed over frames of suitable length. The frame length trades off time and frequency precision: longer frames capture finer spectral (i.e. frequency) detail, but also describe a longer segment of time, which can be problematic as certain speech events (e.g. the stop consonants p, t) can have a very short duration.
Next, a Mel-scale filterbank is applied to each frame, and the logarithm of each filter's output is computed. This leads to log Mel-filterbank features. The filterbank operation reduces the number of dimensions. However, these operations are also perceptually motivated: the filterbank by the masking of frequencies close to each other in the ear, the Mel-scale as it relates frequency to perceived pitch, and the logarithm by the relation of perceived loudness to signal activity (Pulkki and Karjalainen, 2015).
Continued efforts in learning deep representations from raw samples exist, with some success (Sainath et al, 2015). However, log Mel-filterbank vectors as input to deep neural network models (Mohamed et al, 2012) remain the standard choice. Additional, more complex features may be used to aid robustness to speaker variability (Saon et al, 2013) or recognition in tonal languages (Ghahremani et al, 2014).

State of the art in pipeline methods
Pipeline approaches in SLT chain together separate ASR and MT modules, and these naturally follow progress in their respective fields. A popular ASR system architecture is an HMM-DNN hybrid acoustic model (Yu and Li, 2017), followed by an n-gram language model in the first decoding pass, and a neural language model for rescoring. This type of HMM-based ASR is essentially pipeline ASR. In addition to pipeline ASR, end-to-end ASR methods have recently gained popularity. Particularly, encoder-decoder architectures with attention have been successful, although on standard publicly available datasets HMM-based models still narrowly outperform end-to-end ones (Lüscher et al, 2019). Chiu et al (2018) show that encoder-decoder with attention ASR can outperform HMM-based models on an very large (12500h) proprietary dataset. Another common end-to-end ASR method is Connectionist Temporal Classification (CTC) (e.g. Li et al (2019)). Table 3: SLT formulated as Bayesian search, for translation y, source language transcript z, source language speech x, and set of all possible transcripts Z.
End-to-end search argmax y P (y|x) General pipeline search argmax y z∈Z (x) P (y|z)P (z|x) Loosely coupled pipeline Z (x) ⊂ Z Tightly coupled pipeline Z (x) = Z Wang et al (2018c) and Liu et al (2018) place first and second, respectively, in the IWSLT 2018 evaluation campaign. Both apply similar pipeline architectures: a system combination of multiple different HMM-DNN acoustic models and LSTM rescoring for ASR, followed by a system combination of multiple Transformer NMT models for translation. Liu et al (2018) additionally use an encoder-decoder with attention ASR to improve the system combination ASR results, although individually the end-to-end model is clearly outperformed by the HMM-DNN models. Wang et al (2018c) use an additional target-to-source NMT system for rescoring to improve adequacy. The systems also differ in interfacing strategies between ASR and MT.
In the latest IWSLT evaluation campaign in 2019, end-to-end SLT models were encouraged. However, the best performance was still achieved with a pipeline SLT approach, where Pham et al (2019) use end-to-end ASR and a Transformer NMT model. In the ASR module, an LSTM-based approach outperforms a Transformer model, though combining both in an ensemble proved beneficial. Weiss et al (2017) and Pino et al (2019) also report competitive results using end-to-end ASR, with Pino et al (2019) surpassing the state-of-the-art in SLT. End-to-end ASR has attracted attention in SLT, because it allows for parameter transfer in end-to-end SLT (e.g. Bérard et al (2018), and Figure 8).
Challenges in pipeline SLT Research in pipeline SLT has specifically focused on the interface between ASR and MT. There is a clear mismatch between MT training data and ASR output, caused by the ASR noise characteristics (i.e. transcription errors), and the ASR output dissimilarity with respect to the written text due to lack of capitalisation and punctuation, and the disfluencies (e.g. repetitions and hesitations), which naturally occur in speech. Federico (2014, 2015); Ruiz et al (2017) quantify the effect of ASR errors on MT. In a linear mixed-effects model, the amount of WER added on top of gold standard transcripts has a direct effect on TER increase. The results do not vary over different ASR systems. Minor localised ASR errors can result in longer distance errors or duplication of content words in NMT. Homophonic substitution error spans (e.g. anatomy → and that to me) are shown to account for a significant portion of ASR errors and to have a large impact on translation quality. With regards to noise robustness, it is noted that the utterances which were best translated by phrase-based MT, had higher average WER than utterances which were best translated by NMT. In general, NMT has been established as particularly sensitive to noisy inputs (Belinkov and Bisk, 2018;Cheng et al, 2018).
One approach to address the mismatch is training the MT system on noisy, ASR-like input. Peitz et al (2012) use an additional phrase-table trained on ASR-outputs on the SLT corpus. Tsvetkov et al (2014) augment a phrase-table with plausible ASR misrecognitions. These errors are synthesised by mapping each phrase to phones via a pronunciation dictionary, and randomly applying heuristic phone-level edit operations. Sperber et al (2017b) first train an NMT system on reference transcripts, and then fine-tune on noisy transcripts. The noise is sampled from a uniform distribution over insertions, deletions or substitutions, with optional unigram weighting for the substitutions and insertions. Additionally, a deletion-only noise is used. Smaller amounts of noise are shown to improve SLT results, but increasing noise levels to actual test-time ASR levels (rather high, at 40%) only degrades performance. Increased noise is noted to produce shorter outputs, which in turn are punished by the BLEU brevity penalty. A precision-recall tradeoff is observed: the system could either drop uncertain inputs (better precision) or try to guess translations (better recall). Fine-tuning with deletion-only noise biases the system to produce longer outputs, which is shown to counteract the effect of noisy inputs producing shorter outputs. Pham et al (2019) use the data augmentation method SwitchOut (Wang et al, 2018b), to make their NMT models more robust to ASR errors. During training, SwitchOut randomly replaces words in both the source and the target sentences.
Another approach to cope with the mismatch is to transform the ASR-output into written text. Wang et al (2018c) apply a Transformer-based punctuation restoration and heuristic rules which remove disfluencies and transform written out numbers and quantities into numerals. Liu et al (2018) experiment with NMT-based transformations in both directions: producing ASR-like text from written text for training the translation system, or producing written text from ASR-like text as a test-time bridge between ASR and translation. Transforming the MT training data into an ASR-like format consistently outperforms inverse normalization of ASR-output, though both are beneficial in the final system combination.
Long audio streams typically need to be segmented into manageable length pieces using voice activity detection (Ramirez et al, 2007), or more elaborate speaker diarisation methods (Anguera et al, 2012). These methods may not produce clean sentence boundaries. This is a clear problem in MT, as the boundaries can cut between actual sentences. Liu et al (2018) alleviate the problem by applying an LSTM-based resegmenter after the ASR system. Pham et al (2019) combine resegmentation, and casing and punctuation restoration into a single ASR post-processing task, and apply an NMT model.
Coupling between ASR and MT The SLT search is often described in Bayesian terms as shown in Table 3. Generally, pipeline search is based on the assumption that P (y|z, x) = P (y|z), i.e. given the source language transcript, the translation does not depend on the speech. It is still possible to take the uncertainty of the transcription into account under this conditional independence assumption, but it rules out the use of paralinguistic cues, e.g. prosody. In pure serial pipeline search, first the 1-best ASR result is decoded, then only this 1-best result is translated. The hard choice in 1-best decoding is especially susceptible to error propagation. Early work in SLT found consistent improvements with loosely coupled search, where a rich representation carrying the ASR uncertainty, such as an N-best list or word lattice, is used in translation. Tightly coupled search, i.e. joint decoding, is also possible, although the application is limited by excessive computational demands. In tightly coupled search, the translation model would also influence which ASR hypotheses were searched further. This was done by representing both the ASR and the phrase-based MT search spaces as Weighted Finite State Transducers (WFST). (Matusov et al, 2006;Zhou, 2013) Osamura et al (2018) implement a type of loose coupling by using the softmax posterior distribution from the ASR module as the input for NMT. Loose coupling via using lattices as input in NMT is not straightforward. Sperber et al (2017a) implement LatticeLSTM for lattice inputs in RNN-based NMT, and find that preserving the uncertainty in the ASR output is beneficial for SLT. Zhang et al (2019) further propose a Transformer model which can use lattice inputs, and find that it outperforms both a standard Transformer and a LatticeLSTM baseline in an SLT task. However, tight coupling of NMT and ASR has not been proposed in pipeline SLT.
In addition to coupled decoding, end-to-end SLT leverages coupled training. This can avoid suboptimization; for phrase-based MT and HMM-GMM ASR, He et al (2011) show how optimizing the ASR component purely for WER can produce worse results in SLT. He and Deng (2013) foreshadow end-to-end neural SLT systems, proposing a joint, end-to-end optimization procedure for a pipeline of HMM-GMM ASR and phrase-based MT. In the proposed approach, the ASR and MT components are first trained separately, and then the whole pipeline is jointly optimized for sentencelevel BLEU, by iteratively sampling sets of competing hypotheses from the pipeline and updating the parameters of the submodels discriminatively.

End-to-end spoken language translation
The first attempts to use end-to-end methods for SLT were published in 2016. This period saw experimentation with a wide variety of approaches, before research focus converged on sequence-to-sequence architectures. These early methods (Duong et al, 2016;Anastasopoulos et al, 2016;Bansal et al, 2017) were able to align source language audio to target language text, but they were not able to perform translation. The first true end-to-end SLT system is presented by Bérard et al (2016). Still a proof-of-concept, it was trained on BTEC French→English with synthetic audio containing a small number of speakers. Figure 7 shows the different types of training data applicable for SLT. The standard learning setup for end-to-end SLT is only able to train from untranscribed SLT data. The task is very challenging, as data of this type is scarce, and the representation gap between source audio and target text is large. The source transcript is useful as an intermediary representation, a stepping stone to divide the gap into two smaller ones: modality conversion and translation. Many learning setups (see Figure 8), e.g. pretraining, multi-task learning, and knowledge distillation, have been applied for exploiting the source transcripts. In early experiments, no new examples are introduced for the auxiliary task(s); Only source transcript labels for the SLT examples were added. Later the same learning setups have been applied to exploit more abundant auxiliary ASR and MT data.
An important milestone towards parity with pipeline approaches was to achieve better translation quality when both the end-to-end system and the pipeline system are trained on the same SLT data. This milestone was reached by  Fig. 7: Four types of data that can be used to train SLT systems. Untranscribed SLT is the minimal type of data for end-to-end systems. Adding source text transcripts completes the triple. The source text is an intermediate representation which divides the SLT mapping into a modality conversion and a translation. Two types of auxiliary data, ASR and MT data, form adjacent pairs in the triple, leaving one of the ends empty. The auxiliary data can be used as is for pretraining or multi-task learning, or it can be completed into synthetic triples using external TTS or MT systems. Weiss et al (2017), training on the 163h Fisher&Callhome Spanish→English data set. As pipeline methods are naturally capable of exploiting the more abundant paired ASR and MT data, but in this case this condition was unrealistically constrained. When the constraint is lifted, pipeline methods improve to a level that is difficult or impossible to reach on small amounts of source audio-translated text data. The effective use of auxiliary data was a key insight going forward towards achieving parity with pipeline approaches. Figure 8 shows learning setups that have been applied for exploiting source transcripts and auxiliary data. Weiss et al (2017) use a multi-task learning procedure with ASR as the auxiliary task, training only on transcribed SLT data. In multi-task learning (Caruana, 1997), multiple tasks are trained in parallel, with some network components shared between the tasks. Bérard et al (2018) compare pretraining (sequential transfer) with multi-task learning (parallel transfer), finding very little difference between the two. In pretraining, some of the parameters from a network trained to perform an auxiliary task are used to initialise parameters in the network for the main task. The system is trained only on transcribed SLT data, with two auxiliary tasks: pretraining the encoder and decoder with ASR and textual MT respectively. Stoian et al (2019) compare the effects of pretraining on auxiliary ASR datasets of different languages and sizes, concluding that the WER of the ASR system is more predictive of the final translation quality than language relatedness. Anastasopoulos and Chiang (2018) make the line between pipeline and end-to-end approaches more blurred by using a multi-task learning setup with two-step decoding. First the source transcript is decoded using the ASR decoder. A second SLT decoder attends to both the speech input and the hidden states of the ASR decoder. While the system is trained end-to-end, the two-step decoding is still necessary at translation time. The system is trained only on transcribed SLT data. Liu et al (2019) focus on exploiting source transcripts by means of knowledge distillation. They train the student SLT model to match the output probabilities of a text-only MT teacher model, finding that knowledge distillation is better than pretraining. Inaguma et al (2019b) also see substantial improvements from knowledge distillation when adding auxiliary textual parallel data. Wang et al (2019a) introduce the Tandem Connectionist Encoding Network (TCEN), which allows neural network components to be pretrained while minimising both the number of parameters not transferred from the pretraining phase, and the mismatch of components between pretraining and finetuning. The final network consists of four components: ASR encoder, MT encoder, MT attention and MT decoder. The ASR encoder is pretrained with a Connectionist Temporal Classification objective function, which does not require a separate ASR decoder which would go to waste after pretraining. The last three parts can be pretrained with a textual MT task. Jia et al (2019) show that augmenting auxiliary data is more effective than multi-task learning. MT data is augmented with synthesised speech, while ASR data is augmented with synthetic target text by forward translation using a textonly MT system (see Figure 7). These kinds of synthetic data augmentation are conceptually similar to the highly  Fig. 8: Learning setups for end-to-end SLT: The standard framework uses untranscribed SLT data. Auxiliary data can be exploited in different ways such as by pretraining the encoder through ASR, pretraining the decoder through MT, knowledge distillation, or multi-task learning. The optional link in multi-task learning results in 2-step decoding. TCEN combines multiple types of pretraining.
successful practice of using backtranslation (Sennrich et al, 2016a) to exploit monolingual data in textual MT. With both pretraining and multi-task learning, the end-to-end system slightly outperforms the pipeline. Adding synthetic data substantially outperforms the pipeline. The systems are both trained on exceptionally large proprietary corpora: ca 1300 h translated speech and 49000 h transcribed speech. Controversially the system is also evaluated on a proprietary test set. The speech encoder is divided into two parts, of which only the first is pretrained on an ASR auxiliary task. The entire decoder is pretrained on the text MT task. Pino et al (2019) evaluate several pretraining and data augmentation approaches. They use TTS to synthesise source audio for parallel text data, finding that the effect depends on the quality and quantity of the synthetic data. Using textual MT to synthesise target text from ASR data is clearly beneficial. Pretraining the speech encoder on an ASR task is useful for the lower resourced English→Romanian, but not for English→French. Pretraining on ASR is not a good substitute for using textual MT for augmenting the ASR data, but does speed up convergence of the SLT model. Using a combination of a VGG Transformer speech encoder and decoder, they very nearly reach parity with a strong pipeline system. Bansal et al (2019) apply crosslingual pretraining, by pretraining on high-resource ASR to improve low-resource SLT. They use a small Mboshi→French SLT corpus without source transcripts. As Mboshi has no official orthography, transcripts may be difficult to collect. Pretraining the speech encoder using a completely unrelated high-resource language, English, effectively allows to account for acoustic variability, such as speaker and channel differences. Di Gangi et al (2019c) train a one-to-many multilingual system to translate from English to all 8 target languages of the MuST-C corpus, with an additional task pair for English ASR. Prepending a target language tag to the input (Johnson et al, 2017), is not effective in multilingual SLT, resulting in many acceptable translations into the wrong language. Better results are achieved with a stronger language signal using merge, a language-dependent shifting operation. Inaguma et al (2019a) train multilingual models for {en, es} → {en, fr, de} SLT. They achieve better results with the multilingual models than with bilingual ones, including pipeline methods for some test sets.
Noise-based data augmentation methods have also been applied to the speech audio. Bahar et al (2019) and Di Gangi et al (2019) apply spectral augmentation (SpecAugment), which randomly masks blocks of features that are consecutive in time and/or frequency.

End-to-end SLT architectures
There is a large variety of architectures that have been applied to end-to-end SLT, with no clear favourite having emerged. However, recent architectures all follow some type of sequence-to-sequence architectures that makes use of attention mechanisms.
Two varieties of LSTM layers have been used: standard bi-LSTM (e.g. Jia et al, 2019) and pyramidal bi-LSTM (e.g. Duong et al, 2016;Bérard et al, 2016;Bahar et al, 2019). The pyramidal construction of the encoder downsamples the long speech input sequence, making subsequent bi-LSTM layers and the attention mechanism faster and alignment easier. Bérard et al (2016) use convolutional attention, finding it to be particularly useful with long input sequences. Following Weiss et al (2017), Bérard et al (2018) move away from the pyramidal bi-LSTM encoder architecture to convolution followed by bi-LSTM. The prepended convolutional layers perform the downsampling of the audio signal, making the pyramidal construction unnecessary. Transformers have also been used in many SLT systems. Liu et al (2019) propose an architecture in which all encoders and decoders are standard Transformer encoders and decoders respectively. Pino et al (2019) further prepend VGG-style convolutional blocks to Transformer encoders and decoders, in order to replace the positional embedding layer of the standard Transformer architecture and to downsample the signal. Di Gangi et al (2019c) use a speech encoder which begins with stacks of convolutional layers interleaved with 2D self-attention (Dong et al, 2018), followed by a stack of Transformer layers. Salesky et al (2019) revisit the network-in-network (Lin et al, 2014a) architecture to achieve downsampling: parameters are shared spatially in a similar way to CNN, but a full multi-layer perceptron network is applied to each window.
Convolutional Neural Networks are used in many SLT architectures, but only in combination with LSTM or Transformer, not in isolation. The combined CNN-LSTM architecture is popular in end-to-end ASR (Watanabe et al, 2018). The CNN is well suited for reduction of the time scale to something manageable, and modeling short range dependencies. The appended LSTM or Transformer is useful for encoding the semantic information for translation. The CNNs used in SLT are typically 2D convolutions (parameter sharing across both time and frequency). Time Delay Neural Networks (TDNN) are still popular in ASR, but have not to the best of our knowledge been used in end-to-end SLT. TDNNs can be seen as a 1D convolution, only sharing parameters across time. The VGG (Simonyan and Zisserman, 2015) architecture of CNNs is used in SLT, but not ResNet (He et al, 2016).
Comparison of architectures. In SLT, the choice between LSTM and Transformer architectures doesn't seem to be a settled matter: recent papers use both. Both architectures are powerful enough, when stacked into sufficiently deep networks. Pino et al (2019) present a result in favour of the Transformer, as they only reach parity with their pipeline using Transformers, but not LSTMs. Inaguma et al (2019b) find that Transformers consistently outperform LSTMs in their experiments. A downside of LSTM is slow training on the very long sequences encountered in speech translation. While the Transformer parallelises to a larger extent, making training fast, it is not immune to long sequences, as the self-attention is quadratic in memory w.r.t. the length. The Transformer also lacks explicit modelling of short range dependencies, due to the self-attention learning dependencies of any range with equal difficulty. Di Gangi et al (2019b) attempt to augment the Transformer to alleviate some of its shortcomings.
Decoding units. In textual NMT, subword-level decoders have become the standard choice (Sennrich et al, 2016b). Most end-to-end SLT systems use character-level decoders. Although word level decoding is rare, Bansal et al (2018) focus on a low-computation setting, deciding to use word-level decoding to shorten the sequence length. Some wellperforming recent systems use subword units (Liu et al, 2019;Jia et al, 2019;Pino et al, 2019;Bansal et al, 2019). Wang et al (2019a) find characters to work better than subwords in their system.
Has parity with pipeline approaches been reached? Recent results (Jia et al, 2019;Pino et al, 2019) show that on certain tasks with large enough datasets of high-quality, end-to-end systems can reach the same or even better performance than pipeline systems. In low-resource settings, end-to-end systems do not perform as well. However, in the IWSLT 2019 evaluation campaign , the pipeline system of Schneider and Waibel (2019) clearly outperforms all end-to-end submissions. Sperber et al (2019) find that current methods do not use auxiliary data effectively enough. The amount of transcribed SLT data is critical: When the size of the data containing all three of source audio, source text and target text is sufficient, end-to-end methods outperform pipeline methods. In lower resource settings where the amount of SLT data is insufficient, pipeline methods are better. Table 4 shows results on the English→French Augmented LibriSpeech test set, which is one of the most competed test sets for SLT, particularly end-to-end SLT. It shows the rapid increase in performance during the last two years, and the importance of maximally exploiting available training data.

Future Directions
The previous sections provide a detailed overview of resources, definitions of various kinds of multimodal MT, and the extensive work that has been devoted to develop models for the different tasks. However, multimodal MT is still in its infancy. This is especially the case for truly end-to-end models, which have only appeared in recent years. Future work should explore more realistic settings that go beyond restricted domains and rather artificial problems such as visually-guided image caption translation.

Datasets and resources
Image-guided translation has, thus far, been studied with small-scale datasets , and there is a need for larger-scale datasets that bring the resources for this task closer to the size of image captioning  and machine translation datasets (Tiedemann, 2012). Larger-scale datasets have started to appear for video-guided translation (Sanabria et al, 2018;Wang et al, 2019b). Spoken-language translation datasets (Kocabiyikoglu et al, 2018;Niehues et al, 2018) are smaller than standard automatic speech recognition datasets. A common challenge in multimodal translation is the need for crosslingually aligned resources, which are expensive to collect , or can result in a small dataset of clean examples (Kocabiyikoglu et al, 2018). Future work will obviously benefit from larger datasets, however, researchers should further explore the role of data augmentation strategies (Jia et al, 2019) in both spoken language translation and visually-guided translation.

Evaluation and "verification"
A significant challenge in image-guided translation has been to demonstrate that a model definitively improves translation with image guidance. This has resulted in more focused evaluation datasets that test noun sense disambiguation Lala and Specia, 2018) and verb sense disambiguation (Gella et al, 2019). In addition to new evaluations, researchers are focusing their efforts on determining whether image-guided translation models are sensitive to perturbations in the inputs. Elliott (2018) showed that the translations of some trained models are not affected when guided by incongruent images (i.e. the translation models were not guided by the image that the source language sentence describes, instead they are guided by a randomly selected image; see Section 5.1.5 for more details);  demonstrated that training models with masked tokens increases the sensitivity of models to incongruent image guidance; and, more recently, Dutta Chowdhury and Elliott (2019) showed that trained models are more sensitive to textual perturbations than incongruent image guidance. Overall, there is a need for more focused evaluations, especially in a wider variety of language pairs, and for models to be explicitly evaluated in these more challenging conditions. Future research on visually-guided translation should also ensure that new models are actually using the visual guidance in the translation process.
In spoken language translation, this line of research into focused evaluations might involve digging into the cases where a good transcript is not enough to disambiguate the translation. One possible case is translating into a language where the speaker's gender matters, such as French or Arabic (Elaraby et al, 2018). End-to-end SLT systems have the potential to use non-linguistic information from the speech signal to tackle these challenges, but it is currently unknown to which extent they are able to do so.

Shared tasks
In addition to stimulating research interest, shared task evaluation campaigns enable easier comparison of results by encouraging the use of standardised data conditions. The choice of data condition can be made with many aims in mind. To set up a race for state-of-the-art results using any and all available resources, it is enough to define a common test set. For this goal, any additional restrictions are unnecessary or even detrimental. For example the GLUE natural language understanding task (Wang et al, 2018a) takes this approach.
On the other hand, if the goal is to achieve as fair as possible comparison between architectures, then strict limitations on the training data are required as well. Most evaluation campaigns choose this approach. However, it is far from trivial to select an appropriate set of data types to include in the condition. In many tasks, the use of auxiliary or synthetic data has proved vitally useful, e.g. exploiting monolingual data in textual MT using backtranslation (Sennrich et al, 2016a). In spoken language translation, the use of auxiliary data has prompted some discussion of when end-to-end systems are considered to have reached parity with pipeline systems. To answer this question in a fair comparison, both types of systems should be evaluated under standardised data conditions.

Multimodality and new tasks
Most previous work on multimodal translation emphasises multimodal inputs and unimodal outputs, mainly text. The integration of speech synthesis, and also a better integration of visual signals in generated communication is required for improved intelligent systems and interactive artificial agents. In addition to multimodal outputs, there should be a stronger emphasis on real-time language processing and translation. This new emphasis would also result in a closer integration of models for spoken language translation models and visually-guided translation.
In SLT, the visual modality could contribute both complementary and disambiguating information. In addition, visual speech recognition, automatic lip reading in particular (e.g. Chung et al, 2017), could aid SLT for example in audio noise robustness. The How2 dataset should allow a flurry of research in the nascent field of audio-visual SLT. Wu et al (2019a) present exploratory first results. BLEU improvements over the best non-visual baseline are not found, although the visual modality improves results when comparing between model using cascaded deliberation.
In zero-shot translation, a multilingual model is used for translating between a language pair that was not included in the parallel training data (Firat et al, 2016;Johnson et al, 2017). For example, if a model does zero-shot French→Chinese translation, the training data contains language pairs with French as the source language and Chinese as the target language but no parallel French→Chinese data. Considering ongoing research into multilingual translation models also in multimodal translation (e.g. Inaguma et al, 2019a), and the fact that multimodal translation training data of sufficient size is available for a very limited number of language pairs, we expect an interest in zero-shot multimodal language translation in the future.

Conclusions
Multimodal machine translation provides an exciting framework for further development in grounded cross-lingual natural language understanding combining work in NLP, computer vision and speech processing. This paper provides a thorough survey of the current state of the art in the field focusing on specific tasks and benchmarks that drive the research. This survey details the essential language, vision, and speech resources that are available to researchers, and discusses the models and learning approaches in the extensive literature on various multimodal translation paradigms. Combining these different paradigms into truly multimodal end-to-end models of natural cross-lingual communication will be the goal of future developments, given the foundations laid out in this survey.