1 Introduction

Machine translation (MT) has been one of the most important challenges in natural language processing. Irrespective of traditional statistical machine translation (SMT) (Koehn 2009) or modern neural machine translation (NMT) (Sutskever et al. 2014), methods and data have always been mutually indispensable to each other. Indeed, the success of corpus-based MT is mainly dependent on the quality and scale of available parallel corpora to train MT systems. Recent state-of-the-art NMT systems have shown that translation can be surprisingly improved with sufficiently large-scale data and high computational power (Shen et al. 2016).

On the other hand, how to prepare such corpora has remained a big problem. In some specific domains such as Web news, patents, and Wikipedia, relatively high-quality multilingual translations are made available by content holders or volunteer workers, which have been utilized by researchers for decades (Koehn 2005; Taeger 2011). However, in more general cases, it is not always possible to collect a sufficient amount of parallel data because most generic Web documents are monolingual. The human cost for preparing manual translation is quite high, and it is particularly prohibitive for minor language pairs where resources are severely limited.

To tackle the situation where no or only a few parallel corpora are available, a branch of MT called pivot-based machine translation has been developed. The idea of the pivot-based approach is to indirectly learn the alignment of the source and target languages with the help of a third modality (e.g., texts in another language). Although previous studies along this line have been mainly based on the third language, in this work, we propose a novel and more general framework to utilize arbitrary multimedia content (e.g., images) as the pivot. Nowadays, we can easily find abundant monolingual text documents with rich multimedia content as the side information, e.g., text with photos or videos posted to social networking sites and blogs. These visual media are expected to be more or less correlated to the counterpart texts following the objective of a document. Considering that we can generally understand the content of images taken in other countries regardless of our own language, visual information can be a universal representation to ground different languages.

Moreover, in recent years, performance of visual recognition has been dramatically improved owing to the huge success of deep learning, where it is now considered to be on a human level for generic image recognition (Krizhevsky et al. 2012). We expect that these state-of-the-art visual recognition techniques are now mature enough to accurately extract language-agnostic semantics of images to help improve natural language processing (NLP) tasks. If multimedia pivot-based machine translation is established, we could possibly utilize abundant monolingual multimedia documents naturally provided by Web users to build high-performance and open-domain MT systems.

Our contributions in this study are as follows:

  1. 1.

    To the best of our knowledge, we are the first to propose a zero-resource (i.e., no direct parallel corpus) machine translation method that utilizes multimedia as the pivot. Importantly, pivot images are required only in the training phase.

  2. 2.

    To realize this, we propose a neural network based method combining multimodal (cross-modal) representation learning and encoder-decoder models. We note that our model can align source encoder and target decoder without source-to-target path during training which is often utilized by pseudo corpus based methods. Moreover, our idea is agnostic to implementations of encoder and decoder networks.

  3. 3.

    We categorized several possible approaches in model topology and learning strategies and extensively investigated their performance.

2 Related work

2.1 Resource problem in cross-lingual learning

Dealing with limitations in the number of good-quality parallel or comparable corpora has been one of the most important issues in cross-lingual learning. One straightforward approach is to automatically mine parallel corpora, typically from noisy Web repositories. Some methods exploited a bootstrap approach starting from base translation systems (Uszkoreit et al. 2010), whereas others utilized external meta information such as links to the same URL to coupling bilingual texts (Riesa and Marcu 2012). Among them, images have also been exploited as a key for cross-lingual document matching in relatively early works. However, these methods simply rely on OCR reading or near-duplicate (copy) detection of images (Oard 1999), and thus they cannot identify similarities in semantics, which is a fundamental limitation as compared to our work.

Another line of work has been to train MT system from non-parallel data with the help of another modality for indirect knowledge transfer, which is called the pivot-based machine translation. Most recent works have focused on existing popular language to use as the pivot (Wu and Wang 2007, 2009; Firat et al. 2016). While creating direct parallel corpora in minor language pairs is practically very difficult, major languages (e.g., English) are relatively often coupled to each language. Source-to-target translation can be realized by first translating the source language into the pivot language and then translating it into the target language. Nonetheless, this method still assumes that source-pivot and pivot-target parallel corpora are available, which would require the effort of human experts if the languages are minor ones. Moreover, it is difficult to use images as the pivot in this approach because explicitly decoding an image from text is not a well-established technique. Therefore, image-based pivots have mainly been used in relatively easier tasks such as bilingual lexicon learning, where image similarity is used as the criteria to estimate relevance between tag words attached to images (Bergsma and Van Durme 2011; Kiela et al. 2015; Vuli et al. 2016).

2.2 Computer vision for machine translation

Grounding a natural language to real-world representations has always been an important topic in NLP, for which computer vision would be the first natural choice (Silberer and Lapata 2014). After a huge breakthrough in the use of convolutional neural networks (CNNs) (Krizhevsky et al. 2012), visual recognition has been significantly advanced in terms of both accuracy and flexibility, enabling the development of many brand-new technologies. Amongst them, image captioning, which automatically annotates a description for an input image with natural language, has become one of the hottest topics in recent years (Vinyals et al. 2015; Johnson et al. 2016). Because image captioning is essentially interpreted as “translation” from an image to sentence, it has drawn more and more attention in the NLP community as well.

Recently, a new research field called multimodal machine translation was proposed (Huang et al. 2016; Hitschler and Riezler 2016), which became a subtask in WMT 2016.Footnote 1 The aim of this task is to use images in addition to source languages as inputs to improve the translation performance, hopefully relaxing ambiguity in alignment that cannot be solved by texts only. The feasibility of this approach has been demonstrated by some methods, such as visual-based reranking of SMT results (Hitschler and Riezler 2016). However, this task assumes that images are available as a part of a query in the testing phase, and thus the objective and setup are entirely different from ours.

2.3 Multimodal embedding

To use non-language multimedia as the pivot for MT, we need a more flexible mechanism to semantically align different types of data. The key idea here is to derive one common representation shared by all modalities. In other words, whatever the modality is, observed data belonging to the same implicit concept should be mapped into roughly the same point in the embedding space. The most classical and standard method for multimodal learning is probably linear canonical correlation analysis (CCA) (Hotelling 1936), which has been successfully used in image-language collaborations such as semantic image retrieval and annotation (Hardoon et al. 2004), as well as cross-lingual information retrieval (Udupa and Khapra 2010) (Funaki and Nakayama 2015). In more recent methods based on deep neural networks, pairwise ranking loss has been shown to significantly improve multimodal embedding (Frome et al. 2013) owing to its natural capability of learning discriminative nearest-neighbor metrics and stability in gradient-based learning. It was successfully used for image captioning within the framework of the deep encoder-decoder model (Kiros et al. 2015).

In this work, we simultaneously optimize source-pivot (image) and pivot-target losses with a shared pivot encoder to implicitly align two languages in the multimodal space, which is the core of our zero-shot learning. We further put a target sequence decoder on top of the multimodal representation to compose an end-to-end encoder-decoder model. From the theoretical viewpoint, our work is in the line of some recently proposed methods in the form of multi-stream encoder-decoder model. We look into these models and ours in detail and describe our contribution in Sect. 3.4.

Fig. 1
figure 1

Our models for neural translation based on pivot images (Top two-way model. Bottom three-way model). In the training phase, language encoders \(E^s\) and \(E^t\) are forced to have high correlations with the image encoder \(E^v\) in the multimodal space, on which the decoder of the target language \(D^t\) is trained. In the testing phase, translation can be realized by simply feedforwarding through \(E^s\) and \(D^t\)

3 Our approach

3.1 Overview

Our goal is to build a translation model from a source language s to a target language t by utilizing the side information (images) as the pivot. Below, we call a pair of a text description d and its counterpart image i a “document.” For training the system, suppose that we have \(N^s\) monolingual documents in the source language, \(\mathcal {T}^s=\left\{ d^s_k,i^s_k\right\} ^{N^s}_{k=1}\). Similarly, we also have \(N^t\) documents in the target language, \(\mathcal {T}^t=\left\{ d^t_k,i^t_k\right\} ^{N^t}_{k=1}\). Importantly, \(\mathcal {T}^s\) and \(\mathcal {T}^t\) do not overlap; they do not share the same images at all. While \(d^s\) and \(d^t\) obviously appear in different spaces, \(i^s\) and \(i^t\) share a common visual space and can be handled by the same encoder. We let \(E^s(d^s)\), \(E^t(d^t)\), and \(E^v(i)\) denote non-linear encoders (i.e., feature extractors) for source language descriptions, target language descriptions, and images, respectively.

Our model can be divided into roughly two important components. The first component is the multimodal representation learning, in which the parameters of the encoders, \(E^s(d^s)\), \(E^t(d^t)\), and \(E^v(i)\), are optimized so that they are mapped into the same semantic space, which we call “the multimodal space.” If such a good multimodal space is obtained, instances of all modalities should have roughly the same vector representation as long as they are tied together with similar semantic concepts. The second component is to build a target language decoder, \(D^t\), on top of the multimodal space so that the final translation can be realized by \(D^t\left( E^s(d^s)\right) \). It should be emphasized that we only need texts for input during the testing phase, similar to standard machine translation.

Figure 1 illustrates our approach. There are several options in the model topology and training strategies that are thoroughly compared in the experiments. We describe the details in the following sections.

3.2 Model topologies

We use the pair-wise rank loss proposed in (Frome et al. 2013) for training encoders to map them to one common multimodal space. For the two-way model, we take the source-image loss as follows (Fig. 1: Top):

$$\begin{aligned} J_{2w}^E(\mathcal {T}^s) = \sum _{i^s}\sum _{ng} \text{ max }\left\{ 0,\alpha -s\left( E^v\left( i^s\right) ,E^s\left( d^s\right) \right) +s\left( E^v(i^s),E^s\left( d^s_{ng}\right) \right) \right\} , \end{aligned}$$
(1)

where \(\alpha \) is the hyperparameter of margin and the similarity score function, s(), measures the dot product. Note that the outputs of each encoder are unit normalized and thus it is equal to cosine similarity. \(d^s_{ng}\) denotes negative (not coupled) descriptions for \(i^s\) sampled from the same mini-batch.

For training the decoder, images \(i^t\) in \(\mathcal {T}^t\) are feedforwarded and used as the inputs to measure decoder loss against \(d^t\). We take the standard cross-entropy loss.

$$\begin{aligned} J^D_{im}\left( \mathcal {T}^t\right) = -\sum _{d^t} \frac{1}{|d^t|}\sum _{k=1}^{|d^t|}\text{ log }P\left( w_k|D^t\left( E^v\left( i^t\right) \right) \right) , \end{aligned}$$
(2)

where \(P(w_k)\) is the probability that the model outputs the ground truth word at step k. Our two-way model is closely related to the image-captioning model proposed by (Kiros et al. 2015) except that we apply different languages to the encoder and decoder parts. This is viewed as an end-to-end fusion of multimodal embedding and image-captioning models.

In the three-way model, we further incorporate rank loss on \(\mathcal {T}^t\) in addition to \(\mathcal {T}^s\) (Fig. 1: Bottom) for training the encoders. The encoder loss for the three-way model is defined as follows:

$$\begin{aligned} J_{3w}^E\left( \mathcal {T}^s,\mathcal {T}^t\right)= & {} \sum _{i^s}\sum _{ng} \text{ max }\left\{ 0,\alpha -s\left( E^v\left( i^s\right) ,E^s\left( d^s\right) \right) +s\left( E^v\left( i^s\right) ,E^s\left( d^s_{ng}\right) \right) \right\} \nonumber \\&+ \sum _{i^t}\sum _{ng} \text{ max }\left\{ 0,\alpha -s\left( E^v\left( i^t\right) ,E^t\left( d^t\right) \right) +s\left( E^v\left( i^t\right) ,E^t\left( d^t_{ng}\right) \right) \right\} .\nonumber \\ \end{aligned}$$
(3)

The advantages of the three-way model over the two-way model are many. First, while image-target alignment is ignored in the two-way model, images implicitly bind source and target languages by jointly enforcing high correlations between them in the three-way model. Thus, multimodal representation itself is expected to be improved for bridging the gap between two languages. Moreover, simultaneously optimizing two constraints would have a positive regularization effect in a manner similar to that of so-called multi-task learning. Second, unlike in the two-way model, the three-way model can utilize both images and descriptions in \(\mathcal {T}^t\) for training decoders of the target language because now they are mapped into a common representation. This is interpreted as a sort of data augmentation and is expected to further improve robustness. The loss for reconstructing target descriptions is as follows:

$$\begin{aligned} J^D_{de}\left( \mathcal {T}^t\right) = -\sum _{d^t} \frac{1}{|d^t|}\sum _{k=1}^{|d^t|}\text{ log }P\left( w_k|D^t\left( E^t\left( d^t\right) \right) \right) . \end{aligned}$$
(4)

We can use either \(J^D_{im}\), \(J^D_{de}\), or both for training the decoder in the three-way approach. Now, the model can be viewed as a fusion of multimodal embedding, image captioning, and autoencoder of target languages.

3.3 Training strategy

We investigate two strategies for training the whole model. The first strategy is the two-step approach, in which we first optimize the encoder loss, \(J^E\). Then we fix the parameters for all encoders and start optimizing the decoder with respect to \(J^D\). The second strategy is the end-to-end approach, in which we jointly optimize encoder and decoder losses. Here we use the combined loss,

$$\begin{aligned} J^{all} = J^D + \lambda J^E, \end{aligned}$$
(5)

where \(\lambda \) is a weighting parameter.

3.4 Difference from closely related methods

Although our work is, as far as we know, the first attempt of zero-resource machine translation using multimedia pivot, there have been some theoretically close methods that inspired our model. The topology of our network is similar to recently proposed many-to-one sequence-to-sequence model (Luong et al. 2016). However, their model is designed for standard multi-task learning and does not have a multimodal embedding layer like ours. Therefore, it cannot align a source encoder and a target decoder in zero-shot situations. To deal with zero-shot problem, Firat et al. (2016) incorporated some synthetic parallel corpora to explicitly include source-to-target path during training. In other words, they approach the zero-shot problem in data-side while we approach in model-side with the help of multimodal embedding technique.

As for the pivot-based multimodal representation learning, Funaki et al. (2015) and Rajendran et al. (2016) used basically the same idea as our multimodal space, implemented with generalized CCA and neural encoders respectively. A major difference is that there models have no cross-modal decoders because their interest was the multimodal representation (embedding) itself. We will show that simultaneously optimizing decoders have positive effects not only for decoding but also for the learned representation itself. Saha et al. (2016) proposed an end-to-end model of multimodal embedding and target decoder, which is almost identical to our two-way model except that their multimodal fusion is based on correlation loss. As we have described above, our three-way model including the target encoder in multimodal learning have many advantages. In fact, it can significantly improve both the multimodal representation and decoding performance compared to two-way model as we show in the experiments.

Table 1 Statistics and data splits for each preprocessed dataset

4 Experiment

4.1 Data set

For our study, we used two publicly available multilingual image-description datasets. The IAPR-TC12 dataset (Grübinger et al. 2006) has 20,000 images with their English and German descriptions. The original descriptions were provided in German, and their English translations were added by professionals. The recently published Multi30K dataset (Elliott et al. 2016) is specifically designed for research of multimodal machine translation. It has 31,014 images with English and German descriptions for each image. This is an extension of Flickr30K (Young et al. 2014), an image-caption dataset in English, for which German translations are provided by (Elliott et al. 2016). There are two types of bilingual annotations provided in Multi30K dataset. Namely, one for machine translation task and the other for multilingual image captioning task, respectively. We used the former for our experiments and followed the official training, validation and testing splits.

For preprocessing, all words were converted into lowercase and tokenized using Natural Language Toolkit, and then those appearing less than 5 times in the training splits were replaced by UNK symbol. Table 1 summarizes the statistics of the datasets and our experimental setup. We randomly split data into non-overlapping sets for training, validation, and testing. Unnecessary modalities for each split (e.g., German descriptions for Image-English split) were ignored. It is notable that we had no direct English-German parallel data, even in the validation sets.

Although these are the current largest multi-lingual image description datasets as far as we know, we should say that they are relatively small compared to standard studies on neural machine translation. We note that our work is in the beginning stage where our focus is to show the feasibility of zero-shot translation using multimedia pivot, as well as to investigate how each component in our model affects the relative improvements in performance.

4.2 Experimental setup

Because the choice of encoders and decoders for each modality is not within the scope of this paper, we used the most standard neural models for each domain. For visual encoder \(E^v\), we employed the public VGG-19 network (Simonyan and Zisserman 2015), which is one of the most powerful and widely used CNNs pre-trained on the ImageNet dataset (Deng et al. 2009). We used features from the “fc7” layer of VGG-19 and put another two fully connected (FC) layers with 1024 hidden units each, only which are tuned during the training. For implementation, we used the pre-computed features for Multi30K provided at the WMT’16 Multimodal Machine Translation task\(^{1}\). We extracted the same features for IAPR-TC12 using Caffe (Jia et al. 2014). For language encoders and decoders \(E^s\), \(E^t\), and \(D^t\), we used recurrent neural networks (RNNs) with long short-term memory (LSTM) (Hochreiter and Schmidhuber 1997). We used 512-dimensional word embedding and 1024-dimensional hidden units. Note that the dimensions of all encoders should be equal so that they can be coupled in multimodal space. We used the Adam optimizer (Kingma and Ba 2015) with mini-batch size 32 for training the network, and we stopped optimization when the validation loss no longer improved. We fixed \(\alpha =0.1\) and \(\lambda =100\) through our experiments.

To compose a mini-batch for the language encoders \(E^s\), \(E^t\) and decoders \(D^t\), we padded special NULL symbols to align the length of sentences in the same input (or output) batch, which is the standard practice in seq-to-seq learning. As for the image encoder \(E^v\), because the visual representation of each example is a static 4096-dimensional vector, a mini-batch is a simple 32 \(\times \) 4096 matrix having the feature vectors of batch examples in rows in the same order as the corresponding text-side mini-batch. Training data are randomly shuffled in the beginning of each epoch and then fed into mini-batches in order.

For evaluation, we mainly used the standard corpus-level BLEU metrics (Papineni et al. 2002). We used multi-bleu.pl script in Moses toolkit to compute BLEU scores. We also evaluated the sentence-level BLEU+1 metrics (Lin and Och 2004), which is a modification of BLEU that has smoothing terms for higher-order n-grams, making it possible to evaluate MT performance on short sentences. We note a BLEU score in plain text and corresponding BLEU+1 score in a parenthesis.

Table 2 BLEU (BLEU+1) scores on supervised baselines (sequence to sequence) changing the size of training data

Table 2 summarizes the results on baseline models. To demonstrate the performance we could obtain with a supervised parallel corpus, we show the scores on sequence-to-sequence NMT (Sutskever et al. 2014) trained on the same RNN architectures changing the number of randomly sampled parallel data.

4.3 In-depth study of multimodal space

To separately evaluate the effectiveness of the multimodal space, we first focused on simple nearest-neighbor-based translation. Namely, for a query description in the source language, \(d^s_q\), we retrieved its nearest-neighbor training sample in \(\mathcal {T}^t\) and then simply output its description. This experiment essentially measures the retrieval performance and is appropriate for evaluating the multimodal representation itself.

As a baseline, we implemented a naive method based on TFIDF and CNN visual features. For a query, we first retrieved the most similar document in \(\mathcal {T}^s\) in terms of cosine similarity of TFIDF text features. Then, for the coupled image of that document, we retrieved the nearest document in \(\mathcal {T}^t\) in terms of the L2 distance of CNN features (i.e., VGG-19 fc7 layer) whose caption would be output as the translation result.

In the multimodal space obtained by our two-way model, we can retrieve the nearest image in the target side, \(\mathcal {T}^t\), by computing the dot score, which is the criterion we used in the ranking loss (Eq. 1).

$$\begin{aligned} \hat{d}^t_{2w} = d_y^t \;\; \text{ where }\;\; y = {\mathop {\hbox {arg max}}\limits _k}\; s\left( E^v\left( i^t_k\right) ,E^s\left( d^s_q\right) \right) \end{aligned}$$
(6)

For the three-way model, in addition to image-based retrieval, we can directly retrieve the nearest description in \(\mathcal {T}^t\), which is an interesting characteristic of this model.

$$\begin{aligned} \begin{aligned}&\hat{d}^t_{3w} = d_y^t \;\; \text{ where } \\&\quad y = \left\{ \begin{array}{l} {\mathop {\hbox {arg max}}\limits _k}\; s\left( E^v\left( i^t_k\right) ,E^s\left( d^s_q\right) \right) \;\; \text{(image-based) } \\ {\mathop {\hbox {arg max}}\limits _k}\; s\left( E^t\left( d^t_k\right) ,E^s\left( d^s_q\right) \right) \;\; \text{(description-based) } \end{array} \right. \end{aligned} \end{aligned}$$
(7)

Table 3 shows the results of the nearest-neighbor methods. “with dec.” represents a multimodal space jointly trained with decoder while others indicate independently trained ones (i.e., the first step in the two-step approach). For reference, we also noted the performance when we randomly sampled a description in \(\mathcal {T}^t\). As expected, the three-way model generally outperformed the two-way model. Interestingly, we observed that the performance was further improved when we directly retrieved descriptions on the target side. This fact indicates that descriptions projected into the multimodal space still represent some useful information not apparent in the images.

We hypothesize that jointly optimizing multimodal embedding loss and decoder loss (end-to-end model) may result in a better multimodal space because decoder learning can be a good constraint in a multi-task learning framework. As shown in the result, “with dec.” models generally achieve better performance on IAPR-TC12, but not on Multi30K. This is reasonable because independently trained multimodal space is poor on IAPR-TC12 but relatively good on Multi30K as the comparison with “TFIDF + CNN feature” baseline suggests.

Table 3 BLEU (BLEU+1) scores of nearest-neighbor methods. “with dec.” represents when jointly optimized with decoder in end-to-end training

4.4 Main results and discussion

Table 4 shows a detailed comparison of our approach in different configurations. Comparing with the baselines (Table 2), our best results are roughly comparable to sequence-to-sequence models when the number of parallel sentences are limited to about 20% as large as our monolingual ones. We summarize our findings below.

A comparison of model topologies shows that the three-way models generally outperformed their two-way counterparts. However, when only images were feed-forwarded for training decoder, the differences in performance were subtle, and the two-way model sometimes outperformed the three-way model. The most attractive aspect of the three-way approach is that we can use both image and description for decoder training, which always provided the best results. We can possibly utilize external monolingual corpora to further improve decoders, which we would like to investigate in our future work.

As for training strategy, end-to-end training generally achieved better results than the two-step approach, but the difference is not very large on Multi30K. For IAPR-TC12, as the results of the nearest-neighbor experiment suggest, the multimodal space itself was relatively poor (sometimes outperformed by the TFIDF baseline). In such a case, jointly optimizing the multimodal space (encoders) and the decoder seemed to significantly improve the performance. This result also corresponds to the observation in the previous section.

Table 4 BLEU (BLEU+1) scores comparison on different models and training strategies
Fig. 2
figure 2

Loss curves of English to German translation task on our three-way models (image + description) for the IAPR-TC12 dataset

We show the loss curves of English to German translation task on our three-way models. Figures 2 and 3 show the results on IAPR-TC12 and Multi30K, respectively. Note that the training (validation) loss cannot be directly compared to the test loss because they are based on entirely different criteria. Nonetheless, we can see that validation loss and test loss converge in similar timings, making it possible to tune the network properly. Another observation is that decoder training seems to be overfitting earlier on Multi30K. This could be another reason that end-to-end approach showed no significant improvement on this dataset.

Finally, we demonstrate some qualitative results of zero-shot translation at Table 5. We observed that our method can actually translate many sentences correctly. Besides successful ones, we also see many interesting errors. In many translations, although overall description of a scene is more or less relevant, attributes (e.g., color) and numbers of objects are often missed. This is reasonable because we are currently using only a single visual feature vector by global CNNs and therefore it is difficult to align fine-grained local information of images correctly. To tackle this problem, it would be promising to integrate more sophisticated object detection and segmentation methods in future. We also observe a number of small gramatical errors, possibly due to the lack of sufficient training data. We expect that this problem can be mitigated by utilizing external monolingual data in target language.

Fig. 3
figure 3

Loss curves of English to German translation task on our three-way models (image + description) for the Multi30K dataset

Table 5 Qualitative examples of German to English translation using our end-to-end three-way model (image + description). Ground truth English captions are noted in parentheses

5 Conclusion

In this work, we tackled a challenging task of training an NMT system from just monolingual data containing multimedia side information. Unlike many previous studies that used multimedia simply in addition to texts as inputs to reinforce machine translation, we used no parallel corpora for training or image inputs in the testing phase. Our system was made possible by training multimodal encoders to share common modality-agnostic semantic representation using images as the pivot. We compared several possible implementations and showed the feasibility of our approach. Notably, we found the three-way model to be particularly promising in terms of both performance and flexibility in handling various modality-specific data. Although our target in this paper was a fully unsupervised setup, we can naturally include some parallel data in a semi-supervised manner or external monolingual text corpora in the target language to further enhance performance, which is an attractive direction for future research.

Of course, the experimental results also suggest that we have a long way to go. There is still a significant gap in performance as compared to supervised sequence-to-sequence baselines. We expect this gap to further reduce as we use more expressive visual encoders, powerful attention mechanisms, and multimodal learning methods, all of which have remarkably improved in recent years. Moreover, our current method is intrinsically limited to the domain where texts can be grounded to visual content, which is not always the case in generic documents. We would like to extend our approach to handle other side information and investigate how far we can go on automatically crawled noisy Web data, which is an important milestone to realizing true zero-resource MT utilizing abundant multimedia monolingual documents on the Web.