1 Introduction

Neural machine translation (NMT) (Forcada and Ñeco 1997; Cho et al. 2014; Sutskever et al. 2014; Bahdanau et al. 2015) is nowadays the most popular machine translation (MT) approach, following two decades of predominant use by the community of statistical MT (SMT) (Brown et al. 1991; Och and Ney 2000; Zens et al. 2002), as indicated by the majority of MT methods employed in shared tasks during the last few years (Bojar et al. 2016; Cettolo et al. 2016). A recent and widely-adopted NMT architecture, the so-called Transformer (Vaswani et al. 2017), has been shown to outperform other models in high-resource scenarios (Barrault et al. 2019). However, in low-resource settings, NMT approaches in general and Transformer models in particular are known to be difficult to train and usually fail to converge towards a solution which outperforms traditional SMT. Only a few previous work tackle this issue, whether using NMT with recurrent neural networks (RNNs) for low-resource language pairs (Sennrich and Zhang 2019), or for distant languages with the Transformer but using several thousands of sentence pairs (Nguyen and Salazar 2019).

In this paper, we present a thorough investigation of training Transformer NMT models in an extremely low-resource scenario for distant language pairs involving eight translation directions. More precisely, we aim to develop strong baseline NMT systems by optimizing hyper-parameters proper to the Transformer architecture, which relies heavily on multi-head attention mechanisms and feed-forward layers. These two elements compose a usual Transformer block. While the number of blocks and the model dimensionality are hyper-parameters to be optimized based on the training data, they are commonly fixed following the model topology introduced in the original implementation.

Following the evaluation of baseline models after an exhaustive hyper-parameter search, we provide a comprehensive view of synthetic data generation making use of the best performing tuned models and following several prevalent techniques. Particularly, the tried-and-tested back-translation method (Sennrich et al. 2016a) is explored, including the use of source side tags and noised variants, as well as the forward translation approach.

The characteristics of the dataset used in our experiments allow for multilingual (many-to-many languages) NMT systems to be trained and compared to bilingual (one-to-one) NMT models. We propose to train two models: using English on the source side and four Asian languages on the target side, or vice-versa (one-to-many and many-to-one). This is achieved by using a data pre-processing approach involving the introduction of tags on the source side to specify the language of the aligned target sentences. By doing so, we leave the Transformer components identical to one-way models to provide a fair comparison with baseline models.

Finally, because monolingual corpora are widely available for many languages, including the ones presented in this study, we contrast low-resource supervised NMT models trained and tuned as baselines to unsupervised NMT making use of a large amount of monolingual data. The latter being a recent alternative to low-resource NMT, we present a broad illustration of available state-of-the-art techniques making use of publicly available corpora and tools.

To the best of our knowledge, this is the first study involving eight translation directions using the Transformer in an extremely low-resource setting with less than 20k parallel sentences available. We compare the results obtained with our NMT models to SMT, with and without large language models trained on monolingual corpora. Our experiments deliver a recipe to follow when training Transformer models with scarce parallel corpora.

The remainder of this paper is organized as follows. Section 2 presents previous work on low-resource MT, multilingual NMT, as well as details about the Transformer architecture and its hyper-parameters. Section 3 introduces the experimental settings, including the datasets and tools used, and the training protocol of the MT systems. Section 4 details the experiments and obtained results following the different approaches investigated in our work, followed by an in-depth analysis in Sect. 5. Finally, Sect. 6 gives some conclusions.

2 Background

This section presents previous work on low-resource MT, multilingual NMT, followed by the technical details of the Transformer model and its hyper-parameters.

2.1 Low-resource MT

NMT systems usually require a large amount of parallel data for training. Koehn and Knowles (2017) conducted a case study on English-to-Spanish translation task and showed that NMT significantly underperforms SMT when trained on less than 100 million words. However, this experiment was performed with standard hyper-parameters that are typical for high-resource language pairs. Sennrich and Zhang (2019) demonstrated that better translation performance is attainable by performing hyper-parameter tuning when using the same attention-based RNN architecture as in Koehn and Knowles (2017). The authors reported higher BLEU scores than SMT on simulated low-resource experiments in English-to-German MT. Both studies have in common that they did not explore realistic low-resource scenarios nor distant language pairs leaving unanswered the question of the applicability of their findings to these conditions.

In contrast to parallel data, large quantity of monolingual data can be easily collected for many languages. Previous work proposed several ways to take advantage of monolingual data in order to improve translation models trained on parallel data, such as a separately trained language model to be integrated to the NMT system architecture (Gulcehre et al. 2017; Stahlberg et al. 2018) or exploiting monolingual data to create synthetic parallel data. The latter was first proposed for SMT (Schwenk 2008; Lambert et al. 2011), but remains the most prevalent one in NMT (Barrault et al. 2019) due to its simplicity and effectiveness. The core idea is to use an existing MT system to translate monolingual data to produce a sentence-aligned parallel corpus, whose source or target is synthetic. This corpus is then mixed to the initial, non-synthetic, parallel training data to retrain a new MT system.

Nonetheless, this approach only leads to slight improvements in translation quality for SMT. It has been later exploited in NMT (Sennrich et al. 2016a) using the original monolingual data on the target side and the corresponding generated translations on the source side to retrain NMT. A large body of subsequent work proposed to improve this approach that we denote in this paper “backward translation.” In particular, Edunov et al. (2018) and Caswell et al. (2019) respectively showed that adding synthetic noise or a tag to the source side of the synthetic parallel data effectively enables the exploitation of very large monolingual data in high-resource settings. In contrast, Burlot and Yvon (2018) showed that the quality of the backward translations has a significant impact for training good NMT systems.

As opposed to backward translation, Imamura and Sumita (2018) proposed to perform “forward translation”Footnote 1 to enhance NMT by augmenting their training data with synthetic data generated by MT on the target side. They only observed moderate improvements, but in contrast to backward translation, this approach does not require a pre-existing NMT system trained for the reverse translation direction.

Recent work has also shown remarkable results in training MT systems using only monolingual data in so-called unsupervised statistical (USMT) and neural (UNMT) machine translation (Artetxe et al. 2018; Lample et al. 2018; Artetxe et al. 2018; Lample et al. 2018). These approaches have only worked in configurations where we do not necessarily need them, i.e., in high-resource scenarios and/or for very close language pairs. Marie et al. (2019) showed that USMT can almost reach the translation quality of supervised NMT for close language pairs (Spanis+6h–Portuguese) while Marie et al. (2019) showed that this approach is unable to generate a translation for distant language pairs (English–Gujarati and English–Kazakh).

To the best of our knowledge, none of these approaches have been successfully applied to NMT systems for extremely low-resource language pairs, such as Asian languages that we experimented with in this paper.

2.2 Multilingual NMT

Most NMT architectures are flexible enough to incorporate multiple translation directions, through orthogonal approaches involving data pre-processing or architecture modifications. One such modification is to create separate encoders and decoders for each source and target language and couple them with a shared attention mechanism (Firat et al. 2016). However, this method requires substantial NMT architecture modifications and leads to an increase in learnable model parameters, which rapidly becomes computationally expensive. At the data level, a straightforward and powerful approach was proposed by Johnson et al. (2017), where a standard NMT model was used as a black-box and an artificial tokens was added on the source side of the parallel data. Such a token could be for instance <2xx>, where xx indicates the target language.

During training, the NMT system uses these source tokens to produce the designated target language and thus allows to translate between multiple language pairs. This approach allows for the so-called zero-shot translation, enabling translation directions that are unseen during training and for which no parallel training data is available (Firat et al. 2016; Johnson et al. 2017).

One of the advantages of multilingual NMT is its ability to leverage high-resource language pairs to improve the translation quality on low-resource ones. Previous studies have shown that jointly learning low-resource and high-resource pairs leads to improved translation quality for the low-resource one (Firat et al. 2016; Johnson et al. 2017; Dabre et al. 2019). Furthermore, the performance tends to improve as the number of language pairs (consequently the training data) increases (Aharoni et al. 2019).

However, to the best of our knowledge, the impact of joint multilingual NMT training with the Transformer architecture in an extremely low-resource scenario for distant languages through the use of a multi-parallel corpus was not investigated previously. This motivates our work on multilingual NMT, which we contrast with bilingual models through an exhaustive hyper-parameter tuning. All the multilingual NMT models trained and evaluated in our work are based on the approach presented in Johnson et al. (2017), which relies on the artificial token indicating the target language and prepended onto the source side of each parallel sentence pair.

2.3 Transformer

Encoder–decoder models consist in encoding an input sequence \(\mathbf {X} = \{x_1, x_2, \ldots , x_n\}\) (\(\mathbf {X} \in \mathbb {R}^{n \times d}\)) and producing a corresponding output sequence \(\mathbf {Y} = \{y_1, y_2, \ldots , y_m\}\) (\(\mathbf {Y} \in \mathbb {R}^{m \times d}\)), where d is the model dimensionality and n and m are the input and output sequence lengths, respectively. The Transformer is built on stacked \(L_e\) encoder and \(L_d\) decoder layers, each layer consisting of sub-layers of concatenated multi-head dot-product scaled attention (Eq. 1) and two position-wise feed-forward layers with a non-linear activation function in between, usually a rectified linear unit (ReLU) (Nair and Hinton 2010) (Eq. 2).

$$\begin{aligned} attn ( \varvec{\iota }_1, \varvec{\iota }_2 )&= [ head_1 ; \ldots ; head_i; \ldots ; head_h ] \mathbf {W}^O, \end{aligned}$$
$$\begin{aligned} head_i ( \varvec{\iota }_1, \varvec{\iota }_2 )&= softmax \left( \frac{ \mathbf {Q} \mathbf {W}^Q_i \cdot (\mathbf {K}\mathbf {W}^K_i)^T }{\sqrt{d}}\right) \mathbf {V} \mathbf {W}^V_i, \nonumber \\ ffn \left( \mathbf {H}^l_{ attn }\right)&= ReLU \left( \mathbf {H}^l_{ attn } \mathbf {W}_1 + b_1\right) \mathbf {W}_2 + b_2. \end{aligned}$$

Both encoder and decoder layers contain self-attention and feed-forward sub-layers, while the decoder contains an additional encoder–decoder attention sub-layer.Footnote 2 The hidden representations produced by self-attention and feed-forward sub-layers in the \(l_{e}\)-th encoder layer (\(l_{e}\in \{1,\ldots ,L_{e}\}\)) are formalized by \(\mathbf {H}^{l_{e}}_{ Eattn }\) (Eq. 3) and \(\mathbf {H}^{l_{e}}_{ Effn }\) (Eq. 4), respectively. Equivalently, the decoder sub-layers in the \(l_{d}\)-th decoder layer (\(l_{d}\in \{1,\ldots ,L_{d}\}\)) are formalized by \(\mathbf {H}^{l_{d}}_{ Dattn }\) (Eq. 5), \(\mathbf {H}^{l_{d}}_{ EDattn }\) (Eq. 6), and \(\mathbf {H}^{l_{d}}_{ Dffn }\) (Eq. 7). The sub-layer \(\mathbf {H}^{l_{e}}_{ Eattn }\) for \(l_{e} = 1\) receives the input sequence embeddings \(\mathbf {X}\) instead of \(\mathbf {H}^{l_{e}-1}_{ Effn }\), as it is the first encoder layer. The same applies to the decoder.

$$\begin{aligned} \mathbf {H}^{l_{e}}_{ Eattn }&= \nu \left( \chi \left( \mathbf {H}^{l_{e}-1}_{ Effn }, attn \left( \mathbf {H}^{l_{e}-1}_{ Effn }, \mathbf {H}^{l_{e}-1}_{ Effn } \right) \right) \right) , \end{aligned}$$
$$\begin{aligned} \mathbf {H}^{l_{e}}_{ Effn }&= \nu \left( \chi \left( \mathbf {H}^{l_{e}}_{ Eattn }, ffn \left( \mathbf {H}^{l_{e}}_{ Eattn } \right) \right) \right) , \end{aligned}$$
$$\begin{aligned} \mathbf {H}^{l_{d}}_{ Dattn }&= \nu \left( \chi \left( \mathbf {H}^{l_{d}-1}_{ Dffn }, attn \left( \mathbf {H}^{l_{d}-1}_{ Dffn }, \mathbf {H}^{l_{d}-1}_{ Dffn } \right) \right) \right) , \end{aligned}$$
$$\begin{aligned} \mathbf {H}^{l_{d}}_{ EDattn }&= \nu \left( \chi \left( \mathbf {H}^{l_{d}}_{ Dattn }, attn \left( \mathbf {H}^{l_{d}}_{ Dattn }, \mathbf {H}^{L_{e}}_{ Effn } \right) \right) \right) , \end{aligned}$$
$$\begin{aligned} \mathbf {H}^{l_{d}}_{ Dffn }&= \nu \left( \chi \left( \mathbf {H}^{l_{d}}_{ EDattn }, ffn \left( \mathbf {H}^{l_{d}}_{ EDattn } \right) \right) \right) . \end{aligned}$$

Each sub-layer includes LayerNorm (Ba et al. 2016), noted \(\nu \), parameterized by \(\mathbf {g}\) and \(\mathbf {b}\), with input vector \(\hat{\mathbf {h}}\), mean \(\mu \), and standard deviation \(\varphi \) (Eq. 8), and a residual connection noted \(\chi \) with input vectors \(\hat{\mathbf {h}}\) and \(\hat{\mathbf {h}'}\) (Eq. 9). In the default Transformer architecture, LayerNorm is placed after each non-linearity and residual connection, also called post-norm. However, an alternative configuration is pre-norm, placing the normalization layer prior to the non-linearity.

$$\begin{aligned} \nu ( \hat{\mathbf {h}} )&= \frac{\hat{\mathbf {h}} - \mu }{\varphi } \odot \mathbf {g} + \mathbf {b} \end{aligned}$$
$$\begin{aligned} \chi ( \hat{\mathbf {h}}, \hat{\mathbf {h}'} )&= \hat{\mathbf {h}} + \hat{\mathbf {h}'} \end{aligned}.$$

2.4 Hyper-parameters

The Transformer architecture has a set of hyper-parameters to be optimized given the training data. Based on the formalism introduced in Sect. 2.3, the architecture-dependent hyper-parameters to optimize are the model dimensionality d, which is equal to the input and output token embedding dimensionality, the number of heads h, the size of the feed-forward layers \( ffn \) and thus the dimensions of the parameter matrices \(\mathbf {W}_1\) and \(\mathbf {W}_2\), as well as the biases \(b_1\) and \(b_2\), and the number of encoder and decoder layers \(L_e\) and \(L_d\), respectively. In addition, it was shown in recent work that the position of the normalization layer \(\nu \), before or after the attention or feed-forward layers, leads to training instability (i.e., no convergence) for deep Transformer architectures or in low-resource settings using an out-of-the-box hyper-parameter configuration (Wang et al. 2019; Nguyen and Salazar 2019).

Our exhaustive hyper-parameter search for NMT models is motivated by the findings of Sennrich and Zhang (2019) where the authors conducted experiments in low-resource settings using RNNs and showed that commonly used hyper-parameters do not lead to the best results. However, the impact of various number of layers was not evaluated in their study. This limitation could lead to sub-optimal translation quality, as other studies on mixed or low-resource settings have opted for a reduced number of layers in the Transformer architecture, for instance using 5 encoder and decoder layers instead of the usual 6 (Schwenk et al. 2019; Chen et al. 2019). In these latter publications, authors also used fewer number of attention heads, between 2 and 4, compared to the out-of-the-box 8 heads from the vanilla Transformer.

Moreover, a number of recent studies on the Transformer architecture have shown that not all attention heads are necessary. For instance, Voita et al. (2019) evaluated the importance of each head in the multi-head attention mechanism in a layer-wise fashion. They identified their lexical and syntactic roles and proposed a head pruning mechanism. However, their method is only applied to a fully-trained 8-head model. A concurrent study carried out by Michel et al. (2019), both for NMT and natural language inference tasks, focused on measuring the importance of each head through a masking approach. This method showed that a high redundancy is present in the parameters of most heads given the rest of the model and that some heads can be pruned regardless of the test set used. An extension to their approach allows for head masking during the training procedure. Results indicate that the importance of each head is established by the Transformer at an early stage during NMT training.

Finding the best performing architecture based on a validation set given a training corpus is possible through hyper-parameter grid-search or by relying on neural architecture search methods (c.f.  Elsken et al. (2018)). The former solution is realistic in our extremely low-resource scenario and the latter is beyond the scope of our work.

3 Experimental settings

The aim of the study presented in this paper is to provide a set of techniques which tackles low-resource related issues encountered when training NMT models and more specifically Transformers. This particular NMT architecture involves a large amount of hyper-parameter combinations to be tuned, which is the cornerstone of building strong baselines. The focus of our work is a set of extremely low-resource distant language pairs with a realistic data setting, based on corpora presented in Sect. 3.1. We compare traditional SMT models to supervised and unsupervised NMT models with the Transformer architecture, whose specificities and training procedures are presented in Sect. 3.2. Details about the post-processing and evaluation methods are finally presented in Sect. 3.3.

3.1 Dataset

The dataset used in our experiments contains parallel and monolingual data. The former composes the training set of the baseline NMT systems, as well as the validation and test sets, while the latter constitutes the corpora used to produce synthetic parallel data, namely backward and forward translations.

3.1.1 Parallel corpus

The parallel training, validation, and test sets were extracted from the Asian Language Treebank (ALT) corpus (Riza et al. 2016).Footnote 3 We focus on four Asian languages, i.e., Japanese, Lao, Malay, and Vietnamese, aligned to English, leading to eight translation directions. The ALT corpus comprises a total of 20, 106 sentences initially taken from the English Wikinews and translated into the other languages. Thus, the English side of the corpus is considered as original while the other languages are considered as translationese (Gellerstam 1986), i.e., texts that shares a set of lexical, syntactic and/or textual features distinguishing them from non-translated texts. Statistics for the parallel data used in our experiments are presented in Table 1.

Table 1 Statistics of the parallel data used in our experiments

3.1.2 Monolingual corpus

Table 2 Statistics for the entire monolingual corpora, comprising 163M, 87M, 737k, 15M, and 169M lines, respectively for English, Japanese, Lao, Malay, and, Vietnamese, the sampled sub-corpora made of 18k (18,088), 100k, 1M, 10M, or 50M lines. Tokens and types for Japanese and Lao are calculated at the character level

We used monolingual data provided by the Common Crawl project,Footnote 4 which were crawled from various websites in any language. We extracted the data from the April 2018 and April 2019 dumps. To identify from the dumps the lines in English, Japanese, Lao, Malay, and Vietnamese, we used the fastText (Bojanowski et al. 2016)Footnote 5 pretrained model for language identification.Footnote 6 The resulting monolingual corpora still contained a large portion of noisy data, such as long sequences of numbers and/or punctuation marks. For cleaning, we decided to remove lines in the corpora that fulfill at least one of the following conditions:

  • more than 25% of its tokens are numbers or punctuation marks.Footnote 7

  • contains less than 4 tokens.Footnote 8

  • contains more than 150 tokens.

Statistics of the monolingual corpora and their sub-samples used in our experiments are presented in Table 2.

3.2 MT systems

We trained and evaluated three types of MT systems: supervised NMT, unsupervised NMT, and SMT systems. The computing architecture at our disposal used for our NMT systems consists in 8 Nvidia Tesla V100 with the CUDA library version 10.2. Each of the following sections gives the details of each type of MT systems, including tools, pre-processing, hyper-parameters, and training procedures used.

3.2.1 Supervised NMT systems

Our supervised NMT systems include ordinary bilingual (one-to-one) and multilingual (one-to-many and many-to-one) NMT systems. All the systems were trained using the fairseq toolkit (Ott et al. 2019) based on PyTorch.Footnote 9

The only pre-processing applied to the parallel and monolingual data used for our supervised NMT systems was a sub-word transformation method (Sennrich et al. 2016b). Neither tokenization nor case alteration was conducted in order to keep our method language agnostic as much as possible. Because Japanese and Lao languages do not contain spacing, we employed a sub-word transformation approach, sentencepiece (Kudo and Richardson 2018), which is based on sequences of characters. Models were learned on joint vocabularies for bilingual NMT systems and on all languages for the multilingual ones. When the same script is used between languages, the shared vocabulary covers all the observed sub-words of these languages. When different scripts are used between languages, only common elements (punctuation, numbers, etc.) are shared. We restricted the number of sub-word transformation operations to 8000 for all the bilingual NMT models and to 32,000 for all the multilingual NMT models.

Our main objective is to explore a vast hyper-parameter space to obtain our baseline supervised NMT systems, focusing on specific aspects of the Transformer architecture. Our choice of hyper-parameters (motivated in Sect. 2.4) and their values are presented in Table 3. These are based on our preliminary experiments and on the findings of previous work that showed which Transformer hyper-parameters have the largest impact on translation quality measured by automatic metrics (Sennrich and Zhang 2019; Nguyen and Salazar 2019). The exhaustive search of hyper-parameter combinations resulted in 576 systems trained and evaluated for each translation direction, using the training and validation sets specific to each language pair presented in Table 1. Because the experimental setting presented in this paper focuses on extremely low-resource languages, it is possible to train and evaluate a large number of NMT systems without requiring too much computing resources.

We trained all the supervised NMT systems following the same pre-determined procedure, based on gradient descent using the Adam optimizer (Kingma and Ba 2014) and the cross-entropy objective with smoothed labels based on a smoothing rate of 0.1. The parameters of the optimizer were \(\beta _1 = 0.9\), \(\beta _2 = 0.98\), \(\epsilon = 10^{-9}\). The learning rate was scheduled as in Vaswani et al. (2017), initialized at \(1.7^{-7}\) and following an initial 4k steps warmup before decaying at the inverse squared root rate. The NMT architecture used the ReLU (Nair and Hinton 2010) activation function for the feed-forward layers and scaled dot-product attention for the self and encoder–decoder attention layers. A dropout rate of 0.1 was applied to all configurations during training and no gradient clipping was used. We used a batch of 1, 024 tokens and stopped training after 200 epochs for the baseline systems, 80 epochs for the systems using additional synthetic data and 40 epochs for the largest (10M) data configurations. Bilingual models were evaluated every epoch, while multilingual models were evaluated every 5 epochs, both using BLEU on non post-processed validation sets (including sub-words). The best scoring models were kept for final evaluation.

Table 3 Hyper-parameters considered during the tuning of baseline NMT systems

For each trained model, we introduced two additional hyper-parameters for decoding, i.e., the decoder beam size and translation length penalty as shown in Table 3, to be tuned. The best combination of these two parameters were kept to decode the test.

3.2.2 Unsupervised NMT systems

Our UNMT systems used the Transformer-based architecture proposed by Lample et al. (2018). This architecture relies on a denoising autoencoder as language model during training, on a latent representation shared across languages for the encoder and the decoder, and pre-initialized cross-lingual word embeddings. To set up a state-of-the-art UNMT system, we used for initialization a pre-trained cross-lingual language model, i.e., XLM, as performed by Lample and Conneau (2019).

UNMT was exclusively trained on monolingual data. For Lao and Malay we used the entire monolingual data, while we randomly sampled 50 million lines from each of the English, Japanese, and Vietnamese monolingual data. To select the best models, XLM and then UNMT models, we relied on the same validation sets used by our supervised NMT systems. Since pre-training an XLM model followed by the training of an UNMT system is costly, we chose to train a single multilingual XLM model and UNMT system for all translation directions. To construct the vocabulary of the XLM model, we concatenated all the monolingual data and trained a sentencepiece model with 32,000 operations. The sentencepiece model was then applied to the monolingual data for each language and we fixed the vocabulary size during training at 50,000 tokens.

For training XLM and UNMT, we ran the framework publicly released by Lample and Conneau (2019),Footnote 10 with default parameters, namely: 6 encoder and decoder layers, 8 heads, 1024 embedding and 4096 feed-forward dimensions, layer dropout and attention dropout of 0.1, and the GELU activation function. Note that XLM and UNMT models must use the same hyper-parameters and that we could not tune these hyper-parameters due to the prohibitive cost of training. Our XLM model is trained using the masked language model objective (MLM) which is usually trained monolingually. However, since we have trained a single model using the monolingual data for all the languages with a shared vocabulary, our XLM model is cross-lingual. The training steps for the denoising autoencoder component were language specific while the back-translation training steps combined all translation directions involving English.

3.2.3 SMT systems

We used Moses (Koehn et al. 2007)Footnote 11 and its default parameters to conduct SMT experiments, including MSD lexicalized reordering models and a distortion limit of 6. We learned the word alignments with mgiza for extracting the phrase tables. We used 4-gram language models trained, without pruning, with LMPLZ from the kenlm toolkit (Heafield et al. 2013) on the entire monolingual data. For tuning, we used kb-mira (Cherry and Foster 2012) and selected the best set of weights according to the BLEU score obtained on the validation data after 15 iterations.

In contrast to our NMT systems, we did not apply sentencepiece on the data for our experiments with SMT. Instead, to follow a standard configuration typically used for SMT, we tokenized the data using Moses tokenizer for English, Malay, and Vietnamese, only specifying the language option “-l en” for these three languages. For Japanese, we tokenized our data with MeCabFootnote 12 while we used an in-house tokenizer for Lao.

3.3 Post-processing and evaluation

Prior to evaluating NMT system outputs, the removal of spacing and special characters introduced by sentencepiece was applied to translations as a single post-processing step. For the SMT system outputs, detokenization was conducted using the detokenizer.perl tool included in the Moses toolkit for English, Malay and Vietnamese languages, while simple space deletion was applied for Japanese and Lao.

We measured the quality of translations using two automatic metrics, BLEU (Papineni et al. 2002) and chrF (Popović 2015) implemented in SacreBLEU (Post 2018) and relying on the reference translation available in the ALT corpus. For target languages which do not contain spaces, i.e., Japanese and Lao, the evaluation was conducted using the chrF metric only, while we used both chrF and BLEU for target languages containing spaces, i.e., English, Malay and Vietnamese. As two automatic metrics are used for hyper-parameter tuning of NMT systems for six translation directions, the priority is given to models performing best according to chrF, while BLEU is used as a tie-breaker. Systems outputs comparison and statistical significance tests are conducted based on bootstrap resampling using 500 iterations and 1000 samples, following (Koehn 2004).

4 Experiments and results

This section presents the experiments conducted in extremely low-resource settings using the Transformer NMT architecture. First, the baseline systems resulting from exhaustive hyper-parameter search are detailed. Second, the use of monolingual data to produce synthetic data is introduced and two setups, backward and forward translation, are presented. Finally, two multilingual settings, English-to-many and many-to-English, are evaluated.

4.1 Baselines

The best NMT architectures and decoder hyper-parameters were determined based on the validation set and, in order to be consistent for all translation directions, on the chrF metric. The final evaluation of best architectures were conducted by translating the test set, whose evaluation scores are presented in the following paragraphs and summarized in Table 4.Footnote 13 The best configurations according to automatic metrics were kept for further experiments on each translation direction using monolingual data (presented in Sect. 4.2).

Table 4 Test set results obtained with our SMT and NMT systems before and after hyper-parameter tuning (NMT\(_{ v }\) and NMT\(_{ t }\), respectively), along with the corresponding architectures, for the eight translation directions evaluated in our baseline experiments. Tuned parameters are: d model dimension, ff feed-forward dimension, enc encoder layers, dec decoder layers, h heads, norm normalization position. Results in bold indicate systems significantly better than the others with \(p<0.05\)

4.1.1 English-to-Japanese

Vanilla Transformer architecture (base configuration with embeddings of 512 dimensions, 6 encoder and decoder layers, 8 heads, 2048 feed-forward dimensions, post-normalization) with default decoder parameters (beam size of 4 and length penalty set at 0.6) leads to a chrF score of 0.042, while the best configuration after hyper-parameters tuning reaches 0.212 (\(+0.170\)pts). The final configuration has embeddings of 512 dimensions, 2 encoder and 6 decoder layers with 1 attention head, 512 feed-forward dimensions with post-normalization. For decoding, a beam size of 12 and a length penalty set at 1.4 lead to the best results on the validation set.

4.1.2 Japanese-to-English

Out-of-the-box Transformer architecture leads to a BLEU of 0.2 and a chrF of 0.130. The best configuration reaches 8.6 BLEU and 0.372 chrF (\(+8.4\)pts and \(+0.242\)pts respectively). The best configuration according to the validation set has 512 embeddings dimensions, 2 encoder and 6 decoder layers with 1 attention heads, 512 feed-forward dimensions and post-normalization. The decoder parameters are set to a beam size of 12 and a length penalty of 1.4.

4.1.3 English-to-Lao

Default architecture reaches a chrF score of 0.131 for this translation direction. After hyper-parameter search, the score reaches 0.339 chrF (\(+0.208\) pts) with a best architecture composed of 1 encoder and 6 decoder layers with 4 heads, 512 dimensions for both embeddings and feed-forward, using post-normalization. The decoder beam size is 12 and the length penalty is 1.4.

4.1.4 Lao-to-English

Without hyper-parameter search, the results obtained on the test set are 0.4 BLEU and 0.170 chrF. Tuning leads to 10.5 (\(+10.1\)pts) and 0.374 (\(+0.204\)pts) for BLEU and chrF respectively. The best architecture has 512 dimensions for both embeddings and feed-forward, pre-normalization, 4 encoder and 6 decoder layers with 1 attention head. A beam size of 12 and a length penalty of 1.4 are used for decoding.

4.1.5 English-to-Malay

For this translation direction, the default setup leads to 0.9 BLEU and 0.213 chrF. After tuning, with a beam size of 4 and a length penalty of 1.4 for the decoder, a BLEU of 33.3 (\(+32.4\)pts) and a chrF of 0.605 (\(+0.392\)pts) are reached. These scores are obtained with the following configuration: embedding and feed-forward dimensions of 512, 6 encoder and decoder layers with 1 head and pre-normalization.

4.1.6 Malay-to-English

Default Transformer configuration reaches a BLEU of 1.7 and a chrF of 0.194. The best configuration found during hyper-parameter search leads to 29.9 BLEU (\(+28.2\)pts) and 0.559 chrF (\(+0.365\)pts), with both embedding and feed-forward dimensions at 512, 4 encoder and 6 decoder layers with 4 heads and post-normalization. For decoding, a beam size of 12 and a length penalty set at 1.4 lead to the best results on the validation set.

4.1.7 English-to-Vietnamese

With no parameter search for this translation direction, 0.9 BLEU and 0.134 chrF are obtained. With tuning, the best configuration reaches 26.8 BLEU (\(+25.9\)pts) and 0.468 chrF (\(+0.334\)pts) using 512 embedding dimensions, 4, 096 feed-forward dimensions, 4 encoder and decoder layers with 4 heads and pre-normalization. The decoder beam size is 12 and the length penalty is 1.4.

4.1.8 Vietnamese-to-English

Using default Transformer architecture leads to 0.6 BLEU and 0.195 chrF. After tuning, 21.9 BLEU and 0.484 chrF are obtained with a configuration using 512 dimensions for embeddings and feed-forward, 4 encoder and 6 decoder layers with 4 attention heads, using post-normalization. A beam size of 12 and a length penalty of 1.4 are used for decoding.

4.1.9 General observations

For all translation directions, the out-of-the-box Transformer architecture (NMT\(_v\)) does not lead to the best results and models trained with this configuration fail to converge. Models with tuned hyper-parameters (NMT\(_t\)) improve over NMT\(_v\) for all translation directions, while SMT remains the best performing approach. Among the two model dimensionalities (hyper-parameter noted d) evaluated, the smaller one leads to the best results for all language pairs. For the decoder hyper-parameters, the length penalty set at 1.4 is leading to the best results on the validation set for all language pairs and translation directions. A beam size of 12 is the best performing configuration for all pairs except for EN\(\rightarrow \)MS, with a beam size of 4. However, for this particular translation direction, only the chrF score is lower when using a beam size of 12 while the BLEU score is identical.

4.2 Monolingual data

This section presents the experiments conducted in adding synthetic parallel data to our baseline NMT systems using monolingual corpora described in Sect. 3.1. Four approaches were investigated, including three based on backward translation where monolingual corpora used were in the target language, and one based on forward translation where monolingual corpora used were in the source language. As backward translation variations, we investigated the use of a specific tag indicating the origin of the data, as well as the introduction of noise into the synthetic data.

4.2.1 Backward translation

Table 5 Test set results obtained when using backward-translated monolingual data in addition to the parallel training data. Baseline NMTt uses only parallel training data and baseline SMT uses monolingual data for its language model

The use of monolingual corpora to produce synthetic parallel data through backward translation (or back-translation, i.e., translating from target to source language) has been popularized by Sennrich et al. (2016a). We made use of the monolingual data presented in Table 2 and the best NMT systems presented in Table 4 (NMTt) to produce additional training data for each translation direction.

We evaluated the impact of different amounts of synthetic data, i.e., 18k, 100k, 1M and 10M, as well as two NMT configurations per language direction: the best performing baseline as presented in Table 4 (NMTt) and the out-of-the-box Transformer configuration, henceforth Transformer baseFootnote 14, noted NMTv. Comparison of the two architectures along with different amounts of synthetic data allows us to evaluate how much backward translations are required in order to switch back to the commonly used Transformer configuration. These findings are illustrated in Sect. 5.

A summary of the best results involving back-translation are presented in Table 5. When comparing the best tuned baseline (NMTt) to the Transformer base (NMTv) with 10M monolingual data, NMT\(_ v \) outperforms NMTt for all translation directions. For three translation directions, i.e., EN → JA, EN → MS and EN → VI, SMT outperforms NMT architectures. For the EN → LO direction, however, NMTv outperforms SMT, which is explained by the small amount of Lao monolingual data which does not allow for a robust language model to be used by SMT.

4.2.2 Tagged backward translation

Table 6 Test set results obtained when using tagged backward-translated monolingual data in addition to the parallel training data. Baseline NMTt uses only parallel training data and baseline SMT uses monolingual data for its language model. Results in bold indicate systems significantly better than the others with \(p<0.05\)

Caswell et al. (2019) empirically showed that adding a unique token at the beginning of each backward translation, i.e., on the source side, acts as a tag that helps the system during training to differentiate backward translations from the original parallel training data. According to the authors, this method is as effective as introducing synthetic noise for improving translation quality (Edunov et al. 2018) (noised backward translation is investigated in Sect. 4.2.3). We believe that the tagged approach is simpler than the noised one since it requires only one editing operation, i.e., the addition of the tag.

To study the effect of tagging backward translation in an extremely low-resource configuration, we performed experiments with the same backward translations used in Sect. 4.2.1 modified by the addition of a tag “[BT]” at the beginning of each source sentence on the source side of the synthetic parallel data. Table 6 presents the results obtained with our tagged back-translation experiments along with the best tuned NMT baseline using only parallel data (NMTt) and an SMT system using the monolingual data for its language model.

The tagged back-translation results indicate that the Transformer base architecture trained in this way outperforms all other approaches for the eight translation directions. It also improves over the backward translation approach without tag (see Table 5), according to both automatic metrics. All translation directions benefit from adding 10M backward translations except for the EN → JA direction, for which we observe a plateau and no improvement over the system using 1M synthetic parallel data.

4.2.3 Noised backward translation

As an alternative to tagged backward translation, we propose to evaluate the noised backward translation approach for the best performing configuration obtained in the previous section, namely the Transformer base architecture. Adding noise to the source side of backward translated data has previously been explored in NMT and we followed the exact approach proposed by Edunov et al. (2018). Three types of noise were added to each source sentence: word deletion with a probability of 0.1, word replacement by a specific token following a probability of 0.1, and finally word swapping with random swap of words no further than three positions apart.

Our experiments with noisy backward-translated source sentences were conducted using the Transformer base architecture only, as this configuration was leading to the best results when using tagged back-translation. The results are presented in Table 7 and show that tagged backward-translation reaches higher scores for both automatic metrics in all translation directions. Systems trained on noised backward translation outperforms our SMT system except for the EN → JA, EN → MS, and EN → VI translation directions, which are translation directions involving translationese as target.

Table 7 Test set results obtained when using noised backward-translated monolingual data in addition to the parallel training data. The NMT architecture is the Transformer base configuration. Baseline NMTt uses only parallel training data and baseline SMT uses monolingual data for its language model

4.2.4 Forward translation

In contrast to backward translation, with forward translation we used the synthetic data produced by NMT on the target side. Basically, we obtained this configuration by reversing the training parallel data used to train NMT with backward translations in Sect. 4.2.1. One advantage of this approach is that we have clean and original data on the source side to train a better encoder, while a major drawback is that we have synthetic data on the target side, which potentially coerces the decoder into generating ungrammatical translations. Bogoychev and Sennrich (2019) showed that forward translation is more useful when translating from original texts compared to translating from translationese. We thus expected to obtain more improvement according to the automatic metrics for the EN → XX translation directions than for the XX → EN translation directions compared to the baseline NMT systems.

Table 8 Test set results obtained when using forward-translated monolingual data in addition to the parallel training data. Baseline NMTt uses only parallel training data and baseline SMT uses monolingual data for its language model

The results obtained for our forward translation experiments show that this approach is outperformed by SMT for all translation directions. For translation directions with translationese as target, forward translation improves over the NMT baseline, confirming the findings of previous studies. Results also show that a plateau is reached when using 1M synthetic sentence pairs and only EN → MS benefits from adding 10M pairs (Table 8).

4.3 Multilingual models

Experiments on multilingual NMT involved jointly training many-to-one and one-to-many translation models. In particular, we examined two models, one translating from English into four target languages (EN → XX), and one translating from four source languages into English (XX → EN). For both models, the dataset presented in Table 1 were concatenated, while the former model necessitated target language specific tokens to be pre-pended to the source sentences, as described in Sect. 2.2.

In order to compare our baseline results to the multilingual NMT approach, we conducted the same hyper-parameter search as in Sect. 4.1. Table 9 reports on the best results along with the best NMT architecture for each translation direction. These results show that the multilingual NMT approach outperforms the bilingual NMT models for 4 translation directions, namely EN → JA, JA → EN, EN → LO, and EN → VI. For 5 translation directions, SMT reaches better performances, while multilingual NMT outperforms SMT for 2 translation directions. For JA → EN, SMT reaches better results compared to multilingual NMT based on the chrF metric, while both approaches are not significantly different based on BLEU.

Table 9 Test set results obtained with the multilingual NMT approach (NMT\(_{ m }\)) along with the corresponding architectures for the eight translation directions. Two disjoint models are trained, from and towards English. Tuned parameters are: d model dimension, ff feed-forward dimension, enc encoder layers, dec decoder layers, h heads, norm normalization position. NMT\(_{ t }\) are bilingual baseline models after hyper-parameter tuning and SMT is the baseline model without monolingual data for its language model. Results in bold indicate systems significantly better than the others with \(p<0.05\)

4.4 Unsupervised multilingual models

In this section, we present the results of our UNMT experiments performed with the system described in Sect. 3.2. Previous work has shown that UNMT can reach good translation quality but experiments were limited to high-resource language pairs. We intended to investigate the usefulness of UNMT for our extremely low-resource language pairs, including a truly low-resource language, Lao, for which we only had less than 1M lines of monolingual data. To the best of our knowledge, this is the first time that UNMT experiments and results for such small amount of monolingual data are reported. Results are presented in Table 10.

Table 10 Test set results obtained with UNMT (NMT\(_ u \)) in comparison with the best NMT → systems, trained without using monolingual data, and systems trained on 10M (or 737k for EN\(\rightarrow \)LO) tagged backward translations with the Transformer base architecture (T-BT). SMT uses monolingual data for its language model

We observed that our UNMT model (noted NMT\(_ u \)) is unable to produce translations for the EN–JA language pair in both translation directions as exhibited by BLEU and chrF scores close to 0. On the other hand, NMT\(_ u \) performs slightly better than NMT\(_ t \) for LO\(\rightarrow \)EN (+0.2 BLEU, +0.004 chrF) and MS\(\rightarrow \)EN (+2.0 BLEU, +0.011 chrF). However, tagged backward translation virtually requires less monolingual data than NMT\(_ u \), but exploits also the original parallel data, leading to the best system by a large margin of more than 10 BLEU points for all language pairs except EN-LO.

5 Analysis and discussion

This section presents the analysis and discussion relative to the experiments and results obtained in Sect. 4, focusing on four aspects.

  1. 1.

    The position of the layer normalization component and its impact on the convergence of NMT models as well as on the final results according to automatic metrics.

  2. 2.

    The integration of monolingual data produced by baseline NMT models with tuned hyper-parameters through four different approaches.

  3. 3.

    The combination of the eight translation directions in a multilingual NMT model.

  4. 4.

    The training of unsupervised NMT by using monolingual data only.

5.1 Layer normalization position

Layer normalization (Ba et al. 2016) is a crucial component of the Transformer architecture, since it allows for stable training and fast convergence. Some recent studies have shown that the position of the layer normalization mechanism has an impact on the ability to train deep Transformer models in high-resource settings or out-of-the-box Transformer configurations in low-resource settings. For the latter, Nguyen and Salazar (2019) show that when using a Transformer base configuration, one should always put layer normalization prior to the non-linear layers within a Transformer block (i.e., pre-norm).

Our experiments on eight translation directions in extremely low-resource settings validate these findings. More precisely, all our baseline NMT systems using the Transformer base configuration with post-non-linearity layer normalization (i.e. post-norm) do not converge. Successfully trained baseline models using post-norm require fewer number of encoder layers, while pre-norm allows for deeper architecture even with extremely low-resource language pairs.

An interesting observation made on the baseline results is that the decoder part of the Transformer does not suffer from depth-related issue. Additionally, during hyper-parameter tuning, results show that there are larger gaps in terms of chrF scores between configurations using post-norm than the ones using pre-norm. As a recommendation, we encourage the NMT practitioners to do hyper-parameter search including at least two layer normalization positions, pre- and post-norm, in order to reach the best performance in low-resource settings. Another recommendation, in order to save computing time during hyper-parameter search, is to limit the encoder depth to a maximum of 4 layers when using post-norm, as our experiments show that 6-layer encoders and post-norm do not converge for all language pairs and translation directions of the training corpora presented in this paper.

5.2 Monolingual data

Producing synthetic parallel data from monolingual corpora has been shown to improve NMT performance. In our work, we investigate four methods to generate such parallel data, including backward and forward translations. For the former method, three variants of the commonly used backward translation are explored, with or without source-side tags indicating the origin of the data, and with or without noise introduced in the source sentences.

Fig. 1
figure 1

Test set results obtained by the best tuned baseline (NMT\(_{ t }\)) and Transformer base (NMT\(_{ v }\)) with various amount of monolingual data for back-translation (bt), tagged back-translation (t-bt) and forward translation (ft)

When using backward translations without tags and up to 10M monolingual sentences, the Transformer base architecture outperforms the best baseline architecture obtained through hyper-parameter search. However, the synthetic data itself was produced by the best baseline NMT system. Our preliminary experiments using the Transformer base to produce the backward translations did unsurprisingly lead to lower translation quality, due to the low quality of the synthetic data. Nevertheless, using more than 100k backward translations produced by the best baseline architecture allows to switch to the Transformer base. As indicated by Fig. 1, when using backward translated synthetic data (noted bt), Transformer base (NMT\(_{ v }\)) outperforms the tuned baseline (NMT\(_{ t }\)) following the same trend for the eight translation directions.

Based on the empirical results obtained in Table 6, tagged backward translation is the best performing approach among all other approaches evaluated in this work, according to automatic metrics, and regardless of the quantity of monolingual data used, as summarized in Table 11. Noised backward translations and forward translations both underperform tagged backward translation. Despite our extremely low-resource setting, we could confirm the findings of Caswell et al. (2019) that pre-pending a tag to each backward-translated sentence is very effective in improving BLEU scores, irrespective of whether the translation direction is for original or translationese texts. However, while Caswell et al. (2019) reports that tagged backward translation is as effective as introducing synthetic noise in a high-resource settings, we observe a different tendency in our experiments. Using the noised backward translation approach leads to lower scores than tagged backward translation and does not improve over the use of backward translations without noise.

Since we only have very small original parallel data, we assume that degrading the quality of the backward translations is too detrimental to train an NMT system that does not have enough instance of well-formed source sentences to learn the characteristics of the source language. Edunov et al. (2018) also report on a similar observation in a low-resource configuration using only 80k training parallel sentences. As for the use of forward translations, this approach outperforms the backward translation approach only when using a small quantity of monolingual data. As discussed by Bogoychev and Sennrich (2019), training NMT with synthetic data on the target side can be misleading for the decoder especially when the forward translations are of a very poor quality such as translations generated by a low-resouce NMT system.

Table 11 Test set results obtained with the best NMT systems using backward translations (BT), tagged backward translations (T-BT), forward translations (FT), and noised backward translations (N-BT), regardless of the hyper-parameters and quantity of monolingual data used. Results in bold indicate systems significantly better than the others with \(p<0.05\)

5.3 Multilingualism

Comparing the BLEU scores of bilingual and multilingual models for the EN → JA, EN → LO and EN → VI translation directions, multilingual models outperform bilingual ones. On the other hand, for the other translation directions, the performance of multilingual models are comparable or lower than bilingual ones. When comparing SMT to multilingual NMT models, well tuned multilingual models outperform SMT for the EN → JA and EN → LO translation directions, while SMT is as good or better for the remaining six translation directions. In contrast, well tuned NMT models are unable to outperform SMT regardless of the translation direction.

Our observations are in line with previous works such as Firat et al. (2016), Johnson et al. (2017) which showed that multilingual models outperform bilingual ones for low-resource language pairs. However, these works did not focus on extremely low-resource setting and multi-parallel corpora for training multilingual models. Although multilingualism is known to improve performance for low-resource languages, we observe drops in performance for some of the language pairs involved. The work on multi-stage fine-tuning (Dabre et al. 2019), which uses N-way parallel corpora similar to the ones used in our work, supports our observations regarding drops in performance. However, based on the characteristics of our parallel corpora, the multilingual models trained and evaluated in our work do not benefit from additional knowledge by increasing the number of translation directions, because the translation content for all language pairs is the same. Although our multilingual models do not always outperform bilingual NMT or SMT models, we observe that they are useful for difficult translation directions where BLEU scores are below 10pts.

With regards to the tuned hyper-parameters, we noticed slight differences between multilingual and bilingual models. By comparing the best configurations in Tables 9 and 4, we observe that the best multilingual models are mostly those that use pre-normalization of layers (7 translation directions out of 8). In contrast, there is no such tendancy for bilingual models, where 3 configurations out of 8 reach the best performance using pre-normalization.

Another observation is related to the number of decoder layers, where shallower decoders are better for translating into English whereas deeper ones are better for the opposite direction. This tendancy is different from the one that bilingual models exhibit where deeper decoder layers are almost always preferred (6 layers in 7 configurations out of 8). We assume that the many-to-English models require a shallower decoder architecture because of the repetitions on the target side of the training and validation data. However, a deeper analysis is required, involving other language pairs and translation directions, to validate this hypothesis. To the best of our knowledge, in the context of multilingual NMT models for extremely low-resource settings, the study of vast hyper-parameter tuning, optimal configurations and performance comparisons do not exist.

5.4 Unsupervised NMT

Our main observation from the UNMT experiments is that it only performs similarly or slightly better, respectively for EN-LO and MS → EN, than our best supervised NMT systems with tuned hyper-parameters, while it is significantly worse for all the remaining translation directions. These results are particularly surprising for EN-LO for which we have only very few data to pre-train an XLM model and UNMT, meaning that we do not necessarily need a large quantity of monolingual data for each language in a multilingual scenario for UNMT. Nonetheless, the gap between supervised MT and UNMT is even more significant if we take as baseline SMT systems with a difference of more than 5.0 BLEU for all translation directions. Training an MT system with only 18k parallel data is thus preferable to using only monolingual data for these extremely low-resource configurations. Furthermore, using only parallel data for supervised NMT is unrealistic since we have also access at least to the same monolingual data used by UNMT. The gap is then even greater when exploiting monolingual data as tagged backward translation for training supervised NMT.

Our results are in line with the few previous work on unsupervised MT for distant low-resource language pairs (Marie et al. 2019) but to the best of our knowledge this is the first time that results for such language pairs are reported using an UNMT system initialized with pre-trained cross-lingual language model (XLM). We assume that the poor performances reached by our UNMT models are mainly due to the significant lexical and syntactic differences between English and the other languages, making the training of a single multilingual embedding space for these languages a very hard task (Søgaard et al. 2018). This is well exemplified by the EN-JA language pair, that involves two languages with a completely different writing systems, for which BLEU and chrF scores are close to 0. The performance of UNMT is thus far from the performance of supervised MT. Further research is necessary in UNMT for distant low-resource language pairs in order to improve it and rivaling the performance of supervised NMT, as observed for more similar language pairs such as French–English and English–German (Artetxe et al. 2018; Lample et al. 2018; Artetxe et al. 2018; Lample et al. 2018).

6 Conclusion

This paper presented a study on extremely low-resource language pairs with the Transformer NMT architecture involving Asian languages and eight translation directions. After conducting an exhaustive hyper-parameter search focusing on the specificity of the Transformer to define our baseline systems, we trained and evaluated translation models making use of various amount of synthetic data. Four different approaches were employed to generate synthetic data, including backward translations, with or without specific tags and added noise, and forward translations, which were then contrasted with an unsupervised NMT approach built only using monolingual data. Finally, based on the characteristics of the parallel data used in our experiments, we jointly trained multilingual NMT systems from and towards English.

The main objectives of the work presented in this paper is to deliver a recipe allowing MT practitioners to train NMT systems in extremely low-resource scenarios. Based on the empirical evidences validated by eight translation directions, we make the following recommendations.

  • First, an exhaustive hyper-parameter search, including the position of layer normalization within the Transformer block, is crucial for both a strong baseline and producing synthetic data of sufficient quality.

  • Second, a clear preference for backward compared to forward translations for the synthetic data generation approach.

  • Third, generating enough backward translations to benefit from the large amount of parameters available in the commonly used Transformer architecture in terms of number and dimensionality of layers.

  • Fourth, adding a tag on the source side of backward translations to indicate its origin, which leads to the higher performance than not adding tag or introducing noise.

As future work, we plan to enlarge the search space for hyper-parameter tuning, including more general parameters which are not specific to the Transformer architecture, such as various dropout rates, vocabulary sizes and learning rates. Additionally, we want to increase the number of learnable parameters in the Transformer architecture to avoid reaching a plateau when the amount of synthetic training data increases. Finally, we will explore other multilingual NMT approaches in order to improve the results obtained in this work and to make use of tagged backward translation as we did for the bilingual models.