1 Introduction

Comparable corpora have long been identified as an important source of potentially useful data in natural language processing, notably to train data-driven machine translation systems (Munteanu & Marcu, 2005; Sharoff et al., 2016) or to extract bilingual dictionaries (Rapp, 1995).

Extracting sentence pairs with a significant degree of parallelism from this type of corpora faces important challenges, however, considering the vast amounts of unrelated or loosely related data that compose comparable datasets in different languages. Several methods have been designed to address the challenges of comparable document alignment (Azpeitia & Etchegoyhen, 2019; Sharoff et al., 2015) and comparable sentence alignment (Artetxe & Schwenk, 2019; Etchegoyhen & Azpeitia, 2016b; Fung & Cheung, 2004; Munteanu & Marcu, 2002; Smith et al., 2010), leading to new parallel datasets that can support machine translation, in particular for under-resourced languages (Etchegoyhen et al., 2016; Schwenk et al., 2019).

Neural Machine Translation (nmt) is currently the dominant paradigm in research and development in the field of Machine Translation, having led to significant advances in recent years for most language pairs (Bahdanau et al., 2015; Bojar et al., 2016, 2017; Vaswani et al., 2017). As a data-driven approach where model parameters are estimated and optimised on large volumes of parallel data, nmt is particularly sensitive to the presence of noise in the training data (Belinkov & Bisk, 2018; Cheng et al., 2018; Khayrallah & Koehn, 2018; Sperber et al., 2017). Due to the nature of the task, parallel data extracted from comparable corpora are likely to introduce unwarranted noise in the training process and an evaluation of this potential issue is worth examining.

In this article, we describe the results of an in-depth evaluation of the exploitation of a large strongly comparable corpus for Basque-Spanish, composed of independently-produced news by the Basque public broadcaster eitb,Footnote 1 focusing on the impact of various techniques to exploit the original data for Neural Machine Translation and extending the study by Etchegoyhen & Gete (2020).

We first compare results obtained with two alignment methods that have demonstrated high accuracy on comparable data, namely laser (Artetxe & Schwenk, 2019) and stacc (Etchegoyhen & Azpeitia, 2016b), and evaluate various datasets and models based on the alignments obtained with each method via threshold filtering and data combination. We then measure the impact of filtering based on length difference statistical outliers between aligned source and target sentences, as comparable corpora typically feature information divergences between source and target sentences.

We also explore the impact of tags to identify comparable data in the training datasets, following the approach proposed by Caswell et al. (2019) for backtranslations and Etchegoyhen & Gete (2020) for comparable data, and the use of mixed tagging by limiting tagging to portions of the data identified as containing more imbalanced information in the source and the target.

Finally, we compare the impact of using aligned comparable sentences to the use of backtranslations for nmt (Sennrich et al., 2016b), as both approaches have demonstrated their usefulness to increase the coverage and quality of translation models. We evaluate in particular the impact of mixtures of aligned and backtranslated data on the quality of the resulting translation models.

The remainder of this paper is organised as follows: Sect. 2 describes related work on comparable corpora; Sect. 3 describes the corpora, alignment methods and tools used in this study; in Sect. 4, we describe the baseline translation models based on the initial datasets; Sect. 5 centres on threshold filtering using the selected alignment methods; in Sect. 6, a detailed analysis of the different datasets is presented, while Sect. 7 describes the results obtained with different combinations of the available data; Sect. 8 describes the results obtained via length filtering and Sect. 9 those obtained via data tagging; in Sect. 10, we report on different experiments using backtranslated data; Sect. 11 presents a comparative summary of results on translation models; finally, Sect. 12 draws conclusion from the experiments and analyses.

2 Related work

A significant body of work has been produced over the years to mine and exploit parallel sentences from large collections of comparable monolingual corpora (Abdul-Rauf & Schwenk, 2009, 2011; Irvine & Callison-Burch, 2013; Munteanu & Marcu, 2005; Sharoff et al., 2016; Smith et al., 2010), starting with seminal work by Resnik (1999) to exploit the World Wide Web as a source of potential parallel data.

A standard part in the process is the determination of document pairs, to reduce the space of computation in the typically large datasets involved in the task. Several approaches have exploited metadata information for Web page alignment, including url, structural tags or publication date (Chen & Nie, 2000; Munteanu & Marcu, 2005; Papavassiliou et al., 2016; Resnik & Smith, 2003). Alternatively, content-based document alignment approaches for comparable corpora have also been proposed, based on vector space models (Chen et al., 2004), token translation ratios (Ma & Liberman, 1999), mutual information (Fung & Cheung, 2004), expectation-maximisation (Ion et al., 2011) or n-gram matching using machine-translated documents (Uszkoreit 2010).

Several approaches have used a mixture of content and structural properties, notably the systems in the wmt 2016 document alignment shared task (Buck & Koehn, 2016a). Among those, Gomes and Lopes (2016) proposed a phrase-based approach combined with url-matching, Buck and Koehn (2016b) used cosine similarity between tf/idf vectors over machine-translated documents, and Germann (2016) performed alignment via vector space word representations, latent semantic indexing, and url matching. In Esplá-Gomis et al. (2016), document alignment is performed via a mixture of url similarity, structural features such as shared links, and bag of words similarity.

Comparability at the document alignment was notably evaluated in a dedicated shared task of the bucc workshop series (Sharoff et al., 2015). Among participating systems, Li and Gaussier (2013) used bilingual dictionaries and proportion of matching words to assess comparability, Morin et al. (2015) made use of hapax legomena and pigeon hole reasoning to enforce alignments, and Zafarian et al. (2015) used several components, including topic modelling, named entity detection and word features. A strictly content-based method was proposed by Etchegoyhen and Azpeitia (2016a), based on Jaccard similarity (Jaccard, 1901) over sets of lexical translations, expanded with surface-based entities and common prefix matching, demonstrating high accuracy in a large number of scenarios, including comparable corpora (Azpeitia & Etchegoyhen, 2019).

The extraction of parallel sentences from comparable corpora has also been addressed via a large variety of approaches, based on suffix trees (Munteanu & Marcu, 2002), maximum likelihood (Zhao & Vogel, 2002), binary classification (Munteanu & Marcu, 2005), cosine similarity (Fung & Cheung 2004), reference metrics over statistical machine translations (Abdul-Rauf & Schwenk, 2009; Sarikaya et al., 2009) or rich features (Smith et al., 2010; Stefănescu et al., 2012). In recent years deep learning approaches have been applied to the task as well, via bidirectional recurrent neural networks (Grégoire & Langlais, 2017) or cosine similarity over bilingual sentence embeddings (Schwenk, 2018).

The core document alignment method of Etchegoyhen and Azpeitia (2016a) was applied to comparable sentence alignment, improving significantly over more sophisticated feature-rich methods (Etchegoyhen & Azpeitia, 2016b). This method, referred to as stacc, which is also based on Jaccard similarity over expanded lexical translation sets, was further extended with lexical weighting (Azpeitia et al., 2017) and a named-entity penalty (Azpeitia et al., 2018), obtaining the best results across the board in the bucc shared task on parallel sentence identification in comparable corpora (Zweigenbaum et al., 2017, 2018).

Improvement over these results were obtained with the margin-based approach of Artetxe & Schwenk (2019), which uses cosine similarity over multilingual sentence embeddings, extended with a filtering mechanism based on nearest neighbours similarity. Large amounts of comparable corpora have been mined with this approach (Schwenk et al., 2019), by means of the laser toolkit, which we include in our experiments on the selected corpora.

Recently, unsupervised approaches to NMT have demonstrated the potential of directly exploiting monolingual data to train neural translation models (Artetxe et al., 2018; Lample et al., 2018). A detailed comparison between this type of approach and the use of parallel sentences mined from comparable corpora may shed light on their respective merits, in isolation or in combination. Although this type of evaluation is outside the scope of the present study, the results in Sect. 10 on the use of back-translated data, compared to mining parallel sentences, provides preliminary results on the direct exploitation of monolingual data in this type of corpus.

A first version of the eitb corpus was prepared and shared with community (Etchegoyhen et al., 2016), to provide further support to the under-resourced Basque-Spanish language pair. In Etchegoyhen & Gete (2020), a new version of the corpus was prepared and shared with the scientific community,Footnote 2 built with news generated in subsequent years. In what follows, we use the latter version of the original data, covering news produced between 2009 and 2018, and evaluate the impact of different methods to exploit these comparable data on Neural Machine Translation.

3 Corpora preparation

The eitb corpus is composed of news independently produced in Basque and Spanish by the journalists of the Basque Country’s public broadcast service, to report on the same specific events. The corpus can thus be considered strongly comparable (Skadiņa et al., 2012) and viewed as a rich source of parallel data for this language pair (Etchegoyhen et al., 2016).

This version of the original dataset covers 10 years of content, and is composed of 168,984 documents in Basque and 174,348 in Spanish, extracted from xml files.Footnote 3 With an original amount of over two million sentences per language, it is one of the largest available corpora for the Basque-Spanish language pair, covering political news, sports and cultural events, among others. Statistics of the original corpus are described in Table 1.

Table 1 Original eitb corpus 2009–2018 in number of sentences (Sent.) and tokens (Tok)

To perform the experiments described in this article, a second corpus was used, based on the translation memories made available by the Basque government as open data.Footnote 4 We used the corpus created from these translation memories and shared by Etchegoyhen et al. (2018),Footnote 5 to which we will refer as ode in what follows. The corpus is constituted mainly of translations of administrative content, notably from the Instituto Vasco de Administración Pública (ivap). As it consists of professionally produced translations, the ode corpus was used as a parallel basis for the experiments described in the remainder of this work.

The eitb corpus was first aligned at the document level with docal, the efficient component for parallel and comparable document alignment described in Etchegoyhen and Azpeitia (2016a) and Azpeitia and Etchegoyhen (2019). For sentence alignment, we used the two previously mentioned approaches:

  • stacc (Etchegoyhen & Azpeitia, 2016b): This method is based on sets of bilingual lexical mappings, set expansion operations to include unseen named entities determined by surface form as well as longest common prefixes, and Jaccard similarity (Jaccard, 1901) over the source to target and target to source set. We use the stacc variant with lexical weighting (Azpeitia et al., 2017), with lexical weights computed on the corpus to be aligned.

  • laser (Artetxe & Schwenk, 2019): This approach is based on multilingual sentence embeddings, cosine similarity over embedding pairs, and a margin computed over closest alignment candidates.

To extract the translation tables necessary for both the docal and stacc approaches to document and sentence alignment, we used the fastalign toolkit (Dyer et al., 2013) on the ode corpus. Prior to performing either alignment, the sentences were tokenised and truecased using the scripts available in the moses toolkit (Koehn et al., 2007).

The statistics for the ode and eitb corpora, in number of aligned sentences are shown in Table 2. For the eitb corpus, alignments results are based on one-to-one alignments without any filtering based on the alignment scores computed by stacc or laser, i.e. with an alignment threshold set to 0 in both cases.

Table 2 Corpora statistics (number of sentence pairs)

The development and test sets are random partitions for both corpora; for the eitb corpus, the extracted alignments were manually verified. For the test set, additional Basque translations were manually created by professional translators, to increase the robustness of the test against the relatively free word order in Basque (Etchegoyhen et al., 2018). In the Spanish to Basque translation direction, we thus use two translation references, whereas for Basque to Spanish, we use the manually translated sentences as source, to remove any eventual bias that may arise from the original data being comparable.

4 Baseline models

All translation models were based on the Transformer architecture (Vaswani et al., 2017), built with the marian toolkit (Junczys-Dowmunt et al., 2018). The models consisted of 6-layer encoders and decoders, feed-forward networks of 2048 units, embeddings vectors of dimension 512 and 8 attention heads. The dropout rate between layers is 0.1.

We used the Adam optimiser with \(\alpha =0.0003\), \(\beta _1=0.9\), \(\beta _2=0.98\) and \(\epsilon =10^{-9}\). The learning rate increases linearly for the first 16,000 training steps and decreases thereafter proportionally to the inverse square root of the corresponding step. We set the working memory to 6000MB and automatically chose the largest mini-batch that fit the specified memory. The validation data was evaluated every 3500 steps, and the training process ended if there was no improvement in the perplexity of 10 consecutive checkpoints. Embeddings for source, target and output layer are tied and all datasets were segmented with bpe (Sennrich et al., 2016c), using 30,000 operations.

We trained five baseline models, first on the ode corpus and on that same corpus merged with the aligned eitb corpus, with an alignment threshold set to 0 for both the stacc and laser methods, to measure the initial contribution of the selected comparable data. Additionally, we trained models only on the aligned eitb datasets, to measure the impact of the comparable data in isolation. These models and all subsequent ones were evaluated on the previously described test sets, which cover various topics. The results in terms of bleu (Papineni et al., 2002) for the baseline models are shown in Table 3 and were computed with SacreBLEU (Post, 2018).Footnote 6

Table 3 bleu results for the baseline models in both translation directions

In both translation directions, the contribution of the comparable data was significant, with large improvements in bleu scores on the in-domain test sets. Although this is not unexpected, since the ode model was trained on data in the administrative domain, this confirms the potential of the strongly comparable eitb data as a means to improve Basque-Spanish translation, as previously established in Etchegoyhen et al. (2016). Taken in isolation, the comparable data provided strong baseline results on the eitb test sets, but perform poorly on the out-of-domain ode test sets. Overall, the combination of eitb and ode data was optimal in both translation directions, when considering average results on the in-domain and out-of-domain test sets.

The initial alignments obtained with laser and stacc resulted in similar improvements across the board, except for Spanish to Basque translation where the former performed better on the in-domain eitb test set. In the next sections, we evaluate several variants of the initial alignments, both individually and in combination, to determine an optimal setup for the exploitation of the original comparable dataset.

5 Threshold filtering

Methods to extract parallel data from comparable corpora rely on metrics that measure translation equivalence in some form. The scores assigned by the core metrics usually need to be complemented with some form of filtering, based on thresholds determined over the training or development datasets (Artetxe & Schwenk, 2019; Etchegoyhen & Azpeitia, 2016b). This is necessary since similarity varies between comparable sentences and comparability metrics usually assign a continuous score to comparable sentence pairs.

To determine the impact of threshold selection, we extracted subsets of the aligned eitb corpus after applying different thresholds, selected according to the ranges of the stacc and laser metrics, which produced significant amounts of initial data. Each subset was then used in combination with the ode corpus, to train new nmt model variants.

Tables 4 and  5 show the results, in terms of bleu score over the eitb test sets and number of aligned sentences, for the threshold-filtered datasets with stacc and laser, respectively. For each method, we computed statistical significance via bootstrap resampling (Koehn, 2004) between each filtered variant and the 0.0 baseline.Footnote 7

Table 4 Alignment threshold results with STACC
Table 5 Alignment threshold results with LASER

In the case of stacc, neither the slight increases over the baseline in es-eu nor the slight decreases in eu-es up to threshold 0.17 were statistically significant. For further experiments, either dataset, up to a threshold of 0.20 in the first case and 0.17 in the latter case, could thus be considered equivalent, at least regarding the training of the complete models that include the ode datasets as well.

Higher stacc thresholds usually increase the accuracy of the aligner when measured in terms of precision and recall on various alignment tasks (Azpeitia et al., 2017; Etchegoyhen & Azpeitia, 2016b; Etchegoyhen et al., 2016), or when training NMT models via fine-tuning over aligned datasets (Etchegoyhen & Gete, 2020), and could have been selected accordingly. Our goal in these experiments, however, was to measure the impact of complementing parallel data with comparable data, in a realistic scenario where different datasets are merged to maximise nmt models model coverage and accuracy. We therefore opted for an initial data selection based on the statistical significance of bleu scores for nmt models trained on different subsets. To measure the impact of additional filtering methods, we thus selected the largest of the equivalent datasets in terms of impact on bleu scores, i.e. the 0.0 baseline.

With laser, all thresholds led to statistically significant gains over the baseline in es-eu, whereas none did for eu-es. Additionally, we computed statistical significance for all improving variants against the model based on the 1.0 threshold, with no significance for \(p < 0.05\). Along the same lines as those that guided the selection of the stacc dataset, for the next experiments we selected the datasets that led to the best laser-based models, while also featuring the largest amount of data. Thus, for es-eu we selected as defaults the dataset based on the 1.0 threshold, and for eu-es the dataset based on the 0.0 threshold.

Table 6 summarises the main characteristics of the selected datasets for each method and language pair.

Table 6 Selected datasets for es-eu and eu-es based on threshold results

6 Data analysis

Each alignment method generates its own datasets according to the selected threshold, which may differ in terms of several characteristics, while also overlapping to a certain extent. In this section, we analyse the main differences in terms of corpus statistics.

We first extracted the union and intersection of the stacc and laser datasets, with results shown in Table 7. Overall, the selected methods mined a common core of 55.6% of aligned sentences for es-eu, and 50.2% for eu-es. This leaves an almost equivalently large amount of alignments uniquely mined by each method.

Table 7 Number of aligned sentences for the union and intersection of the stacc and laser datasets

Determining the benefits these unique alignments can provide will be approximated in the next section by training models based on the union and intersection of the datasets. A full manual analysis of alignment accuracy for these large datasets was not feasible for this study, therefore we randomly sampled 100 aligned sentences from each one of the selected datasets and manually evaluated the alignments for approximate estimates of the quality of the unique alignments. An alignment was considered incorrect if there was information imbalance between source and target sentences, or crucial information missing, such as Named Entities.

For eu-es, where both laser and stacc datasets are based on 0 thresholds, the number of alignments identified as correct in the samples was 21% for the former, and 35% for the latter. For es-eu, the stacc dataset is the same as in the other direction, but the laser dataset is based on a 1.0 threshold for which we extracted a separate sample. In this case, 29% of the alignments were considered correct.

Although these results are only approximate estimates, they do indicate that each method addresses aspects of comparable data alignment that the other method fails to capture, while also featuring erroneous alignments based on the respective characteristics of the methods. From manual examination of the samples, the main weakness of laser seems to be the lack of robustness in identifying corresponding Named Entities and alignment of loosely related sentences, which is likely to be a by-product of the use of embeddings, whereas for stacc the main limitation appeared to be the alignment of sentences which are translations of each other, but where a significant portion of one side of the pair is missing in the other.

We provide examples of correct alignments in Table 8. As these examples illustrate, each method can mine valid parallel sentences in the comparable datasets that are not captured by the alternative, irrespective of the length of the aligned sentences.

Table 8 Examples of sentence pairs uniquely aligned by either stacc or laser, with corresponding English translations

We then compared the vocabulary distribution differences between the selected datasets, the union of the two, and their intersection. Figures 1 and 2 show the number of tokens in the union vocabulary sizes for es-eu and eu-es, respectively, along with the amount of common tokens between the union and intersection datasets, the stacc dataset and the laser dataset.

Fig. 1
figure 1

Vocabulary distribution (number of tokens) over selected ES-EU datasets

Fig. 2
figure 2

Vocabulary distribution (number of tokens) over selected EU-ES datasets

The intersection of vocabularies is lesser than the union, indicating that each method mines portions of data that differ in terms of vocabulary, with \(85\%\) common vocabulary for both es-eu and es-eu .

As indicated by the stacc-only and laser-only results, uniquely retrieved tokens range between 40,707 and 87,504 for eu-es, with larger counts for laser, and between 33,457 and 83,577 for es-eu, with larger counts for stacc in this case. Considering the reversed tendencies given the language pair, these differences are likely due in part to the differences in volumes of selected data, with 46,209 more sentence pairs selected by laser for eu-es, and 240,731 more by stacc for es-eu. The differences in vocabulary indicate however that the additional sentence pairs in each case differ significantly from the common pairs selected by both methods.

Since the two methods differ significantly in the way they mine parallel sentences, via bilingual embeddings in one case and lexical translation overlap in the other, the differences in terms of vocabulary are not surprising. To further explore these differences, we measured the amount of Named Entities (NE) and sentence pairs that are identical in each language, as retrieved by each method.Footnote 8 Results are shown in Table 9.

Table 9 Number (#) and percentage (%) of aligned named entities (ne) and identical sentences (id)

In terms of NEs, the results are in line with the previously discussed volume differences between the selected datasets, with either methods retrieving at least 9 out of 10 NEs present in the union of the datasets.

The methods do contrast in terms of identical sentences in aligned pairs, with laser aligning virtually all of them in both cases, as opposed to stacc. This difference might be due to cases where the identical source and target sentences are are in the same language, in which case the reliance of the stacc method on lexical translation mappings would discard these pairs as possible alignments, correctly so in these cases. Since the amount of such pairs is comparably marginal, we did not explore these cases any further.

Table 10 Average sentence length and sentence length differences (number of tokens) for pairs aligned with stacc or laser

Alignment methods may also differ in terms of the length of the sentences in aligned pairs. We measured both the average sentence length and the average length difference between source and target sentences, with the results shown in Table 10.

As indicated by these results, stacc tends to align pairs with larger length differences than laser, with a two-token difference on average. This is mainly due to the length of the selected Spanish sentences, as the length of Basque sentences is comparable with both methods on average. These differences add to the observation that the characteristics of each method lead to divergent mining of parallel sentences in comparable corpora, to some extent. In the next section, we investigate the impact of using different combinations of data to train nmt models.

7 Data combination

To measure the impact of combining the datasets generated by both stacc and laser, we created variants of the corpus with both the union and the intersections of these datasets and trained complete models with each variant.

Table 11 presents the results in terms of bleu scores for es-eu, along with the previous scores obtained with the individual datasets selected by each method. Table 12 shows statistical significance results of the scores obtained between each pair of models.

Table 11 bleu scores in ES-EU and number of aligned eitb sentences with union and intersection datasets
Table 12 Statistical significance results in es-eu with bootstrap resampling for models trained on ode+stacc0.0 and ode+laser1.0 (indicated in the table as stacc and laser, respectively)

The union of the two datasets results in the lowest score overall for es-eu, which is statistically significant against all other models, with differences in bleu scores ranging from 1 to 2.7 points. Although each method can uniquely retrieve valid alignments on its own, as previously discussed, in combination the results were lower on these test sets. Although this may be indicate a tendency, additional test sets might be warranted to further confirm if the union of alignment pairs is always detrimental for this translation pair. It should be noted in particular that translation into Basque is more likely to be sensitive to variations in word order between valid translations and fixed references, variations which are likely to increase with larger amounts of training data.

The intersection-based model was not significantly different from the laser-only variant, although it features lower amounts of data. The gains obtained over the stacc model were significant, indicating that the intersection-induced filtering was beneficial for part of the data selected by stacc.

Tables 13 and 14 present the results for eu-es, in terms of bleu and statistical significance, respectively.

Table 13 bleu scores in EU-ES and number of aligned eitb sentences with union and intersection datasets
Table 14 Statistical significance results in eu-es with bootstrap resampling for models trained on ode+stacc0.0 and ode+laser0.0 (indicated in the table as stacc and laser, respectively)

For this translation pair, the model trained on the union of the datasets produced results on a par with those trained on the datasets mined by either stacc or laser, with no statistically significant difference. In contrast, the intersection-based model was significantly worse than all other variants, indicating that lower amounts of training data were more impactful for this translation pair than was the case for es-eu. In the following sections, we further explore the differences between the two translation pairs.

8 Length filtering

Due to the nature of the task, aligned comparable sentences may display information imbalance, with one of the sentences in a pair missing part of the information in the other. In this section we evaluate the impact of information mismatch, via filtering based on length differences measured on the aligned sentence pairs.

We based our approach to length-based filtering on the method described in Etchegoyhen et al. (2018), which aims to identify statistical outliers in terms of length differences between aligned sentences. We first computed the median and standard deviation over length differences, measured in terms of tokens. These reference statistics were computed on the parallel ode corpus, to establish the relevant length-difference indicators on parallel human translations. A length-difference score (lgs), based on a modified z-score, was then computed on the aligned eitb datasets, according to the formula in Eq. 1:

$$\begin{aligned} {\textsc {lgs}} = \frac{0.6745 \times (x - \tilde{y})}{median(\{|y_i - \tilde{y}|\})}, \end{aligned}$$
(1)

where x is the length difference of a sentence pair in the eitb corpus, \(\tilde{y}\) is the median length difference in the reference corpus, and the denominator is the median absolute deviation, computed over the reference corpus as well.

The modified z-score was then used to identify outliers in the aligned eitb corpus, with sentence pairs having an absolute score over a given threshold identified as cases of information imbalance. Iglewicz and Hoaglin (1993) recommend a value of 3.5 to identify outliers when using a modified z-score, and we selected this value as our default to filter all identified outliers. Additionally, we selected two more thresholds with lower values, namely 2.0 and 1.5, to evaluate the impact of a more restrictive identification of length imbalance.Footnote 9

Table 15 Amount of sentence pairs discarded via length filtering with laser and stacc

Table 15 summarises the amount of sentence pairs discarded with the above length filtering method, with different thresholds. Overall, the volumes of discarded data are significant, even with less restrictive thresholds, with a minimum of 25% of the data for the selected datasets with lgs\(_{2.0}\) filtering, for instance. Higher alignment threshold gradually minimise the issue in both approaches, although it tends to persist throughout.

These results also reinforce one of the conclusions of the manual analysis of samples, namely that the stacc method tends to align more sentences than laser where part of the information is a correct translation, but a significant part of it is missing on one side of the pair.

In Table 16, we show examples of filtered pairs, where the information in the Spanish sentence that is missing in the Basque counterpart of the aligned pair is marked in italics.

Table 16 Examples of filtered sentence pairs with modified z-score above 2.0 for stacc and laser, with English translations

The results on models trained on selected corpora filtered by length outliers for es-eu and eu-es are shown in Tables 17 and 18, respectively.Footnote 10 Also indicated in the tables are the size of the filtered corpus, the bleu brevity penalty (bp), and the proportion of filtered sentences where the length of the Spanish sentence is larger than that of the Basque sentence. Statistical significance was computed for each model with respect to the unfiltered baselines.

Table 17 Results from length filtering for ES-EU on union and intersection datasets
Table 18 Results from length filtering for EU-ES on the union dataset

For Spanish to Basque translation, in terms of bleu scores, length filtering improved significantly over the unfiltered corpus for all variants of the union dataset and for the intersection dataset, where the results were statistically significant with the less restrictive of the three filtering thresholds. For this language pair, information imbalance was thus significantly detrimental in the datasets unfiltered for length differences. For Basque to Spanish, the results were reversed, with a gradual decrease of bleu scores with additional filtering, all statistically significant.

One interpretation of these opposite results may be based on the fact that the length of filtered Spanish sentences is systematically longer than that of Basque sentences. Although this is the case in general, given the morphological system of Basque with productive affixation, more aggressive filtering of length-difference outliers lowers the proportion of Spanish sentences that are longer than their Basque counterparts, as shown in the last column of the tables, indicating that the overall tendency in the corpus is for information imbalance to affect the Basque data more than its Spanish counterpart. In other words, the news in the eitb corpus tend to summarise the information more in Basque than in Spanish. Translating from Spanish to Basque would thus have the effect of orienting the models towards summarisation, with a negative impact on translation quality that needs to be compensated with more length-based filtering. This conjecture is supported by the results in terms of brevity penalty, with lower brevity scores correlating with less length-based filtering.

For Basque to Spanish, translation quality seems to correlate instead with the volumes of data. This may be attributed to the fact that there is no marked tendency towards summarisation in this translation direction, given the fact that the target sentences are longer than the source, for the most part. The target monolingual data can thus contribute relevant decoding information in a way that is similar to synthetic data based on back-translations or on empty source sentences (Sennrich et al., 2016b), where the models can improve their modelling of the target sequences in the face of degenerate source input.

9 Data tagging

The use of tags identifying specific aspects of the data in the training corpora has proved effective in Neural Machine Translation. Thus, Sennrich et al. (2016a) used markers to control the translation of honorifics, Kobus et al. (2017) model domain control via tags identifying different domains, Yamagishi et al. (2016) use tags to control voice translation in Japanese to English Translation, and Caswell et al. (2019) employ tags to identify back-translated synthetic data, for instance.

The latter work in particular demonstrates that tagging techniques can prove more effective than noising approaches, indicating also that the impact of noising for back-translated data essentially acts as an indicator of the type of data used for training and helps the models discriminate between natural and synthetic data. We extend their approach to comparable data, by prepending a <cc> tag to all source sentences of the comparable eitb training set.

We trained models by combining the ode corpus with selected variants of the eitb corpus, with and without tags indicating comparable data, on the best performing datasets as established in the previous sections. The results of these experiments are shown in Table 19.

Table 19 bleu scores on merged datasets with and without tags

For Spanish to Basque, tagging was only effective on the noisier dataset, i.e. the eitb variant with no filtering of length-difference outliers. For the less noisy dataset, based on length-based filtering, the use of tags was detrimental. Interestingly, the use of tags in this translation direction had a significant impact in terms of shortness of translations, from a brevity penalty of 0.801 for the untagged model to 0.997 for the tagged variant based on the same corpus.

These results tend to support the hypothesis that tagging helps the model discriminate between natural and noisy data, and becomes counterproductive when the tagged comparable data are closer to natural translations, as in the variant based on filtered length-imbalanced data.

For Basque to Spanish, tagging was detrimental in terms of bleu, despite minor improvements regarding the brevity of translations. This can be viewed in light of the previous hypothesis that the overall higher quantity of information in the Spanish target sentences is a dominating factor for this translation direction.

The negative impact caused by tagging in this case seems to indicate that comparable data with less source information in the source than in the target are actually not noisy for the translation models, as discriminating between natural and comparable data leads to lower translation quality results in this case. Determining whether this hypothesis is correct can be further examined by comparing tagged back-translated data with tagged comparable data; we explore this hypothesis in the next section.

As a final experiment with tagged data, we evaluated a mixed use of tags for es-eu, the language pair for which tagging was beneficial. Given that tags were effective for the noisier dataset without length filtering and not for the filtered variant, we tagged only the data discarded via length-filtering, leaving the non-filtered data without tagging. The results are shown in Table 20.

Table 20 bleu scores for ES-EU with different tagging methods on the ode + stacc0.0 \(\cup\) laser1.0 dataset

Mixed tagging obtained degraded results over generalised tagging, indicating that, for this language pair, identifying all comparable data in the training datasets remains the optimal option. A more fine-grained analysis might be needed to further explore the use of mixed tagging; we leave this type of evaluation for future research.

10 Backtranslations vs. alignment

Backtranslations have proved useful to complement parallel corpora with synthetic translations (Edunov et al., 2018; Poncelas et al., 2018; Sennrich et al., 2016b). To our knowledge, no comparison has yet been made between using the target side of comparable corpora via backtranslations vs. using the results of alignment, as is typically done to exploit this type of corpora and as we have been adopting so far in this work. Given the results obtained in the previous sections, for translation from Basque to Spanish in particular, it might be the case that, in some cases, backtranslating target data provides similar or larger improvements over the baselines than aligned data.

To explore this hypothesis, we generated backtranslations for both translation directions, using the baseline ode model, and trained translation models by merging the resulting backtranslations with the ode corpus. We also trained a model where backtranslated data were marked with a <bt> tag Caswell et al. (2019). The results are shown in Table 21.

Table 21 Comparative results on the eitb test sets using backtranslations merged with ode data

In both translation directions, models trained on backtranslated data improved significantly over the ode baseline, with gains of approximately 7 and 10 bleu points for es-eu and eu-es, respectively. However, in both cases models based on aligned data performed significantly better, in particular for translation from Spanish to Basque. The model variants with tagged backtranslations improved marginally over the non-tagged version for es-eu but slightly degraded the performance for eu-es. These results are in line with those obtained in previous sections, where translation from Spanish to Basque required more filtering and data identification, whereas translation from Basque to Spanish seems to be optimal with the largest amount of available data.

As an additional experiment, we trained models based on a mixture of aligned and backtranslated data, where data discarded via the initial alignments, plus threshold and length filtering, were included as backtranslations. The main goal of this experiment was to evaluate the optimal combination of aligned and backtranslated data.

The results, shown in Table 22, indicate no improvement in terms of bleu for es-eu, but are impactful with respect to the length of the translated sentences, as indicated by results in terms of bp, with translations that are more similar in length to the human references overall. For eu-es, the model trained on mixed data improved over the best merged model, with a statistically significant gain of 0.9 bleu points and translated sentences which were also closer to that of the human references as well, although not to the extent observed for translation from Spanish to Basque.

Table 22 Comparative results with mixed aligned and backtranslated data

Overall, from the results on these datasets at least, backtranslations cannot be used as a replacement for comparable data alignment, but are a notable complement to reach optimal performance in terms of bleu and to approximate reference translations in length.

11 Comparative summary

Starting from the baseline trained on the ode corpus, various models have been trained and evaluated in the previous sections. Table 23 summarises the comparative results between the baselines and the best models achieved via data combination and filtering. For translation from Spanish to Basque, the best model was based on the union of the stacc\(_{0.0}\) and laser\(_{1.0}\) datasets, length filtering of outliers, source data tagging and inclusion of filtered backtranslations. For translation from Basque to Spanish, the best model was based on stacc\(_{0.0}\) and laser\(_{0.0}\) without length filtering or tagged data, along with backtranslations over the non-aligned data. The gains obtained by the best models were statistically significant over the baselines on the eitb test sets in all cases and both translation directions.

Table 23 Comparison of results with baselines and best models

Table 24 provides some examples of aligned sentences in the final corpora. These examples illustrate the quality of the parallel resource obtained from the original comparable datasets, and the variety of topics covered in the corpus, including politics, world affairs, weather, sports and culture. The specific challenges presented by the morpho-syntactic properties of Basque, including agglutinative morphology, ergativity or relatively free word order, among others,Footnote 11 make it even more necessary to prepare additional parallel resources for this language. Comparable corpora can provide useful data to build such resources, although they require dedicated analyses and careful selection to fully exploit their potential.

Table 24 Examples of aligned sentences in the final corpus, with English translations

12 Conclusions

In this work, we presented the results of a case study centred on the exploitation of Basque-Spanish comparable data for Neural Machine Translation. The original corpus is composed of news produced independently in these two languages by the Basque public broadcaster eitb. We applied and evaluated several techniques to create different variants of the corpus and measure their impact on machine translation quality.

Two efficient data alignment methods were evaluated, one based on lexical translation overlaps and the other on bilingual sentence embeddings. Both proved similarly efficient on their own to mine parallel data in the original corpus and train nmt models in both translation directions. The datasets extracted with each method were evaluated, and although they extracted a significant portion of similar data, manual analysis of the data showed that each method is able to mine valid parallel sentences that are not retrieved by the other approach. More work will thus be necessary to devise a single method with the combined accuracy of these two methods at least.

Different alignment thresholds were evaluated for each of the selected methods, with minor impact on translation quality on the models based on merged ode and eitb datasets. Other studies, on this dataset and others, have shown the benefits of higher alignment thresholds with both methods, though this was not case for this particular experimental setup where the goal was to measure the impact of complementing parallel data with comparable data on nmt models.

Additionally, the impact of further filtering based on length-difference outliers was also measured, with the notable result that such filtering was necessary for Spanish to Basque translation, given information imbalance in the data, but not in the other translation direction, as nmt models proved able to benefit from target language information despite degenerate comparable source information. A related result was the tendency of nmt models to gear towards summarisation when provided with impoverished comparable target information, a phenomenon which is likely to arise with comparable corpora and needs to be controlled for an optimal exploitation of the data.

Results on the impact of tagging for comparable data were also discussed. This method was shown to be effective in helping the models discriminate noisy comparable data when there is less information in the target sentence than in the source, but detrimental in the opposite case, where the imbalanced comparable data may still strengthen target-side sequence modelling. Mixed tagging, where only lower quality alignments were signalled, resulted in lower quality overall, indicating that comparable data as a whole required identification in the training process, at least for the translation pair where tagging led to significant improvements.

Both length-based filtering and tagging were the most impactful methods overall, although the impact of these methods was dependent on the specific information imbalance in the dataset, with translation from Spanish to Basque as the scenario benefitting from these techniques for the comparable data used in our experiments.

Finally, we compared the use of aligned comparable sentences with simply exploiting backtranslations. Overall, comparable sentence alignment achieved higher translation quality, although complementary backtranslations proved useful to further increase the coverage of the translation models, via mixtures of aligned sentences and backtranslations. Comparable data may thus be exploited for Neural Machine Translation beyond the standard combination of alignment and threshold-based filtering.

Significant improvements in translation quality can be obtained with comparable corpora, and may be useful in particular for under-resourced languages. However, their use requires a careful analysis of phenomena such as information imbalance, in specific translation pairs, to fully exploit their potential for Neural Machine Translation.