Abstract
Comparable corpora can benefit the development of Neural Machine Translation models, in particular for under-resourced languages. We present a case study centred on the exploitation of a large comparable corpus for Basque-Spanish, created from independently-produced news by the Basque public broadcaster eitb, where we evaluate the impact of different techniques to exploit the original data, in order to complement parallel datasets for this language pair in both translation directions. Two efficient methods for parallel sentence mining are explored, which identified a common core of approximately half of the total number of aligned sentences, each one uniquely identifying valid parallel sentences not captured by the other method. Filtering the data via identification of length-difference outliers proved highly effective to improve the models, as was the use of tags to discriminate between comparable and parallel data in the training corpora. The use of backtranslated data is also evaluated in this work, with results indicating that alignment-based datasets remain the most beneficial, although complementary backtranslations should also be included to fully exploit the available comparable data. Overall, the results in this work demonstrate that this type of data needs to be carefully analysed prior to its use as training data for Neural Machine Translation, since issues such as information imbalance between source and target data can lead to unoptimal results for a given translation pair.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
1 Introduction
Comparable corpora have long been identified as an important source of potentially useful data in natural language processing, notably to train data-driven machine translation systems (Munteanu & Marcu, 2005; Sharoff et al., 2016) or to extract bilingual dictionaries (Rapp, 1995).
Extracting sentence pairs with a significant degree of parallelism from this type of corpora faces important challenges, however, considering the vast amounts of unrelated or loosely related data that compose comparable datasets in different languages. Several methods have been designed to address the challenges of comparable document alignment (Azpeitia & Etchegoyhen, 2019; Sharoff et al., 2015) and comparable sentence alignment (Artetxe & Schwenk, 2019; Etchegoyhen & Azpeitia, 2016b; Fung & Cheung, 2004; Munteanu & Marcu, 2002; Smith et al., 2010), leading to new parallel datasets that can support machine translation, in particular for under-resourced languages (Etchegoyhen et al., 2016; Schwenk et al., 2019).
Neural Machine Translation (nmt) is currently the dominant paradigm in research and development in the field of Machine Translation, having led to significant advances in recent years for most language pairs (Bahdanau et al., 2015; Bojar et al., 2016, 2017; Vaswani et al., 2017). As a data-driven approach where model parameters are estimated and optimised on large volumes of parallel data, nmt is particularly sensitive to the presence of noise in the training data (Belinkov & Bisk, 2018; Cheng et al., 2018; Khayrallah & Koehn, 2018; Sperber et al., 2017). Due to the nature of the task, parallel data extracted from comparable corpora are likely to introduce unwarranted noise in the training process and an evaluation of this potential issue is worth examining.
In this article, we describe the results of an in-depth evaluation of the exploitation of a large strongly comparable corpus for Basque-Spanish, composed of independently-produced news by the Basque public broadcaster eitb,Footnote 1 focusing on the impact of various techniques to exploit the original data for Neural Machine Translation and extending the study by Etchegoyhen & Gete (2020).
We first compare results obtained with two alignment methods that have demonstrated high accuracy on comparable data, namely laser (Artetxe & Schwenk, 2019) and stacc (Etchegoyhen & Azpeitia, 2016b), and evaluate various datasets and models based on the alignments obtained with each method via threshold filtering and data combination. We then measure the impact of filtering based on length difference statistical outliers between aligned source and target sentences, as comparable corpora typically feature information divergences between source and target sentences.
We also explore the impact of tags to identify comparable data in the training datasets, following the approach proposed by Caswell et al. (2019) for backtranslations and Etchegoyhen & Gete (2020) for comparable data, and the use of mixed tagging by limiting tagging to portions of the data identified as containing more imbalanced information in the source and the target.
Finally, we compare the impact of using aligned comparable sentences to the use of backtranslations for nmt (Sennrich et al., 2016b), as both approaches have demonstrated their usefulness to increase the coverage and quality of translation models. We evaluate in particular the impact of mixtures of aligned and backtranslated data on the quality of the resulting translation models.
The remainder of this paper is organised as follows: Sect. 2 describes related work on comparable corpora; Sect. 3 describes the corpora, alignment methods and tools used in this study; in Sect. 4, we describe the baseline translation models based on the initial datasets; Sect. 5 centres on threshold filtering using the selected alignment methods; in Sect. 6, a detailed analysis of the different datasets is presented, while Sect. 7 describes the results obtained with different combinations of the available data; Sect. 8 describes the results obtained via length filtering and Sect. 9 those obtained via data tagging; in Sect. 10, we report on different experiments using backtranslated data; Sect. 11 presents a comparative summary of results on translation models; finally, Sect. 12 draws conclusion from the experiments and analyses.
2 Related work
A significant body of work has been produced over the years to mine and exploit parallel sentences from large collections of comparable monolingual corpora (Abdul-Rauf & Schwenk, 2009, 2011; Irvine & Callison-Burch, 2013; Munteanu & Marcu, 2005; Sharoff et al., 2016; Smith et al., 2010), starting with seminal work by Resnik (1999) to exploit the World Wide Web as a source of potential parallel data.
A standard part in the process is the determination of document pairs, to reduce the space of computation in the typically large datasets involved in the task. Several approaches have exploited metadata information for Web page alignment, including url, structural tags or publication date (Chen & Nie, 2000; Munteanu & Marcu, 2005; Papavassiliou et al., 2016; Resnik & Smith, 2003). Alternatively, content-based document alignment approaches for comparable corpora have also been proposed, based on vector space models (Chen et al., 2004), token translation ratios (Ma & Liberman, 1999), mutual information (Fung & Cheung, 2004), expectation-maximisation (Ion et al., 2011) or n-gram matching using machine-translated documents (Uszkoreit 2010).
Several approaches have used a mixture of content and structural properties, notably the systems in the wmt 2016 document alignment shared task (Buck & Koehn, 2016a). Among those, Gomes and Lopes (2016) proposed a phrase-based approach combined with url-matching, Buck and Koehn (2016b) used cosine similarity between tf/idf vectors over machine-translated documents, and Germann (2016) performed alignment via vector space word representations, latent semantic indexing, and url matching. In Esplá-Gomis et al. (2016), document alignment is performed via a mixture of url similarity, structural features such as shared links, and bag of words similarity.
Comparability at the document alignment was notably evaluated in a dedicated shared task of the bucc workshop series (Sharoff et al., 2015). Among participating systems, Li and Gaussier (2013) used bilingual dictionaries and proportion of matching words to assess comparability, Morin et al. (2015) made use of hapax legomena and pigeon hole reasoning to enforce alignments, and Zafarian et al. (2015) used several components, including topic modelling, named entity detection and word features. A strictly content-based method was proposed by Etchegoyhen and Azpeitia (2016a), based on Jaccard similarity (Jaccard, 1901) over sets of lexical translations, expanded with surface-based entities and common prefix matching, demonstrating high accuracy in a large number of scenarios, including comparable corpora (Azpeitia & Etchegoyhen, 2019).
The extraction of parallel sentences from comparable corpora has also been addressed via a large variety of approaches, based on suffix trees (Munteanu & Marcu, 2002), maximum likelihood (Zhao & Vogel, 2002), binary classification (Munteanu & Marcu, 2005), cosine similarity (Fung & Cheung 2004), reference metrics over statistical machine translations (Abdul-Rauf & Schwenk, 2009; Sarikaya et al., 2009) or rich features (Smith et al., 2010; Stefănescu et al., 2012). In recent years deep learning approaches have been applied to the task as well, via bidirectional recurrent neural networks (Grégoire & Langlais, 2017) or cosine similarity over bilingual sentence embeddings (Schwenk, 2018).
The core document alignment method of Etchegoyhen and Azpeitia (2016a) was applied to comparable sentence alignment, improving significantly over more sophisticated feature-rich methods (Etchegoyhen & Azpeitia, 2016b). This method, referred to as stacc, which is also based on Jaccard similarity over expanded lexical translation sets, was further extended with lexical weighting (Azpeitia et al., 2017) and a named-entity penalty (Azpeitia et al., 2018), obtaining the best results across the board in the bucc shared task on parallel sentence identification in comparable corpora (Zweigenbaum et al., 2017, 2018).
Improvement over these results were obtained with the margin-based approach of Artetxe & Schwenk (2019), which uses cosine similarity over multilingual sentence embeddings, extended with a filtering mechanism based on nearest neighbours similarity. Large amounts of comparable corpora have been mined with this approach (Schwenk et al., 2019), by means of the laser toolkit, which we include in our experiments on the selected corpora.
Recently, unsupervised approaches to NMT have demonstrated the potential of directly exploiting monolingual data to train neural translation models (Artetxe et al., 2018; Lample et al., 2018). A detailed comparison between this type of approach and the use of parallel sentences mined from comparable corpora may shed light on their respective merits, in isolation or in combination. Although this type of evaluation is outside the scope of the present study, the results in Sect. 10 on the use of back-translated data, compared to mining parallel sentences, provides preliminary results on the direct exploitation of monolingual data in this type of corpus.
A first version of the eitb corpus was prepared and shared with community (Etchegoyhen et al., 2016), to provide further support to the under-resourced Basque-Spanish language pair. In Etchegoyhen & Gete (2020), a new version of the corpus was prepared and shared with the scientific community,Footnote 2 built with news generated in subsequent years. In what follows, we use the latter version of the original data, covering news produced between 2009 and 2018, and evaluate the impact of different methods to exploit these comparable data on Neural Machine Translation.
3 Corpora preparation
The eitb corpus is composed of news independently produced in Basque and Spanish by the journalists of the Basque Country’s public broadcast service, to report on the same specific events. The corpus can thus be considered strongly comparable (Skadiņa et al., 2012) and viewed as a rich source of parallel data for this language pair (Etchegoyhen et al., 2016).
This version of the original dataset covers 10 years of content, and is composed of 168,984 documents in Basque and 174,348 in Spanish, extracted from xml files.Footnote 3 With an original amount of over two million sentences per language, it is one of the largest available corpora for the Basque-Spanish language pair, covering political news, sports and cultural events, among others. Statistics of the original corpus are described in Table 1.
To perform the experiments described in this article, a second corpus was used, based on the translation memories made available by the Basque government as open data.Footnote 4 We used the corpus created from these translation memories and shared by Etchegoyhen et al. (2018),Footnote 5 to which we will refer as ode in what follows. The corpus is constituted mainly of translations of administrative content, notably from the Instituto Vasco de Administración Pública (ivap). As it consists of professionally produced translations, the ode corpus was used as a parallel basis for the experiments described in the remainder of this work.
The eitb corpus was first aligned at the document level with docal, the efficient component for parallel and comparable document alignment described in Etchegoyhen and Azpeitia (2016a) and Azpeitia and Etchegoyhen (2019). For sentence alignment, we used the two previously mentioned approaches:
-
stacc (Etchegoyhen & Azpeitia, 2016b): This method is based on sets of bilingual lexical mappings, set expansion operations to include unseen named entities determined by surface form as well as longest common prefixes, and Jaccard similarity (Jaccard, 1901) over the source to target and target to source set. We use the stacc variant with lexical weighting (Azpeitia et al., 2017), with lexical weights computed on the corpus to be aligned.
-
laser (Artetxe & Schwenk, 2019): This approach is based on multilingual sentence embeddings, cosine similarity over embedding pairs, and a margin computed over closest alignment candidates.
To extract the translation tables necessary for both the docal and stacc approaches to document and sentence alignment, we used the fastalign toolkit (Dyer et al., 2013) on the ode corpus. Prior to performing either alignment, the sentences were tokenised and truecased using the scripts available in the moses toolkit (Koehn et al., 2007).
The statistics for the ode and eitb corpora, in number of aligned sentences are shown in Table 2. For the eitb corpus, alignments results are based on one-to-one alignments without any filtering based on the alignment scores computed by stacc or laser, i.e. with an alignment threshold set to 0 in both cases.
The development and test sets are random partitions for both corpora; for the eitb corpus, the extracted alignments were manually verified. For the test set, additional Basque translations were manually created by professional translators, to increase the robustness of the test against the relatively free word order in Basque (Etchegoyhen et al., 2018). In the Spanish to Basque translation direction, we thus use two translation references, whereas for Basque to Spanish, we use the manually translated sentences as source, to remove any eventual bias that may arise from the original data being comparable.
4 Baseline models
All translation models were based on the Transformer architecture (Vaswani et al., 2017), built with the marian toolkit (Junczys-Dowmunt et al., 2018). The models consisted of 6-layer encoders and decoders, feed-forward networks of 2048 units, embeddings vectors of dimension 512 and 8 attention heads. The dropout rate between layers is 0.1.
We used the Adam optimiser with \(\alpha =0.0003\), \(\beta _1=0.9\), \(\beta _2=0.98\) and \(\epsilon =10^{-9}\). The learning rate increases linearly for the first 16,000 training steps and decreases thereafter proportionally to the inverse square root of the corresponding step. We set the working memory to 6000MB and automatically chose the largest mini-batch that fit the specified memory. The validation data was evaluated every 3500 steps, and the training process ended if there was no improvement in the perplexity of 10 consecutive checkpoints. Embeddings for source, target and output layer are tied and all datasets were segmented with bpe (Sennrich et al., 2016c), using 30,000 operations.
We trained five baseline models, first on the ode corpus and on that same corpus merged with the aligned eitb corpus, with an alignment threshold set to 0 for both the stacc and laser methods, to measure the initial contribution of the selected comparable data. Additionally, we trained models only on the aligned eitb datasets, to measure the impact of the comparable data in isolation. These models and all subsequent ones were evaluated on the previously described test sets, which cover various topics. The results in terms of bleu (Papineni et al., 2002) for the baseline models are shown in Table 3 and were computed with SacreBLEU (Post, 2018).Footnote 6
In both translation directions, the contribution of the comparable data was significant, with large improvements in bleu scores on the in-domain test sets. Although this is not unexpected, since the ode model was trained on data in the administrative domain, this confirms the potential of the strongly comparable eitb data as a means to improve Basque-Spanish translation, as previously established in Etchegoyhen et al. (2016). Taken in isolation, the comparable data provided strong baseline results on the eitb test sets, but perform poorly on the out-of-domain ode test sets. Overall, the combination of eitb and ode data was optimal in both translation directions, when considering average results on the in-domain and out-of-domain test sets.
The initial alignments obtained with laser and stacc resulted in similar improvements across the board, except for Spanish to Basque translation where the former performed better on the in-domain eitb test set. In the next sections, we evaluate several variants of the initial alignments, both individually and in combination, to determine an optimal setup for the exploitation of the original comparable dataset.
5 Threshold filtering
Methods to extract parallel data from comparable corpora rely on metrics that measure translation equivalence in some form. The scores assigned by the core metrics usually need to be complemented with some form of filtering, based on thresholds determined over the training or development datasets (Artetxe & Schwenk, 2019; Etchegoyhen & Azpeitia, 2016b). This is necessary since similarity varies between comparable sentences and comparability metrics usually assign a continuous score to comparable sentence pairs.
To determine the impact of threshold selection, we extracted subsets of the aligned eitb corpus after applying different thresholds, selected according to the ranges of the stacc and laser metrics, which produced significant amounts of initial data. Each subset was then used in combination with the ode corpus, to train new nmt model variants.
Tables 4 and 5 show the results, in terms of bleu score over the eitb test sets and number of aligned sentences, for the threshold-filtered datasets with stacc and laser, respectively. For each method, we computed statistical significance via bootstrap resampling (Koehn, 2004) between each filtered variant and the 0.0 baseline.Footnote 7
In the case of stacc, neither the slight increases over the baseline in es-eu nor the slight decreases in eu-es up to threshold 0.17 were statistically significant. For further experiments, either dataset, up to a threshold of 0.20 in the first case and 0.17 in the latter case, could thus be considered equivalent, at least regarding the training of the complete models that include the ode datasets as well.
Higher stacc thresholds usually increase the accuracy of the aligner when measured in terms of precision and recall on various alignment tasks (Azpeitia et al., 2017; Etchegoyhen & Azpeitia, 2016b; Etchegoyhen et al., 2016), or when training NMT models via fine-tuning over aligned datasets (Etchegoyhen & Gete, 2020), and could have been selected accordingly. Our goal in these experiments, however, was to measure the impact of complementing parallel data with comparable data, in a realistic scenario where different datasets are merged to maximise nmt models model coverage and accuracy. We therefore opted for an initial data selection based on the statistical significance of bleu scores for nmt models trained on different subsets. To measure the impact of additional filtering methods, we thus selected the largest of the equivalent datasets in terms of impact on bleu scores, i.e. the 0.0 baseline.
With laser, all thresholds led to statistically significant gains over the baseline in es-eu, whereas none did for eu-es. Additionally, we computed statistical significance for all improving variants against the model based on the 1.0 threshold, with no significance for \(p < 0.05\). Along the same lines as those that guided the selection of the stacc dataset, for the next experiments we selected the datasets that led to the best laser-based models, while also featuring the largest amount of data. Thus, for es-eu we selected as defaults the dataset based on the 1.0 threshold, and for eu-es the dataset based on the 0.0 threshold.
Table 6 summarises the main characteristics of the selected datasets for each method and language pair.
6 Data analysis
Each alignment method generates its own datasets according to the selected threshold, which may differ in terms of several characteristics, while also overlapping to a certain extent. In this section, we analyse the main differences in terms of corpus statistics.
We first extracted the union and intersection of the stacc and laser datasets, with results shown in Table 7. Overall, the selected methods mined a common core of 55.6% of aligned sentences for es-eu, and 50.2% for eu-es. This leaves an almost equivalently large amount of alignments uniquely mined by each method.
Determining the benefits these unique alignments can provide will be approximated in the next section by training models based on the union and intersection of the datasets. A full manual analysis of alignment accuracy for these large datasets was not feasible for this study, therefore we randomly sampled 100 aligned sentences from each one of the selected datasets and manually evaluated the alignments for approximate estimates of the quality of the unique alignments. An alignment was considered incorrect if there was information imbalance between source and target sentences, or crucial information missing, such as Named Entities.
For eu-es, where both laser and stacc datasets are based on 0 thresholds, the number of alignments identified as correct in the samples was 21% for the former, and 35% for the latter. For es-eu, the stacc dataset is the same as in the other direction, but the laser dataset is based on a 1.0 threshold for which we extracted a separate sample. In this case, 29% of the alignments were considered correct.
Although these results are only approximate estimates, they do indicate that each method addresses aspects of comparable data alignment that the other method fails to capture, while also featuring erroneous alignments based on the respective characteristics of the methods. From manual examination of the samples, the main weakness of laser seems to be the lack of robustness in identifying corresponding Named Entities and alignment of loosely related sentences, which is likely to be a by-product of the use of embeddings, whereas for stacc the main limitation appeared to be the alignment of sentences which are translations of each other, but where a significant portion of one side of the pair is missing in the other.
We provide examples of correct alignments in Table 8. As these examples illustrate, each method can mine valid parallel sentences in the comparable datasets that are not captured by the alternative, irrespective of the length of the aligned sentences.
We then compared the vocabulary distribution differences between the selected datasets, the union of the two, and their intersection. Figures 1 and 2 show the number of tokens in the union vocabulary sizes for es-eu and eu-es, respectively, along with the amount of common tokens between the union and intersection datasets, the stacc dataset and the laser dataset.
The intersection of vocabularies is lesser than the union, indicating that each method mines portions of data that differ in terms of vocabulary, with \(85\%\) common vocabulary for both es-eu and es-eu .
As indicated by the stacc-only and laser-only results, uniquely retrieved tokens range between 40,707 and 87,504 for eu-es, with larger counts for laser, and between 33,457 and 83,577 for es-eu, with larger counts for stacc in this case. Considering the reversed tendencies given the language pair, these differences are likely due in part to the differences in volumes of selected data, with 46,209 more sentence pairs selected by laser for eu-es, and 240,731 more by stacc for es-eu. The differences in vocabulary indicate however that the additional sentence pairs in each case differ significantly from the common pairs selected by both methods.
Since the two methods differ significantly in the way they mine parallel sentences, via bilingual embeddings in one case and lexical translation overlap in the other, the differences in terms of vocabulary are not surprising. To further explore these differences, we measured the amount of Named Entities (NE) and sentence pairs that are identical in each language, as retrieved by each method.Footnote 8 Results are shown in Table 9.
In terms of NEs, the results are in line with the previously discussed volume differences between the selected datasets, with either methods retrieving at least 9 out of 10 NEs present in the union of the datasets.
The methods do contrast in terms of identical sentences in aligned pairs, with laser aligning virtually all of them in both cases, as opposed to stacc. This difference might be due to cases where the identical source and target sentences are are in the same language, in which case the reliance of the stacc method on lexical translation mappings would discard these pairs as possible alignments, correctly so in these cases. Since the amount of such pairs is comparably marginal, we did not explore these cases any further.
Alignment methods may also differ in terms of the length of the sentences in aligned pairs. We measured both the average sentence length and the average length difference between source and target sentences, with the results shown in Table 10.
As indicated by these results, stacc tends to align pairs with larger length differences than laser, with a two-token difference on average. This is mainly due to the length of the selected Spanish sentences, as the length of Basque sentences is comparable with both methods on average. These differences add to the observation that the characteristics of each method lead to divergent mining of parallel sentences in comparable corpora, to some extent. In the next section, we investigate the impact of using different combinations of data to train nmt models.
7 Data combination
To measure the impact of combining the datasets generated by both stacc and laser, we created variants of the corpus with both the union and the intersections of these datasets and trained complete models with each variant.
Table 11 presents the results in terms of bleu scores for es-eu, along with the previous scores obtained with the individual datasets selected by each method. Table 12 shows statistical significance results of the scores obtained between each pair of models.
The union of the two datasets results in the lowest score overall for es-eu, which is statistically significant against all other models, with differences in bleu scores ranging from 1 to 2.7 points. Although each method can uniquely retrieve valid alignments on its own, as previously discussed, in combination the results were lower on these test sets. Although this may be indicate a tendency, additional test sets might be warranted to further confirm if the union of alignment pairs is always detrimental for this translation pair. It should be noted in particular that translation into Basque is more likely to be sensitive to variations in word order between valid translations and fixed references, variations which are likely to increase with larger amounts of training data.
The intersection-based model was not significantly different from the laser-only variant, although it features lower amounts of data. The gains obtained over the stacc model were significant, indicating that the intersection-induced filtering was beneficial for part of the data selected by stacc.
Tables 13 and 14 present the results for eu-es, in terms of bleu and statistical significance, respectively.
For this translation pair, the model trained on the union of the datasets produced results on a par with those trained on the datasets mined by either stacc or laser, with no statistically significant difference. In contrast, the intersection-based model was significantly worse than all other variants, indicating that lower amounts of training data were more impactful for this translation pair than was the case for es-eu. In the following sections, we further explore the differences between the two translation pairs.
8 Length filtering
Due to the nature of the task, aligned comparable sentences may display information imbalance, with one of the sentences in a pair missing part of the information in the other. In this section we evaluate the impact of information mismatch, via filtering based on length differences measured on the aligned sentence pairs.
We based our approach to length-based filtering on the method described in Etchegoyhen et al. (2018), which aims to identify statistical outliers in terms of length differences between aligned sentences. We first computed the median and standard deviation over length differences, measured in terms of tokens. These reference statistics were computed on the parallel ode corpus, to establish the relevant length-difference indicators on parallel human translations. A length-difference score (lgs), based on a modified z-score, was then computed on the aligned eitb datasets, according to the formula in Eq. 1:
where x is the length difference of a sentence pair in the eitb corpus, \(\tilde{y}\) is the median length difference in the reference corpus, and the denominator is the median absolute deviation, computed over the reference corpus as well.
The modified z-score was then used to identify outliers in the aligned eitb corpus, with sentence pairs having an absolute score over a given threshold identified as cases of information imbalance. Iglewicz and Hoaglin (1993) recommend a value of 3.5 to identify outliers when using a modified z-score, and we selected this value as our default to filter all identified outliers. Additionally, we selected two more thresholds with lower values, namely 2.0 and 1.5, to evaluate the impact of a more restrictive identification of length imbalance.Footnote 9
Table 15 summarises the amount of sentence pairs discarded with the above length filtering method, with different thresholds. Overall, the volumes of discarded data are significant, even with less restrictive thresholds, with a minimum of 25% of the data for the selected datasets with lgs\(_{2.0}\) filtering, for instance. Higher alignment threshold gradually minimise the issue in both approaches, although it tends to persist throughout.
These results also reinforce one of the conclusions of the manual analysis of samples, namely that the stacc method tends to align more sentences than laser where part of the information is a correct translation, but a significant part of it is missing on one side of the pair.
In Table 16, we show examples of filtered pairs, where the information in the Spanish sentence that is missing in the Basque counterpart of the aligned pair is marked in italics.
The results on models trained on selected corpora filtered by length outliers for es-eu and eu-es are shown in Tables 17 and 18, respectively.Footnote 10 Also indicated in the tables are the size of the filtered corpus, the bleu brevity penalty (bp), and the proportion of filtered sentences where the length of the Spanish sentence is larger than that of the Basque sentence. Statistical significance was computed for each model with respect to the unfiltered baselines.
For Spanish to Basque translation, in terms of bleu scores, length filtering improved significantly over the unfiltered corpus for all variants of the union dataset and for the intersection dataset, where the results were statistically significant with the less restrictive of the three filtering thresholds. For this language pair, information imbalance was thus significantly detrimental in the datasets unfiltered for length differences. For Basque to Spanish, the results were reversed, with a gradual decrease of bleu scores with additional filtering, all statistically significant.
One interpretation of these opposite results may be based on the fact that the length of filtered Spanish sentences is systematically longer than that of Basque sentences. Although this is the case in general, given the morphological system of Basque with productive affixation, more aggressive filtering of length-difference outliers lowers the proportion of Spanish sentences that are longer than their Basque counterparts, as shown in the last column of the tables, indicating that the overall tendency in the corpus is for information imbalance to affect the Basque data more than its Spanish counterpart. In other words, the news in the eitb corpus tend to summarise the information more in Basque than in Spanish. Translating from Spanish to Basque would thus have the effect of orienting the models towards summarisation, with a negative impact on translation quality that needs to be compensated with more length-based filtering. This conjecture is supported by the results in terms of brevity penalty, with lower brevity scores correlating with less length-based filtering.
For Basque to Spanish, translation quality seems to correlate instead with the volumes of data. This may be attributed to the fact that there is no marked tendency towards summarisation in this translation direction, given the fact that the target sentences are longer than the source, for the most part. The target monolingual data can thus contribute relevant decoding information in a way that is similar to synthetic data based on back-translations or on empty source sentences (Sennrich et al., 2016b), where the models can improve their modelling of the target sequences in the face of degenerate source input.
9 Data tagging
The use of tags identifying specific aspects of the data in the training corpora has proved effective in Neural Machine Translation. Thus, Sennrich et al. (2016a) used markers to control the translation of honorifics, Kobus et al. (2017) model domain control via tags identifying different domains, Yamagishi et al. (2016) use tags to control voice translation in Japanese to English Translation, and Caswell et al. (2019) employ tags to identify back-translated synthetic data, for instance.
The latter work in particular demonstrates that tagging techniques can prove more effective than noising approaches, indicating also that the impact of noising for back-translated data essentially acts as an indicator of the type of data used for training and helps the models discriminate between natural and synthetic data. We extend their approach to comparable data, by prepending a <cc> tag to all source sentences of the comparable eitb training set.
We trained models by combining the ode corpus with selected variants of the eitb corpus, with and without tags indicating comparable data, on the best performing datasets as established in the previous sections. The results of these experiments are shown in Table 19.
For Spanish to Basque, tagging was only effective on the noisier dataset, i.e. the eitb variant with no filtering of length-difference outliers. For the less noisy dataset, based on length-based filtering, the use of tags was detrimental. Interestingly, the use of tags in this translation direction had a significant impact in terms of shortness of translations, from a brevity penalty of 0.801 for the untagged model to 0.997 for the tagged variant based on the same corpus.
These results tend to support the hypothesis that tagging helps the model discriminate between natural and noisy data, and becomes counterproductive when the tagged comparable data are closer to natural translations, as in the variant based on filtered length-imbalanced data.
For Basque to Spanish, tagging was detrimental in terms of bleu, despite minor improvements regarding the brevity of translations. This can be viewed in light of the previous hypothesis that the overall higher quantity of information in the Spanish target sentences is a dominating factor for this translation direction.
The negative impact caused by tagging in this case seems to indicate that comparable data with less source information in the source than in the target are actually not noisy for the translation models, as discriminating between natural and comparable data leads to lower translation quality results in this case. Determining whether this hypothesis is correct can be further examined by comparing tagged back-translated data with tagged comparable data; we explore this hypothesis in the next section.
As a final experiment with tagged data, we evaluated a mixed use of tags for es-eu, the language pair for which tagging was beneficial. Given that tags were effective for the noisier dataset without length filtering and not for the filtered variant, we tagged only the data discarded via length-filtering, leaving the non-filtered data without tagging. The results are shown in Table 20.
Mixed tagging obtained degraded results over generalised tagging, indicating that, for this language pair, identifying all comparable data in the training datasets remains the optimal option. A more fine-grained analysis might be needed to further explore the use of mixed tagging; we leave this type of evaluation for future research.
10 Backtranslations vs. alignment
Backtranslations have proved useful to complement parallel corpora with synthetic translations (Edunov et al., 2018; Poncelas et al., 2018; Sennrich et al., 2016b). To our knowledge, no comparison has yet been made between using the target side of comparable corpora via backtranslations vs. using the results of alignment, as is typically done to exploit this type of corpora and as we have been adopting so far in this work. Given the results obtained in the previous sections, for translation from Basque to Spanish in particular, it might be the case that, in some cases, backtranslating target data provides similar or larger improvements over the baselines than aligned data.
To explore this hypothesis, we generated backtranslations for both translation directions, using the baseline ode model, and trained translation models by merging the resulting backtranslations with the ode corpus. We also trained a model where backtranslated data were marked with a <bt> tag Caswell et al. (2019). The results are shown in Table 21.
In both translation directions, models trained on backtranslated data improved significantly over the ode baseline, with gains of approximately 7 and 10 bleu points for es-eu and eu-es, respectively. However, in both cases models based on aligned data performed significantly better, in particular for translation from Spanish to Basque. The model variants with tagged backtranslations improved marginally over the non-tagged version for es-eu but slightly degraded the performance for eu-es. These results are in line with those obtained in previous sections, where translation from Spanish to Basque required more filtering and data identification, whereas translation from Basque to Spanish seems to be optimal with the largest amount of available data.
As an additional experiment, we trained models based on a mixture of aligned and backtranslated data, where data discarded via the initial alignments, plus threshold and length filtering, were included as backtranslations. The main goal of this experiment was to evaluate the optimal combination of aligned and backtranslated data.
The results, shown in Table 22, indicate no improvement in terms of bleu for es-eu, but are impactful with respect to the length of the translated sentences, as indicated by results in terms of bp, with translations that are more similar in length to the human references overall. For eu-es, the model trained on mixed data improved over the best merged model, with a statistically significant gain of 0.9 bleu points and translated sentences which were also closer to that of the human references as well, although not to the extent observed for translation from Spanish to Basque.
Overall, from the results on these datasets at least, backtranslations cannot be used as a replacement for comparable data alignment, but are a notable complement to reach optimal performance in terms of bleu and to approximate reference translations in length.
11 Comparative summary
Starting from the baseline trained on the ode corpus, various models have been trained and evaluated in the previous sections. Table 23 summarises the comparative results between the baselines and the best models achieved via data combination and filtering. For translation from Spanish to Basque, the best model was based on the union of the stacc\(_{0.0}\) and laser\(_{1.0}\) datasets, length filtering of outliers, source data tagging and inclusion of filtered backtranslations. For translation from Basque to Spanish, the best model was based on stacc\(_{0.0}\) and laser\(_{0.0}\) without length filtering or tagged data, along with backtranslations over the non-aligned data. The gains obtained by the best models were statistically significant over the baselines on the eitb test sets in all cases and both translation directions.
Table 24 provides some examples of aligned sentences in the final corpora. These examples illustrate the quality of the parallel resource obtained from the original comparable datasets, and the variety of topics covered in the corpus, including politics, world affairs, weather, sports and culture. The specific challenges presented by the morpho-syntactic properties of Basque, including agglutinative morphology, ergativity or relatively free word order, among others,Footnote 11 make it even more necessary to prepare additional parallel resources for this language. Comparable corpora can provide useful data to build such resources, although they require dedicated analyses and careful selection to fully exploit their potential.
12 Conclusions
In this work, we presented the results of a case study centred on the exploitation of Basque-Spanish comparable data for Neural Machine Translation. The original corpus is composed of news produced independently in these two languages by the Basque public broadcaster eitb. We applied and evaluated several techniques to create different variants of the corpus and measure their impact on machine translation quality.
Two efficient data alignment methods were evaluated, one based on lexical translation overlaps and the other on bilingual sentence embeddings. Both proved similarly efficient on their own to mine parallel data in the original corpus and train nmt models in both translation directions. The datasets extracted with each method were evaluated, and although they extracted a significant portion of similar data, manual analysis of the data showed that each method is able to mine valid parallel sentences that are not retrieved by the other approach. More work will thus be necessary to devise a single method with the combined accuracy of these two methods at least.
Different alignment thresholds were evaluated for each of the selected methods, with minor impact on translation quality on the models based on merged ode and eitb datasets. Other studies, on this dataset and others, have shown the benefits of higher alignment thresholds with both methods, though this was not case for this particular experimental setup where the goal was to measure the impact of complementing parallel data with comparable data on nmt models.
Additionally, the impact of further filtering based on length-difference outliers was also measured, with the notable result that such filtering was necessary for Spanish to Basque translation, given information imbalance in the data, but not in the other translation direction, as nmt models proved able to benefit from target language information despite degenerate comparable source information. A related result was the tendency of nmt models to gear towards summarisation when provided with impoverished comparable target information, a phenomenon which is likely to arise with comparable corpora and needs to be controlled for an optimal exploitation of the data.
Results on the impact of tagging for comparable data were also discussed. This method was shown to be effective in helping the models discriminate noisy comparable data when there is less information in the target sentence than in the source, but detrimental in the opposite case, where the imbalanced comparable data may still strengthen target-side sequence modelling. Mixed tagging, where only lower quality alignments were signalled, resulted in lower quality overall, indicating that comparable data as a whole required identification in the training process, at least for the translation pair where tagging led to significant improvements.
Both length-based filtering and tagging were the most impactful methods overall, although the impact of these methods was dependent on the specific information imbalance in the dataset, with translation from Spanish to Basque as the scenario benefitting from these techniques for the comparable data used in our experiments.
Finally, we compared the use of aligned comparable sentences with simply exploiting backtranslations. Overall, comparable sentence alignment achieved higher translation quality, although complementary backtranslations proved useful to further increase the coverage of the translation models, via mixtures of aligned sentences and backtranslations. Comparable data may thus be exploited for Neural Machine Translation beyond the standard combination of alignment and threshold-based filtering.
Significant improvements in translation quality can be obtained with comparable corpora, and may be useful in particular for under-resourced languages. However, their use requires a careful analysis of phenomena such as information imbalance, in specific translation pairs, to fully exploit their potential for Neural Machine Translation.
Notes
Euskal Irrati Telebista: https://www.eitb.eus.
The corpus is available in the opus repository (Tiedemann, 2012): https://opus.nlpl.eu/EiTB-ParCC.php.
See Etchegoyhen et al. (2016) for a detailed description of the original file structure.
In the version shared for the iwslt 2018 shared task, available at: https://sites.google.com/site/iwsltevaluation2018/TED-tasks.
For all translation model results hereafter, translation from Basque to Spanish is indicated as eu-es and the reverse direction as es-eu.
In both tables, \(^\dagger\) indicates statistical significance over the baseline, for \(p < 0.05\).
The identification of Named Entities was approximated by determining the amount of cased tokens and numbers in the truecased datasets. Identical sentences in the source and target may be due to alignment errors but also to sports results or movie listings, for example.
Length imbalance could have been computed by simply taking the average absolute difference for each sentence pair. However, this method would not lead to the identification of statistically significant deviations from the mean determined on a reference corpus, which was our goal for these experiments.
For Basque to Spanish translation, we discarded the intersection dataset, considering the significantly worse results obtained over the baseline, as discussed in the previous section.
See Etchegoyhen et al. (2018) and references therein for more details on Machine Translation of Basque.
References
Abdul-Rauf, S., & Schwenk, H. (2009). On the use of comparable corpora to improve SMT performance. In Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics (pp. 16–23). Association for Computational Linguistics.
Abdul-Rauf, S., & Schwenk, H. (2011). Parallel sentence generation from comparable corpora for improved SMT. Machine Translation, 25(4), 341–375.
Artetxe, M., Labaka, G., Agirre, E., & Cho, K. (2018). Unsupervised neural machine translation. In International Conference on Learning Representations.
Artetxe, M., & Schwenk, H. (2019). Margin-based parallel corpus mining with multilingual sentence embeddings. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (pp. 3197–3203). Association for Computational Linguistics.
Azpeitia, A., & Etchegoyhen, T. (2019). Efficient document alignment across scenarios. Machine Translation, 33, 205–237.
Azpeitia, A., Etchegoyhen, T., & Martínez Garcia, E. (2017). Weighted set-theoretic alignment of comparable sentences. In Proceedings of the Tenth Workshop on Building and Using Comparable Corpora (pp. 41–45).
Azpeitia, A., Etchegoyhen, T., & Martínez Garcia, E. (2018). Extracting parallel sentences from comparable corpora with STACC variants. In Proceedings of the Eleventh Workshop on Building and Using Comparable Corpora (pp. 48–52).
Bahdanau, D., Cho, K., & Bengio, Y. (2015). Neural machine translation by jointly learning to align and translate. In Proceedings of the International Conference on Learning Representations.
Belinkov, Y., & Bisk, Y. (2018). Synthetic and natural noise both break neural machine translation. In Proceedings of the 6th International Conference on Learning Representations.
Bojar, O., Chatterjee, R., Federmann, C., Graham, Y., Haddow, B., Huang, S., Huck, M., Koehn, P., Liu, Q., Logacheva, V., Monz, C., Negri, M., Post, M., Rubino, R., Specia, L., & Turchi, M. (2017). Findings of the 2017 conference on machine translation. In Proceedings of the Second Conference on Machine Translation, Volume 2: Shared Task Papers (pp. 169–214).
Bojar, O., Chatterjee, R., Federmann, C., Graham, Y., Haddow, B., Huck, M., Jimeno Yepes, A., Koehn, P., Logacheva, V., Monz, C., Negri, M., Neveol, A., Neves, M., Popel, M., Post, M., Rubino, R., Scarton, C., Specia, L., Turchi, M., Verspoor, K., & Zampieri, M. (2016). Findings of the 2016 conference on machine translation. In Proceedings of the First Conference on Machine Translation (pp. 131–198).
Buck, C., & Koehn, P. (2016a). Findings of the WMT 2016 bilingual document alignment shared task. In Proceedings of the First Conference on Machine Translation (pp. 554–563). Association for Computational Linguistics.
Buck, C., & Koehn, P. (2016b). Quick and reliable document alignment via TF/IDF-weighted cosine distance. In Proceedings of the First Conference on Machine Translation (pp. 672–678). Association for Computational Linguistics.
Caswell, I., Chelba, C., & Grangier, D. (2019). Tagged back-translation. In Proceedings of the Fourth Conference on Machine Translation (Volume 1: Research Papers) (pp. 53–63). Association for Computational Linguistics.
Chen, J., Chau, R., & Yeh, C.-H. (2004). Discovering parallel text from the World Wide Web. In Proceedings of the Second Workshop on Australasian Information Security, Data Mining and Web Intelligence, and Software Internationalisation - Volume 32, ACSW Frontiers ’04 (pp. 157–161). Australian Computer Society, Inc.
Chen, J., & Nie, J.-Y. (2000). Parallel web text mining for cross-language IR. In Content-Based Multimedia Information Access - Volume 1 (pp. 62–77). Centre des hautes études internationales d’informatique documentaire.
Cheng, Y., Tu, Z., Meng, F., Zhai, J., & Liu, Y. (2018). Towards robust neural machine translation. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (pp. 1756–1766). Association for Computational Linguistics.
Dyer, C., Chahuneau, V., & Smith, N. A. (2013). A simple, fast, and effective reparameterization of IBM model 2. In Proceedings of The 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies.
Edunov, S., Ott, M., Auli, M., & Grangier, D. (2018). Understanding back-translation at scale. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (pp. 489–500). Association for Computational Linguistics.
Esplá-Gomis, M., Forcada, M. L., Ortiz-Rojas, S., & Ferrández-Tordera, J. (2016). Bitextor’s participation in WMT’16: shared task on document alignment. In Proceedings of the First Conference on Machine Translation (pp. 685–691).
Etchegoyhen, T., & Azpeitia, A. (2016a). A portable method for parallel and comparable document alignment. Baltic Journal of Modern Computing, 4(2), 243–255.
Etchegoyhen, T., & Azpeitia, A. (2016b). Set-theoretic alignment for comparable corpora. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, volume 1: Long Papers (pp. 2009–2018).
Etchegoyhen, T., Azpeitia, A., & Pérez, N. (2016). Exploiting a large strongly comparable corpus. In Proceedings of the Tenth International Conference on Language Resources and Evaluation.
Etchegoyhen, T., & Gete, H. (2020). Handle with care: A case study in comparable corpora exploitation for neural machine translation. In Proceedings of The 12th Language Resources and Evaluation Conference (pp. 3792–3800). European Language Resources Association.
Etchegoyhen, T., Martínez Garcia, E., Azpeitia, A., Labaka, G., Alegria, I., Cortes Etxabe, I., Jauregi Carrera, A., Ellakuria Santos, I., Martin, M., & Calonge, E. (2018). Neural machine translation of Basque. In Proceedings of the 21st Annual Conference of the European Association for Machine Translation (pp. 139–148).
Fung, P., & Cheung, P. (2004). Mining very non-parallel corpora: Parallel sentence and lexicon extraction via bootstrapping and E.M. In Proceedings of Empirical Methods in Natural Language Processing (pp. 57–63).
Germann, U. (2016). Bilingual document alignment with latent semantic indexing. In Proceedings of the First Conference on Machine Translation (pp. 692–696). Association for Computational Linguistics.
Gomes, L., & Lopes, G. P. (2016). First steps towards coverage-based document alignment. In Proceedings of the First Conference on Machine Translation (pp. 697–702). Association for Computational Linguistics.
Grégoire, F., & Langlais, P. (2017). BUCC 2017 shared task: a first attempt toward a deep learning framework for identifying parallel sentences in comparable corpora. In Proceedings of the 10th Workshop on Building and Using Comparable Corpora (pp. 46–50).
Iglewicz, B., & Hoaglin, D. (1993). Volume 16: How to detect and handle outliers. The ASQC basic references in quality control: statistical techniques, 16.
Ion, R., Ceauşu, A., & Irimia, E. (2011). An expectation maximization algorithm for textual unit alignment. In Proceedings of the 4th Workshop on Building and Using Comparable Corpora: Comparable Corpora and the Web (pp. 128–135). Association for Computational Linguistics.
Irvine, A., & Callison-Burch, C. (2013). Combining bilingual and comparable corpora for low resource machine translation. In Proceedings of the Eighth Workshop on Statistical Machine Translation (pp. 262–270).
Jaccard, P. (1901). Distribution de la flore alpine dans le bassin des Dranses et dans quelques régions voisines. Bulletin de la Société Vaudoise des Sciences Naturelles, 37, 241–272.
Junczys-Dowmunt, M., Grundkiewicz, R., Dwojak, T., Hoang, H., Heafield, K., Neckermann, T., Seide, F., Germann, U., Aji, A. F., Bogoychev, N., Martins, A. F. T., & Birch, A. (2018). Marian: Fast neural machine translation in C++. In Proceedings of 56th Annual Meeting of the Association for Computational Linguistics-System Demonstrations (pp. 116–121).
Khayrallah, H., & Koehn, P. (2018). On the impact of various types of noise on neural machine translation. In Proceedings of the 2nd Workshop on Neural Machine Translation and Generation (pp. 74–83).
Kobus, C., Crego, J., & Senellart, J. (2017). Domain control for neural machine translation. In Proceedings of Recent Advances in Natural Language Processing (pp. 372–378). INCOMA Ltd.
Koehn, P. (2004). Statistical significance tests for machine translation evaluation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (pp. 388–395).
Koehn, P., Hoang, H., Birch, A., Callison-Burch, C., Federico, M., Bertoldi, N., Cowan, B., Shen, W., Moran, C., Zens, R., et al. (2007). Moses: Open source toolkit for statistical machine translation. In Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics (pp. 177–180).
Lample, G., Conneau, A., Denoyer, L., & Ranzato, M. (2018). Unsupervised machine translation using monolingual corpora only. In International Conference on Learning Representations.
Li, B., & Gaussier, E. (2013). Exploiting comparable corpora for lexicon extraction: Measuring and improving corpus quality. In Building and Using Comparable Corpora (pp. 131–149).
Ma, X., & Liberman, M. (1999). Bits: A method for bilingual text search over the web. In Proceedings of Machine Translation Summit VII (pp. 538–542).
Morin, E., Hazem, A., Boudin, F., & Clouet, E. L. (2015). Lina: Identifying comparable documents from Wikipedia. In Proceedings of the Eighth Workshop on Building and Using Comparable Corpora.
Munteanu, D. S., & Marcu, D. (2002). Processing comparable corpora with bilingual suffix trees. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (pp. 289–295).
Munteanu, D. S., & Marcu, D. (2005). Improving machine translation performance by exploiting non-parallel corpora. Computational Linguistics, 31(4), 477–504.
Papavassiliou, V., Prokopidis, P., & Piperidis, S. (2016). The ILSP/ARC submission to the WMT 2016 bilingual document alignment shared task. In Proceedings of the First Conference on Machine Translation (pp. 733–739).
Papineni, K., Roukos, S., Ward, T., & Zhu, W.-J. (2002). BLEU: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics (pp. 311–318).
Poncelas, A., Shterionov, D. S., Way, A., de Buy Wenniger, G. M., & Passban, P. (2018). Investigating backtranslation in neural machine translation. In Proceedings of the 21st Annual Conference of the European Association for Machine Translation (pp. 249–258).
Post, M. (2018). A call for clarity in reporting BLEU scores. In Proceedings of the Third Conference on Machine Translation: Research Papers (pp. 186–191).
Rapp, R. (1995). Identifying word translations in non-parallel texts. In Proceedings of the 33rd annual meeting on Association for Computational Linguistics (pp. 320–322).
Resnik, P. (1999). Mining the Web for bilingual text. In Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics (pp. 527–534).
Resnik, P., & Smith, N. A. (2003). The web as a parallel corpus. Computational Linguistics, 29(3), 349–380.
Sarikaya, R., Maskey, S., Zhang, R., Jan, E.-E., Wang, D., Ramabhadran, B., & Roukos, S. (2009). Iterative sentence-pair extraction from quasi-parallel corpora for machine translation. In Proceedings of InterSpeech (pp. 432–435).
Schwenk, H. (2018). Filtering and mining parallel data in a joint multilingual space. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers) (pp. 228–234).
Schwenk, H., Chaudhary, V., Sun, S., Gong, H., & Guzmán, F. (2019). WikiMatrix: Mining 135m parallel sentences in 1620 language pairs from Wikipedia. CoRR, abs/1907.05791.
Sennrich, R., Haddow, B., & Birch, A. (2016a). Controlling politeness in neural machine translation via side constraints. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (pp. 35–40). Association for Computational Linguistics.
Sennrich, R., Haddow, B., & Birch, A. (2016b). Improving neural machine translation models with monolingual data. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (pp. 86–96).
Sennrich, R., Haddow, B., & Birch, A. (2016c). Neural machine translation of rare words with subword units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, Volume 1: Long Papers (pp. 1715–1725).
Sharoff, S., Rapp, R., Zweigenbaum, P., & Fung, P. (2016). Building and using comparable corpora. Incorporated: Springer Publishing Company.
Sharoff, S., Zweigenbaum, P., & Rapp, R. (2015). BUCC shared task: Cross-language document similarity. In Proceedings of the 8th Workshop on Building and Using Comparable Corpora (pp. 74–78).
Skadiņa, I., Aker, A., Mastropavlos, N., Su, F., Tufis, D., Verlic, M., Vasiljevs, A., Babych, B., Clough, P., Gaizauskas, R., et al. (2012). Collecting and using comparable corpora for statistical machine translation. In Proceedings of the 8th International Conference on Language Resources and Evaluation.
Smith, J. R., Quirk, C., & Toutanova, K. (2010). Extracting parallel sentences from comparable corpora using document level alignment. In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics (pp. 403–411).
Sperber, M., Niehues, J., & Waibel, A. (2017). Toward robust neural machine translation for noisy input sequences. In Proceedings of the 14th International Workshop on Spoken Language Translation (pp. 90–96).
Stefănescu, D., Ion, R., & Hunsicker, S. (2012). Hybrid parallel sentence mining from comparable corpora. In Proceedings of the 16th Conference of the European Association for Machine Translation (pp. 137–144).
Tiedemann, J. (2012). Parallel data, tools and interfaces in OPUS. In Proceedings of the 8th Language Resources and Evaluation Conference (pp. 2214–2218).
Uszkoreit, J., Ponte, J. M., Popat, A. C., & Dubiner, M. (2010). Large scale parallel document mining for machine translation. In Proceedings of the 23rd International Conference on Computational Linguistics (pp. 1101–1109).
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł, & Polosukhin, I. (2017). Attention is all you need. Advances in Neural Information Processing Systems 30 (NIPS 2017) (pp. 5998–6008).
Yamagishi, H., Kanouchi, S., Sato, T., & Komachi, M. (2016). Controlling the voice of a sentence in Japanese-to-English neural machine translation. In Proceedings of the 3rd Workshop on Asian Translation (pp. 203–210).
Zafarian, A., Aghasadeghi, A., Azadi, F., Ghiasifard, S., Alipanahloo, Z., Bakhshaei, S., & Ziabary, S. M. M. (2015). AUT document alignment framework for bucc workshop shared task. In Proceedings of the 8th Workshop on Building and Using Comparable Corpora (p. 79).
Zhao, B., & Vogel, S. (2002). Adaptive parallel sentences mining from web bilingual news collection. In Proceedings of the 2002 IEEE International Conference on Data Mining (pp. 745–748).
Zweigenbaum, P., Sharoff, S., & Rapp, R. (2017). Overview of the second BUCC shared task: Spotting parallel sentences in comparable corpora. In Proceedings of the 10th Workshop on Building and Using Comparable Corpora (pp. 60–67). Association for Computational Linguistics.
Zweigenbaum, P., Sharoff, S., & Rapp, R. (2018). Overview of the third BUCC shared task: Spotting parallel sentences in comparable corpora. In Rapp, R., Zweigenbaum, P., and Sharoff, S. (Eds.) Proceedings of the Eleventh International Conference on Language Resources and Evaluation.
Acknowledgements
We wish to thank the Basque public broadcasting organisation eitb, for their support and their willingness to share the corpus with the community, and the anonymous LRE reviewers for their helpful comments. This work was partially supported the Department of Economic Development of the Basque Government (Spri), via projects ITAI (ZL-2021/00038) and TANDO (KK-2020/00074).
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Gete, H., Etchegoyhen, T. Making the most of comparable corpora in Neural Machine Translation: a case study. Lang Resources & Evaluation 56, 943–971 (2022). https://doi.org/10.1007/s10579-021-09572-2
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10579-021-09572-2