Making the most of comparable corpora in Neural Machine Translation: a case study

Gete, Harritxu; Etchegoyhen, Thierry

doi:10.1007/s10579-021-09572-2

Making the most of comparable corpora in Neural Machine Translation: a case study

Open access
Published: 27 January 2022

Volume 56, pages 943–971, (2022)
Cite this article

Download PDF

You have full access to this open access article

Language Resources and Evaluation Aims and scope Submit manuscript

Making the most of comparable corpora in Neural Machine Translation: a case study

Download PDF

2611 Accesses
3 Citations
2 Altmetric
Explore all metrics

Abstract

Comparable corpora can benefit the development of Neural Machine Translation models, in particular for under-resourced languages. We present a case study centred on the exploitation of a large comparable corpus for Basque-Spanish, created from independently-produced news by the Basque public broadcaster eitb, where we evaluate the impact of different techniques to exploit the original data, in order to complement parallel datasets for this language pair in both translation directions. Two efficient methods for parallel sentence mining are explored, which identified a common core of approximately half of the total number of aligned sentences, each one uniquely identifying valid parallel sentences not captured by the other method. Filtering the data via identification of length-difference outliers proved highly effective to improve the models, as was the use of tags to discriminate between comparable and parallel data in the training corpora. The use of backtranslated data is also evaluated in this work, with results indicating that alignment-based datasets remain the most beneficial, although complementary backtranslations should also be included to fully exploit the available comparable data. Overall, the results in this work demonstrate that this type of data needs to be carefully analysed prior to its use as training data for Neural Machine Translation, since issues such as information imbalance between source and target data can lead to unoptimal results for a given translation pair.

A Parallel Corpus of Theses and Dissertations Abstracts

Mapping and Aligning Units from Comparable Corpora

A large English–Thai parallel corpus from the web and machine-generated text

Article 30 March 2021

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Comparable corpora have long been identified as an important source of potentially useful data in natural language processing, notably to train data-driven machine translation systems (Munteanu & Marcu, 2005; Sharoff et al., 2016) or to extract bilingual dictionaries (Rapp, 1995).

Extracting sentence pairs with a significant degree of parallelism from this type of corpora faces important challenges, however, considering the vast amounts of unrelated or loosely related data that compose comparable datasets in different languages. Several methods have been designed to address the challenges of comparable document alignment (Azpeitia & Etchegoyhen, 2019; Sharoff et al., 2015) and comparable sentence alignment (Artetxe & Schwenk, 2019; Etchegoyhen & Azpeitia, 2016b; Fung & Cheung, 2004; Munteanu & Marcu, 2002; Smith et al., 2010), leading to new parallel datasets that can support machine translation, in particular for under-resourced languages (Etchegoyhen et al., 2016; Schwenk et al., 2019).

Neural Machine Translation (nmt) is currently the dominant paradigm in research and development in the field of Machine Translation, having led to significant advances in recent years for most language pairs (Bahdanau et al., 2015; Bojar et al., 2016, 2017; Vaswani et al., 2017). As a data-driven approach where model parameters are estimated and optimised on large volumes of parallel data, nmt is particularly sensitive to the presence of noise in the training data (Belinkov & Bisk, 2018; Cheng et al., 2018; Khayrallah & Koehn, 2018; Sperber et al., 2017). Due to the nature of the task, parallel data extracted from comparable corpora are likely to introduce unwarranted noise in the training process and an evaluation of this potential issue is worth examining.

In this article, we describe the results of an in-depth evaluation of the exploitation of a large strongly comparable corpus for Basque-Spanish, composed of independently-produced news by the Basque public broadcaster eitb,^{Footnote 1} focusing on the impact of various techniques to exploit the original data for Neural Machine Translation and extending the study by Etchegoyhen & Gete (2020).

We first compare results obtained with two alignment methods that have demonstrated high accuracy on comparable data, namely laser (Artetxe & Schwenk, 2019) and stacc (Etchegoyhen & Azpeitia, 2016b), and evaluate various datasets and models based on the alignments obtained with each method via threshold filtering and data combination. We then measure the impact of filtering based on length difference statistical outliers between aligned source and target sentences, as comparable corpora typically feature information divergences between source and target sentences.

We also explore the impact of tags to identify comparable data in the training datasets, following the approach proposed by Caswell et al. (2019) for backtranslations and Etchegoyhen & Gete (2020) for comparable data, and the use of mixed tagging by limiting tagging to portions of the data identified as containing more imbalanced information in the source and the target.

Finally, we compare the impact of using aligned comparable sentences to the use of backtranslations for nmt (Sennrich et al., 2016b), as both approaches have demonstrated their usefulness to increase the coverage and quality of translation models. We evaluate in particular the impact of mixtures of aligned and backtranslated data on the quality of the resulting translation models.

The remainder of this paper is organised as follows: Sect. 2 describes related work on comparable corpora; Sect. 3 describes the corpora, alignment methods and tools used in this study; in Sect. 4, we describe the baseline translation models based on the initial datasets; Sect. 5 centres on threshold filtering using the selected alignment methods; in Sect. 6, a detailed analysis of the different datasets is presented, while Sect. 7 describes the results obtained with different combinations of the available data; Sect. 8 describes the results obtained via length filtering and Sect. 9 those obtained via data tagging; in Sect. 10, we report on different experiments using backtranslated data; Sect. 11 presents a comparative summary of results on translation models; finally, Sect. 12 draws conclusion from the experiments and analyses.

2 Related work

A significant body of work has been produced over the years to mine and exploit parallel sentences from large collections of comparable monolingual corpora (Abdul-Rauf & Schwenk, 2009, 2011; Irvine & Callison-Burch, 2013; Munteanu & Marcu, 2005; Sharoff et al., 2016; Smith et al., 2010), starting with seminal work by Resnik (1999) to exploit the World Wide Web as a source of potential parallel data.

A standard part in the process is the determination of document pairs, to reduce the space of computation in the typically large datasets involved in the task. Several approaches have exploited metadata information for Web page alignment, including url, structural tags or publication date (Chen & Nie, 2000; Munteanu & Marcu, 2005; Papavassiliou et al., 2016; Resnik & Smith, 2003). Alternatively, content-based document alignment approaches for comparable corpora have also been proposed, based on vector space models (Chen et al., 2004), token translation ratios (Ma & Liberman, 1999), mutual information (Fung & Cheung, 2004), expectation-maximisation (Ion et al., 2011) or n-gram matching using machine-translated documents (Uszkoreit 2010).

Several approaches have used a mixture of content and structural properties, notably the systems in the wmt 2016 document alignment shared task (Buck & Koehn, 2016a). Among those, Gomes and Lopes (2016) proposed a phrase-based approach combined with url-matching, Buck and Koehn (2016b) used cosine similarity between tf/idf vectors over machine-translated documents, and Germann (2016) performed alignment via vector space word representations, latent semantic indexing, and url matching. In Esplá-Gomis et al. (2016), document alignment is performed via a mixture of url similarity, structural features such as shared links, and bag of words similarity.

Comparability at the document alignment was notably evaluated in a dedicated shared task of the bucc workshop series (Sharoff et al., 2015). Among participating systems, Li and Gaussier (2013) used bilingual dictionaries and proportion of matching words to assess comparability, Morin et al. (2015) made use of hapax legomena and pigeon hole reasoning to enforce alignments, and Zafarian et al. (2015) used several components, including topic modelling, named entity detection and word features. A strictly content-based method was proposed by Etchegoyhen and Azpeitia (2016a), based on Jaccard similarity (Jaccard, 1901) over sets of lexical translations, expanded with surface-based entities and common prefix matching, demonstrating high accuracy in a large number of scenarios, including comparable corpora (Azpeitia & Etchegoyhen, 2019).

The extraction of parallel sentences from comparable corpora has also been addressed via a large variety of approaches, based on suffix trees (Munteanu & Marcu, 2002), maximum likelihood (Zhao & Vogel, 2002), binary classification (Munteanu & Marcu, 2005), cosine similarity (Fung & Cheung 2004), reference metrics over statistical machine translations (Abdul-Rauf & Schwenk, 2009; Sarikaya et al., 2009) or rich features (Smith et al., 2010; Stefănescu et al., 2012). In recent years deep learning approaches have been applied to the task as well, via bidirectional recurrent neural networks (Grégoire & Langlais, 2017) or cosine similarity over bilingual sentence embeddings (Schwenk, 2018).

The core document alignment method of Etchegoyhen and Azpeitia (2016a) was applied to comparable sentence alignment, improving significantly over more sophisticated feature-rich methods (Etchegoyhen & Azpeitia, 2016b). This method, referred to as stacc, which is also based on Jaccard similarity over expanded lexical translation sets, was further extended with lexical weighting (Azpeitia et al., 2017) and a named-entity penalty (Azpeitia et al., 2018), obtaining the best results across the board in the bucc shared task on parallel sentence identification in comparable corpora (Zweigenbaum et al., 2017, 2018).

Improvement over these results were obtained with the margin-based approach of Artetxe & Schwenk (2019), which uses cosine similarity over multilingual sentence embeddings, extended with a filtering mechanism based on nearest neighbours similarity. Large amounts of comparable corpora have been mined with this approach (Schwenk et al., 2019), by means of the laser toolkit, which we include in our experiments on the selected corpora.

Recently, unsupervised approaches to NMT have demonstrated the potential of directly exploiting monolingual data to train neural translation models (Artetxe et al., 2018; Lample et al., 2018). A detailed comparison between this type of approach and the use of parallel sentences mined from comparable corpora may shed light on their respective merits, in isolation or in combination. Although this type of evaluation is outside the scope of the present study, the results in Sect. 10 on the use of back-translated data, compared to mining parallel sentences, provides preliminary results on the direct exploitation of monolingual data in this type of corpus.

A first version of the eitb corpus was prepared and shared with community (Etchegoyhen et al., 2016), to provide further support to the under-resourced Basque-Spanish language pair. In Etchegoyhen & Gete (2020), a new version of the corpus was prepared and shared with the scientific community,^{Footnote 2} built with news generated in subsequent years. In what follows, we use the latter version of the original data, covering news produced between 2009 and 2018, and evaluate the impact of different methods to exploit these comparable data on Neural Machine Translation.

3 Corpora preparation

The eitb corpus is composed of news independently produced in Basque and Spanish by the journalists of the Basque Country’s public broadcast service, to report on the same specific events. The corpus can thus be considered strongly comparable (Skadiņa et al., 2012) and viewed as a rich source of parallel data for this language pair (Etchegoyhen et al., 2016).

This version of the original dataset covers 10 years of content, and is composed of 168,984 documents in Basque and 174,348 in Spanish, extracted from xml files.^{Footnote 3} With an original amount of over two million sentences per language, it is one of the largest available corpora for the Basque-Spanish language pair, covering political news, sports and cultural events, among others. Statistics of the original corpus are described in Table 1.

Table 1 Original eitb corpus 2009–2018 in number of sentences (Sent.) and tokens (Tok)

Making the most of comparable corpora in Neural Machine Translation: a case study

Abstract

Similar content being viewed by others

A Parallel Corpus of Theses and Dissertations Abstracts

Mapping and Aligning Units from Comparable Corpora

A large English–Thai parallel corpus from the web and machine-generated text

1 Introduction

2 Related work

3 Corpora preparation

4 Baseline models

5 Threshold filtering

6 Data analysis

7 Data combination

8 Length filtering

9 Data tagging

10 Backtranslations vs. alignment

11 Comparative summary

12 Conclusions

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation