Machine Translation Customization via Automatic Training Data Selection from the Web

Vu, Thuy; Moschitti, Alessandro

doi:10.1007/978-3-030-72113-8_44

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 12656))

Included in the following conference series:

European Conference on Information Retrieval

2252 Accesses
1 Citations

The original version of this chapter was revised: table 3 was not correct. This has been corrected. The correction to this chapter is available at https://doi.org/10.1007/978-3-030-72113-8_51

Abstract

Machine translation (MT) systems, especially when designed for an industrial setting, are trained with general parallel data derived from the Web. Thus, their style is typically driven by word/structure distribution coming from the average of many domains. In contrast, MT customers want translations to be specialized to their domain, for which they are typically able to provide text samples. We describe an approach for customizing MT systems on specific domains by selecting data similar to the target customer data to train neural translation models. We build document classifiers using monolingual target data, e.g., provided by the customers to select parallel training data from Web crawled data. Finally, we train MT models on our automatically selected data, obtaining a system specialized to the target domain. We tested our approach on the benchmark from WMT-18 Translation Task for News domains enabling comparisons with state-of-the-art MT systems. The results show that our models outperform the top systems while using less data and smaller models.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 99.00; Price excludes VAT (USA)

Softcover Book: USD 129.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Recent advances of low-resource neural machine translation

Article 30 October 2021

Scaling neural machine translation to 200 languages

Article Open access 05 June 2024

Experimenting with Different Machine Translation Models in Medium-Resource Settings

Change history

10 June 2021
The original version of this chapter the table 3 was not correct. This has been corrected.

Notes

1.
As of May 2020, Google Translate provided riunione condominiale, which, although correct, is a bit too formal term for this kind of meeting.
2.
https://github.com/awslabs/sockeye [11].
3.
https://github.com/marian-nmt/marian-examples/tree/336740065d9c23e53e912a1befff18981d9d27ab/wmt2017-transformer.

References

Ahmed, F., Shafiq, M.Z., Liu, A.X.: The internet is for porn: measurement and analysis of online adult traffic. ICDCS 2016, 88–97 (2016)
Google Scholar
Axelrod, A., He, X., Gao, J.: Domain adaptation via pseudo in-domain data selection. EMNLP 2011, 355–362 (2011)
Google Scholar
Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. In: ICLR 2015 (2015)
Google Scholar
Bañón, M., et al.: ParaCrawl: web-scale acquisition of parallel corpora. ACL 2020, 4555–4567 (2020)
Google Scholar
Biesinger, R.: Is your software racist? Politico (2018). https://www.politico.com/agenda/story/2018/02/07/algorithmic-bias-software-recommendations-000631
Bojar, O., et al.: Findings of the 2018 Conference on Machine Translation (WMT 2018), Belgium, Brussels, pp. 272–307 (2018)
Google Scholar
Buck, C., Koehn, P.: Quick and reliable document alignment via TF/IDF-weighted cosine distance. In: WMT 2016, Berlin, Germany, pp. 672–678 (2016)
Google Scholar
Chen, B., Huang, F.: Semi-supervised convolutional networks for translation adaptation with tiny amount of in-domain data. In: Proceedings of the 20th SIGNLL Conference on Computational Natural Language Learning. Association for Computational Linguistics, Berlin (2016)
Google Scholar
Dinu, G., Mathur, P., Federico, M., Al-Onaizan, Y.: Training neural machine translation to apply terminology constraints. In: ACL 2019, Florence, Italy, pp. 3063–3068 (2019)
Google Scholar
Gao, J., Goodman, J., Li, M., Lee, K.F.: Toward a unified approach to statistical language modeling for chinese. In: ACM TALIP (2002)
Google Scholar
Hieber, F., et al.: Sockeye: a toolkit for neural machine translation. CoRR (2017)
Google Scholar
Junczys-Dowmunt, M., et al.: Marian: fast neural machine translation in C++. CoRR (2018)
Google Scholar
Liu, L., Hong, Y., Liu, H., Wang, X., Yao, J.: Effective selection of translation model training data. In: ACL 2014, Baltimore, Maryland (2014)
Google Scholar
Luong, T., Pham, H., Manning, C.D.: Effective approaches to attention-based neural machine translation. In: EMNLP 2015, Lisbon, Portugal (2015)
Google Scholar
McCulloch, G.: Covid-19 is history’s biggest translation challenge. wired.com (2020). https://www.wired.com/story/covid-language-translation-problem/
Moore, R.C., Lewis, W.: Intelligent selection of language model training data. In: ACL 2010, Uppsala, Sweden, pp. 220–224 (2010)
Google Scholar
Post, M.: A call for clarity in reporting BLEU scores. In: WMT 2018 (2018)
Google Scholar
Smith, J.R., Saint-Amand, H., Plamada, M., Koehn, P., Callison-Burch, C., Lopez, A.: Dirt cheap web-scale parallel text from the common crawl. In: ACL 2013, Sofia, Bulgaria, pp. 1374–1383 (2013)
Google Scholar
Uszkoreit, J., Ponte, J., Popat, A., Dubiner, M.: Large scale parallel document mining for machine translation. In: COLING 2010, Beijing, China (2010)
Google Scholar
Vaswani, A., et al..: Attention is all you need. In: NIPS 2017, pp. 5998–6008 (2017)
Google Scholar
Vu, T., Moschitti, A.: CDA: a cost efficient content-based multilingual web document aligner. In: EACL 2021 (2021)
Google Scholar
Yasuda, K., Zhang, R., Yamamoto, H., Sumita, E.: Method of selecting training data to build a compact and efficient translation model. In: IJCNLP 2008 (2008)
Google Scholar

Download references

Author information

Authors and Affiliations

Amazon Alexa AI, Manhattan Beach, CA, USA
Thuy Vu & Alessandro Moschitti

Authors

Thuy Vu
View author publications
You can also search for this author in PubMed Google Scholar
Alessandro Moschitti
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Thuy Vu .

Editor information

Editors and Affiliations

Radboud University Nijmegen, Nijmegen, The Netherlands
Djoerd Hiemstra
Department of Computer Science, Katholieke Universiteit Leuven, Heverlee, Belgium
Marie-Francine Moens
Toulouse Institute of Computer Science Research, Toulouse, France
Josiane Mothe
Istituto di Scienza e Tecnologie dell’Informazione, Consiglio Nazionale delle Ricerche, Pisa, Italy
Raffaele Perego
Leipzig University, Leipzig, Germany
Martin Potthast
Istituto di Scienza e Tecnologie dell’Informazione, Consiglio Nazionale delle Ricerche, Pisa, Italy
Fabrizio Sebastiani

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Vu, T., Moschitti, A. (2021). Machine Translation Customization via Automatic Training Data Selection from the Web. In: Hiemstra, D., Moens, MF., Mothe, J., Perego, R., Potthast, M., Sebastiani, F. (eds) Advances in Information Retrieval. ECIR 2021. Lecture Notes in Computer Science(), vol 12656. Springer, Cham. https://doi.org/10.1007/978-3-030-72113-8_44

Download citation

DOI: https://doi.org/10.1007/978-3-030-72113-8_44
Published: 27 March 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-72112-1
Online ISBN: 978-3-030-72113-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Machine Translation Customization via Automatic Training Data Selection from the Web

Abstract

Access this chapter

Similar content being viewed by others

Recent advances of low-resource neural machine translation

Scaling neural machine translation to 200 languages

Experimenting with Different Machine Translation Models in Medium-Resource Settings

Change history

10 June 2021

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Machine Translation Customization via Automatic Training Data Selection from the Web

Abstract

Access this chapter

Similar content being viewed by others

Recent advances of low-resource neural machine translation

Scaling neural machine translation to 200 languages

Experimenting with Different Machine Translation Models in Medium-Resource Settings

Change history

10 June 2021

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation