Skip to main content

Semi-automatic Quasi-morphological Word Segmentation for Neural Machine Translation

Part of the Communications in Computer and Information Science book series (CCIS,volume 838)

Abstract

This paper proposes the Prefix-Root-Postfix-Encoding (PRPE) algorithm, which performs close-to-morphological segmentation of words as part of text pre-processing in machine translation. PRPE is a cross-language algorithm requiring only minor tweaking to adapt it for any particular language, a property which makes it potentially useful for morphologically rich languages with no morphological analysers available. As a key part of the proposed algorithm we introduce the ‘Root alignment’ principle to extract potential sub-words from a corpus, as well as a special technique for constructing words from potential sub-words. We conducted experiments with two different neural machine translation systems, training them on parallel corpora for English-Latvian and Latvian-English translation. Evaluation of translation quality showed improvements in BLEU scores when the data were pre-processed using the proposed algorithm, compared to a couple of baseline word segmentation algorithms. Although we were able to demonstrate improvements in both translation directions and for both NMT systems, they were relatively minor, and our experiments show that machine translation with inflected languages remains challenging, especially with translation direction towards a highly inflected language.

Keywords

  • Neural machine translation
  • Processing morphologically rich languages
  • Word segmentation

This is a preview of subscription content, access via your institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • DOI: 10.1007/978-3-319-97571-9_23
  • Chapter length: 13 pages
  • Instant PDF download
  • Readable on all devices
  • Own it forever
  • Exclusive offer for individuals only
  • Tax calculation will be finalised during checkout
eBook
USD   79.99
Price excludes VAT (USA)
  • ISBN: 978-3-319-97571-9
  • Instant PDF download
  • Readable on all devices
  • Own it forever
  • Exclusive offer for individuals only
  • Tax calculation will be finalised during checkout
Softcover Book
USD   99.99
Price excludes VAT (USA)
Fig. 1.
Fig. 2.
Fig. 3.

Notes

  1. 1.

    Source code available at: https://github.com/zuters/prpe.

  2. 2.

    http://www.statmt.org/wmt17/translation-task.html.

  3. 3.

    https://github.com/rsennrich/subword-nmt.

  4. 4.

    https://github.com/EdinburghNLP/nematus.

  5. 5.

    https://github.com/facebookresearch/fairseq-py.

  6. 6.

    http://www.statmt.org/wmt17/results.html.

  7. 7.

    Statistical significance was estimated via bootstrap resampling using the script analysis/bootstrap-hypothesis-difference-significance.pl from the Moses MT system: https://github.com/moses-smt/mosesdecoder.

  8. 8.

    http://data.statmt.org/wmt17_systems/training.

References

  1. Pinnis, M., Krišlauks, R., Deksne, D., Miks, T.: Neural machine translation for morphologically rich languages with improved sub-word units and synthetic data. In: Ekštein, K., Matoušek, V. (eds.) TSD 2017. LNCS (LNAI), vol. 10415, pp. 237–245. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-64206-2_27

    CrossRef  Google Scholar 

  2. Ruokolainen, T., Kohonen, O., Sirts, K., Grönroos, A., Kurimo, M., Virpioja, S.: A comparative study of minimally supervised morphological segmentation. Comput. Linguist. 42(1), 91–120 (2016)

    CrossRef  MathSciNet  Google Scholar 

  3. Sennrich, R., Haddow, B., Birch, A.: Neural machine translation of rare words with subword units. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (ACL 2016), Berlin, Germany (2016)

    Google Scholar 

  4. Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: BLEU: a method for automatic evaluation of machine translation. In: ACL 2002: 40th Annual Meeting of the Association for Computational Linguistics, pp. 311–318 (2002)

    Google Scholar 

  5. Hajič, J.: Morphological tagging: data vs. dictionaries. In: Proceedings of the 1st North American Chapter of the Association for Computational Linguistics conference (NAACL 2000), pp. 94–101 (2000)

    Google Scholar 

  6. Paikens, P., Rituma, L., Pretkalnina, L.: Morphological analysis with limited resources: Latvian example. In: Proceedings of the 19th Nordic Conference of Computational Linguistics (NODALIDA) (2013)

    Google Scholar 

  7. Pinnis, M., Goba, K.: Maximum entropy model for disambiguation of rich morphological tags. In: Mahlow, C., Piotrowski, M. (eds.) SFCM 2011. CCIS, vol. 100, pp. 14–22. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-23138-4_2

    CrossRef  Google Scholar 

  8. Virpioja, S., Smit P., Grönroos, S.-A., Kurimo, M.: Morfessor 2.0: Python implementation and extensions for Morfessor baseline. In: Aalto University publication series SCIENCE + TECHNOLOGY, 25/2013, Aalto University (2013)

    Google Scholar 

  9. Jurafsky, D., Martin, J.H.: Speech and Language Processing, 2nd edn, pp. 184–187. Prentice Hall, Englewood Cliffs (2009)

    Google Scholar 

  10. Clifton, A., Sarkar, A.: Combining morpheme-based machine translation with post-processing morpheme prediction. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, vol. 1, pp. 32–42 (2011)

    Google Scholar 

  11. Mermer, C., Akin, S.: Unsupervised search for the optimal segmentation for statistical machine translation. In: Proceedings of the ACL 2010 Student Research Workshop, Uppsala, Sweden, pp. 31–36 (2010)

    Google Scholar 

  12. Pinnis, M., Krišlauks, R., Miks, T., Deksne, D., Šics, V.: Tilde’s machine translation systems for WMT 2017. In: Proceedings of the Second Conference on Machine Translation (WMT 2017). Shared Task Papers, Copenhagen, Denmark, vol. 2, pp. 374–381. Association for Computational Linguistics (2017). http://www.aclweb.org/anthology/W17-4737

  13. Grönroos, S.-A., Virpioja, S., Smit, P., Kurimo, M.: Morfessor FlatCat: an HMM-based method for unsupervised and semi-supervised learning of morphology. In: Proceedings of the 25th International Conference on Computational Linguistics, Dublin, Ireland, pp. 1177–1185. Association for Computational Linguistics (2014)

    Google Scholar 

  14. Sennrich, R., Firat, O., Cho, K., Birch, A., Haddow, B., Hitschler, J., Junczys-Dowmunt, M., Läubli, S., Barone, A.V.M., Mokry, J., Nadejde, M.: Nematus: a toolkit for neural machine translation. In: Proceedings of the Software Demonstrations of the 15th Conference of the European Chapter of the Association for Computational Linguistics, Valencia, Spain, pp. 65–68 (2017)

    Google Scholar 

  15. Gehring, J., Auli, M., Grangier, D., Yarats D., Dauphin, Y.: Convolutional sequence to sequence learning. In: Proceedings of the 34th International Conference on Machine Learning, Sydney, Australia, pp. 1243–1252 (2017)

    Google Scholar 

  16. Sennrich, R., Birch, A., Currey, A., Germann, U., Haddow, B., Heafield, K., Barone, A.V.M., Williams P.: The University of Edinburgh’s neural MT systems for WMT17. In: Proceedings of the Second Conference on Machine Translation. Shared Task Papers, vol. 2, Copenhagen, Denmark (2017)

    Google Scholar 

  17. Barone, A.V.M., Helcl, J., Sennrich, R., Haddow, B., Birch, A.: Deep Architectures for Neural Machine Translation (2017). arXiv Preprints: arXiv:1707.07631 [cs.CL]

Download references

Acknowledgements

The research has been supported by the European Regional Development Fund within the research project “Neural Network Modelling for Inflected Natural Languages” No. 1.1.1.1/16/A/215, and the Faculty of Computing, University of Latvia.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jānis Zuters .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and Permissions

Copyright information

© 2018 Springer Nature Switzerland AG

About this paper

Verify currency and authenticity via CrossMark

Cite this paper

Zuters, J., Strazds, G., Immers, K. (2018). Semi-automatic Quasi-morphological Word Segmentation for Neural Machine Translation. In: Lupeikiene, A., Vasilecas, O., Dzemyda, G. (eds) Databases and Information Systems. DB&IS 2018. Communications in Computer and Information Science, vol 838. Springer, Cham. https://doi.org/10.1007/978-3-319-97571-9_23

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-97571-9_23

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-97570-2

  • Online ISBN: 978-3-319-97571-9

  • eBook Packages: Computer ScienceComputer Science (R0)