Machine Translation

, Volume 27, Issue 2, pp 139–166

Substring-based machine translation

  • Graham Neubig
  • Taro Watanabe
  • Shinsuke Mori
  • Tatsuya Kawahara
Article

DOI: 10.1007/s10590-013-9136-6

Cite this article as:
Neubig, G., Watanabe, T., Mori, S. et al. Machine Translation (2013) 27: 139. doi:10.1007/s10590-013-9136-6
  • 507 Downloads

Abstract

Machine translation is traditionally formulated as the transduction of strings of words from the source to the target language. As a result, additional lexical processing steps such as morphological analysis, transliteration, and tokenization are required to process the internal structure of words to help cope with data-sparsity issues that occur when simply dividing words according to white spaces. In this paper, we take a different approach: not dividing lexical processing and translation into two steps, but simply viewing translation as a single transduction between character strings in the source and target languages. In particular, we demonstrate that the key to achieving accuracies on a par with word-based translation in the character-based framework is the use of a many-to-many alignment strategy that can accurately capture correspondences between arbitrary substrings. We build on the alignment method proposed in Neubig et al. (Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics. Portland, Oregon, pp. 632–641, 2011), improving its efficiency and accuracy with a focus on character-based translation. Using a many-to-many aligner imbued with these improvements, we demonstrate that the traditional framework of phrase-based machine translation sees large gains in accuracy over character-based translation with more naive alignment methods, and achieves comparable results to word-based translation for two distant language pairs.

Keywords

Character-based translation Alignment Inversion transduction grammar 

Copyright information

© Springer Science+Business Media Dordrecht 2013

Authors and Affiliations

  • Graham Neubig
    • 1
  • Taro Watanabe
    • 2
  • Shinsuke Mori
    • 3
  • Tatsuya Kawahara
    • 3
  1. 1.Nara Institute of Science and TechnologyIkomaJapan
  2. 2.National Institute of Information and Communications TechnologyKyotoJapan
  3. 3.Kyoto UniversityKyotoJapan

Personalised recommendations