Machine Translation

, Volume 27, Issue 2, pp 139–166 | Cite as

Substring-based machine translation

  • Graham Neubig
  • Taro Watanabe
  • Shinsuke Mori
  • Tatsuya Kawahara
Article
  • 530 Downloads

Abstract

Machine translation is traditionally formulated as the transduction of strings of words from the source to the target language. As a result, additional lexical processing steps such as morphological analysis, transliteration, and tokenization are required to process the internal structure of words to help cope with data-sparsity issues that occur when simply dividing words according to white spaces. In this paper, we take a different approach: not dividing lexical processing and translation into two steps, but simply viewing translation as a single transduction between character strings in the source and target languages. In particular, we demonstrate that the key to achieving accuracies on a par with word-based translation in the character-based framework is the use of a many-to-many alignment strategy that can accurately capture correspondences between arbitrary substrings. We build on the alignment method proposed in Neubig et al. (Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics. Portland, Oregon, pp. 632–641, 2011), improving its efficiency and accuracy with a focus on character-based translation. Using a many-to-many aligner imbued with these improvements, we demonstrate that the traditional framework of phrase-based machine translation sees large gains in accuracy over character-based translation with more naive alignment methods, and achieves comparable results to word-based translation for two distant language pairs.

Keywords

Character-based translation Alignment Inversion transduction grammar 

References

  1. Abouelhoda MI, Kurtz S, Ohlebusch E (2004) Replacing suffix trees with enhanced suffix arrays. J Discret Algorithms 2(1):53–86MathSciNetMATHCrossRefGoogle Scholar
  2. Al-Onaizan Y, Knight K (2002) Translating named entities using monolingual and bilingual resources. In: 40th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference, Philadelphia, pp 400–408Google Scholar
  3. Bai MH, Chen KJ, Chang JS (2008) Improving word alignment by adjusting Chinese word segmentation. In: IJCNLP 2008, Proceedings of the 3rd International Joint Conference on Natural Language Processing. Hyderabad, pp 249–256Google Scholar
  4. Blunsom P, Cohn T (2010) Inducing synchronous grammars with slice sampling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics. Proceedings of the Main Conference, Los Angeles, pp 238–241Google Scholar
  5. Blunsom P, Cohn T, Dyer C, Osborne M (2009) A Gibbs sampler for phrasal synchronous grammar induction. In: ACL-IJCNLP 2009, Joint Conference of the 47th Annual Meeting of the Association for Computational Linguistics and 4th International Joint Conference on Natural Language Processing of the AFNLP. Proceedings of the Conference, Suntec, pp 782–790Google Scholar
  6. Bojar O (2007) English-to-Czech factored machine translation. In: ACL 2007: Proceedings of the Second Workshop on Statistical Machine Translation. Czech Republic, Prague, pp 232–239Google Scholar
  7. Brown PF, Della-Pietra VJ, Della-Pietra SA, Mercer RL (1993) The mathematics of statistical machine translation: parameter estimation. Comput Linguist 19:263–312Google Scholar
  8. Brown RD (2002) Corpus-driven splitting of compound words. In: TMI-2002 Conference: Proceedings of the 9th International Conference on Theoretical and Methodological issues in Machine Translation, Keihanna, pp 12–21Google Scholar
  9. Chang PC, Galley M, Manning CD (2008) Optimizing Chinese word segmentation for machine translation performance. In: ACL-08: HLT: Third Workshop on Statistical Machine Translation. Proceedings of the Workshop, Columbus, pp 224–232Google Scholar
  10. Chomsky N (1956) Three models for the description of language. IRE Trans Inf Theory 2(3):113–124MATHCrossRefGoogle Scholar
  11. Chu C, Nakazawa T, Kawahara D, Kurohashi S (2012) Exploiting shared Chinese characters in Chinese word segmentation optimization for Chinese–Japanese machine translation. In: EAMT 2012, Proceedings of the 16th Annual Conference of the European Association for Machine Translation. Trento, pp 35–42Google Scholar
  12. Chung T, Gildea D (2009) Unsupervised tokenization for machine translation. In: EMNLP 2009: Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, Singapore, pp 718–726Google Scholar
  13. Corston-Oliver S, Gamon M (2004) Normalizing German and English inflectional morphology to improve statistical word alignment. In: Machine Translation: from Real Users to Research, 6th Conference of the Association for Machine Translation in the Americas, AMTA (2004) Washington, DC, pp 48–57Google Scholar
  14. Cromières F (2006) Sub-sentential alignment using substring co-occurrence counts. In: COLING—ACL 2006, 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics. Proceedings of the Student Research Workshop, Sydney, pp 13–18Google Scholar
  15. DeNero J, Bouchard-Côté A, Klein D (2008) Sampling alignment structure under a Bayesian translation model. In: EMNLP 2008: 2008 Conference on Empirical Methods in Natural Language Processing. Proceedings of the Conference, Honolulu, pp 314–323Google Scholar
  16. Denkowski M, Lavie A (2011) Meteor 1.3: automatic metric for reliable optimization and evaluation of machine translation systems. In: Proceedings of the 6th Workshop on Statistical Machine Translation (WMT), Edinburgh, pp 85–91Google Scholar
  17. Denoual E, Lepage Y (2005) BLEU in characters: towards automatic MT evaluation in languages without word delimiters. In: Proceedings of the 2nd International Joint Conference on Natural Language Processing, IJCNLP-05, Jeju Island, pp 81–86Google Scholar
  18. Finch A, Sumita E (2007) Phrase-based machine transliteration. In: Proceedings of the Workshop on Technologies and Corpora for Asia-Pacific Speech Translation (TCAST), Hyderabad, pp 13–18Google Scholar
  19. Goldwater S, McClosky D (2005) Improving statistical MT through morphological analysis. In: HLT/EMNLP 2005: Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing. Proceedings of the Conference, Vancouver, British Columbia, pp 676–683Google Scholar
  20. Haghighi A, Blitzer J, DeNero J, Klein D, (2009) Better word alignments with supervised ITG models. In: ACL-IJCNLP 2009, Joint Conference of the 47th Annual Meeting of the Association for Computational Linguistics and 4th International Joint Conference on Natural Language Processing of the AFNLP. Proceedings of the Conference, Suntec, pp 923–931Google Scholar
  21. Jiang J, Ahmed Z, Carson-Berndsen J, Cahill P, Way A (2011) Phonetic representation-based speech translation. In: Proceedings of Machine Translation Summit XIII, Xiamen, pp 81–88Google Scholar
  22. Karlsson F (1999) Finnish: an essential grammar. Routledge, LondonGoogle Scholar
  23. Klein D, Manning CD (2003) A* parsing: fast exact Viterbi parse selection. In: HLT-NAACL 2003: Conference combining Human Language Technology conference series and the North American Chapter of the Association for Computational Linguistics conference series . Edmonton, pp 40–47Google Scholar
  24. Kneser R, Ney H (1995) Improved backing-off for M-gram language modelling. In: International Conference on Acoustics, Speech, and Signal Processing (ICASSP), vol. 1, Detroit, pp 181–184Google Scholar
  25. Knight K, Graehl J (1998) Machine transliteration. Comput Linguist 24(4):599–612Google Scholar
  26. Koehn P (2005) Europarl: a parallel corpus for statistical machine translation. In: MT Summit X: The Tenth Machine Translation Summit, Phuket, pp 79–86Google Scholar
  27. Koehn P, Axelrod A, Mayne AB, Callison-Burch C, Osborne M, Talbot D (2005) Edinburgh system description for the 2005 IWSLT speech translation evaluation. In: International Workshop on Spoken Language Translation: Evaluation Campaign on Spoken Language Translation [IWSLT 2005], Pittsburgh, 8pp [no page numbers]Google Scholar
  28. Koehn P, Hoang H, Birch A, Callison-Burch C, Federico M, Bertoldi N, Cowan B, Shen W, Moran C, Zens R, Dyer C, Bojar O, Constantin A, Herbst E, (2007) Moses: open source toolkit for statistical machine translation. In: ACL 2007: proceedings of demo and poster sessions. Czech Republic, Prague, pp 177–180Google Scholar
  29. Koehn P, Och FJ, Marcu D (2003) Statistical phrase-based translation. In: HLT-NAACL, (2003) conference combining Human Language Technology conference series and the North American Chapter of the Association for Computational Linguistics conference series. Edmonton, pp 48–54Google Scholar
  30. Kondrak G, Marcu D, Knight K (2003) Cognates can improve statistical translation models. In: HLT-NAACL 2003: conference combining Human Language Technology conference series and the North American Chapter of the Association for Computational Linguistics conference series. Edmonton, pp 46–48Google Scholar
  31. Lee YS (2004) Morphological analysis for statistical machine translation. In: HLT-NAACL, 2004: Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics. Proceedings of the Main Conference, Boston, Massachusetts, pp 57–60Google Scholar
  32. Levenberg A, Dyer C, Blunsom P (2012) A Bayesian model for learning SCFGs with discontiguous rules. In: Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL). Jeju Island, pp 223–232Google Scholar
  33. Li H, Zhang M, Su J (2004) A joint source-channel model for machine transliteration. In: Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics (ACL), Barcelona, pp 159–166Google Scholar
  34. Li M, Zong C, Ng HT (2011) Automatic evaluation of Chinese translation output: word-level or character-level? In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics (ACL). Portland, pp 159–164Google Scholar
  35. Liang P, Taskar B, Klein D (2006) Alignment by agreement. In: Proceedings of the 2006 Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics (HLT-NAACL). Montreal, pp 104–111Google Scholar
  36. Liu C, Ng HT (2012) Character-level machine translation evaluation for languages with ambiguous word boundaries. In: [ACL, 2012] Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics. Republic of Korea, pp 921–929Google Scholar
  37. Macherey K, Dai A, Talbot D, Popat A, Och F (2011) Language-independent compound splitting with morphological operations. In: ACL-HLT, 2011: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics. Portland, pp 1395–1404Google Scholar
  38. Marcu D, Wong W (2002) A phrase-based, joint probability model for statistical machine translation. In: Proceedings of the 2002 Conference on Empirical Methods in Natural Language Processing. Philadelphia, pp 133–139Google Scholar
  39. Nakov P, Tiedemann J (2012) Combining word-level and character-level models for machine translation between closely-related languages. In: [ACL, 2012] Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics. Republic of Korea, Jeju, pp 301–305Google Scholar
  40. Naradowsky J, Toutanova K (2011) Unsupervised bilingual morpheme segmentation and alignment with context-rich hidden semi-Markov models. In: ACL-HLT, 2011: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics. Portland, pp 895–904Google Scholar
  41. Neubig G (2011) The Kyoto free translation task. http://www.phontron.com/kftt. Accessed 16 May 2011
  42. Neubig G, Watanabe T, Mori S, Kawahara T (2012) Machine translation without words through substring alignment. In: : [ACL, 2012] Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics. Republic of Korea, pp 165–174Google Scholar
  43. Neubig G, Watanabe T, Sumita E, Mori S, Kawahara T (2011) An unsupervised model for joint phrase alignment and extraction. In: ACL-HLT, 2011: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics. Portland, pp 632–641Google Scholar
  44. Nguyen T, Vogel S, Smith NA, (2010) Nonparametric word segmentation for machine translation. In: Coling 2010, 23rd International Conference on Computational Linguistics. Proceedings of the Conference, Beijing, pp 815–823Google Scholar
  45. Nießen S, Ney H (2000) Improving SMT quality with morpho-syntactic analysis. In: The 18th International Conference on Computational Linguistics, COLING 2000 in Europe. Proceedings of the Conference, Saarbrücken, pp 1081–1085Google Scholar
  46. Och, FJ (2003) Minimum error rate training in statistical machine translation. In: ACL-2003: 41st Annual meeting of the Association for Computational Linguistics, Sapporo, pp 160–167Google Scholar
  47. Och FJ, Ney H (2003) A systematic comparison of various statistical alignment models. Comput Linguist 29(1):19–51MATHCrossRefGoogle Scholar
  48. Papineni K, Roukos S, Ward T, Zhu WJ (2002) BLEU: a method for automatic evaluation of machine translation. In: 40th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference, Philadelphia, pp 311–318Google Scholar
  49. Saers M, Nivre J, Wu D (2009) Learning stochastic bracketing inversion transduction grammars with a cubic time biparsing algorithm. In: IWPT-09: Proceedings of the 11th International Conference on Parsing Technologies, Paris, pp 29–32Google Scholar
  50. Snyder B, Barzilay R (2008) Unsupervised multilingual learning for morphological segmentation. In: ACL-08: HLT, 46th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Proceedings of the Conference, Columbus, pp 737–745Google Scholar
  51. Sornlertlamvanich V, Mokarat C, Isahara H (2008) Thai-lao machine translation based on phoneme transfer. In: Proceedings of the 14th Annual Meeting of the Association for Natural Language Processing, Tokyo, pp 65–68Google Scholar
  52. Subotin M (2011) An exponential translation model for target language morphology. In: ACL-HLT, 2011: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics. Portland, Oregon, pp 230–238Google Scholar
  53. Talbot D, Osborne M (2006) Modelling lexical redundancy for machine translation. In: COLING—ACL, 2006, 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics. Proceedings of the Conference, Sydney, pp 969–976Google Scholar
  54. Tiedemann J (2009) Character-based PSMT for closely related languages. In: EAMT-2009: Proceedings of the 13th Annual Conference of the European Association for Machine Translation, Barcelona, pp 12–19Google Scholar
  55. Tiedemann J (2012) Character-based pivot translation for under-resourced languages and domains. In: [EACL 2012] Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics . Avignon, pp 141–151Google Scholar
  56. Vilar D, Peter JT, Ney H (2007) Can we translate letters? In: ACL 2007: Proceedings of the Second Workshop on Statistical Machine Translation. Czech Republic, Prague, pp 33–39Google Scholar
  57. Vogel S, Ney H, Tillmann C (1996) HMM-based word alignment in statistical translation. In: COLING-96: The 16th International Conference on Computational Linguistics, Proceedings, Copenhagen, pp 836–841Google Scholar
  58. Wang Y, Uchimoto K, Kazama J, Kruengkrai C, Torisawa K (2010) Adapting Chinese word segmentation for machine translation based on short units. In: LREC 2010: proceedings of the seventh international conference on Language Resources and Evaluation. La Valetta, Malta, pp 1758–1764Google Scholar
  59. Wu D (1997) Stochastic inversion transduction grammars and bilingual parsing of parallel corpora. Comput Linguist 23(3):377–403Google Scholar
  60. Xu J, Zens R, Ney H (2004) Do we need Chinese word segmentation for statistical machine translation? In: Proceedings of the 3rd SIGHAN workshop on Chinese language processing. Barcelona, pp 122–128Google Scholar
  61. Zhang H, Gildea D (2005) Stochastic lexicalized inversion transduction grammar for alignment. In: ACL-05: 43rd Annual Meeting of the Association for Computational Linguistics Ann Arbor, Michigan, pp 475–482Google Scholar
  62. Zhang H, Quirk C, Moore RC, Gildea D (2008a) Bayesian learning of non-compositional phrases with synchronous parsing. In: ACL-08: HLT, 46th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Proceedings of the Conference, Columbus, pp 97–105Google Scholar
  63. Zhang R, Yasuda K, Sumita E (2008b) Improved statistical machine translation by multiple Chinese word segmentation. In: ACL-08: HLT: Third Workshop on Statistical Machine Translation, Proceedings of the Workshop, Columbus, pp 216–223Google Scholar

Copyright information

© Springer Science+Business Media Dordrecht 2013

Authors and Affiliations

  • Graham Neubig
    • 1
  • Taro Watanabe
    • 2
  • Shinsuke Mori
    • 3
  • Tatsuya Kawahara
    • 3
  1. 1.Nara Institute of Science and TechnologyIkomaJapan
  2. 2.National Institute of Information and Communications TechnologyKyotoJapan
  3. 3.Kyoto UniversityKyotoJapan

Personalised recommendations