Skip to main content
Log in

Substring-based machine translation

  • Published:
Machine Translation

Abstract

Machine translation is traditionally formulated as the transduction of strings of words from the source to the target language. As a result, additional lexical processing steps such as morphological analysis, transliteration, and tokenization are required to process the internal structure of words to help cope with data-sparsity issues that occur when simply dividing words according to white spaces. In this paper, we take a different approach: not dividing lexical processing and translation into two steps, but simply viewing translation as a single transduction between character strings in the source and target languages. In particular, we demonstrate that the key to achieving accuracies on a par with word-based translation in the character-based framework is the use of a many-to-many alignment strategy that can accurately capture correspondences between arbitrary substrings. We build on the alignment method proposed in Neubig et al. (Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics. Portland, Oregon, pp. 632–641, 2011), improving its efficiency and accuracy with a focus on character-based translation. Using a many-to-many aligner imbued with these improvements, we demonstrate that the traditional framework of phrase-based machine translation sees large gains in accuracy over character-based translation with more naive alignment methods, and achieves comparable results to word-based translation for two distant language pairs.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

Notes

  1. Some previous work has also performed alignment using morphological analyzers to normalize or split the sentence into morpheme streams (Corston-Oliver and Gamon 2004).

  2. Null alignments can be represented implicitly with no span in \({{\varvec{a}}}_1^K\) covering the unaligned words.

  3. Here we are specifically referring to a special case of ITGs with only a single symbol each for straight and inverted productions, which is also known as the bracketing ITG. ITGs with multiple straight and inverted terminals are also conceivable, but are generally not used in alignment as they significantly increase the computational burden of learning the ITG.

  4. It is also likely that the look-ahead probabilities could be integrated into the auxiliary variable sampling function for slice sampling to improve efficiency while maintaining correctness guarantees, an interesting challenge that we will leave to future work.

  5. It should be noted that we are not counting duplicate occurrences of substrings in a single sentence. This was a design choice to prevent the over-counting of one-character or very short strings that tend to occur many times in a single sentence.

  6. Using the open-source implementation esaxx http://code.google.com/p/esaxx/.

  7. http://www.statmt.org/wpt05/mt-shared-task/.

  8. The 100-character limit results in the use of somewhat shorter sentences than when using limits based on words. For example, using a more traditional limit of a maximum of 40 words on both sides for Japanese-English results in a total of 5.91M words of English, 2.7 times greater than when a 100-character limit is used. The 100-character limit was mainly for efficient experimentation in the character-based models, and we describe possible directions for raising this limit in the future work section.

  9. http://phontron.com/pialign/.

  10. This setup was chosen to minimize the effect of the tuning criterion on the comparison between the baseline and the proposed system, although it does imply that we must have access to tokenized data for the development set.

  11. We also performed experiments in which we incorporated a word-based language model in character-based translation, but found that this consistently gave neutral to negative results, a similar finding to that of Vilar et al. (2007). We suspect that this is due to the fact that word-based language models assign a sudden, large penalty when a word completes, hurting decoding. In addition, the modeling of unknown words is not trivial, and while we provided a fixed penalty for each unknown word (tuned using MERT), a more sophisticated unknown word model is probably necessary.

  12. Character-based BLEU and word-based BLEU showed similar relative gains.

  13. These numbers were produced with a different version of Moses than the numbers in previous sections, so should not be directly compared.

References

  • Abouelhoda MI, Kurtz S, Ohlebusch E (2004) Replacing suffix trees with enhanced suffix arrays. J Discret Algorithms 2(1):53–86

    Article  MathSciNet  MATH  Google Scholar 

  • Al-Onaizan Y, Knight K (2002) Translating named entities using monolingual and bilingual resources. In: 40th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference, Philadelphia, pp 400–408

  • Bai MH, Chen KJ, Chang JS (2008) Improving word alignment by adjusting Chinese word segmentation. In: IJCNLP 2008, Proceedings of the 3rd International Joint Conference on Natural Language Processing. Hyderabad, pp 249–256

  • Blunsom P, Cohn T (2010) Inducing synchronous grammars with slice sampling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics. Proceedings of the Main Conference, Los Angeles, pp 238–241

  • Blunsom P, Cohn T, Dyer C, Osborne M (2009) A Gibbs sampler for phrasal synchronous grammar induction. In: ACL-IJCNLP 2009, Joint Conference of the 47th Annual Meeting of the Association for Computational Linguistics and 4th International Joint Conference on Natural Language Processing of the AFNLP. Proceedings of the Conference, Suntec, pp 782–790

  • Bojar O (2007) English-to-Czech factored machine translation. In: ACL 2007: Proceedings of the Second Workshop on Statistical Machine Translation. Czech Republic, Prague, pp 232–239

  • Brown PF, Della-Pietra VJ, Della-Pietra SA, Mercer RL (1993) The mathematics of statistical machine translation: parameter estimation. Comput Linguist 19:263–312

    Google Scholar 

  • Brown RD (2002) Corpus-driven splitting of compound words. In: TMI-2002 Conference: Proceedings of the 9th International Conference on Theoretical and Methodological issues in Machine Translation, Keihanna, pp 12–21

  • Chang PC, Galley M, Manning CD (2008) Optimizing Chinese word segmentation for machine translation performance. In: ACL-08: HLT: Third Workshop on Statistical Machine Translation. Proceedings of the Workshop, Columbus, pp 224–232

  • Chomsky N (1956) Three models for the description of language. IRE Trans Inf Theory 2(3):113–124

    Article  MATH  Google Scholar 

  • Chu C, Nakazawa T, Kawahara D, Kurohashi S (2012) Exploiting shared Chinese characters in Chinese word segmentation optimization for Chinese–Japanese machine translation. In: EAMT 2012, Proceedings of the 16th Annual Conference of the European Association for Machine Translation. Trento, pp 35–42

  • Chung T, Gildea D (2009) Unsupervised tokenization for machine translation. In: EMNLP 2009: Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, Singapore, pp 718–726

  • Corston-Oliver S, Gamon M (2004) Normalizing German and English inflectional morphology to improve statistical word alignment. In: Machine Translation: from Real Users to Research, 6th Conference of the Association for Machine Translation in the Americas, AMTA (2004) Washington, DC, pp 48–57

  • Cromières F (2006) Sub-sentential alignment using substring co-occurrence counts. In: COLING—ACL 2006, 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics. Proceedings of the Student Research Workshop, Sydney, pp 13–18

  • DeNero J, Bouchard-Côté A, Klein D (2008) Sampling alignment structure under a Bayesian translation model. In: EMNLP 2008: 2008 Conference on Empirical Methods in Natural Language Processing. Proceedings of the Conference, Honolulu, pp 314–323

  • Denkowski M, Lavie A (2011) Meteor 1.3: automatic metric for reliable optimization and evaluation of machine translation systems. In: Proceedings of the 6th Workshop on Statistical Machine Translation (WMT), Edinburgh, pp 85–91

  • Denoual E, Lepage Y (2005) BLEU in characters: towards automatic MT evaluation in languages without word delimiters. In: Proceedings of the 2nd International Joint Conference on Natural Language Processing, IJCNLP-05, Jeju Island, pp 81–86

  • Finch A, Sumita E (2007) Phrase-based machine transliteration. In: Proceedings of the Workshop on Technologies and Corpora for Asia-Pacific Speech Translation (TCAST), Hyderabad, pp 13–18

  • Goldwater S, McClosky D (2005) Improving statistical MT through morphological analysis. In: HLT/EMNLP 2005: Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing. Proceedings of the Conference, Vancouver, British Columbia, pp 676–683

  • Haghighi A, Blitzer J, DeNero J, Klein D, (2009) Better word alignments with supervised ITG models. In: ACL-IJCNLP 2009, Joint Conference of the 47th Annual Meeting of the Association for Computational Linguistics and 4th International Joint Conference on Natural Language Processing of the AFNLP. Proceedings of the Conference, Suntec, pp 923–931

  • Jiang J, Ahmed Z, Carson-Berndsen J, Cahill P, Way A (2011) Phonetic representation-based speech translation. In: Proceedings of Machine Translation Summit XIII, Xiamen, pp 81–88

  • Karlsson F (1999) Finnish: an essential grammar. Routledge, London

    Google Scholar 

  • Klein D, Manning CD (2003) A* parsing: fast exact Viterbi parse selection. In: HLT-NAACL 2003: Conference combining Human Language Technology conference series and the North American Chapter of the Association for Computational Linguistics conference series . Edmonton, pp 40–47

  • Kneser R, Ney H (1995) Improved backing-off for M-gram language modelling. In: International Conference on Acoustics, Speech, and Signal Processing (ICASSP), vol. 1, Detroit, pp 181–184

  • Knight K, Graehl J (1998) Machine transliteration. Comput Linguist 24(4):599–612

    Google Scholar 

  • Koehn P (2005) Europarl: a parallel corpus for statistical machine translation. In: MT Summit X: The Tenth Machine Translation Summit, Phuket, pp 79–86

  • Koehn P, Axelrod A, Mayne AB, Callison-Burch C, Osborne M, Talbot D (2005) Edinburgh system description for the 2005 IWSLT speech translation evaluation. In: International Workshop on Spoken Language Translation: Evaluation Campaign on Spoken Language Translation [IWSLT 2005], Pittsburgh, 8pp [no page numbers]

  • Koehn P, Hoang H, Birch A, Callison-Burch C, Federico M, Bertoldi N, Cowan B, Shen W, Moran C, Zens R, Dyer C, Bojar O, Constantin A, Herbst E, (2007) Moses: open source toolkit for statistical machine translation. In: ACL 2007: proceedings of demo and poster sessions. Czech Republic, Prague, pp 177–180

  • Koehn P, Och FJ, Marcu D (2003) Statistical phrase-based translation. In: HLT-NAACL, (2003) conference combining Human Language Technology conference series and the North American Chapter of the Association for Computational Linguistics conference series. Edmonton, pp 48–54

  • Kondrak G, Marcu D, Knight K (2003) Cognates can improve statistical translation models. In: HLT-NAACL 2003: conference combining Human Language Technology conference series and the North American Chapter of the Association for Computational Linguistics conference series. Edmonton, pp 46–48

  • Lee YS (2004) Morphological analysis for statistical machine translation. In: HLT-NAACL, 2004: Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics. Proceedings of the Main Conference, Boston, Massachusetts, pp 57–60

  • Levenberg A, Dyer C, Blunsom P (2012) A Bayesian model for learning SCFGs with discontiguous rules. In: Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL). Jeju Island, pp 223–232

  • Li H, Zhang M, Su J (2004) A joint source-channel model for machine transliteration. In: Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics (ACL), Barcelona, pp 159–166

  • Li M, Zong C, Ng HT (2011) Automatic evaluation of Chinese translation output: word-level or character-level? In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics (ACL). Portland, pp 159–164

  • Liang P, Taskar B, Klein D (2006) Alignment by agreement. In: Proceedings of the 2006 Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics (HLT-NAACL). Montreal, pp 104–111

  • Liu C, Ng HT (2012) Character-level machine translation evaluation for languages with ambiguous word boundaries. In: [ACL, 2012] Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics. Republic of Korea, pp 921–929

  • Macherey K, Dai A, Talbot D, Popat A, Och F (2011) Language-independent compound splitting with morphological operations. In: ACL-HLT, 2011: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics. Portland, pp 1395–1404

  • Marcu D, Wong W (2002) A phrase-based, joint probability model for statistical machine translation. In: Proceedings of the 2002 Conference on Empirical Methods in Natural Language Processing. Philadelphia, pp 133–139

  • Nakov P, Tiedemann J (2012) Combining word-level and character-level models for machine translation between closely-related languages. In: [ACL, 2012] Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics. Republic of Korea, Jeju, pp 301–305

  • Naradowsky J, Toutanova K (2011) Unsupervised bilingual morpheme segmentation and alignment with context-rich hidden semi-Markov models. In: ACL-HLT, 2011: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics. Portland, pp 895–904

  • Neubig G (2011) The Kyoto free translation task. http://www.phontron.com/kftt. Accessed 16 May 2011

  • Neubig G, Watanabe T, Mori S, Kawahara T (2012) Machine translation without words through substring alignment. In: : [ACL, 2012] Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics. Republic of Korea, pp 165–174

  • Neubig G, Watanabe T, Sumita E, Mori S, Kawahara T (2011) An unsupervised model for joint phrase alignment and extraction. In: ACL-HLT, 2011: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics. Portland, pp 632–641

  • Nguyen T, Vogel S, Smith NA, (2010) Nonparametric word segmentation for machine translation. In: Coling 2010, 23rd International Conference on Computational Linguistics. Proceedings of the Conference, Beijing, pp 815–823

  • Nießen S, Ney H (2000) Improving SMT quality with morpho-syntactic analysis. In: The 18th International Conference on Computational Linguistics, COLING 2000 in Europe. Proceedings of the Conference, Saarbrücken, pp 1081–1085

  • Och, FJ (2003) Minimum error rate training in statistical machine translation. In: ACL-2003: 41st Annual meeting of the Association for Computational Linguistics, Sapporo, pp 160–167

  • Och FJ, Ney H (2003) A systematic comparison of various statistical alignment models. Comput Linguist 29(1):19–51

    Article  MATH  Google Scholar 

  • Papineni K, Roukos S, Ward T, Zhu WJ (2002) BLEU: a method for automatic evaluation of machine translation. In: 40th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference, Philadelphia, pp 311–318

  • Saers M, Nivre J, Wu D (2009) Learning stochastic bracketing inversion transduction grammars with a cubic time biparsing algorithm. In: IWPT-09: Proceedings of the 11th International Conference on Parsing Technologies, Paris, pp 29–32

  • Snyder B, Barzilay R (2008) Unsupervised multilingual learning for morphological segmentation. In: ACL-08: HLT, 46th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Proceedings of the Conference, Columbus, pp 737–745

  • Sornlertlamvanich V, Mokarat C, Isahara H (2008) Thai-lao machine translation based on phoneme transfer. In: Proceedings of the 14th Annual Meeting of the Association for Natural Language Processing, Tokyo, pp 65–68

  • Subotin M (2011) An exponential translation model for target language morphology. In: ACL-HLT, 2011: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics. Portland, Oregon, pp 230–238

  • Talbot D, Osborne M (2006) Modelling lexical redundancy for machine translation. In: COLING—ACL, 2006, 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics. Proceedings of the Conference, Sydney, pp 969–976

  • Tiedemann J (2009) Character-based PSMT for closely related languages. In: EAMT-2009: Proceedings of the 13th Annual Conference of the European Association for Machine Translation, Barcelona, pp 12–19

  • Tiedemann J (2012) Character-based pivot translation for under-resourced languages and domains. In: [EACL 2012] Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics . Avignon, pp 141–151

  • Vilar D, Peter JT, Ney H (2007) Can we translate letters? In: ACL 2007: Proceedings of the Second Workshop on Statistical Machine Translation. Czech Republic, Prague, pp 33–39

  • Vogel S, Ney H, Tillmann C (1996) HMM-based word alignment in statistical translation. In: COLING-96: The 16th International Conference on Computational Linguistics, Proceedings, Copenhagen, pp 836–841

  • Wang Y, Uchimoto K, Kazama J, Kruengkrai C, Torisawa K (2010) Adapting Chinese word segmentation for machine translation based on short units. In: LREC 2010: proceedings of the seventh international conference on Language Resources and Evaluation. La Valetta, Malta, pp 1758–1764

  • Wu D (1997) Stochastic inversion transduction grammars and bilingual parsing of parallel corpora. Comput Linguist 23(3):377–403

    Google Scholar 

  • Xu J, Zens R, Ney H (2004) Do we need Chinese word segmentation for statistical machine translation? In: Proceedings of the 3rd SIGHAN workshop on Chinese language processing. Barcelona, pp 122–128

  • Zhang H, Gildea D (2005) Stochastic lexicalized inversion transduction grammar for alignment. In: ACL-05: 43rd Annual Meeting of the Association for Computational Linguistics Ann Arbor, Michigan, pp 475–482

  • Zhang H, Quirk C, Moore RC, Gildea D (2008a) Bayesian learning of non-compositional phrases with synchronous parsing. In: ACL-08: HLT, 46th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Proceedings of the Conference, Columbus, pp 97–105

  • Zhang R, Yasuda K, Sumita E (2008b) Improved statistical machine translation by multiple Chinese word segmentation. In: ACL-08: HLT: Third Workshop on Statistical Machine Translation, Proceedings of the Workshop, Columbus, pp 216–223

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Graham Neubig.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Neubig, G., Watanabe, T., Mori, S. et al. Substring-based machine translation. Machine Translation 27, 139–166 (2013). https://doi.org/10.1007/s10590-013-9136-6

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10590-013-9136-6

Keywords

Navigation