Abstract
Machine translation is traditionally formulated as the transduction of strings of words from the source to the target language. As a result, additional lexical processing steps such as morphological analysis, transliteration, and tokenization are required to process the internal structure of words to help cope with data-sparsity issues that occur when simply dividing words according to white spaces. In this paper, we take a different approach: not dividing lexical processing and translation into two steps, but simply viewing translation as a single transduction between character strings in the source and target languages. In particular, we demonstrate that the key to achieving accuracies on a par with word-based translation in the character-based framework is the use of a many-to-many alignment strategy that can accurately capture correspondences between arbitrary substrings. We build on the alignment method proposed in Neubig et al. (Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics. Portland, Oregon, pp. 632–641, 2011), improving its efficiency and accuracy with a focus on character-based translation. Using a many-to-many aligner imbued with these improvements, we demonstrate that the traditional framework of phrase-based machine translation sees large gains in accuracy over character-based translation with more naive alignment methods, and achieves comparable results to word-based translation for two distant language pairs.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Notes
Some previous work has also performed alignment using morphological analyzers to normalize or split the sentence into morpheme streams (Corston-Oliver and Gamon 2004).
Null alignments can be represented implicitly with no span in \({{\varvec{a}}}_1^K\) covering the unaligned words.
Here we are specifically referring to a special case of ITGs with only a single symbol each for straight and inverted productions, which is also known as the bracketing ITG. ITGs with multiple straight and inverted terminals are also conceivable, but are generally not used in alignment as they significantly increase the computational burden of learning the ITG.
It is also likely that the look-ahead probabilities could be integrated into the auxiliary variable sampling function for slice sampling to improve efficiency while maintaining correctness guarantees, an interesting challenge that we will leave to future work.
It should be noted that we are not counting duplicate occurrences of substrings in a single sentence. This was a design choice to prevent the over-counting of one-character or very short strings that tend to occur many times in a single sentence.
Using the open-source implementation esaxx http://code.google.com/p/esaxx/.
The 100-character limit results in the use of somewhat shorter sentences than when using limits based on words. For example, using a more traditional limit of a maximum of 40 words on both sides for Japanese-English results in a total of 5.91M words of English, 2.7 times greater than when a 100-character limit is used. The 100-character limit was mainly for efficient experimentation in the character-based models, and we describe possible directions for raising this limit in the future work section.
This setup was chosen to minimize the effect of the tuning criterion on the comparison between the baseline and the proposed system, although it does imply that we must have access to tokenized data for the development set.
We also performed experiments in which we incorporated a word-based language model in character-based translation, but found that this consistently gave neutral to negative results, a similar finding to that of Vilar et al. (2007). We suspect that this is due to the fact that word-based language models assign a sudden, large penalty when a word completes, hurting decoding. In addition, the modeling of unknown words is not trivial, and while we provided a fixed penalty for each unknown word (tuned using MERT), a more sophisticated unknown word model is probably necessary.
Character-based BLEU and word-based BLEU showed similar relative gains.
These numbers were produced with a different version of Moses than the numbers in previous sections, so should not be directly compared.
References
Abouelhoda MI, Kurtz S, Ohlebusch E (2004) Replacing suffix trees with enhanced suffix arrays. J Discret Algorithms 2(1):53–86
Al-Onaizan Y, Knight K (2002) Translating named entities using monolingual and bilingual resources. In: 40th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference, Philadelphia, pp 400–408
Bai MH, Chen KJ, Chang JS (2008) Improving word alignment by adjusting Chinese word segmentation. In: IJCNLP 2008, Proceedings of the 3rd International Joint Conference on Natural Language Processing. Hyderabad, pp 249–256
Blunsom P, Cohn T (2010) Inducing synchronous grammars with slice sampling. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics. Proceedings of the Main Conference, Los Angeles, pp 238–241
Blunsom P, Cohn T, Dyer C, Osborne M (2009) A Gibbs sampler for phrasal synchronous grammar induction. In: ACL-IJCNLP 2009, Joint Conference of the 47th Annual Meeting of the Association for Computational Linguistics and 4th International Joint Conference on Natural Language Processing of the AFNLP. Proceedings of the Conference, Suntec, pp 782–790
Bojar O (2007) English-to-Czech factored machine translation. In: ACL 2007: Proceedings of the Second Workshop on Statistical Machine Translation. Czech Republic, Prague, pp 232–239
Brown PF, Della-Pietra VJ, Della-Pietra SA, Mercer RL (1993) The mathematics of statistical machine translation: parameter estimation. Comput Linguist 19:263–312
Brown RD (2002) Corpus-driven splitting of compound words. In: TMI-2002 Conference: Proceedings of the 9th International Conference on Theoretical and Methodological issues in Machine Translation, Keihanna, pp 12–21
Chang PC, Galley M, Manning CD (2008) Optimizing Chinese word segmentation for machine translation performance. In: ACL-08: HLT: Third Workshop on Statistical Machine Translation. Proceedings of the Workshop, Columbus, pp 224–232
Chomsky N (1956) Three models for the description of language. IRE Trans Inf Theory 2(3):113–124
Chu C, Nakazawa T, Kawahara D, Kurohashi S (2012) Exploiting shared Chinese characters in Chinese word segmentation optimization for Chinese–Japanese machine translation. In: EAMT 2012, Proceedings of the 16th Annual Conference of the European Association for Machine Translation. Trento, pp 35–42
Chung T, Gildea D (2009) Unsupervised tokenization for machine translation. In: EMNLP 2009: Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, Singapore, pp 718–726
Corston-Oliver S, Gamon M (2004) Normalizing German and English inflectional morphology to improve statistical word alignment. In: Machine Translation: from Real Users to Research, 6th Conference of the Association for Machine Translation in the Americas, AMTA (2004) Washington, DC, pp 48–57
Cromières F (2006) Sub-sentential alignment using substring co-occurrence counts. In: COLING—ACL 2006, 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics. Proceedings of the Student Research Workshop, Sydney, pp 13–18
DeNero J, Bouchard-Côté A, Klein D (2008) Sampling alignment structure under a Bayesian translation model. In: EMNLP 2008: 2008 Conference on Empirical Methods in Natural Language Processing. Proceedings of the Conference, Honolulu, pp 314–323
Denkowski M, Lavie A (2011) Meteor 1.3: automatic metric for reliable optimization and evaluation of machine translation systems. In: Proceedings of the 6th Workshop on Statistical Machine Translation (WMT), Edinburgh, pp 85–91
Denoual E, Lepage Y (2005) BLEU in characters: towards automatic MT evaluation in languages without word delimiters. In: Proceedings of the 2nd International Joint Conference on Natural Language Processing, IJCNLP-05, Jeju Island, pp 81–86
Finch A, Sumita E (2007) Phrase-based machine transliteration. In: Proceedings of the Workshop on Technologies and Corpora for Asia-Pacific Speech Translation (TCAST), Hyderabad, pp 13–18
Goldwater S, McClosky D (2005) Improving statistical MT through morphological analysis. In: HLT/EMNLP 2005: Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing. Proceedings of the Conference, Vancouver, British Columbia, pp 676–683
Haghighi A, Blitzer J, DeNero J, Klein D, (2009) Better word alignments with supervised ITG models. In: ACL-IJCNLP 2009, Joint Conference of the 47th Annual Meeting of the Association for Computational Linguistics and 4th International Joint Conference on Natural Language Processing of the AFNLP. Proceedings of the Conference, Suntec, pp 923–931
Jiang J, Ahmed Z, Carson-Berndsen J, Cahill P, Way A (2011) Phonetic representation-based speech translation. In: Proceedings of Machine Translation Summit XIII, Xiamen, pp 81–88
Karlsson F (1999) Finnish: an essential grammar. Routledge, London
Klein D, Manning CD (2003) A* parsing: fast exact Viterbi parse selection. In: HLT-NAACL 2003: Conference combining Human Language Technology conference series and the North American Chapter of the Association for Computational Linguistics conference series . Edmonton, pp 40–47
Kneser R, Ney H (1995) Improved backing-off for M-gram language modelling. In: International Conference on Acoustics, Speech, and Signal Processing (ICASSP), vol. 1, Detroit, pp 181–184
Knight K, Graehl J (1998) Machine transliteration. Comput Linguist 24(4):599–612
Koehn P (2005) Europarl: a parallel corpus for statistical machine translation. In: MT Summit X: The Tenth Machine Translation Summit, Phuket, pp 79–86
Koehn P, Axelrod A, Mayne AB, Callison-Burch C, Osborne M, Talbot D (2005) Edinburgh system description for the 2005 IWSLT speech translation evaluation. In: International Workshop on Spoken Language Translation: Evaluation Campaign on Spoken Language Translation [IWSLT 2005], Pittsburgh, 8pp [no page numbers]
Koehn P, Hoang H, Birch A, Callison-Burch C, Federico M, Bertoldi N, Cowan B, Shen W, Moran C, Zens R, Dyer C, Bojar O, Constantin A, Herbst E, (2007) Moses: open source toolkit for statistical machine translation. In: ACL 2007: proceedings of demo and poster sessions. Czech Republic, Prague, pp 177–180
Koehn P, Och FJ, Marcu D (2003) Statistical phrase-based translation. In: HLT-NAACL, (2003) conference combining Human Language Technology conference series and the North American Chapter of the Association for Computational Linguistics conference series. Edmonton, pp 48–54
Kondrak G, Marcu D, Knight K (2003) Cognates can improve statistical translation models. In: HLT-NAACL 2003: conference combining Human Language Technology conference series and the North American Chapter of the Association for Computational Linguistics conference series. Edmonton, pp 46–48
Lee YS (2004) Morphological analysis for statistical machine translation. In: HLT-NAACL, 2004: Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics. Proceedings of the Main Conference, Boston, Massachusetts, pp 57–60
Levenberg A, Dyer C, Blunsom P (2012) A Bayesian model for learning SCFGs with discontiguous rules. In: Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL). Jeju Island, pp 223–232
Li H, Zhang M, Su J (2004) A joint source-channel model for machine transliteration. In: Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics (ACL), Barcelona, pp 159–166
Li M, Zong C, Ng HT (2011) Automatic evaluation of Chinese translation output: word-level or character-level? In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics (ACL). Portland, pp 159–164
Liang P, Taskar B, Klein D (2006) Alignment by agreement. In: Proceedings of the 2006 Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics (HLT-NAACL). Montreal, pp 104–111
Liu C, Ng HT (2012) Character-level machine translation evaluation for languages with ambiguous word boundaries. In: [ACL, 2012] Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics. Republic of Korea, pp 921–929
Macherey K, Dai A, Talbot D, Popat A, Och F (2011) Language-independent compound splitting with morphological operations. In: ACL-HLT, 2011: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics. Portland, pp 1395–1404
Marcu D, Wong W (2002) A phrase-based, joint probability model for statistical machine translation. In: Proceedings of the 2002 Conference on Empirical Methods in Natural Language Processing. Philadelphia, pp 133–139
Nakov P, Tiedemann J (2012) Combining word-level and character-level models for machine translation between closely-related languages. In: [ACL, 2012] Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics. Republic of Korea, Jeju, pp 301–305
Naradowsky J, Toutanova K (2011) Unsupervised bilingual morpheme segmentation and alignment with context-rich hidden semi-Markov models. In: ACL-HLT, 2011: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics. Portland, pp 895–904
Neubig G (2011) The Kyoto free translation task. http://www.phontron.com/kftt. Accessed 16 May 2011
Neubig G, Watanabe T, Mori S, Kawahara T (2012) Machine translation without words through substring alignment. In: : [ACL, 2012] Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics. Republic of Korea, pp 165–174
Neubig G, Watanabe T, Sumita E, Mori S, Kawahara T (2011) An unsupervised model for joint phrase alignment and extraction. In: ACL-HLT, 2011: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics. Portland, pp 632–641
Nguyen T, Vogel S, Smith NA, (2010) Nonparametric word segmentation for machine translation. In: Coling 2010, 23rd International Conference on Computational Linguistics. Proceedings of the Conference, Beijing, pp 815–823
Nießen S, Ney H (2000) Improving SMT quality with morpho-syntactic analysis. In: The 18th International Conference on Computational Linguistics, COLING 2000 in Europe. Proceedings of the Conference, Saarbrücken, pp 1081–1085
Och, FJ (2003) Minimum error rate training in statistical machine translation. In: ACL-2003: 41st Annual meeting of the Association for Computational Linguistics, Sapporo, pp 160–167
Och FJ, Ney H (2003) A systematic comparison of various statistical alignment models. Comput Linguist 29(1):19–51
Papineni K, Roukos S, Ward T, Zhu WJ (2002) BLEU: a method for automatic evaluation of machine translation. In: 40th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference, Philadelphia, pp 311–318
Saers M, Nivre J, Wu D (2009) Learning stochastic bracketing inversion transduction grammars with a cubic time biparsing algorithm. In: IWPT-09: Proceedings of the 11th International Conference on Parsing Technologies, Paris, pp 29–32
Snyder B, Barzilay R (2008) Unsupervised multilingual learning for morphological segmentation. In: ACL-08: HLT, 46th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Proceedings of the Conference, Columbus, pp 737–745
Sornlertlamvanich V, Mokarat C, Isahara H (2008) Thai-lao machine translation based on phoneme transfer. In: Proceedings of the 14th Annual Meeting of the Association for Natural Language Processing, Tokyo, pp 65–68
Subotin M (2011) An exponential translation model for target language morphology. In: ACL-HLT, 2011: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics. Portland, Oregon, pp 230–238
Talbot D, Osborne M (2006) Modelling lexical redundancy for machine translation. In: COLING—ACL, 2006, 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics. Proceedings of the Conference, Sydney, pp 969–976
Tiedemann J (2009) Character-based PSMT for closely related languages. In: EAMT-2009: Proceedings of the 13th Annual Conference of the European Association for Machine Translation, Barcelona, pp 12–19
Tiedemann J (2012) Character-based pivot translation for under-resourced languages and domains. In: [EACL 2012] Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics . Avignon, pp 141–151
Vilar D, Peter JT, Ney H (2007) Can we translate letters? In: ACL 2007: Proceedings of the Second Workshop on Statistical Machine Translation. Czech Republic, Prague, pp 33–39
Vogel S, Ney H, Tillmann C (1996) HMM-based word alignment in statistical translation. In: COLING-96: The 16th International Conference on Computational Linguistics, Proceedings, Copenhagen, pp 836–841
Wang Y, Uchimoto K, Kazama J, Kruengkrai C, Torisawa K (2010) Adapting Chinese word segmentation for machine translation based on short units. In: LREC 2010: proceedings of the seventh international conference on Language Resources and Evaluation. La Valetta, Malta, pp 1758–1764
Wu D (1997) Stochastic inversion transduction grammars and bilingual parsing of parallel corpora. Comput Linguist 23(3):377–403
Xu J, Zens R, Ney H (2004) Do we need Chinese word segmentation for statistical machine translation? In: Proceedings of the 3rd SIGHAN workshop on Chinese language processing. Barcelona, pp 122–128
Zhang H, Gildea D (2005) Stochastic lexicalized inversion transduction grammar for alignment. In: ACL-05: 43rd Annual Meeting of the Association for Computational Linguistics Ann Arbor, Michigan, pp 475–482
Zhang H, Quirk C, Moore RC, Gildea D (2008a) Bayesian learning of non-compositional phrases with synchronous parsing. In: ACL-08: HLT, 46th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Proceedings of the Conference, Columbus, pp 97–105
Zhang R, Yasuda K, Sumita E (2008b) Improved statistical machine translation by multiple Chinese word segmentation. In: ACL-08: HLT: Third Workshop on Statistical Machine Translation, Proceedings of the Workshop, Columbus, pp 216–223
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Neubig, G., Watanabe, T., Mori, S. et al. Substring-based machine translation. Machine Translation 27, 139–166 (2013). https://doi.org/10.1007/s10590-013-9136-6
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10590-013-9136-6