Skip to main content

Syntax-Based Pre-reordering for Chinese-to-Japanese Statistical Machine Translation

  • Chapter
  • First Online:
Book cover Hybrid Approaches to Machine Translation

Abstract

There are additional difficulties associated with the translation of language pairs that have different word orders. In this chapter, we introduce some of these difficulties and describe two syntax-based approaches to addressing these problems. First, we describe an approach that exploits regularities in the differences of phrase head locations between Chinese and Japanese and formalize rules that reorder branches of constituency trees. Second, we propose an approach that compensates the differences in typical locations of the Subject (S), the Verb (V), and the Object (O) between Chinese (SVO) and Japanese (SOV), and devise rules that reorder word blocks from dependency trees. These approaches are implemented in the form of pre-reordering methods, and we evaluate their impact on a phrase-based machine translation system in terms of translation quality in news and patent domains. These approaches rely on syntactic structures that are automatically extracted by means of parsers, and as such, they are sensitive to parse errors. We analyze the effect of these parse errors, and obtain upper bounds in translation performance that can be achieved with these syntax-based pre-reordering methods.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 109.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 139.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 139.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    They produce target phrases that correspond to source phrases at a very different relative position.

  2. 2.

    http://www.nactem.ac.uk/enju.

  3. 3.

    In the text, we represent Chinese characters in Pinyin together with a tone number and its English translation in parentheses, e.g., 我(wo3, I). In total, there are 5 tones (i.e., 0, 1, 2, 3, and 4) in Chinese.

  4. 4.

    http://triplet.cc/software/corbit.

  5. 5.

    We follow the POS tag guideline of the Penn Chinese Treebank v3.0 (Xia 2000). Table 6 in Appendix lists all POS tag definitions.

  6. 6.

    However, it is still open for debate whether Chinese is a head-initial or a head-final language due to its flexible word order (Gao 2008). Nevertheless, the written form of Chinese behaves primarily as a head-initial language.

  7. 7.

    http://mt.xmu.edu.cn/cwmt2011/document/papers/e00.pdf.

  8. 8.

    http://champollion.sourceforge.net/.

  9. 9.

    http://mecab.googlecode.com/svn/trunk/mecab/doc/index.html.

  10. 10.

    http://nlp.stanford.edu/software/segmenter.shtml.

  11. 11.

    http://nlp.cs.berkeley.edu/Software.shtml.

  12. 12.

    http://www.statmt.org/moses.

  13. 13.

    http://www.kyloo.net/software/doku.php/mgiza:overview.

  14. 14.

    http://www.kecl.ntt.co.jp/icl/lirg/ribes.

  15. 15.

    http://www.cs.umd.edu/~snover/tercom.

  16. 16.

    http://www.cis.upenn.edu/~chinese/.

References

  • Badr, Ibrahim, Rabih Zbib, and James Glass. 2009. Syntactic phrase reordering for English-to-Arabic statistical machine translation. In Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics, 86–93. Association for Computational Linguistics.

    Google Scholar 

  • Brown, Peter F, John Cocke, Stephen A Della Pietra, Vincent J Della Pietra, Fredrick Jelinek, John D Lafferty, Robert L Mercer, and Paul S Roossin. 1990. A statistical approach to machine translation. Computational Linguistics 16(2):79–85.

    Google Scholar 

  • Chang, Pi-Chuan, Michel Galley, and Christopher D Manning. 2008. Optimizing Chinese word segmentation for machine translation performance. In Proceedings of the Third Workshop on Statistical Machine Translation, 224–232. Association for Computational Linguistics.

    Google Scholar 

  • Chang, Pi-Chuan, Huihsin Tseng, Dan Jurafsky, and Christopher D Manning. 2009. Discriminative reordering with Chinese grammatical relations features. In Proceedings of the Third Workshop on Syntax and Structure in Statistical Translation, 51–59. Association for Computational Linguistics.

    Google Scholar 

  • Collins, Michael, Philipp Koehn, and Ivona Kučerová. 2005. Clause restructuring for statistical machine translation. In Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics, 531–540. Association for Computational Linguistics.

    Google Scholar 

  • Costa-Jussà, Marta Ruiz, and José Adrián Rodríguez Fonollosa. 2006. Statistical machine reordering. In Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing (EMNLP), 70–76. Association for Computational Linguistics.

    Google Scholar 

  • Fukui, Naoki. 1992. Theory of projection in syntax. Stanford, CA/Tokyo: CSLI Publisher/Kuroshio Publisher.

    Google Scholar 

  • Gao, Qian. 2008. Word order in mandarin: Reading and speaking. In Proceedings of the 20th North American Conference on Chinese Linguistics (NACCL-20), vol. 2, pp. 611–626.

    Google Scholar 

  • Gao, Qin, and Stephan Vogel. 2008. Parallel implementations of word alignment tool. In Proceedings of Software Engineering, Testing, and Quality Assurance for Natural Language Processing, 49–57. Association for Computational Linguistics.

    Google Scholar 

  • Genzel, Dmitriy. 2010. Automatically learning source-side reordering rules for large scale machine translation. In Proceedings of the 23rd International Conference on Computational Linguistics (COLING), 376–384. Association for Computational Linguistics.

    Google Scholar 

  • Goto, Isao, Masao Utiyama, and Eiichiro Sumita. 2012. Post-ordering by parsing for Japanese-English statistical machine translation. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Short Papers - Volume 2, 311–316. Association for Computational Linguistics.

    Google Scholar 

  • Han, Dan, Katsuhito Sudoh, Xianchao Wu, Kevin Duh, Hajime Tsukada, and Masaaki Nagata. 2012. Head finalization reordering for Chinese-to-Japanese machine translation. In Proceedings of the Sixth Workshop on Syntax, Semantics and Structure in Statistical Translation (SSST-6), 57–66. Association for Computational Linguistics.

    Google Scholar 

  • Han, Dan, Pascual Martínez-Gómez, Yusuke Miyao, Katsuhito Sudoh, and Masaaki Nagata. 2013a. Effects of parsing errors on pre-reordering performance for Chinese-to-Japanese SMT. In Proceedings of the 27th Pacific Asia Conference on Language Information and Computing (PACLIC). The PACLIC Steering Committee.

    Google Scholar 

  • Han, Dan, Pascual Martínez-Gómez, Yusuke Miyao, Katsuhito Sudoh, and Masaaki Nagata. 2013b. Using unlabeled dependency parsing for pre-reordering for Chinese-to-Japanese statistical machine translation. In Proceedings of the 2nd Workshop on Hybrid Approaches to Translation (HyTra), 25–33. Association for Computational Linguistics.

    Google Scholar 

  • Hatori, Jun, Takuya Matsuzaki, Yusuke Miyao, and Jun’ichi Tsujii. 2011. Incremental joint POS tagging and dependency parsing in Chinese. In Proceedings of the 5th International Joint Conference on Natural Language Processing (IJCNLP), 1216–1224. Asian Federation of Natural Language Processing.

    Google Scholar 

  • Isozaki, Hideki, Tsutomu Hirao, Kevin Duh, Katsuhito Sudoh, and Hajime Tsukada. 2010a. Automatic evaluation of translation quality for distant language pairs. In Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, 944–952. Association for Computational Linguistics.

    Google Scholar 

  • Isozaki, Hideki, Katsuhito Sudoh, Hajime Tsukada, and Kevin Duh. 2010b. Head finalization: A simple reordering rule for SOV languages. In Proceedings of the Joint Fifth Workshop on Statistical Machine Translation and Metrics MATR, 244–251. Association for Computational Linguistics.

    Google Scholar 

  • Isozaki, Hideki, Katsuhito Sudoh, Hajime Tsukada, and Kevin Duh. 2012. HPSG-based preprocessing for English-to-Japanese translation. ACM Transactions on Asian Language Information Processing (TALIP) 11(3):8:1–8:16.

    Google Scholar 

  • Kendall, Maurice G. 1938. A new measure of rank correlation. Biometrika 30(1/2):81–93.

    Article  MathSciNet  MATH  Google Scholar 

  • Koehn, Philipp, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, Marcello Federico, Nicola Bertoldi, Brooke Cowan, Wade Shen, Christine Moran, and Richard Zens, et al. 2007. Moses: Open source toolkit for statistical machine translation. In Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics on Interactive Poster and Demonstration Sessions, 177–180. Association for Computational Linguistics.

    Google Scholar 

  • Kudo, Taku, and Yuji Matsumoto. 2000. Japanese dependency structure analysis based on support vector machines. In Proceedings of the 2000 Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora: Held in Conjunction with the 38th Annual Meeting of the Association for Computational Linguistics-Volume 13, 18–25. Association for Computational Linguistics.

    Google Scholar 

  • Lee, Young-Suk, Bing Zhao, and Xiaoqiang Luo. 2010. Constituent reordering and syntax models for English-to-Japanese statistical machine translation. In Proceedings of the 23rd International Conference on Computational Linguistics (COLING), 626–634. Association for Computational Linguistics.

    Google Scholar 

  • Li, Charles N., and Sandra Annear Thompson. 1989. Mandarin Chinese: A functional reference grammar. Linguistics-Asian studies. Berkeley, CA: University of California Press.

    Google Scholar 

  • Li, Chi-Ho, Minghui Li, Dongdong Zhang, Mu Li, Ming Zhou, and Yi Guan. 2007. A probabilistic approach to syntax-based reordering for statistical machine translation. In Proceedings of the 45th Annual Meeting on Association for Computational Linguistics (ACL), vol. 45(1), pp. 720–727. Association for Computational Linguistics.

    Google Scholar 

  • Ma, Xiaoyi. 2006. Champollion: A robust parallel text sentence aligner. In Proceedings of 5th International Conference on Language Resources and Evaluation (LREC-5), 489–492. Citeseer.

    Google Scholar 

  • Miller, James Edward, and Jim Miller. 2011. A critical introduction to syntax. New York: Continuum International Publishing Group.

    Google Scholar 

  • Miyao, Yusuke, and Jun’ichi Tsujii. 2008. Feature forest models for probabilistic HPSG parsing. Computational Linguistics 34(1):35–80.

    Article  MathSciNet  Google Scholar 

  • Neubig, Graham, Taro Watanabe, and Shinsuke Mori. 2012. Inducing a discriminative parser to optimize machine translation reordering. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, 843–853. Association for Computational Linguistics.

    Google Scholar 

  • Och, Franz Josef. 2003. Minimum error rate training in statistical machine translation. In Proceedings of the 41st Annual Meeting on Association for Computational Linguistics-Volume 1, 160–167. Association for Computational Linguistics.

    Google Scholar 

  • Och, Franz Josef, and Hermann Ney. 2003. A systematic comparison of various statistical alignment models. Computational Linguistics 29(1):19–51.

    Article  MATH  Google Scholar 

  • Papineni, Kishore, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, 311–318. Association for Computational Linguistics.

    Google Scholar 

  • Petrov, Slav, Leon Barrett, Romain Thibaux, and Dan Klein. 2006. Learning accurate, compact, and interpretable tree annotation. In Proceedings of the 21st International Conference on Computational Linguistics and the 44th Annual Meeting of the Association for Computational Linguistics, 433–440. Association for Computational Linguistics.

    Google Scholar 

  • Pollard, Carl Jesse, and Ivan Andrew Sag. 1994. Head-driven phrase structure grammar. Chicago and Stanford, CA: The University of Chicago Press and CSLI Publications.

    Google Scholar 

  • Ramanathan, Ananthakrishnan, Hansraj Choudhary, Avishek Ghosh, and Pushpak Bhattacharyya. 2009. Case markers and morphology: Addressing the crux of the fluency problem in English-Hindi SMT. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing, 800–808. Association for Computational Linguistics.

    Google Scholar 

  • Rottmann, Kay, and Stephan Vogel. 2007. Word reordering in statistical machine translation with a pos-based distortion model. In Proceedings of the 11th International Conference on Theoretical and Methodological Issues in Machine Translation (TMI), 171–180.

    Google Scholar 

  • Snover, Matthew, Bonnie Dorr, Richard Schwartz, Linnea Micciulla, and John Makhoul. 2006. A study of translation edit rate with targeted human annotation. In Proceedings of Association for Machine Translation in the Americas (AMTA), 223–231. The Association for Machine Translation in the Americas.

    Google Scholar 

  • Sudoh, Katsuhito, Xianchao Wu, Kevin Duh, Hajime Tsukada, and Masaaki Nagata. 2011. Post-ordering in statistical machine translation. In Proceedings of the 13th Machine Translation Summit, 316–323. The International Association for Machine Translation (IAMT).

    Google Scholar 

  • Tillmann, Christoph, Stephan Vogel, Hermann Ney, Alex Zubiaga, and Hassan Sawaf. 1997. Accelerated dp based search for statistical translation. In Proceedings of the 5th European Conference on Speech Communication and Technology, 2667–2670.

    Google Scholar 

  • Tromble, Roy, and Jason Eisner. 2009. Learning linear ordering problems for better translation. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, vol. 2, pp. 1007–1016. Association for Computational Linguistics.

    Google Scholar 

  • Tsunakawa, Takashi, Naoaki Okazaki, Xiao Liu, and Jun’ichi Tsujii. 2009. A Chinese-Japanese lexical machine translation through a pivot language. ACM Transactions on Asian Language Information Processing 8(2):9:1–9:21.

    Google Scholar 

  • Visweswariah, Karthik, Jiri Navratil, Jeffrey Sorensen, Vijil Chenthamarakshan, and Nanda Kambhatla. 2010. Syntax based reordering with automatically derived rules for improved statistical machine translation. In Proceedings of the 23rd International Conference on Computational Linguistics (COLING), 1119–1127. Association for Computational Linguistics.

    Google Scholar 

  • Visweswariah, Karthik, Rajakrishnan Rajkumar, Ankur Gandhe, Ananthakrishnan Ramanathan, and Jiri Navratil. 2011. A word reordering model for improved machine translation. In Proceedings of Empirical Methods in Natural Language Processing, 486–496. Association for Computational Linguistics.

    Google Scholar 

  • Wang, Chao, Michael Collins, and Philipp Koehn. 2007. Chinese syntactic reordering for statistical machine translation. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), 737–745. Association for Computational Linguistics.

    Google Scholar 

  • Wu, Hua, and Haifeng Wang. 2007. Pivot language approach for phrase-based statistical machine translation. Machine Translation 21(3):165–181.

    Article  Google Scholar 

  • Wu, Xianchao, Katsuhito Sudoh, Kevin Duh, Hajime Tsukada, and Masaaki Nagata. 2011. Extracting pre-ordering rules from predicate-argument structures. In Proceedings of 5th International Joint Conference on Natural Language Processing (IJCNLP), November 2011, 29–37. Chiang Mai: Asian Federation of Natural Language Processing. http://www.aclweb.org/anthology/I111004.

    Google Scholar 

  • Xia, Fei. 2000. The part-of-speech tagging guidelines for the Penn Chinese Treebank 3.0. Technical Report IRCS0007 (October 2000). Institute of Research and Cognitive Science (IRCS). Pennsylvania: University of Pennsylvania. http://repository.upenn.edu/ircs_reports/38/.

  • Xia, Fei, and Michael McCord. 2004. Improving a statistical MT system with automatically learned rewrite patterns. In Proceedings of the 20th International Conference on Computational Linguistics (COLING), 508–514. Association for Computational Linguistics.

    Google Scholar 

  • Xu, Peng, Jaeho Kang, Michael Ringgaard, and Franz Och. 2009. Using a dependency parser to improve SMT for subject-object-verb languages. In Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, 245–253. Association for Computational Linguistics.

    Google Scholar 

  • Yu, Kun, Yusuke Miyao, Takuya Matsuzaki, Xiangli Wang, and Junichi Tsujii. 2011. Analysis of the difficulties in Chinese deep parsing. In Proceedings of the 12th International Conference on Parsing Technologies, 48–57. Association for Computational Linguistics.

    Google Scholar 

  • Zhao, Hong-Mei, Ya-Juan Lv, Guo-Sheng Ben, Yun Huang, and Qun Liu. 2011. Evaluation report for the 7th China workshop on machine translation (CWMT2011). In The 7th China Workshop on Machine Translation (CWMT2011). http://mt.xmu.edu.cn/cwmt2011/document/papers/e00.pdf.

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Dan Han .

Editor information

Editors and Affiliations

Appendix: Summary of Part-of-Speech Tag Set in Penn Chinese Treebank

Appendix: Summary of Part-of-Speech Tag Set in Penn Chinese Treebank

See Table 6.

Table 6 POS tags defined in Penn Chinese Treebank v3.0 (Xia 2000)

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer International Publishing Switzerland

About this chapter

Cite this chapter

Han, D., Martínez-Gómez, P., Miyao, Y. (2016). Syntax-Based Pre-reordering for Chinese-to-Japanese Statistical Machine Translation. In: Costa-jussà, M., Rapp, R., Lambert, P., Eberle, K., Banchs, R., Babych, B. (eds) Hybrid Approaches to Machine Translation. Theory and Applications of Natural Language Processing. Springer, Cham. https://doi.org/10.1007/978-3-319-21311-8_4

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-21311-8_4

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-21310-1

  • Online ISBN: 978-3-319-21311-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics