Syntax-Based Pre-reordering for Chinese-to-Japanese Statistical Machine Translation

Han, Dan; Martínez-Gómez, Pascual; Miyao, Yusuke

doi:10.1007/978-3-319-21311-8_4

Dan Han¹⁰,
Pascual Martínez-Gómez¹⁰ &
Yusuke Miyao^11,12

Part of the book series: Theory and Applications of Natural Language Processing ((NLP))

985 Accesses

Abstract

There are additional difficulties associated with the translation of language pairs that have different word orders. In this chapter, we introduce some of these difficulties and describe two syntax-based approaches to addressing these problems. First, we describe an approach that exploits regularities in the differences of phrase head locations between Chinese and Japanese and formalize rules that reorder branches of constituency trees. Second, we propose an approach that compensates the differences in typical locations of the Subject (S), the Verb (V), and the Object (O) between Chinese (SVO) and Japanese (SOV), and devise rules that reorder word blocks from dependency trees. These approaches are implemented in the form of pre-reordering methods, and we evaluate their impact on a phrase-based machine translation system in terms of translation quality in news and patent domains. These approaches rely on syntactic structures that are automatically extracted by means of parsers, and as such, they are sensitive to parse errors. We analyze the effect of these parse errors, and obtain upper bounds in translation performance that can be achieved with these syntax-based pre-reordering methods.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 109.00; Price excludes VAT (USA)

Softcover Book: USD 139.99; Price excludes VAT (USA)

Hardcover Book: USD 139.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
They produce target phrases that correspond to source phrases at a very different relative position.
2.
http://www.nactem.ac.uk/enju.
3.
In the text, we represent Chinese characters in Pinyin together with a tone number and its English translation in parentheses, e.g., 我(wo3, I). In total, there are 5 tones (i.e., 0, 1, 2, 3, and 4) in Chinese.
4.
http://triplet.cc/software/corbit.
5.
We follow the POS tag guideline of the Penn Chinese Treebank v3.0 (Xia 2000). Table 6 in Appendix lists all POS tag definitions.
6.
However, it is still open for debate whether Chinese is a head-initial or a head-final language due to its flexible word order (Gao 2008). Nevertheless, the written form of Chinese behaves primarily as a head-initial language.
7.
http://mt.xmu.edu.cn/cwmt2011/document/papers/e00.pdf.
8.
http://champollion.sourceforge.net/.
9.
http://mecab.googlecode.com/svn/trunk/mecab/doc/index.html.
10.
http://nlp.stanford.edu/software/segmenter.shtml.
11.
http://nlp.cs.berkeley.edu/Software.shtml.
12.
http://www.statmt.org/moses.
13.
http://www.kyloo.net/software/doku.php/mgiza:overview.
14.
http://www.kecl.ntt.co.jp/icl/lirg/ribes.
15.
http://www.cs.umd.edu/~snover/tercom.
16.
http://www.cis.upenn.edu/~chinese/.

References

Badr, Ibrahim, Rabih Zbib, and James Glass. 2009. Syntactic phrase reordering for English-to-Arabic statistical machine translation. In Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics, 86–93. Association for Computational Linguistics.
Google Scholar
Brown, Peter F, John Cocke, Stephen A Della Pietra, Vincent J Della Pietra, Fredrick Jelinek, John D Lafferty, Robert L Mercer, and Paul S Roossin. 1990. A statistical approach to machine translation. Computational Linguistics 16(2):79–85.
Google Scholar
Chang, Pi-Chuan, Michel Galley, and Christopher D Manning. 2008. Optimizing Chinese word segmentation for machine translation performance. In Proceedings of the Third Workshop on Statistical Machine Translation, 224–232. Association for Computational Linguistics.
Google Scholar
Chang, Pi-Chuan, Huihsin Tseng, Dan Jurafsky, and Christopher D Manning. 2009. Discriminative reordering with Chinese grammatical relations features. In Proceedings of the Third Workshop on Syntax and Structure in Statistical Translation, 51–59. Association for Computational Linguistics.
Google Scholar
Collins, Michael, Philipp Koehn, and Ivona Kučerová. 2005. Clause restructuring for statistical machine translation. In Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics, 531–540. Association for Computational Linguistics.
Google Scholar
Costa-Jussà, Marta Ruiz, and José Adrián Rodríguez Fonollosa. 2006. Statistical machine reordering. In Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing (EMNLP), 70–76. Association for Computational Linguistics.
Google Scholar
Fukui, Naoki. 1992. Theory of projection in syntax. Stanford, CA/Tokyo: CSLI Publisher/Kuroshio Publisher.
Google Scholar
Gao, Qian. 2008. Word order in mandarin: Reading and speaking. In Proceedings of the 20th North American Conference on Chinese Linguistics (NACCL-20), vol. 2, pp. 611–626.
Google Scholar
Gao, Qin, and Stephan Vogel. 2008. Parallel implementations of word alignment tool. In Proceedings of Software Engineering, Testing, and Quality Assurance for Natural Language Processing, 49–57. Association for Computational Linguistics.
Google Scholar
Genzel, Dmitriy. 2010. Automatically learning source-side reordering rules for large scale machine translation. In Proceedings of the 23rd International Conference on Computational Linguistics (COLING), 376–384. Association for Computational Linguistics.
Google Scholar
Goto, Isao, Masao Utiyama, and Eiichiro Sumita. 2012. Post-ordering by parsing for Japanese-English statistical machine translation. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Short Papers - Volume 2, 311–316. Association for Computational Linguistics.
Google Scholar
Han, Dan, Katsuhito Sudoh, Xianchao Wu, Kevin Duh, Hajime Tsukada, and Masaaki Nagata. 2012. Head finalization reordering for Chinese-to-Japanese machine translation. In Proceedings of the Sixth Workshop on Syntax, Semantics and Structure in Statistical Translation (SSST-6), 57–66. Association for Computational Linguistics.
Google Scholar
Han, Dan, Pascual Martínez-Gómez, Yusuke Miyao, Katsuhito Sudoh, and Masaaki Nagata. 2013a. Effects of parsing errors on pre-reordering performance for Chinese-to-Japanese SMT. In Proceedings of the 27th Pacific Asia Conference on Language Information and Computing (PACLIC). The PACLIC Steering Committee.
Google Scholar
Han, Dan, Pascual Martínez-Gómez, Yusuke Miyao, Katsuhito Sudoh, and Masaaki Nagata. 2013b. Using unlabeled dependency parsing for pre-reordering for Chinese-to-Japanese statistical machine translation. In Proceedings of the 2nd Workshop on Hybrid Approaches to Translation (HyTra), 25–33. Association for Computational Linguistics.
Google Scholar
Hatori, Jun, Takuya Matsuzaki, Yusuke Miyao, and Jun’ichi Tsujii. 2011. Incremental joint POS tagging and dependency parsing in Chinese. In Proceedings of the 5th International Joint Conference on Natural Language Processing (IJCNLP), 1216–1224. Asian Federation of Natural Language Processing.
Google Scholar
Isozaki, Hideki, Tsutomu Hirao, Kevin Duh, Katsuhito Sudoh, and Hajime Tsukada. 2010a. Automatic evaluation of translation quality for distant language pairs. In Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, 944–952. Association for Computational Linguistics.
Google Scholar
Isozaki, Hideki, Katsuhito Sudoh, Hajime Tsukada, and Kevin Duh. 2010b. Head finalization: A simple reordering rule for SOV languages. In Proceedings of the Joint Fifth Workshop on Statistical Machine Translation and Metrics MATR, 244–251. Association for Computational Linguistics.
Google Scholar
Isozaki, Hideki, Katsuhito Sudoh, Hajime Tsukada, and Kevin Duh. 2012. HPSG-based preprocessing for English-to-Japanese translation. ACM Transactions on Asian Language Information Processing (TALIP) 11(3):8:1–8:16.
Google Scholar
Kendall, Maurice G. 1938. A new measure of rank correlation. Biometrika 30(1/2):81–93.
Article MathSciNet MATH Google Scholar
Koehn, Philipp, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, Marcello Federico, Nicola Bertoldi, Brooke Cowan, Wade Shen, Christine Moran, and Richard Zens, et al. 2007. Moses: Open source toolkit for statistical machine translation. In Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics on Interactive Poster and Demonstration Sessions, 177–180. Association for Computational Linguistics.
Google Scholar
Kudo, Taku, and Yuji Matsumoto. 2000. Japanese dependency structure analysis based on support vector machines. In Proceedings of the 2000 Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora: Held in Conjunction with the 38th Annual Meeting of the Association for Computational Linguistics-Volume 13, 18–25. Association for Computational Linguistics.
Google Scholar
Lee, Young-Suk, Bing Zhao, and Xiaoqiang Luo. 2010. Constituent reordering and syntax models for English-to-Japanese statistical machine translation. In Proceedings of the 23rd International Conference on Computational Linguistics (COLING), 626–634. Association for Computational Linguistics.
Google Scholar
Li, Charles N., and Sandra Annear Thompson. 1989. Mandarin Chinese: A functional reference grammar. Linguistics-Asian studies. Berkeley, CA: University of California Press.
Google Scholar
Li, Chi-Ho, Minghui Li, Dongdong Zhang, Mu Li, Ming Zhou, and Yi Guan. 2007. A probabilistic approach to syntax-based reordering for statistical machine translation. In Proceedings of the 45th Annual Meeting on Association for Computational Linguistics (ACL), vol. 45(1), pp. 720–727. Association for Computational Linguistics.
Google Scholar
Ma, Xiaoyi. 2006. Champollion: A robust parallel text sentence aligner. In Proceedings of 5th International Conference on Language Resources and Evaluation (LREC-5), 489–492. Citeseer.
Google Scholar
Miller, James Edward, and Jim Miller. 2011. A critical introduction to syntax. New York: Continuum International Publishing Group.
Google Scholar
Miyao, Yusuke, and Jun’ichi Tsujii. 2008. Feature forest models for probabilistic HPSG parsing. Computational Linguistics 34(1):35–80.
Article MathSciNet Google Scholar
Neubig, Graham, Taro Watanabe, and Shinsuke Mori. 2012. Inducing a discriminative parser to optimize machine translation reordering. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, 843–853. Association for Computational Linguistics.
Google Scholar
Och, Franz Josef. 2003. Minimum error rate training in statistical machine translation. In Proceedings of the 41st Annual Meeting on Association for Computational Linguistics-Volume 1, 160–167. Association for Computational Linguistics.
Google Scholar
Och, Franz Josef, and Hermann Ney. 2003. A systematic comparison of various statistical alignment models. Computational Linguistics 29(1):19–51.
Article MATH Google Scholar
Papineni, Kishore, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, 311–318. Association for Computational Linguistics.
Google Scholar
Petrov, Slav, Leon Barrett, Romain Thibaux, and Dan Klein. 2006. Learning accurate, compact, and interpretable tree annotation. In Proceedings of the 21st International Conference on Computational Linguistics and the 44th Annual Meeting of the Association for Computational Linguistics, 433–440. Association for Computational Linguistics.
Google Scholar
Pollard, Carl Jesse, and Ivan Andrew Sag. 1994. Head-driven phrase structure grammar. Chicago and Stanford, CA: The University of Chicago Press and CSLI Publications.
Google Scholar
Ramanathan, Ananthakrishnan, Hansraj Choudhary, Avishek Ghosh, and Pushpak Bhattacharyya. 2009. Case markers and morphology: Addressing the crux of the fluency problem in English-Hindi SMT. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing, 800–808. Association for Computational Linguistics.
Google Scholar
Rottmann, Kay, and Stephan Vogel. 2007. Word reordering in statistical machine translation with a pos-based distortion model. In Proceedings of the 11th International Conference on Theoretical and Methodological Issues in Machine Translation (TMI), 171–180.
Google Scholar
Snover, Matthew, Bonnie Dorr, Richard Schwartz, Linnea Micciulla, and John Makhoul. 2006. A study of translation edit rate with targeted human annotation. In Proceedings of Association for Machine Translation in the Americas (AMTA), 223–231. The Association for Machine Translation in the Americas.
Google Scholar
Sudoh, Katsuhito, Xianchao Wu, Kevin Duh, Hajime Tsukada, and Masaaki Nagata. 2011. Post-ordering in statistical machine translation. In Proceedings of the 13th Machine Translation Summit, 316–323. The International Association for Machine Translation (IAMT).
Google Scholar
Tillmann, Christoph, Stephan Vogel, Hermann Ney, Alex Zubiaga, and Hassan Sawaf. 1997. Accelerated dp based search for statistical translation. In Proceedings of the 5th European Conference on Speech Communication and Technology, 2667–2670.
Google Scholar
Tromble, Roy, and Jason Eisner. 2009. Learning linear ordering problems for better translation. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, vol. 2, pp. 1007–1016. Association for Computational Linguistics.
Google Scholar
Tsunakawa, Takashi, Naoaki Okazaki, Xiao Liu, and Jun’ichi Tsujii. 2009. A Chinese-Japanese lexical machine translation through a pivot language. ACM Transactions on Asian Language Information Processing 8(2):9:1–9:21.
Google Scholar
Visweswariah, Karthik, Jiri Navratil, Jeffrey Sorensen, Vijil Chenthamarakshan, and Nanda Kambhatla. 2010. Syntax based reordering with automatically derived rules for improved statistical machine translation. In Proceedings of the 23rd International Conference on Computational Linguistics (COLING), 1119–1127. Association for Computational Linguistics.
Google Scholar
Visweswariah, Karthik, Rajakrishnan Rajkumar, Ankur Gandhe, Ananthakrishnan Ramanathan, and Jiri Navratil. 2011. A word reordering model for improved machine translation. In Proceedings of Empirical Methods in Natural Language Processing, 486–496. Association for Computational Linguistics.
Google Scholar
Wang, Chao, Michael Collins, and Philipp Koehn. 2007. Chinese syntactic reordering for statistical machine translation. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), 737–745. Association for Computational Linguistics.
Google Scholar
Wu, Hua, and Haifeng Wang. 2007. Pivot language approach for phrase-based statistical machine translation. Machine Translation 21(3):165–181.
Article Google Scholar
Wu, Xianchao, Katsuhito Sudoh, Kevin Duh, Hajime Tsukada, and Masaaki Nagata. 2011. Extracting pre-ordering rules from predicate-argument structures. In Proceedings of 5th International Joint Conference on Natural Language Processing (IJCNLP), November 2011, 29–37. Chiang Mai: Asian Federation of Natural Language Processing. http://www.aclweb.org/anthology/I111004.
Google Scholar
Xia, Fei. 2000. The part-of-speech tagging guidelines for the Penn Chinese Treebank 3.0. Technical Report IRCS0007 (October 2000). Institute of Research and Cognitive Science (IRCS). Pennsylvania: University of Pennsylvania. http://repository.upenn.edu/ircs_reports/38/.
Xia, Fei, and Michael McCord. 2004. Improving a statistical MT system with automatically learned rewrite patterns. In Proceedings of the 20th International Conference on Computational Linguistics (COLING), 508–514. Association for Computational Linguistics.
Google Scholar
Xu, Peng, Jaeho Kang, Michael Ringgaard, and Franz Och. 2009. Using a dependency parser to improve SMT for subject-object-verb languages. In Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, 245–253. Association for Computational Linguistics.
Google Scholar
Yu, Kun, Yusuke Miyao, Takuya Matsuzaki, Xiangli Wang, and Junichi Tsujii. 2011. Analysis of the difficulties in Chinese deep parsing. In Proceedings of the 12th International Conference on Parsing Technologies, 48–57. Association for Computational Linguistics.
Google Scholar
Zhao, Hong-Mei, Ya-Juan Lv, Guo-Sheng Ben, Yun Huang, and Qun Liu. 2011. Evaluation report for the 7th China workshop on machine translation (CWMT2011). In The 7th China Workshop on Machine Translation (CWMT2011). http://mt.xmu.edu.cn/cwmt2011/document/papers/e00.pdf.

Download references

Author information

Authors and Affiliations

National Institute of Advanced Industrial Science, Tokyo, Japan
Dan Han & Pascual Martínez-Gómez
The Graduate University for Advanced Studies, Hayama, Japan
Yusuke Miyao
National Institute of Informatics, Tokyo, Japan
Yusuke Miyao

Authors

Dan Han
View author publications
You can also search for this author in PubMed Google Scholar
Pascual Martínez-Gómez
View author publications
You can also search for this author in PubMed Google Scholar
Yusuke Miyao
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Dan Han .

Editor information

Editors and Affiliations

Universitat politècnica de catalunya , Barcelona, Spain
Marta R. Costa-jussà
University of Aix-Marseille and University of Mainz, Marseille, France
Reinhard Rapp
Pompeu Fabra University, Barcelona, Barcelona, Spain
Patrik Lambert
Lingenio GmbH, Heidelberg, Baden-Württemberg, Germany
Kurt Eberle
Institute for Infocomm Research, Singapore, Singapur, Singapore
Rafael E. Banchs
Centre for Translation Studies, University of Leeds School of Modern Languages&Cultures, Leeds, United Kingdom
Bogdan Babych

Appendix: Summary of Part-of-Speech Tag Set in Penn Chinese Treebank

See Table 6.

Table 6 POS tags defined in Penn Chinese Treebank v3.0 (Xia 2000)

Full size table

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Han, D., Martínez-Gómez, P., Miyao, Y. (2016). Syntax-Based Pre-reordering for Chinese-to-Japanese Statistical Machine Translation. In: Costa-jussà, M., Rapp, R., Lambert, P., Eberle, K., Banchs, R., Babych, B. (eds) Hybrid Approaches to Machine Translation. Theory and Applications of Natural Language Processing. Springer, Cham. https://doi.org/10.1007/978-3-319-21311-8_4

Download citation

DOI: https://doi.org/10.1007/978-3-319-21311-8_4
Published: 13 July 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-21310-1
Online ISBN: 978-3-319-21311-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Syntax-Based Pre-reordering for Chinese-to-Japanese Statistical Machine Translation

Abstract

Access this chapter

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Appendix: Summary of Part-of-Speech Tag Set in Penn Chinese Treebank

Appendix: Summary of Part-of-Speech Tag Set in Penn Chinese Treebank

Rights and permissions

Copyright information

About this chapter

Cite this chapter

Download citation

Share this chapter

Publish with us

Search

Navigation