Language Resources and Evaluation

, Volume 51, Issue 2, pp 249–282 | Cite as

Large aligned treebanks for syntax-based machine translation

  • Gideon KotzéEmail author
  • Vincent Vandeghinste
  • Scott Martens
  • Jörg Tiedemann
Original Paper


We present a collection of parallel treebanks that have been automatically aligned on both the terminal and the non-terminal constituent level for use in syntax-based machine translation. We describe how they were constructed and applied to a syntax- and example-based machine translation system called Parse and Corpus-Based Machine Translation (PaCo-MT). For the language pair Dutch to English, we present non-terminal alignment evaluation scores for a variety of tree alignment approaches. Finally, based on the parallel treebanks created by these approaches, we evaluate the MT system itself and compare the scores with those of Moses, a current state-of-the-art statistical MT system, when trained on the same data.


Parallel treebank Parallel corpus Machine translation Syntax-based machine translation Constituent alignment Tree alignment Resource development 



Much of the work described in this paper was carried out within the PaCo-MT project. The PaCo-MT project was carried out within the STEVIN programme which was funded by the Dutch Language Union ( We also thank the University of Groningen that made it possible for Gideon Kotzé to finish his thesis and produce some of the work that is described here, as well as the University of South Africa that has supported his contribution to this paper; the University of Leuven that supported the contributions of Vincent Vandeghinste and Scott Martens, as well as the University of Groningen and Uppsala University that supported Jörg Tiedemann’s contribution. The work in this paper is continued in the SCATE project, funded by the Flemish IWT (IWT-SBO 130041).


  1. Abdul-Rauf, S., Fishel, M., Lambert, P., Noubours, S., & Sennrich, R. (2012). Extrinsic evaluation of sentence alignment systems. In Proceedings of the LREC workshop on creating cross-language resources for disconnected languages and styles (CREDISLAS) (pp. 6–10). Turkey: Istanbul.Google Scholar
  2. Augustinus, L., Vandeghinste, V., & Vanallemeersch, T. (2016). Poly-GrETEL: cross-lingual example-based querying of syntactic constructions. In N. C. C. Calzolari, K. Choukri, T. Declerck, S. Goggi, M. Grobelnik, B. Maegaard, J. Mariani, H. Mazo, A. Moreno, J. Odijk, & S. Piperidis (Eds.), Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016). Paris, France, Portorož, Slovenia: European Language Resources Association (ELRA).Google Scholar
  3. Boitet, C., & Tomokiyo, M. (1995). Ambiguities and ambiguity labelling: Towards ambiguity data bases. In: R. Mitkov, N. Nicolov (Eds.), Proceedings of the international conference on recent advances in natural language processing (RANLP), Tsigov Chark, Bulgaria, Current Issues in Linguistic Theory, vol. 136 (pp 185–210).Google Scholar
  4. Brown, P., Della Pietra, S., & Mercer, R. (1993). The mathematics of statistical machine translation: Parameter estimation. Computational Linguistics, 19(2), 263–311.Google Scholar
  5. Burkett, D., & Klein, D. (2012). Transforming trees to improve syntactic convergence. In Proceedings of the 2012 joint conference on empirical methods in natural language processing and computational natural language learning (pp. 863–872). Jeju Island, Korea: Association for Computational Linguistics,
  6. Chiang, D. (2006). An introduction to synchronous grammars. In Notes from a tutorial at ACL 2006 with kevin knight entitled “Tutorial 1: Synchronous grammars and tree automata”. Sydney, Australia: CoLing–ACL ‘06,
  7. De Marneffe, M., MacCartney, B., & Manning, C. (2006). Generating typed dependency parses from phrase structure parses. In Proceedings of the 5th edition of the international conference on language resources and evaluation (LREC ’06) (pp. 449–454). Italy: Genoa.Google Scholar
  8. Dietterich, T. (2002). Machine learning for sequential data: A review. In Structural, syntactic, and statistical pattern recognition (pp. 15–30). Springer.Google Scholar
  9. Doddington, G. (2002). Automatic evaluation of machine translation quality using n-gram co-occurrence statistics. In M. Marcus (Ed.), Proceedings of the second international conference on human language technology research (HLT ’02) (pp. 138–145). San Diego, CA.Google Scholar
  10. Eisner, J. (2003). Learning non-isomorphic tree mappings for machine translation. In 41st annual meeting of the Association for Computational Linguistics (ACL, ’03) (pp. 205–208). Sapporo, Japan: Companion Volume.Google Scholar
  11. Fellbaum, C. (Ed.). (1998). WordNet: An Electronic Lexical Database. Cambridge, MA: MIT Press.Google Scholar
  12. Gale, W., & Church, K. (1993). A program for aligning sentences in bilingual corpora. Computational Linguistics, 19(1), 75–102.Google Scholar
  13. Groves, D., Hearne, M., & Way, A. (2004). Robust sub-sentential alignment of phrase-structure trees. Proceedings of the 20th International Conference on Computational Linguistics (CoLing ’04) (pp. 1072–1078). Geneva: Switzerland.Google Scholar
  14. Guo, Y., van Genabith, J., & Wang, H. (2008). Dependency-based N-gram models for general purpose sentence realisation. In Proceedings of the 22nd International Conference on Computational Linguistics (CoLing ’08) (pp. 297–304). UK: Manchester.Google Scholar
  15. Klein, D., & Manning, C. (2003). Accurate unlexicalized parsing. In 41st Annual Meeting of the Association for Computational Linguistics (ACL ’03) (pp. 423–430). Japan: Sapporo.Google Scholar
  16. Koehn, P. (2005). Europarl: A parallel corpus for statistical machine translation. In The tenth machine translation summit (MT Summit X) (pp. 79–86). Thailand: Phuket.Google Scholar
  17. Koehn, P., Hoang, H., Birch, A., Callison-Burch, C., Federico, M., & Bertoldi, N., et al. (2007). Moses: open source toolkit for statistical machine translation. In Proceedings of the demo and poster sessions of the 45th annual meeting of the association for computational linguistics (ACL ’07) (pp. 177–180). Prague, Czech Republic.Google Scholar
  18. Koehn, P. (2010). Statistical Machine Translation. Cambridge: Cambridge University Press.Google Scholar
  19. Kotzé, G. (2011). Improving syntactic tree alignment through rule-based error correction. In Proceedings of the 2011 ESSLLI Student Session. 23rd European Summer School in Logic, Language, and Information (ESSLLI ’11) (pp 122–127). Slovenia: Ljubljana.Google Scholar
  20. Kotzé, G. (2013). Complementary approaches to tree alignment: Combining statistical and rule-based methods. PhD thesis, University of Groningen.Google Scholar
  21. Kotzé, G., Vandeghinste, V., Martens, S., & Tiedemann, J. (2012). Large aligned treebanks for syntax-based machine translation. In N. C. C. Calzolari, K. Choukri, T. Declerck, M. Doğan, B. Maegaard, J. Mariani, A. Moreno, J. Odijk, & S. Piperidis (Eds.), Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC’12). Istanbul, Turkey: European Language Resources Association (ELRA).Google Scholar
  22. Kotzé, G. (2012). Transformation-based tree-to-tree alignment. Computational Linguistics in the Netherlands Journal, 2, 71–96.Google Scholar
  23. Kuhn, H. (1955). The Hungarian method for the assignment problem. Naval Research Logistics Quarterly, 2, 83–97.CrossRefGoogle Scholar
  24. Lavie, A., Parlikar, A., & Ambati, V. (2008). Syntax-driven learning of sub-sentential translation equivalents and translation rules from parsed parallel corpora. In Proceedings of the 2nd Workshop on Syntax and Structure in Statistical Translation (SSST ’08) (pp. 87–95). OH: Columbus.Google Scholar
  25. Liu, Y., Lü, Y., & Liu, Q. (2009). Improving tree-to-tree translation with packed forests. In Proceedings of the 47th Annual Meeting of the ACL and the 4th IJCNLP of the AFNLP (ACL-AFNLP ’09) (pp. 558–566).Google Scholar
  26. Lundborg, J., Marek, T., Mettler, M., & Volk, M. (2007). Using the Stockholm TreeAligner. In Proceedings of the 6th Workshop on Treebanks and Linguistic Theories (TLT ’07) (pp. 73–78). Norway: Bergen.Google Scholar
  27. Melamed, I. (2001). Empirical Methods for Exploiting Parallel Texts. Cambridge: MIT Press.Google Scholar
  28. Menezes, A., & Richardson, S. (2003). A best-first alignment algorithm for automatic extraction of transfer mappings from bilingual corpora. In M. Carl & A. Way (Eds.), Recent advances in example-based machine translation, text, speech and language technology (Vol. 21, pp. 421–442), chap 15 Netherlands, Dordrecht: Springer.CrossRefGoogle Scholar
  29. Mengel, A., & Lezius, W. (2000). An XML-based encoding format for syntactically annotated corpora. Proceedings of the Second International Conference on Language Resources and Evaluation (LREC ’00) (pp. 121–126). Athens: Greece.Google Scholar
  30. Miller, G. (1995). WordNet: A lexical database for English. Communications of the ACM, 38(11), 39–41.CrossRefGoogle Scholar
  31. Munkres, J. (1957). Algorithms for the assignment and transportation problems. Journal of the Society of Industrial and Applied Mathematics, 5(1), 32–38.CrossRefGoogle Scholar
  32. Nesson, R., Shieber, S., & Rush, A. (2006). Induction of probabilistic synchronous tree-insertion grammars for machine translation. In Proceedings of the 7th Conference of the Association for Machine Translation in the Americas (AMTA ’06) (pp. 128–137). Cambridge, MA.Google Scholar
  33. Och, F., & Ney, H. (2003). A systematic comparison of various statistical alignment models. Computational Linguistics, 29(1), 19–51.CrossRefGoogle Scholar
  34. Papineni, K., Roukos, S., Ward, T., & Zhu, W.J. (2002). BLEU: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association of Computational Linguistics (ACL ’02) (pp. 311–318). Philadelphia, PA.Google Scholar
  35. Petrov, S., & Klein, D. (2007). Improved inference for unlexicalized parsing. In Proceedings of NAACL HLT 2007 (pp. 404–411). Rochester, NY.Google Scholar
  36. Samuelsson, Y., & Volk, M. (2007). Alignment tools for parallel treebanks. In G. Rehm, A. Witt, & L. Lemnitzer (Eds.), Data structures for linguistic resources and applications: Proceedings of the biennial GLDV conference. Germany: Gunter Narr.Google Scholar
  37. Schabes, Y. (1990). Mathematical and computational aspects of lexicalized grammars. PhD thesis, University of Pennsylvania.Google Scholar
  38. Schmid, H. (1994). Probabilistic part-of-speech tagging using decision trees. In Proceedings of the Conference: International Conference on New Methods in Language Processing (NeMLaP ’94) (pp. 44–49). Manchester, UK.Google Scholar
  39. Shieber, S., & Schabes, Y. (1990). Synchronous tree-adjoining grammars. In Proceedings of the 13th International Conference on Computational Linguistics (CoLing ’90), vol. 3 (pp. 253–258). Finland: Helsinki.Google Scholar
  40. Snover, M., Dorr, B., Schwartz, R., Micciulla, L., & Makhoul, J. (2006). A study of translation edit rate with targeted human annotation. In Proceedings of the 7th Conference of the Association for Machine Translation in the Americas (AMTA ’06) (pp. 223–231). Cambridge, MA.Google Scholar
  41. Stolcke, A. (2002). SRILM-an extensible language modeling toolkit. In Proceedings of the International Conference on Spoken Language Processing (pp. 901–904). Denver, CO.Google Scholar
  42. Sun, J., Zhang, M., & Tan, C. (2010). Exploring syntactic structural features for sub-tree alignment using bilingual tree kernels. In Proceedings of the 48th annual meeting of the association for computational linguistics (pp. 306–315). Uppsala, Sweden: Association for Computational Linguistics.Google Scholar
  43. Tiedemann, J. (2009). News from OPUS-a collection of multilingual parallel corpora with tools and interfaces. In N. G. N. Angelova & R. Mitkov (Eds.), Recent advances in natural language processing V. selected papers from RANLP 2007. John Benjamins.Google Scholar
  44. Tiedemann, J. (2010). Lingua-Align: an experimental toolbox for automatic tree-to-tree alignment. In Proceedings of the 7th International Conference on Language Resources and Evaluation (LREC ’10). Valetta, Malta.Google Scholar
  45. Tiedemann, J., & Kotzé, G. (2009a). A discriminative approach to tree alignment. In Proceedings of the Workshop on Natural Language Processing Methods and Corpora in Translation, Lexicography, and Language Learning held in conjunction with the International Conference RANLP-2009 (pp. 33–39). Borovets, Bulgaria: Association for Computational Linguistics.Google Scholar
  46. Tiedemann, J., & Kotzé, G. (2009b). Building a large machine-aligned parallel treebank. In M. Passarotti, A. Przepiórkowski, S. Raynaud, & F. Van Eynde (Eds.), Proceedings of the 8th International Workshop on Treebanks and Linguistic Theories (TLT’08) (pp. 197–208). Milano, Italy: EDUCatt.Google Scholar
  47. Van Noord, G. (2006). At last parsing is now operational. In TALN 06 “verbum ex machina”: Actes de la 13e conférence sur le traitement automatique des langues naturelles Vol. 1 (pp. 20–42). Leuven, Belgium: Presses Universitaires de Louvain.Google Scholar
  48. Vanallemeersch, T. (2012). Parser-independent semantic tree alignment. In Proceedings of META-RESEARCH Workshop on Advanced Treebanking, in conjunction with LREC-2012. Istanbul, Turkey.Google Scholar
  49. Vanallemeersch, T., & Vandeghinste, V. (2015). Assessing linguistically aware fuzzy matching in translation memories. In I. El-Kahlout, M. Özkan, F. Sánchez-Martínez, G. Ramírez-Sánchez, F. Hollowood, & A. Way (Eds.), EAMT-2015: Proceedings of the 18th Annual Conference of the European Association for Machine Translation (pp. 153–160). Antalya, Turkey,
  50. Vandeghinste, V. (2009). Tree-based target language modeling. In M. Lluís, H. Somers (Eds.), Proceedings of the 13nd International Conference of the European Association for Machine Translation (EAMT ’09) (pp. 152–159). Barcelona, Spain.Google Scholar
  51. Vandeghinste, V., & Martens, S. (2009). Top-down transfer in example-based MT. In M. Forcada & A. Way (Eds.), Proceedings of the 3rd International Workshop on Example-Based Machine Translation (EBMT ’09) (pp. 69–76). Ireland: Dublin.Google Scholar
  52. Vandeghinste, V., & Martens, S. (2010). Bottom-up transfer in example-based machine translation. In F. Yvon & V. Hansen (Eds.), Proceedings of the 14th Annual Conference of the European Association for Machine Translation (EAMT ’10). France: Saint-Raphaël.Google Scholar
  53. Vandeghinste, V., & Vanallemeersch, T. (2016). Smart computer aided translation environment: Report on improved transduction. Technical report, University of Leuven,
  54. Vandeghinste, V., Martens, S., Kotzé, G., Tiedemann, J., Van den Bogaert, J., De Smet, K., et al. (2013). Parse and corpus-based machine translation. In: P. Spyns, & J. Odijk (Eds.), Essential speech and language technology for dutch: Results by the STEVIN-programme. Published as part of the series ‘Theory and Applications of Natural Language Processing’. Springer.Google Scholar
  55. Vandeghinste, V., Vanallemeersch, T., Augustinus, L., Pelemans, J., Heyman, G., Van der Lek-Ciudin, I., et al. (2016). SCATE—Smart computer-aided translation environment. In Baltic journal of modern computing. Special issue: Proceedings of the 19th Annual Conference of the European Association for Machine Translation (EAMT ’14) (p. 382). Riga, Latvia,
  56. Varga, D., Németh, L., Halácsy, P., Kornai, A., Trón V., & Nagy, V. (2005). Parallel corpora for medium density languages. In N. Nicolov, K. Bontcheva, G. Angelova, R. Mitkov (Eds.), Recent advances in natural language processing IV. Selected papers from RANLP 2005, ‘Current Issues in Linguistic Theory’, vol. 292 (pp. 247–258). John Benjamins.Google Scholar
  57. Velldal, E., & Oepen, S. (2006). Statistical ranking in tactical generation. In Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing (EMNLP ’06) (pp. 517–525). Sydney, Australia: Association for Computational Linguistics.Google Scholar
  58. Volk, M., Göhring, A., Marek, T., & Samuelsson, Y. (2010). SMULTRON (version 3.0)—The Stockholm MULtilingual parallel TReebank., An English-French-German-Spanish-Swedish parallel treebank with sub-sentential alignments.
  59. Wang, W., May, J., Knight, K., & Marcu, D. (2010). Re-structuring, Re-labeling, and Re-aligning for syntax-based machine translation. Computational Linguistics, 36(2), 247–277.CrossRefGoogle Scholar
  60. Xiao, T., & Zhu, J. (2013). Unsupervised sub-tree alignment for tree-to-tree translation. Journal of Artificial Intelligence Research, 48(1), pp. 733–782,
  61. Zhang, Y., & Vogel, S. (2004). Measuring confidence intervals for the machine translation evaluation metrics. In Proceedings of The 10th International Conference on Theoretical and Methodological Issues in Machine Translation (TMI ’04) (pp. 85–94). Baltimore, MD, USA.Google Scholar
  62. Zhechev, V. (2010). Automatic Generation of Parallel Treebanks: An Efficient Unsupervised System. Lambert Academic Publishing.Google Scholar

Copyright information

© Springer Science+Business Media Dordrecht 2016

Authors and Affiliations

  • Gideon Kotzé
    • 1
    Email author
  • Vincent Vandeghinste
    • 2
  • Scott Martens
    • 2
  • Jörg Tiedemann
    • 3
  1. 1.University of South AfricaPretoriaSouth Africa
  2. 2.University of LeuvenLeuvenBelgium
  3. 3.University of HelsinkiHelsinkiFinland

Personalised recommendations