Skip to main content
Log in

Large aligned treebanks for syntax-based machine translation

  • Original Paper
  • Published:
Language Resources and Evaluation Aims and scope Submit manuscript

Abstract

We present a collection of parallel treebanks that have been automatically aligned on both the terminal and the non-terminal constituent level for use in syntax-based machine translation. We describe how they were constructed and applied to a syntax- and example-based machine translation system called Parse and Corpus-Based Machine Translation (PaCo-MT). For the language pair Dutch to English, we present non-terminal alignment evaluation scores for a variety of tree alignment approaches. Finally, based on the parallel treebanks created by these approaches, we evaluate the MT system itself and compare the scores with those of Moses, a current state-of-the-art statistical MT system, when trained on the same data.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

Notes

  1. http://www.statmt.org/moses/.

  2. http://langtech.jrc.it/DGT-TM.html.

  3. Open Parallel corpUS.

  4. http://opus.lingfil.uu.se.

  5. http://oneliner.be/en/.

  6. In Dutch, verbs that are combined with prepositions, such as opstaan (stand up) are written as one word. However, under some circumstances, they are split in two (staan and op) and these split parts often occur in different parts of the sentence, creating problems for constituent alignment. These cases were those that we have merged.

  7. v. 1.6.1.

  8. Technically, tree alignment also includes word alignment, as words are associated with the leaves of the tree. However, in this work we make a clear distinction between word alignment and constituent alignment, as we apply different approaches to perform those tasks. This is similar to related work in which the discussions of phrase-structure tree-to-tree alignment often put the focus on the alignment of non-terminals.

  9. 1.45 for the leaf ratio similarity and 1.4 for the linked leaf ratio similarity.

  10. The edge labels have been omitted from these examples, but were used in the actual rule induction.

  11. http://www.natcorp.ox.ac.uk/.

  12. 366,161 sentence pairs, as opposed to 1,180,706 sentence pairs in Europarl, 731,673 in OPUS and 478,972 in DGT.

  13. Found here: http://projectile.sv.cmu.edu/research/public/tools/bootStrap/generateLog-v11.pl, although at the time of submitting the paper, the page was down. A copy was found here: https://www.cs.cmu.edu/afs/cs/project/cmt-55/lti/Courses/731/homework/HW9/score/sig-test/generateLog-v11.pl.

  14. Found at http://projectile.sv.cmu.edu/research/public/tools/bootStrap/tutorial.htm—however, at the time of submitting this paper, the page was down.

  15. Comment of Kevin Knight on the question why syntax-based MT does not consistently perform better or worse than phrase-based SMT, at the 2012 workshop “More Structure for Better Statistical Machine Translation?” held in Amsterdam.

  16. For the languages in question, there is much more data available. For example, we used Europarl 3, whereas the current version is 7, released in 2012. There are also more recent versions of OPUS and DGT containing more data.

References

  • Abdul-Rauf, S., Fishel, M., Lambert, P., Noubours, S., & Sennrich, R. (2012). Extrinsic evaluation of sentence alignment systems. In Proceedings of the LREC workshop on creating cross-language resources for disconnected languages and styles (CREDISLAS) (pp. 6–10). Turkey: Istanbul.

  • Augustinus, L., Vandeghinste, V., & Vanallemeersch, T. (2016). Poly-GrETEL: cross-lingual example-based querying of syntactic constructions. In N. C. C. Calzolari, K. Choukri, T. Declerck, S. Goggi, M. Grobelnik, B. Maegaard, J. Mariani, H. Mazo, A. Moreno, J. Odijk, & S. Piperidis (Eds.), Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016). Paris, France, Portorož, Slovenia: European Language Resources Association (ELRA).

  • Boitet, C., & Tomokiyo, M. (1995). Ambiguities and ambiguity labelling: Towards ambiguity data bases. In: R. Mitkov, N. Nicolov (Eds.), Proceedings of the international conference on recent advances in natural language processing (RANLP), Tsigov Chark, Bulgaria, Current Issues in Linguistic Theory, vol. 136 (pp 185–210).

  • Brown, P., Della Pietra, S., & Mercer, R. (1993). The mathematics of statistical machine translation: Parameter estimation. Computational Linguistics, 19(2), 263–311.

    Google Scholar 

  • Burkett, D., & Klein, D. (2012). Transforming trees to improve syntactic convergence. In Proceedings of the 2012 joint conference on empirical methods in natural language processing and computational natural language learning (pp. 863–872). Jeju Island, Korea: Association for Computational Linguistics, http://www.aclweb.org/anthology/D12-1079.

  • Chiang, D. (2006). An introduction to synchronous grammars. In Notes from a tutorial at ACL 2006 with kevin knight entitled “Tutorial 1: Synchronous grammars and tree automata”. Sydney, Australia: CoLing–ACL ‘06, https://www3.nd.edu/~dchiang/papers/synchtut.pdf.

  • De Marneffe, M., MacCartney, B., & Manning, C. (2006). Generating typed dependency parses from phrase structure parses. In Proceedings of the 5th edition of the international conference on language resources and evaluation (LREC ’06) (pp. 449–454). Italy: Genoa.

  • Dietterich, T. (2002). Machine learning for sequential data: A review. In Structural, syntactic, and statistical pattern recognition (pp. 15–30). Springer.

  • Doddington, G. (2002). Automatic evaluation of machine translation quality using n-gram co-occurrence statistics. In M. Marcus (Ed.), Proceedings of the second international conference on human language technology research (HLT ’02) (pp. 138–145). San Diego, CA.

  • Eisner, J. (2003). Learning non-isomorphic tree mappings for machine translation. In 41st annual meeting of the Association for Computational Linguistics (ACL, ’03) (pp. 205–208). Sapporo, Japan: Companion Volume.

  • Fellbaum, C. (Ed.). (1998). WordNet: An Electronic Lexical Database. Cambridge, MA: MIT Press.

    Google Scholar 

  • Gale, W., & Church, K. (1993). A program for aligning sentences in bilingual corpora. Computational Linguistics, 19(1), 75–102.

    Google Scholar 

  • Groves, D., Hearne, M., & Way, A. (2004). Robust sub-sentential alignment of phrase-structure trees. Proceedings of the 20th International Conference on Computational Linguistics (CoLing ’04) (pp. 1072–1078). Geneva: Switzerland.

  • Guo, Y., van Genabith, J., & Wang, H. (2008). Dependency-based N-gram models for general purpose sentence realisation. In Proceedings of the 22nd International Conference on Computational Linguistics (CoLing ’08) (pp. 297–304). UK: Manchester.

  • Klein, D., & Manning, C. (2003). Accurate unlexicalized parsing. In 41st Annual Meeting of the Association for Computational Linguistics (ACL ’03) (pp. 423–430). Japan: Sapporo.

  • Koehn, P. (2005). Europarl: A parallel corpus for statistical machine translation. In The tenth machine translation summit (MT Summit X) (pp. 79–86). Thailand: Phuket.

  • Koehn, P., Hoang, H., Birch, A., Callison-Burch, C., Federico, M., & Bertoldi, N., et al. (2007). Moses: open source toolkit for statistical machine translation. In Proceedings of the demo and poster sessions of the 45th annual meeting of the association for computational linguistics (ACL ’07) (pp. 177–180). Prague, Czech Republic.

  • Koehn, P. (2010). Statistical Machine Translation. Cambridge: Cambridge University Press.

    Google Scholar 

  • Kotzé, G. (2011). Improving syntactic tree alignment through rule-based error correction. In Proceedings of the 2011 ESSLLI Student Session. 23rd European Summer School in Logic, Language, and Information (ESSLLI ’11) (pp 122–127). Slovenia: Ljubljana.

  • Kotzé, G. (2013). Complementary approaches to tree alignment: Combining statistical and rule-based methods. PhD thesis, University of Groningen.

  • Kotzé, G., Vandeghinste, V., Martens, S., & Tiedemann, J. (2012). Large aligned treebanks for syntax-based machine translation. In N. C. C. Calzolari, K. Choukri, T. Declerck, M. Doğan, B. Maegaard, J. Mariani, A. Moreno, J. Odijk, & S. Piperidis (Eds.), Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC’12). Istanbul, Turkey: European Language Resources Association (ELRA).

  • Kotzé, G. (2012). Transformation-based tree-to-tree alignment. Computational Linguistics in the Netherlands Journal, 2, 71–96.

    Google Scholar 

  • Kuhn, H. (1955). The Hungarian method for the assignment problem. Naval Research Logistics Quarterly, 2, 83–97.

    Article  Google Scholar 

  • Lavie, A., Parlikar, A., & Ambati, V. (2008). Syntax-driven learning of sub-sentential translation equivalents and translation rules from parsed parallel corpora. In Proceedings of the 2nd Workshop on Syntax and Structure in Statistical Translation (SSST ’08) (pp. 87–95). OH: Columbus.

  • Liu, Y., Lü, Y., & Liu, Q. (2009). Improving tree-to-tree translation with packed forests. In Proceedings of the 47th Annual Meeting of the ACL and the 4th IJCNLP of the AFNLP (ACL-AFNLP ’09) (pp. 558–566).

  • Lundborg, J., Marek, T., Mettler, M., & Volk, M. (2007). Using the Stockholm TreeAligner. In Proceedings of the 6th Workshop on Treebanks and Linguistic Theories (TLT ’07) (pp. 73–78). Norway: Bergen.

  • Melamed, I. (2001). Empirical Methods for Exploiting Parallel Texts. Cambridge: MIT Press.

    Google Scholar 

  • Menezes, A., & Richardson, S. (2003). A best-first alignment algorithm for automatic extraction of transfer mappings from bilingual corpora. In M. Carl & A. Way (Eds.), Recent advances in example-based machine translation, text, speech and language technology (Vol. 21, pp. 421–442), chap 15 Netherlands, Dordrecht: Springer.

    Chapter  Google Scholar 

  • Mengel, A., & Lezius, W. (2000). An XML-based encoding format for syntactically annotated corpora. Proceedings of the Second International Conference on Language Resources and Evaluation (LREC ’00) (pp. 121–126). Athens: Greece.

  • Miller, G. (1995). WordNet: A lexical database for English. Communications of the ACM, 38(11), 39–41.

    Article  Google Scholar 

  • Munkres, J. (1957). Algorithms for the assignment and transportation problems. Journal of the Society of Industrial and Applied Mathematics, 5(1), 32–38.

    Article  Google Scholar 

  • Nesson, R., Shieber, S., & Rush, A. (2006). Induction of probabilistic synchronous tree-insertion grammars for machine translation. In Proceedings of the 7th Conference of the Association for Machine Translation in the Americas (AMTA ’06) (pp. 128–137). Cambridge, MA.

  • Och, F., & Ney, H. (2003). A systematic comparison of various statistical alignment models. Computational Linguistics, 29(1), 19–51.

    Article  Google Scholar 

  • Papineni, K., Roukos, S., Ward, T., & Zhu, W.J. (2002). BLEU: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association of Computational Linguistics (ACL ’02) (pp. 311–318). Philadelphia, PA.

  • Petrov, S., & Klein, D. (2007). Improved inference for unlexicalized parsing. In Proceedings of NAACL HLT 2007 (pp. 404–411). Rochester, NY.

  • Samuelsson, Y., & Volk, M. (2007). Alignment tools for parallel treebanks. In G. Rehm, A. Witt, & L. Lemnitzer (Eds.), Data structures for linguistic resources and applications: Proceedings of the biennial GLDV conference. Germany: Gunter Narr.

  • Schabes, Y. (1990). Mathematical and computational aspects of lexicalized grammars. PhD thesis, University of Pennsylvania.

  • Schmid, H. (1994). Probabilistic part-of-speech tagging using decision trees. In Proceedings of the Conference: International Conference on New Methods in Language Processing (NeMLaP ’94) (pp. 44–49). Manchester, UK.

  • Shieber, S., & Schabes, Y. (1990). Synchronous tree-adjoining grammars. In Proceedings of the 13th International Conference on Computational Linguistics (CoLing ’90), vol. 3 (pp. 253–258). Finland: Helsinki.

  • Snover, M., Dorr, B., Schwartz, R., Micciulla, L., & Makhoul, J. (2006). A study of translation edit rate with targeted human annotation. In Proceedings of the 7th Conference of the Association for Machine Translation in the Americas (AMTA ’06) (pp. 223–231). Cambridge, MA.

  • Stolcke, A. (2002). SRILM-an extensible language modeling toolkit. In Proceedings of the International Conference on Spoken Language Processing (pp. 901–904). Denver, CO.

  • Sun, J., Zhang, M., & Tan, C. (2010). Exploring syntactic structural features for sub-tree alignment using bilingual tree kernels. In Proceedings of the 48th annual meeting of the association for computational linguistics (pp. 306–315). Uppsala, Sweden: Association for Computational Linguistics.

  • Tiedemann, J. (2009). News from OPUS-a collection of multilingual parallel corpora with tools and interfaces. In N. G. N. Angelova & R. Mitkov (Eds.), Recent advances in natural language processing V. selected papers from RANLP 2007. John Benjamins.

  • Tiedemann, J. (2010). Lingua-Align: an experimental toolbox for automatic tree-to-tree alignment. In Proceedings of the 7th International Conference on Language Resources and Evaluation (LREC ’10). Valetta, Malta.

  • Tiedemann, J., & Kotzé, G. (2009a). A discriminative approach to tree alignment. In Proceedings of the Workshop on Natural Language Processing Methods and Corpora in Translation, Lexicography, and Language Learning held in conjunction with the International Conference RANLP-2009 (pp. 33–39). Borovets, Bulgaria: Association for Computational Linguistics.

  • Tiedemann, J., & Kotzé, G. (2009b). Building a large machine-aligned parallel treebank. In M. Passarotti, A. Przepiórkowski, S. Raynaud, & F. Van Eynde (Eds.), Proceedings of the 8th International Workshop on Treebanks and Linguistic Theories (TLT’08) (pp. 197–208). Milano, Italy: EDUCatt.

  • Van Noord, G. (2006). At last parsing is now operational. In TALN 06 “verbum ex machina”: Actes de la 13e conférence sur le traitement automatique des langues naturelles Vol. 1 (pp. 20–42). Leuven, Belgium: Presses Universitaires de Louvain.

  • Vanallemeersch, T. (2012). Parser-independent semantic tree alignment. In Proceedings of META-RESEARCH Workshop on Advanced Treebanking, in conjunction with LREC-2012. Istanbul, Turkey.

  • Vanallemeersch, T., & Vandeghinste, V. (2015). Assessing linguistically aware fuzzy matching in translation memories. In I. El-Kahlout, M. Özkan, F. Sánchez-Martínez, G. Ramírez-Sánchez, F. Hollowood, & A. Way (Eds.), EAMT-2015: Proceedings of the 18th Annual Conference of the European Association for Machine Translation (pp. 153–160). Antalya, Turkey, http://aclweb.org/anthology/W15-4920.

  • Vandeghinste, V. (2009). Tree-based target language modeling. In M. Lluís, H. Somers (Eds.), Proceedings of the 13nd International Conference of the European Association for Machine Translation (EAMT ’09) (pp. 152–159). Barcelona, Spain.

  • Vandeghinste, V., & Martens, S. (2009). Top-down transfer in example-based MT. In M. Forcada & A. Way (Eds.), Proceedings of the 3rd International Workshop on Example-Based Machine Translation (EBMT ’09) (pp. 69–76). Ireland: Dublin.

  • Vandeghinste, V., & Martens, S. (2010). Bottom-up transfer in example-based machine translation. In F. Yvon & V. Hansen (Eds.), Proceedings of the 14th Annual Conference of the European Association for Machine Translation (EAMT ’10). France: Saint-Raphaël.

  • Vandeghinste, V., & Vanallemeersch, T. (2016). Smart computer aided translation environment: Report on improved transduction. Technical report, University of Leuven, http://ccl.kuleuven.be/scate/D1.2.2.ReportOnImprovedTransduction.pdf.

  • Vandeghinste, V., Martens, S., Kotzé, G., Tiedemann, J., Van den Bogaert, J., De Smet, K., et al. (2013). Parse and corpus-based machine translation. In: P. Spyns, & J. Odijk (Eds.), Essential speech and language technology for dutch: Results by the STEVIN-programme. Published as part of the series ‘Theory and Applications of Natural Language Processing’. Springer.

  • Vandeghinste, V., Vanallemeersch, T., Augustinus, L., Pelemans, J., Heyman, G., Van der Lek-Ciudin, I., et al. (2016). SCATE—Smart computer-aided translation environment. In Baltic journal of modern computing. Special issue: Proceedings of the 19th Annual Conference of the European Association for Machine Translation (EAMT ’14) (p. 382). Riga, Latvia, http://www.bjmc.lu.lv/fileadmin/user_upload/lu_portal/projekti/bjmc/Contents/4_2_28_Products.pdf.

  • Varga, D., Németh, L., Halácsy, P., Kornai, A., Trón V., & Nagy, V. (2005). Parallel corpora for medium density languages. In N. Nicolov, K. Bontcheva, G. Angelova, R. Mitkov (Eds.), Recent advances in natural language processing IV. Selected papers from RANLP 2005, ‘Current Issues in Linguistic Theory’, vol. 292 (pp. 247–258). John Benjamins.

  • Velldal, E., & Oepen, S. (2006). Statistical ranking in tactical generation. In Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing (EMNLP ’06) (pp. 517–525). Sydney, Australia: Association for Computational Linguistics.

  • Volk, M., Göhring, A., Marek, T., & Samuelsson, Y. (2010). SMULTRON (version 3.0)—The Stockholm MULtilingual parallel TReebank. http://www.cl.uzh.ch/en/research/corpus-linguistics/paralleltreebanks/smultron.html, An English-French-German-Spanish-Swedish parallel treebank with sub-sentential alignments.

  • Wang, W., May, J., Knight, K., & Marcu, D. (2010). Re-structuring, Re-labeling, and Re-aligning for syntax-based machine translation. Computational Linguistics, 36(2), 247–277.

    Article  Google Scholar 

  • Xiao, T., & Zhu, J. (2013). Unsupervised sub-tree alignment for tree-to-tree translation. Journal of Artificial Intelligence Research, 48(1), pp. 733–782, http://dl.acm.org/citation.cfm?id=2591248.2591265.

  • Zhang, Y., & Vogel, S. (2004). Measuring confidence intervals for the machine translation evaluation metrics. In Proceedings of The 10th International Conference on Theoretical and Methodological Issues in Machine Translation (TMI ’04) (pp. 85–94). Baltimore, MD, USA.

  • Zhechev, V. (2010). Automatic Generation of Parallel Treebanks: An Efficient Unsupervised System. Lambert Academic Publishing.

Download references

Acknowledgments

Much of the work described in this paper was carried out within the PaCo-MT project. The PaCo-MT project was carried out within the STEVIN programme which was funded by the Dutch Language Union (http://over.taalunie.org/organisatie/netwerk/stevin). We also thank the University of Groningen that made it possible for Gideon Kotzé to finish his thesis and produce some of the work that is described here, as well as the University of South Africa that has supported his contribution to this paper; the University of Leuven that supported the contributions of Vincent Vandeghinste and Scott Martens, as well as the University of Groningen and Uppsala University that supported Jörg Tiedemann’s contribution. The work in this paper is continued in the SCATE project, funded by the Flemish IWT (IWT-SBO 130041).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Gideon Kotzé.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Kotzé, G., Vandeghinste, V., Martens, S. et al. Large aligned treebanks for syntax-based machine translation. Lang Resources & Evaluation 51, 249–282 (2017). https://doi.org/10.1007/s10579-016-9369-0

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10579-016-9369-0

Keywords

Navigation