Language Resources and Evaluation

, Volume 48, Issue 4, pp 709–739 | Cite as

Capturing divergence in dependency trees to improve syntactic projection

  • Ryan Georgi
  • Fei Xia
  • William D. Lewis
Project Notes


Obtaining syntactic parses is an important step in many NLP pipelines. However, most of the world’s languages do not have a large amount of syntactically annotated data available for building parsers. Syntactic projection techniques attempt to address this issue by using parallel corpora consisting of resource-poor and resource-rich language pairs, taking advantage of a parser for the resource-rich language and word alignment between the languages to project the parses onto the data for the resource-poor language. These projection methods can suffer, however, when syntactic structures for some sentence pairs in the two languages look quite different. In this paper, we investigate the use of small, parallel, annotated corpora to automatically detect divergent structural patterns between two languages. We then use these detected patterns to improve projection algorithms and dependency parsers, allowing for better performing NLP tools for resource-poor languages, particularly those that may not have large amounts of annotated data necessary for traditional, fully-supervised methods. While this detection process is not exhaustive, we demonstrate that common patterns of divergence can be identified automatically without prior knowledge of a given language pair, and the patterns can be used to improve performance of syntactic projection and parsing.


Multilingualism Translation divergence Syntactic projection 


  1. Benajiba., Y. & Zitouni, I. (2010). Enhancing mention detection using projection via aligned corpora. In Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing at Cambridge, MA (pp. 993–1001). Stroudsburg, PA, USA: Association for Computational Linguistics.Google Scholar
  2. Bhatt, R., Narasimhan, B., Palmer, M., Rambow, O., Sharma, D. M., & Xia, F. (2009). A multi-representational and multi-layered treebank for hindi/urdu. In The Third Linguistic Annotation Workshop (The LAW III) in conjunction with ACL/IJCNLP 2009. Association for Computational Linguistics.Google Scholar
  3. Brown, P. F., Cock, J., Pietra, S. A. D., Pietra, V. J. D., Jelinek, F., Lafferty, J. D., et al. (1990). A statistical approach to machine translation. Computational Linguistics, 16(2), 79–85.Google Scholar
  4. Calzolari, N., Del Gratta, R., Francopoulo, G., Mariani, J., Rubino, F., Russo, I. & Soria, C. (2012). The LRE Map. Harmonising Community Descriptions of Resources. In LREC (International Conference on Language Resources and Evaluation), Istanbul.Google Scholar
  5. Collins, M. (1999). Head-driven statistical models for natural language parsing. PhD thesis, University of Pennsylvania.Google Scholar
  6. de Marneffe, M. C., MacCartney, B., & Manning, C. D. (2006). Generating typed dependency parses from phrase structure parses. In Proceedings of LREC.Google Scholar
  7. Dorr, B. J. (1994). Machine translation divergences: A formal description and proposed solution. Computational Linguistics, 20, 597–633.Google Scholar
  8. Georgi, R., Xia, F., & Lewis, W. D. (2012). Improving dependency parsing with interlinear glossed text and syntactic projection. In Proceedings of the 24th International Conference on Computational Linguistics (COLING 2012), Mumbai, India.Google Scholar
  9. Georgi, R., Xia, F., & Lewis, W. D. (2013). Enhanced and portable dependency projection algorithms using interlinear glossed text. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (vol 2, Short Papers, pp. 306–311), Sofia, Bulgaria, August 2013. Association for Computational Linguistics.
  10. Hwa, R., Resnik, P., Weinberg, A., Cabezas, C., & Kolak, O. (2004). Bootstrapping parsers via syntactic projection across parallel texts. Natural Language Engineering, 1(1), 1–15.Google Scholar
  11. Hwa, R., Resnik, P., Weinberg, A., & Kolak, O. (2002). Evaluating translational correspondence using annotation projection. In Proceedings of ACL 2002, July (2002).Google Scholar
  12. Lewis, W. D. (2006). ODIN: A model for adapting and enriching legacy infrastructure. In Proceedings of the E-Humanities Workshop, p. 137.Google Scholar
  13. Lewis, W. D. & Xia, F. (2008). Automatically identifying computationally relevant typological features. In Proceedings of IJCNLP.Google Scholar
  14. Lewis, W. D., & Xia, F. (2010). Developing ODIN: A multilingual repository of annotated language data for hundreds of the world’s languages. Journal of Literary and Linguistic Computing (LLC), 25(3), 303–319.CrossRefGoogle Scholar
  15. McDonald, R., Lerman, K., & Pereira, F. (2006). Multilingual dependency analysis with a two-stage discriminative parser. In Proceedings of the Tenth Conference on Computational Natural Language Learning, pp 216–220. Association for Computational Linguistics.Google Scholar
  16. Petrov, S., Das, D. & McDonald, R. (2012). A universal part-of-speech tagset. In Proceedings of LREC.Google Scholar
  17. Volk, M., Göhring, A., Marek, T., & Samuelsson, Y. (2010). SMULTRON (version 3.0)—The Stockholm MULtilingual parallel TReebank. An English-French-German-Spanish-Swedish parallel treebank with sub-sentential alignments.
  18. Yarowsky, D., & Ngai, G. (2001). Inducing multilingual POS taggers and NP bracketers via robust projection across aligned corpora. In Proceedings of NAACL, Stroudsburg, PA. Johns Hopkins University.Google Scholar

Copyright information

© Springer Science+Business Media Dordrecht 2014

Authors and Affiliations

  1. 1.Department of LinguisticsUniversity of WashingtonSeattleUSA
  2. 2.Microsoft Research, Bldg 99RedmondUSA

Personalised recommendations