Advertisement

Language Resources and Evaluation

, Volume 50, Issue 2, pp 321–349 | Cite as

Enriching a massively multilingual database of interlinear glossed text

  • Fei Xia
  • William D. Lewis
  • Michael Wayne Goodman
  • Glenn Slayden
  • Ryan Georgi
  • Joshua Crowgey
  • Emily M. Bender
Original Paper
  • 266 Downloads

Abstract

The majority of the world’s languages have little to no NLP resources or tools. This is due to a lack of training data (“resources”) over which tools, such as taggers or parsers, can be trained. In recent years, there have been increasing efforts to apply NLP methods to a much broader swath of the world’s languages. In many cases this involves bootstrapping the learning process with enriched or partially enriched resources. We propose that Interlinear Glossed Text (IGT), a very common form of annotated data used in the field of linguistics, has great potential for bootstrapping NLP tools for resource-poor languages. Although IGT is generally very richly annotated, and can be enriched even further (e.g., through structural projection), much of the content is not easily consumable by machines since it remains “trapped” in linguistic scholarly documents and in human readable form. In this paper, we describe the expansion of the ODIN resource—a database containing many thousands of instances of IGT for over a thousand languages. We enrich the original IGT data by adding word alignment and syntactic structure. To make the data in ODIN more readily consumable by tool developers and NLP researchers, we adopt and extend a new XML format for IGT, called Xigt. We also develop two packages for manipulating IGT data: one, INTENT, enriches raw IGT automatically, and the other, XigtEdit, is a graphical IGT editor.

Keywords

Resource-poor languages Interlinear glossed text ODIN 

Notes

Acknowledgments

This material is partly supported by the National Science Foundation under Grant No. BCS-1160274 and BCS-0748919, and Singapore Ministry of Education under Tier 2 Grant No. MOE2013-T2-1-016. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation. We would like to thank Sebastian Nordhoff for discussion on the Xigt format and issues with the original IGT data, and anonymous reviewers for helpful comments. We would also like to thank the Linguist List (http://linguistlist.org/) for hosting the ODIN database.

References

  1. Bailyn, J. F. (2001). Inversion, dislocation and optionality in Russian. In G. Zybatow, U. Junghanns, G. Mehlhorn, & L. Szucsich (Eds.), Current issues in formal slavic linguistics. Frankfurt: Peter Lang AG.Google Scholar
  2. Bender, E. M., Goodman, M. W., Crowgey, J., & Xia, F. (2013). Towards creating precision grammars from interlinear glossed text: Inferring large-scale typological properties. In Proceedings of the 7th workshop on language technology for cultural heritage, social sciences, and humanities, Sofia, Bulgaria (pp. 74–83).Google Scholar
  3. Bickel, B., Comrie, B., & Haspelmath, M. (2004). The Leipzig Glossing Rules: Conventions for interlinear morpheme-by-morpheme glosses (revised version). Technical report, Department of Linguistics of the Max Planck Institute for Evolutionary Anthropology and the Department of Linguistics of the University of Leipzig. http://www.eva.mpg.de/lingua/files/morpheme.html. 17 May 2006.
  4. Brants, S., Dipper, S., Hansen, S., Lezius, W., & Smith, G. (2002). The TIGER treebank. In Proceedings of the workshop on treebanks and linguistic theories (pp. 24–41).Google Scholar
  5. Bybee, J. L., & Dahl, Ö. (1989). The creation of tense and aspect systems in the languages of the world. Amsterdam: John Benjamins.Google Scholar
  6. Das, D., & Petrov, S. (2011). Unsupervised part-of-speech tagging with bilingual graph-based projections. In HLT ’11: Proceedings of the 49th annual meeting of the association for computational linguistics: Human language technologies, association for computational linguistics.Google Scholar
  7. de Marneffe, M. C., MacCartney, B., & Manning, C. D. (2006). Generating typed dependency parses from phrase structure parses. In Proceedings of LREC 2006.Google Scholar
  8. Dorr, B. J. (1994). Machine translation divergences: A formal description and proposed solution. Computational Linguistics, 20(4), 597–635.Google Scholar
  9. Farrar, S., & Langendoen, D. T. (2003). A linguistic ontology for the Semantic Web. GLOT International, 7(3), 97–100.Google Scholar
  10. Feldman, A., Hana, J., & Brew, C. (2006). A cross-language approach to rapid creation of new morpho-syntactically annotated resources. In Proceedings of the 5th international conference on language resources and evaluation (LREC 2006), Genoa, Italy.Google Scholar
  11. Georgi, R., Xia, F., & Lewis, W. D. (2013). Enhanced and portable dependency projection algorithms using interlinear glossed text. In Proceedings of ACL 2013 (Volume 2: Short papers), Sofia, Bulgaria (pp. 306–311).Google Scholar
  12. Georgi, R., Xia, F., & Lewis, W. D. (2014). Capturing divergence in dependency trees to improve syntactic projection. Language Resources and Evaluation, 48(4), 709–739.CrossRefGoogle Scholar
  13. Georgi, R., Xia, F., & Lewis, W.D. (2015). Enriching interlinear text using automatically constructed annotators. In Proceedings of the 9th workshop on language technology for cultural heritage, social sciences, and humanities (LaTeCH 2015), Beijing, China.Google Scholar
  14. Goodman, M.W., Crowgey, J., Xia, F., & Bender, E.M. (2014). Xigt: Extensible interlinear glossed text for natural language processing. Language Resources and Evaluation. doi: 10.1007/s10579-014-9276-1.
  15. Hana, J., Feldman, A., Amaral, L., & Brew, C. (2006). Tagging portuguese with a Spanish tagger using cognates. In Proceedings of the workshop on cross-language knowledge induction, in conjunction with the 11th conference of the European chapter of the association for computational linguistics (EACL-2006), Trento, Italy.Google Scholar
  16. Hwa, R., Resnik, P., Weinberg, A., Cabezas, C., & Kolak, O. (2005). Bootstrapping parsers via syntactic projection across parallel texts. Special Issue of the Journal of Natural Language Engineering on Parallel Texts, 11(3), 311–325.CrossRefGoogle Scholar
  17. Klein, D., & Manning, C.D. (2003). Accurate unlexicalized parsing. In Proceedings of the 41st meeting of the association for computational linguistics (pp. 423–430).Google Scholar
  18. Lewis, W. (2003). Mining and migrating interlinear text. In Proceedings of EMELD 2003 workshop on digitizing and annotating texts and field recordings, East Lansing, Michigan. http://www.emeld.net/workshop/2003/Lewis-paper.pdf.
  19. Lewis, W., & Xia, F. (2010). Developing odin: A multilingual repository of annotated language data for hundreds of the world’s languages. Journal of Literary and Linguistic Computing (LLC), 25(3), 303–319.CrossRefGoogle Scholar
  20. Lewis, W.D., & Xia, F. (2008). Automatically identifying computationally relevant typological features. In Proceedings of the third international joint conference on natural language processing (IJCNLP-2008), Hyderabad, India.Google Scholar
  21. Lewis, W. D., & Xia, F. (2008b). Automatically identifying computationally relevant typological features. In Proceedings of the third international joint conference on natural language processing, Hyderabad, India (pp. 685–690).Google Scholar
  22. Lewis, W.D., Farrar, S., & Langendoen, D.T. (2001). Building a knowledge base of morphosyntactic terminology. In Proceedings of the IRCS workshop on linguistic databases, University of Pennsylvania (pp. 150–156). www.u.arizona.edu/~farrar/papers/LewFarLang01.pdf.
  23. Ma, X., & Xia, F. (2014). Unsupervised dependency parsing with transferring distribution via parallel guidance and entropy regularization. In Proceedings of ACL-2014, Baltimore, MD.Google Scholar
  24. Marcus, M., Marcinkiewicz, M. A., & Santorini, B. (1993). Building a large annotated corpus of English: The Penn Treebank. Computational Linguistics, 19(2), 313–330.Google Scholar
  25. McDonald, R., Nivre, J., Quirmbach-Brundage, Y., Goldberg, Y., Das, D., Ganchev, K., et al. (2013). Universal dependency annotation for multilingual parsing. In Proceedings of ACL-2013.Google Scholar
  26. Nivre, J., Hall, J., & Nilsson, J. (2006). Maltparser: A data-driven parser-generator for dependency parsing. In Proceedings of LREC (Vol. 6, pp. 2216–2219).Google Scholar
  27. Och, F. J., & Ney, H. (2003). A systematic comparison of various statistical alignment models. Computational Linguistics, 29(1), 19–51.CrossRefGoogle Scholar
  28. Täckström, O., McDonald, R., & Uszkoreit, J. (2012). Cross-lingual word clusters for direct transfer of linguistic structure. In Proceedings of NAACL/HLT 2012.Google Scholar
  29. Täckström, O., McDonald, R., & Nivre, J. (2013). Target language adaptation of discriminative transfer parsers. In Proceedings of NAACL 2013.Google Scholar
  30. Toutanova, K., Klein, D., Manning, C., & Singer, Y. (2003). Feature-rich part-of-speech tagging with a cyclic dependency network. In Proceedings of HLT-NAACL (pp. 252–259).Google Scholar
  31. Xia, F., & Lewis, W. D. (2007). Multilingual structural projection across interlinear text. In Proceedings of the conference on human language technologies (HLT/NAACL 2007), Rochester, New York (pp. 452–459).Google Scholar
  32. Xia, F., & Lewis, W. D. (2008). Repurposing theoretical linguistic data for tool development and search. In Proceedings of the third international joint conference on natural language processing (IJCNLP-2008), Hyderabad, India.Google Scholar
  33. Xia, F., Lewis, W.D., & Poon, H. (2009). Language ID in the context of harvesting language data off the web. In Proceedings of the 12th conference of the European chapter of the association of computational linguistics (EACL 2009), Athens, Greece.Google Scholar
  34. Xia, F., Lewis, C., & Lewis, W. D. (2010). The problems of language identification within hugely multilingual data sets. In Proceedings of the 7th international conference on language resources and evaluation (LREC 2010), Valletta, Malta (pp. 2790–2797).Google Scholar
  35. Xiao, M., & Guo, Y. (2015). Annotation projection-based representation learning for cross-lingual dependency parsing. CoNLL 2015.Google Scholar
  36. Yarowsky, D., & Ngai, G. (2001). Inducing multilingual POS taggers and NP bracketers via robust projection across aligned corpora. In Proceedings of the 2001 meeting of the North American chapter of the association for computational linguistics (NAACL-2001) (pp. 200–207).Google Scholar

Copyright information

© Springer Science+Business Media Dordrecht 2016

Authors and Affiliations

  • Fei Xia
    • 1
  • William D. Lewis
    • 2
  • Michael Wayne Goodman
    • 1
  • Glenn Slayden
    • 1
  • Ryan Georgi
    • 1
  • Joshua Crowgey
    • 1
  • Emily M. Bender
    • 1
  1. 1.University of WashingtonSeattleUSA
  2. 2.Microsoft ResearchRedmondUSA

Personalised recommendations