Enriching a massively multilingual database of interlinear glossed text
- 254 Downloads
The majority of the world’s languages have little to no NLP resources or tools. This is due to a lack of training data (“resources”) over which tools, such as taggers or parsers, can be trained. In recent years, there have been increasing efforts to apply NLP methods to a much broader swath of the world’s languages. In many cases this involves bootstrapping the learning process with enriched or partially enriched resources. We propose that Interlinear Glossed Text (IGT), a very common form of annotated data used in the field of linguistics, has great potential for bootstrapping NLP tools for resource-poor languages. Although IGT is generally very richly annotated, and can be enriched even further (e.g., through structural projection), much of the content is not easily consumable by machines since it remains “trapped” in linguistic scholarly documents and in human readable form. In this paper, we describe the expansion of the ODIN resource—a database containing many thousands of instances of IGT for over a thousand languages. We enrich the original IGT data by adding word alignment and syntactic structure. To make the data in ODIN more readily consumable by tool developers and NLP researchers, we adopt and extend a new XML format for IGT, called Xigt. We also develop two packages for manipulating IGT data: one, INTENT, enriches raw IGT automatically, and the other, XigtEdit, is a graphical IGT editor.
KeywordsResource-poor languages Interlinear glossed text ODIN
This material is partly supported by the National Science Foundation under Grant No. BCS-1160274 and BCS-0748919, and Singapore Ministry of Education under Tier 2 Grant No. MOE2013-T2-1-016. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation. We would like to thank Sebastian Nordhoff for discussion on the Xigt format and issues with the original IGT data, and anonymous reviewers for helpful comments. We would also like to thank the Linguist List (http://linguistlist.org/) for hosting the ODIN database.
- Bailyn, J. F. (2001). Inversion, dislocation and optionality in Russian. In G. Zybatow, U. Junghanns, G. Mehlhorn, & L. Szucsich (Eds.), Current issues in formal slavic linguistics. Frankfurt: Peter Lang AG.Google Scholar
- Bender, E. M., Goodman, M. W., Crowgey, J., & Xia, F. (2013). Towards creating precision grammars from interlinear glossed text: Inferring large-scale typological properties. In Proceedings of the 7th workshop on language technology for cultural heritage, social sciences, and humanities, Sofia, Bulgaria (pp. 74–83).Google Scholar
- Bickel, B., Comrie, B., & Haspelmath, M. (2004). The Leipzig Glossing Rules: Conventions for interlinear morpheme-by-morpheme glosses (revised version). Technical report, Department of Linguistics of the Max Planck Institute for Evolutionary Anthropology and the Department of Linguistics of the University of Leipzig. http://www.eva.mpg.de/lingua/files/morpheme.html. 17 May 2006.
- Brants, S., Dipper, S., Hansen, S., Lezius, W., & Smith, G. (2002). The TIGER treebank. In Proceedings of the workshop on treebanks and linguistic theories (pp. 24–41).Google Scholar
- Bybee, J. L., & Dahl, Ö. (1989). The creation of tense and aspect systems in the languages of the world. Amsterdam: John Benjamins.Google Scholar
- Das, D., & Petrov, S. (2011). Unsupervised part-of-speech tagging with bilingual graph-based projections. In HLT ’11: Proceedings of the 49th annual meeting of the association for computational linguistics: Human language technologies, association for computational linguistics.Google Scholar
- de Marneffe, M. C., MacCartney, B., & Manning, C. D. (2006). Generating typed dependency parses from phrase structure parses. In Proceedings of LREC 2006.Google Scholar
- Dorr, B. J. (1994). Machine translation divergences: A formal description and proposed solution. Computational Linguistics, 20(4), 597–635.Google Scholar
- Farrar, S., & Langendoen, D. T. (2003). A linguistic ontology for the Semantic Web. GLOT International, 7(3), 97–100.Google Scholar
- Feldman, A., Hana, J., & Brew, C. (2006). A cross-language approach to rapid creation of new morpho-syntactically annotated resources. In Proceedings of the 5th international conference on language resources and evaluation (LREC 2006), Genoa, Italy.Google Scholar
- Georgi, R., Xia, F., & Lewis, W. D. (2013). Enhanced and portable dependency projection algorithms using interlinear glossed text. In Proceedings of ACL 2013 (Volume 2: Short papers), Sofia, Bulgaria (pp. 306–311).Google Scholar
- Georgi, R., Xia, F., & Lewis, W.D. (2015). Enriching interlinear text using automatically constructed annotators. In Proceedings of the 9th workshop on language technology for cultural heritage, social sciences, and humanities (LaTeCH 2015), Beijing, China.Google Scholar
- Goodman, M.W., Crowgey, J., Xia, F., & Bender, E.M. (2014). Xigt: Extensible interlinear glossed text for natural language processing. Language Resources and Evaluation. doi: 10.1007/s10579-014-9276-1.
- Hana, J., Feldman, A., Amaral, L., & Brew, C. (2006). Tagging portuguese with a Spanish tagger using cognates. In Proceedings of the workshop on cross-language knowledge induction, in conjunction with the 11th conference of the European chapter of the association for computational linguistics (EACL-2006), Trento, Italy.Google Scholar
- Klein, D., & Manning, C.D. (2003). Accurate unlexicalized parsing. In Proceedings of the 41st meeting of the association for computational linguistics (pp. 423–430).Google Scholar
- Lewis, W. (2003). Mining and migrating interlinear text. In Proceedings of EMELD 2003 workshop on digitizing and annotating texts and field recordings, East Lansing, Michigan. http://www.emeld.net/workshop/2003/Lewis-paper.pdf.
- Lewis, W.D., & Xia, F. (2008). Automatically identifying computationally relevant typological features. In Proceedings of the third international joint conference on natural language processing (IJCNLP-2008), Hyderabad, India.Google Scholar
- Lewis, W. D., & Xia, F. (2008b). Automatically identifying computationally relevant typological features. In Proceedings of the third international joint conference on natural language processing, Hyderabad, India (pp. 685–690).Google Scholar
- Lewis, W.D., Farrar, S., & Langendoen, D.T. (2001). Building a knowledge base of morphosyntactic terminology. In Proceedings of the IRCS workshop on linguistic databases, University of Pennsylvania (pp. 150–156). www.u.arizona.edu/~farrar/papers/LewFarLang01.pdf.
- Ma, X., & Xia, F. (2014). Unsupervised dependency parsing with transferring distribution via parallel guidance and entropy regularization. In Proceedings of ACL-2014, Baltimore, MD.Google Scholar
- Marcus, M., Marcinkiewicz, M. A., & Santorini, B. (1993). Building a large annotated corpus of English: The Penn Treebank. Computational Linguistics, 19(2), 313–330.Google Scholar
- McDonald, R., Nivre, J., Quirmbach-Brundage, Y., Goldberg, Y., Das, D., Ganchev, K., et al. (2013). Universal dependency annotation for multilingual parsing. In Proceedings of ACL-2013.Google Scholar
- Nivre, J., Hall, J., & Nilsson, J. (2006). Maltparser: A data-driven parser-generator for dependency parsing. In Proceedings of LREC (Vol. 6, pp. 2216–2219).Google Scholar
- Täckström, O., McDonald, R., & Uszkoreit, J. (2012). Cross-lingual word clusters for direct transfer of linguistic structure. In Proceedings of NAACL/HLT 2012.Google Scholar
- Täckström, O., McDonald, R., & Nivre, J. (2013). Target language adaptation of discriminative transfer parsers. In Proceedings of NAACL 2013.Google Scholar
- Toutanova, K., Klein, D., Manning, C., & Singer, Y. (2003). Feature-rich part-of-speech tagging with a cyclic dependency network. In Proceedings of HLT-NAACL (pp. 252–259).Google Scholar
- Xia, F., & Lewis, W. D. (2007). Multilingual structural projection across interlinear text. In Proceedings of the conference on human language technologies (HLT/NAACL 2007), Rochester, New York (pp. 452–459).Google Scholar
- Xia, F., & Lewis, W. D. (2008). Repurposing theoretical linguistic data for tool development and search. In Proceedings of the third international joint conference on natural language processing (IJCNLP-2008), Hyderabad, India.Google Scholar
- Xia, F., Lewis, W.D., & Poon, H. (2009). Language ID in the context of harvesting language data off the web. In Proceedings of the 12th conference of the European chapter of the association of computational linguistics (EACL 2009), Athens, Greece.Google Scholar
- Xia, F., Lewis, C., & Lewis, W. D. (2010). The problems of language identification within hugely multilingual data sets. In Proceedings of the 7th international conference on language resources and evaluation (LREC 2010), Valletta, Malta (pp. 2790–2797).Google Scholar
- Xiao, M., & Guo, Y. (2015). Annotation projection-based representation learning for cross-lingual dependency parsing. CoNLL 2015.Google Scholar
- Yarowsky, D., & Ngai, G. (2001). Inducing multilingual POS taggers and NP bracketers via robust projection across aligned corpora. In Proceedings of the 2001 meeting of the North American chapter of the association for computational linguistics (NAACL-2001) (pp. 200–207).Google Scholar