Skip to main content

Enriching a massively multilingual database of interlinear glossed text

Abstract

The majority of the world’s languages have little to no NLP resources or tools. This is due to a lack of training data (“resources”) over which tools, such as taggers or parsers, can be trained. In recent years, there have been increasing efforts to apply NLP methods to a much broader swath of the world’s languages. In many cases this involves bootstrapping the learning process with enriched or partially enriched resources. We propose that Interlinear Glossed Text (IGT), a very common form of annotated data used in the field of linguistics, has great potential for bootstrapping NLP tools for resource-poor languages. Although IGT is generally very richly annotated, and can be enriched even further (e.g., through structural projection), much of the content is not easily consumable by machines since it remains “trapped” in linguistic scholarly documents and in human readable form. In this paper, we describe the expansion of the ODIN resource—a database containing many thousands of instances of IGT for over a thousand languages. We enrich the original IGT data by adding word alignment and syntactic structure. To make the data in ODIN more readily consumable by tool developers and NLP researchers, we adopt and extend a new XML format for IGT, called Xigt. We also develop two packages for manipulating IGT data: one, INTENT, enriches raw IGT automatically, and the other, XigtEdit, is a graphical IGT editor.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18

Notes

  1. http://depts.washington.edu/uwcl/packages/.

  2. The first attested use of the word gram in the context of IGT that we are aware of is in Bybee and Dahl (1989). Bybee and Dahl used the term gram to refer to morphemes that have grammatical functions in a language. Here, and elsewhere, we use the term gram to refer to the annotation used by linguists to refer to these morphemes.

  3. http://odin.linguistlist.org.

  4. CF in the language name stands for French-lexified creole.

  5. Similarly, existing formats (e.g., the CoNLL format for representing dependency structure) aren’t designed to handle IGT, including the relationships between the lines.

  6. Goodman et al chose to develop a new format after surveying many existing formats for encoding IGT and finding that none of them fully supported their requirements. The details of this process, including the desiderata they identified, are in Goodman et al. (2014).

  7. However, it is possible to have multiple tiers of the same type, such as a words tier for the tokenization of the language line and another for the tokenization of the translation. These represent two distinct vectors of annotations of the same type.

  8. Alignment expressions are a deviation from common practice for making references in XML documents. A more standard solution might have separate attributes for the referred ID and the start and end positions of substring selections, where necessary. Considering that there can be multiple IDs, each potentially using substring selections, in a single reference, and that there can be more than one kind of reference for a single item, such a solution could quickly become unwieldy and unnecessarily inflate the file size of a document. For these reasons, Goodman et al found alignment expressions to be a more elegant solution.

  9. http://www.language-archives.org/OLAC/metadata.html.

  10. The xml:lang attribute is a standard part of the XML specification: http://www.w3.org/TR/xml/#sec-lang-tag. Its semantics state that it is inherited by descendant elements, so the default language may be specified at the corpus level and tier-specific languages may then override the default.

  11. Unacceptable characters are those illegal in XML documents, such as the form feed character (0x000C) and other Unicode control characters. If the original IGT data contain any unacceptable characters, we replace them with the Unicode replacement character (0xFFFD).

  12. The problem persists despite efforts to promote consistency, such as the Leipzig Glossing Rules (Bickel et al. 2004).

  13. Following the specification of the IDREFS attribute type: http://www.w3.org/TR/REC-xml/#idref. We choose not to use, e.g., a comma-separated alignment expression because the children of a node are more intuitively a list rather than a string concatenation. Also, we want to disallow sub-selections on children.

  14. It is not strictly necessary to first segment the translation line; the items on the phrase-structure tier could use alignment expressions to select the word spans directly from the item on the translations tier. However, we find it prudent to segment the words separately in case more than one tier annotates the same segments.

  15. We could have repurposed the alignment reference attribute for one of these, perhaps source, but we felt that defining two new reference attributes made their purpose clearer.

  16. http://nlp.stanford.edu/software/tagger.shtml.

  17. We use a classifier, not a sequence labeler, because the word order in the gloss line will be language-dependent, and the training and test data of our POS tagger can come from different languages.

  18. http://nlp.stanford.edu/software/lex-parser.shtml.

  19. Windows Presentation Foundation (WPF) is a graphical framework for Microsoft Windows. See http://msdn.microsoft.com/en-us/library/aa970268(v=vs.110).aspx.

References

  • Bailyn, J. F. (2001). Inversion, dislocation and optionality in Russian. In G. Zybatow, U. Junghanns, G. Mehlhorn, & L. Szucsich (Eds.), Current issues in formal slavic linguistics. Frankfurt: Peter Lang AG.

  • Bender, E. M., Goodman, M. W., Crowgey, J., & Xia, F. (2013). Towards creating precision grammars from interlinear glossed text: Inferring large-scale typological properties. In Proceedings of the 7th workshop on language technology for cultural heritage, social sciences, and humanities, Sofia, Bulgaria (pp. 74–83).

  • Bickel, B., Comrie, B., & Haspelmath, M. (2004). The Leipzig Glossing Rules: Conventions for interlinear morpheme-by-morpheme glosses (revised version). Technical report, Department of Linguistics of the Max Planck Institute for Evolutionary Anthropology and the Department of Linguistics of the University of Leipzig. http://www.eva.mpg.de/lingua/files/morpheme.html. 17 May 2006.

  • Brants, S., Dipper, S., Hansen, S., Lezius, W., & Smith, G. (2002). The TIGER treebank. In Proceedings of the workshop on treebanks and linguistic theories (pp. 24–41).

  • Bybee, J. L., & Dahl, Ö. (1989). The creation of tense and aspect systems in the languages of the world. Amsterdam: John Benjamins.

    Google Scholar 

  • Das, D., & Petrov, S. (2011). Unsupervised part-of-speech tagging with bilingual graph-based projections. In HLT ’11: Proceedings of the 49th annual meeting of the association for computational linguistics: Human language technologies, association for computational linguistics.

  • de Marneffe, M. C., MacCartney, B., & Manning, C. D. (2006). Generating typed dependency parses from phrase structure parses. In Proceedings of LREC 2006.

  • Dorr, B. J. (1994). Machine translation divergences: A formal description and proposed solution. Computational Linguistics, 20(4), 597–635.

    Google Scholar 

  • Farrar, S., & Langendoen, D. T. (2003). A linguistic ontology for the Semantic Web. GLOT International, 7(3), 97–100.

    Google Scholar 

  • Feldman, A., Hana, J., & Brew, C. (2006). A cross-language approach to rapid creation of new morpho-syntactically annotated resources. In Proceedings of the 5th international conference on language resources and evaluation (LREC 2006), Genoa, Italy.

  • Georgi, R., Xia, F., & Lewis, W. D. (2013). Enhanced and portable dependency projection algorithms using interlinear glossed text. In Proceedings of ACL 2013 (Volume 2: Short papers), Sofia, Bulgaria (pp. 306–311).

  • Georgi, R., Xia, F., & Lewis, W. D. (2014). Capturing divergence in dependency trees to improve syntactic projection. Language Resources and Evaluation, 48(4), 709–739.

    Article  Google Scholar 

  • Georgi, R., Xia, F., & Lewis, W.D. (2015). Enriching interlinear text using automatically constructed annotators. In Proceedings of the 9th workshop on language technology for cultural heritage, social sciences, and humanities (LaTeCH 2015), Beijing, China.

  • Goodman, M.W., Crowgey, J., Xia, F., & Bender, E.M. (2014). Xigt: Extensible interlinear glossed text for natural language processing. Language Resources and Evaluation. doi:10.1007/s10579-014-9276-1.

  • Hana, J., Feldman, A., Amaral, L., & Brew, C. (2006). Tagging portuguese with a Spanish tagger using cognates. In Proceedings of the workshop on cross-language knowledge induction, in conjunction with the 11th conference of the European chapter of the association for computational linguistics (EACL-2006), Trento, Italy.

  • Hwa, R., Resnik, P., Weinberg, A., Cabezas, C., & Kolak, O. (2005). Bootstrapping parsers via syntactic projection across parallel texts. Special Issue of the Journal of Natural Language Engineering on Parallel Texts, 11(3), 311–325.

    Article  Google Scholar 

  • Klein, D., & Manning, C.D. (2003). Accurate unlexicalized parsing. In Proceedings of the 41st meeting of the association for computational linguistics (pp. 423–430).

  • Lewis, W. (2003). Mining and migrating interlinear text. In Proceedings of EMELD 2003 workshop on digitizing and annotating texts and field recordings, East Lansing, Michigan. http://www.emeld.net/workshop/2003/Lewis-paper.pdf.

  • Lewis, W., & Xia, F. (2010). Developing odin: A multilingual repository of annotated language data for hundreds of the world’s languages. Journal of Literary and Linguistic Computing (LLC), 25(3), 303–319.

    Article  Google Scholar 

  • Lewis, W.D., & Xia, F. (2008). Automatically identifying computationally relevant typological features. In Proceedings of the third international joint conference on natural language processing (IJCNLP-2008), Hyderabad, India.

  • Lewis, W. D., & Xia, F. (2008b). Automatically identifying computationally relevant typological features. In Proceedings of the third international joint conference on natural language processing, Hyderabad, India (pp. 685–690).

  • Lewis, W.D., Farrar, S., & Langendoen, D.T. (2001). Building a knowledge base of morphosyntactic terminology. In Proceedings of the IRCS workshop on linguistic databases, University of Pennsylvania (pp. 150–156). www.u.arizona.edu/~farrar/papers/LewFarLang01.pdf.

  • Ma, X., & Xia, F. (2014). Unsupervised dependency parsing with transferring distribution via parallel guidance and entropy regularization. In Proceedings of ACL-2014, Baltimore, MD.

  • Marcus, M., Marcinkiewicz, M. A., & Santorini, B. (1993). Building a large annotated corpus of English: The Penn Treebank. Computational Linguistics, 19(2), 313–330.

    Google Scholar 

  • McDonald, R., Nivre, J., Quirmbach-Brundage, Y., Goldberg, Y., Das, D., Ganchev, K., et al. (2013). Universal dependency annotation for multilingual parsing. In Proceedings of ACL-2013.

  • Nivre, J., Hall, J., & Nilsson, J. (2006). Maltparser: A data-driven parser-generator for dependency parsing. In Proceedings of LREC (Vol. 6, pp. 2216–2219).

  • Och, F. J., & Ney, H. (2003). A systematic comparison of various statistical alignment models. Computational Linguistics, 29(1), 19–51.

    Article  Google Scholar 

  • Täckström, O., McDonald, R., & Uszkoreit, J. (2012). Cross-lingual word clusters for direct transfer of linguistic structure. In Proceedings of NAACL/HLT 2012.

  • Täckström, O., McDonald, R., & Nivre, J. (2013). Target language adaptation of discriminative transfer parsers. In Proceedings of NAACL 2013.

  • Toutanova, K., Klein, D., Manning, C., & Singer, Y. (2003). Feature-rich part-of-speech tagging with a cyclic dependency network. In Proceedings of HLT-NAACL (pp. 252–259).

  • Xia, F., & Lewis, W. D. (2007). Multilingual structural projection across interlinear text. In Proceedings of the conference on human language technologies (HLT/NAACL 2007), Rochester, New York (pp. 452–459).

  • Xia, F., & Lewis, W. D. (2008). Repurposing theoretical linguistic data for tool development and search. In Proceedings of the third international joint conference on natural language processing (IJCNLP-2008), Hyderabad, India.

  • Xia, F., Lewis, W.D., & Poon, H. (2009). Language ID in the context of harvesting language data off the web. In Proceedings of the 12th conference of the European chapter of the association of computational linguistics (EACL 2009), Athens, Greece.

  • Xia, F., Lewis, C., & Lewis, W. D. (2010). The problems of language identification within hugely multilingual data sets. In Proceedings of the 7th international conference on language resources and evaluation (LREC 2010), Valletta, Malta (pp. 2790–2797).

  • Xiao, M., & Guo, Y. (2015). Annotation projection-based representation learning for cross-lingual dependency parsing. CoNLL 2015.

  • Yarowsky, D., & Ngai, G. (2001). Inducing multilingual POS taggers and NP bracketers via robust projection across aligned corpora. In Proceedings of the 2001 meeting of the North American chapter of the association for computational linguistics (NAACL-2001) (pp. 200–207).

Download references

Acknowledgments

This material is partly supported by the National Science Foundation under Grant No. BCS-1160274 and BCS-0748919, and Singapore Ministry of Education under Tier 2 Grant No. MOE2013-T2-1-016. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation. We would like to thank Sebastian Nordhoff for discussion on the Xigt format and issues with the original IGT data, and anonymous reviewers for helpful comments. We would also like to thank the Linguist List (http://linguistlist.org/) for hosting the ODIN database.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Fei Xia.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Xia, F., Lewis, W.D., Goodman, M.W. et al. Enriching a massively multilingual database of interlinear glossed text. Lang Resources & Evaluation 50, 321–349 (2016). https://doi.org/10.1007/s10579-015-9325-4

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10579-015-9325-4

Keywords

  • Resource-poor languages
  • Interlinear glossed text
  • ODIN