Xigt: extensible interlinear glossed text for natural language processing

Abstract

This paper presents Xigt, an extensible storage format for interlinear glossed text (IGT). We review design desiderata for such a format based on our own use cases as well as general best practices, and then explore existing representations of IGT through the lens of those desiderata. We give an overview of the data model and XML serialization of Xigt, and then describe its application to the use case of representing a large, noisy, heterogeneous set of IGT.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18

Notes

  1. 1.

    Languages that have non-Roman scripts will frequently be transliterated, but sometimes IGT will include both the original orthography and the transliteration as separate tiers.

  2. 2.

    It is in fact more typical to find three-line IGT in linguistics papers, with only one of either the source or morpheme-segmented lines.

  3. 3.

    http://depts.washington.edu/uwcl/aggregation/.

  4. 4.

    http://odin.linguistlist.org.

  5. 5.

    Available for download at: http://uakari.ling.washington.edu/corpus/odin/.

  6. 6.

    In the case of the ODIN collection, the error could stem from the data entry on the part of the linguist who wrote the paper the item was harvested from, from noise introduced in the process of conversion from PDF to the text format, or from noise introduced by the automatic methods used in the enrichment process.

  7. 7.

    If we had specialized element names for tiers, the XPath query or XSLT selector may need to be modified for any newly defined tiers, especially if there are non-tier elements at the same level.

  8. 8.

    The indices 0–21 are the offsets of the characters in the sentence and they are not part of the signal.

  9. 9.

    This example is offered for illustration; the AG formalism does not restrict the possible units of time referred to in the range of τ.

  10. 10.

    Mapping from LAF/GrAF to Xigt would necessarily be limited to the kinds of annotations that Xigt is designed to handle.

  11. 11.

    Also called SIL Standard Format.

  12. 12.

    http://www-01.sil.org/computing/shoebox/.

  13. 13.

    http://emeld.org.

  14. 14.

    For more information on the project and to download code and resources, please refer to the project website at http://depts.washington.edu/uwcl/xigt.

  15. 15.

    The public repository is available from the project website.

  16. 16.

    The example in Fig. 2 is from Zaenen et al. (1985) via Bakker and Siewierska (2007) in ODIN.

  17. 17.

    As we are mostly concerned with textual data, this paper only discusses the segmentation and alignment of character spans. Xigt is capable of representing annotations of audio data, but explicit support for such annotations is relegated to future work. Support for non-linear data (such as images) is beyond the current scope of the project.

  18. 18.

    As the scope of alignment is a single IGT, tier and item identifiers do not need to be unique within a xigt-corpus.

  19. 19.

    These, too, are available at the project’s public repository.

  20. 20.

    http://depts.washington.edu/uwcl/aggregation/.

  21. 21.

    The example in Fig. 12 is from Bratt (1996).

  22. 22.

    The example in Fig. 13 is from Cagri (2005).

  23. 23.

    Not a tier in the traditional sense, but rather a container of data related to the IGT. We could alternatively put these lines in a <metadata> element, but then getting alignments to work would require a more complicated extension, as Xigt is already set up to align tiers to one another, but not tiers to metadata. Thus it is a practical decision to put the information in a tier.

  24. 24.

    Regarding stand-off annotation, we could have similarly pointed to character offsets in an external file, but as the textual data is extracted from PDFs and not stable, we chose to copy the relevant lines into the XML file.

  25. 25.

    Regarding the strange line wrapping; the > character of the opening <item> tags on the ODIN text tiers are on a new line to help show the initial spaces on each line (or absence thereof).

  26. 26.

    The glosses line on its own is fairly clean, so we could alternatively segment it and align each gloss to the phrase in a floating alignment, but we don’t do that here for the sake of illustrating coarse-grained annotation.

  27. 27.

    But not a perfect representation, because even the basic corpus can be improved with better cleaning steps and more accurate extraction of complex IGTs, such as those with line wrapping or inline alternations.

  28. 28.

    The number of languages is calculated by the assigned ISO-639-3 code.

  29. 29.

    The extension may need some validation code to ensure there are no cycles.

References

  1. Bakker, D., & Siewierska, A. (2007). Another take on the notion subject. In Structural–functional studies in English grammar: In honour of Lachlan Mackenzie (Vol. 83, p. 141).

  2. Baldwin, T., Beavers, J., Bender, E. M., Flickinger, D., Kim, A., & Oepen, S. (2005). Beauty and the beast: What running a broad-coverage precision grammar over the BNC taught us about the grammar—And the corpus. In S. Kepser & M. Reis (Eds.), Linguistic evidence: Empirical, theoretical, and computational perspectives (pp. 49–69). Berlin: Mouton de Gruyter.

    Google Scholar 

  3. Beermann, D., & Mihaylov, P. (2009). TypeCraft: Linguistic data and knowledge sharing, open access and linguistic methodology. In Paper presented at the workshop on small tools in cross-linguistic research. The Netherlands: University of Utrecht.

  4. Bender, E. M., Drellishak, S., Fokkens, A., Poulson, L., & Saleem, S. (2010). Grammar customization. Research on Language and Computation, 1–50. doi:10.1007/s11168-010-9070-1.

  5. Bender, E. M., Flickinger, D., & Oepen, S. (2002). The grammar matrix: An open-source starter-kit for the rapid development of cross-linguistically consistent broad-coverage precision grammars. In J. Carroll, N. Oostdijk, & R. Sutcliffe (Eds.), Proceedings of the workshop on grammar engineering and evaluation at the 19th international conference on computational linguistics (pp. 8–14), Taipei, Taiwan.

  6. Bender, E. M., Ghodke, S., Baldwin, T., & Dridan, R. (2012). From database to treebank: Enhancing hypertext grammars with grammar engineering and treebank search. In S. Nordhoff & K. L. G. Poggeman (Eds.), Electronic grammaticography (pp. 179–206). Honolulu: University of Hawaii Press.

    Google Scholar 

  7. Bender, E. M., Goodman, M. W., Crowgey, J., & Xia, F. (2013). Towards creating precision grammars from interlinear glossed text: Inferring large-scale typological properties. In Proceedings of the 7th workshop on language technology for cultural heritage, social sciences, and humanities (pp. 74–83). Sofia, Bulgaria: Association for Computational Linguistics. http://www.aclweb.org/anthology/W13-2710.

  8. Berglund, A., Boag, S., Chamberlin, D., Fernandez, M. F., Kay, M., Robie, J., & Siméon, J. (2007). XML path language (XPath) 2.0. W3C recommendation 23.

  9. Bickel, B., Comrie, B., & Haspelmath, M. (2008). The Leipzig glossing rules: Conventions for interlinear morpheme-by-morpheme glosses. Max Planck Institute for Evolutionary Anthropology and Department of Linguistics, University of Leipzig. http://www.eva.mpg.de/lingua/resources/glossing-rules.php.

  10. Bird, S., Day, D., Garofolo, J., Henderson, J., Laprun, C., & Liberman, M. (2000). Atlas: A flexible and extensible architecture for linguistic annotation. In Proceedings of the second international conference on language resources and rvaluation. Paris: European Language Resources Association.

  11. Bird, S., & Liberman, M. (2001). A formal framework for linguistic annotation. Speech Communication, 33(1), 23–60.

    Article  Google Scholar 

  12. Brants, S., Dipper, S., Hansen, S., Lezius, W., & Smith, G. (2002). The TIGER treebank. In Proceedings of the workshop on treebanks and linguistic theories (pp. 24–41).

  13. Bratt, E. O. (1996). Argument composition and the lexicon: Lexical and periphrastic causatives in Korean. PhD thesis, Stanford University.

  14. Brugman, H., & Russel, A. (2004). Annotating multi-media/multi-modal resources with ELAN. In Proceedings of the fourth international conference on language resources and evaluation.

  15. Cagri, I. (2005). Minimality and Turkish relative clauses. PhD thesis, University of Maryland.

  16. Clark, J., & Murata, M. (2001). Relax NG specification. Technical report, The Organization for the Advancement of Structured Information Standards (OASIS).

  17. Georgi, R., Xia, F., & Lewis, W. (2012). Improving dependency parsing with interlinear glossed text and syntactic projection. In Proceedings of the COLING 2012: Posters (pp. 371–380), Mumbai, India.

  18. Hughes, B., Bird, S., & Bow, C. (2003). Encoding and presenting interlinear text using XML technologies. In Proceedings of the Australasian language technology workshop (pp. 61–69), Melbourne, Australia.

  19. Ide, N., Romary, L., & de la Clergerie, E. (2003). International standard for a linguistic annotation framework. In Proceedings of the HLT-NAACL workshop on software engineering and architecture of language technology systems (SEALTS).

  20. Ide, N., & Suderman, K. (2007). GrAF: A graph-based format for linguistic annotations. In Proceedings of the linguistic annotation workshop, Prague (pp. 1–8).

  21. Kay, M., et al. (2007). XSL transformations (XSLT) version 2.0. W3C recommendation 23.

  22. Lewis, W. D., & Xia, F. (2008). Automatically identifying computationally relevant typological features. In Proceedings of the third international joint conference on natural language processing (pp. 685–690), Hyderabad, India.

  23. Lewis, W., & Xia, F. (2010). Developing ODIN: A multilingual repository of annotated language data for hundreds of the world’s languages. Journal of Literary and Linguistic Computing (LLC), 25(3), 303–319.

    Article  Google Scholar 

  24. Maeda, K., & Bird, S. (2000). A formal framework for interlinear text. In Proceedings of the workshop on web-based language documentation and description. http://www.ldc.upenn.edu/exploration/expl2000/papers/.

  25. Mengel, A., & Lezius, W. (2000). An XML-based representation format for syntactically annotated corpora. In LREC.

  26. Oepen, S., Flickinger, D., Toutanova, K., & Manning, C. D. (2004). LinGO redwoods. A rich and dynamic treebank for HPSG. Journal of Research on Language and Computation, 2(4), 575–596.

    Article  Google Scholar 

  27. Palmer, A., & Erk, K. (2007). IGT-XML: An XML format for interlinearized glossed text. In Proceedings of the linguistic annotation workshop (pp. 176–183). Prague, Czech Republic: Association for Computational Linguistics. http://www.aclweb.org/anthology/W/W07/W07-1528.

  28. Toews, C. (2009). The expression of tense and aspect in Shona. In Selected proceedings of the 39th annual conference on African linguistics (pp. 32–41).

  29. Xia, F., & Lewis, W. (2009). Applying NLP technologies to the collection and enrichment of language data on the web to aid linguistic research. In Proceedings of the EACL 2009 workshop on language technology and resources for cultural heritage, social sciences, humanities, and education (LaTeCH-SHELT&R 2009) (pp. 51–59). Athens, Greece: Association for Computational Linguistics. http://www.aclweb.org/anthology/W09-0307.

  30. Xia, F., Lewis, W., Goodman, M. W., Crowgey, J., & Bender, E. M. (2013). Enriching ODIN. In Proceedings of the LREC 2014, Reykjavik, Iceland (to appear).

  31. Zaenen, A., Maling, J., & Thráinsson, H. (1985). Case and grammatical functions: The icelandic passive. Natural Language & Linguistic Theory, 3(4), 441–483.

    Article  Google Scholar 

Download references

Acknowledgments

This material is based upon work supported by the National Science Foundation under Grant Nos. BCS-1160274 and BCS-0748919. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation. We would like to thank Glenn Slayden and Ryan Georgi for general discussion; Luís Morgado da Costa for the example Portuguese IGT; and Francis Bond, František Kratochvíl, and anonymous reviewers for helpful feedback.

Author information

Affiliations

Authors

Corresponding author

Correspondence to Michael Wayne Goodman.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Goodman, M.W., Crowgey, J., Xia, F. et al. Xigt: extensible interlinear glossed text for natural language processing. Lang Resources & Evaluation 49, 455–485 (2015). https://doi.org/10.1007/s10579-014-9276-1

Download citation

Keywords

  • Interlinear glossed text (IGT)
  • Annotation
  • Storage format