Xigt: extensible interlinear glossed text for natural language processing


This paper presents Xigt, an extensible storage format for interlinear glossed text (IGT). We review design desiderata for such a format based on our own use cases as well as general best practices, and then explore existing representations of IGT through the lens of those desiderata. We give an overview of the data model and XML serialization of Xigt, and then describe its application to the use case of representing a large, noisy, heterogeneous set of IGT.

  1. 1.

    Languages that have non-Roman scripts will frequently be transliterated, but sometimes IGT will include both the original orthography and the transliteration as separate tiers.

  2. 2.

    It is in fact more typical to find three-line IGT in linguistics papers, with only one of either the source or morpheme-segmented lines.

  3. 3.


  4. 4.


  5. 5.

    Available for download at: http://uakari.ling.washington.edu/corpus/odin/.

  6. 6.

    In the case of the ODIN collection, the error could stem from the data entry on the part of the linguist who wrote the paper the item was harvested from, from noise introduced in the process of conversion from PDF to the text format, or from noise introduced by the automatic methods used in the enrichment process.

  7. 7.

    If we had specialized element names for tiers, the XPath query or XSLT selector may need to be modified for any newly defined tiers, especially if there are non-tier elements at the same level.

  8. 8.

    The indices 0–21 are the offsets of the characters in the sentence and they are not part of the signal.

  9. 9.

    This example is offered for illustration; the AG formalism does not restrict the possible units of time referred to in the range of τ.

  10. 10.

    Mapping from LAF/GrAF to Xigt would necessarily be limited to the kinds of annotations that Xigt is designed to handle.

  11. 11.

    Also called SIL Standard Format.

  12. 12.


  13. 13.


  14. 14.

    For more information on the project and to download code and resources, please refer to the project website at http://depts.washington.edu/uwcl/xigt.

  15. 15.

    The public repository is available from the project website.

  16. 16.

    The example in Fig. 2 is from Zaenen et al. (1985) via Bakker and Siewierska (2007) in ODIN.

  17. 17.

    As we are mostly concerned with textual data, this paper only discusses the segmentation and alignment of character spans. Xigt is capable of representing annotations of audio data, but explicit support for such annotations is relegated to future work. Support for non-linear data (such as images) is beyond the current scope of the project.

  18. 18.

    As the scope of alignment is a single IGT, tier and item identifiers do not need to be unique within a xigt-corpus.

  19. 19.

    These, too, are available at the project’s public repository.

  20. 20.


  21. 21.

    The example in Fig. 12 is from Bratt (1996).

  22. 22.

    The example in Fig. 13 is from Cagri (2005).

  23. 23.

    Not a tier in the traditional sense, but rather a container of data related to the IGT. We could alternatively put these lines in a <metadata> element, but then getting alignments to work would require a more complicated extension, as Xigt is already set up to align tiers to one another, but not tiers to metadata. Thus it is a practical decision to put the information in a tier.

  24. 24.

    Regarding stand-off annotation, we could have similarly pointed to character offsets in an external file, but as the textual data is extracted from PDFs and not stable, we chose to copy the relevant lines into the XML file.

  25. 25.

    Regarding the strange line wrapping; the > character of the opening <item> tags on the ODIN text tiers are on a new line to help show the initial spaces on each line (or absence thereof).

  26. 26.

    The glosses line on its own is fairly clean, so we could alternatively segment it and align each gloss to the phrase in a floating alignment, but we don’t do that here for the sake of illustrating coarse-grained annotation.

  27. 27.

    But not a perfect representation, because even the basic corpus can be improved with better cleaning steps and more accurate extraction of complex IGTs, such as those with line wrapping or inline alternations.

  28. 28.

    The number of languages is calculated by the assigned ISO-639-3 code.

  29. 29.

    The extension may need some validation code to ensure there are no cycles.


