Advertisement

Language Resources and Evaluation

, Volume 49, Issue 2, pp 455–485 | Cite as

Xigt: extensible interlinear glossed text for natural language processing

  • Michael Wayne Goodman
  • Joshua Crowgey
  • Fei Xia
  • Emily M. Bender
Project Notes

Abstract

This paper presents Xigt, an extensible storage format for interlinear glossed text (IGT). We review design desiderata for such a format based on our own use cases as well as general best practices, and then explore existing representations of IGT through the lens of those desiderata. We give an overview of the data model and XML serialization of Xigt, and then describe its application to the use case of representing a large, noisy, heterogeneous set of IGT.

Keywords

Interlinear glossed text (IGT) Annotation Storage format 

Notes

Acknowledgments

This material is based upon work supported by the National Science Foundation under Grant Nos. BCS-1160274 and BCS-0748919. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation. We would like to thank Glenn Slayden and Ryan Georgi for general discussion; Luís Morgado da Costa for the example Portuguese IGT; and Francis Bond, František Kratochvíl, and anonymous reviewers for helpful feedback.

References

  1. Bakker, D., & Siewierska, A. (2007). Another take on the notion subject. In Structural–functional studies in English grammar: In honour of Lachlan Mackenzie (Vol. 83, p. 141).Google Scholar
  2. Baldwin, T., Beavers, J., Bender, E. M., Flickinger, D., Kim, A., & Oepen, S. (2005). Beauty and the beast: What running a broad-coverage precision grammar over the BNC taught us about the grammar—And the corpus. In S. Kepser & M. Reis (Eds.), Linguistic evidence: Empirical, theoretical, and computational perspectives (pp. 49–69). Berlin: Mouton de Gruyter.CrossRefGoogle Scholar
  3. Beermann, D., & Mihaylov, P. (2009). TypeCraft: Linguistic data and knowledge sharing, open access and linguistic methodology. In Paper presented at the workshop on small tools in cross-linguistic research. The Netherlands: University of Utrecht.Google Scholar
  4. Bender, E. M., Drellishak, S., Fokkens, A., Poulson, L., & Saleem, S. (2010). Grammar customization. Research on Language and Computation, 1–50. doi: 10.1007/s11168-010-9070-1.
  5. Bender, E. M., Flickinger, D., & Oepen, S. (2002). The grammar matrix: An open-source starter-kit for the rapid development of cross-linguistically consistent broad-coverage precision grammars. In J. Carroll, N. Oostdijk, & R. Sutcliffe (Eds.), Proceedings of the workshop on grammar engineering and evaluation at the 19th international conference on computational linguistics (pp. 8–14), Taipei, Taiwan.Google Scholar
  6. Bender, E. M., Ghodke, S., Baldwin, T., & Dridan, R. (2012). From database to treebank: Enhancing hypertext grammars with grammar engineering and treebank search. In S. Nordhoff & K. L. G. Poggeman (Eds.), Electronic grammaticography (pp. 179–206). Honolulu: University of Hawaii Press.Google Scholar
  7. Bender, E. M., Goodman, M. W., Crowgey, J., & Xia, F. (2013). Towards creating precision grammars from interlinear glossed text: Inferring large-scale typological properties. In Proceedings of the 7th workshop on language technology for cultural heritage, social sciences, and humanities (pp. 74–83). Sofia, Bulgaria: Association for Computational Linguistics. http://www.aclweb.org/anthology/W13-2710.
  8. Berglund, A., Boag, S., Chamberlin, D., Fernandez, M. F., Kay, M., Robie, J., & Siméon, J. (2007). XML path language (XPath) 2.0. W3C recommendation 23.Google Scholar
  9. Bickel, B., Comrie, B., & Haspelmath, M. (2008). The Leipzig glossing rules: Conventions for interlinear morpheme-by-morpheme glosses. Max Planck Institute for Evolutionary Anthropology and Department of Linguistics, University of Leipzig. http://www.eva.mpg.de/lingua/resources/glossing-rules.php.
  10. Bird, S., Day, D., Garofolo, J., Henderson, J., Laprun, C., & Liberman, M. (2000). Atlas: A flexible and extensible architecture for linguistic annotation. In Proceedings of the second international conference on language resources and rvaluation. Paris: European Language Resources Association.Google Scholar
  11. Bird, S., & Liberman, M. (2001). A formal framework for linguistic annotation. Speech Communication, 33(1), 23–60.CrossRefGoogle Scholar
  12. Brants, S., Dipper, S., Hansen, S., Lezius, W., & Smith, G. (2002). The TIGER treebank. In Proceedings of the workshop on treebanks and linguistic theories (pp. 24–41).Google Scholar
  13. Bratt, E. O. (1996). Argument composition and the lexicon: Lexical and periphrastic causatives in Korean. PhD thesis, Stanford University.Google Scholar
  14. Brugman, H., & Russel, A. (2004). Annotating multi-media/multi-modal resources with ELAN. In Proceedings of the fourth international conference on language resources and evaluation.Google Scholar
  15. Cagri, I. (2005). Minimality and Turkish relative clauses. PhD thesis, University of Maryland.Google Scholar
  16. Clark, J., & Murata, M. (2001). Relax NG specification. Technical report, The Organization for the Advancement of Structured Information Standards (OASIS).Google Scholar
  17. Georgi, R., Xia, F., & Lewis, W. (2012). Improving dependency parsing with interlinear glossed text and syntactic projection. In Proceedings of the COLING 2012: Posters (pp. 371–380), Mumbai, India.Google Scholar
  18. Hughes, B., Bird, S., & Bow, C. (2003). Encoding and presenting interlinear text using XML technologies. In Proceedings of the Australasian language technology workshop (pp. 61–69), Melbourne, Australia.Google Scholar
  19. Ide, N., Romary, L., & de la Clergerie, E. (2003). International standard for a linguistic annotation framework. In Proceedings of the HLT-NAACL workshop on software engineering and architecture of language technology systems (SEALTS).Google Scholar
  20. Ide, N., & Suderman, K. (2007). GrAF: A graph-based format for linguistic annotations. In Proceedings of the linguistic annotation workshop, Prague (pp. 1–8).Google Scholar
  21. Kay, M., et al. (2007). XSL transformations (XSLT) version 2.0. W3C recommendation 23.Google Scholar
  22. Lewis, W. D., & Xia, F. (2008). Automatically identifying computationally relevant typological features. In Proceedings of the third international joint conference on natural language processing (pp. 685–690), Hyderabad, India.Google Scholar
  23. Lewis, W., & Xia, F. (2010). Developing ODIN: A multilingual repository of annotated language data for hundreds of the world’s languages. Journal of Literary and Linguistic Computing (LLC), 25(3), 303–319.CrossRefGoogle Scholar
  24. Maeda, K., & Bird, S. (2000). A formal framework for interlinear text. In Proceedings of the workshop on web-based language documentation and description. http://www.ldc.upenn.edu/exploration/expl2000/papers/.
  25. Mengel, A., & Lezius, W. (2000). An XML-based representation format for syntactically annotated corpora. In LREC.Google Scholar
  26. Oepen, S., Flickinger, D., Toutanova, K., & Manning, C. D. (2004). LinGO redwoods. A rich and dynamic treebank for HPSG. Journal of Research on Language and Computation, 2(4), 575–596.CrossRefGoogle Scholar
  27. Palmer, A., & Erk, K. (2007). IGT-XML: An XML format for interlinearized glossed text. In Proceedings of the linguistic annotation workshop (pp. 176–183). Prague, Czech Republic: Association for Computational Linguistics. http://www.aclweb.org/anthology/W/W07/W07-1528.
  28. Toews, C. (2009). The expression of tense and aspect in Shona. In Selected proceedings of the 39th annual conference on African linguistics (pp. 32–41).Google Scholar
  29. Xia, F., & Lewis, W. (2009). Applying NLP technologies to the collection and enrichment of language data on the web to aid linguistic research. In Proceedings of the EACL 2009 workshop on language technology and resources for cultural heritage, social sciences, humanities, and education (LaTeCH-SHELT&R 2009) (pp. 51–59). Athens, Greece: Association for Computational Linguistics. http://www.aclweb.org/anthology/W09-0307.
  30. Xia, F., Lewis, W., Goodman, M. W., Crowgey, J., & Bender, E. M. (2013). Enriching ODIN. In Proceedings of the LREC 2014, Reykjavik, Iceland (to appear).Google Scholar
  31. Zaenen, A., Maling, J., & Thráinsson, H. (1985). Case and grammatical functions: The icelandic passive. Natural Language & Linguistic Theory, 3(4), 441–483.CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media Dordrecht 2014

Authors and Affiliations

  • Michael Wayne Goodman
    • 1
  • Joshua Crowgey
    • 1
  • Fei Xia
    • 1
  • Emily M. Bender
    • 1
  1. 1.Department of LinguisticsUniversity of WashingtonSeattleUSA

Personalised recommendations