Enriching, Editing, and Representing Interlinear Glossed Text

Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9041)


The majority of the world’s languages have little to no NLP resources or tools. This is due to a lack of training data (“resources”) over which tools, such as taggers or parsers, can be trained. In recent years, there have been increasing efforts to apply NLP methods to a much broader swathe of the worlds languages. In many cases this involves bootstrapping the learning process with enriched or partially enriched resources. One promising line of research involves the use of Interlinear Glossed Text (IGT), a very common form of annotated data used in the field of linguistics. Although IGT is generally very richly annotated, and can be enriched even further (e.g., through structural projection), much of the content is not easily consumable by machines since it remains “trapped” in linguistic scholarly documents and in human readable form. In this paper, we introduce several tools that make IGT more accessible and consumable by NLP researchers.


Syntactic Structure Computational Linguistics Word Alignment Translation Line Human Readable Form 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Hana, J., Feldman, A., Amaral, L., Brew, C.: Tagging portuguese with a spanish tagger using cognates. In: Proc. of the Workshop on Cross-language Knowledge Induction, in conjunction with the 11th Conference of the European Chapter of the Association for Computational Linguistics (EACL-2006), Trento, Italy (2006)Google Scholar
  2. 2.
    Feldman, A., Hana, J., Brew, C.: A cross-language approach to rapid creation of new morpho-syntactically annotated resources. In: Proc. of the 5th international conference on Language Resources and Evaluation (LREC 2006), Genoa, Italy (2006)Google Scholar
  3. 3.
    Yarowsky, D., Ngai, G.: Inducing Multilingual POS Taggers and NP Bracketers via Robust Projection across Aligned Corpora. In: Proc. of the 2001 Meeting of the North American chapter of the Association for Computational Linguistics (NAACL-2001), pp. 200–207 (2001)Google Scholar
  4. 4.
    Hwa, R., Resnik, P., Weinberg, A., Cabezas, C., Kolak, O.: Bootstrapping Parsers via Syntactic Projection across Parallel Texts. Special Issue of the Journal of Natural Language Engineering on Parallel Texts, 311–325 (2005)Google Scholar
  5. 5.
    Georgi, R., Xia, F., Lewis, W.D.: Enhanced and portable dependency projection algorithms using interlinear glossed text. In: Proceedings of ACL 2013 (Volume 2: Short Papers), Sofia, Bulgaria, pp. 306–311 (2013)Google Scholar
  6. 6.
    Georgi, R., Xia, F., Lewis, W.D.: Capturing divergence in dependency trees to improve syntactic projection. Language Resources and Evaluation 48, 709–739 (2014)CrossRefGoogle Scholar
  7. 7.
    Lewis, W., Xia, F.: Developing odin: A multilingual repository of annotated language data for hundreds of the world’s languages. Journal of Literary and Linguistic Computing (LLC) 25, 303–319 (2010)CrossRefGoogle Scholar
  8. 8.
    Bailyn, J.F.: Inversion, Dislocation and Optionality in Russian. In: Zybatow, G. (ed.) Current Issues in Formal Slavic Linguistics (2001)Google Scholar
  9. 9.
    Lewis, W.D.: Mining and migrating interlinear glossed text. Technical report, Workshop on Digitizing and Annotating Texts and Field Recordings, LSA Institute (2003),
  10. 10.
    Xia, F., Lewis, W.D.: Multilingual structural projection across interlinear text. In: Proc. of the Conference on Human Language Technologies (HLT/NAACL 2007), Rochester, New York, pp. 452–459 (2007)Google Scholar
  11. 11.
    Lefebvre, C.: Creole Genesis and the Acquisition of Grammar: The case of Haitian Creole. Cambridge University Press, Cambridge (1998)Google Scholar
  12. 12.
    Och, F.J., Ney, H.: A systematic comparison of various statistical alignment models. Computational Linguistics 29, 19–51 (2003)CrossRefzbMATHGoogle Scholar
  13. 13.
    Lewis, W.D., Xia, F.: Automatically Identifying Computationally Relevant Typological Features. In: Proc. of the Third International Joint Conference on Natural Language Processing (IJCNLP-2008), Hyderabad, India (2008)Google Scholar
  14. 14.
    Bender, E.M., Goodman, M.W., Crowgey, J., Xia, F.: Towards creating precision grammars from interlinear glossed text: Inferring large-scale typological properties. In: Proceedings of the 7th Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities, Sofia, Bulgaria, pp. 74–83 (2013)Google Scholar
  15. 15.
    Goodman, M.W., Crowgey, J., Xia, F., Bender, E.M.: Xigt: extensible interlinear glossed text for natural language processing. In: Language Resources and Evaluation, pp. 1–31 (2014)Google Scholar
  16. 16.
    Georgi, R., Xia, F., Lewis, W.D.: Training part-of-speech taggers using interlinear text (2015) (manuscript)Google Scholar
  17. 17.
    Toutanova, K., Klein, D., Manning, C.D., Singer, Y.: Feature-rich part-of-speech tagging with a cyclic dependency network. In: Proceedings of HLT-NAACL 2003, pp. 252–259 (2003)Google Scholar
  18. 18.
    Marcus, M., Marcinkiewicz, M.A., Santorini, B.: Building a large annotated corpus of English: the Penn Treebank. Computational Linguistics 19, 313–330 (1993)Google Scholar
  19. 19.
    Dorr, B.J.: Machine translation divergences: a formal description and proposed solution. Computational Linguistics 20, 597–635 (1994)Google Scholar
  20. 20.
    Klein, D., Manning, C.D.: Accurate Unlexicalized Parsing. In: Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics, ACL 2003 (2003)Google Scholar
  21. 21.
    de Marneffe, M.C., MacCartney, B., Manning, C.D.: Generating typed dependency parses from phrase structure parses. In: Proc. of LREC 2006 (2006)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2015

Authors and Affiliations

  1. 1.Linguistics DepartmentUniversity of WashingtonSeattleUSA
  2. 2.Microsoft ResearchRedmondUSA

Personalised recommendations