Language Resources and Evaluation

, Volume 51, Issue 3, pp 581–612 | Cite as

The GUM corpus: creating multilayer resources in the classroom

  • Amir ZeldesEmail author
Original Paper


This paper presents the methodology, design principles and detailed evaluation of a new freely available multilayer corpus, collected and edited via classroom annotation using collaborative software. After briefly discussing corpus design for open, extensible corpora, five classroom annotation projects are presented, covering structural markup in TEI XML, multiple part of speech tagging, constituent and dependency parsing, information structural and coreference annotation, and Rhetorical Structure Theory analysis. Layers are inspected for annotation quality and together they coalesce to form a richly annotated corpus that can be used to study the interactions between different levels of linguistic description. The evaluation gives an indication of the expected quality of a corpus created by students with relatively little training. A multifactorial example study on lexical NP coreference likelihood is also presented, which illustrates some applications of the corpus. The results of this project show that high quality, richly annotated resources can be created effectively as part of a linguistics curriculum, opening new possibilities not just for research, but also for corpora in linguistics pedagogy.


Multilayer corpora Classroom annotation Coreference Information structure Treebank Parsing 



I would like to thank the participants, past, present and future, of the course LING-367 ‘Computational Corpus Linguistics’, for their contributions to the corpus described in this paper. Special thanks are due to Dan Simonson for his help in preparing the data. For a current list of contributors and a link to the course syllabus, please see I am also grateful for very helpful suggestions from Aurelie Herbelot, Anke Lüdeling, Mark Sicoli, Manfred Stede, the editors, and three anonymous reviewers; the usual disclaimers apply.


  1. Anderson, A. H., Bader, M., Bard, E. G., Boyle, E., Doherty, G., Garrod, S., et al. (1991). The HCRC map task corpus. Language and Speech, 34, 351–366.CrossRefGoogle Scholar
  2. Biber, D. (1993). Representativeness in corpus design. Literary and Linguistic Computing, 8(4), 243–257.CrossRefGoogle Scholar
  3. Blackwell, C., & Martin, T. R. (2009). Technology, collaboration, and undergraduate research. Digital Humanities Quarterly, 3(1).
  4. Burnard, L., & Bauman, S. (2008). TEI P5: Guidelines for electronic text encoding and interchange. Technical report.
  5. Calhoun, S., Carletta, J., Brenier, J., Mayo, N., Jurafsky, D., Steedman, M., & Beaver, D. (2010). The NXT-format Switchboard Corpus: A rich resource for investigating the syntax, semantics, pragmatics and prosody of dialogue. Language Resources and Evaluation, 44(4), 387–419.CrossRefGoogle Scholar
  6. Carlson, L., Marcu, D., & Okurowski, M. E. (2001). Building a discourse-tagged corpus in the framework of rhetorical structure theory. In Proceedings of 2nd SIGDIAL workshop on discourse and dialogue, Eurospeech 2001 (pp. 1–10). Aalborg, Denmark.Google Scholar
  7. Cer, D., de Marneffe, M.-C., Jurafsky, D., & Manning, C. D. (2010). Parsing to stanford dependencies: Trade-offs between speed and accuracy. In 7th International conference on language resources and evaluation (LREC 2010) (pp. 1628–1632). Valletta, Malta.Google Scholar
  8. Chambers, N., & Jurafsky, D. (2009). Unsupervised learning of narrative schemas and their participants. In Proceedings of the 47th annual meeting of the ACL and the 4th IJCNLP of the AFNLP (pp. 602–610). Suntec, Singapore.Google Scholar
  9. Chen, D., & Manning, C. D. (2014). A fast and accurate dependency parser using neural networks. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) (pp. 740–750). Doha, Qatar.Google Scholar
  10. Crowdy, S. (1993). Spoken corpus design. Literary and Linguistic Computing, 8, 259–265.CrossRefGoogle Scholar
  11. de Marneffe, M.-C., Dozat, T., Silveira, N., Haverinen, K., Ginter, F., Nivre, J., & Manning, C. D. (2014). Universal Stanford dependencies: A cross-linguistic typology. In Proceedings of 9th international conference on language resources and evaluation (LREC 2014) (pp. 4585–4592). Reykjavík, Iceland.Google Scholar
  12. de Marneffe, M.-C., & Manning, C. D. (2013). Stanford typed dependencies manual. Stanford University, Technical Report.Google Scholar
  13. Dipper, S., Götze, M., & Skopeteas, S. (Eds.) (2007). Information structure in cross-linguistic corpora: annotation guidelines for phonology, morphology, syntax, semantics, and information structure. Interdisciplinary Studies on Information Structure, Working papers of the SFB 632, 7. Google Scholar
  14. Durrett, G., & Klein, D. (2013). Easy victories and uphill battles in coreference resolution. In Proceedings of the conference on empirical methods in natural language processing (EMNLP 2013). Seattle: ACL.Google Scholar
  15. Garside, R., & Smith, N. (1997). A hybrid grammatical tagger: CLAWS4. In R. Garside, G. Leech, & A. McEnery (Eds.), Corpus annotation: Linguistic information from computer text corpora (pp. 102–121). London: Longman.Google Scholar
  16. Gerdes, K. (2013). Collaborative dependency annotation. In Proceedings of the second international conference on dependency linguistics (DepLing 2013) (pp. 88–97). Prague.Google Scholar
  17. Givón, T. (Ed.). (1983). Topic continuity in discourse. A quantitative cross-language study (Typlological Studies in Language 3). Amsterdam: John Benjamins.Google Scholar
  18. Godfrey, J. J., Holliman, E. C., & McDaniel, J. (1992). SWITCHBOARD: Telephone speech corpus for research and development. In Proceedings of ICASSP-92 (pp. 517–520). San Francisco, CA.Google Scholar
  19. Grosz, B. J., Joshi, A. K., & Weinstein, S. (1995). Centering: A framework for modeling the local coherence of discourse. Computational Linguistics, 21(2), 203–225.Google Scholar
  20. Haug, D. T., Eckhoff, H. M., Majer, M., & Welo, E. (2009). Breaking down and putting back together: Analysis and synthesis of New Testament Greek. Journal of Greek Linguistics, 9(1), 56–92.Google Scholar
  21. Hirschman, L., Robinson, P., Burger, J. D., & Vilain, M. B. (1998). Automating coreference: The role of annotated training data. AAAI, Technical Report SS-98-01.
  22. Hovy, E., Marcus, M., Palmer, M., Ramshaw, L., & Weischedel, R. (2006). OntoNotes: The 90 % solution. In Proceedings of the human language technology conference of the NAACL, companion volume: Short Papers (pp. 57–60). New York: Association for Computational Linguistics.Google Scholar
  23. Hsueh, P.-Y., Melville, P., & Sindhwani, V. (2009). Data quality from crowdsourcing: A study of annotation selection criteria. In Proceedings of the NAACL HLT workshop on active learning for natural language processing (pp. 27–35). Boulder, CO.Google Scholar
  24. Hunston, S. (2008). Collection strategies and design decisions. In A. Lüdeling & M. Kytö (Eds.), Corpus linguistics. An international handbook (pp. 154–168). Berlin: Mouton de Gruyter.Google Scholar
  25. Ide, N., Baker, C., Fellbaum, C., & Passonneau, R. (2010). The manually annotated sub-corpus: A community resource for and by the people. In Proceedings of the 48th annual meeting of the association for computational linguistics (pp. 68–73). Uppsala, Sweden.Google Scholar
  26. Jha, M., Andreas, J., Thadani, K., Rosenthal, S., & McKeown, K. (2010). Corpus creation for new genres: A crowdsourced approach to PP attachment. In Proceedings of the NAACL HLT 2010 workshop on creating speech and language data with Amazon’s mechanical turk (pp. 13–20). Los Angeles, CA.Google Scholar
  27. Jiang, L., Wang, Y., Hoffart, J., & Weikum, G. (2013). Crowdsourced entity markup. In Proceedings of the 1st international workshop on crowdsourcing the semantic web (pp. 59–68). Sydney.Google Scholar
  28. Krause, T., Lüdeling, A., Odebrecht, C., & Zeldes, A. (2012). Multiple tokenizations in a Diachronic Corpus. In Exploring ancient languages through Corpora. Oslo.Google Scholar
  29. Krause, T., & Zeldes, A. (2014). ANNIS3: A new architecture for generic corpus query and visualization. Digital Scholarship in the Humanities.
  30. Kučera, H., & Francis, W. N. (1967). Computational analysis of present-day English. Providence: Brown University Press.Google Scholar
  31. Lee, H., Chang, A., Peirsman, Y., Chambers, N., Surdeanu, M., & Jurafsky, D. (2013). Deterministic coreference resolution based on entity-centric, precision-ranked rules. Computational Linguistics, 39(4), 885–916.CrossRefGoogle Scholar
  32. Lüdeling, A., Doolittle, S., Hirschmann, H., Schmidt, K., & Walter, M. (2008). Das Lernerkorpus Falko. Deutsch als Fremdsprache, 2, 67–73.Google Scholar
  33. Lüdeling, A., Evert, S., & Baroni, M. (2007). Using web data for linguistic purposes. In M. Hundt, N. Nesselhauf, & C. Biewer (Eds.), Corpus linguistics and the web (Language and computers—studies in practical linguistics 59) (pp. 7–24). Amsterdam: Rodopi.Google Scholar
  34. Lüdeling, A., Ritz, J., Stede, M., & Zeldes, A. (2016). Corpus linguistics and information structure research. In Féry, C., & Ichihara, S. (Eds.), The Oxford handbook of information structure. Oxford: Oxford University Press.Google Scholar
  35. Lyons, J. (1977). Semantics. Cambridge: Cambridge University Press.CrossRefGoogle Scholar
  36. Mann, W. C., & Thompson, S. A. (1988). Rhetorical structure theory: Toward a functional theory of text organization. Text, 8(3), 243–281.CrossRefGoogle Scholar
  37. Manning, C. D., Surdeanu, M., Bauer, J., Finkel, J., Bethard, S. J., & McClosky, D. (2014). The Stanford CoreNLP natural language processing toolkit. In Proceedings of 52nd annual meeting of the association for computational linguistics: System demonstrations (pp. 55–60). Baltimore, MD.Google Scholar
  38. Marcu, D., Amorrortu, E., & Romera, M. (1999). Experiments in constructing a corpus of discourse trees. In Proceedings of the ACL workshop towards standards and tools for discourse tagging (pp. 48–57). College Park, MD.Google Scholar
  39. Marcus, M. P., Santorini, B., & Marcinkiewicz, M. A. (1993). Building a large annotated corpus of English: The Penn Treebank. Special Issue on Using Large Corpora, Computational Linguistics, 19(2), 313–330.Google Scholar
  40. Mitchell, A., Strassel, S., Przybocki, M., Davis, J., Doddington, G., Grishman, R., Meyers, A., Brunstein, A., Ferro, L., & Sundheim, B. (2003). ACE-2 Version 1.0. Linguistic Data Consortium, Technical Report LDC2003T11, Philadelphia.Google Scholar
  41. Nissim, M. (2006). Learning information status of discourse entities. In Proceedings of the 2006 conference on empirical methods in natural language processing (EMNLP 2006) (pp. 94–102). Sydney, Australia.Google Scholar
  42. O’Donnell, M. (2000). RSTTool 2.4—A markup tool for rhetorical structure theory. In Proceedings of the international natural language generation conference (INLG’2000) (pp. 253–256). Mitzpe Ramon, Israel.Google Scholar
  43. Paul, D. B., & Baker, J. M. (1992). The design for the Wall Street Journal-based CSR corpus. In Proceedings of the workshop on speech and natural language, HLT ‘91 (pp. 357–362). Stroudsburg, PA: ACL.Google Scholar
  44. Ragheb, M., & Dickinson, M. (2013). Inter-annotator Agreement for Dependency Annotation of Learner Language. In Proceedings of the Eighth Workshop on Innovative Use of NLP for Building Educational Applications (pp. 169–179). Atlanta, GA.Google Scholar
  45. Recasens, M., de Marneffe, M.-C., & Potts, C. (2013). The life and death of discourse entities: Identifying singleton mentions. In Proceedings of NAACL 2013 (pp. 627–633). Atlanta, GA.Google Scholar
  46. Redeker, G., Berzlánovich, I., van der Vliet, N., Bouma, G., & Egg, M. (2012). Multi-layer discourse annotation of a Dutch text corpus. In N. Calzolari, K. Choukri, T. Declerck, M. U. Doǧan, B. Maegaard, J. Mariani, A. Moreno, J. Odijk, & S. Piperidis (Eds.), Proceedings of the eight international conference on language resources and evaluation (LREC’12) (pp. 2820–2825). Istanbul: ELRA.Google Scholar
  47. Reppen, R. (2010). Building a corpus: What are the basics? In A. O’Keeffe & M. McCarthy (Eds.), The Routledge handbook of corpus linguistics (pp. 31–38). London: Routledge.Google Scholar
  48. Reznicek, M., Lüdeling, A., Krummes, C., Schwantuschke, F., Walter, M., Schmidt, K., Hirschmann, H., & Andreas, T. (2012). Das Falko-Handbuch. Korpusaufbau und Annotationen. Humboldt-Universität zu Berlin, Technical Report Version 2.01, Berlin.Google Scholar
  49. Riester, A., Killmann, L., Lorenz, D., & Portz, M. (2007). Richtlinien zur Annotation von Gegebenheit und Kontrast in Projekt A1. Draft version, November 2007. SFB 732, University of Stuttgart, Technical Report, Stuttgart.Google Scholar
  50. Ritz, J. (2010). Using tf-idf-related measures for determining the anaphoricity of noun phrases. In Proceedings of KONVENS 2010 (pp. 85–92). Saarbrücken.Google Scholar
  51. Ritz, J., Dipper, S., & Götze, M. (2008). Annotation of information structure: An evaluation across different types of texts. In N. Calzolari, K. Choukri, B. Maegaard, J. Mariani, J. Odjik, S. Piperidis, & D. Tapias (Eds.), Proceedings of the 6th international conference on language resources and evaluation (LREC-2008) (pp. 2137–2142). Marrakech.Google Scholar
  52. Sabou, M., Bontcheva, K., Derczynski, L., & Scharl, A. (2014). Corpus annotation through crowdsourcing: Towards best practice guidelines. In N. Calzolari, K. Choukri, T. Declerck, H. Loftsson, B. Maegaard, J. Mariani, A. Moreno, J. Odijk, & S. Piperidis (Eds.), Proceedings of the ninth international conference on language resources and evaluation (LREC’14). Reykjavik: ELRA.Google Scholar
  53. Santorini, B. (1990). Part-of-speech tagging guidelines for the Penn Treebank Project (3rd Revision). University of Pennsylvania, Technical Report.Google Scholar
  54. Schmid, H. (1994). Probabilistic part-of-speech tagging using decision trees. In Proceedings of the conference on new methods in language processing (pp. 44–49). Manchester.Google Scholar
  55. Silveira, N., Dozat, T., de Marneffe, M.-C., Bowman, S. R., Connor, M., Bauery, J., & Manning, C. D. (2014). A gold standard dependency corpus for English. In Proceedings of the ninth international conference on language resources and evaluation (LREC-2014) (pp. 2897–2904). Reykjavik, Iceland.Google Scholar
  56. Sinclair, J. (2004). Trust the text. London: Routledge.Google Scholar
  57. Snow, R., O’Connor, B., Jurafsky, D., & Ng, A. (2008). Cheap and fast—But is it good? Evaluating non-expert annotations for natural language tasks. In Proceedings of the 2008 conference on empirical methods in natural language processing (EMNLP 2008) (pp. 254–263). Honolulu, HI.Google Scholar
  58. Socher, R., Bauer, J., Manning, C. D., & Ng, A. Y. (2013). Parsing with compositional vector grammars. In Proceedings of the 51st annual meeting of the association for computational linguistics (pp. 455–465). Sofia, Bulgaria.Google Scholar
  59. Stede, M. (2004). The Potsdam commentary corpus. In Webber, B., & Byron, D. K. (Eds.), Proceeding of the ACL-04 workshop on discourse annotation (pp. 96–102). Barcelona, Spain.Google Scholar
  60. Stede, M. (2008). Disambiguating rhetorical structure. Research on Language and Computation, 6(3), 311–332.Google Scholar
  61. Stede, M., & Neumann, A. (2014). Potsdam commentary corpus 2.0: Annotation for discourse research. In Proceedings of the language resources and evaluation conference (LREC ‘14) (pp. 925–929). Reykjavik.Google Scholar
  62. Taboada, M., & Mann, W. C. (2006). Rhetorical structure theory: Looking back and moving ahead. Discourse Studies, 8, 423–459.CrossRefGoogle Scholar
  63. Telljohann, H., Hinrichs, E. W., Kübler, S., Zinsmeister, H., & Beck, K. (2012). Stylebook for the Tübingen Treebank of Written German (TüBa-D/Z). Seminar für Sprachwissenschaft, Universität Tübingen, Technical Report.Google Scholar
  64. Weischedel, R., Pradhan, S., Ramshaw, L., Kaufman, J., Franchini, M., El-Bachouti, M., Xue, N., Palmer, M., Hwang, J. D., Bonial, C., Choi, J., Mansouri, A., Foster, M., Hawwary, A.-A., Marcus, M., Taylor, A., Greenberg, C., Hovy, E., Belvin, R., & Houston, A. (2012). OntoNotes Release 5.0. Linguistic Data Consortium, Philadelphia, Technical Report.Google Scholar
  65. Yimam, S. M., Gurevych, I., Castilho, R. Eckart de, & Biemann, C. (2013). WebAnno: A flexible, web-based and visually supported system for distributed annotations. In Proceedings of the 51st annual meeting of the association for computational linguistics (pp. 1–6). Sofia, Bulgaria.Google Scholar

Copyright information

© Springer Science+Business Media Dordrecht 2016

Authors and Affiliations

  1. 1.Georgetown UniversityWashingtonUSA

Personalised recommendations