Skip to main content

Case Study: The Manually Annotated Sub-Corpus

  • Chapter
  • First Online:

Abstract

This case study describes the creation process for the Manually Annotated Sub-Corpus (MASC), a 500,000 word subset of the Open American National Corpus (OANC). The corpus includes primary data from a balanced selection of 19 written and spoken genres, all of which is annotated for almost 20 varieties of linguistic phenomena at all levels. All annotations are either hand-validated or manually-produced. MASC is unique in that it is fully open and free for any use, including commercial use.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   349.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   449.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD   449.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

  1. 1.

    The consortium members who contributed texts to the ANC are Oxford University Press, Cambridge University Press, Langenscheidt Publishers, and the Microsoft Corporation.

  2. 2.

    http://liberalarts.iupui.edu/icic/research/corpus_of_philanthropic_fundraising_discourse.

  3. 3.

    http://nsv.uncc.edu/nsv/narratives.

  4. 4.

    http://creativecommons.org/.

  5. 5.

    http://www.biomedcentral.com/.

  6. 6.

    http://www.plos.org.

  7. 7.

    http://www.anc.org/contribute/texts/.

  8. 8.

    To date, we have collected over five million words of college essays and fiction contributed by college students.

  9. 9.

    http://www.openoffice.org.

  10. 10.

    For this reason, we were unable to include a million words of contributed data from the ACL Anthology in the ANC.

  11. 11.

    http://cleaneval.sigwac.org.uk/.

  12. 12.

    http://www1.icsi.berkeley.edu/Speech/mr/.

  13. 13.

    Defined in ISO/IEC 10646.

  14. 14.

    The ANC maintains a GATE plugin repository, which includes import and export modules for annotated documents in GrAF (see Sect. 2.4), at http://www.anc.org/tools/gate/gate-update-site.xml.

  15. 15.

    http://gate.ac.uk/sale/tao/splitch8.html.

  16. 16.

    Some of these modules were developed or improved by students at Vassar College, who did the analysis and JAPE rule-writing as a term project for an advanced undergraduate course on Computational Linguistics.

  17. 17.

    General Architecture for Text Engineering; http://gate.ac.uk.

  18. 18.

    The contents of the ANC First Release are described at http://www.anc.org/FirstRelease/.

  19. 19.

    http://linguistics.okfn.org/resources/llod/.

  20. 20.

    Available at http://www.anc.org/data/oanc/contributed-annotations/.

  21. 21.

    http://www.anc.org/data/oanc/ngram/.

  22. 22.

    NSF CRI 0708952.

  23. 23.

    See http://www.anc.org/MASC/About_files/NSF_report-final.pdf.

  24. 24.

    MASC includes about 5 K of the 10 K LU corpus, eliminating non-English and translated texts as well as texts that are not free of usage and redistribution restrictions. See https://catalog.ldc.upenn.edu/LDC2009T10.

  25. 25.

    The list does not include WordNet sense annotations because they are not applied to full texts.

  26. 26.

    http://gate.ac.uk/sale/tao/splitch6.html#x9-1260006.

  27. 27.

    Primarily, the students were Cognitive Science majors with a Linguistics emphasis. Over the four years of the project, sixteen different students worked on validation.

  28. 28.

    All of the MASC project’s annotation guidelines are accessible from http://www.anc.org/wiki/#AnnotationValidation.

  29. 29.

    http://gate.ac.uk/sale/tao/splitch10.html.

  30. 30.

    Sense and frame element annotations were handled separately; see chapter “Semantic Annotation of MASC”, in this volume.

  31. 31.

    We created a post-processing JAPE script that modifies the default ANNIE tokenization slightly.

  32. 32.

    Several years ago, the PTB project changed its tokenization, which originally did not break hyphenated words, because of difficulties with cases such as “New York-based” encountered in the Unified Linguistic Annotation project (see https://catalog.ldc.upenn.edu/LDC2009T07). However, this disallowed tagging the hyphenated word as an adjective, which, despite the need to manually correct tokenizations such as New+York-based, was deemed preferable.

  33. 33.

    https://catalog.ldc.upenn.edu/LDC99T42.

  34. 34.

    http://anc-projects.appspot.com/ptbpennposcompare.

  35. 35.

    Because of the unexpected difficulty of correcting the ANNIE tags by this method, the first release of the full MASC (version 3.0.0) did not contain the tags corrected from the PTB data, but had been post-processed with JAPE scripts to correct systematic errors.

  36. 36.

    http://lcl.uniroma1.it/MASC-NEWS/.

  37. 37.

    http://babelnet.org/.

  38. 38.

    http://dx.doi.org/10.7916/D80V89XH.

  39. 39.

    For comprehensive overview of GrAF and its headers, see [17].

  40. 40.

    See https://catalog.ldc.upenn.edu/LDC2013T12.

  41. 41.

    http://www.clarin.eu.

  42. 42.

    Available from https://pypi.python.org/pypi/graf-python/0.3.0.

  43. 43.

    https://poio-api.readthedocs.org/en/latest/.

  44. 44.

    http://www.graphviz.org/.

  45. 45.

    http://www.sfb632.uni-potsdam.de/annis/.

  46. 46.

    The ANNIS implementation for accessing MASC annotations is available from http://www.anc.org/software/annis.

  47. 47.

    http://nltk.org.

  48. 48.

    http://ifarm.nl/signll/conll/.

  49. 49.

    Note that GrAF is a “true” standoff format, as opposed to hybrid standoff formats as described in chapter “Designing Annotation Schemes: From Model to Representation” in this volume.

References

  1. Baker, C.F., Fellbaum, C.: WordNet and FrameNet as complementary resources for annotation. In: Proceedings of the Third Linguistic Annotation Workshop, pp. 125–129. Association for Computational Linguistics, Suntec, Singapore (2009). http://www.aclweb.org/anthology/W/W09/W09-3021

  2. Baker, C.F., Fillmore, C.J., Lowe, J.B.: The Berkeley FrameNet project. In: Proceedings of the 17th International Conference on Computational Linguistics, vol.1, pp. 86–90. Association for Computational Linguistics, Stroudsburg, PA, USA (1998)

    Google Scholar 

  3. Blumtritt, J., Bouda, P., Rau, F.: Poio API and GraF-XML: a radical stand-off approach in language documentation and language typology. In: Proceedings of Balisage: The Markup Conference 2013, Balisage Series on Markup Technologies, vol. 10, Montreal, Canada (2013). doi:10.4242/BalisageVol10.Bouda01

  4. Chiarcos, C., Hellmann, S., Nordhoff, S.: Linking linguistic resources: examples from the  Open Linguistics Working Group. In: C. Chiarcos, S. Nordhoff, S. Hellmann (eds.) Linked Data in Linguistics, pp. 201–216. Springer, Heidelberg (2012)

    Google Scholar 

  5. Chiarcos, C., Ritz, J., Stede, M.: By all these lovely Tokens... Merging conflicting tokenizations. Lang. Res. Eval. 46(1), 53–74 (2012)

    Google Scholar 

  6. Cunningham, H., Maynard, D., Bontcheva, K., Tablan, V.: GATE: a framework and graphical development environment for robust nlp tools and applications. In: Proceedings of ACL’02 (2002)

    Google Scholar 

  7. Dridan, R., Oepen, S.: Tokenization: returning to a long solved problem–a survey, contrastive experiment, recommendations, and toolkit. In: ACL (2), pp. 378–382. The Association for Computational Linguistics (2012)

    Google Scholar 

  8. Fellbaum, C., Baker, C.: Aligning verbs in WordNet and FrameNet. Linguistics (to appear)

    Google Scholar 

  9. Ferrucci, D., Lally, A.: UIMA: an architectural approach to unstructured information processing in the corporate research environment. Natural Lang. Eng. 10(3–4), 327–348 (2004). doi:10.1017/S1351324904003523

  10. Fillmore, C.J., Jurafsky, D., Ide, N., Macleod, C.: An American National Corpus: a proposal. In: Proceedings of the First Annual Conference on Language Resources and Evaluation, pp. 965–969. European Language Resources Association, Paris (1998)

    Google Scholar 

  11. Fokkens, A., van Erp, M., Postma, M., Pedersen, T., Vossen, P., Freire, N.: Offspring from reproduction problems: What replication failure teaches us. In: Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (vol. 1: Long Papers), pp. 1691–1701. Association for Computational Linguistics, Sofia, Bulgaria (2013)

    Google Scholar 

  12. Ide, N.: An open linguistic infrastructure for annotated corpora. In: I. Gurevych, J. Kim (eds.) The People Web Meets NLP: Collaboratively Constructed Language Resources, pp. 263–84. Springer, Heidelberg (2013)

    Google Scholar 

  13. Ide, N., Romary, L.: International standard for a linguistic annotation framework. Natural Lang. Eng. 10(3–4), 211–225 (2004). doi:10.1017/S135132490400350X

  14. Ide, N., Romary, L.: Representing linguistic corpora and their annotations. In: Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC 2006) (2006)

    Google Scholar 

  15. Ide, N., Suderman, K.: Integrating linguistic resources: the American National Corpus model. In: Proceedings of the Fifth Language Resources and Evaluation Conference (LREC). Genoa, Italy (2006)

    Google Scholar 

  16. Ide, N., Suderman, K.: GrAF: a graph-based format for linguistic annotations. In: Proceedings of the Linguistic Annotation Workshop, pp. 1–8. Association for Computational Linguistics, Prague, Czech Republic (2007). http://www.aclweb.org/anthology/W/W07/W07-1501

  17. Ide, N., Suderman, K.: The Linguistic Annotation Framework: a Standard for Annotation Interchange and Merging. Language Resources and Evaluation (2014)

    Google Scholar 

  18. Ide, N., Bonhomme, P., Romary, L.: XCES: an XML-based encoding standard for linguistic corpora. In: Proceedings of the Second International Language Resources and Evaluation Conference. European Language Resources Association, Paris (2000)

    Google Scholar 

  19. Ide, N., Reppen, R., Suderman, K.: The American National Corpus: more than the web can provide. In: Proceedings of the Third Language Resources and Evaluation Conference, pp. 839–844. Las Palmas (2002)

    Google Scholar 

  20. Ide, N., Suderman, K., Simms, B.: ANC2Go: a web application for customized corpus creation. In: Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC). European Language Resources Association, Valletta, Malta (2010)

    Google Scholar 

  21. ISO: Language Resource Management - Linguistic Annotation Framework. ISO 24612 (2012)

    Google Scholar 

  22. Kilgarriff, A.: Googleology is bad science. Comput. Linguist. 33(1) (2007)

    Google Scholar 

  23. Kremer, G., Erk, K., Pad, S., Thater, S.: What substitutes tell us – analysis of an “all-words” lexical substitution corpus. In: Proceedings of the Conference of the European. Association for Computational Linguistics. Gothenburg, Sweden (2014)

    Google Scholar 

  24. Macleod, C., Grishman, R., Meyers, A., Barrett, L., Reeves, R.: Nomlex: a lexicon of nominalizations. Proc. Euralex 98, 187–193 (1998)

    Google Scholar 

  25. Mann, W.C., Thompson, S.A.: Rhetorical structure theory: description and construction of text structures. In: Kempen, G. (ed.) Natural Language Generation: New Results in Artificial Intelligence, Psychology, and Linguistics, pp. 85–95. Nijhoff, Dordrecht (1987)

    Chapter  Google Scholar 

  26. Marcus, M.P., Marcinkiewicz, M.A., Santorini, B.: Building a large annotated corpus of English: the Penn Treebank. Comput. Linguist. 19(2), 313–330 (1993)

    Google Scholar 

  27. Moro, A., Navigli, R., Tucci, F.M., Passonneau, R.J.: Annotating the MASC corpus with BabelNet. In: Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14). European Language Resources Association (ELRA), Reykjavik, Iceland (2014)

    Google Scholar 

  28. Navigli, R., Ponzetto, S.P.: BabelNet: the automatic construction, evaluation and application of a wide-coverage multilingual semantic network. Artif. Intell. 193, 217–250 (2012)

    Article  Google Scholar 

  29. Neumann, A., Ide, N., Stede, M.: Importing MASC into the ANNIS linguistic database: a case study of mapping GrAF. In: Proceedings of the Seventh Linguistic Annotation Workshop (LAW), pp. 98–102. Sofia, Bulgaria (2013)

    Google Scholar 

  30. Pradhan, S.S., Hovy, E., Marcus, M., Palmer, M., Ramshaw, L., Weischedel, R.: OntoNotes: a unified relational semantic representation. In: ICSC ’07: Proceedings of the International Conference on Semantic Computing, pp. 517–526. IEEE Computer Society, Washington, DC, USA (2007). http://dx.doi.org/10.1109/ICSC.2007.67

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Nancy Ide .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2017 Springer Science+Business Media Dordrecht

About this chapter

Cite this chapter

Ide, N. (2017). Case Study: The Manually Annotated Sub-Corpus. In: Ide, N., Pustejovsky, J. (eds) Handbook of Linguistic Annotation. Springer, Dordrecht. https://doi.org/10.1007/978-94-024-0881-2_19

Download citation

  • DOI: https://doi.org/10.1007/978-94-024-0881-2_19

  • Published:

  • Publisher Name: Springer, Dordrecht

  • Print ISBN: 978-94-024-0879-9

  • Online ISBN: 978-94-024-0881-2

  • eBook Packages: Social SciencesSocial Sciences (R0)

Publish with us

Policies and ethics