Case Study: The Manually Annotated Sub-Corpus

Ide, Nancy

doi:10.1007/978-94-024-0881-2_19

Case Study: The Manually Annotated Sub-Corpus

Nancy Ide³

Chapter
First Online: 17 June 2017

2069 Accesses
1 Citations

Abstract

This case study describes the creation process for the Manually Annotated Sub-Corpus (MASC), a 500,000 word subset of the Open American National Corpus (OANC). The corpus includes primary data from a balanced selection of 19 written and spoken genres, all of which is annotated for almost 20 varieties of linguistic phenomena at all levels. All annotations are either hand-validated or manually-produced. MASC is unique in that it is fully open and free for any use, including commercial use.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 349.00; Price excludes VAT (USA)

Softcover Book: USD 449.99; Price excludes VAT (USA)

Hardcover Book: USD 449.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

1.
The consortium members who contributed texts to the ANC are Oxford University Press, Cambridge University Press, Langenscheidt Publishers, and the Microsoft Corporation.
2.
http://liberalarts.iupui.edu/icic/research/corpus_of_philanthropic_fundraising_discourse.
3.
http://nsv.uncc.edu/nsv/narratives.
4.
http://creativecommons.org/.
5.
http://www.biomedcentral.com/.
6.
http://www.plos.org.
7.
http://www.anc.org/contribute/texts/.
8.
To date, we have collected over five million words of college essays and fiction contributed by college students.
9.
http://www.openoffice.org.
10.
For this reason, we were unable to include a million words of contributed data from the ACL Anthology in the ANC.
11.
http://cleaneval.sigwac.org.uk/.
12.
http://www1.icsi.berkeley.edu/Speech/mr/.
13.
Defined in ISO/IEC 10646.
14.
The ANC maintains a GATE plugin repository, which includes import and export modules for annotated documents in GrAF (see Sect. 2.4), at http://www.anc.org/tools/gate/gate-update-site.xml.
15.
http://gate.ac.uk/sale/tao/splitch8.html.
16.
Some of these modules were developed or improved by students at Vassar College, who did the analysis and JAPE rule-writing as a term project for an advanced undergraduate course on Computational Linguistics.
17.
General Architecture for Text Engineering; http://gate.ac.uk.
18.
The contents of the ANC First Release are described at http://www.anc.org/FirstRelease/.
19.
http://linguistics.okfn.org/resources/llod/.
20.
Available at http://www.anc.org/data/oanc/contributed-annotations/.
21.
http://www.anc.org/data/oanc/ngram/.
22.
NSF CRI 0708952.
23.
See http://www.anc.org/MASC/About_files/NSF_report-final.pdf.
24.
MASC includes about 5 K of the 10 K LU corpus, eliminating non-English and translated texts as well as texts that are not free of usage and redistribution restrictions. See https://catalog.ldc.upenn.edu/LDC2009T10.
25.
The list does not include WordNet sense annotations because they are not applied to full texts.
26.
http://gate.ac.uk/sale/tao/splitch6.html#x9-1260006.
27.
Primarily, the students were Cognitive Science majors with a Linguistics emphasis. Over the four years of the project, sixteen different students worked on validation.
28.
All of the MASC project’s annotation guidelines are accessible from http://www.anc.org/wiki/#AnnotationValidation.
29.
http://gate.ac.uk/sale/tao/splitch10.html.
30.
Sense and frame element annotations were handled separately; see chapter “Semantic Annotation of MASC”, in this volume.
31.
We created a post-processing JAPE script that modifies the default ANNIE tokenization slightly.
32.
Several years ago, the PTB project changed its tokenization, which originally did not break hyphenated words, because of difficulties with cases such as “New York-based” encountered in the Unified Linguistic Annotation project (see https://catalog.ldc.upenn.edu/LDC2009T07). However, this disallowed tagging the hyphenated word as an adjective, which, despite the need to manually correct tokenizations such as New+York-based, was deemed preferable.
33.
https://catalog.ldc.upenn.edu/LDC99T42.
34.
http://anc-projects.appspot.com/ptbpennposcompare.
35.
Because of the unexpected difficulty of correcting the ANNIE tags by this method, the first release of the full MASC (version 3.0.0) did not contain the tags corrected from the PTB data, but had been post-processed with JAPE scripts to correct systematic errors.
36.
http://lcl.uniroma1.it/MASC-NEWS/.
37.
http://babelnet.org/.
38.
http://dx.doi.org/10.7916/D80V89XH.
39.
For comprehensive overview of GrAF and its headers, see [17].
40.
See https://catalog.ldc.upenn.edu/LDC2013T12.
41.
http://www.clarin.eu.
42.
Available from https://pypi.python.org/pypi/graf-python/0.3.0.
43.
https://poio-api.readthedocs.org/en/latest/.
44.
http://www.graphviz.org/.
45.
http://www.sfb632.uni-potsdam.de/annis/.
46.
The ANNIS implementation for accessing MASC annotations is available from http://www.anc.org/software/annis.
47.
http://nltk.org.
48.
http://ifarm.nl/signll/conll/.
49.
Note that GrAF is a “true” standoff format, as opposed to hybrid standoff formats as described in chapter “Designing Annotation Schemes: From Model to Representation” in this volume.

References

Baker, C.F., Fellbaum, C.: WordNet and FrameNet as complementary resources for annotation. In: Proceedings of the Third Linguistic Annotation Workshop, pp. 125–129. Association for Computational Linguistics, Suntec, Singapore (2009). http://www.aclweb.org/anthology/W/W09/W09-3021
Baker, C.F., Fillmore, C.J., Lowe, J.B.: The Berkeley FrameNet project. In: Proceedings of the 17th International Conference on Computational Linguistics, vol.1, pp. 86–90. Association for Computational Linguistics, Stroudsburg, PA, USA (1998)
Google Scholar
Blumtritt, J., Bouda, P., Rau, F.: Poio API and GraF-XML: a radical stand-off approach in language documentation and language typology. In: Proceedings of Balisage: The Markup Conference 2013, Balisage Series on Markup Technologies, vol. 10, Montreal, Canada (2013). doi:10.4242/BalisageVol10.Bouda01
Chiarcos, C., Hellmann, S., Nordhoff, S.: Linking linguistic resources: examples from the Open Linguistics Working Group. In: C. Chiarcos, S. Nordhoff, S. Hellmann (eds.) Linked Data in Linguistics, pp. 201–216. Springer, Heidelberg (2012)
Google Scholar
Chiarcos, C., Ritz, J., Stede, M.: By all these lovely Tokens... Merging conflicting tokenizations. Lang. Res. Eval. 46(1), 53–74 (2012)
Google Scholar
Cunningham, H., Maynard, D., Bontcheva, K., Tablan, V.: GATE: a framework and graphical development environment for robust nlp tools and applications. In: Proceedings of ACL’02 (2002)
Google Scholar
Dridan, R., Oepen, S.: Tokenization: returning to a long solved problem–a survey, contrastive experiment, recommendations, and toolkit. In: ACL (2), pp. 378–382. The Association for Computational Linguistics (2012)
Google Scholar
Fellbaum, C., Baker, C.: Aligning verbs in WordNet and FrameNet. Linguistics (to appear)
Google Scholar
Ferrucci, D., Lally, A.: UIMA: an architectural approach to unstructured information processing in the corporate research environment. Natural Lang. Eng. 10(3–4), 327–348 (2004). doi:10.1017/S1351324904003523
Fillmore, C.J., Jurafsky, D., Ide, N., Macleod, C.: An American National Corpus: a proposal. In: Proceedings of the First Annual Conference on Language Resources and Evaluation, pp. 965–969. European Language Resources Association, Paris (1998)
Google Scholar
Fokkens, A., van Erp, M., Postma, M., Pedersen, T., Vossen, P., Freire, N.: Offspring from reproduction problems: What replication failure teaches us. In: Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (vol. 1: Long Papers), pp. 1691–1701. Association for Computational Linguistics, Sofia, Bulgaria (2013)
Google Scholar
Ide, N.: An open linguistic infrastructure for annotated corpora. In: I. Gurevych, J. Kim (eds.) The People Web Meets NLP: Collaboratively Constructed Language Resources, pp. 263–84. Springer, Heidelberg (2013)
Google Scholar
Ide, N., Romary, L.: International standard for a linguistic annotation framework. Natural Lang. Eng. 10(3–4), 211–225 (2004). doi:10.1017/S135132490400350X
Ide, N., Romary, L.: Representing linguistic corpora and their annotations. In: Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC 2006) (2006)
Google Scholar
Ide, N., Suderman, K.: Integrating linguistic resources: the American National Corpus model. In: Proceedings of the Fifth Language Resources and Evaluation Conference (LREC). Genoa, Italy (2006)
Google Scholar
Ide, N., Suderman, K.: GrAF: a graph-based format for linguistic annotations. In: Proceedings of the Linguistic Annotation Workshop, pp. 1–8. Association for Computational Linguistics, Prague, Czech Republic (2007). http://www.aclweb.org/anthology/W/W07/W07-1501
Ide, N., Suderman, K.: The Linguistic Annotation Framework: a Standard for Annotation Interchange and Merging. Language Resources and Evaluation (2014)
Google Scholar
Ide, N., Bonhomme, P., Romary, L.: XCES: an XML-based encoding standard for linguistic corpora. In: Proceedings of the Second International Language Resources and Evaluation Conference. European Language Resources Association, Paris (2000)
Google Scholar
Ide, N., Reppen, R., Suderman, K.: The American National Corpus: more than the web can provide. In: Proceedings of the Third Language Resources and Evaluation Conference, pp. 839–844. Las Palmas (2002)
Google Scholar
Ide, N., Suderman, K., Simms, B.: ANC2Go: a web application for customized corpus creation. In: Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC). European Language Resources Association, Valletta, Malta (2010)
Google Scholar
ISO: Language Resource Management - Linguistic Annotation Framework. ISO 24612 (2012)
Google Scholar
Kilgarriff, A.: Googleology is bad science. Comput. Linguist. 33(1) (2007)
Google Scholar
Kremer, G., Erk, K., Pad, S., Thater, S.: What substitutes tell us – analysis of an “all-words” lexical substitution corpus. In: Proceedings of the Conference of the European. Association for Computational Linguistics. Gothenburg, Sweden (2014)
Google Scholar
Macleod, C., Grishman, R., Meyers, A., Barrett, L., Reeves, R.: Nomlex: a lexicon of nominalizations. Proc. Euralex 98, 187–193 (1998)
Google Scholar
Mann, W.C., Thompson, S.A.: Rhetorical structure theory: description and construction of text structures. In: Kempen, G. (ed.) Natural Language Generation: New Results in Artificial Intelligence, Psychology, and Linguistics, pp. 85–95. Nijhoff, Dordrecht (1987)
Chapter Google Scholar
Marcus, M.P., Marcinkiewicz, M.A., Santorini, B.: Building a large annotated corpus of English: the Penn Treebank. Comput. Linguist. 19(2), 313–330 (1993)
Google Scholar
Moro, A., Navigli, R., Tucci, F.M., Passonneau, R.J.: Annotating the MASC corpus with BabelNet. In: Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14). European Language Resources Association (ELRA), Reykjavik, Iceland (2014)
Google Scholar
Navigli, R., Ponzetto, S.P.: BabelNet: the automatic construction, evaluation and application of a wide-coverage multilingual semantic network. Artif. Intell. 193, 217–250 (2012)
Article Google Scholar
Neumann, A., Ide, N., Stede, M.: Importing MASC into the ANNIS linguistic database: a case study of mapping GrAF. In: Proceedings of the Seventh Linguistic Annotation Workshop (LAW), pp. 98–102. Sofia, Bulgaria (2013)
Google Scholar
Pradhan, S.S., Hovy, E., Marcus, M., Palmer, M., Ramshaw, L., Weischedel, R.: OntoNotes: a unified relational semantic representation. In: ICSC ’07: Proceedings of the International Conference on Semantic Computing, pp. 517–526. IEEE Computer Society, Washington, DC, USA (2007). http://dx.doi.org/10.1109/ICSC.2007.67

Download references

Author information

Authors and Affiliations

Vassar College, Poughkeepsie, NY, 12604-0520, USA
Nancy Ide

Authors

Nancy Ide
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Nancy Ide .

Editor information

Editors and Affiliations

Department of Computer Science, Vassar College, Poughkeepsie, New York, USA
Nancy Ide
Department of Computer Science, Volen Center for Complex Systems, Brandeis University, Waltham, Massachusetts, USA
James Pustejovsky

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Ide, N. (2017). Case Study: The Manually Annotated Sub-Corpus. In: Ide, N., Pustejovsky, J. (eds) Handbook of Linguistic Annotation. Springer, Dordrecht. https://doi.org/10.1007/978-94-024-0881-2_19

Download citation

DOI: https://doi.org/10.1007/978-94-024-0881-2_19
Published: 17 June 2017
Publisher Name: Springer, Dordrecht
Print ISBN: 978-94-024-0879-9
Online ISBN: 978-94-024-0881-2
eBook Packages: Social SciencesSocial Sciences (R0)

Publish with us

Policies and ethics