A Methodology towards Effective and Efficient Manual Document Annotation: Addressing Annotator Discrepancy and Annotation Quality

Zhang, Ziqi; Chapman, Sam; Ciravegna, Fabio

doi:10.1007/978-3-642-16438-5_21

Ziqi Zhang²¹,
Sam Chapman²² &
Fabio Ciravegna^21,22

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 6317))

Included in the following conference series:

International Conference on Knowledge Engineering and Knowledge Management

1446 Accesses
5 Citations

Abstract

Manual document annotation is an essential technique for knowledge acquisition and capture. Creating high-quality annotations is a difficult task due to inter-annotator discrepancy, the problem that annotators can never agree completely on what and exactly how to annotate. To address this, traditional document annotation involves multiple domain experts working on the same annotation task in an iterative and collaborative manner to identify and resolve discrepancies progressively. However, such a detailed process is often ineffective despite taking significant time and effort; unfortunately, discrepancies remain high in many cases. This paper proposes an alternative approach to document annotation. The approach tackles the problem by firstly studying annotators’ suitability based on the types of information to be annotated; then identifying and isolating the most inconsistent annotators who tend to cause the majority of discrepancies in a task; finally distributing annotation workload among the most suitable annotators. Tested in a named entity annotation task in the domain of archaeology, we show that compared to the traditional approach to document annotation, it produces larger amounts of better quality annotations that result in higher machine learning accuracy while requires significantly less time and effort.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Bermingham, A., Smeaton, A.: A Study of Inter-Annotator Agreement for Opinion Retrieval. In: Proceedings of SIGIR 2009 (2009)
Google Scholar
Brants, T.: Inter-annotator agreement for a German newspaper corpus. In: Proceedings of the Second International Conference on Language Resources and Evaluation, LREC (2000)
Google Scholar
Byrne, K.: Nested Named Entity Recognition in Historical Archive Text. In: Proceedings of International Conference on Semantic Computing (2007)
Google Scholar
Carletta, J.: Assessing agreement on classification tasks: the kappa statistic. Computational Linguistics 22(2), 249–254 (1996)
Google Scholar
Colosimo, M., Morgan, A., Yeh, A., Colombe, J., Hirschman, L.: Data preparation and internannotator agreement: BioCreAtIvE Task 1B. BMC Bioinformatics (2005)
Google Scholar
Ciravegna, F., Lavelli, A., Satta, G.: Bringing information extraction out of the labs: the Pinocchio Environment. In: Proceedings of the 14th European Conference on Artificial Intelligence (2000)
Google Scholar
Cucchiarini, C., Strik, H.: Automatic transcription agreement: An overview, pp. 347–350 (2003)
Google Scholar
Ehrmann, M.: Les entites nommees, de la linguistique au TAL: statut theorique et methods de desambiguisation. Ph.D. thesis, Univ. Paris (2008)
Google Scholar
Ferro, L., Mani, I., Sundheim, B., Wilson, G.: TIDES Temporal Annotation Guidelines. Draft Version 1.0. MITRE Technical Report MTR 00W0000094 (October 2000)
Google Scholar
Fort, K., Ehrmann, M., Nazarenko, A.: Towards a methodology for named entities anntoation. In: Proceedings of the Third Linguistic Annotation Workshop, ACL-IJNLP, pp. 142–145 (2009)
Google Scholar
Grishman, R., Sundheim, B.: Message understanding conference - 6: A brief history. In: Proceedings of International Conference on Computational Linguistics (1996)
Google Scholar
Gut, U., Bayerl, P.S.: Measuring the Reliability of Manual Annotations of Speech Corpora. In: Proceedings of Speech Prosody (2004), Nara, pp. 565–568 (2004)
Google Scholar
Hripcsak, G., Rothschild, A.: Agreement, the F-measure and Reliability in Information Retrieval. Journal of the American Medical Informatics Association, 296–298 (2005)
Google Scholar
Hripcsak, G., Wilcox, A.: Reference standards, judges, and comparison subjects: roles for experts in evaluating system performance. J. Am. Med. Inform. Assoc., 1–15 (2002)
Google Scholar
Iria, J.: Automating Knowledge Capture in the Aerospace Domain. In: Proceedings of the Fifth International Conference on Knowledge Capture, pp. 97–104 (2009)
Google Scholar
Jeffrey, S., Richards, J., Ciravegna, F., Chapman, S., Zhang, Z.: The Archaeotools project: Faceted Classification and Natural Language Processing in an Archaeological Context. In: Special Theme Issues of the Philosophical Transactions of the Royal Society A, Crossing Boundaries: Computational Science, E-Science and Global E-Infrastructures (2009)
Google Scholar
Kim, J., Ohta, T., Tsujii, J.: Corpus annotations for mining biomedical events from literature. In: BMC Bioinformatics (2008)
Google Scholar
Linguistic Data Consortium, Automatic Content Extraction (ACE) (2008), http://projects.ldc.upenn.edu/ace/
Minkov, E., Wang, R., Cohen, W.: Extracting Personal Names from Email: Applying Named Entity Recognition to Informal Text. In: Proceedings of HLT/EMNLP 2005 (2005)
Google Scholar
Morante, R., Asch, V., Daelemans, W.: A memory–based learning approach to event extraction in biomedical texts. In: Proceedings of the Workshop on BioNLP: Shared Task, pp. 59–67 (2009)
Google Scholar
Murphy, T., McIntosh, T., Curran, J.: Named entity recognition for astronomy literature. In: Australian Language Technology Workshop (2006)
Google Scholar
Nadeau, D.: PhD Thesis: Semi-Supervised Named Entity Recognition: Learning to Recognize 100 Entity Types with Little Supervision (2007)
Google Scholar
Ng, H., Lim, C., Foo, S.: A Case Study on Inter-Annotator Agreement for Word Sense Disambiguation. In: Proceedings of the ACL SIGLEX Workshop on Standardizing Lexical Resources SIGLEX 1999, pp. 9–13 (1999)
Google Scholar
Ohta, T., Tateisi, Y., Kim, J.: The GENIA corpus: an annotated research abstract corpus in molecular biology domain. In: Proceedings of the Second International Conference on Human Language Technology Research, pp. 82–86 (2002)
Google Scholar
Olsson, F.: PhD thesis: Bootstrapping Named Entity Annotation by Means of Active Machine Learning: A Method for Creating Corpora (2008)
Google Scholar
Pyysalo, S., Ginter, F., Heimonen, J., Björne, J., Boberg, J., Järvinen, J., Salakoski, T.: BioInfer: a corpus for information extraction in the biomedical domain. BMC Bioinformatics (2007)
Google Scholar
Saracevic, T.: Individual differences in organizing, searching, and retrieving information. In: Proceedings of the 54th Annual ASIS Meeting, pp. 82–86 (1991)
Google Scholar
Tanabe, L., Xie, N., Thom, L., Matten, W., Wilbur, W.: GENETAG: a tagged corpus for gene/protein named entity recognition. BMC Bioinformatics (2005)
Google Scholar
Wilbur, W., Rzhetsky, A., Shatkay, H.: New directions in biomedical text annotation: definitions, guidelines and corpus construction. Bioinformatics (2006)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, University of Sheffield, UK
Ziqi Zhang & Fabio Ciravegna
K-Now, UK
Sam Chapman & Fabio Ciravegna

Authors

Ziqi Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Sam Chapman
View author publications
You can also search for this author in PubMed Google Scholar
Fabio Ciravegna
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Cognitive Interaction Technology Excellence Center (CITEC), Universität Bielefeld, Universitätsstraße 21-23, 33615, Bielefeld, Germany
Philipp Cimiano
IST/DEI, INESC-ID, Rua Alves Redol 9, 1000-029, Lisboa, Portugal
H. Sofia Pinto

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zhang, Z., Chapman, S., Ciravegna, F. (2010). A Methodology towards Effective and Efficient Manual Document Annotation: Addressing Annotator Discrepancy and Annotation Quality. In: Cimiano, P., Pinto, H.S. (eds) Knowledge Engineering and Management by the Masses. EKAW 2010. Lecture Notes in Computer Science(), vol 6317. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-16438-5_21

Download citation

DOI: https://doi.org/10.1007/978-3-642-16438-5_21
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-16437-8
Online ISBN: 978-3-642-16438-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics