Skip to main content

A Methodology towards Effective and Efficient Manual Document Annotation: Addressing Annotator Discrepancy and Annotation Quality

  • Conference paper
Knowledge Engineering and Management by the Masses (EKAW 2010)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 6317))

Abstract

Manual document annotation is an essential technique for knowledge acquisition and capture. Creating high-quality annotations is a difficult task due to inter-annotator discrepancy, the problem that annotators can never agree completely on what and exactly how to annotate. To address this, traditional document annotation involves multiple domain experts working on the same annotation task in an iterative and collaborative manner to identify and resolve discrepancies progressively. However, such a detailed process is often ineffective despite taking significant time and effort; unfortunately, discrepancies remain high in many cases. This paper proposes an alternative approach to document annotation. The approach tackles the problem by firstly studying annotators’ suitability based on the types of information to be annotated; then identifying and isolating the most inconsistent annotators who tend to cause the majority of discrepancies in a task; finally distributing annotation workload among the most suitable annotators. Tested in a named entity annotation task in the domain of archaeology, we show that compared to the traditional approach to document annotation, it produces larger amounts of better quality annotations that result in higher machine learning accuracy while requires significantly less time and effort.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Bermingham, A., Smeaton, A.: A Study of Inter-Annotator Agreement for Opinion Retrieval. In: Proceedings of SIGIR 2009 (2009)

    Google Scholar 

  2. Brants, T.: Inter-annotator agreement for a German newspaper corpus. In: Proceedings of the Second International Conference on Language Resources and Evaluation, LREC (2000)

    Google Scholar 

  3. Byrne, K.: Nested Named Entity Recognition in Historical Archive Text. In: Proceedings of International Conference on Semantic Computing (2007)

    Google Scholar 

  4. Carletta, J.: Assessing agreement on classification tasks: the kappa statistic. Computational Linguistics 22(2), 249–254 (1996)

    Google Scholar 

  5. Colosimo, M., Morgan, A., Yeh, A., Colombe, J., Hirschman, L.: Data preparation and internannotator agreement: BioCreAtIvE Task 1B. BMC Bioinformatics (2005)

    Google Scholar 

  6. Ciravegna, F., Lavelli, A., Satta, G.: Bringing information extraction out of the labs: the Pinocchio Environment. In: Proceedings of the 14th European Conference on Artificial Intelligence (2000)

    Google Scholar 

  7. Cucchiarini, C., Strik, H.: Automatic transcription agreement: An overview, pp. 347–350 (2003)

    Google Scholar 

  8. Ehrmann, M.: Les entites nommees, de la linguistique au TAL: statut theorique et methods de desambiguisation. Ph.D. thesis, Univ. Paris (2008)

    Google Scholar 

  9. Ferro, L., Mani, I., Sundheim, B., Wilson, G.: TIDES Temporal Annotation Guidelines. Draft Version 1.0. MITRE Technical Report MTR 00W0000094 (October 2000)

    Google Scholar 

  10. Fort, K., Ehrmann, M., Nazarenko, A.: Towards a methodology for named entities anntoation. In: Proceedings of the Third Linguistic Annotation Workshop, ACL-IJNLP, pp. 142–145 (2009)

    Google Scholar 

  11. Grishman, R., Sundheim, B.: Message understanding conference - 6: A brief history. In: Proceedings of International Conference on Computational Linguistics (1996)

    Google Scholar 

  12. Gut, U., Bayerl, P.S.: Measuring the Reliability of Manual Annotations of Speech Corpora. In: Proceedings of Speech Prosody (2004), Nara, pp. 565–568 (2004)

    Google Scholar 

  13. Hripcsak, G., Rothschild, A.: Agreement, the F-measure and Reliability in Information Retrieval. Journal of the American Medical Informatics Association, 296–298 (2005)

    Google Scholar 

  14. Hripcsak, G., Wilcox, A.: Reference standards, judges, and comparison subjects: roles for experts in evaluating system performance. J. Am. Med. Inform. Assoc., 1–15 (2002)

    Google Scholar 

  15. Iria, J.: Automating Knowledge Capture in the Aerospace Domain. In: Proceedings of the Fifth International Conference on Knowledge Capture, pp. 97–104 (2009)

    Google Scholar 

  16. Jeffrey, S., Richards, J., Ciravegna, F., Chapman, S., Zhang, Z.: The Archaeotools project: Faceted Classification and Natural Language Processing in an Archaeological Context. In: Special Theme Issues of the Philosophical Transactions of the Royal Society A, Crossing Boundaries: Computational Science, E-Science and Global E-Infrastructures (2009)

    Google Scholar 

  17. Kim, J., Ohta, T., Tsujii, J.: Corpus annotations for mining biomedical events from literature. In: BMC Bioinformatics (2008)

    Google Scholar 

  18. Linguistic Data Consortium, Automatic Content Extraction (ACE) (2008), http://projects.ldc.upenn.edu/ace/

  19. Minkov, E., Wang, R., Cohen, W.: Extracting Personal Names from Email: Applying Named Entity Recognition to Informal Text. In: Proceedings of HLT/EMNLP 2005 (2005)

    Google Scholar 

  20. Morante, R., Asch, V., Daelemans, W.: A memory–based learning approach to event extraction in biomedical texts. In: Proceedings of the Workshop on BioNLP: Shared Task, pp. 59–67 (2009)

    Google Scholar 

  21. Murphy, T., McIntosh, T., Curran, J.: Named entity recognition for astronomy literature. In: Australian Language Technology Workshop (2006)

    Google Scholar 

  22. Nadeau, D.: PhD Thesis: Semi-Supervised Named Entity Recognition: Learning to Recognize 100 Entity Types with Little Supervision (2007)

    Google Scholar 

  23. Ng, H., Lim, C., Foo, S.: A Case Study on Inter-Annotator Agreement for Word Sense Disambiguation. In: Proceedings of the ACL SIGLEX Workshop on Standardizing Lexical Resources SIGLEX 1999, pp. 9–13 (1999)

    Google Scholar 

  24. Ohta, T., Tateisi, Y., Kim, J.: The GENIA corpus: an annotated research abstract corpus in molecular biology domain. In: Proceedings of the Second International Conference on Human Language Technology Research, pp. 82–86 (2002)

    Google Scholar 

  25. Olsson, F.: PhD thesis: Bootstrapping Named Entity Annotation by Means of Active Machine Learning: A Method for Creating Corpora (2008)

    Google Scholar 

  26. Pyysalo, S., Ginter, F., Heimonen, J., Björne, J., Boberg, J., Järvinen, J., Salakoski, T.: BioInfer: a corpus for information extraction in the biomedical domain. BMC Bioinformatics (2007)

    Google Scholar 

  27. Saracevic, T.: Individual differences in organizing, searching, and retrieving information. In: Proceedings of the 54th Annual ASIS Meeting, pp. 82–86 (1991)

    Google Scholar 

  28. Tanabe, L., Xie, N., Thom, L., Matten, W., Wilbur, W.: GENETAG: a tagged corpus for gene/protein named entity recognition. BMC Bioinformatics (2005)

    Google Scholar 

  29. Wilbur, W., Rzhetsky, A., Shatkay, H.: New directions in biomedical text annotation: definitions, guidelines and corpus construction. Bioinformatics (2006)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2010 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Zhang, Z., Chapman, S., Ciravegna, F. (2010). A Methodology towards Effective and Efficient Manual Document Annotation: Addressing Annotator Discrepancy and Annotation Quality. In: Cimiano, P., Pinto, H.S. (eds) Knowledge Engineering and Management by the Masses. EKAW 2010. Lecture Notes in Computer Science(), vol 6317. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-16438-5_21

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-16438-5_21

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-16437-8

  • Online ISBN: 978-3-642-16438-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics