Empirical Evaluation of Semi-automated XML Annotation of Text Documents with the GoldenGATE Editor

  • Guido Sautter
  • Klemens Böhm
  • Frank Padberg
  • Walter Tichy
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4675)

Abstract

Digitized scientific documents should be marked up according to domain-specific XML schemas, to make maximum use of their content. Such markup allows for advanced, semantics-based access to the document collection. Many NLP applications have been developed to support automated annotation. But NLP results often are not accurate enough; and manual corrections are indispensable. We therefore have developed the GoldenGATE editor, a tool that integrates NLP applications and assistance features for manual XML editing. Plain XML editors do not feature such a tight integration: Users have to create the markup manually or move the documents back and forth between the editor and (mostly command line) NLP tools. This paper features the first empirical evaluation of how users benefit from such a tight integration when creating semantically rich digital libraries. We have conducted experiments with humans who had to perform markup tasks on a document collection from a generic domain. The results show clearly that markup editing assistance in tight combination with NLP functionality significantly reduces the user effort in annotating documents.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Mikheev, A., Moens, M., Grover, C.: Named Entity Recognition without Gazetteers. In: Proceedings of EACL, Bergen, Norway (1999)Google Scholar
  2. 2.
    Miller, D., Boisen, S., Schwartz, R., Stone, R., Weischedel, R.: Named Entity Extraction from Noisy Input: Speech and OCR. In: Christodoulakis, D.N. (ed.) NLP 2000. LNCS (LNAI), vol. 1835, Springer, Heidelberg (2000)Google Scholar
  3. 3.
    Sautter, G., Agosti, D., Böhm, K.: Semi-automated XML Markup of Biosystematics Legacy Literature with the GoldenGATE Editor. In: Proceedings of PSB, Weilea, HI, USA (2007)Google Scholar
  4. 4.
    Sautter, G., Agosti, D., Böhm, K.: A Combining Approach to Find All Taxon Names (FAT) in Legacy Biosystematics Literature, Biodiversity Informatics Journal 3 (2006)Google Scholar
  5. 5.
    Tichy, W.: Hints for Reviewing Empirical Work in Software Engineering. Journal of Empirical Softw. Eng. 5, 309–312 (2000)CrossRefMathSciNetGoogle Scholar
  6. 6.
    Müller, M., Padberg, F.: An Empirical Study about the Feelgood Factor in Pair Programming. Int. Symp. on Softw. Metr. 10, 151–158 (2004)CrossRefGoogle Scholar
  7. 7.
    IDM Computer Solutions Inc., www.ultraedit.com
  8. 8.
    <oxygen/>, www.oxygenxml.com
  9. 9.
    Altova GmbH, www.altova.com
  10. 10.
    The OpenNLP project, www.opennlp.org
  11. 11.
  12. 12.
    Rabiner, L., Juang, B.: An Introduction to Hidden Markov Models. IEEE ASSP Magazine 3(1), 4–16 (1986)CrossRefGoogle Scholar
  13. 13.
    GATE, General Architecture for Text Engineering, gate.ac.uk
  14. 14.
  15. 15.
  16. 16.
  17. 17.
    Cohen, J.: Statistical Power Analysis for the Behavioral Sciences, 2nd edn., Erlbaum, Hillsdale, NJ (1988)Google Scholar
  18. 18.
    Christensen, L.: Experimental Methodology, 10th edn. Pearson, Boston, MA (2007)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2007

Authors and Affiliations

  • Guido Sautter
    • 1
  • Klemens Böhm
    • 1
  • Frank Padberg
    • 1
  • Walter Tichy
    • 1
  1. 1.Department of Computer Science, Universität Karlsruhe (TH), Am Fasanengarten 5, 76128 KarlsruheGermany

Personalised recommendations