Creating Digital Resources from Legacy Documents: An Experience Report from the Biosystematics Domain

  • Guido Sautter
  • Klemens Böhm
  • Donat Agosti
  • Christiana Klingenberg
Part of the Lecture Notes in Computer Science book series (LNCS, volume 5554)

Abstract

Digitized legacy document marked up with XML can be used in many ways, e.g., to generate RDF statements about the world described. A prerequisite for doing so is that the document markup is of sufficient quality. Since fully automated markup-generation methods cannot ensure this, manual corrections and cleaning are indispensable. In this paper, we report on our experiences from a digitization and markup project for a large corpus of legacy documents from the biosystematics domain, with a focus on the use of modern tools. The markup created covers both document structure and semantic details. In contrast to previous markup projects reported on in literature, our corpus consists of large publications that comprise many different semantic units, and the documents contain OCR noise and layout artifacts. A core insight is that digitization and automated markup on the one hand and manual cleaning and correction on the other hand should be tightly interleaved, and that tools supporting this integration yield a significant improvement.

Keywords

Semantic XML Markup Digital Resources RDF Generation 

References

  1. 1.
  2. 2.
    Altova GmbH, http://www.altova.com
  3. 3.
    Brazma, A., Krestyaninova, M., Sarkans, U.: Standards for systems biology. Nature Reviews Genetics 7, 593–605 (2006)CrossRefGoogle Scholar
  4. 4.
    Catapano, T., et al.: TaxonX: A Lightweight and Flexible XML Schema for Mark-up of Taxonomic Treatments. In: Proceedings of Annual Meeting of Taxonomic Data Working Group 2006, St. Louis, MO, USA (2006)Google Scholar
  5. 5.
    Chinchor, N.: MUC-7 Named Entity Task definition. In: Proceedings of Message Understanding Conference, Washington, DC, USA (1997)Google Scholar
  6. 6.
    Cucerzan, S., Yarowsky, D.: Language Independent Named Entity Recognition Combining Morphological and Contextual Evidence. In: Proceedings of the Joint SIGDAT Conference on EMNLP and VLC, University of Maryland, College Park, MD, USA (1999)Google Scholar
  7. 7.
  8. 8.
    Fellbaum, C. (ed.): WordNet: An Electronic Lexical Database. MIT Press, Cambridge (1998)MATHGoogle Scholar
  9. 9.
    General Architecture for Text Engineering, http://gate.ac.uk
  10. 10.
  11. 11.
    Johnson, N.F.: The Hymenoptera Name Server, http://atbi.biosci.ohiostate.edu:210/hymenoptera/nomenclator
  12. 12.
    Kim, J.-D., Ohta, T., Tateisi, Y., Tsujii, J.: GENIA corpus – a semantically annotated corpus for bio-text-mining. Bioinformatics, i180–i182 (2003)Google Scholar
  13. 13.
    Marcus, M.P., Santorini, B., Marcinkiewicz, M.A.: Building a Large Annotated Corpus of English: The Penn Treebank. Computational Linguistics 19(2), 313–330 (1994)Google Scholar
  14. 14.
    Metadata Object Description Schema, http://www.loc.gov/standards/mods/
  15. 15.
    Mikheev, A., Moens, M., Grover, C.: Named Entity Recognition without Gazetteers. In: Proceedings of Annual Meeting of European Association Computational Linguistics, Bergen, Norway (1999)Google Scholar
  16. 16.
    National Library of Medicine: MEDLINE, http://www.nlm.nih.gov/pubs/factsheets/medline.html
  17. 17.
    Sautter, G., Böhm, K., Agosti, D.: A combining approach to find all taxon names (FAT). Biodiversity Informatics 3 (2006), https://journals.ku.edu/index.php/jbi/index
  18. 18.
    Sautter, G., Agosti, D., Böhm, K.: Semi-automated XML Markup of Biosystematics Legacy Literature with the GoldenGATE Editor. In: Proceedings of Pacific Symposium on Bio-computing, Weilea, HI, USA (2007)Google Scholar
  19. 19.
    Sautter, G., Böhm, K., Padberg, F., Tichy, W.: Empirical Evaluation of Semi-Automated XML Annotation of Text Documents with the GoldenGATE Editor. In: Proceedings of European Conference on Research and Advances in Digital Libraries, Budapest, Hungary (2007)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2009

Authors and Affiliations

  • Guido Sautter
    • 1
  • Klemens Böhm
    • 1
  • Donat Agosti
    • 2
  • Christiana Klingenberg
    • 3
  1. 1.Universität Karlsruhe (TH)Karlsruhe
  2. 2.Am. Mus. of Nat. Hist.New York
  3. 3.Staatliches Museum für NaturkundeKarlsruheGermany

Personalised recommendations