Parsing Biomedical Literature

  • Matthew Lease
  • Eugene Charniak
Part of the Lecture Notes in Computer Science book series (LNCS, volume 3651)

Abstract

We present a preliminary study of several parser adaptation techniques evaluated on the GENIA corpus of MEDLINE abstracts [1,2]. We begin by observing that the Penn Treebank (PTB) is lexically impoverished when measured on various genres of scientific and technical writing, and that this significantly impacts parse accuracy. To resolve this without requiring in-domain treebank data, we show how existing domain-specific lexical resources may be leveraged to augment PTB-training: part-of-speech tags, dictionary collocations, and named-entities. Using a state-of-the-art statistical parser [3] as our baseline, our lexically-adapted parser achieves a 14.2% reduction in error. With oracle-knowledge of named-entities, this error reduction improves to 21.2%.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Copyright information

© Springer-Verlag Berlin Heidelberg 2005

Authors and Affiliations

  • Matthew Lease
    • 1
  • Eugene Charniak
    • 1
  1. 1.Brown Laboratory for Linguistic Information Processing (BLLIP)Brown UniversityProvidenceUSA

Personalised recommendations