Parsing Biomedical Literature

  • Matthew Lease
  • Eugene Charniak
Part of the Lecture Notes in Computer Science book series (LNCS, volume 3651)


We present a preliminary study of several parser adaptation techniques evaluated on the GENIA corpus of MEDLINE abstracts [1,2]. We begin by observing that the Penn Treebank (PTB) is lexically impoverished when measured on various genres of scientific and technical writing, and that this significantly impacts parse accuracy. To resolve this without requiring in-domain treebank data, we show how existing domain-specific lexical resources may be leveraged to augment PTB-training: part-of-speech tags, dictionary collocations, and named-entities. Using a state-of-the-art statistical parser [3] as our baseline, our lexically-adapted parser achieves a 14.2% reduction in error. With oracle-knowledge of named-entities, this error reduction improves to 21.2%.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Kim, J.d., Ohta, T., Tateisi, Y., Tsujii, J.: Genia corpus - a semantically annotated corpus for bio-textmining. Bioinformatics (Supplement: Eleventh International Conference on Intelligent Systems for Molecular Biology) 19, i180–i182 (2003)Google Scholar
  2. 2.
    Tateisi, Y., Ohta, T., dong Kim, J., Hong, H., Jian, S., Tsujii, J.: The genia corpus: Medline abstracts annotated with linguistic information. In: Third meeting of SIG on Text Mining, Intelligent Systems for Molecular Biology, ISMB (2003)Google Scholar
  3. 3.
    Charniak, E.: A maximum-entropy-inspired parser. In: Proc. NAACL, pp. 132–139 (2000)Google Scholar
  4. 4.
    Marcus, M., Santorini, B., Marcinkiewicz, M.A.: Building a large annotated corpus of English: The Penn Treebank. Computational Linguistics 19, 313–330 (1993)Google Scholar
  5. 5.
    Collins, M.: Discriminative reranking for natural language parsing. In: Proc. ICML, pp. 175–182 (2000)Google Scholar
  6. 6.
    Ratnaparkhi, A.: Learning to parse natural language with maximum entropy models. Machine Learning 34, 151–175 (1999)MATHCrossRefGoogle Scholar
  7. 7.
    Gildea, D.: Corpus variation and parser performance. In: Proceedings of the 2001 Conference on Empirical Methods in Natural Language Processing, pp. 167–202 (2001)Google Scholar
  8. 8.
    Roark, B., Bacchiani, M.: Supervised and unsupervised pcfg adaptation to novel domains. In: Proceedings of HLT-NAACL, pp. 205–212 (2003)Google Scholar
  9. 9.
    Steedman, M., Hwa, R., Clark, S., Osborne, M., Sarkar, A., Hockenmaier, J., Ruhlen, P., Baker, S., Crim, J.: Example selection for bootstrapping statistical parsers. In: Proceedings of HLT-NAACL, pp. 331–338 (2003)Google Scholar
  10. 10.
    de Bruijn, B., Martin, J.: Literature mining in molecular biology. In: Proceedings of the European Federation for Medical Informatics (EFMI) Workshop on Natural Language Processing in Biomedical Applications (2002)Google Scholar
  11. 11.
    Hirschman, L., Park, J., Tsujii, J., Wong, L., Wu, C.: Accomplishments and challenges in literature data mining for biology. Bioinformatics 18, 1553–1561 (2002)CrossRefGoogle Scholar
  12. 12.
    Yakushiji, A., Tateisi, Y., Miyao, Y., Tsujii, J.: Event extraction from biomedical papers using a full parser. In: Pacific Symposium on Biocomputing, pp. 408–419 (2001)Google Scholar
  13. 13.
    Daraselia, N., Yuryev, A., Egorov, S., Novichkova, S., Nikitin, A., Mazo, I.: Extracting human protein interactions from medline using a full-sentence parser. Bioinformatics 20, 604–611 (2004)CrossRefGoogle Scholar
  14. 14.
    Shatkay, H., Feldman, R.: Mining the biomedical literature in the genomic era: An overview. Journal of Computational Biology 10, 821–855 (2003)CrossRefGoogle Scholar
  15. 15.
    Hwa, R.: Learning Probabilistic Lexicalized Grammars for Natural Language Processing. PhD thesis, Harvard University (2001)Google Scholar
  16. 16.
    Bies, A., Ferguson, M., Katz, K., MacIntyre, R.: Bracketting Guideliness for Treebank II style Penn Treebank Project. Linguistic Data Consortium (1995)Google Scholar
  17. 17.
    Buckley, C.: Implementation of the smart information retrieval system. Technical Report 85-686, Cornell University (1985)Google Scholar
  18. 18.
    Goodman, J.: Parsing inside-out. PhD thesis, Harvard University (1998)Google Scholar
  19. 19.
    McCray, A.T., Srinivasan, S., Browne, A.C.: Lexical methods for managing variation in biomedical terminologies. In: Proceedings of the 18th Annual Symposium on Computer Applications in Medical Care (SCAMC), pp. 235–239 (1994)Google Scholar
  20. 20.
    Grover, C., Lapata, M., Lascarides, A.: A comparison of parsing technologies for the biomedical domain. Journal of Natural Language Engineering (2002)Google Scholar
  21. 21.
    Surdeanu, M., Harabagiu, S., Williams, J., Aarseth, P.: Using predicate-argument structures for information extraction. In: Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics (ACL-2003), pp. 8–15 (2003)Google Scholar
  22. 22.
    Miyao, Y., Ninomiya, T., Tsujii, J.: Corpus-oriented grammar development for acquiring a head-driven phrase structure grammar from the penn treebank. In: Proc. of IJCNLP-2004, pp. 684–693 (2004)Google Scholar
  23. 23.
    Zhou, G., Su, J.: Exploring deep knowledge resources in biomedical name recognition. In: Proceedings of the Joint Workshop on Natural Language Processing in Biomedicine and its Applications, JNLPBA-2004 (2004)Google Scholar
  24. 24.
    Charniak, E.: Statistical parsing with a context-free grammar and word statistics. In: Proceedings of the Fourteenth National Conference on Artificial Intelligence. AAAI Press/MIT Press, Menlo Park (1997)Google Scholar
  25. 25.
    Park, J.C.: Using combinatory categorical grammar to extract biomedical information. IEEE Intelligent Systems 16, 62–67 (2001)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2005

Authors and Affiliations

  • Matthew Lease
    • 1
  • Eugene Charniak
    • 1
  1. 1.Brown Laboratory for Linguistic Information Processing (BLLIP)Brown UniversityProvidenceUSA

Personalised recommendations