Advertisement

Treebanks pp 5-22 | Cite as

The Penn Treebank: An Overview

  • Ann Taylor
  • Mitchell Marcus
  • Beatrice Santorini
Part of the Text, Speech and Language Technology book series (TLTB, volume 20)

Abstract

The Penn Treebank, in its eight years of operation (1989–1996), produced approximately 7 million words of part-of-speech tagged text, 3 million words of skeletally parsed text, over 2 million words of text parsed for predicateargument structure, and 1.6 million words of transcribed spoken text annotated for speech disfluencies. This paper describes the design of the three annotation schemes used by the Treebank: POS tagging, syntactic bracketing, and disfluency annotation and the methodology employed in production. All available Penn Treebank materials are distributed by the Linguistic Data Consortium http://www.ldc.upenn.edu.

Keywords

English Annotated Corpus Part-of-speech Tagging Treebank Syntactic Bracketing Parsing Disfluencies 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Bies, Ann, Mark Ferguson, Karen Katz, and Robert MacIntyre. (1995). Bracketing Guidelines for Treebank II Style. Ms., Department of Computer and Information Science, University of Pennsylvania.Google Scholar
  2. Brill, Eric. (1993). A Corpus-based Approach to Language Learning. PhD Dissertation, University of Pennsylvania.Google Scholar
  3. Church, Kenneth W. (1980). Memory Limitations in Natural Language Processing, MIT LCS Technical Report 245. Master’s thesis, Massachusetts Institute of Technology.Google Scholar
  4. Church, Kenneth W. (1988). A stochastic parts program and noun phrase parser for unrestricted text. In Proceedings of the Second Conference on Applied Natural Language Processing. 26th Annual Meeting of the Association for Computational Linguistics, pages 136–143.Google Scholar
  5. Francis, W. Nelson (1964). A Standard Sample of Present-day English for Use with Digital Computers. Report to the U.S Office of Education on Cooperative Research Project No. E-007. Brown University, Providence RI.Google Scholar
  6. Francis, W. Nelson and Henry Kučera. (1982). Frequency Analysis of English Usage. Lexicon and Grammar. Houghton Mifflin, Boston.Google Scholar
  7. Garside, Roger, Geoffrey Leech, and Geoffrey Sampson. (1987). The Computational Analysis of English. A Corpus-based Approach. Longman, London.Google Scholar
  8. Hindle, Donald. (1983). User Manual for Fidditch. Technical memorandum 7590-142, Naval Research Laboratory.Google Scholar
  9. Hindle, Donald. (1989). Acquiring disambiguation rules from text. In Proceedings of the 27th Annual Meeting of the Association for Computational Linguistics.Google Scholar
  10. Kroch, Anthony S. and Ann Taylor. (2000). The Penn-Helsinki Parsed Corpus of Middle English, Second Edition. Department of Linguistics, University of Pennsylvania.Google Scholar
  11. Lewis, Bil, Dan Laliberte, and the GNU Manual Group. (1990). The GNU Emacs Lisp Reference Manual. Free Software Foundation, Cambridge MA.Google Scholar
  12. Marcus, Mitchell P., Beatrice Santorini, and Mary Ann Marcinkiewicz. (1993) Building a large annotated corpus of English: the Penn Treebank. Computational Linguistics 19(2):313–330.Google Scholar
  13. Marcus, Mitchell P., Grace Kim, Mary Ann Marcinkiewicz, Robert MacIntyre, Ann Bies, Mark Ferguson, Karen Katz, and Britta Schasberger. (1994). The Penn Treebank: Annotating predicate-argument structure. In ARPA Human Language Technology Workshop.Google Scholar
  14. Mateer, Marie, and Ann Taylor. (1995). Disfluency Annotation Stylebook for the Switchboard Corpus. Ms., Department of Computer and Information Science, University of Pennsylvania.Google Scholar
  15. Quirk, R., S. Greenbaum, G. Leech, and J. Svartvik. (1985). A Comprehensive Grammar of the English Language, Longman, London.Google Scholar
  16. Santorini, Beatrice. (1990). Part-of-speech Tagging Guidelines for the Penn Treebank Project. Technical report MS-CIS-90-47, Department of Computer and Information Science, University of Pennsylvania.Google Scholar
  17. Santorini, Beatrice and Mary Ann Marcinkiewicz. (1991). Bracketing Guidelines for the Penn Treebank Project. Ms., Department of Computer and Information Science, University of Pennsylvania.Google Scholar
  18. Shriberg, E.E. (1994). Preliminaries to a Theory of Speech Disfluencies. PhD Dissertation, University of California at Berkeley.Google Scholar

Copyright information

© Springer Science+Business Media Dordrecht 2003

Authors and Affiliations

  • Ann Taylor
    • 1
  • Mitchell Marcus
    • 2
  • Beatrice Santorini
    • 2
  1. 1.University of York HeslingtonYorkUK
  2. 2.University of PennsylvaniaPhiladelphiaUSA

Personalised recommendations