Abstract
The Penn Treebank, in its eight years of operation (1989–1996), produced approximately 7 million words of part-of-speech tagged text, 3 million words of skeletally parsed text, over 2 million words of text parsed for predicateargument structure, and 1.6 million words of transcribed spoken text annotated for speech disfluencies. This paper describes the design of the three annotation schemes used by the Treebank: POS tagging, syntactic bracketing, and disfluency annotation and the methodology employed in production. All available Penn Treebank materials are distributed by the Linguistic Data Consortium http://www.ldc.upenn.edu.
Keywords
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Bies, Ann, Mark Ferguson, Karen Katz, and Robert MacIntyre. (1995). Bracketing Guidelines for Treebank II Style. Ms., Department of Computer and Information Science, University of Pennsylvania.
Brill, Eric. (1993). A Corpus-based Approach to Language Learning. PhD Dissertation, University of Pennsylvania.
Church, Kenneth W. (1980). Memory Limitations in Natural Language Processing, MIT LCS Technical Report 245. Master’s thesis, Massachusetts Institute of Technology.
Church, Kenneth W. (1988). A stochastic parts program and noun phrase parser for unrestricted text. In Proceedings of the Second Conference on Applied Natural Language Processing. 26th Annual Meeting of the Association for Computational Linguistics, pages 136–143.
Francis, W. Nelson (1964). A Standard Sample of Present-day English for Use with Digital Computers. Report to the U.S Office of Education on Cooperative Research Project No. E-007. Brown University, Providence RI.
Francis, W. Nelson and Henry Kučera. (1982). Frequency Analysis of English Usage. Lexicon and Grammar. Houghton Mifflin, Boston.
Garside, Roger, Geoffrey Leech, and Geoffrey Sampson. (1987). The Computational Analysis of English. A Corpus-based Approach. Longman, London.
Hindle, Donald. (1983). User Manual for Fidditch. Technical memorandum 7590-142, Naval Research Laboratory.
Hindle, Donald. (1989). Acquiring disambiguation rules from text. In Proceedings of the 27th Annual Meeting of the Association for Computational Linguistics.
Kroch, Anthony S. and Ann Taylor. (2000). The Penn-Helsinki Parsed Corpus of Middle English, Second Edition. Department of Linguistics, University of Pennsylvania.
Lewis, Bil, Dan Laliberte, and the GNU Manual Group. (1990). The GNU Emacs Lisp Reference Manual. Free Software Foundation, Cambridge MA.
Marcus, Mitchell P., Beatrice Santorini, and Mary Ann Marcinkiewicz. (1993) Building a large annotated corpus of English: the Penn Treebank. Computational Linguistics 19(2):313–330.
Marcus, Mitchell P., Grace Kim, Mary Ann Marcinkiewicz, Robert MacIntyre, Ann Bies, Mark Ferguson, Karen Katz, and Britta Schasberger. (1994). The Penn Treebank: Annotating predicate-argument structure. In ARPA Human Language Technology Workshop.
Mateer, Marie, and Ann Taylor. (1995). Disfluency Annotation Stylebook for the Switchboard Corpus. Ms., Department of Computer and Information Science, University of Pennsylvania.
Quirk, R., S. Greenbaum, G. Leech, and J. Svartvik. (1985). A Comprehensive Grammar of the English Language, Longman, London.
Santorini, Beatrice. (1990). Part-of-speech Tagging Guidelines for the Penn Treebank Project. Technical report MS-CIS-90-47, Department of Computer and Information Science, University of Pennsylvania.
Santorini, Beatrice and Mary Ann Marcinkiewicz. (1991). Bracketing Guidelines for the Penn Treebank Project. Ms., Department of Computer and Information Science, University of Pennsylvania.
Shriberg, E.E. (1994). Preliminaries to a Theory of Speech Disfluencies. PhD Dissertation, University of California at Berkeley.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2003 Springer Science+Business Media Dordrecht
About this chapter
Cite this chapter
Taylor, A., Marcus, M., Santorini, B. (2003). The Penn Treebank: An Overview. In: Abeillé, A. (eds) Treebanks. Text, Speech and Language Technology, vol 20. Springer, Dordrecht. https://doi.org/10.1007/978-94-010-0201-1_1
Download citation
DOI: https://doi.org/10.1007/978-94-010-0201-1_1
Publisher Name: Springer, Dordrecht
Print ISBN: 978-1-4020-1335-5
Online ISBN: 978-94-010-0201-1
eBook Packages: Springer Book Archive