Abstract
Robust natural language analysis systems must be able to handle words that are not in the lexicon. This paper describes a statistical model that predicts the most likely Parts-of-Speech for previously unseen words. The method uses a loglinear model to combine a number of orthographic and morphological features, and returns a probability distribution over the open word classes. The model is combined with a stochastic Part-of-Speech tagger to provide a model of context. Empirical evaluation shows that this results in significant gains in Part-of-Speech prediction accuracy over simpler methods.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Agresti, A. (1990). Categorical Data Analysis. John Wiley & Sons, New York.
Bishop, Y. M., Fienberg, S. E., and Holland, P. W. (1975). Discrete Multivariate Analysis: Theory and Practice. MIT Press, Cambridge, MA.
Charniak, E., Hendrickson, C., Jacobson, N., and Perkowitz, M. (1993). Equations for part-of-speech tagging. In AAAI-93, pages 784–789.
de Marcken, C. G. (1990). Parsing the LOB corpus. In Proceedings of ACL-90, pages 243–251.
Deming, W. E. and Stephan, F. F. (1940). On a least squares adjustment of a sampled frequency table when the expected marginal totals are known. Ann. Math. Statis, (11):427–444.
Duda, R. O. and Hart, P. E. (1973). Pattern Classification and Scene Analysis. John Wiley & Sons, New York.
Fienberg, S. E. (1980). The Analysis of Cross-Classified Categorical Data. The MIT Press, Cambridge, MA, second edition edition.
Jelinek, F., Mercer, R. L., Bahl, L. R., and J, K. B. (1977). Perplexity — a measure of difficulty of speech recognition tasks. In 94th Meeting of the Acoustical Society of America, Miami Beach, FL.
Marcus, M. P., Santorini, B., and Marcinkiewicz, M. A. (1993). Building a large annotated corpus of English: The Penn Treebank. Computational Linguistics, 19(2):313–330.
Weischedel, R., Meteer, M., Schwartz, R., Ramshaw, L., and Palmucci, J. (1993). Coping with ambiguity and unknown words through probabilistic models. Computational Linguistics, 19(2):359–382.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 1996 Springer-Verlag New York, Inc.
About this chapter
Cite this chapter
Franz, A. (1996). A Model for Part-of-Speech Prediction. In: Fisher, D., Lenz, HJ. (eds) Learning from Data. Lecture Notes in Statistics, vol 112. Springer, New York, NY. https://doi.org/10.1007/978-1-4612-2404-4_40
Download citation
DOI: https://doi.org/10.1007/978-1-4612-2404-4_40
Publisher Name: Springer, New York, NY
Print ISBN: 978-0-387-94736-5
Online ISBN: 978-1-4612-2404-4
eBook Packages: Springer Book Archive