Skip to main content

Part-of-Speech Tagging from 97% to 100%: Is It Time for Some Linguistics?

  • Conference paper

Part of the Lecture Notes in Computer Science book series (LNTCS,volume 6608)

Abstract

I examine what would be necessary to move part-of-speech tagging performance from its current level of about 97.3% token accuracy (56% sentence accuracy) to close to 100% accuracy. I suggest that it must still be possible to greatly increase tagging performance and examine some useful improvements that have recently been made to the Stanford Part-of-Speech Tagger. However, an error analysis of some of the remaining errors suggests that there is limited further mileage to be had either from better machine learning or better features in a discriminative sequence classifier. The prospects for further gains from semi-supervised learning also seem quite limited. Rather, I suggest and begin to demonstrate that the largest opportunity for further progress comes from improving the taxonomic basis of the linguistic resources from which taggers are trained. That is, from improved descriptive linguistics. However, I conclude by suggesting that there are also limits to this process. The status of some words may not be able to be adequately captured by assigning them to one of a small number of categories. While conventions can be used in such cases to improve tagging consistency, they lack a strong linguistic basis.

Keywords

  • Noun Phrase
  • Proper Noun
  • Unknown Word
  • Past Participle
  • Participle Clause

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

This is a preview of subscription content, access via your institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • DOI: 10.1007/978-3-642-19400-9_14
  • Chapter length: 19 pages
  • Instant PDF download
  • Readable on all devices
  • Own it forever
  • Exclusive offer for individuals only
  • Tax calculation will be finalised during checkout
eBook
USD   79.99
Price excludes VAT (USA)
  • ISBN: 978-3-642-19400-9
  • Instant PDF download
  • Readable on all devices
  • Own it forever
  • Exclusive offer for individuals only
  • Tax calculation will be finalised during checkout
Softcover Book
USD   99.99
Price excludes VAT (USA)

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Toutanova, K., Klein, D., Manning, C.D., Singer, Y.: Feature-rich part-of-speech tagging with a cyclic dependency network. In: NAACL 3, pp. 252–259 (2003)

    Google Scholar 

  2. Shen, L., Satta, G., Joshi, A.: Guided learning for bidirectional sequence classification. In: ACL 2007 (2007)

    Google Scholar 

  3. Spoustová, D.j., Hajič, J., Raab, J., Spousta, M.: Semi-supervised training for the averaged perceptron POS tagger. In: Proceedings of the 12th Conference of the European Chapter of the ACL (EACL 2009), pp. 763–771 (2009)

    Google Scholar 

  4. Søgaard, A.: Simple semi-supervised training of part-of-speech taggers. In: Proceedings of the ACL 2010 Conference Short Papers, pp. 205–208 (2010)

    Google Scholar 

  5. Subramanya, A., Petrov, S., Pereira, F.: Efficient graph-based semi-supervised learning of structured tagging models. In: Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, pp. 167–176 (2010)

    Google Scholar 

  6. Collins, M.: Discriminative training methods for Hidden Markov Models: Theory and experiments with perceptron algorithms. In: EMNLP 2002 (2002)

    Google Scholar 

  7. Marcus, M.P., Santorini, B., Marcinkiewicz, M.A.: Building a large annotated corpus of English: The Penn treebank. Computational Linguistics 19, 313–330 (1993)

    Google Scholar 

  8. Finkel, J., Dingare, S., Manning, C., Nissim, M., Alex, B., Grover, C.: Exploring the boundaries: Gene and protein identification in biomedical text. BMC Bioinformatics 6 (suppl. 1) (2005)

    Google Scholar 

  9. Collins, M.: Ranking algorithms for named entity extraction: Boosting and the voted perceptron. In: ACL 40, pp. 489–496 (2002)

    Google Scholar 

  10. Tsuruoka, Y., Tsujii, J.: Bidirectional inference with the easiest-first strategy for tagging sequence data. In: Proceedings of HLT/EMNLP 2005, pp. 467–474 (2005)

    Google Scholar 

  11. Clark, A.: Combining distributional and morphological information for part of speech induction. In: EACL 2003, pp. 59–66 (2003)

    Google Scholar 

  12. Klein, D., Manning, C.D.: Accurate unlexicalized parsing. In: ACL 41, pp. 423–430 (2003)

    Google Scholar 

  13. MacKinlay, A.: The effects of part-of-speech tagsets on tagger performance. Honours thesis, Department of Computer Science and Software Engineering, University of Melbourne (2005)

    Google Scholar 

  14. Church, K.W.: Current practice in part of speech tagging and suggestions for the future. In: Mackie, A.W., McAuley, T.K., Simmons, C. (eds.) For Henry Kučera: Studies in Slavic Philology and Computational Linguistics. Papers in Slavic philology, vol. 6, pp. 13–48. Michigan Slavic Studies, Ann Arbor (1992)

    Google Scholar 

  15. Magerman, D.M.: Natural language parsing as statistical pattern recognition. PhD thesis, Stanford University (1994)

    Google Scholar 

  16. Ratnaparkhi, A.: A maximum entropy model for part-of-speech tagging. In: EMNLP 1, pp. 133–142 (1996)

    Google Scholar 

  17. Abney, S., Schapire, R.E., Singer, Y.: Boosting applied to tagging and PP attachment. In: Fung, P., Zhou, J. (eds.) Proceedings of the 1999 Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora, pp. 38–45 (1999)

    Google Scholar 

  18. Voutilainen, A., Järvinen, T.: Specifying a shallow grammatical representation for parsing purposes. In: 7th Conference of the European Chapter of the Association for Computational Linguistics, pp. 210–214 (1995)

    Google Scholar 

  19. Samuelsson, C., Voutilainen, A.: Comparing a linguistic and a stochastic tagger. In: Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics, pp. 246–253 (1997)

    Google Scholar 

  20. Dickinson, M., Meurers, W.D.: Detecting errors in part-of-speech annotation. In: Proceedings of the 10th Conference of the European Chapter of the Association for Computational Linguistics, EACL 2003 (2003)

    Google Scholar 

  21. Levy, R., Andrew, G.: Tregex and Tsurgeon: tools for querying and manipulating tree data structures. In: 5th International Conference on Language Resources and Evaluation, LREC 2006 (2006)

    Google Scholar 

  22. Rohde, D.L.T.: Tgrep2 user manual. ms. MIT, Cambridge (2005)

    Google Scholar 

  23. Santorini, B.: Part-of-speech tagging guidelines for the Penn treebank project. 3rd Revision, 2nd printing, February 1995. University of Pennsylvania (1990)

    Google Scholar 

  24. Moore, D.S.: Statistics: Concepts and Controversies, 3rd edn. W. H. Freeman, New York (1991)

    Google Scholar 

  25. Ross, J.R.: A fake NP squish. In: Bailey, C.J.N., Shuy, R.W. (eds.) New Ways of Analyzing Variation in English, pp. 96–140. Georgetown University Press, Washington (1973)

    Google Scholar 

  26. Quirk, R., Greenbaum, S., Leech, G., Svartvik, J.: A Comprehensive Grammar of the English Language. Longman, London (1985)

    Google Scholar 

  27. Aarts, B.: Syntactic gradience: the nature of grammatical indeterminacy. Oxford University Press, Oxford (2007)

    Google Scholar 

  28. Abney, S.: Statistical methods and linguistics. In: Klavans, J., Resnik, P. (eds.) The Balancing Act. MIT Press, Cambridge (1996)

    Google Scholar 

  29. Maling, J.: Transitive adjectives: A case of categorial reanalysis. In: Heny, F., Richards, B. (eds.) Linguistic Categories: Auxiliaries and Related Puzzles, vol. 1, pp. 253–289. D. Reidel, Dordrecht (1983)

    CrossRef  Google Scholar 

  30. Harnad, S. (ed.): Categorical perception: the groundwork of cognition. Cambridge University Press, Cambridge (1987)

    Google Scholar 

  31. Radford, A.: Transformational Grammar. Cambridge University Press, Cambridge (1988)

    CrossRef  Google Scholar 

  32. Manning, C.D., Schütze, H.: Foundations of Statistical Natural Language Processing. MIT Press, Boston (1999)

    MATH  Google Scholar 

  33. Bies, A., Ferguson, M., Katz, K., MacIntyre, R. (colleagues): Bracketing guidelines for Treebank II style: Penn treebank project. ms, University of Pennsylvania (1995)

    Google Scholar 

  34. Huddleston, R.D., Pullum, G.K.: The Cambridge Grammar of the English Language. Cambridge University Press, Cambridge (2002)

    CrossRef  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and Permissions

Copyright information

© 2011 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Manning, C.D. (2011). Part-of-Speech Tagging from 97% to 100%: Is It Time for Some Linguistics?. In: Gelbukh, A.F. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2011. Lecture Notes in Computer Science, vol 6608. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-19400-9_14

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-19400-9_14

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-19399-6

  • Online ISBN: 978-3-642-19400-9

  • eBook Packages: Computer ScienceComputer Science (R0)