Skip to main content

Principles and Rules of Part-of-Speech Annotation

  • Chapter
  • First Online:
Language Corpora Annotation and Processing
  • 415 Accesses

Abstract

In this chapter, we describe some principles and rules that we apply during the process of part-of-speech (POS) annotation on a written text corpus. For various practical and theoretical reasons, these principles and rules become useful and relevant for overcoming many unwanted hurdles in POS annotation. We also describe here the strategies that we apply when we formulate some algorithms for the automatic assignment of grammatical values to words in a text. Although we address theoretical and practical issues of POS annotation, we do not talk about technical and computational issues which are important when a computer system is engaged for this kind of work. Since the target readers of this chapter are students of linguistics who have limited or zero exposure to computation, we keep these technical and computational issues beyond this chapter. We primarily focus on some linguistic and theoretical aspects of the problem. We keep these issues relatively simple as an area of general inquiry, so that non-experts and common readers can get ideas about how words in a text are annotated at the POS level, and how the principles and rules are to be used during this work. This chapter has two parts: in Part I, we propose some principles that are to be followed during POS annotation; and in Part II, we define some rules which are to be adopted during the actual work of POS annotation. Supporting examples are taken from English and Bengali text corpus. This chapter imparts necessary guidance to the new generation of scholars about the nature and characteristics of POS annotation and how proper reference to these principles and rules can make the subsequent works of text processing easy and simplified.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 139.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 179.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 199.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  • Atwell, E., Demetriou, G., Hughes, J., Schiffrin, A., Souter, C., & Wilcock, S. (2000). A comparative evaluation of modern English corpus grammatical annotation schemes. International Computer Archive of Modern English Journal, 24(1), 7–23.

    Google Scholar 

  • Avinesh, P. V. S. & Karthik, G. (2007). Part-of-speech tagging and chunking using conditional random field and transformation-based learning. In Proceedings of the Workshop on Shallow Parsing for South Asian Languages (IJCAI-07) (pp. 21–24). IIIT, Hyderabad, India.

    Google Scholar 

  • Brants, T. (2000). TnT-a statistical POS tagger. In Proceedings of the Sixth Applied Natural Language Processing Conference, Seattle (pp. 37–42).

    Google Scholar 

  • Dash, N. S. (2009). Language corpora: Past, present, and future. New Delhi: Mittal Publications.

    Google Scholar 

  • Dash, N. S. (2011). Principles of part-of-speech (POS) tagging in Indian language corpora. In Proceedings of 5th Language Technology Conference (LTC-2011): Human Language Technologies as a Challenge for Computer Science and Linguistics, 25–27 Nov 2011, Poznan, Poland (pp. 101–105).

    Google Scholar 

  • Dash, N. S. (2013). Part-of-speech (POS) tagging in Bengali written text corpus. Bhasa Bijnan o Prayukti: An International Journal on Linguistics and Language Technology, 1(1), 53–96.

    Google Scholar 

  • Dash, N. S. (2016). Multifunctionality of a hyphen in Bengali text corpus: Problems and challenges in text normalization and POS tagging. International Journal of Innovative Studies in Sociology and Humanities, 1(1), 19–34.

    Google Scholar 

  • Dash, N. S. (2021). Pre-editing and text standardization on a Bengali written text corpus. Aligarh Journal of Linguistics, 10(1), 1–22.

    Google Scholar 

  • Dash, N. S., & Ramamoorthy, L. (2019). Utility and application of language corpora. Singapore: Springer Nature.

    Google Scholar 

  • Garside, R. (1987). The CLAWS word-tagging system. In R. Garside, G. Leech, & G. Sampson (Eds.), Computational analysis of English: A corpus-based approach (pp. 30–41). London: Longman.

    Google Scholar 

  • Garside, R. (1995). Grammatical tagging of the spoken part of the British National Corpus: A progress report. In G. Leech, G. Myers, & J. Thomas (Eds.), Spoken English on computer: Transcription, mark-up, and application (pp. 161–167). London: Longman.

    Google Scholar 

  • Huang, C., Simon, P., Hsieh, S., & Prevot, L. (2007). Rethinking Chinese word segmentation: Tokenization, character classification, or word-break identification. In Proceedings of the ACL-2007 Demo and Poster Sessions, Prague, June 2007 (pp. 69–72). Association for the Computational Linguistics.

    Google Scholar 

  • Ide, N. (2017). Introduction to the handbook of linguistic annotation. In N. Ide & J. Pustejovsky (Eds.), Handbook of linguistic annotation (Text, Speech, and Language Technology Series) (pp. 1–18). Springer.

    Google Scholar 

  • Ide, N., & Pustejovsky, J. (Eds.) (2017). Handbook of linguistic annotation (Text, Speech, and Language Technology Series). Springer.

    Google Scholar 

  • Ide, N., Calzolari, N., Eckle-Kohler, J., Gibbon, D., Hellman, S., Lee, K., Nivre, J., & Romary, L. (2017a). Community standards for linguistically-annotated resources. In N. Ide & J. Pustejovsky (Eds.), Handbook of linguistic annotation (Text, Speech, and Language Technology Series) (pp. 113–165). Springer.

    Google Scholar 

  • Ide, N., Chiarcos, C., Stede, M., & Cassidy, S. (2017b). Designing annotation schemes: From model to representation. In N. Ide & J. Pustejovsky (Eds.), Handbook of linguistic annotation (Text, Speech, and Language Technology Series) (pp. 73–111). Springer.

    Google Scholar 

  • Kupiec, J. (1992). Robust POS tagging using a Hidden Markov model. Computer Speech and Language, 6(1), 3–15.

    Google Scholar 

  • Leech, G. (1993). Corpus annotation schemes. Literary and Linguistic Computing, 8(4), 275–281.

    Article  Google Scholar 

  • Leech, G. (1997). Introducing corpus annotation. In R. Garside, G. Leech, & A. McEnery (Eds.), Corpus annotation: Linguistic information from computer text corpora (pp. 1–18). London: Longman.

    Google Scholar 

  • Leech, G., & Garside, R. (1982). Grammatical tagging of the LOB Corpus: A general survey. In S. Johansson & K. Hofland (Eds.), Computer corpora in English language research (pp. 110–117). Bergen: NAVF.

    Google Scholar 

  • Leech, G., & Smith, N. (1999). The use of tagging. In H. V. Halteren (Ed.), Syntactic word class tagging (pp. 23–36). Dordrecht: Kluwer Academic Press.

    Google Scholar 

  • Leech, G., Garside, R., & Atwell, E. (1983). The automatic tagging of the LOB Corpus. International Computer Archive of Modern English News, 7(1), 110–117.

    Google Scholar 

  • Leech, G., Garside, R., & Bryant, M. (1994). The large-scale grammatical tagging of text: Experience with the BNC. In N. Oostdijk & P. deHaan (Eds.), Corpus-based research into language (pp. 47–63). Amsterdam: Rodopi.

    Google Scholar 

  • Nguyen, D. Q., Pham, D. D., & Pham, S. B. (2016). A robust transformation-based learning approach using ripple down rules for POS tagging. AI Communications, 29(3), 409–422.

    Article  Google Scholar 

  • Sag, I. A., Baldwin, T., Bond, F., Copestake, A., & Flickinger, D. (2001). Multiword expressions: A pain in the neck for NLP. In A. Gelbukh (Ed.), Proceedings of CICLING2002 (pp. 35–41). Verlag: Springer.

    Google Scholar 

  • Schmid, H. (2008). Tokenizing and POS tagging. In A. Lüdeling & M. Kytö (Eds.), Corpus linguistics: An international handbook (Vol. 1, pp. 527–551). Berlin: Gruyter.

    Google Scholar 

  • Sproat, R., Black, A., Chen, S., Kumar, S., Ostendorf, M., & Richards, C. (2001). Normalization of non-standard words. Computer Speech & Language, 15(3), 287–333.

    Article  Google Scholar 

  • Toutanova, K., Klein, D., Manning, C. D., & Singer, Y. (2003). Feature-rich POS tagging with a cyclic dependency network. Proceedings of HLT-NAACL 2003 (pp. 252–259).

    Google Scholar 

Web Links

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Niladri Sekhar Dash .

Rights and permissions

Reprints and permissions

Copyright information

© 2021 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Dash, N.S. (2021). Principles and Rules of Part-of-Speech Annotation. In: Language Corpora Annotation and Processing. Springer, Singapore. https://doi.org/10.1007/978-981-16-2960-0_2

Download citation

  • DOI: https://doi.org/10.1007/978-981-16-2960-0_2

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-16-2959-4

  • Online ISBN: 978-981-16-2960-0

  • eBook Packages: EducationEducation (R0)

Publish with us

Policies and ethics