Principles and Rules of Part-of-Speech Annotation

Dash, Niladri Sekhar

doi:10.1007/978-981-16-2960-0_2

Niladri Sekhar Dash²

415 Accesses

Abstract

In this chapter, we describe some principles and rules that we apply during the process of part-of-speech (POS) annotation on a written text corpus. For various practical and theoretical reasons, these principles and rules become useful and relevant for overcoming many unwanted hurdles in POS annotation. We also describe here the strategies that we apply when we formulate some algorithms for the automatic assignment of grammatical values to words in a text. Although we address theoretical and practical issues of POS annotation, we do not talk about technical and computational issues which are important when a computer system is engaged for this kind of work. Since the target readers of this chapter are students of linguistics who have limited or zero exposure to computation, we keep these technical and computational issues beyond this chapter. We primarily focus on some linguistic and theoretical aspects of the problem. We keep these issues relatively simple as an area of general inquiry, so that non-experts and common readers can get ideas about how words in a text are annotated at the POS level, and how the principles and rules are to be used during this work. This chapter has two parts: in Part I, we propose some principles that are to be followed during POS annotation; and in Part II, we define some rules which are to be adopted during the actual work of POS annotation. Supporting examples are taken from English and Bengali text corpus. This chapter imparts necessary guidance to the new generation of scholars about the nature and characteristics of POS annotation and how proper reference to these principles and rules can make the subsequent works of text processing easy and simplified.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 139.00; Price excludes VAT (USA)

Softcover Book: USD 179.99; Price excludes VAT (USA)

Hardcover Book: USD 199.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Atwell, E., Demetriou, G., Hughes, J., Schiffrin, A., Souter, C., & Wilcock, S. (2000). A comparative evaluation of modern English corpus grammatical annotation schemes. International Computer Archive of Modern English Journal, 24(1), 7–23.
Google Scholar
Avinesh, P. V. S. & Karthik, G. (2007). Part-of-speech tagging and chunking using conditional random field and transformation-based learning. In Proceedings of the Workshop on Shallow Parsing for South Asian Languages (IJCAI-07) (pp. 21–24). IIIT, Hyderabad, India.
Google Scholar
Brants, T. (2000). TnT-a statistical POS tagger. In Proceedings of the Sixth Applied Natural Language Processing Conference, Seattle (pp. 37–42).
Google Scholar
Dash, N. S. (2009). Language corpora: Past, present, and future. New Delhi: Mittal Publications.
Google Scholar
Dash, N. S. (2011). Principles of part-of-speech (POS) tagging in Indian language corpora. In Proceedings of 5th Language Technology Conference (LTC-2011): Human Language Technologies as a Challenge for Computer Science and Linguistics, 25–27 Nov 2011, Poznan, Poland (pp. 101–105).
Google Scholar
Dash, N. S. (2013). Part-of-speech (POS) tagging in Bengali written text corpus. Bhasa Bijnan o Prayukti: An International Journal on Linguistics and Language Technology, 1(1), 53–96.
Google Scholar
Dash, N. S. (2016). Multifunctionality of a hyphen in Bengali text corpus: Problems and challenges in text normalization and POS tagging. International Journal of Innovative Studies in Sociology and Humanities, 1(1), 19–34.
Google Scholar
Dash, N. S. (2021). Pre-editing and text standardization on a Bengali written text corpus. Aligarh Journal of Linguistics, 10(1), 1–22.
Google Scholar
Dash, N. S., & Ramamoorthy, L. (2019). Utility and application of language corpora. Singapore: Springer Nature.
Google Scholar
Garside, R. (1987). The CLAWS word-tagging system. In R. Garside, G. Leech, & G. Sampson (Eds.), Computational analysis of English: A corpus-based approach (pp. 30–41). London: Longman.
Google Scholar
Garside, R. (1995). Grammatical tagging of the spoken part of the British National Corpus: A progress report. In G. Leech, G. Myers, & J. Thomas (Eds.), Spoken English on computer: Transcription, mark-up, and application (pp. 161–167). London: Longman.
Google Scholar
Huang, C., Simon, P., Hsieh, S., & Prevot, L. (2007). Rethinking Chinese word segmentation: Tokenization, character classification, or word-break identification. In Proceedings of the ACL-2007 Demo and Poster Sessions, Prague, June 2007 (pp. 69–72). Association for the Computational Linguistics.
Google Scholar
Ide, N. (2017). Introduction to the handbook of linguistic annotation. In N. Ide & J. Pustejovsky (Eds.), Handbook of linguistic annotation (Text, Speech, and Language Technology Series) (pp. 1–18). Springer.
Google Scholar
Ide, N., & Pustejovsky, J. (Eds.) (2017). Handbook of linguistic annotation (Text, Speech, and Language Technology Series). Springer.
Google Scholar
Ide, N., Calzolari, N., Eckle-Kohler, J., Gibbon, D., Hellman, S., Lee, K., Nivre, J., & Romary, L. (2017a). Community standards for linguistically-annotated resources. In N. Ide & J. Pustejovsky (Eds.), Handbook of linguistic annotation (Text, Speech, and Language Technology Series) (pp. 113–165). Springer.
Google Scholar
Ide, N., Chiarcos, C., Stede, M., & Cassidy, S. (2017b). Designing annotation schemes: From model to representation. In N. Ide & J. Pustejovsky (Eds.), Handbook of linguistic annotation (Text, Speech, and Language Technology Series) (pp. 73–111). Springer.
Google Scholar
Kupiec, J. (1992). Robust POS tagging using a Hidden Markov model. Computer Speech and Language, 6(1), 3–15.
Google Scholar
Leech, G. (1993). Corpus annotation schemes. Literary and Linguistic Computing, 8(4), 275–281.
Article Google Scholar
Leech, G. (1997). Introducing corpus annotation. In R. Garside, G. Leech, & A. McEnery (Eds.), Corpus annotation: Linguistic information from computer text corpora (pp. 1–18). London: Longman.
Google Scholar
Leech, G., & Garside, R. (1982). Grammatical tagging of the LOB Corpus: A general survey. In S. Johansson & K. Hofland (Eds.), Computer corpora in English language research (pp. 110–117). Bergen: NAVF.
Google Scholar
Leech, G., & Smith, N. (1999). The use of tagging. In H. V. Halteren (Ed.), Syntactic word class tagging (pp. 23–36). Dordrecht: Kluwer Academic Press.
Google Scholar
Leech, G., Garside, R., & Atwell, E. (1983). The automatic tagging of the LOB Corpus. International Computer Archive of Modern English News, 7(1), 110–117.
Google Scholar
Leech, G., Garside, R., & Bryant, M. (1994). The large-scale grammatical tagging of text: Experience with the BNC. In N. Oostdijk & P. deHaan (Eds.), Corpus-based research into language (pp. 47–63). Amsterdam: Rodopi.
Google Scholar
Nguyen, D. Q., Pham, D. D., & Pham, S. B. (2016). A robust transformation-based learning approach using ripple down rules for POS tagging. AI Communications, 29(3), 409–422.
Article Google Scholar
Sag, I. A., Baldwin, T., Bond, F., Copestake, A., & Flickinger, D. (2001). Multiword expressions: A pain in the neck for NLP. In A. Gelbukh (Ed.), Proceedings of CICLING2002 (pp. 35–41). Verlag: Springer.
Google Scholar
Schmid, H. (2008). Tokenizing and POS tagging. In A. Lüdeling & M. Kytö (Eds.), Corpus linguistics: An international handbook (Vol. 1, pp. 527–551). Berlin: Gruyter.
Google Scholar
Sproat, R., Black, A., Chen, S., Kumar, S., Ostendorf, M., & Richards, C. (2001). Normalization of non-standard words. Computer Speech & Language, 15(3), 287–333.
Article Google Scholar
Toutanova, K., Klein, D., Manning, C. D., & Singer, Y. (2003). Feature-rich POS tagging with a cyclic dependency network. Proceedings of HLT-NAACL 2003 (pp. 252–259).
Google Scholar

Web Links

Download references

Author information

Authors and Affiliations

Linguistic Research Unit, Indian Statistical Institute, Kolkata, West Bengal, India
Dr. Niladri Sekhar Dash

Authors

Dr. Niladri Sekhar Dash
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Niladri Sekhar Dash .

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Dash, N.S. (2021). Principles and Rules of Part-of-Speech Annotation. In: Language Corpora Annotation and Processing. Springer, Singapore. https://doi.org/10.1007/978-981-16-2960-0_2

Download citation

DOI: https://doi.org/10.1007/978-981-16-2960-0_2
Published: 08 July 2021
Publisher Name: Springer, Singapore
Print ISBN: 978-981-16-2959-4
Online ISBN: 978-981-16-2960-0
eBook Packages: EducationEducation (R0)

Publish with us

Policies and ethics