Abstract
In this chapter, we describe some principles and rules that we apply during the process of part-of-speech (POS) annotation on a written text corpus. For various practical and theoretical reasons, these principles and rules become useful and relevant for overcoming many unwanted hurdles in POS annotation. We also describe here the strategies that we apply when we formulate some algorithms for the automatic assignment of grammatical values to words in a text. Although we address theoretical and practical issues of POS annotation, we do not talk about technical and computational issues which are important when a computer system is engaged for this kind of work. Since the target readers of this chapter are students of linguistics who have limited or zero exposure to computation, we keep these technical and computational issues beyond this chapter. We primarily focus on some linguistic and theoretical aspects of the problem. We keep these issues relatively simple as an area of general inquiry, so that non-experts and common readers can get ideas about how words in a text are annotated at the POS level, and how the principles and rules are to be used during this work. This chapter has two parts: in Part I, we propose some principles that are to be followed during POS annotation; and in Part II, we define some rules which are to be adopted during the actual work of POS annotation. Supporting examples are taken from English and Bengali text corpus. This chapter imparts necessary guidance to the new generation of scholars about the nature and characteristics of POS annotation and how proper reference to these principles and rules can make the subsequent works of text processing easy and simplified.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Atwell, E., Demetriou, G., Hughes, J., Schiffrin, A., Souter, C., & Wilcock, S. (2000). A comparative evaluation of modern English corpus grammatical annotation schemes. International Computer Archive of Modern English Journal, 24(1), 7–23.
Avinesh, P. V. S. & Karthik, G. (2007). Part-of-speech tagging and chunking using conditional random field and transformation-based learning. In Proceedings of the Workshop on Shallow Parsing for South Asian Languages (IJCAI-07) (pp. 21–24). IIIT, Hyderabad, India.
Brants, T. (2000). TnT-a statistical POS tagger. In Proceedings of the Sixth Applied Natural Language Processing Conference, Seattle (pp. 37–42).
Dash, N. S. (2009). Language corpora: Past, present, and future. New Delhi: Mittal Publications.
Dash, N. S. (2011). Principles of part-of-speech (POS) tagging in Indian language corpora. In Proceedings of 5th Language Technology Conference (LTC-2011): Human Language Technologies as a Challenge for Computer Science and Linguistics, 25–27 Nov 2011, Poznan, Poland (pp. 101–105).
Dash, N. S. (2013). Part-of-speech (POS) tagging in Bengali written text corpus. Bhasa Bijnan o Prayukti: An International Journal on Linguistics and Language Technology, 1(1), 53–96.
Dash, N. S. (2016). Multifunctionality of a hyphen in Bengali text corpus: Problems and challenges in text normalization and POS tagging. International Journal of Innovative Studies in Sociology and Humanities, 1(1), 19–34.
Dash, N. S. (2021). Pre-editing and text standardization on a Bengali written text corpus. Aligarh Journal of Linguistics, 10(1), 1–22.
Dash, N. S., & Ramamoorthy, L. (2019). Utility and application of language corpora. Singapore: Springer Nature.
Garside, R. (1987). The CLAWS word-tagging system. In R. Garside, G. Leech, & G. Sampson (Eds.), Computational analysis of English: A corpus-based approach (pp. 30–41). London: Longman.
Garside, R. (1995). Grammatical tagging of the spoken part of the British National Corpus: A progress report. In G. Leech, G. Myers, & J. Thomas (Eds.), Spoken English on computer: Transcription, mark-up, and application (pp. 161–167). London: Longman.
Huang, C., Simon, P., Hsieh, S., & Prevot, L. (2007). Rethinking Chinese word segmentation: Tokenization, character classification, or word-break identification. In Proceedings of the ACL-2007 Demo and Poster Sessions, Prague, June 2007 (pp. 69–72). Association for the Computational Linguistics.
Ide, N. (2017). Introduction to the handbook of linguistic annotation. In N. Ide & J. Pustejovsky (Eds.), Handbook of linguistic annotation (Text, Speech, and Language Technology Series) (pp. 1–18). Springer.
Ide, N., & Pustejovsky, J. (Eds.) (2017). Handbook of linguistic annotation (Text, Speech, and Language Technology Series). Springer.
Ide, N., Calzolari, N., Eckle-Kohler, J., Gibbon, D., Hellman, S., Lee, K., Nivre, J., & Romary, L. (2017a). Community standards for linguistically-annotated resources. In N. Ide & J. Pustejovsky (Eds.), Handbook of linguistic annotation (Text, Speech, and Language Technology Series) (pp. 113–165). Springer.
Ide, N., Chiarcos, C., Stede, M., & Cassidy, S. (2017b). Designing annotation schemes: From model to representation. In N. Ide & J. Pustejovsky (Eds.), Handbook of linguistic annotation (Text, Speech, and Language Technology Series) (pp. 73–111). Springer.
Kupiec, J. (1992). Robust POS tagging using a Hidden Markov model. Computer Speech and Language, 6(1), 3–15.
Leech, G. (1993). Corpus annotation schemes. Literary and Linguistic Computing, 8(4), 275–281.
Leech, G. (1997). Introducing corpus annotation. In R. Garside, G. Leech, & A. McEnery (Eds.), Corpus annotation: Linguistic information from computer text corpora (pp. 1–18). London: Longman.
Leech, G., & Garside, R. (1982). Grammatical tagging of the LOB Corpus: A general survey. In S. Johansson & K. Hofland (Eds.), Computer corpora in English language research (pp. 110–117). Bergen: NAVF.
Leech, G., & Smith, N. (1999). The use of tagging. In H. V. Halteren (Ed.), Syntactic word class tagging (pp. 23–36). Dordrecht: Kluwer Academic Press.
Leech, G., Garside, R., & Atwell, E. (1983). The automatic tagging of the LOB Corpus. International Computer Archive of Modern English News, 7(1), 110–117.
Leech, G., Garside, R., & Bryant, M. (1994). The large-scale grammatical tagging of text: Experience with the BNC. In N. Oostdijk & P. deHaan (Eds.), Corpus-based research into language (pp. 47–63). Amsterdam: Rodopi.
Nguyen, D. Q., Pham, D. D., & Pham, S. B. (2016). A robust transformation-based learning approach using ripple down rules for POS tagging. AI Communications, 29(3), 409–422.
Sag, I. A., Baldwin, T., Bond, F., Copestake, A., & Flickinger, D. (2001). Multiword expressions: A pain in the neck for NLP. In A. Gelbukh (Ed.), Proceedings of CICLING2002 (pp. 35–41). Verlag: Springer.
Schmid, H. (2008). Tokenizing and POS tagging. In A. Lüdeling & M. Kytö (Eds.), Corpus linguistics: An international handbook (Vol. 1, pp. 527–551). Berlin: Gruyter.
Sproat, R., Black, A., Chen, S., Kumar, S., Ostendorf, M., & Richards, C. (2001). Normalization of non-standard words. Computer Speech & Language, 15(3), 287–333.
Toutanova, K., Klein, D., Manning, C. D., & Singer, Y. (2003). Feature-rich POS tagging with a cyclic dependency network. Proceedings of HLT-NAACL 2003 (pp. 252–259).
Web Links
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
Copyright information
© 2021 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this chapter
Cite this chapter
Dash, N.S. (2021). Principles and Rules of Part-of-Speech Annotation. In: Language Corpora Annotation and Processing. Springer, Singapore. https://doi.org/10.1007/978-981-16-2960-0_2
Download citation
DOI: https://doi.org/10.1007/978-981-16-2960-0_2
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-16-2959-4
Online ISBN: 978-981-16-2960-0
eBook Packages: EducationEducation (R0)