Skip to main content

Lemmatization of Inflected Nouns

  • Chapter
  • First Online:
Language Corpora Annotation and Processing
  • 387 Accesses

Abstract

In this chapter, we describe a process of lemmatization of inflected nouns in Bengali as a part of lexical processing. Inflected nouns are used at a very high frequency in Bengali texts. We first collect a large number of inflected nouns from a Bengali corpus and compile a noun database. Then we apply a process of lemmatization to separate inflections from nominal bases. There are several intermediate stages in lemmatization which are applied following grammatical mapping rules (GMRs). These rules isolate inflections from nominal bases. The GMRs are first designed manually after analyzing a large set of inflected nouns to collect necessary data and information. At subsequent stages, these GMRs are developed in a machine-readable format so that the lemmatizer can separate the inflections from inflected nouns with the least human intervention. This strategy is proved to be largely successful in the sense that most of the inflected Bengali nouns, which are stored in a noun database, are rightly lemmatized. This multilayered process also generates an exhaustive list of nominal inflections and a large list of lemmatized nouns. At the subsequent stage, nouns are semantically classified for their use in translation, dictionary compilation, lexical decomposition, and language teaching. We have also applied this method to lemmatize inflected pronouns and adjectives which follow a similar pattern of inflection and affixation in Bengali.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 139.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 179.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 199.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  • Airio, E. (2006). Word normalization and decompounding in mono- and bilingual IR. Information Retrieval, 9, 249–271.

    Article  Google Scholar 

  • Barnbrook, G. (1996). Language and computers. Edinburgh University Press.

    Google Scholar 

  • Beale, A. D. (1987). Towards a distributional lexicon. In R. Garside, G. Leech, & G. Sampson (Eds.), The computational analysis of English: A corpus-based approach (pp. 149–162). Longman.

    Google Scholar 

  • Biber, D., Conrad, S., & Reppen, R. (1998). Corpus linguistics: Investigating language structure and use. Cambridge University Press.

    Book  Google Scholar 

  • Creutz, M. (2003). Unsupervised segmentation of words using prior distributions of morph length and frequency. In Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics, July 2003 (pp. 280–287).

    Google Scholar 

  • Dash, N. S., & Chaudhuri, B. B. (1997). Computer parsing of Bengali verbs. Linguistics Today., 1(1), 64–85.

    Google Scholar 

  • Dash, N. S. (2006). The process of lemmatization of inflected and affixed words in Bengali text corpus. In Presented in the 28th All India Conference of Linguists (28-AICL). Varanasi: Department of Linguistics, Banaras Hindu University, November 2–5, 2006.

    Google Scholar 

  • Dash, N. S. (2007a.) Indian scenario in language corpus generation. In: N. S. Dash, P. Dasgupta, & P. Sarkar (Eds.) Rainbow of linguistics (Vol. I, pp. 129–162). Kolkata: T. Media Publication.

    Google Scholar 

  • Dash, N. S. (2007b). Toward lemmatization of Bengali words for building language technology resources. South Asian Language Review, 17(2), 1–15.

    Article  Google Scholar 

  • Dash, N. S. (2015). Marking words with part-of-speech (POS) tags within the text boundary of a corpus: The problems, the process, and the outcomes. Translation Today, 9(1), 5–24.

    Google Scholar 

  • Dash, N. S. (2017). Defining Language-Specific Synsets in IndoWordNet: Some theoretical and practical issues. In N. S. Dash, P. Bhattacharyya, & J. Pawar (Eds.), The WordNet in Indian languages (pp. 45–64). Springer.

    Chapter  Google Scholar 

  • Dawson, J. L. (1974). Suffix removal for word conflation. Bulletin of the Association for Literary and Linguistic Computing., 2(3), 33–46.

    Google Scholar 

  • Erjavec, T., & Dzeroski, S. (2004). Machine learning of morphosyntactic structure: Lemmatizing unknown Slovene words. Applied Artificial Intelligence, 18(1), 17–40.

    Article  Google Scholar 

  • Federici, S., & Pirelli, V. (1992). A bootstrapping strategy for lemmatization: Learning through examples. In: Kiefer, et al. (Eds.) (pp. 123–135).

    Google Scholar 

  • Fligelstone, S. (1994) JAWS: Using lemmatization rules and contextual disambiguation rules to enhance CLAWS output. In Lancaster database of linguistic corpora: Project report. UK: Linguistics Department, Lancaster University.

    Google Scholar 

  • Frakes, W. B., & Fox, C. J. (2003). Strength and similarity of affix removal stemming algorithms. SIGIR Forum., 37, 26–30.

    Article  Google Scholar 

  • Frakes, W. B. (1984). Term conflation for information retrieval. In Proceedings of the 7th Annual International ACM SIGIR’84 Conference on Research and Development in Information Retrieval (pp. 383–389).

    Google Scholar 

  • Francis, N., & Kucera, H. (1982). Frequency analysis of english usage: Lexicon and grammar. Houghton Mifflin Company.

    Google Scholar 

  • Galvez, C., de Moya-Anegon, F., & Solana, V. H. (2005). Term conflation methods in information retrieval: Non-linguistic and linguistic approaches. Journal of Documentation., 61(4), 520–547.

    Article  Google Scholar 

  • Hafer, M. A., & Weiss, S. F. (1974). Word segmentation by letter successor varieties. Information Processing and Management., 10(11/12), 371–386.

    Google Scholar 

  • Harman, D. (1991). How effective is suffixing? Journal of the American Society for Information Science., 42(1), 7–15.

    Article  Google Scholar 

  • Hull, D. A. (1996). Stemming algorithms—A case study for detailed evaluation. Journal of the American Society for Information Science., 47(1), 70–84.

    Article  Google Scholar 

  • Hundt, M., Sand, A., & Skandera, P. (1999). Manual of Information to accompany The Freiburg-Brown Corpus of American English (Frown). Albert-Ludwigs-Universität Freiburg.

    Google Scholar 

  • Jongejan, B., & Dalianis, H. (2009). Automatic training of lemmatization rules that handle morphological changes in pre-, in-, and suffixes alike. In Proceeding of the ACL-2009, Joint conference of the 47th Annual Meeting of the Association for Computational Linguistics and the 4th International Joint Conference on Natural Language Processing of the Asian Federation of Natural Language Processing, Singapore, August 2–7, 2009 (pp. 145–153).

    Google Scholar 

  • Kamps, J., Monz, C., Rijke, M., & Sigurbjörnsson, B. (2004). Language dependent and language-independent approach to cross-lingual text retrieval. In C. Peters, J. Gonzalo, M. Braschler, & M. Kluck (Eds.), Comparative evaluation of multilingual information access systems (pp. 152–165). Springer.

    Chapter  Google Scholar 

  • Kanis, J., & Skorkovská, L. (2010). Comparison of different lemmatization approaches through the means of information retrieval performance. In Proceedings of the 13th International Conference on Text, Speech and Dialogue TSD'10 (pp. 93–100).

    Google Scholar 

  • Korenius, T., Laurikkala, J., Järvelin, K., & Juhola, M. (2004). Stemming and lemmatization in the clustering of Finnish text documents. In Proceedings of the 13th ACM International Conference on Information and Knowledge Management, CIKM'04 (pp. 625–633).

    Google Scholar 

  • Kornilakis, H., Grigoriadou, M., Galiotou, E., Papakitsos, E. (2004). Using a lemmatizer to support the development and validation of the Greek WordNet. In Proceedings of the 2nd Global WordNet Conference (pp. 130–135). Brno, Czech Republic, January 20–23, 2004.

    Google Scholar 

  • Kraaij, W., & Pohlmann, R. (1996). Viewing stemming as recall enhancement. In: H. P. Frei, D. Harman, P. Schauble, & R. Wilkinson (Eds.), In Proceedings of the 17th ACM SIGIR Conference, Zurich, August 18–22 (pp. 40–48).

    Google Scholar 

  • Krovetz, R. (1993). Viewing morphology as an inference process. In Proceedings of ACM-SIGIR93, 16th International ACM/SIGIR ‘93 Conference on Research and Development in Information Retrieval, Pittsburgh, PA, USA, June 27–July 01, 1993 (pp. 191–203).

    Google Scholar 

  • Leech, G. (2007). New resources or just better old ones? The Holy Grail of representativeness. In M. Hundt, N. Nesselhauf, & C. Biewer (Eds.), Corpus linguistics and the web (pp. 133–149). Rodopi.

    Google Scholar 

  • Lennon, M., Pearce, D. S., Tarry, B. D., & Willett, P. (1981). An Evaluation of some conflation algorithms for information retrieval. Journal of Information Science., 3, 177–183.

    Article  Google Scholar 

  • Liu, H., Christiansen, T., Baumgartner, W. A., & Verspoor, K. (2012) BioLemmatizer: a lemmatization tool for morphological processing of biomedical text. Journal of Biomedical Semantics. 1–29.

    Google Scholar 

  • Lovins, J. B. (1968). Development of a Stemming algorithm. Mechanical Translation and Computational Linguistics., 11, 22–31.

    Google Scholar 

  • Lovins, J. B. (1971). Error evaluation for stemming algorithms as clustering algorithms. Journal of the American Society for Information Science., 22, 28–40.

    Article  Google Scholar 

  • McEnery, T., & Hardie, A. (2006). Corpus linguistics: Method, theory, and practice. Cambridge University Press.

    Google Scholar 

  • McEnery, T., & Wilson, A. (1996). Corpus linguistics. Edinburgh University Press.

    Google Scholar 

  • Minnen, G., Carroll, J., & Pearce, D. (2001). Applied morphological processing of English. Natural Language Engineering., 7, 207–223.

    Article  Google Scholar 

  • Paice, C. D. (1990). Another stemmer. SIGIR Forum., 24(3), 56–61.

    Article  Google Scholar 

  • Paice, C. D. (1996). Method for evaluation of stemming algorithms based on error counting. Journal of the American Society for Information Science., 47(8), 632–649.

    Article  Google Scholar 

  • Popovič, M., & Willett, P. (1992). The effectiveness of stemming for natural-language access to Slovene textual data. Journal of the American Society for Information Science., 43(5), 384–390.

    Article  Google Scholar 

  • Porter, M. F. (1980). An algorithm for suffix stripping. Program, 14(3), 130–137.

    Article  Google Scholar 

  • Reichel, U. D., & Weilhammer, K. (2004). Automated morphological segmentation and evaluation. In Proceedings of LREC 2004, Lisbon.

    Google Scholar 

  • Sánchez, A., & Cantos, P. (1997). Predictability of word forms (types) and lemmas in linguistic corpora, a case study based analysis of the CUMBRE corpus: An 8-million-word corpus of contemporary Spanish. International Journal of Corpus Linguistics., 2(2), 259–280.

    Article  Google Scholar 

  • Savoy, J. (1993). Stemming of French words based on grammatical categories. Journal of the American Society for Information Science., 44(1), 1–9.

    Article  Google Scholar 

  • Ulmschneider, J. E., & Doszkocs, T. (1983). A practical stemming algorithm for online search assistance. Online Review., 7(4), 301–318.

    Article  Google Scholar 

  • Xu, J., & Croft, W. B. (1998). Corpus-based stemming using co-occurrence of word variants. ACM Transactions on Information Systems., 16(1), 61–81.

    Article  Google Scholar 

Web Links

Download references

Author information

Authors and Affiliations

Authors

Appendix

Appendix

Bengali nouns formed with or without formative elements.

Type 1: Noun Without Word-Formative Element (WFE)

  1. (a)

    Noun + Ø marker: (1)

    Ghar

    [ghar-Ø]

    “house”

Type 2: Noun with one Word-Formative Element (WFE)

  1. (b)

    Noun + Emphatic Particle: (2)

    Ghari

    [ghar-i]

    “house indeed”

    Gharo

    [ghar-o]

    “house also”

  1. (c)

    Noun + Singular Marker: (5)

    gharṭi

    [ghar-ṭi]

    “the house”

    gharṭā

    [ghar-ṭā]

    “the house”

    gharkhānā

    [ghar-khānā]

    “the house”

    gharkhāni

    [ghar-khāni]

    “the house”

    gharṭuku

    [ghar-ṭuku]

    “the house”

  1. (d)

    Noun + Plural Marker : (3)

    Gharguli

    [ghar-guli]

    “houses”

    Ghargulo

    [ghar-gulo]

    “houses”

    Ghargulā

    [ghar-gulā]

    “houses”

  1. (e)

    Noun + Case Marker : (4)

    Ghare

    [ghar-e]

    “in house”

    Gharer

    [ghar-er]

    “of house”

    Gharke

    [ghar-ke]

    “to house”

    Gharete

    [ghar-ete]

    “in house”

Type 3: Noun with Two Word-Formative Elements (WFEs)

  1. (a)

    Noun + Singular + Particle: (10)

    gharṭii

    [ghar-ṭi-i]

    “the house indeed”

    gharṭio

    [ghar-ṭi-o]

    “the house also”

    gharṭāi

    [ghar-ṭā-i]

    “the house indeed”

    gharṭāo

    [ghar-ṭā-o]

    “the house also”

    gharkhānāi

    [ghar-kānā-i]

    “the house indeed”

    gharkhānāo

    [ghar-kānā-o]

    “the house also”

    gharkhānii

    [gharkhāni-i]

    “the house indeed”

    gharkhānio

    [ghar-khāni-o]

    “the house also”

    gharṭukui

    [ghar-ṭuku-i]

    “the house indeed”

    gharṭukuo

    [ghar-ṭuku-o]

    “the house also”

  1. (b)

    Noun + Plural + Particle: (6)

    ghargulii

    [ghar-guli-i]

    “houses indeed”

    ghargulio

    [ghar-guli-o]

    “houses also”

    gharguloi

    [ghar-gulo-i]

    “houses indeed”

    gharguloo

    [ghar-gulo-o]

    “houses also”

    ghargulā

    [ghar-gulā-i]

    “houses indeed”

    ghargulā

    [ghar-gulā-o]

    “houses also”

  1. (c)

    Noun + Singular + Case : (13)

    gharṭike

    [ghar-ṭi-ke]

    “to the house”

    gharṭite

    [ghar-ṭi-te]

    “in the house”

    gharṭir

    [ghar-ṭi-r]

    “of the house”

    gharṭāke

    [ghar-ṭā-ke]

    “to the house”

    gharṭāte

    [ghar-ṭā-te]

    “in the house”

    gharṭār

    [ghar-ṭā-r]

    “of the house”

    gharṭāy

    [ghar-ṭā-y]

    “of the house”

    gharkhānāi

    [ghar-kānā-ke]

    “the house indeed”

    gharkhānāo

    [ghar-kānā-te]

    “the house also”

    gharkhānii

    [gharkhāni-ke]

    “the house indeed”

    gharkhānio

    [ghar-khāni-te]

    “the house also”

    gharṭukui

    [ghar-ṭuku-ke]

    “the house indeed”

    gharṭukuo

    [ghar-ṭuku-te]

    “the house also”

  1. (d)

    Noun + Plural + Case : (11)

    ghargulike

    [ghar-guli-ke]

    “to the houses”

    ghargulite

    [ghar-guli-te]

    “in the houses”

    ghargulir

    [ghar-guli-r]

    “of the houses”

    gharguloke

    [ghar-gulo-ke]

    “to the houses”

    ghargulote

    [ghar-gulo-te]

    “to the houses”

    ghargulor

    [ghar-gulo-r]

    “of the houses”

    ghargoy

    [ghar-gulo-y]

    “houses also”

    ghargulāke

    [ghar-gulā-ke]

    “houses also”

    ghargulāte

    [ghar-gulā-te]

    “houses indeed”

    ghargulār

    [ghar-gulā-r]

    “houses also”

    ghargulāy

    [ghar-gulā-y]

    “houses also”

  1. (e)

    Noun + Case + Particle : (6)

    gharei

    [ghar-e-i]

    “in the house indeed”

    ghareo

    [ghar-e-o]

    “in the house also”

    gharkei

    [ghar-ke-i]

    “to the house indeed”

    gharkeo

    [ghar-ke-o]

    “to the house also”

    gharetei

    [ghar-ete-i]

    “in the house indeed”

    ghareteo

    [ghar-ete-o]

    “in the house also”

Type 4: With three Word-Formative Elements (WFEs)

  1. (a)

    Noun + Singular + Case + Particle (28)

    gharṭikei

    [ghar-ṭi-ke-i]

    “to the house indeed”

    gharṭikeo

    [ghar-ṭi-ke-o]

    “to the house also”

    gharṭitei

    [ghar-ṭi-te-i]

    “in the house indeed”

    gharṭiteo

    [ghar-ṭi-te-o]

    “in the house also”

    gharṭākei

    [ghar-ṭā-ke-i]

    “to the house indeed”

    gharṭākeo

    [ghar-ṭā-ke-o]

    “to the house also”

    gharṭātei

    [ghar-ṭā-te-i]

    “in the house indeed”

    gharṭāteo

    [ghar-ṭā-te-o]

    “in the house also”

    gharṭiri

    [ghar-ṭi-r-i]

    “of the house indeed”

    gharṭiro

    [ghar-ṭi-r-o]

    “of the house also”

    gharṭāri

    [ghar-ṭā-r-i]

    “of the house indeed”

    gharṭāro

    [ghar-ṭā-r-o]

    “of the house also”

    gharṭāyi

    [ghar-ṭā-y-i]

    “of the house indeed”

    gharṭāyo

    [ghar-ṭā-y-o]

    “of the house also”

    gharkhānākei

    [ghar-khānā-ke-i]

    “to the house indeed”

    gharkhānākeo

    [ghar-khānā-ke-o]

    “to the house also”

    gharkhānātei

    [ghar-khānā-te-i]

    “to the house indeed”

    gharkhānāteo

    [ghar-khānā-te-o]

    “to the house also”

    gharkhānāyi

    [ghar-khānā-y-i]

    “to the house indeed”

    gharkānāyo

    [ghar-khānā-y-o]

    “to the house also”

    gharkhānikei

    [ghar-khāni-ke-i]

    “to the house indeed”

    gharkhānikeo

    [ghar-khāni-ke-o]

    “to the house also”

    gharkhānitei

    [ghar-khāni-te-i]

    “to the house indeed”

    gharkhāniteo

    [ghar-khāni-te-o]

    “to the house also”

    gharṭukukei

    [ghar-ṭuku-ke-i]

    “to the house indeed”

    gharṭukukeo

    [ghar-ṭuku-ke-o]

    “to the house also”

    gharṭukutei

    [ghar-ṭuku-te-i]

    “in the house indeed”

    gharṭukuteo

    [ghar-ṭuku-te-o]

    “in the house also”

  1. (b)

    Noun + Plural + Case + Particle (22)

    ghargulikei

    [ghar-guli-ke-i]

    “to the houses indeed”

    ghargulikeo

    [ghar-guli-ke-o]

    “to the houses also”

    ghargulitei

    [ghar-guli-te-i]

    “in the houses indeed”

    gharguliteo

    [ghar-guli-te–o]

    “in the houses also”

    ghargulokei

    [ghar-gulo-ke-i]

    “to the houses indeed”

    ghargulokeo

    [ghar-gulo-ke-o]

    “to the houses also”

    ghargulotei

    [ghar-gulo-te-i]

    “in the houses indeed”

    gharguloteo

    [ghar-gulo-te-o]

    “in the houses also”

    ghargulākei

    [ghar-gulā-ke-i]

    “to the houses indeed”

    ghargulākeo

    [ghar-gulā-ke-o]

    “to the houses also”

    ghargulātei

    [ghar-gulā-te-i]

    “in the houses indeed”

    ghargulāteo

    [ghar-gulā-te-o]

    “in the houses also”

    gharguliri

    [ghar-guli-r-i]

    “of the houses indeed”

    gharguliro

    [ghar-guli-r-o]

    “of the houses also”

    ghargulori

    [ghar-gulo-r-i]

    “of the houses indeed”

    gharguloro

    [ghar-gulo-r-o]

    “of the houses also”

    ghargulori

    [ghar-gulā-r-i]

    “of the houses indeed”

    gharguloro

    [ghar-gulā-r-o]

    “of the houses also”

    gharguloyi

    [ghar-gulo-y-i]

    “of the houses indeed”

    gharguloyo

    [ghar-gulo-y-o]

    “of the houses also”

    ghargulāyi

    [ghar-gulā-y-i]

    “of the houses indeed”

    ghargulāyo

    [ghar-gulā-y-o]

    “of the houses also”

Rights and permissions

Reprints and permissions

Copyright information

© 2021 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Dash, N.S. (2021). Lemmatization of Inflected Nouns. In: Language Corpora Annotation and Processing. Springer, Singapore. https://doi.org/10.1007/978-981-16-2960-0_8

Download citation

  • DOI: https://doi.org/10.1007/978-981-16-2960-0_8

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-16-2959-4

  • Online ISBN: 978-981-16-2960-0

  • eBook Packages: EducationEducation (R0)

Publish with us

Policies and ethics