Abstract
In this chapter, we describe a process of lemmatization of inflected nouns in Bengali as a part of lexical processing. Inflected nouns are used at a very high frequency in Bengali texts. We first collect a large number of inflected nouns from a Bengali corpus and compile a noun database. Then we apply a process of lemmatization to separate inflections from nominal bases. There are several intermediate stages in lemmatization which are applied following grammatical mapping rules (GMRs). These rules isolate inflections from nominal bases. The GMRs are first designed manually after analyzing a large set of inflected nouns to collect necessary data and information. At subsequent stages, these GMRs are developed in a machine-readable format so that the lemmatizer can separate the inflections from inflected nouns with the least human intervention. This strategy is proved to be largely successful in the sense that most of the inflected Bengali nouns, which are stored in a noun database, are rightly lemmatized. This multilayered process also generates an exhaustive list of nominal inflections and a large list of lemmatized nouns. At the subsequent stage, nouns are semantically classified for their use in translation, dictionary compilation, lexical decomposition, and language teaching. We have also applied this method to lemmatize inflected pronouns and adjectives which follow a similar pattern of inflection and affixation in Bengali.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Airio, E. (2006). Word normalization and decompounding in mono- and bilingual IR. Information Retrieval, 9, 249–271.
Barnbrook, G. (1996). Language and computers. Edinburgh University Press.
Beale, A. D. (1987). Towards a distributional lexicon. In R. Garside, G. Leech, & G. Sampson (Eds.), The computational analysis of English: A corpus-based approach (pp. 149–162). Longman.
Biber, D., Conrad, S., & Reppen, R. (1998). Corpus linguistics: Investigating language structure and use. Cambridge University Press.
Creutz, M. (2003). Unsupervised segmentation of words using prior distributions of morph length and frequency. In Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics, July 2003 (pp. 280–287).
Dash, N. S., & Chaudhuri, B. B. (1997). Computer parsing of Bengali verbs. Linguistics Today., 1(1), 64–85.
Dash, N. S. (2006). The process of lemmatization of inflected and affixed words in Bengali text corpus. In Presented in the 28th All India Conference of Linguists (28-AICL). Varanasi: Department of Linguistics, Banaras Hindu University, November 2–5, 2006.
Dash, N. S. (2007a.) Indian scenario in language corpus generation. In: N. S. Dash, P. Dasgupta, & P. Sarkar (Eds.) Rainbow of linguistics (Vol. I, pp. 129–162). Kolkata: T. Media Publication.
Dash, N. S. (2007b). Toward lemmatization of Bengali words for building language technology resources. South Asian Language Review, 17(2), 1–15.
Dash, N. S. (2015). Marking words with part-of-speech (POS) tags within the text boundary of a corpus: The problems, the process, and the outcomes. Translation Today, 9(1), 5–24.
Dash, N. S. (2017). Defining Language-Specific Synsets in IndoWordNet: Some theoretical and practical issues. In N. S. Dash, P. Bhattacharyya, & J. Pawar (Eds.), The WordNet in Indian languages (pp. 45–64). Springer.
Dawson, J. L. (1974). Suffix removal for word conflation. Bulletin of the Association for Literary and Linguistic Computing., 2(3), 33–46.
Erjavec, T., & Dzeroski, S. (2004). Machine learning of morphosyntactic structure: Lemmatizing unknown Slovene words. Applied Artificial Intelligence, 18(1), 17–40.
Federici, S., & Pirelli, V. (1992). A bootstrapping strategy for lemmatization: Learning through examples. In: Kiefer, et al. (Eds.) (pp. 123–135).
Fligelstone, S. (1994) JAWS: Using lemmatization rules and contextual disambiguation rules to enhance CLAWS output. In Lancaster database of linguistic corpora: Project report. UK: Linguistics Department, Lancaster University.
Frakes, W. B., & Fox, C. J. (2003). Strength and similarity of affix removal stemming algorithms. SIGIR Forum., 37, 26–30.
Frakes, W. B. (1984). Term conflation for information retrieval. In Proceedings of the 7th Annual International ACM SIGIR’84 Conference on Research and Development in Information Retrieval (pp. 383–389).
Francis, N., & Kucera, H. (1982). Frequency analysis of english usage: Lexicon and grammar. Houghton Mifflin Company.
Galvez, C., de Moya-Anegon, F., & Solana, V. H. (2005). Term conflation methods in information retrieval: Non-linguistic and linguistic approaches. Journal of Documentation., 61(4), 520–547.
Hafer, M. A., & Weiss, S. F. (1974). Word segmentation by letter successor varieties. Information Processing and Management., 10(11/12), 371–386.
Harman, D. (1991). How effective is suffixing? Journal of the American Society for Information Science., 42(1), 7–15.
Hull, D. A. (1996). Stemming algorithms—A case study for detailed evaluation. Journal of the American Society for Information Science., 47(1), 70–84.
Hundt, M., Sand, A., & Skandera, P. (1999). Manual of Information to accompany The Freiburg-Brown Corpus of American English (Frown). Albert-Ludwigs-Universität Freiburg.
Jongejan, B., & Dalianis, H. (2009). Automatic training of lemmatization rules that handle morphological changes in pre-, in-, and suffixes alike. In Proceeding of the ACL-2009, Joint conference of the 47th Annual Meeting of the Association for Computational Linguistics and the 4th International Joint Conference on Natural Language Processing of the Asian Federation of Natural Language Processing, Singapore, August 2–7, 2009 (pp. 145–153).
Kamps, J., Monz, C., Rijke, M., & Sigurbjörnsson, B. (2004). Language dependent and language-independent approach to cross-lingual text retrieval. In C. Peters, J. Gonzalo, M. Braschler, & M. Kluck (Eds.), Comparative evaluation of multilingual information access systems (pp. 152–165). Springer.
Kanis, J., & Skorkovská, L. (2010). Comparison of different lemmatization approaches through the means of information retrieval performance. In Proceedings of the 13th International Conference on Text, Speech and Dialogue TSD'10 (pp. 93–100).
Korenius, T., Laurikkala, J., Järvelin, K., & Juhola, M. (2004). Stemming and lemmatization in the clustering of Finnish text documents. In Proceedings of the 13th ACM International Conference on Information and Knowledge Management, CIKM'04 (pp. 625–633).
Kornilakis, H., Grigoriadou, M., Galiotou, E., Papakitsos, E. (2004). Using a lemmatizer to support the development and validation of the Greek WordNet. In Proceedings of the 2nd Global WordNet Conference (pp. 130–135). Brno, Czech Republic, January 20–23, 2004.
Kraaij, W., & Pohlmann, R. (1996). Viewing stemming as recall enhancement. In: H. P. Frei, D. Harman, P. Schauble, & R. Wilkinson (Eds.), In Proceedings of the 17th ACM SIGIR Conference, Zurich, August 18–22 (pp. 40–48).
Krovetz, R. (1993). Viewing morphology as an inference process. In Proceedings of ACM-SIGIR93, 16th International ACM/SIGIR ‘93 Conference on Research and Development in Information Retrieval, Pittsburgh, PA, USA, June 27–July 01, 1993 (pp. 191–203).
Leech, G. (2007). New resources or just better old ones? The Holy Grail of representativeness. In M. Hundt, N. Nesselhauf, & C. Biewer (Eds.), Corpus linguistics and the web (pp. 133–149). Rodopi.
Lennon, M., Pearce, D. S., Tarry, B. D., & Willett, P. (1981). An Evaluation of some conflation algorithms for information retrieval. Journal of Information Science., 3, 177–183.
Liu, H., Christiansen, T., Baumgartner, W. A., & Verspoor, K. (2012) BioLemmatizer: a lemmatization tool for morphological processing of biomedical text. Journal of Biomedical Semantics. 1–29.
Lovins, J. B. (1968). Development of a Stemming algorithm. Mechanical Translation and Computational Linguistics., 11, 22–31.
Lovins, J. B. (1971). Error evaluation for stemming algorithms as clustering algorithms. Journal of the American Society for Information Science., 22, 28–40.
McEnery, T., & Hardie, A. (2006). Corpus linguistics: Method, theory, and practice. Cambridge University Press.
McEnery, T., & Wilson, A. (1996). Corpus linguistics. Edinburgh University Press.
Minnen, G., Carroll, J., & Pearce, D. (2001). Applied morphological processing of English. Natural Language Engineering., 7, 207–223.
Paice, C. D. (1990). Another stemmer. SIGIR Forum., 24(3), 56–61.
Paice, C. D. (1996). Method for evaluation of stemming algorithms based on error counting. Journal of the American Society for Information Science., 47(8), 632–649.
Popovič, M., & Willett, P. (1992). The effectiveness of stemming for natural-language access to Slovene textual data. Journal of the American Society for Information Science., 43(5), 384–390.
Porter, M. F. (1980). An algorithm for suffix stripping. Program, 14(3), 130–137.
Reichel, U. D., & Weilhammer, K. (2004). Automated morphological segmentation and evaluation. In Proceedings of LREC 2004, Lisbon.
Sánchez, A., & Cantos, P. (1997). Predictability of word forms (types) and lemmas in linguistic corpora, a case study based analysis of the CUMBRE corpus: An 8-million-word corpus of contemporary Spanish. International Journal of Corpus Linguistics., 2(2), 259–280.
Savoy, J. (1993). Stemming of French words based on grammatical categories. Journal of the American Society for Information Science., 44(1), 1–9.
Ulmschneider, J. E., & Doszkocs, T. (1983). A practical stemming algorithm for online search assistance. Online Review., 7(4), 301–318.
Xu, J., & Croft, W. B. (1998). Corpus-based stemming using co-occurrence of word variants. ACM Transactions on Information Systems., 16(1), 61–81.
Web Links
https://blog.bitext.com/what-is-the-difference-between-stemming-and-lemmatization/
https://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html
https://www.datacamp.com/community/tutorials/stemming-lemmatization-python
https://www.geeksforgeeks.org/python-lemmatization-with-nltk/
https://www.machinelearningplus.com/nlp/lemmatization-examples
Author information
Authors and Affiliations
Appendix
Appendix
Bengali nouns formed with or without formative elements.
Type 1: Noun Without Word-Formative Element (WFE)
-
(a)
Noun + Ø marker: (1)
Ghar
[ghar-Ø]
“house”
Type 2: Noun with one Word-Formative Element (WFE)
-
(b)
Noun + Emphatic Particle: (2)
Ghari
[ghar-i]
“house indeed”
Gharo
[ghar-o]
“house also”
-
(c)
Noun + Singular Marker: (5)
gharṭi
[ghar-ṭi]
“the house”
gharṭā
[ghar-ṭā]
“the house”
gharkhānā
[ghar-khānā]
“the house”
gharkhāni
[ghar-khāni]
“the house”
gharṭuku
[ghar-ṭuku]
“the house”
-
(d)
Noun + Plural Marker : (3)
Gharguli
[ghar-guli]
“houses”
Ghargulo
[ghar-gulo]
“houses”
Ghargulā
[ghar-gulā]
“houses”
-
(e)
Noun + Case Marker : (4)
Ghare
[ghar-e]
“in house”
Gharer
[ghar-er]
“of house”
Gharke
[ghar-ke]
“to house”
Gharete
[ghar-ete]
“in house”
Type 3: Noun with Two Word-Formative Elements (WFEs)
-
(a)
Noun + Singular + Particle: (10)
gharṭii
[ghar-ṭi-i]
“the house indeed”
gharṭio
[ghar-ṭi-o]
“the house also”
gharṭāi
[ghar-ṭā-i]
“the house indeed”
gharṭāo
[ghar-ṭā-o]
“the house also”
gharkhānāi
[ghar-kānā-i]
“the house indeed”
gharkhānāo
[ghar-kānā-o]
“the house also”
gharkhānii
[gharkhāni-i]
“the house indeed”
gharkhānio
[ghar-khāni-o]
“the house also”
gharṭukui
[ghar-ṭuku-i]
“the house indeed”
gharṭukuo
[ghar-ṭuku-o]
“the house also”
-
(b)
Noun + Plural + Particle: (6)
ghargulii
[ghar-guli-i]
“houses indeed”
ghargulio
[ghar-guli-o]
“houses also”
gharguloi
[ghar-gulo-i]
“houses indeed”
gharguloo
[ghar-gulo-o]
“houses also”
ghargulā
[ghar-gulā-i]
“houses indeed”
ghargulā
[ghar-gulā-o]
“houses also”
-
(c)
Noun + Singular + Case : (13)
gharṭike
[ghar-ṭi-ke]
“to the house”
gharṭite
[ghar-ṭi-te]
“in the house”
gharṭir
[ghar-ṭi-r]
“of the house”
gharṭāke
[ghar-ṭā-ke]
“to the house”
gharṭāte
[ghar-ṭā-te]
“in the house”
gharṭār
[ghar-ṭā-r]
“of the house”
gharṭāy
[ghar-ṭā-y]
“of the house”
gharkhānāi
[ghar-kānā-ke]
“the house indeed”
gharkhānāo
[ghar-kānā-te]
“the house also”
gharkhānii
[gharkhāni-ke]
“the house indeed”
gharkhānio
[ghar-khāni-te]
“the house also”
gharṭukui
[ghar-ṭuku-ke]
“the house indeed”
gharṭukuo
[ghar-ṭuku-te]
“the house also”
-
(d)
Noun + Plural + Case : (11)
ghargulike
[ghar-guli-ke]
“to the houses”
ghargulite
[ghar-guli-te]
“in the houses”
ghargulir
[ghar-guli-r]
“of the houses”
gharguloke
[ghar-gulo-ke]
“to the houses”
ghargulote
[ghar-gulo-te]
“to the houses”
ghargulor
[ghar-gulo-r]
“of the houses”
ghargoy
[ghar-gulo-y]
“houses also”
ghargulāke
[ghar-gulā-ke]
“houses also”
ghargulāte
[ghar-gulā-te]
“houses indeed”
ghargulār
[ghar-gulā-r]
“houses also”
ghargulāy
[ghar-gulā-y]
“houses also”
-
(e)
Noun + Case + Particle : (6)
gharei
[ghar-e-i]
“in the house indeed”
ghareo
[ghar-e-o]
“in the house also”
gharkei
[ghar-ke-i]
“to the house indeed”
gharkeo
[ghar-ke-o]
“to the house also”
gharetei
[ghar-ete-i]
“in the house indeed”
ghareteo
[ghar-ete-o]
“in the house also”
Type 4: With three Word-Formative Elements (WFEs)
-
(a)
Noun + Singular + Case + Particle (28)
gharṭikei
[ghar-ṭi-ke-i]
“to the house indeed”
gharṭikeo
[ghar-ṭi-ke-o]
“to the house also”
gharṭitei
[ghar-ṭi-te-i]
“in the house indeed”
gharṭiteo
[ghar-ṭi-te-o]
“in the house also”
gharṭākei
[ghar-ṭā-ke-i]
“to the house indeed”
gharṭākeo
[ghar-ṭā-ke-o]
“to the house also”
gharṭātei
[ghar-ṭā-te-i]
“in the house indeed”
gharṭāteo
[ghar-ṭā-te-o]
“in the house also”
gharṭiri
[ghar-ṭi-r-i]
“of the house indeed”
gharṭiro
[ghar-ṭi-r-o]
“of the house also”
gharṭāri
[ghar-ṭā-r-i]
“of the house indeed”
gharṭāro
[ghar-ṭā-r-o]
“of the house also”
gharṭāyi
[ghar-ṭā-y-i]
“of the house indeed”
gharṭāyo
[ghar-ṭā-y-o]
“of the house also”
gharkhānākei
[ghar-khānā-ke-i]
“to the house indeed”
gharkhānākeo
[ghar-khānā-ke-o]
“to the house also”
gharkhānātei
[ghar-khānā-te-i]
“to the house indeed”
gharkhānāteo
[ghar-khānā-te-o]
“to the house also”
gharkhānāyi
[ghar-khānā-y-i]
“to the house indeed”
gharkānāyo
[ghar-khānā-y-o]
“to the house also”
gharkhānikei
[ghar-khāni-ke-i]
“to the house indeed”
gharkhānikeo
[ghar-khāni-ke-o]
“to the house also”
gharkhānitei
[ghar-khāni-te-i]
“to the house indeed”
gharkhāniteo
[ghar-khāni-te-o]
“to the house also”
gharṭukukei
[ghar-ṭuku-ke-i]
“to the house indeed”
gharṭukukeo
[ghar-ṭuku-ke-o]
“to the house also”
gharṭukutei
[ghar-ṭuku-te-i]
“in the house indeed”
gharṭukuteo
[ghar-ṭuku-te-o]
“in the house also”
-
(b)
Noun + Plural + Case + Particle (22)
ghargulikei
[ghar-guli-ke-i]
“to the houses indeed”
ghargulikeo
[ghar-guli-ke-o]
“to the houses also”
ghargulitei
[ghar-guli-te-i]
“in the houses indeed”
gharguliteo
[ghar-guli-te–o]
“in the houses also”
ghargulokei
[ghar-gulo-ke-i]
“to the houses indeed”
ghargulokeo
[ghar-gulo-ke-o]
“to the houses also”
ghargulotei
[ghar-gulo-te-i]
“in the houses indeed”
gharguloteo
[ghar-gulo-te-o]
“in the houses also”
ghargulākei
[ghar-gulā-ke-i]
“to the houses indeed”
ghargulākeo
[ghar-gulā-ke-o]
“to the houses also”
ghargulātei
[ghar-gulā-te-i]
“in the houses indeed”
ghargulāteo
[ghar-gulā-te-o]
“in the houses also”
gharguliri
[ghar-guli-r-i]
“of the houses indeed”
gharguliro
[ghar-guli-r-o]
“of the houses also”
ghargulori
[ghar-gulo-r-i]
“of the houses indeed”
gharguloro
[ghar-gulo-r-o]
“of the houses also”
ghargulori
[ghar-gulā-r-i]
“of the houses indeed”
gharguloro
[ghar-gulā-r-o]
“of the houses also”
gharguloyi
[ghar-gulo-y-i]
“of the houses indeed”
gharguloyo
[ghar-gulo-y-o]
“of the houses also”
ghargulāyi
[ghar-gulā-y-i]
“of the houses indeed”
ghargulāyo
[ghar-gulā-y-o]
“of the houses also”
Rights and permissions
Copyright information
© 2021 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this chapter
Cite this chapter
Dash, N.S. (2021). Lemmatization of Inflected Nouns. In: Language Corpora Annotation and Processing. Springer, Singapore. https://doi.org/10.1007/978-981-16-2960-0_8
Download citation
DOI: https://doi.org/10.1007/978-981-16-2960-0_8
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-16-2959-4
Online ISBN: 978-981-16-2960-0
eBook Packages: EducationEducation (R0)