Lemmatization of Inflected Nouns

Dash, Niladri Sekhar

doi:10.1007/978-981-16-2960-0_8

Niladri Sekhar Dash²

387 Accesses

Abstract

In this chapter, we describe a process of lemmatization of inflected nouns in Bengali as a part of lexical processing. Inflected nouns are used at a very high frequency in Bengali texts. We first collect a large number of inflected nouns from a Bengali corpus and compile a noun database. Then we apply a process of lemmatization to separate inflections from nominal bases. There are several intermediate stages in lemmatization which are applied following grammatical mapping rules (GMRs). These rules isolate inflections from nominal bases. The GMRs are first designed manually after analyzing a large set of inflected nouns to collect necessary data and information. At subsequent stages, these GMRs are developed in a machine-readable format so that the lemmatizer can separate the inflections from inflected nouns with the least human intervention. This strategy is proved to be largely successful in the sense that most of the inflected Bengali nouns, which are stored in a noun database, are rightly lemmatized. This multilayered process also generates an exhaustive list of nominal inflections and a large list of lemmatized nouns. At the subsequent stage, nouns are semantically classified for their use in translation, dictionary compilation, lexical decomposition, and language teaching. We have also applied this method to lemmatize inflected pronouns and adjectives which follow a similar pattern of inflection and affixation in Bengali.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 139.00; Price excludes VAT (USA)

Softcover Book: USD 179.99; Price excludes VAT (USA)

Hardcover Book: USD 199.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Airio, E. (2006). Word normalization and decompounding in mono- and bilingual IR. Information Retrieval, 9, 249–271.
Article Google Scholar
Barnbrook, G. (1996). Language and computers. Edinburgh University Press.
Google Scholar
Beale, A. D. (1987). Towards a distributional lexicon. In R. Garside, G. Leech, & G. Sampson (Eds.), The computational analysis of English: A corpus-based approach (pp. 149–162). Longman.
Google Scholar
Biber, D., Conrad, S., & Reppen, R. (1998). Corpus linguistics: Investigating language structure and use. Cambridge University Press.
Book Google Scholar
Creutz, M. (2003). Unsupervised segmentation of words using prior distributions of morph length and frequency. In Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics, July 2003 (pp. 280–287).
Google Scholar
Dash, N. S., & Chaudhuri, B. B. (1997). Computer parsing of Bengali verbs. Linguistics Today., 1(1), 64–85.
Google Scholar
Dash, N. S. (2006). The process of lemmatization of inflected and affixed words in Bengali text corpus. In Presented in the 28^th All India Conference of Linguists (28-AICL). Varanasi: Department of Linguistics, Banaras Hindu University, November 2–5, 2006.
Google Scholar
Dash, N. S. (2007a.) Indian scenario in language corpus generation. In: N. S. Dash, P. Dasgupta, & P. Sarkar (Eds.) Rainbow of linguistics (Vol. I, pp. 129–162). Kolkata: T. Media Publication.
Google Scholar
Dash, N. S. (2007b). Toward lemmatization of Bengali words for building language technology resources. South Asian Language Review, 17(2), 1–15.
Article Google Scholar
Dash, N. S. (2015). Marking words with part-of-speech (POS) tags within the text boundary of a corpus: The problems, the process, and the outcomes. Translation Today, 9(1), 5–24.
Google Scholar
Dash, N. S. (2017). Defining Language-Specific Synsets in IndoWordNet: Some theoretical and practical issues. In N. S. Dash, P. Bhattacharyya, & J. Pawar (Eds.), The WordNet in Indian languages (pp. 45–64). Springer.
Chapter Google Scholar
Dawson, J. L. (1974). Suffix removal for word conflation. Bulletin of the Association for Literary and Linguistic Computing., 2(3), 33–46.
Google Scholar
Erjavec, T., & Dzeroski, S. (2004). Machine learning of morphosyntactic structure: Lemmatizing unknown Slovene words. Applied Artificial Intelligence, 18(1), 17–40.
Article Google Scholar
Federici, S., & Pirelli, V. (1992). A bootstrapping strategy for lemmatization: Learning through examples. In: Kiefer, et al. (Eds.) (pp. 123–135).
Google Scholar
Fligelstone, S. (1994) JAWS: Using lemmatization rules and contextual disambiguation rules to enhance CLAWS output. In Lancaster database of linguistic corpora: Project report. UK: Linguistics Department, Lancaster University.
Google Scholar
Frakes, W. B., & Fox, C. J. (2003). Strength and similarity of affix removal stemming algorithms. SIGIR Forum., 37, 26–30.
Article Google Scholar
Frakes, W. B. (1984). Term conflation for information retrieval. In Proceedings of the 7th Annual International ACM SIGIR’84 Conference on Research and Development in Information Retrieval (pp. 383–389).
Google Scholar
Francis, N., & Kucera, H. (1982). Frequency analysis of english usage: Lexicon and grammar. Houghton Mifflin Company.
Google Scholar
Galvez, C., de Moya-Anegon, F., & Solana, V. H. (2005). Term conflation methods in information retrieval: Non-linguistic and linguistic approaches. Journal of Documentation., 61(4), 520–547.
Article Google Scholar
Hafer, M. A., & Weiss, S. F. (1974). Word segmentation by letter successor varieties. Information Processing and Management., 10(11/12), 371–386.
Google Scholar
Harman, D. (1991). How effective is suffixing? Journal of the American Society for Information Science., 42(1), 7–15.
Article Google Scholar
Hull, D. A. (1996). Stemming algorithms—A case study for detailed evaluation. Journal of the American Society for Information Science., 47(1), 70–84.
Article Google Scholar
Hundt, M., Sand, A., & Skandera, P. (1999). Manual of Information to accompany The Freiburg-Brown Corpus of American English (Frown). Albert-Ludwigs-Universität Freiburg.
Google Scholar
Jongejan, B., & Dalianis, H. (2009). Automatic training of lemmatization rules that handle morphological changes in pre-, in-, and suffixes alike. In Proceeding of the ACL-2009, Joint conference of the 47th Annual Meeting of the Association for Computational Linguistics and the 4th International Joint Conference on Natural Language Processing of the Asian Federation of Natural Language Processing, Singapore, August 2–7, 2009 (pp. 145–153).
Google Scholar
Kamps, J., Monz, C., Rijke, M., & Sigurbjörnsson, B. (2004). Language dependent and language-independent approach to cross-lingual text retrieval. In C. Peters, J. Gonzalo, M. Braschler, & M. Kluck (Eds.), Comparative evaluation of multilingual information access systems (pp. 152–165). Springer.
Chapter Google Scholar
Kanis, J., & Skorkovská, L. (2010). Comparison of different lemmatization approaches through the means of information retrieval performance. In Proceedings of the 13th International Conference on Text, Speech and Dialogue TSD'10 (pp. 93–100).
Google Scholar
Korenius, T., Laurikkala, J., Järvelin, K., & Juhola, M. (2004). Stemming and lemmatization in the clustering of Finnish text documents. In Proceedings of the 13th ACM International Conference on Information and Knowledge Management, CIKM'04 (pp. 625–633).
Google Scholar
Kornilakis, H., Grigoriadou, M., Galiotou, E., Papakitsos, E. (2004). Using a lemmatizer to support the development and validation of the Greek WordNet. In Proceedings of the 2nd Global WordNet Conference (pp. 130–135). Brno, Czech Republic, January 20–23, 2004.
Google Scholar
Kraaij, W., & Pohlmann, R. (1996). Viewing stemming as recall enhancement. In: H. P. Frei, D. Harman, P. Schauble, & R. Wilkinson (Eds.), In Proceedings of the 17th ACM SIGIR Conference, Zurich, August 18–22 (pp. 40–48).
Google Scholar
Krovetz, R. (1993). Viewing morphology as an inference process. In Proceedings of ACM-SIGIR93, 16th International ACM/SIGIR ‘93 Conference on Research and Development in Information Retrieval, Pittsburgh, PA, USA, June 27–July 01, 1993 (pp. 191–203).
Google Scholar
Leech, G. (2007). New resources or just better old ones? The Holy Grail of representativeness. In M. Hundt, N. Nesselhauf, & C. Biewer (Eds.), Corpus linguistics and the web (pp. 133–149). Rodopi.
Google Scholar
Lennon, M., Pearce, D. S., Tarry, B. D., & Willett, P. (1981). An Evaluation of some conflation algorithms for information retrieval. Journal of Information Science., 3, 177–183.
Article Google Scholar
Liu, H., Christiansen, T., Baumgartner, W. A., & Verspoor, K. (2012) BioLemmatizer: a lemmatization tool for morphological processing of biomedical text. Journal of Biomedical Semantics. 1–29.
Google Scholar
Lovins, J. B. (1968). Development of a Stemming algorithm. Mechanical Translation and Computational Linguistics., 11, 22–31.
Google Scholar
Lovins, J. B. (1971). Error evaluation for stemming algorithms as clustering algorithms. Journal of the American Society for Information Science., 22, 28–40.
Article Google Scholar
McEnery, T., & Hardie, A. (2006). Corpus linguistics: Method, theory, and practice. Cambridge University Press.
Google Scholar
McEnery, T., & Wilson, A. (1996). Corpus linguistics. Edinburgh University Press.
Google Scholar
Minnen, G., Carroll, J., & Pearce, D. (2001). Applied morphological processing of English. Natural Language Engineering., 7, 207–223.
Article Google Scholar
Paice, C. D. (1990). Another stemmer. SIGIR Forum., 24(3), 56–61.
Article Google Scholar
Paice, C. D. (1996). Method for evaluation of stemming algorithms based on error counting. Journal of the American Society for Information Science., 47(8), 632–649.
Article Google Scholar
Popovič, M., & Willett, P. (1992). The effectiveness of stemming for natural-language access to Slovene textual data. Journal of the American Society for Information Science., 43(5), 384–390.
Article Google Scholar
Porter, M. F. (1980). An algorithm for suffix stripping. Program, 14(3), 130–137.
Article Google Scholar
Reichel, U. D., & Weilhammer, K. (2004). Automated morphological segmentation and evaluation. In Proceedings of LREC 2004, Lisbon.
Google Scholar
Sánchez, A., & Cantos, P. (1997). Predictability of word forms (types) and lemmas in linguistic corpora, a case study based analysis of the CUMBRE corpus: An 8-million-word corpus of contemporary Spanish. International Journal of Corpus Linguistics., 2(2), 259–280.
Article Google Scholar
Savoy, J. (1993). Stemming of French words based on grammatical categories. Journal of the American Society for Information Science., 44(1), 1–9.
Article Google Scholar
Ulmschneider, J. E., & Doszkocs, T. (1983). A practical stemming algorithm for online search assistance. Online Review., 7(4), 301–318.
Article Google Scholar
Xu, J., & Croft, W. B. (1998). Corpus-based stemming using co-occurrence of word variants. ACM Transactions on Information Systems., 16(1), 61–81.
Article Google Scholar

Web Links

Download references

Author information

Authors and Affiliations

Linguistic Research Unit, Indian Statistical Institute, Kolkata, West Bengal, India
Dr. Niladri Sekhar Dash

Authors

Dr. Niladri Sekhar Dash
View author publications
You can also search for this author in PubMed Google Scholar

Appendix

Bengali nouns formed with or without formative elements.

Type 1: Noun Without Word-Formative Element (WFE)

(a)
Noun + Ø marker: (1)
Ghar
[ghar-Ø]
“house”

Type 2: Noun with one Word-Formative Element (WFE)

(b)
Noun + Emphatic Particle: (2)
Ghari
[ghar-i]
“house indeed”
Gharo
[ghar-o]
“house also”

(c)
Noun + Singular Marker: (5)
gharṭi
[ghar-ṭi]
“the house”
gharṭā
[ghar-ṭā]
“the house”
gharkhānā
[ghar-khānā]
“the house”
gharkhāni
[ghar-khāni]
“the house”
gharṭuku
[ghar-ṭuku]
“the house”

(d)
Noun + Plural Marker : (3)
Gharguli
[ghar-guli]
“houses”
Ghargulo
[ghar-gulo]
“houses”
Ghargulā
[ghar-gulā]
“houses”

(e)
Noun + Case Marker : (4)
Ghare
[ghar-e]
“in house”
Gharer
[ghar-er]
“of house”
Gharke
[ghar-ke]
“to house”
Gharete
[ghar-ete]
“in house”

Type 3: Noun with Two Word-Formative Elements (WFEs)

(a)

Noun + Singular + Particle: (10)

gharṭii	[ghar-ṭi-i]	“the house indeed”
gharṭio	[ghar-ṭi-o]	“the house also”
gharṭāi	[ghar-ṭā-i]	“the house indeed”
gharṭāo	[ghar-ṭā-o]	“the house also”
gharkhānāi	[ghar-kānā-i]	“the house indeed”
gharkhānāo	[ghar-kānā-o]	“the house also”
gharkhānii	[gharkhāni-i]	“the house indeed”

gharkhānio	[ghar-khāni-o]	“the house also”
gharṭukui	[ghar-ṭuku-i]	“the house indeed”
gharṭukuo	[ghar-ṭuku-o]	“the house also”

(b)

Noun + Plural + Particle: (6)

ghargulii	[ghar-guli-i]	“houses indeed”
ghargulio	[ghar-guli-o]	“houses also”
gharguloi	[ghar-gulo-i]	“houses indeed”
gharguloo	[ghar-gulo-o]	“houses also”
ghargulā	[ghar-gulā-i]	“houses indeed”
ghargulā	[ghar-gulā-o]	“houses also”

(c)

Noun + Singular + Case : (13)

gharṭike	[ghar-ṭi-ke]	“to the house”
gharṭite	[ghar-ṭi-te]	“in the house”
gharṭir	[ghar-ṭi-r]	“of the house”
gharṭāke	[ghar-ṭā-ke]	“to the house”
gharṭāte	[ghar-ṭā-te]	“in the house”
gharṭār	[ghar-ṭā-r]	“of the house”
gharṭāy	[ghar-ṭā-y]	“of the house”
gharkhānāi	[ghar-kānā-ke]	“the house indeed”
gharkhānāo	[ghar-kānā-te]	“the house also”
gharkhānii	[gharkhāni-ke]	“the house indeed”
gharkhānio	[ghar-khāni-te]	“the house also”
gharṭukui	[ghar-ṭuku-ke]	“the house indeed”
gharṭukuo	[ghar-ṭuku-te]	“the house also”

(d)

Noun + Plural + Case : (11)

ghargulike	[ghar-guli-ke]	“to the houses”
ghargulite	[ghar-guli-te]	“in the houses”
ghargulir	[ghar-guli-r]	“of the houses”
gharguloke	[ghar-gulo-ke]	“to the houses”
ghargulote	[ghar-gulo-te]	“to the houses”
ghargulor	[ghar-gulo-r]	“of the houses”
ghargoy	[ghar-gulo-y]	“houses also”
ghargulāke	[ghar-gulā-ke]	“houses also”
ghargulāte	[ghar-gulā-te]	“houses indeed”
ghargulār	[ghar-gulā-r]	“houses also”
ghargulāy	[ghar-gulā-y]	“houses also”

(e)

Noun + Case + Particle : (6)

gharei	[ghar-e-i]	“in the house indeed”
ghareo	[ghar-e-o]	“in the house also”
gharkei	[ghar-ke-i]	“to the house indeed”
gharkeo	[ghar-ke-o]	“to the house also”
gharetei	[ghar-ete-i]	“in the house indeed”
ghareteo	[ghar-ete-o]	“in the house also”

Type 4: With three Word-Formative Elements (WFEs)

(a)

Noun + Singular + Case + Particle (28)

gharṭikei	[ghar-ṭi-ke-i]	“to the house indeed”
gharṭikeo	[ghar-ṭi-ke-o]	“to the house also”
gharṭitei	[ghar-ṭi-te-i]	“in the house indeed”
gharṭiteo	[ghar-ṭi-te-o]	“in the house also”
gharṭākei	[ghar-ṭā-ke-i]	“to the house indeed”
gharṭākeo	[ghar-ṭā-ke-o]	“to the house also”
gharṭātei	[ghar-ṭā-te-i]	“in the house indeed”
gharṭāteo	[ghar-ṭā-te-o]	“in the house also”
gharṭiri	[ghar-ṭi-r-i]	“of the house indeed”
gharṭiro	[ghar-ṭi-r-o]	“of the house also”
gharṭāri	[ghar-ṭā-r-i]	“of the house indeed”
gharṭāro	[ghar-ṭā-r-o]	“of the house also”
gharṭāyi	[ghar-ṭā-y-i]	“of the house indeed”
gharṭāyo	[ghar-ṭā-y-o]	“of the house also”
gharkhānākei	[ghar-khānā-ke-i]	“to the house indeed”
gharkhānākeo	[ghar-khānā-ke-o]	“to the house also”
gharkhānātei	[ghar-khānā-te-i]	“to the house indeed”
gharkhānāteo	[ghar-khānā-te-o]	“to the house also”
gharkhānāyi	[ghar-khānā-y-i]	“to the house indeed”
gharkānāyo	[ghar-khānā-y-o]	“to the house also”
gharkhānikei	[ghar-khāni-ke-i]	“to the house indeed”
gharkhānikeo	[ghar-khāni-ke-o]	“to the house also”
gharkhānitei	[ghar-khāni-te-i]	“to the house indeed”
gharkhāniteo	[ghar-khāni-te-o]	“to the house also”
gharṭukukei	[ghar-ṭuku-ke-i]	“to the house indeed”
gharṭukukeo	[ghar-ṭuku-ke-o]	“to the house also”
gharṭukutei	[ghar-ṭuku-te-i]	“in the house indeed”
gharṭukuteo	[ghar-ṭuku-te-o]	“in the house also”

(b)

Noun + Plural + Case + Particle (22)

ghargulikei	[ghar-guli-ke-i]	“to the houses indeed”
ghargulikeo	[ghar-guli-ke-o]	“to the houses also”
ghargulitei	[ghar-guli-te-i]	“in the houses indeed”
gharguliteo	[ghar-guli-te–o]	“in the houses also”
ghargulokei	[ghar-gulo-ke-i]	“to the houses indeed”
ghargulokeo	[ghar-gulo-ke-o]	“to the houses also”
ghargulotei	[ghar-gulo-te-i]	“in the houses indeed”
gharguloteo	[ghar-gulo-te-o]	“in the houses also”
ghargulākei	[ghar-gulā-ke-i]	“to the houses indeed”
ghargulākeo	[ghar-gulā-ke-o]	“to the houses also”
ghargulātei	[ghar-gulā-te-i]	“in the houses indeed”
ghargulāteo	[ghar-gulā-te-o]	“in the houses also”
gharguliri	[ghar-guli-r-i]	“of the houses indeed”
gharguliro	[ghar-guli-r-o]	“of the houses also”
ghargulori	[ghar-gulo-r-i]	“of the houses indeed”
gharguloro	[ghar-gulo-r-o]	“of the houses also”
ghargulori	[ghar-gulā-r-i]	“of the houses indeed”
gharguloro	[ghar-gulā-r-o]	“of the houses also”
gharguloyi	[ghar-gulo-y-i]	“of the houses indeed”
gharguloyo	[ghar-gulo-y-o]	“of the houses also”
ghargulāyi	[ghar-gulā-y-i]	“of the houses indeed”
ghargulāyo	[ghar-gulā-y-o]	“of the houses also”

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Dash, N.S. (2021). Lemmatization of Inflected Nouns. In: Language Corpora Annotation and Processing. Springer, Singapore. https://doi.org/10.1007/978-981-16-2960-0_8

Download citation

DOI: https://doi.org/10.1007/978-981-16-2960-0_8
Published: 08 July 2021
Publisher Name: Springer, Singapore
Print ISBN: 978-981-16-2959-4
Online ISBN: 978-981-16-2960-0
eBook Packages: EducationEducation (R0)

Publish with us

Policies and ethics

Ghari	[ghar-i]	“house indeed”
Gharo	[ghar-o]	“house also”

gharṭi	[ghar-ṭi]	“the house”
gharṭā	[ghar-ṭā]	“the house”
gharkhānā	[ghar-khānā]	“the house”
gharkhāni	[ghar-khāni]	“the house”
gharṭuku	[ghar-ṭuku]	“the house”

Gharguli	[ghar-guli]	“houses”
Ghargulo	[ghar-gulo]	“houses”
Ghargulā	[ghar-gulā]	“houses”

Ghare	[ghar-e]	“in house”
Gharer	[ghar-er]	“of house”
Gharke	[ghar-ke]	“to house”
Gharete	[ghar-ete]	“in house”

Lemmatization of Inflected Nouns

Abstract

Access this chapter

References

Web Links

Author information

Authors and Affiliations

Appendix

Appendix

Type 1: Noun Without Word-Formative Element (WFE)

Type 2: Noun with one Word-Formative Element (WFE)

Type 3: Noun with Two Word-Formative Elements (WFEs)

Type 4: With three Word-Formative Elements (WFEs)

Rights and permissions

Copyright information

About this chapter

Cite this chapter

Download citation

Share this chapter

Publish with us

Search

Navigation