Abstract
The computation of natural language enables a suitable transmission through the universe by retrieving the correct sense of each word. A word may be monosemous or polysemous. The use of polysemous words in an appropriate context plays a critical role in communication. Over the last 2 decades, a significant amount of research has been done for automatically solving the correct sense of a polysemous word in the context of word sense disambiguation. A word sense disambiguation algorithm identifies the proper sense of a polysemous word by analysing the contextual data. Nevertheless, there is a gap in the contemporary literature regarding the availability of datasets in Asian languages, especially Bengali. Therefore, in this work, we have presented a dataset comprising hundred Bengali polysemous words. Each word in this dataset consists of three or four disjoint senses, and each sense comprises ten paragraphs. Each paragraph describes the sense of a particular polysemous word. We have performed statistical analysis on the basis of seven relevant and important characteristics. A general framework has also been presented for training and testing with possible guidelines for performance analysis. A baseline strategy has been introduced based on four feature sets. Finally, a set of experiments have been performed to analyse the system performance.
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs12652-022-04471-y/MediaObjects/12652_2022_4471_Fig1_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs12652-022-04471-y/MediaObjects/12652_2022_4471_Fig2_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs12652-022-04471-y/MediaObjects/12652_2022_4471_Fig3_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs12652-022-04471-y/MediaObjects/12652_2022_4471_Fig4_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs12652-022-04471-y/MediaObjects/12652_2022_4471_Fig5_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs12652-022-04471-y/MediaObjects/12652_2022_4471_Fig6_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs12652-022-04471-y/MediaObjects/12652_2022_4471_Fig7_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs12652-022-04471-y/MediaObjects/12652_2022_4471_Fig8_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs12652-022-04471-y/MediaObjects/12652_2022_4471_Fig9_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs12652-022-04471-y/MediaObjects/12652_2022_4471_Fig10_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs12652-022-04471-y/MediaObjects/12652_2022_4471_Fig11_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs12652-022-04471-y/MediaObjects/12652_2022_4471_Fig12_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs12652-022-04471-y/MediaObjects/12652_2022_4471_Fig13_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs12652-022-04471-y/MediaObjects/12652_2022_4471_Fig14_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs12652-022-04471-y/MediaObjects/12652_2022_4471_Fig15_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs12652-022-04471-y/MediaObjects/12652_2022_4471_Fig16_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs12652-022-04471-y/MediaObjects/12652_2022_4471_Fig17_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs12652-022-04471-y/MediaObjects/12652_2022_4471_Fig18_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs12652-022-04471-y/MediaObjects/12652_2022_4471_Fig19_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs12652-022-04471-y/MediaObjects/12652_2022_4471_Fig20_HTML.png)
Similar content being viewed by others
Availability of Data and Materials
The authors declare that the data supporting the findings of this study are available within its supplementary information files. Description of the Dataset for \(|S \left( W \right) | = 3\) is available in Appendix-I. Description of the Dataset for \(|S \left( W \right) | = 4\) is available in Appendix-II. The dataset is freely accessible here: https://www.kaggle.com/datasets/debapratimdasdawn/bengali-wsd-dataset with proper DOI: https://www.doi.org/10.34740/kaggle/dsv/3985193.
References
Agirre E, Martinez D (2001) Knowledge sources for word sense disambiguation. International conference on text, speech and dialogue. Springer, Cham, pp 1–10
Alian M, Awajan A, Al-Kouz A (2016) Word sense disambiguation for arabic text using wikipedia and vector space model. Int J Speech Technol 19(4):857–867
Anirban D, Nitya B, Van Breugel LM, Sonali S, Bhupen B, Hiranya S, Udeme-Abasi N, Ahmed M, Subhankar P (2020) Youtube as a source of medical and epidemiological information during COVID-19 pandemic: a cross-sectional study of content across six languages around the globe. Cureus 12(6):e8622
Aoshima M, Yata K (2014) A distance-based, misclassification rate adjusted classifier for multiclass, high-dimensional data. Ann Inst Stat Math 66(5):983–1010
Ashiq W (2021) Urdu word sense disambiguation using siamese neural networks. PhD thesis, Department of Computer science, COMSATS University Lahore
Aung NTT, Soe KM, Thein NL (2011) A word sense disambiguation system using naïve bayesian algorithm for Myanmar language. Int J Sci Eng Res 2(9):1–6
Banerjee S, Pedersen T et al (2003) Extended gloss overlaps as a measure of semantic relatedness. Ijcai 3:805–810 (Citeseer)
Banerjee E, Bansal A, Jha GN (2014) Issues in chunking parallel corpora: mapping hindi-english verb group in ilci. In: Workshop Programme, pp111
Baruah N, Gogoi A, Sarma SK, Borah R (2021) Utilizing corpus statistics for assamese word sense disambiguation. Advances in computing and network communications. Springer, Cham, pp 271–283
Basile P, De Gemmis M, Lops P, Semeraro G (2008) Combining knowledge-based methods and supervised learning for effective Italian word sense disambiguation. In: Proceedings of the 2008 Conference on Semantics in Text Processing. Association for Computational Linguistics, pp 5–16
Biswas S (1995) Samsad Bangla Abidhan: dictionary of the Bengali language compiled by Sailendra Biswas. Sahitya Samsad
Breiman L (2001) Random forests. Mach Learn 45(1):5–32
Dash NS (2002) Lexical polysemy in Bengali: a corpus-based study. PILC J Dravidic Stud 12(1–2):203–214
Dash NS (2007) Indian scenario in language corpus generation. Rainbow Linguist
Dash NS (2012) Polysemy and homonymy: a conceptual labyrinth. Proc IndoWordNet Workshop 2012:1–7
Dash NS, Chaudhuri BB (2002) Using text corpora for understanding polysemy in Bangla. In: Language Engineering Conference, 2002. Proceedings, IEEE, pp 99–109
Das A, Sarkar S (2013) Word sense disambiguation in Bengali applied to Bengali-Hindi machine translation. In: International Conference on Natural Language Processing (ICON), Noida
David D (2013) Black space: improving writing by increasing lexical density. Brain Food for the Thinking Teacher, The Learning Spy
Deb D (2012) On case marking in Assamese Bengali and Oriya. Int J Appl Linguist Engl Lit 1(2):102
Dhungana UR, Shakya S (2014) Word sense disambiguation in Nepali language. In: Digital Information and Communication Technology and Its Applications (DICTAP), 2014 Fourth International Conference on IEEE, pp 46–50
Dutta A, Borgohain SK (2022) Verb sense disambiguation by measuring semantic relatedness between verb and surrounding terms of context
Dutta MA, Singh MSM, Borgohain SK (2022) Removal of ambiguity of noun using multimodal approach
Galley M , McKeown K (2003) Improving word sense disambiguation in lexical chaining
Gaustad T (2003) The importance of high-quality input for wsd: an application-oriented comparison of part-of-speech taggers. Proc Austral Lang Technol Workshop 2003:118–125
Gonzalo J, Chugur I, Verdejo F (2000) Sense clusters for information retrieval: evidence from semcor and the eurowordnet interlingual index. Proc ACL Workshop Word Senses Multi-Linguality 8:10–18
Gruber TR (1995) Toward principles for the design of ontologies used for knowledge sharing. Int J Hum Comput Stud 43(5–6):907–928
Haque A, Hoque MM (2016) Bangla word sense disambiguation system using dictionary based approach. In: 1st Internation Conference on Advanced Information and Communication Technology (ICAICT 2016), pp 1–6
Haroon RP (2010) Malayalam word sense disambiguation. In: Computational Intelligence and Computing Research (ICCIC), 2010 IEEE International Conference on IEEE, pp 1–4
Hoste V, Daelemans W, Hendrickx I, Bosch A van den (2002) Dutch word sense disambiguation: Optimizing the localness of context. In: Proceedings of the ACL-02 Workshop on Word Sense Disambiguation: Recent Successes and Future Directions. Association for Computational Linguistics, pp 61–66
Hwangbo H, Kim Y (2017) An empirical study on the effect of data sparsity and data overlap on cross domain collaborative filtering performance. Expert Syst Appl 89:254–265
Ide N, Véronis J (1998) Introduction to the special issue on word sense disambiguation: the state of the art. Comput Linguist 24(1):2–40
International Organization for Standardization (1994) Organització Internacional per a la Normalització. Accuracy (trueness and precision) of measurement methods and results. International Organization for Standardization, Geneva
Jia L, Tang J, Li M, You J, Ding J, Chen Y (2021) Twe-wsd: an effective topical word embedding based word sense disambiguation. CAAI Trans Intell Technol 6(1):72–79
Joachims T (1996) A probabilistic analysis of the rocchio algorithm with tfidf for text categorization. Technical report, Carnegie-mellon univ pittsburgh pa dept of computer science
Kaddoura S, Ahmed RD (2022) A comprehensive review on Arabic word sense disambiguation for natural language processing applications. Wiley interdisciplinary reviews: data mining and knowledge discovery. Springer, Cham, p e1447
Kawahara D, Kurohashi S (2010) Acquiring reliable predicate-argument structures from raw corpora for case frame compilation. In: Seventh International Conference on Language Resources and Evaluation (LREC)
Kilgarriff A, Yallop C (2000) What’s in a thesaurus? In: Second International Conference on Language Resources and Evaluation (LREC)
Ledo MY, Grigori S, Alexander G (2003) Tool for computer-aided Spanish word sense disambiguation. International conference on intelligent text processing and computational linguistics. Springer, Cham, pp 277–280
Lindén K (2005) Word sense discovery and disambiguation. PhD thesis, University of Helsinki, Faculty of Arts, Department of General Linguistics
Liu X (2008) Proposal of document classification with word sense disambiguation
Miller GA (1995) Wordnet: a lexical database for English. Commun ACM 38(11):39–41
Mishra N, Yadav S, Siddiqui TJ (2009) An unsupervised approach to Hindi word sense disambiguation. Proceedings of the first international conference on intelligent human computer interaction. Springer, Cham, pp 327–335
Navigli R (2009) Word sense disambiguation: a survey. ACM Comput Surv (CSUR) 41(2):10
Navigli R (2009) Word sense disambiguation: a survey. ACM Comput Surv (CSUR) 41(2):1–69
Pal AR, Saha D (2017) Word sense disambiguation in bengali: An unsupervised approach. In: 2017 Second International Conference on Electrical, Computer and Communication Technologies (ICECCT), IEEE, pp 1–5
Pal AR, Saha D (2019) Word sense disambiguation in Bengali language using unsupervised methodology with modifications. Sādhanā 44(7):1–13
Pal AR, Saha D, Dash NS (2015a) Automatic classification of bengali sentences based on sense definitions present in bengali wordnet. arXiv preprint arXiv:1508.01349
Pal AR, Saha D, Naskar S, Dash NS (2015b) Word sense disambiguation in Bengali: a lemmatized system increases the accuracy of the result. In: Recent Trends in Information Systems (ReTIS), 2015 IEEE 2nd International Conference on IEEE, pp 342–346
Pal AR, Saha D, Pal A (2017) A knowledge based methodology for word sense disambiguation for low resource language. Adv Comput Sci Technol 10(2):267–283
Pal AR, Saha D, Dash NS, Naskar SK, Pal A (2019) A novel approach to word sense disambiguation in Bengali language using supervised methodology. Sādhanā 44(8):181
Pal AR, Saha D, Naskar SK, Dash NS (2021) In search of a suitable method for disambiguation of word senses in Bengali. Int J Speech Technol 24(2):439–454
Palmer M, Gildea D, Kingsbury P (2005) The proposition bank: An annotated corpus of semantic roles. Comput Linguist 31(1):71–106
Pandit R, Naskar SK (2015) A memory based approach to word sense disambiguation in bengali using \(k\)-nn method. In: Recent Trends in Information Systems (ReTIS), 2015 IEEE 2nd International Conference on IEEE, pp 383–386
Parameswarappa S, Narayana VN (2013) Kannada word sense disambiguation using decision list. Int J Emerg Trends Technol Comput Sci 2(3):272–278
Patel K, Kanojia D, Bhattacharyya P (2018) Semi-automatic wordnet linking using word embeddings. In: Proceedings of the 9th Global WordNet Conference (GWC 2018), pp 269–274
Powers D (2011) Ailab evaluation: from precision, recall and f-measure to roc, informedness, markedness and correlation. J Mach Learn Technol 2(22293981):01
Rousseeuw PJ, Croux C (1992) Explicit scale estimators with high breakdown point. L1-Stat Anal Relat Methods 1:77–92
Sarmah J, Sarma SK (2016) Word sense disambiguation for Assamese. In: Advanced Computing (IACC), 2016 IEEE 6th International Conference on IEEE, pp 146–151
Shirai K, Nakamura M (2010) Jaist: clustering and classification based approaches for Japanese WSD. In: Proceedings of the 5th International Workshop on Semantic Evaluation. Association for Computational Linguistics, pp 379–382
Sudha Bhingardive, Pushpak Bhattacharyya (2017) Word sense disambiguation using indowordnet. The WordNet in Indian languages. Springer, Cham, pp 243–260
Upton G, Cook I (1996) Understanding statistics. Oxford University Press, Oxford
Veronis J, Ide NM (1990) Word sense disambiguation with very large neural networks extracted from machine readable dictionaries. Proc Conf Comput Linguist 2:389–394
Wiebe J, O’Hara T, Bruce R (1998) Constructing bayesian networks from wordnet for word-sense disambiguation: Representational and processing issues. In: US Army Conference on Applied Statistics, 21-23 October 1998, pp 67
Yingjie Z, Bin LI, Jiajun C, Xiaohe C (2012) A study in dictionary-based all-word word sense disambiguation for pre-Qin Chinese. J Chin Inf Process 3:13
Zouaghi A, Merhbene L, Zrigui M (2012) Combination of information retrieval methods with lesk algorithm for Arabic word sense disambiguation. Artif Intell Rev 38(4):257–269
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Below is the link to the electronic supplementary material.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Das Dawn, D., Khan, A., Shaikh, S.H. et al. A dataset for evaluating Bengali word sense disambiguation techniques. J Ambient Intell Human Comput 14, 4057–4086 (2023). https://doi.org/10.1007/s12652-022-04471-y
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s12652-022-04471-y