Skip to main content
Log in

A dataset for evaluating Bengali word sense disambiguation techniques

  • Original Research
  • Published:
Journal of Ambient Intelligence and Humanized Computing Aims and scope Submit manuscript

Abstract

The computation of natural language enables a suitable transmission through the universe by retrieving the correct sense of each word. A word may be monosemous or polysemous. The use of polysemous words in an appropriate context plays a critical role in communication. Over the last 2 decades, a significant amount of research has been done for automatically solving the correct sense of a polysemous word in the context of word sense disambiguation. A word sense disambiguation algorithm identifies the proper sense of a polysemous word by analysing the contextual data. Nevertheless, there is a gap in the contemporary literature regarding the availability of datasets in Asian languages, especially Bengali. Therefore, in this work, we have presented a dataset comprising hundred Bengali polysemous words. Each word in this dataset consists of three or four disjoint senses, and each sense comprises ten paragraphs. Each paragraph describes the sense of a particular polysemous word. We have performed statistical analysis on the basis of seven relevant and important characteristics. A general framework has also been presented for training and testing with possible guidelines for performance analysis. A baseline strategy has been introduced based on four feature sets. Finally, a set of experiments have been performed to analyse the system performance.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18
Fig. 19
Fig. 20

Similar content being viewed by others

Availability of Data and Materials

The authors declare that the data supporting the findings of this study are available within its supplementary information files. Description of the Dataset for \(|S \left( W \right) | = 3\) is available in Appendix-I. Description of the Dataset for \(|S \left( W \right) | = 4\) is available in Appendix-II. The dataset is freely accessible here: https://www.kaggle.com/datasets/debapratimdasdawn/bengali-wsd-dataset with proper DOI: https://www.doi.org/10.34740/kaggle/dsv/3985193.

References

  • Agirre E, Martinez D (2001) Knowledge sources for word sense disambiguation. International conference on text, speech and dialogue. Springer, Cham, pp 1–10

    Google Scholar 

  • Alian M, Awajan A, Al-Kouz A (2016) Word sense disambiguation for arabic text using wikipedia and vector space model. Int J Speech Technol 19(4):857–867

    Article  Google Scholar 

  • Anirban D, Nitya B, Van Breugel LM, Sonali S, Bhupen B, Hiranya S, Udeme-Abasi N, Ahmed M, Subhankar P (2020) Youtube as a source of medical and epidemiological information during COVID-19 pandemic: a cross-sectional study of content across six languages around the globe. Cureus 12(6):e8622

    Google Scholar 

  • Aoshima M, Yata K (2014) A distance-based, misclassification rate adjusted classifier for multiclass, high-dimensional data. Ann Inst Stat Math 66(5):983–1010

    Article  MathSciNet  MATH  Google Scholar 

  • Ashiq W (2021) Urdu word sense disambiguation using siamese neural networks. PhD thesis, Department of Computer science, COMSATS University Lahore

  • Aung NTT, Soe KM, Thein NL (2011) A word sense disambiguation system using naïve bayesian algorithm for Myanmar language. Int J Sci Eng Res 2(9):1–6

    Google Scholar 

  • Banerjee S, Pedersen T et al (2003) Extended gloss overlaps as a measure of semantic relatedness. Ijcai 3:805–810 (Citeseer)

    Google Scholar 

  • Banerjee E, Bansal A, Jha GN (2014) Issues in chunking parallel corpora: mapping hindi-english verb group in ilci. In: Workshop Programme, pp111

  • Baruah N, Gogoi A, Sarma SK, Borah R (2021) Utilizing corpus statistics for assamese word sense disambiguation. Advances in computing and network communications. Springer, Cham, pp 271–283

    Chapter  Google Scholar 

  • Basile P, De Gemmis M, Lops P, Semeraro G (2008) Combining knowledge-based methods and supervised learning for effective Italian word sense disambiguation. In: Proceedings of the 2008 Conference on Semantics in Text Processing. Association for Computational Linguistics, pp 5–16

  • Biswas S (1995) Samsad Bangla Abidhan: dictionary of the Bengali language compiled by Sailendra Biswas. Sahitya Samsad

  • Breiman L (2001) Random forests. Mach Learn 45(1):5–32

    Article  MATH  Google Scholar 

  • Dash NS (2002) Lexical polysemy in Bengali: a corpus-based study. PILC J Dravidic Stud 12(1–2):203–214

    Google Scholar 

  • Dash NS (2007) Indian scenario in language corpus generation. Rainbow Linguist

  • Dash NS (2012) Polysemy and homonymy: a conceptual labyrinth. Proc IndoWordNet Workshop 2012:1–7

    Google Scholar 

  • Dash NS, Chaudhuri BB (2002) Using text corpora for understanding polysemy in Bangla. In: Language Engineering Conference, 2002. Proceedings, IEEE, pp 99–109

  • Das A, Sarkar S (2013) Word sense disambiguation in Bengali applied to Bengali-Hindi machine translation. In: International Conference on Natural Language Processing (ICON), Noida

  • David D (2013) Black space: improving writing by increasing lexical density. Brain Food for the Thinking Teacher, The Learning Spy

  • Deb D (2012) On case marking in Assamese Bengali and Oriya. Int J Appl Linguist Engl Lit 1(2):102

    Article  Google Scholar 

  • Dhungana UR, Shakya S (2014) Word sense disambiguation in Nepali language. In: Digital Information and Communication Technology and Its Applications (DICTAP), 2014 Fourth International Conference on IEEE, pp 46–50

  • Dutta A, Borgohain SK (2022) Verb sense disambiguation by measuring semantic relatedness between verb and surrounding terms of context

  • Dutta MA, Singh MSM, Borgohain SK (2022) Removal of ambiguity of noun using multimodal approach

  • Galley M , McKeown K (2003) Improving word sense disambiguation in lexical chaining

  • Gaustad T (2003) The importance of high-quality input for wsd: an application-oriented comparison of part-of-speech taggers. Proc Austral Lang Technol Workshop 2003:118–125

    Google Scholar 

  • Gonzalo J, Chugur I, Verdejo F (2000) Sense clusters for information retrieval: evidence from semcor and the eurowordnet interlingual index. Proc ACL Workshop Word Senses Multi-Linguality 8:10–18

    Google Scholar 

  • Gruber TR (1995) Toward principles for the design of ontologies used for knowledge sharing. Int J Hum Comput Stud 43(5–6):907–928

    Article  Google Scholar 

  • Haque A, Hoque MM (2016) Bangla word sense disambiguation system using dictionary based approach. In: 1st Internation Conference on Advanced Information and Communication Technology (ICAICT 2016), pp 1–6

  • Haroon RP (2010) Malayalam word sense disambiguation. In: Computational Intelligence and Computing Research (ICCIC), 2010 IEEE International Conference on IEEE, pp 1–4

  • Hoste V, Daelemans W, Hendrickx I, Bosch A van den (2002) Dutch word sense disambiguation: Optimizing the localness of context. In: Proceedings of the ACL-02 Workshop on Word Sense Disambiguation: Recent Successes and Future Directions. Association for Computational Linguistics, pp 61–66

  • Hwangbo H, Kim Y (2017) An empirical study on the effect of data sparsity and data overlap on cross domain collaborative filtering performance. Expert Syst Appl 89:254–265

    Article  Google Scholar 

  • Ide N, Véronis J (1998) Introduction to the special issue on word sense disambiguation: the state of the art. Comput Linguist 24(1):2–40

    Google Scholar 

  • International Organization for Standardization (1994) Organització Internacional per a la Normalització. Accuracy (trueness and precision) of measurement methods and results. International Organization for Standardization, Geneva

    Google Scholar 

  • Jia L, Tang J, Li M, You J, Ding J, Chen Y (2021) Twe-wsd: an effective topical word embedding based word sense disambiguation. CAAI Trans Intell Technol 6(1):72–79

    Article  Google Scholar 

  • Joachims T (1996) A probabilistic analysis of the rocchio algorithm with tfidf for text categorization. Technical report, Carnegie-mellon univ pittsburgh pa dept of computer science

  • Kaddoura S, Ahmed RD (2022) A comprehensive review on Arabic word sense disambiguation for natural language processing applications. Wiley interdisciplinary reviews: data mining and knowledge discovery. Springer, Cham, p e1447

    Google Scholar 

  • Kawahara D, Kurohashi S (2010) Acquiring reliable predicate-argument structures from raw corpora for case frame compilation. In: Seventh International Conference on Language Resources and Evaluation (LREC)

  • Kilgarriff A, Yallop C (2000) What’s in a thesaurus? In: Second International Conference on Language Resources and Evaluation (LREC)

  • Ledo MY, Grigori S, Alexander G (2003) Tool for computer-aided Spanish word sense disambiguation. International conference on intelligent text processing and computational linguistics. Springer, Cham, pp 277–280

    Google Scholar 

  • Lindén K (2005) Word sense discovery and disambiguation. PhD thesis, University of Helsinki, Faculty of Arts, Department of General Linguistics

  • Liu X (2008) Proposal of document classification with word sense disambiguation

  • Miller GA (1995) Wordnet: a lexical database for English. Commun ACM 38(11):39–41

    Article  Google Scholar 

  • Mishra N, Yadav S, Siddiqui TJ (2009) An unsupervised approach to Hindi word sense disambiguation. Proceedings of the first international conference on intelligent human computer interaction. Springer, Cham, pp 327–335

    Chapter  Google Scholar 

  • Navigli R (2009) Word sense disambiguation: a survey. ACM Comput Surv (CSUR) 41(2):10

    Article  Google Scholar 

  • Navigli R (2009) Word sense disambiguation: a survey. ACM Comput Surv (CSUR) 41(2):1–69

    Article  Google Scholar 

  • Pal AR, Saha D (2017) Word sense disambiguation in bengali: An unsupervised approach. In: 2017 Second International Conference on Electrical, Computer and Communication Technologies (ICECCT), IEEE, pp 1–5

  • Pal AR, Saha D (2019) Word sense disambiguation in Bengali language using unsupervised methodology with modifications. Sādhanā 44(7):1–13

    Article  Google Scholar 

  • Pal AR, Saha D, Dash NS (2015a) Automatic classification of bengali sentences based on sense definitions present in bengali wordnet. arXiv preprint arXiv:1508.01349

  • Pal AR, Saha D, Naskar S, Dash NS (2015b) Word sense disambiguation in Bengali: a lemmatized system increases the accuracy of the result. In: Recent Trends in Information Systems (ReTIS), 2015 IEEE 2nd International Conference on IEEE, pp 342–346

  • Pal AR, Saha D, Pal A (2017) A knowledge based methodology for word sense disambiguation for low resource language. Adv Comput Sci Technol 10(2):267–283

    Google Scholar 

  • Pal AR, Saha D, Dash NS, Naskar SK, Pal A (2019) A novel approach to word sense disambiguation in Bengali language using supervised methodology. Sādhanā 44(8):181

    Article  Google Scholar 

  • Pal AR, Saha D, Naskar SK, Dash NS (2021) In search of a suitable method for disambiguation of word senses in Bengali. Int J Speech Technol 24(2):439–454

    Article  Google Scholar 

  • Palmer M, Gildea D, Kingsbury P (2005) The proposition bank: An annotated corpus of semantic roles. Comput Linguist 31(1):71–106

    Article  Google Scholar 

  • Pandit R, Naskar SK (2015) A memory based approach to word sense disambiguation in bengali using \(k\)-nn method. In: Recent Trends in Information Systems (ReTIS), 2015 IEEE 2nd International Conference on IEEE, pp 383–386

  • Parameswarappa S, Narayana VN (2013) Kannada word sense disambiguation using decision list. Int J Emerg Trends Technol Comput Sci 2(3):272–278

    Google Scholar 

  • Patel K, Kanojia D, Bhattacharyya P (2018) Semi-automatic wordnet linking using word embeddings. In: Proceedings of the 9th Global WordNet Conference (GWC 2018), pp 269–274

  • Powers D (2011) Ailab evaluation: from precision, recall and f-measure to roc, informedness, markedness and correlation. J Mach Learn Technol 2(22293981):01

    Google Scholar 

  • Rousseeuw PJ, Croux C (1992) Explicit scale estimators with high breakdown point. L1-Stat Anal Relat Methods 1:77–92

    MathSciNet  MATH  Google Scholar 

  • Sarmah J, Sarma SK (2016) Word sense disambiguation for Assamese. In: Advanced Computing (IACC), 2016 IEEE 6th International Conference on IEEE, pp 146–151

  • Shirai K, Nakamura M (2010) Jaist: clustering and classification based approaches for Japanese WSD. In: Proceedings of the 5th International Workshop on Semantic Evaluation. Association for Computational Linguistics, pp 379–382

  • Sudha Bhingardive, Pushpak Bhattacharyya (2017) Word sense disambiguation using indowordnet. The WordNet in Indian languages. Springer, Cham, pp 243–260

    Google Scholar 

  • Upton G, Cook I (1996) Understanding statistics. Oxford University Press, Oxford

    MATH  Google Scholar 

  • Veronis J, Ide NM (1990) Word sense disambiguation with very large neural networks extracted from machine readable dictionaries. Proc Conf Comput Linguist 2:389–394

    Google Scholar 

  • Wiebe J, O’Hara T, Bruce R (1998) Constructing bayesian networks from wordnet for word-sense disambiguation: Representational and processing issues. In: US Army Conference on Applied Statistics, 21-23 October 1998, pp 67

  • Yingjie Z, Bin LI, Jiajun C, Xiaohe C (2012) A study in dictionary-based all-word word sense disambiguation for pre-Qin Chinese. J Chin Inf Process 3:13

    Google Scholar 

  • Zouaghi A, Merhbene L, Zrigui M (2012) Combination of information retrieval methods with lesk algorithm for Arabic word sense disambiguation. Artif Intell Rev 38(4):257–269

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Debapratim Das Dawn.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary file 1 (pdf 116 KB)

Supplementary file 2 (pdf 115 KB)

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Das Dawn, D., Khan, A., Shaikh, S.H. et al. A dataset for evaluating Bengali word sense disambiguation techniques. J Ambient Intell Human Comput 14, 4057–4086 (2023). https://doi.org/10.1007/s12652-022-04471-y

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s12652-022-04471-y

Keywords

Navigation