A dataset for evaluating Bengali word sense disambiguation techniques

Das Dawn, Debapratim; Khan, Abhinandan; Shaikh, Soharab Hossain; Pal, Rajat Kumar

doi:10.1007/s12652-022-04471-y

A dataset for evaluating Bengali word sense disambiguation techniques

Original Research
Published: 24 December 2022

Volume 14, pages 4057–4086, (2023)
Cite this article

Journal of Ambient Intelligence and Humanized Computing Aims and scope Submit manuscript

345 Accesses
1 Citation
Explore all metrics

Abstract

The computation of natural language enables a suitable transmission through the universe by retrieving the correct sense of each word. A word may be monosemous or polysemous. The use of polysemous words in an appropriate context plays a critical role in communication. Over the last 2 decades, a significant amount of research has been done for automatically solving the correct sense of a polysemous word in the context of word sense disambiguation. A word sense disambiguation algorithm identifies the proper sense of a polysemous word by analysing the contextual data. Nevertheless, there is a gap in the contemporary literature regarding the availability of datasets in Asian languages, especially Bengali. Therefore, in this work, we have presented a dataset comprising hundred Bengali polysemous words. Each word in this dataset consists of three or four disjoint senses, and each sense comprises ten paragraphs. Each paragraph describes the sense of a particular polysemous word. We have performed statistical analysis on the basis of seven relevant and important characteristics. A general framework has also been presented for training and testing with possible guidelines for performance analysis. A baseline strategy has been introduced based on four feature sets. Finally, a set of experiments have been performed to analyse the system performance.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Survey of Different Approaches for Word Sense Disambiguation

Word Sense Disambiguation in Bengali language using unsupervised methodology with modifications

Article 27 June 2019

Context-Based Word Sense Disambiguation in Telugu Using the Statistical Techniques

Availability of Data and Materials

The authors declare that the data supporting the findings of this study are available within its supplementary information files. Description of the Dataset for \(|S \left( W \right) | = 3\) is available in Appendix-I. Description of the Dataset for \(|S \left( W \right) | = 4\) is available in Appendix-II. The dataset is freely accessible here: https://www.kaggle.com/datasets/debapratimdasdawn/bengali-wsd-dataset with proper DOI: https://www.doi.org/10.34740/kaggle/dsv/3985193.

References

Agirre E, Martinez D (2001) Knowledge sources for word sense disambiguation. International conference on text, speech and dialogue. Springer, Cham, pp 1–10
Google Scholar
Alian M, Awajan A, Al-Kouz A (2016) Word sense disambiguation for arabic text using wikipedia and vector space model. Int J Speech Technol 19(4):857–867
Article Google Scholar
Anirban D, Nitya B, Van Breugel LM, Sonali S, Bhupen B, Hiranya S, Udeme-Abasi N, Ahmed M, Subhankar P (2020) Youtube as a source of medical and epidemiological information during COVID-19 pandemic: a cross-sectional study of content across six languages around the globe. Cureus 12(6):e8622
Google Scholar
Aoshima M, Yata K (2014) A distance-based, misclassification rate adjusted classifier for multiclass, high-dimensional data. Ann Inst Stat Math 66(5):983–1010
Article MathSciNet MATH Google Scholar
Ashiq W (2021) Urdu word sense disambiguation using siamese neural networks. PhD thesis, Department of Computer science, COMSATS University Lahore
Aung NTT, Soe KM, Thein NL (2011) A word sense disambiguation system using naïve bayesian algorithm for Myanmar language. Int J Sci Eng Res 2(9):1–6
Google Scholar
Banerjee S, Pedersen T et al (2003) Extended gloss overlaps as a measure of semantic relatedness. Ijcai 3:805–810 (Citeseer)
Google Scholar
Banerjee E, Bansal A, Jha GN (2014) Issues in chunking parallel corpora: mapping hindi-english verb group in ilci. In: Workshop Programme, pp111
Baruah N, Gogoi A, Sarma SK, Borah R (2021) Utilizing corpus statistics for assamese word sense disambiguation. Advances in computing and network communications. Springer, Cham, pp 271–283
Chapter Google Scholar
Basile P, De Gemmis M, Lops P, Semeraro G (2008) Combining knowledge-based methods and supervised learning for effective Italian word sense disambiguation. In: Proceedings of the 2008 Conference on Semantics in Text Processing. Association for Computational Linguistics, pp 5–16
Biswas S (1995) Samsad Bangla Abidhan: dictionary of the Bengali language compiled by Sailendra Biswas. Sahitya Samsad
Breiman L (2001) Random forests. Mach Learn 45(1):5–32
Article MATH Google Scholar
Dash NS (2002) Lexical polysemy in Bengali: a corpus-based study. PILC J Dravidic Stud 12(1–2):203–214
Google Scholar
Dash NS (2007) Indian scenario in language corpus generation. Rainbow Linguist
Dash NS (2012) Polysemy and homonymy: a conceptual labyrinth. Proc IndoWordNet Workshop 2012:1–7
Google Scholar
Dash NS, Chaudhuri BB (2002) Using text corpora for understanding polysemy in Bangla. In: Language Engineering Conference, 2002. Proceedings, IEEE, pp 99–109
Das A, Sarkar S (2013) Word sense disambiguation in Bengali applied to Bengali-Hindi machine translation. In: International Conference on Natural Language Processing (ICON), Noida
David D (2013) Black space: improving writing by increasing lexical density. Brain Food for the Thinking Teacher, The Learning Spy
Deb D (2012) On case marking in Assamese Bengali and Oriya. Int J Appl Linguist Engl Lit 1(2):102
Article Google Scholar
Dhungana UR, Shakya S (2014) Word sense disambiguation in Nepali language. In: Digital Information and Communication Technology and Its Applications (DICTAP), 2014 Fourth International Conference on IEEE, pp 46–50
Dutta A, Borgohain SK (2022) Verb sense disambiguation by measuring semantic relatedness between verb and surrounding terms of context
Dutta MA, Singh MSM, Borgohain SK (2022) Removal of ambiguity of noun using multimodal approach
Galley M , McKeown K (2003) Improving word sense disambiguation in lexical chaining
Gaustad T (2003) The importance of high-quality input for wsd: an application-oriented comparison of part-of-speech taggers. Proc Austral Lang Technol Workshop 2003:118–125
Google Scholar
Gonzalo J, Chugur I, Verdejo F (2000) Sense clusters for information retrieval: evidence from semcor and the eurowordnet interlingual index. Proc ACL Workshop Word Senses Multi-Linguality 8:10–18
Google Scholar
Gruber TR (1995) Toward principles for the design of ontologies used for knowledge sharing. Int J Hum Comput Stud 43(5–6):907–928
Article Google Scholar
Haque A, Hoque MM (2016) Bangla word sense disambiguation system using dictionary based approach. In: 1st Internation Conference on Advanced Information and Communication Technology (ICAICT 2016), pp 1–6
Haroon RP (2010) Malayalam word sense disambiguation. In: Computational Intelligence and Computing Research (ICCIC), 2010 IEEE International Conference on IEEE, pp 1–4
Hoste V, Daelemans W, Hendrickx I, Bosch A van den (2002) Dutch word sense disambiguation: Optimizing the localness of context. In: Proceedings of the ACL-02 Workshop on Word Sense Disambiguation: Recent Successes and Future Directions. Association for Computational Linguistics, pp 61–66
Hwangbo H, Kim Y (2017) An empirical study on the effect of data sparsity and data overlap on cross domain collaborative filtering performance. Expert Syst Appl 89:254–265
Article Google Scholar
Ide N, Véronis J (1998) Introduction to the special issue on word sense disambiguation: the state of the art. Comput Linguist 24(1):2–40
Google Scholar
International Organization for Standardization (1994) Organització Internacional per a la Normalització. Accuracy (trueness and precision) of measurement methods and results. International Organization for Standardization, Geneva
Google Scholar
Jia L, Tang J, Li M, You J, Ding J, Chen Y (2021) Twe-wsd: an effective topical word embedding based word sense disambiguation. CAAI Trans Intell Technol 6(1):72–79
Article Google Scholar
Joachims T (1996) A probabilistic analysis of the rocchio algorithm with tfidf for text categorization. Technical report, Carnegie-mellon univ pittsburgh pa dept of computer science
Kaddoura S, Ahmed RD (2022) A comprehensive review on Arabic word sense disambiguation for natural language processing applications. Wiley interdisciplinary reviews: data mining and knowledge discovery. Springer, Cham, p e1447
Google Scholar
Kawahara D, Kurohashi S (2010) Acquiring reliable predicate-argument structures from raw corpora for case frame compilation. In: Seventh International Conference on Language Resources and Evaluation (LREC)
Kilgarriff A, Yallop C (2000) What’s in a thesaurus? In: Second International Conference on Language Resources and Evaluation (LREC)
Ledo MY, Grigori S, Alexander G (2003) Tool for computer-aided Spanish word sense disambiguation. International conference on intelligent text processing and computational linguistics. Springer, Cham, pp 277–280
Google Scholar
Lindén K (2005) Word sense discovery and disambiguation. PhD thesis, University of Helsinki, Faculty of Arts, Department of General Linguistics
Liu X (2008) Proposal of document classification with word sense disambiguation
Miller GA (1995) Wordnet: a lexical database for English. Commun ACM 38(11):39–41
Article Google Scholar
Mishra N, Yadav S, Siddiqui TJ (2009) An unsupervised approach to Hindi word sense disambiguation. Proceedings of the first international conference on intelligent human computer interaction. Springer, Cham, pp 327–335
Chapter Google Scholar
Navigli R (2009) Word sense disambiguation: a survey. ACM Comput Surv (CSUR) 41(2):10
Article Google Scholar
Navigli R (2009) Word sense disambiguation: a survey. ACM Comput Surv (CSUR) 41(2):1–69
Article Google Scholar
Pal AR, Saha D (2017) Word sense disambiguation in bengali: An unsupervised approach. In: 2017 Second International Conference on Electrical, Computer and Communication Technologies (ICECCT), IEEE, pp 1–5
Pal AR, Saha D (2019) Word sense disambiguation in Bengali language using unsupervised methodology with modifications. Sādhanā 44(7):1–13
Article Google Scholar
Pal AR, Saha D, Dash NS (2015a) Automatic classification of bengali sentences based on sense definitions present in bengali wordnet. arXiv preprint arXiv:1508.01349
Pal AR, Saha D, Naskar S, Dash NS (2015b) Word sense disambiguation in Bengali: a lemmatized system increases the accuracy of the result. In: Recent Trends in Information Systems (ReTIS), 2015 IEEE 2nd International Conference on IEEE, pp 342–346
Pal AR, Saha D, Pal A (2017) A knowledge based methodology for word sense disambiguation for low resource language. Adv Comput Sci Technol 10(2):267–283
Google Scholar
Pal AR, Saha D, Dash NS, Naskar SK, Pal A (2019) A novel approach to word sense disambiguation in Bengali language using supervised methodology. Sādhanā 44(8):181
Article Google Scholar
Pal AR, Saha D, Naskar SK, Dash NS (2021) In search of a suitable method for disambiguation of word senses in Bengali. Int J Speech Technol 24(2):439–454
Article Google Scholar
Palmer M, Gildea D, Kingsbury P (2005) The proposition bank: An annotated corpus of semantic roles. Comput Linguist 31(1):71–106
Article Google Scholar
Pandit R, Naskar SK (2015) A memory based approach to word sense disambiguation in bengali using \(k\)-nn method. In: Recent Trends in Information Systems (ReTIS), 2015 IEEE 2nd International Conference on IEEE, pp 383–386
Parameswarappa S, Narayana VN (2013) Kannada word sense disambiguation using decision list. Int J Emerg Trends Technol Comput Sci 2(3):272–278
Google Scholar
Patel K, Kanojia D, Bhattacharyya P (2018) Semi-automatic wordnet linking using word embeddings. In: Proceedings of the 9th Global WordNet Conference (GWC 2018), pp 269–274
Powers D (2011) Ailab evaluation: from precision, recall and f-measure to roc, informedness, markedness and correlation. J Mach Learn Technol 2(22293981):01
Google Scholar
Rousseeuw PJ, Croux C (1992) Explicit scale estimators with high breakdown point. L1-Stat Anal Relat Methods 1:77–92
MathSciNet MATH Google Scholar
Sarmah J, Sarma SK (2016) Word sense disambiguation for Assamese. In: Advanced Computing (IACC), 2016 IEEE 6th International Conference on IEEE, pp 146–151
Shirai K, Nakamura M (2010) Jaist: clustering and classification based approaches for Japanese WSD. In: Proceedings of the 5th International Workshop on Semantic Evaluation. Association for Computational Linguistics, pp 379–382
Sudha Bhingardive, Pushpak Bhattacharyya (2017) Word sense disambiguation using indowordnet. The WordNet in Indian languages. Springer, Cham, pp 243–260
Google Scholar
Upton G, Cook I (1996) Understanding statistics. Oxford University Press, Oxford
MATH Google Scholar
Veronis J, Ide NM (1990) Word sense disambiguation with very large neural networks extracted from machine readable dictionaries. Proc Conf Comput Linguist 2:389–394
Google Scholar
Wiebe J, O’Hara T, Bruce R (1998) Constructing bayesian networks from wordnet for word-sense disambiguation: Representational and processing issues. In: US Army Conference on Applied Statistics, 21-23 October 1998, pp 67
Yingjie Z, Bin LI, Jiajun C, Xiaohe C (2012) A study in dictionary-based all-word word sense disambiguation for pre-Qin Chinese. J Chin Inf Process 3:13
Google Scholar
Zouaghi A, Merhbene L, Zrigui M (2012) Combination of information retrieval methods with lesk algorithm for Arabic word sense disambiguation. Artif Intell Rev 38(4):257–269
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science and Engineering, University of Calcutta, Calcutta, India
Debapratim Das Dawn & Rajat Kumar Pal
Product Development and Diversification, ARP Engineering, Calcutta, India
Abhinandan Khan
Department of Computer Science and Engineering, BML Munjal University, Kapriwas, India
Soharab Hossain Shaikh

Authors

Debapratim Das Dawn
View author publications
You can also search for this author in PubMed Google Scholar
Abhinandan Khan
View author publications
You can also search for this author in PubMed Google Scholar
Soharab Hossain Shaikh
View author publications
You can also search for this author in PubMed Google Scholar
Rajat Kumar Pal
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Debapratim Das Dawn.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary file 1 (pdf 116 KB)

Supplementary file 2 (pdf 115 KB)

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Das Dawn, D., Khan, A., Shaikh, S.H. et al. A dataset for evaluating Bengali word sense disambiguation techniques. J Ambient Intell Human Comput 14, 4057–4086 (2023). https://doi.org/10.1007/s12652-022-04471-y

Download citation

Received: 15 April 2021
Accepted: 24 October 2022
Published: 24 December 2022
Issue Date: April 2023
DOI: https://doi.org/10.1007/s12652-022-04471-y

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A dataset for evaluating Bengali word sense disambiguation techniques

Abstract

Access this article

Similar content being viewed by others

A Survey of Different Approaches for Word Sense Disambiguation

Word Sense Disambiguation in Bengali language using unsupervised methodology with modifications

Context-Based Word Sense Disambiguation in Telugu Using the Statistical Techniques

Availability of Data and Materials

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Supplementary Information

Supplementary file 1 (pdf 116 KB)

Supplementary file 2 (pdf 115 KB)

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A dataset for evaluating Bengali word sense disambiguation techniques

Abstract

Access this article

Similar content being viewed by others

A Survey of Different Approaches for Word Sense Disambiguation

Word Sense Disambiguation in Bengali language using unsupervised methodology with modifications

Context-Based Word Sense Disambiguation in Telugu Using the Statistical Techniques

Availability of Data and Materials

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Supplementary Information

Supplementary file 1 (pdf 116 KB)

Supplementary file 2 (pdf 115 KB)

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation