Skip to main content
Log in

A word sense disambiguation corpus for Urdu

  • Original Paper
  • Published:
Language Resources and Evaluation Aims and scope Submit manuscript

Abstract

The aim of word sense disambiguation (WSD) is to correctly identify the meaning of a word in context. All natural languages exhibit word sense ambiguities and these are often hard to resolve automatically. Consequently WSD is considered an important problem in natural language processing (NLP). Standard evaluation resources are needed to develop, evaluate and compare WSD methods. A range of initiatives have lead to the development of benchmark WSD corpora for a wide range of languages from various language families. However, there is a lack of benchmark WSD corpora for South Asian languages including Urdu, despite there being over 300 million Urdu speakers and a large amounts of Urdu digital text available online. To address that gap, this study describes a novel benchmark corpus for the Urdu Lexical Sample WSD task. This corpus contains 50 target words (30 nouns, 11 adjectives, and 9 verbs). A standard, manually crafted dictionary called Urdu Lughat is used as a sense inventory. Four baseline WSD approaches were applied to the corpus. The results show that the best performance was obtained using a simple Bag of Words approach. To encourage NLP research on the Urdu language the corpus is freely available to the research community.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

Notes

  1. https://www.ethnologue.com/language/urd Last visited: 23-October-2018.

  2. https://github.com/google-research-datasets/word_sense_disambigation_corpora Last visited: 23-October-2018.

  3. http://opus.nlpl.eu/MultiUN.php Last visited: 23-October-2018.

  4. https://catalog.ldc.upenn.edu/LDC2000T43 Last visited: 23-October-2018.

  5. http://www.ldoceonline.com/ Last visited: 23-October-2018.

  6. http://www.cfilt.iitb.ac.in/wordnet/webhwn/ Last visited: 23-October-2018.

  7. http://www.jagran.com/ Last visited: 23-October-2018.

  8. https://en.wikipedia.org/wiki/Hindi Last visited: 22-October-2018.

  9. http://www.hindi.webdunia.com/ Last visited: 23-October-2018.

  10. http://www.cfilt.iitb.ac.in/indowordnet/ Last visited: 23-October-2018.

  11. http://www.cle.org.pk/clestore/urduwordnet.htm Last visited: 23-October-2018.

  12. http://www.urdulughat.info/ Last visited: 23-October-2018.

  13. The complete list of 50 selected words along-with their senses can be downloaded from http://www.comsatsnlpgroup.wordpress.com Last visited: 23-October-2018.

  14. https://github.com/alisaeed007/Urdu-Annotation-Tool-UAT- Last visited: 23-October-2018.

  15. http://www.comsatsnlpgroup.wordpress.com Last visited: 23-October-2018.

  16. A Java based library to implement WE models—https://deeplearning4j.org/ Last visited: 23-October-2018.

References

  • Abid, M., Habib, A., Ashraf, J., & Shahid, A. (2017). Urdu word sense disambiguation using machine learning approach. Cluster Computing, 1, 1–8.

    Google Scholar 

  • Agirre, E., & Edmonds, P. (Eds.). (2007). Word sense disambiguation: Algorithms and applications (Vol. 33). Berlin: Springer.

    Google Scholar 

  • Ahmed, T., Urooj, S., Hussain, S., Mustafa, A., Parveen, R., Adeeba, F., Hautli, A., & Butt, M. (2014). The CLE Urdu POS tagset. In Poster presentation in language resources and evaluation conference (LERC 14).

  • Altman, N. S. (1992). An introduction to kernel and nearest-neighbor nonparametric regression. The American Statistician, 46(3), 175–185.

    Google Scholar 

  • Anand Kumar, M., Rajendran, S., & Soman, K. P. (2014). Tamil word sense disambiguation using support vector machines with rich features. International Journal of Applied Engineering Research, 9(20), 7609–20.

    Google Scholar 

  • Appropriation, F. Y. A. F. Y. (1858). American Printing House for the Blind.

  • Arieff, A. I. (1912). Veterans Administration Medical Center in San Francisco. San Jose Mercury, 12, 305.

    Google Scholar 

  • Arif, S. Z., Yaqoob, M. M., Rehman, A., & Jamil, F. (2016). Word sense disambiguation for Urdu text by machine learning. International Journal of Computer Science and Information Security, 14(5), 738.

    Google Scholar 

  • Baker, P., Hardie, A., McEnery, T., Cunningham, H., & Gaizauskas, R. J. (2002). EMILLE, A 67-million word corpus of Indic languages: Data collection, mark-up and harmonisation. In LREC.

  • Becker, D., Riaz, K., Bennett, B. H., Davis, E., & Panton, D. (2002). Named entity recognition in Urdu: A progress report. In International conference on internet computing (pp. 757–761).

  • Bhingardive, S., Singh, D., Rudramurthy, V., Redkar, H., & Bhattacharyya, P. (2015). Unsupervised most frequent sense detection using word embeddings. In Proceedings of the 2015 conference of the North American Chapter of the Association for Computational Linguistics: Human language technologies (pp. 1238–1243).

  • Board, U. D. (2008). Urdu Lughat. Karachi: Urdu Lughat Board.

    Google Scholar 

  • Bond, F., Baldwin, T., Fothergill, R., & Uchimoto, K. (2012). Japanese SemCor: A sense-tagged corpus of Japanese. In Proceedings of the 6th global WordNet conference (GWC 2012) (pp. 56–63).

  • Bruce, R., & Wiebe, J. (1994). Word-sense disambiguation using decomposable models. In Proceedings of the 32nd annual meeting on Association for Computational Linguistics (pp. 139–146). Association for Computational Linguistics.

  • Cai, J., Lee, W. S., & Teh, Y. W. (2007). Improving word sense disambiguation using topic features. In Proceedings of the 2007 joint conference on empirical methods in natural language processing and computational natural language learning (EMNLP-CoNLL).

  • Chklovski, T., Mihalcea, R., Pedersen, T., & Purandare, A. (2004). The Senseval-3 multilingual English–Hindi lexical sample task. In Proceedings of SENSEVAL-3, the third international workshop on the evaluation of systems for the semantic analysis of text. Association for Computational Linguistics.

  • Daud, A., Khan, W., & Che, D. (2016). Urdu language processing: A survey. Artificial Intelligence Review, 47, 1–33. https://doi.org/10.1007/s10462-016-9482-x.

    Google Scholar 

  • DiMarco, C., Covvey, H., Cowan, D., DiCiccio, V., Hovy, E., Lipa, J., & Mulholland, D. (2007). The development of a natural language generation system for personalized e-health information. In Medinfo 2007: Proceedings of the 12th world congress on health (medical) informatics; building sustainable health systems (p. 2339). IOS Press.

  • Edmonds, P. (2002). SENSEVAL: The evaluation of word sense disambiguation systems. ELRA Newsletter, 7(3), 5–14.

    Google Scholar 

  • Edmonds, P., & Cotton, S. (2001). SENSEVAL-2: Overview. In The Proceedings of the Second International Workshop on Evaluating Word Sense Disambiguation Systems (pp. 1–5). Association for Computational Linguistics.

  • Francis, W. N., & Kucera, H. (1979). Brown corpus manual. Providence: Brown University.

    Google Scholar 

  • Gee, J. P., & Green, J. L. (1998). Chapter 4: Discourse analysis, learning, and social practice: A methodological study. Review of Research in Education, 23(1), 119–169.

    Article  Google Scholar 

  • Huang, P. S., Damarla, T., & Hasegawa-Johnson, M. (2011). Multi-sensory features for personnel detection at border crossings. In 2011 Proceedings of the 14th international conference on information fusion (FUSION) (pp. 1–8). IEEE.

  • Hussain, S. (2008). Resources for Urdu language processing. In IJCNLP (pp. 99–100).

  • Hutchins, W. J. (1995). Machine translation: A brief history. In Concise history of the language sciences (pp. 431–445). Elsevier.

  • Iacobacci, I., Pilehvar, M. T., & Navigli, R. (2016). Embeddings for word sense disambiguation: An evaluation study. In Proceedings of the 54th annual meeting of the association for computational linguistics (Vol. 1, pp. 897–907).

  • Ide, N. (1998). Corpus encoding standard: SGML guidelines for encoding linguistic corpora. In Proceedings of the first international language resources and evaluation conference (pp. 463–470).

  • Jawaid, B., Kamran, A., & Bojar, O. (2014). A tagged corpus and a tagger for Urdu. In LREC (pp. 2938–2943).

  • Jiang, J. (2012). Information extraction from text. In C. C. Aggarwal & C.-X. Zhai (Eds.), Mining text data (pp. 11–41). Boston: Springer.

    Chapter  Google Scholar 

  • Khan, S. N., Khan, K., Khan, A., Khan, A., Khan, A. U., & Ullah, B. (2018). Urdu word segmentation using machine learning approaches. International Journal of Advanced Computer Science and Applications, 9(6), 193–200.

    Google Scholar 

  • Kilgarriff, A. (2004). How dominant is the commonest sense of a word?. In International conference on text, speech and dialogue (pp. 103–111). Berlin: Springer.

  • Landes, S., Leacock, C., & Tengi, R. (1998). Building a semantic concordance of English. WordNet: An electronic lexical database and some applications. Cambridge: MIT Press.

    Google Scholar 

  • Leacock, C., Towell, G., & Voorhees, E. (1993). Corpus-based statistical sense resolution. In Proceedings of the workshop on human language technology (pp. 260–265). Association for Computational Linguistics.

  • Lin, C. Y., & Hovy, E. (2000). The automated acquisition of topic signatures for text summarization. In Proceedings of the 18th conference on computational linguistics (Vol. 1, pp. 495–501). Association for Computational Linguistics.

  • Lior, R. (2014). Data mining with decision trees: Theory and applications (Vol. 81). Singapore: World Scientific.

    Google Scholar 

  • Liu, Y., Liu, Z., Chua, T.S., & Sun, M. (2015). Topical word embeddings. In AAAI (pp. 2418–2424).

  • McCulloch, W. S., & Pitts, W. (1943). A logical calculus of the ideas immanent in nervous activity. The Bulletin of Mathematical Biophysics, 5(4), 115–133.

    Article  Google Scholar 

  • McEnery, A., Baker, P., Gaizauskas, R., & Cunningham, H. (2000). EMILLE: Building a corpus of South Asian languages. VIVEK-BOMBAY, 13(3), 22–28.

    Google Scholar 

  • McKean, E. (2005). The new Oxford American dictionary (Vol. 2). New York: Oxford University Press.

    Google Scholar 

  • Mihalcea, R., Chklovski, T., & Kilgarriff, A. (2004). The Senseval-3 English lexical sample task. In Proceedings of SENSEVAL-3, the third international workshop on the evaluation of systems for the semantic analysis of text.

  • Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781.

  • Mishra, N., & Siddiqui, T. J. (2012). An investigation to semi supervised approach for HINDI word sense disambiguation. Trends in Innovative Computing 2012-Intelligent Systems Design, 12, 126–130.

    Google Scholar 

  • Narayan, D., Chakrabarti, D., Pande, P., & Bhattacharyya, P. (2002). An experience in building the indo WordNet—A WordNet for Hindi. In First international conference on global WordNet, Mysore, India.

  • Naseer, A., & Hussain, S. (2009). Supervised word sense disambiguation for Urdu using Bayesian classification. Lahore: Center for Research in Urdu Language Processing.

    Google Scholar 

  • Navigli, R. (2009). Word sense disambiguation: A survey. ACM Computing Surveys (CSUR), 41(2), 10.

    Article  Google Scholar 

  • Ng, H. T., Lim, C. Y., & Foo, S. K. (1999). A case study on inter-annotator agreement for word sense disambiguation. SIGLEX99: Standardizing Lexical Resources, W99-05, W99-0502.

  • Palmer, M., Fellbaum, C., Cotton, S., Delfs, L., & Dang, H. T. (2001). English tasks: All-words and verb lexical sample. In The proceedings of the second international workshop on evaluating word sense disambiguation systems (pp. 21–24). Association for Computational Linguistics.

  • Passonneau, R. J., Baker, C., Fellbaum, C., & Ide, N. (2012). The MASC word sense sentence corpus. In Proceedings of LREC.

  • Pedersen, T. (2000). A simple approach to building ensembles of Naive Bayesian classifiers for word sense disambiguation. In Proceedings of the 1st North American chapter of the Association for Computational Linguistics conference (pp. 63–69). Association for Computational Linguistics.

  • Prasad, B. D. (2008). Content analysis. Research Methods for Social Work, 5, 1–20.

    Google Scholar 

  • Rahman, T. (2004). Language policy and localization in Pakistan: Proposal for a paradigmatic shift. In SCALLA conference on computational linguistics (Vol. 99, p. 100).

  • Riaz, K. (2010). Rule-based named entity recognition in Urdu. In Proceedings of the 2010 named entities workshop (pp. 126–135). Association for Computational Linguistics.

  • Rong, X. (2014). Word2vec parameter learning explained. arXiv preprint arXiv:1411.2738.

  • Schütze, H., Manning, C. D., & Raghavan, P. (2008). Introduction to information retrieval (Vol. 39). Cambridge: Cambridge University Press.

    Google Scholar 

  • Sharjeel, M., Nawab, R. M. A., & Rayson, P. (2017). COUNTER: Corpus of Urdu news text reuse. Language Resources and Evaluation, 51(3), 777–803.

    Article  Google Scholar 

  • Singh, U., Goyal, V., & Lehal, G. S. (2012). Named entity recognition system for Urdu. In Proceedings of COLING 2012 (pp. 2507–2518).

  • Sinha, M., Kumar, M., Pande, P., Kashyap, L., & Bhattacharyya, P. (2004). Hindi word sense disambiguation. In International symposium on machine translation, natural language processing and translation support systems, Delhi, India.

  • Sokolova, M., & Lapalme, G. (2009). A systematic analysis of performance measures for classification tasks. Information Processing & Management, 45(4), 427–437.

    Article  Google Scholar 

  • Sreeganesh, T. (2006). Telugu parts of speech tagging in WSD. Language of India, 6, 1.

    Google Scholar 

  • Taghipour, K., & Ng, H. T. (2015). One million sense-tagged instances for word sense disambiguation and induction. In Proceedings of the nineteenth conference on computational natural language learning (pp. 338–344).

  • Urooj, S., Shams, S., Hussain, S., & Adeeba, F. (2014). Sense tagged CLE Urdu digest corpus. In Proceedings of conference on language and technology, Karachi.

  • Vossen, P., Izquierdo, R., & Görög, A. (2013). DutchSemCor: In quest of the ideal sense-tagged corpus. In Proceedings of the international conference recent advances in natural language processing RANLP 2013 (pp. 710–718).

  • Zafar, A., Mahmood, A., Abdullah, F., Zahid, S., Hussain, S., & Mustafa, A. (2012). Developing Urdu WordNet using the merge approach. In Proceedings of the conference on language and technology (pp. 55–59).

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ali Saeed.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Saeed, A., Nawab, R.M.A., Stevenson, M. et al. A word sense disambiguation corpus for Urdu. Lang Resources & Evaluation 53, 397–418 (2019). https://doi.org/10.1007/s10579-018-9438-7

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10579-018-9438-7

Keywords

Navigation