Skip to main content
Log in

Parts-of-speech tagging of Nepali texts with Bidirectional LSTM, Conditional Random Fields and HMM

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

Parts-of-Speech (POS) Tagging is one of the fundamental and pre-processing steps for Natural Language Processing (NLP) tasks such as Text Summarization, Name Entity Recognition, Dependency Parsing or Parsing in general, Classification, Sentiment analysis, Machine translation and Information Extraction systems etc. Various state-of-art models have been implemented for the POS tagging of many natural languages. However from our literature survey, it is established that the problem has not been addressed rigorously for Nepali language and no comprehensive comparative studies have been presented. It is an under-resourced and highly inflectional language, therefore encodes information like gender, person, number, mood, and aspect within their word forms. Precise disambiguation of these inflected words is critical in Nepali text analysis. In this paper, POS tagging using Hidden Markov Model (HMM), Conditional Random Fields (CRF) and Long Short Term Memory (LSTM) is presented for the language. Furthermore, a comprehensive comparative study of the three models is also presented. Experiments shows that CRF based technique outperforms HMM model, further deep neural network based technique like LSTM outperforms CRF in terms of accuracy, which scores an accuracy of \(99.6\%\). This study demonstrate that deep learning based models are exceptional at disambiguating rich morphological information encoded by Nepali words.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

References

  1. (2019) Bureau of Indian Standards (Govt. of India)

  2. Acharya Jayaraj (1991) A Descriptive Grammar of Nepali and an Analyzed Corpus, 1st edn. Georgetown University Press, Washington, D.C

    Google Scholar 

  3. Akhil KK, Rajimol R, Anoop VS (2020) Parts-of-speech tagging for malayalam using deep learning techniques. Int J Inf Technol 12(3):741–748

    Google Scholar 

  4. Alhasan Ahmad, Al-Taani Ahmad T (2018) Pos tagging for arabic text using bee colony algorithm. Procedia Comput Sci, Arab Comput Linguist 142:158–165

    Article  Google Scholar 

  5. Bach NX, Linh ND, Phuong TM (2018) An empirical study on pos tagging for vietnamese social media text. Comput Speech Lang 50:1–15

    Article  Google Scholar 

  6. Bal KB (2004) Structure of Nepali Grammar. Madan Puraskar Pustakalaya, 1st. edn. Nepal

  7. Behera P, Jha GN (2016) Evaluation of svm-based automatic parts of speech tagger for odia. WILDRE-3, LREC

  8. Besharati S, Veisi H, Darzi A, Saravani SHH (2021) A hybrid statistical and deep learning based technique for persian part of speech tagging. Iran J Comput Sci 4(1):35–43

    Article  Google Scholar 

  9. Boonkwan P, Supnithi T (2017) Bidirectional deep learning of context representation for joint word segmentation and pos tagging. In International Conference on Computer Science, Applied Mathematics and Applications, Springer 184-196

  10. Brants T (2000) Tnt-a statistical part-of-speech tagger. arXiv preprint cs/0003055

  11. Brill E (1995) Transformation-based error-driven learning and natural language processing: A case study in part-of-speech tagging. Comput Linguist 21(4):543–565

    MathSciNet  Google Scholar 

  12. Carneiro HC, França FM, Lima PM (2015) Multilingual part-of-speech tagging with weightless neural networks. Neural Netw 66:11–21

    Article  Google Scholar 

  13. Cutting D, Kupiec J, Pedersen J, Sibun P (1992) A practical part-of-speech tagger. In Third conference on applied natural language processing, pages 133-140

  14. Das BR, Sahoo S, Panda CS, Patnaik S (2015) Part of speech tagging in odia using support vector machine. Procedia Comput Sci 48:507–512

    Article  Google Scholar 

  15. Divyapushpalakshmi M, Ramalakshmi R (2021) An efficient sentimental analysis using hybrid deep learning and optimization technique for twitter using parts of speech (pos) tagging. Int J Speech Technol 24(2):329–339

    Article  Google Scholar 

  16. Ekbal A, Haque R, Bandyopadhyay S (2008) Maximum entropy based bengali part of speech tagging. A. Gelbukh (Ed.), Advances in Natural Language Processing and Applications, Research in Computing Science (RCS) Journal, 33:67-78

  17. Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780

    Article  Google Scholar 

  18. Indian Language Technology Proliferation and Deployment Center (2019) Deployment Center (Govt of India)

  19. Jamatia A, Gambäck B, Das A (2015) Part-of-speech tagging for code-mixed english-hindi twitter and facebook chat messages. Assoc Comput Linguist

  20. Jayan JP, Rajeev RR (2011) Parts of speech tagger and chunker for malayalam-statistical approach. Comput Eng Intell Syst 2(2):68–78

    Google Scholar 

  21. Jolly SK, Agrawal R (2020) Parts of speech tagging for punjabi language using supervised approaches. Springer, In Intell Comput Eng, pp 107–116

    Google Scholar 

  22. Junaida M, Babu AP (2021) A deep learning approach to malayalam parts of speech tagging. In Second International Conference on Networks and Advances in Computational Technologies, Springer, 243-250

  23. Kabir MF, Abdullah-Al-Mamun K, Huda MN (2016) Deep learning based parts of speech tagger for bengali. In 2016 5th International Conference on Informatics, Electronics and Vision (ICIEV), 26-29

  24. Kempe A (1993) A probabilistic tagger and an analysis of tagging errors. Institut für maschinelle sprachverarbeitung, Universität stuttgart, Rapport technique

    Google Scholar 

  25. Khan W, Daud A, Nasir JA, Amjad T, Arafat S, Aljohani N, Alotaibi FS (2019) Urdu part of speech tagging using conditional random fields. Lang Resour Eval 53(3):331–362

    Article  Google Scholar 

  26. Lafferty J, McCallum A, Pereira FCN (2001) Conditional random fields: Probabilistic models for segmenting and labeling sequence data

  27. Le-Hong P, Phan X-H, Tran T-T (2013) On the effect of the label bias problem in part-of-speech tagging. In The 2013 RIVF International Conference on Computing Communication Technologies - Research, Innovation, and Vision for Future (RIVF), 103-108

  28. Li M-W, Wang Y-T, Geng J, Hong W-C (2021) Chaos cloud quantum bat hybrid optimization algorithm. Nonlinear Dyn 103(1):1167–1193

    Article  Google Scholar 

  29. MacKinlay A (2005) The effects of part-of-speech tagsets on tagger performance (bachelor’s thesis). Master’s thesis, University of Melbourne, Melbourne, Australia

  30. Marquez L, Padro L, Rodriguez H (2000) A machine learning approach to pos tagging. Mach Learn 39(1):59–91

    Article  Google Scholar 

  31. Mukherjee S, Das Mandal SK (2013) Bengali parts-of-speech tagging using global linear model. In: 2013 Annual IEEE India Conference (INDICON), pp 1–4

  32. Nambiar SK, Leons A, Jose S, et al. (2019) Pos tagger for malayalam using hidden markov model. In: 2019 International Conference on Smart Systems and Inventive Technology (ICSSIT), IEEE, pp 957–960

  33. Narayan R, Singh V, Chakraverty S (2014) Quantum neural network based parts of speech tagger for hindi. Int J Adv Technol 5(2):137–152

    Google Scholar 

  34. Pakray P, Pal A, Majumder G, Gelbukh A (2015) Resource building and parts-of-speech (pos) tagging for the mizo language. In: 2015 Fourteenth Mexican International Conference on Artificial Intelligence (MICAI), IEEE, pp 3–7

  35. Pallavi ASP, Pillai A (2014) Parts of speech (pos) tagger for kannada using conditional random fields (crfs). In: Proceedings of the National Conference on Indian Language Computing, NCILC

  36. Pammi SC, Prahallad K (2007) Pos tagging and chunking using decision forests. In: IJCAI Workshop on Shallow Parsing for South Asian Languages, Citeseer, pp 33–36

  37. Pandian SL, Geetha T (2009) Crf models for tamil part of speech tagging and chunking. In: International Conference on Computer Processing of Oriental Languages, Springer, pp 11–22

  38. Pascal Denis and Benoît Sagot (2012) Coupling an annotated corpus and a lexicon for state-of-the-art pos tagging. Lang Resour Eval 46(4):721–736

    Article  Google Scholar 

  39. Patel C, Gali K (2008) Part-of-speech tagging for gujarati using conditional random fields. In: Proceedings of the IJCNLP-08 Workshop on NLP for Less Privileged Languages

  40. Paul A, Purkayastha BS, Sarkar S (2015) Hidden markov model based part of speech tagging for nepali language. In: 2015 International Symposium on Advanced Computing and Communication (ISACC), pp 149–156

  41. Plank B, Søgaard A, Goldberg Y (2016) Multilingual part-of-speech tagging with bidirectional long short-term memory models and auxiliary loss. arXiv preprint arXiv:1604.05529

  42. Pota M, Marulli F, Esposito M, De Pietro G, Fujita H (2019) Multilingual pos tagging by a composite deep architecture based on character-level features and on-the-fly enriched word embeddings. Knowl-Based Syst 164:309–323

    Article  Google Scholar 

  43. Pradhan A, Yajnik A (2021) Probabilistic and neural network based pos tagging of ambiguous nepali text: A comparative study. In: 2021 International Symposium on Electrical, Electronics and Information Engineering, pp 249–253

  44. Sakiba SN, Shuvo MMU, Hossain N, Das SK, Mela JD, Islam MA (2021) A memory-efficient tool for bengali parts of speech tagging. In: Artificial intelligence techniques for advanced computing applications, Springer, pp 67–78

  45. Sarkar K, Gayen V (2013) A trigram hmm-based pos tagger for indian languages. In: Proceedings of the international conference on frontiers of intelligent computing: theory and applications (FICTA), Springer, pp 205–212

  46. Schmid H (1994) Part-of-speech tagging with neural networks. arXiv preprint cmp-lg/9410018

  47. Shahi TB, Dhamala TN, Balami B (2013) Support vector machines based part of speech tagging for nepali text. Int J Comput Appl Technol 70(24)

  48. Shamsi F, Guessoum A (2020) A hidden markov model -based pos tagger for arabic. In: proceedings of 8th International Conference on Textual Data Statistical Analysis

  49. Shim K-S (2011) Syllable-based pos tagging without korean morphological analysis. Korean J Cogn Sci 22(3):327–345

    Article  Google Scholar 

  50. Shrivastava M, Bhattacharyya P (2008) Hindi pos tagger using naive stemming: harnessing morphological information without extensive linguistic knowledge. In: International Conference on NLP (ICON08), Pune, India, Citeseer

  51. Shu X, Tang J, Qi GJ, Liu W, Yang J (2019) Hierarchical long short-term concurrent memory for human interaction recognition. IEEE Trans Pattern Anal Mach Intell 43(3):1110–1118

    Article  Google Scholar 

  52. Shu X, Zhang L, Sun Y, Tang J (2020) Host-parasite: graph lstm-in-lstm for group activity recognition. IEEE Trans Neural Netw Learn Syst 32(2):663–674

    Article  Google Scholar 

  53. Shu X, Zhang L, Qi GJ, Liu W, Tang J (2021) Spatiotemporal co-attention recurrent neural networks for human-skeleton motion prediction. IEEE Trans Pattern Anal Mach Intell

  54. Siddiqui T, Tiwary US (2008) Natural Language Processing and Information Retrieval, 1st edn. Oxford University Press, United Kingdom

    Google Scholar 

  55. Singh J, Joshi N, Mathur I (2013) Development of marathi part of speech tagger using statistical approach. 2013 International Conference on Advances in Computing. Communications and Informatics (ICACCI), IEEE, pp 1554–1559

    Google Scholar 

  56. Sunitha C et al (2015) A hybrid parts of speech tagger for malayalam language. 2015 International Conference on Advances in Computing. Communications and Informatics (ICACCI), IEEE, pp 1502–1507

    Google Scholar 

  57. Suraksha N, Reshma K, Kumar KS (2017) Part-of-speech tagging and parsing of kannada text using conditional random fields (crfs). In: 2017 International Conference on Intelligent Computing and Control (I2C2), IEEE, pp 1–5

  58. Tang J, Shu X, Yan R, Zhang L (2019) Coherence constrained graph lstm for group activity recognition. IEEE Trans Pattern Anal Mach Intell

  59. Van Halteren H, Zavrel J, Daelemans W (1998) Improving data driven wordclass tagging by system combination. arXiv preprint cmp-lg/9807013

  60. Yajnik A (2017) Part of speech tagging using statistical approach for nepali text. Int J Cog Lang Sci 11(1):76–79

    Google Scholar 

  61. Yajnik A (2018) Ann based pos tagging for nepali text. Int J on Nat Lang Comput 7:13–18

    Article  Google Scholar 

  62. Yuwana RS, Suryawati E, Pardede HF (2018) On empirical evaluation of deep architectures for indonesian pos tagging problem. In: 2018 International Conference on Computer, Control, Informatics and Its Applications (IC3INA), IEEE, pp 204–208

  63. Zhang Z, Hong WC (2021) Application of variational mode decomposition and chaotic grey wolf optimizer with support vector regression for forecasting electric loads. Knowl-Based Syst 228:107297

    Article  Google Scholar 

  64. Zhao L, Zhang A, Liu Y, Fei H (2020) Encoding multi-granularity structural information for joint chinese word segmentation and pos tagging. Pattern Recogn Lett 138:163–169

    Article  Google Scholar 

Download references

Acknowledgements

We would like to acknowledge and express our sincere gratitude to the “Department of Science and Technology, Government of India”, for sponsoring the project entitled “Study and develop a natural language parser for Nepali language”, “reference no. SR/CSRI/- 28/2015(G)” under the “Cognitive Science Research Initiative (CSRI)” to carry out this work. We also acknowledge the “TMA Pai University (Sikkim Manipal University)” research grant for supporting this work.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ashish Pradhan.

Ethics declarations

Conflicts of interest

The authors claim no conflict or competing interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Pradhan, A., Yajnik, A. Parts-of-speech tagging of Nepali texts with Bidirectional LSTM, Conditional Random Fields and HMM. Multimed Tools Appl 83, 9893–9909 (2024). https://doi.org/10.1007/s11042-023-15679-1

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-023-15679-1

Keywords

Navigation