Abstract
Parts-of-Speech (POS) Tagging is one of the fundamental and pre-processing steps for Natural Language Processing (NLP) tasks such as Text Summarization, Name Entity Recognition, Dependency Parsing or Parsing in general, Classification, Sentiment analysis, Machine translation and Information Extraction systems etc. Various state-of-art models have been implemented for the POS tagging of many natural languages. However from our literature survey, it is established that the problem has not been addressed rigorously for Nepali language and no comprehensive comparative studies have been presented. It is an under-resourced and highly inflectional language, therefore encodes information like gender, person, number, mood, and aspect within their word forms. Precise disambiguation of these inflected words is critical in Nepali text analysis. In this paper, POS tagging using Hidden Markov Model (HMM), Conditional Random Fields (CRF) and Long Short Term Memory (LSTM) is presented for the language. Furthermore, a comprehensive comparative study of the three models is also presented. Experiments shows that CRF based technique outperforms HMM model, further deep neural network based technique like LSTM outperforms CRF in terms of accuracy, which scores an accuracy of \(99.6\%\). This study demonstrate that deep learning based models are exceptional at disambiguating rich morphological information encoded by Nepali words.
Similar content being viewed by others
References
(2019) Bureau of Indian Standards (Govt. of India)
Acharya Jayaraj (1991) A Descriptive Grammar of Nepali and an Analyzed Corpus, 1st edn. Georgetown University Press, Washington, D.C
Akhil KK, Rajimol R, Anoop VS (2020) Parts-of-speech tagging for malayalam using deep learning techniques. Int J Inf Technol 12(3):741–748
Alhasan Ahmad, Al-Taani Ahmad T (2018) Pos tagging for arabic text using bee colony algorithm. Procedia Comput Sci, Arab Comput Linguist 142:158–165
Bach NX, Linh ND, Phuong TM (2018) An empirical study on pos tagging for vietnamese social media text. Comput Speech Lang 50:1–15
Bal KB (2004) Structure of Nepali Grammar. Madan Puraskar Pustakalaya, 1st. edn. Nepal
Behera P, Jha GN (2016) Evaluation of svm-based automatic parts of speech tagger for odia. WILDRE-3, LREC
Besharati S, Veisi H, Darzi A, Saravani SHH (2021) A hybrid statistical and deep learning based technique for persian part of speech tagging. Iran J Comput Sci 4(1):35–43
Boonkwan P, Supnithi T (2017) Bidirectional deep learning of context representation for joint word segmentation and pos tagging. In International Conference on Computer Science, Applied Mathematics and Applications, Springer 184-196
Brants T (2000) Tnt-a statistical part-of-speech tagger. arXiv preprint cs/0003055
Brill E (1995) Transformation-based error-driven learning and natural language processing: A case study in part-of-speech tagging. Comput Linguist 21(4):543–565
Carneiro HC, França FM, Lima PM (2015) Multilingual part-of-speech tagging with weightless neural networks. Neural Netw 66:11–21
Cutting D, Kupiec J, Pedersen J, Sibun P (1992) A practical part-of-speech tagger. In Third conference on applied natural language processing, pages 133-140
Das BR, Sahoo S, Panda CS, Patnaik S (2015) Part of speech tagging in odia using support vector machine. Procedia Comput Sci 48:507–512
Divyapushpalakshmi M, Ramalakshmi R (2021) An efficient sentimental analysis using hybrid deep learning and optimization technique for twitter using parts of speech (pos) tagging. Int J Speech Technol 24(2):329–339
Ekbal A, Haque R, Bandyopadhyay S (2008) Maximum entropy based bengali part of speech tagging. A. Gelbukh (Ed.), Advances in Natural Language Processing and Applications, Research in Computing Science (RCS) Journal, 33:67-78
Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780
Indian Language Technology Proliferation and Deployment Center (2019) Deployment Center (Govt of India)
Jamatia A, Gambäck B, Das A (2015) Part-of-speech tagging for code-mixed english-hindi twitter and facebook chat messages. Assoc Comput Linguist
Jayan JP, Rajeev RR (2011) Parts of speech tagger and chunker for malayalam-statistical approach. Comput Eng Intell Syst 2(2):68–78
Jolly SK, Agrawal R (2020) Parts of speech tagging for punjabi language using supervised approaches. Springer, In Intell Comput Eng, pp 107–116
Junaida M, Babu AP (2021) A deep learning approach to malayalam parts of speech tagging. In Second International Conference on Networks and Advances in Computational Technologies, Springer, 243-250
Kabir MF, Abdullah-Al-Mamun K, Huda MN (2016) Deep learning based parts of speech tagger for bengali. In 2016 5th International Conference on Informatics, Electronics and Vision (ICIEV), 26-29
Kempe A (1993) A probabilistic tagger and an analysis of tagging errors. Institut für maschinelle sprachverarbeitung, Universität stuttgart, Rapport technique
Khan W, Daud A, Nasir JA, Amjad T, Arafat S, Aljohani N, Alotaibi FS (2019) Urdu part of speech tagging using conditional random fields. Lang Resour Eval 53(3):331–362
Lafferty J, McCallum A, Pereira FCN (2001) Conditional random fields: Probabilistic models for segmenting and labeling sequence data
Le-Hong P, Phan X-H, Tran T-T (2013) On the effect of the label bias problem in part-of-speech tagging. In The 2013 RIVF International Conference on Computing Communication Technologies - Research, Innovation, and Vision for Future (RIVF), 103-108
Li M-W, Wang Y-T, Geng J, Hong W-C (2021) Chaos cloud quantum bat hybrid optimization algorithm. Nonlinear Dyn 103(1):1167–1193
MacKinlay A (2005) The effects of part-of-speech tagsets on tagger performance (bachelor’s thesis). Master’s thesis, University of Melbourne, Melbourne, Australia
Marquez L, Padro L, Rodriguez H (2000) A machine learning approach to pos tagging. Mach Learn 39(1):59–91
Mukherjee S, Das Mandal SK (2013) Bengali parts-of-speech tagging using global linear model. In: 2013 Annual IEEE India Conference (INDICON), pp 1–4
Nambiar SK, Leons A, Jose S, et al. (2019) Pos tagger for malayalam using hidden markov model. In: 2019 International Conference on Smart Systems and Inventive Technology (ICSSIT), IEEE, pp 957–960
Narayan R, Singh V, Chakraverty S (2014) Quantum neural network based parts of speech tagger for hindi. Int J Adv Technol 5(2):137–152
Pakray P, Pal A, Majumder G, Gelbukh A (2015) Resource building and parts-of-speech (pos) tagging for the mizo language. In: 2015 Fourteenth Mexican International Conference on Artificial Intelligence (MICAI), IEEE, pp 3–7
Pallavi ASP, Pillai A (2014) Parts of speech (pos) tagger for kannada using conditional random fields (crfs). In: Proceedings of the National Conference on Indian Language Computing, NCILC
Pammi SC, Prahallad K (2007) Pos tagging and chunking using decision forests. In: IJCAI Workshop on Shallow Parsing for South Asian Languages, Citeseer, pp 33–36
Pandian SL, Geetha T (2009) Crf models for tamil part of speech tagging and chunking. In: International Conference on Computer Processing of Oriental Languages, Springer, pp 11–22
Pascal Denis and Benoît Sagot (2012) Coupling an annotated corpus and a lexicon for state-of-the-art pos tagging. Lang Resour Eval 46(4):721–736
Patel C, Gali K (2008) Part-of-speech tagging for gujarati using conditional random fields. In: Proceedings of the IJCNLP-08 Workshop on NLP for Less Privileged Languages
Paul A, Purkayastha BS, Sarkar S (2015) Hidden markov model based part of speech tagging for nepali language. In: 2015 International Symposium on Advanced Computing and Communication (ISACC), pp 149–156
Plank B, Søgaard A, Goldberg Y (2016) Multilingual part-of-speech tagging with bidirectional long short-term memory models and auxiliary loss. arXiv preprint arXiv:1604.05529
Pota M, Marulli F, Esposito M, De Pietro G, Fujita H (2019) Multilingual pos tagging by a composite deep architecture based on character-level features and on-the-fly enriched word embeddings. Knowl-Based Syst 164:309–323
Pradhan A, Yajnik A (2021) Probabilistic and neural network based pos tagging of ambiguous nepali text: A comparative study. In: 2021 International Symposium on Electrical, Electronics and Information Engineering, pp 249–253
Sakiba SN, Shuvo MMU, Hossain N, Das SK, Mela JD, Islam MA (2021) A memory-efficient tool for bengali parts of speech tagging. In: Artificial intelligence techniques for advanced computing applications, Springer, pp 67–78
Sarkar K, Gayen V (2013) A trigram hmm-based pos tagger for indian languages. In: Proceedings of the international conference on frontiers of intelligent computing: theory and applications (FICTA), Springer, pp 205–212
Schmid H (1994) Part-of-speech tagging with neural networks. arXiv preprint cmp-lg/9410018
Shahi TB, Dhamala TN, Balami B (2013) Support vector machines based part of speech tagging for nepali text. Int J Comput Appl Technol 70(24)
Shamsi F, Guessoum A (2020) A hidden markov model -based pos tagger for arabic. In: proceedings of 8th International Conference on Textual Data Statistical Analysis
Shim K-S (2011) Syllable-based pos tagging without korean morphological analysis. Korean J Cogn Sci 22(3):327–345
Shrivastava M, Bhattacharyya P (2008) Hindi pos tagger using naive stemming: harnessing morphological information without extensive linguistic knowledge. In: International Conference on NLP (ICON08), Pune, India, Citeseer
Shu X, Tang J, Qi GJ, Liu W, Yang J (2019) Hierarchical long short-term concurrent memory for human interaction recognition. IEEE Trans Pattern Anal Mach Intell 43(3):1110–1118
Shu X, Zhang L, Sun Y, Tang J (2020) Host-parasite: graph lstm-in-lstm for group activity recognition. IEEE Trans Neural Netw Learn Syst 32(2):663–674
Shu X, Zhang L, Qi GJ, Liu W, Tang J (2021) Spatiotemporal co-attention recurrent neural networks for human-skeleton motion prediction. IEEE Trans Pattern Anal Mach Intell
Siddiqui T, Tiwary US (2008) Natural Language Processing and Information Retrieval, 1st edn. Oxford University Press, United Kingdom
Singh J, Joshi N, Mathur I (2013) Development of marathi part of speech tagger using statistical approach. 2013 International Conference on Advances in Computing. Communications and Informatics (ICACCI), IEEE, pp 1554–1559
Sunitha C et al (2015) A hybrid parts of speech tagger for malayalam language. 2015 International Conference on Advances in Computing. Communications and Informatics (ICACCI), IEEE, pp 1502–1507
Suraksha N, Reshma K, Kumar KS (2017) Part-of-speech tagging and parsing of kannada text using conditional random fields (crfs). In: 2017 International Conference on Intelligent Computing and Control (I2C2), IEEE, pp 1–5
Tang J, Shu X, Yan R, Zhang L (2019) Coherence constrained graph lstm for group activity recognition. IEEE Trans Pattern Anal Mach Intell
Van Halteren H, Zavrel J, Daelemans W (1998) Improving data driven wordclass tagging by system combination. arXiv preprint cmp-lg/9807013
Yajnik A (2017) Part of speech tagging using statistical approach for nepali text. Int J Cog Lang Sci 11(1):76–79
Yajnik A (2018) Ann based pos tagging for nepali text. Int J on Nat Lang Comput 7:13–18
Yuwana RS, Suryawati E, Pardede HF (2018) On empirical evaluation of deep architectures for indonesian pos tagging problem. In: 2018 International Conference on Computer, Control, Informatics and Its Applications (IC3INA), IEEE, pp 204–208
Zhang Z, Hong WC (2021) Application of variational mode decomposition and chaotic grey wolf optimizer with support vector regression for forecasting electric loads. Knowl-Based Syst 228:107297
Zhao L, Zhang A, Liu Y, Fei H (2020) Encoding multi-granularity structural information for joint chinese word segmentation and pos tagging. Pattern Recogn Lett 138:163–169
Acknowledgements
We would like to acknowledge and express our sincere gratitude to the “Department of Science and Technology, Government of India”, for sponsoring the project entitled “Study and develop a natural language parser for Nepali language”, “reference no. SR/CSRI/- 28/2015(G)” under the “Cognitive Science Research Initiative (CSRI)” to carry out this work. We also acknowledge the “TMA Pai University (Sikkim Manipal University)” research grant for supporting this work.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflicts of interest
The authors claim no conflict or competing interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Pradhan, A., Yajnik, A. Parts-of-speech tagging of Nepali texts with Bidirectional LSTM, Conditional Random Fields and HMM. Multimed Tools Appl 83, 9893–9909 (2024). https://doi.org/10.1007/s11042-023-15679-1
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-023-15679-1