Advertisement

A Trigram HMM-Based POS Tagger for Indian Languages

  • Kamal Sarkar
  • Vivekananda Gayen
Part of the Advances in Intelligent Systems and Computing book series (AISC, volume 199)

Abstract

We present in this paper a trigram HMM-based (Hidden Markov Model) part-of-speech (POS) tagger for Indian languages, which will accept a raw text in an Indian language (typed in corresponding language font) to produce a POS tagged output. We implement the trigram POS Tagger from the scratch based on the second order Hidden Markov Model (HMM). For handling unknown words, we introduce a prefix analysis method and a word-type analysis method which are combined with the well known suffix analysis method for predicting the probable tags. Though our developed systems have been tested on the data for four Indian languages namely Bengali, Hindi, Marathi and Telugu, the developed system can be easily ported to a new language just by replacing the training file with the POS tagged data for the new language. Our developed trigram POS tagger has been compared to the bigram POS tagger defined as a baseline.

Keywords

Part-of-speech tagging Second order Hidden Markov Model Deleted interpolation Indian Languages 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Brants, T.: TnT – “A statistical part-of-speech tagger”. In: Proc. of the 6th Applied NLP Conference, pp. 224–231 (2000)Google Scholar
  2. 2.
    Dandapat, S., Sarkar, S., Basu, A.: Automatic part-of-speech tagging for bengali: an approach for morphologically rich languages in a poor scenario. In: Proceedings of the Association for Computational Linguistic, pp. 221–224 (2007)Google Scholar
  3. 3.
    Ekbal, A., et al.: Bengali part of speech tagging using conditional random field. In: Proceedings of the 7th International Symposium of Natural Language Processing (SNLP 2007), Pattaya, Thailand, December 13-15, pp. 131–136 (2007)Google Scholar
  4. 4.
    Ekbal, A., Bandyopadhyay, S.: Part of speech tagging in bengali using support vector machine. In: IEEE International Conference on Information Technology, ICIT 2008, pp. 106–111 (2008)Google Scholar
  5. 5.
    Kumar, D., Josan, G.S.: Part of speech taggers for morphologically rich indian languages: a survey. International Journal of Computer Applications (0975-8887) 6(5) (2010)Google Scholar
  6. 6.
    Ali, H.: An unsupervised parts-of-speech tagger for the bangla language, Department of Computer Science, University of British Columbia (2010)Google Scholar
  7. 7.
    Chakrabarti, D.: Layered parts of speech tagging for bangla, Language in Indian. Special Volume: Problems of Parsing in Indian Languages (May 2001), http://www.languageinindia.com
  8. 8.
    Antony, P.J., Soman, K.P.: Parts of speech tagging for Indian languages: a literature survey. International Journal of Computer Applications (0975-8887) 34(8) (November 2011)Google Scholar
  9. 9.
    Shrivastava, M., Bhattacharyya, P.: Hindi POS Tagger Using Naive Stemming: Harnessing Morphological Information Without Extensive Linguistic Knowledge, Department of Computer Science and Engineering, Indian Institute of Technology, Bombay. In: Proceeding of the ICON (2008)Google Scholar
  10. 10.
    Ray, P.R., Harish, V., Sarkar, S., Basu, A.: Part of Speech Tagging and Local Word Grouping Techniques for Natural Language Parsing in Hindi, Department of Computer Science & Engineering, Indian Institute of Technology, Kharagpur, INDIA 721302, http://www.mla.iitkgp.ernet.in/papers/hindipostagging.pdf
  11. 11.
    Selvam, M., Natarajan, A.M.: Improvement of Rule Based Morphological Analysis and POS Tagging in Tamil Language via Projection and Induction Techniques. International Journal of Computers 3(4) (2009)Google Scholar
  12. 12.
    Antony, P.J., Santhanu, P.M., Soman, K.P.: SVM Based Parts Speech Tagger for Malayalam. In: International Conference on-Recent Trends in Information, Telecommunication and Computing, ITC 2010 (2010)Google Scholar
  13. 13.
    Pattabhi, R.K.R.T., Vijay Sundar Ram, R., Vijayakrishna, R., Sobha, L.: A Text Chunker and Hybrid POS Tagger for Indian Languages, AU-KBC Research Centre. MIT Campus, Anna University, Chromepet, Chennai (2007)Google Scholar
  14. 14.
    Rao, D., Yarowsky, D.: Part of Speech Tagging and Shallow Parsing of Indian Languages, Department of Computer Science, Johns Hopkins University, USA, The Proceedings of the Workshop on Shallow Parsing in South Asian Languages (2007), http://shiva.iiit.ac.in/SPSAL2007/final/iitmcsa.pdf
  15. 15.
    Jurafsky, D., Martin, J.H.: Speech and Language Processing An Intoduction to Natural Language Processing, Computational Linguistics and Speech Recognition. Preason Education Series (2002)Google Scholar
  16. 16.
    Viterbi, A.J.: Error bounds for convolutional codes and an asymptotically optimum decoding algorithm. IEEE Transaction on Information Theory IT-13(2), 260–269 (1967)CrossRefGoogle Scholar
  17. 17.
    Sarkar, K., Gayen, V.: A Practical Part-of-Speech Tagger for Bengali. In: Third International Conference on Emerging Applications of Information Technology (EAIT 2012) (accepted, 2012)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2013

Authors and Affiliations

  1. 1.Computer Science & Engineering DepartmentJadavpur UniversityKolkataIndia

Personalised recommendations