Pattern Based Bootstrapping Technique for Tamil POS Tagging

  • Jayabal Ganesh
  • Ranjani Parthasarathi
  • T. V. Geetha
  • J. Balaji
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 8891)

Abstract

Part of speech (POS) tagging is one of the basic preprocessing techniques for any text processing NLP application. It is a difficult task for morphologically rich and partially free word order languages. This paper describes a Part of Speech (POS) tagger of one such morphologically rich language, Tamil. The main issue of POS tagging is the ambiguity that arises because different POS tags can have the same inflections, and have to be disambiguated using the context. This paper presents a pattern based bootstrapping approach using only a small set of POS labeled suffix context patterns. The pattern consists of a stem and a sequence of suffixes, obtained by segmentation using a suffix list. This bootstrapping technique generates new patterns by iteratively masking suffixes with low probability of occurrences in the suffix context, and replacing them with other co-occurring suffixes. We have tested our system with a corpus containing 20,000 Tamil documents having 2,71,933 unique words. Our system achieves a precision of 87.74%.

Keywords

POS Tagging Bootstrapping semi-supervised Tagging Tamil Language 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Garg, N., Goyal, V., Preet, S.: Rules Based Part of Speech Tagger. In: The Proceedings of COLING, pp. 163–174 (2012)Google Scholar
  2. 2.
    Bagul, P., Mishra, A., Mahajan, P., Kulkarni, M., Dhopavkar, G.: Rule Based POS Tagger for Marathi Text. The Proceedings of International Journal of Computer Science and Information Technologies (IJCSIT) 5(2), 1322–1326 (2014)Google Scholar
  3. 3.
    Joshi, N., Darbari, H., Mathur, I.: Hmm Based Pos Tagger For Hindi. In: The Proceedings of the Computer Science Conference Proceedings, CSCP (2013)Google Scholar
  4. 4.
    Manju, K., Soumya, S., Idicula, S.M.: Development of a Pos Tagger for Malayalam-An Experience. In: Proceedings of the International Conference on Advances in Recent Technologies in Communication and Computing (2009)Google Scholar
  5. 5.
    Saharia, N., Das, D., Sharma, U., Kalita, J.: Part of Speech Tagger for Assamese Text. In: The Proceedings of ACL-IJCNLP Conference Short Papers, pp. 33–36 (2009)Google Scholar
  6. 6.
    Singh, J., Joshi, N., Mathur, I.: Part of Speech Tagging of Marathi Text Using Trigram method. Proceedings of the International Journal of Advanced Information Technology (IJAIT) 3(2) (April 2013)Google Scholar
  7. 7.
    Singh, T.D.: Manipuri POS Tagging using CRF and SVM: A Language Independent Approach. In: Proceedings of the International Conference on Natural Language Processing, ICON (2008)Google Scholar
  8. 8.
    Pallavi, A.S.P.: Parts Of Speech (POS) Tagger for Kannada Using Conditional Random Fields (CRFs). In: Proceedings of the National Conference on Indian Language Computing, NCILC (2014)Google Scholar
  9. 9.
    Patel, C., Gali, K.: Part-Of-Speech Tagging for Gujarati Using Conditional Random Fields. In: Proceedings of the IJCNLP Workshop on NLP for Less Privileged Languages, pp. 117–122 (2008)Google Scholar
  10. 10.
    Antony, P.J., Mohan, S.P., Soman K.P.: SVM Based Part of Speech Tagger for Malayalam. In: Proceedings of the International Conference on Recent Trends in Information (2010)Google Scholar
  11. 11.
    Sindhiya Binulal, G., Anand Goud, P., Soman, K.P.: A SVM based approach to Telugu Parts of Speech Tagging using SVMTool. Proceedings of the International Journal of Recent Trends in Engineering 1(2) (2009)Google Scholar
  12. 12.
    Chandrakanth, D., Anand Kumar, M., Gunasekaran, S.: Part-Of-Speech Tagging For Tamil Language. Proceedings of the International Journal of Communications and Engineering 06(6(1)) (March 2012)Google Scholar
  13. 13.
    Lakshmana Pandian, S., Geetha, T.V.: Morpheme based Language Model for Tamil Part-of-Speech Tagging. Proceedings of the Research Journal on Computer Science and Computer Engineering with Applications, 19–25 (July-December 2008)Google Scholar
  14. 14.
    Akilan, R., Naganathan, E.R.: Pos Tagging for Classical Tamil Texts. Proceedings of the International Journal of Business Intelligent 1(01) (January-June 2012)Google Scholar
  15. 15.
    Palanisamy, A., Devi, S.L.: HMM based POS Tagger for a Relatively Free Word Order Language. Proceedings of the Research in Computing Science (18), 37–48 (2006)Google Scholar
  16. 16.
    Arulmozhi, P., Pattabhi R K Rao, T., Sobha, L.: A Hybrid POS Tagger for a Relative Free Word Order Language. In: Proceedings of the MSPIL 2006 (2006)Google Scholar
  17. 17.
    Dhanalakshmi, V., Anand Kumar, M., Rajendran, S., Soman, K.P.: POS Tagger and Chunker for Tamil Language. In: Proceedings of Tamil Internet Conference (2009)Google Scholar
  18. 18.
    Murthy, K.N., Badugu, S.: A New Approach to Tagging in Indian Languages. Proceedings of the Research in Computing Science (70), 45–56 (2013)Google Scholar
  19. 19.
    Lakshmana Pandian, S.: Language models developed for POS tagging and chunking. In: Proceedings of 22nd International Conference, ICCPOL 2009 (2009)Google Scholar
  20. 20.
    Anand Kumar, M., Dhanalakshmi, V., Soman, K.P., Rajendran, S.: A Sequence Labeling Approach to Morphological Analyzer for Tamil Language. Proceedings of International Journal on Computer Science and Engineering International Journal on Computer Science and Engineering (IJCSE) 02(06), 1944–1951 (2010)Google Scholar
  21. 21.
    Cucerzan, Yarowsky, D.: Bootstrapping a Multilingual Part-of-speech Tagger in One Person-day. In: Proceedings of the Sixth Conference on Natural Language Learning (CoNLL), pp. 132–138 (2002)Google Scholar
  22. 22.
    Clark, S., Curran, J.R., Osborne, M.: Bootstrapping POS taggers using Unlabelled Data. In: Proceedings of the Seventh CoNLL Conference (2003)Google Scholar
  23. 23.
    Wang, W., Huang, Z., Harper, M.: Semi-Supervised Learning for Part-of-Speech Tagging of Mandarin Transcribed Speech. In: Proceedings of the ICASSP, vol. 4 (2007)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2014

Authors and Affiliations

  • Jayabal Ganesh
    • 1
  • Ranjani Parthasarathi
    • 1
  • T. V. Geetha
    • 2
  • J. Balaji
    • 2
  1. 1.Department of Inforamation Science and Technology, College of EngineeringAnna UniversityChennaiIndia
  2. 2.Department of Computer Science and Engineering, College of EngineeringAnna UniversityChennaiIndia

Personalised recommendations