Advertisement

Kannada Stemmer and Its Effect on Kannada Documents Classification

Conference paper
Part of the Smart Innovation, Systems and Technologies book series (SIST, volume 33)

Abstract

Stemming is reducing a word to its root or stem form. Kannada is a morphologically rich language and words get inflected to different forms based on person, number, gender and tense. Stemming is an important pre-processing step in any Natural Language Processing application. In this paper, stemming is performed on Kannada words using unsupervised method using suffix arrays. An accuracy of 0.58 % was achieved with this method. The performance of the stemmer is further improved by using a stem-list dictionary in combination with the unsupervised method. A list of 18,804 stem words is created manually in Kannada Language as part of this work. A 10 % improvement in performance is observed. The effect of the proposed stemmer on text classification of Kannada documents using Naïve Bayes and Maximum Entropy methods are compared. It is shown in this paper, that stemming improves the performance of text classification.

Keywords

Kannada Stemmer Text classification Unsupervised stemming Naïve Bayes Maximum entropy Natural language processing 

References

  1. 1.
    Frakes, W.B., Baeza-Yates, R.: Information Retrieval: Data Structures and Algorithms. Prentice Hall, Englewood Cliffs (1992)Google Scholar
  2. 2.
    Lovins, J.: Development of a stemming algorithm. Mech. Transl. Comput. Linguist. 11, 22–23 (1968)Google Scholar
  3. 3.
    Paice, C., Husk, G.: Another stemmer. ACM SIGIR Forum 24(3), 566 (1990)Google Scholar
  4. 4.
    Porter, M.: An algorithm for suffix stripping. Program 14(3), 130–137 (1980)Google Scholar
  5. 5.
    Ramanathan, A., Rao, D.D.: A lightweight stemmer for Hindi. In: Proceedings of EACL, ACL (2003)Google Scholar
  6. 6.
    Islam, Z., Uddin, N., Khan, M.: A light weight stemmer for bengali and its use in spelling checker. In: Proceedings of 1st International conference on Digital Communications and Computer Applications (DCCA 2007), Irbid, Jordan, pp. 87–93 (2007)Google Scholar
  7. 7.
    Majumder, P., Mitra, M., Parui, S.K., Kole, G., Mitra, P., Datta, K.: Yass: yet another suffix stripper. ACM Trans. Inf. Syst. 25(4), 18 (2007)CrossRefGoogle Scholar
  8. 8.
    Pandey, A.K., Siddiqui, T.J.: An unsupervised Hindi stemmer with heuristic improvements. In: Proceedings of the Second W.orkshop on Analytics for Noisy Unstructured Text Data, AND 2008, Singapore, pp. 99–105 (2008)Google Scholar
  9. 9.
    Dasgupta, S., Ng, V.: Unsupervised morphological parsing of bengali. Lang. Resour. Eval. 40, 311–330 (2006)Google Scholar
  10. 10.
    Keshava, S., Pitler, E.: A simpler, intuitive approach to morpheme induction. In: Proceedings of 2nd Pascal Challenges Workshop, pp. 31–35 (2006)Google Scholar
  11. 11.
    Majgaonker, M.M., Siddiqui, T.J.: Discovering suffixes: a case study for Marathi language. Int. J. Comput. Sci. Eng. 04, 2716–2720 (2010)Google Scholar
  12. 12.
    Suba, K., Jiandani, D., Bhattacharyya, P.: Hybrid inflectional stemmer and rule-based derivational stemmer for Gujrati. In: 2nd Workshop on South and Southeast Asian Natural Languages Processing, Chiang Mai, Thailand (2011)Google Scholar
  13. 13.
    Gupta, V., Lehal, G.S.: Punjabi language stemmer for nouns and proper names. In: Proceedings of the 2nd Workshop on South and Southeast Asian Natural Language Processing (WSSANLP) IJCNLP 2011, Chiang Mai, Thailand, pp. 35–39 (2011)Google Scholar
  14. 14.
    Kumar, D., Rana, P.: Design and development of a stemmer for Punjabi. Int. J. Comput. Appl. 11(12), 0975–8887 (2010)Google Scholar
  15. 15.
    Padma, M.C., Prathibha, R.J.: Development of morphological stemmer, analyzer and generator for Kannada nouns. In: Proceedings of International Conference, ICERECT 2012, pp. 713–723 (2014)Google Scholar
  16. 16.
    Bhat, S.: Statistical stemming for Kannada. In: Proceedings The 4th Workshop on South and Southeast Asian NLP (WSSANLP), International Joint Conference on Natural Language Processing, Nagoya, Japan, pp. 25–33, 14–18 Oct 2013Google Scholar
  17. 17.
  18. 18.
    Emille corpus: http://www.emille.lancs.ac.uk (2003)
  19. 19.
    Nigam, K., Lafferty, J., McCallum, A.: Using maximum entropy for text classification. In: IJCAI 1999 Workshop on Machine Learning for Information Filtering, pp. 61–67 (1999)Google Scholar
  20. 20.
    McCallum, A.K.: MALLET: a machine learning for language toolkit (2002)Google Scholar

Copyright information

© Springer India 2015

Authors and Affiliations

  1. 1.Department of Computer Science and EngineeringR.V. College of EngineeringBangaloreIndia
  2. 2.Department of Information Science and EngineeringR.V. College of EngineeringBangaloreIndia

Personalised recommendations