Optimal Stem Identification in Presence of Suffix List

  • N. Vasudevan
  • Pushpak Bhattacharyya
Part of the Lecture Notes in Computer Science book series (LNCS, volume 7181)


Stemming is considered crucial in many NLP and IR applications. In the absence of any linguistic information, stemming is a challenging task. Stemming of words using suffixes of a language as linguistic information is in comparison an easier problem. In this work we considered stemming as a process of obtaining minimum number of lexicon from an unannotated corpus by using a suffix set. We proved that the exact lexicon reduction problem is NP-hard and came up with a polynomial time approximation. One probabilistic model that minimizes the stem distributional entropy is also proposed for stemming. Performances of these models are analyzed using an unannotated corpus and a suffix set of Malayalam, a morphologically rich language of India belonging to the Dravidian family.


Cover Problem Vertex Cover Valid Mapping Vertex Cover Problem Minimum Vertex Cover 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Hammarström, H., Borin, L.: Unsupervised learning of morphology. CL, 309–350 (2011)Google Scholar
  2. 2.
    Goldsmith, J.A.: Unsupervised learning of the morphology of a natural language. CL (2), 153–198 (2001)Google Scholar
  3. 3.
    Creutz, M., Lagus, K.: Unsupervised models for morpheme segmentation and morphology learning. TSLP 4 (2007)Google Scholar
  4. 4.
    Clark, A.: Partially supervised learning of morphology with stochastic transducers. In: NLPRS, pp. 341–348 (2001)Google Scholar
  5. 5.
    Snover, M.G., Jarosz, G.E., Brent, M.R.: Unsupervised learning of morphology using a novel directed search algorithm: taking the first step. In: Proc. of ACL-WMPL 2002, pp. 11–20 (2002)Google Scholar
  6. 6.
    Dreyer, M., Eisner, J.: Graphical models over multiple strings. In: Proc. of EMNLP 2009, pp. 101–110 (2009)Google Scholar
  7. 7.
    Johnson, H., Martin, J.: Unsupervised learning of morphology for english and inuktitut. In: Proc. of NAACL-HLT 2003, pp. 43–45 (2003)Google Scholar
  8. 8.
    Bosch, A.v.d., Daelemans, W.: Memory-based morphological analysis. In: Proc. of ACL 1999 (1999)Google Scholar
  9. 9.
    Hammarström, H.: A naive theory of affixation and an algorithm for extraction. In: Proc. of HLT-NAACL 2006, pp. 79–88 (June 2006)Google Scholar
  10. 10.
    Hammarström, H.: Poor Man’s Stemming: Unsupervised Recognition of Same-Stem Words. In: Ng, H.T., Leong, M.-K., Kan, M.-Y., Ji, D. (eds.) AIRS 2006. LNCS, vol. 4182, pp. 323–337. Springer, Heidelberg (2006)CrossRefGoogle Scholar
  11. 11.
    Monson, C., Carbonell, J.G., Lavie, A., Levin, L.S.: ParaMor and Morpho Challenge 2008. In: Peters, C., Deselaers, T., Ferro, N., Gonzalo, J., Jones, G.J.F., Kurimo, M., Mandl, T., Peñas, A., Petras, V. (eds.) CLEF 2008. LNCS, vol. 5706, pp. 967–974. Springer, Heidelberg (2009)CrossRefGoogle Scholar
  12. 12.
    Dasgupta, S., Ng, V.: High-performance, language-independent morphological segmentation. In: HLT-NAACL, pp. 155–163 (2007)Google Scholar
  13. 13.
    Dasgupta, S., Ng, V.: Unsupervised morphological parsing of bengali. Language Resources and Evaluation, 311–330 (2006)Google Scholar
  14. 14.
    Lawphongpanich, S.: Frank-wolfe algorithm. In: Encyclopedia of Optimization, pp. 1094–1097 (2009)Google Scholar
  15. 15.
    David, S.M.I.P.S.: A morphological processor for malayalam language. Technical report, South Asia Research (2007)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2012

Authors and Affiliations

  • N. Vasudevan
    • 1
  • Pushpak Bhattacharyya
    • 1
  1. 1.Computer Science and Engg DepartmentIIT BombayMumbaiIndia

Personalised recommendations