Abstract
Stemming is considered crucial in many NLP and IR applications. In the absence of any linguistic information, stemming is a challenging task. Stemming of words using suffixes of a language as linguistic information is in comparison an easier problem. In this work we considered stemming as a process of obtaining minimum number of lexicon from an unannotated corpus by using a suffix set. We proved that the exact lexicon reduction problem is NP-hard and came up with a polynomial time approximation. One probabilistic model that minimizes the stem distributional entropy is also proposed for stemming. Performances of these models are analyzed using an unannotated corpus and a suffix set of Malayalam, a morphologically rich language of India belonging to the Dravidian family.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Hammarström, H., Borin, L.: Unsupervised learning of morphology. CL, 309–350 (2011)
Goldsmith, J.A.: Unsupervised learning of the morphology of a natural language. CL (2), 153–198 (2001)
Creutz, M., Lagus, K.: Unsupervised models for morpheme segmentation and morphology learning. TSLP 4 (2007)
Clark, A.: Partially supervised learning of morphology with stochastic transducers. In: NLPRS, pp. 341–348 (2001)
Snover, M.G., Jarosz, G.E., Brent, M.R.: Unsupervised learning of morphology using a novel directed search algorithm: taking the first step. In: Proc. of ACL-WMPL 2002, pp. 11–20 (2002)
Dreyer, M., Eisner, J.: Graphical models over multiple strings. In: Proc. of EMNLP 2009, pp. 101–110 (2009)
Johnson, H., Martin, J.: Unsupervised learning of morphology for english and inuktitut. In: Proc. of NAACL-HLT 2003, pp. 43–45 (2003)
Bosch, A.v.d., Daelemans, W.: Memory-based morphological analysis. In: Proc. of ACL 1999 (1999)
Hammarström, H.: A naive theory of affixation and an algorithm for extraction. In: Proc. of HLT-NAACL 2006, pp. 79–88 (June 2006)
Hammarström, H.: Poor Man’s Stemming: Unsupervised Recognition of Same-Stem Words. In: Ng, H.T., Leong, M.-K., Kan, M.-Y., Ji, D. (eds.) AIRS 2006. LNCS, vol. 4182, pp. 323–337. Springer, Heidelberg (2006)
Monson, C., Carbonell, J.G., Lavie, A., Levin, L.S.: ParaMor and Morpho Challenge 2008. In: Peters, C., Deselaers, T., Ferro, N., Gonzalo, J., Jones, G.J.F., Kurimo, M., Mandl, T., Peñas, A., Petras, V. (eds.) CLEF 2008. LNCS, vol. 5706, pp. 967–974. Springer, Heidelberg (2009)
Dasgupta, S., Ng, V.: High-performance, language-independent morphological segmentation. In: HLT-NAACL, pp. 155–163 (2007)
Dasgupta, S., Ng, V.: Unsupervised morphological parsing of bengali. Language Resources and Evaluation, 311–330 (2006)
Lawphongpanich, S.: Frank-wolfe algorithm. In: Encyclopedia of Optimization, pp. 1094–1097 (2009)
David, S.M.I.P.S.: A morphological processor for malayalam language. Technical report, South Asia Research (2007)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2012 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Vasudevan, N., Bhattacharyya, P. (2012). Optimal Stem Identification in Presence of Suffix List. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2012. Lecture Notes in Computer Science, vol 7181. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-28604-9_8
Download citation
DOI: https://doi.org/10.1007/978-3-642-28604-9_8
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-28603-2
Online ISBN: 978-3-642-28604-9
eBook Packages: Computer ScienceComputer Science (R0)