Abstract
There have been multiple attempts to resolve various inflection matching problems in information retrieval. Stemming is a common approach to this end. Among many techniques for stemming, statistical stemming has been shown to be effective in a number of languages, particularly highly inflected languages. In this paper we propose a method for finding affixes in different positions of a word. Common statistical techniques heavily rely on string similarity in terms of prefix and suffix matching. Since infixes are common in irregular/informal inflections in morphologically complex texts, it is required to find infixes for stemming. In this paper we propose a method whose aim is to find statistical inflectional rules based on minimum edit distance table of word pairs and the likelihoods of the rules in a language. These rules are used to statistically stem words and can be used in different text mining tasks. Experimental results on CLEF 2008 and CLEF 2009 English-Persian CLIR tasks indicate that the proposed method significantly outperforms all the baselines in terms of MAP.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Aggarwal, C.C., Zhai, C.: Mining Text Data. Springer, New York (2012)
Krovetz, R.: Viewing morphology as an inference process. In: Proceedings SIGIR 1993, pp. 191–202. SIGIR (1993)
Bhat, S.: Statistical stemming for kannada. In: WSSANLP 2013 (2013)
Dadashkarimi, J., Shakery, A., Faili, H.: A probabilistic translation method for dictionary-based cross-lingual information retrieval in agglutinative languages. In: CCL 2014, Tehran, Iran (2014)
Levenshtein, V.I.: Binary codes capable of correcting deletions, insertions and reversals. Sov. Phys. Dokl. 10(8), 707–710 (1966)
Hashemi, H.B., Shakery, A.: Mining a Persian-English comparable corpus for cross-language information retrieval. Inf. Process. Manage. 50(2), 384–398 (2014)
Dadashkarimi, J., Shahshahani, M.S., Tebbifakhr, A., Faili, H., Shakery, A.: Dimension projection among languages based on pseudo-relevant documents for query translation. arXiv preprint (2016). arXiv:1605.07844
Rahimi, R., Shakery, A., Dadashkarimi, J., Aryannejad, M., Dehghani, M., Esfahani, H.N.: Building a multi-domain comparable corpus using a learning to rank method. Nat. Lang. Eng. 22(Special Issue 04), 627–653 (2016)
Cao, G., Robertson, S., Nie, J.Y.: Selecting query term alternations for web search by exploiting query contexts. In: Proceedings of ACL 2008: HLT, Columbus, Ohio, pp. 148–155. Association for Computational Linguistics, June 2008
Zhai, C., Lafferty, J.: Model-based feedback in the language modeling approach to information retrieval. In: CIKM, Atlanta, Georgia, USA, pp. 403–410 (2001)
Esfahani, H.N., Dadashkarimi, J., Shakery, A.: Profile-based translation in multilingual expertise retrieval. In: Proceedings of MultilingMine@ECIR (2016)
Shamsfard, M., Jafari, H.S., Ilbeygi, M.: Step-1: a set of fundamental tools for persian text processing. In: LREC (2010)
Monz, C., Dorr, B.J.: Iterative translation disambiguation for cross-language information retrieval. In: SIGIR, Salvador, Brazil, pp. 520–527 (2005)
Ganguly, D., Leveling, J., Jones, G.: Cross-lingual topical relevance models. In: COLING 2012, Mumbai, India, pp. 927–942 (2012)
Acknowledgement
The author would like to thank Razieh Rahimi and the anonymous reviewers for their helpful comments and feedback.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer International Publishing Switzerland
About this paper
Cite this paper
Dadashkarimi, J., Nasr Esfahani, H., Faili, H., Shakery, A. (2016). SS4MCT: A Statistical Stemmer for Morphologically Complex Texts. In: Fuhr, N., et al. Experimental IR Meets Multilinguality, Multimodality, and Interaction. CLEF 2016. Lecture Notes in Computer Science(), vol 9822. Springer, Cham. https://doi.org/10.1007/978-3-319-44564-9_16
Download citation
DOI: https://doi.org/10.1007/978-3-319-44564-9_16
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-44563-2
Online ISBN: 978-3-319-44564-9
eBook Packages: Computer ScienceComputer Science (R0)