SS4MCT: A Statistical Stemmer for Morphologically Complex Texts

Dadashkarimi, Javid; Nasr Esfahani, Hossein; Faili, Heshaam; Shakery, Azadeh

doi:10.1007/978-3-319-44564-9_16

Javid Dadashkarimi²¹,
Hossein Nasr Esfahani²¹,
Heshaam Faili²¹ &
…
Azadeh Shakery²¹

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 9822))

Included in the following conference series:

International Conference of the Cross-Language Evaluation Forum for European Languages

959 Accesses
2 Citations

Abstract

There have been multiple attempts to resolve various inflection matching problems in information retrieval. Stemming is a common approach to this end. Among many techniques for stemming, statistical stemming has been shown to be effective in a number of languages, particularly highly inflected languages. In this paper we propose a method for finding affixes in different positions of a word. Common statistical techniques heavily rely on string similarity in terms of prefix and suffix matching. Since infixes are common in irregular/informal inflections in morphologically complex texts, it is required to find infixes for stemming. In this paper we propose a method whose aim is to find statistical inflectional rules based on minimum edit distance table of word pairs and the likelihoods of the rules in a language. These rules are used to statistically stem words and can be used in different text mining tasks. Experimental results on CLEF 2008 and CLEF 2009 English-Persian CLIR tasks indicate that the proposed method significantly outperforms all the baselines in terms of MAP.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

References

Aggarwal, C.C., Zhai, C.: Mining Text Data. Springer, New York (2012)
Book Google Scholar
Krovetz, R.: Viewing morphology as an inference process. In: Proceedings SIGIR 1993, pp. 191–202. SIGIR (1993)
Google Scholar
Bhat, S.: Statistical stemming for kannada. In: WSSANLP 2013 (2013)
Google Scholar
Dadashkarimi, J., Shakery, A., Faili, H.: A probabilistic translation method for dictionary-based cross-lingual information retrieval in agglutinative languages. In: CCL 2014, Tehran, Iran (2014)
Google Scholar
Levenshtein, V.I.: Binary codes capable of correcting deletions, insertions and reversals. Sov. Phys. Dokl. 10(8), 707–710 (1966)
MathSciNet MATH Google Scholar
Hashemi, H.B., Shakery, A.: Mining a Persian-English comparable corpus for cross-language information retrieval. Inf. Process. Manage. 50(2), 384–398 (2014)
Article Google Scholar
Dadashkarimi, J., Shahshahani, M.S., Tebbifakhr, A., Faili, H., Shakery, A.: Dimension projection among languages based on pseudo-relevant documents for query translation. arXiv preprint (2016). arXiv:1605.07844
Rahimi, R., Shakery, A., Dadashkarimi, J., Aryannejad, M., Dehghani, M., Esfahani, H.N.: Building a multi-domain comparable corpus using a learning to rank method. Nat. Lang. Eng. 22(Special Issue 04), 627–653 (2016)
Article Google Scholar
Cao, G., Robertson, S., Nie, J.Y.: Selecting query term alternations for web search by exploiting query contexts. In: Proceedings of ACL 2008: HLT, Columbus, Ohio, pp. 148–155. Association for Computational Linguistics, June 2008
Google Scholar
Zhai, C., Lafferty, J.: Model-based feedback in the language modeling approach to information retrieval. In: CIKM, Atlanta, Georgia, USA, pp. 403–410 (2001)
Google Scholar
Esfahani, H.N., Dadashkarimi, J., Shakery, A.: Profile-based translation in multilingual expertise retrieval. In: Proceedings of MultilingMine@ECIR (2016)
Google Scholar
Shamsfard, M., Jafari, H.S., Ilbeygi, M.: Step-1: a set of fundamental tools for persian text processing. In: LREC (2010)
Google Scholar
Monz, C., Dorr, B.J.: Iterative translation disambiguation for cross-language information retrieval. In: SIGIR, Salvador, Brazil, pp. 520–527 (2005)
Google Scholar
Ganguly, D., Leveling, J., Jones, G.: Cross-lingual topical relevance models. In: COLING 2012, Mumbai, India, pp. 927–942 (2012)
Google Scholar

Download references

Acknowledgement

The author would like to thank Razieh Rahimi and the anonymous reviewers for their helpful comments and feedback.

Author information

Authors and Affiliations

School of Electrical and Computer Engineering, College of Engineering, University of Tehran, Tehran, Iran
Javid Dadashkarimi, Hossein Nasr Esfahani, Heshaam Faili & Azadeh Shakery

Authors

Javid Dadashkarimi
View author publications
You can also search for this author in PubMed Google Scholar
Hossein Nasr Esfahani
View author publications
You can also search for this author in PubMed Google Scholar
Heshaam Faili
View author publications
You can also search for this author in PubMed Google Scholar
Azadeh Shakery
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Azadeh Shakery .

Editor information

Editors and Affiliations

Universität Duisburg-Essen , Duisburg, Germany
Norbert Fuhr
Universidade de Évora , Évora, Portugal
Paulo Quaresma
University of Évora , Évora, Portugal
Teresa Gonçalves
Aalborg University Copenhagen , Copenhagen, Denmark
Birger Larsen
University of Stavanger , Stavanger, Norway
Krisztian Balog
University of Glasgow , Glasgow, United Kingdom
Craig Macdonald
University of Padua , Padua, Italy
Linda Cappellato
University of Padua , Padua, Italy
Nicola Ferro

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Dadashkarimi, J., Nasr Esfahani, H., Faili, H., Shakery, A. (2016). SS4MCT: A Statistical Stemmer for Morphologically Complex Texts. In: Fuhr, N., et al. Experimental IR Meets Multilinguality, Multimodality, and Interaction. CLEF 2016. Lecture Notes in Computer Science(), vol 9822. Springer, Cham. https://doi.org/10.1007/978-3-319-44564-9_16

Download citation

DOI: https://doi.org/10.1007/978-3-319-44564-9_16
Published: 23 August 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-44563-2
Online ISBN: 978-3-319-44564-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics