Skip to main content

SS4MCT: A Statistical Stemmer for Morphologically Complex Texts

  • Conference paper
  • First Online:
Experimental IR Meets Multilinguality, Multimodality, and Interaction (CLEF 2016)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 9822))

Abstract

There have been multiple attempts to resolve various inflection matching problems in information retrieval. Stemming is a common approach to this end. Among many techniques for stemming, statistical stemming has been shown to be effective in a number of languages, particularly highly inflected languages. In this paper we propose a method for finding affixes in different positions of a word. Common statistical techniques heavily rely on string similarity in terms of prefix and suffix matching. Since infixes are common in irregular/informal inflections in morphologically complex texts, it is required to find infixes for stemming. In this paper we propose a method whose aim is to find statistical inflectional rules based on minimum edit distance table of word pairs and the likelihoods of the rules in a language. These rules are used to statistically stem words and can be used in different text mining tasks. Experimental results on CLEF 2008 and CLEF 2009 English-Persian CLIR tasks indicate that the proposed method significantly outperforms all the baselines in terms of MAP.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    https://translate.google.com/#en/fa/.

  2. 2.

    http://ece.ut.ac.ir/dbrg/bijankhan/.

  3. 3.

    http://www.faraazin.ir.

References

  1. Aggarwal, C.C., Zhai, C.: Mining Text Data. Springer, New York (2012)

    Book  Google Scholar 

  2. Krovetz, R.: Viewing morphology as an inference process. In: Proceedings SIGIR 1993, pp. 191–202. SIGIR (1993)

    Google Scholar 

  3. Bhat, S.: Statistical stemming for kannada. In: WSSANLP 2013 (2013)

    Google Scholar 

  4. Dadashkarimi, J., Shakery, A., Faili, H.: A probabilistic translation method for dictionary-based cross-lingual information retrieval in agglutinative languages. In: CCL 2014, Tehran, Iran (2014)

    Google Scholar 

  5. Levenshtein, V.I.: Binary codes capable of correcting deletions, insertions and reversals. Sov. Phys. Dokl. 10(8), 707–710 (1966)

    MathSciNet  MATH  Google Scholar 

  6. Hashemi, H.B., Shakery, A.: Mining a Persian-English comparable corpus for cross-language information retrieval. Inf. Process. Manage. 50(2), 384–398 (2014)

    Article  Google Scholar 

  7. Dadashkarimi, J., Shahshahani, M.S., Tebbifakhr, A., Faili, H., Shakery, A.: Dimension projection among languages based on pseudo-relevant documents for query translation. arXiv preprint (2016). arXiv:1605.07844

  8. Rahimi, R., Shakery, A., Dadashkarimi, J., Aryannejad, M., Dehghani, M., Esfahani, H.N.: Building a multi-domain comparable corpus using a learning to rank method. Nat. Lang. Eng. 22(Special Issue 04), 627–653 (2016)

    Article  Google Scholar 

  9. Cao, G., Robertson, S., Nie, J.Y.: Selecting query term alternations for web search by exploiting query contexts. In: Proceedings of ACL 2008: HLT, Columbus, Ohio, pp. 148–155. Association for Computational Linguistics, June 2008

    Google Scholar 

  10. Zhai, C., Lafferty, J.: Model-based feedback in the language modeling approach to information retrieval. In: CIKM, Atlanta, Georgia, USA, pp. 403–410 (2001)

    Google Scholar 

  11. Esfahani, H.N., Dadashkarimi, J., Shakery, A.: Profile-based translation in multilingual expertise retrieval. In: Proceedings of MultilingMine@ECIR (2016)

    Google Scholar 

  12. Shamsfard, M., Jafari, H.S., Ilbeygi, M.: Step-1: a set of fundamental tools for persian text processing. In: LREC (2010)

    Google Scholar 

  13. Monz, C., Dorr, B.J.: Iterative translation disambiguation for cross-language information retrieval. In: SIGIR, Salvador, Brazil, pp. 520–527 (2005)

    Google Scholar 

  14. Ganguly, D., Leveling, J., Jones, G.: Cross-lingual topical relevance models. In: COLING 2012, Mumbai, India, pp. 927–942 (2012)

    Google Scholar 

Download references

Acknowledgement

The author would like to thank Razieh Rahimi and the anonymous reviewers for their helpful comments and feedback.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Azadeh Shakery .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer International Publishing Switzerland

About this paper

Cite this paper

Dadashkarimi, J., Nasr Esfahani, H., Faili, H., Shakery, A. (2016). SS4MCT: A Statistical Stemmer for Morphologically Complex Texts. In: Fuhr, N., et al. Experimental IR Meets Multilinguality, Multimodality, and Interaction. CLEF 2016. Lecture Notes in Computer Science(), vol 9822. Springer, Cham. https://doi.org/10.1007/978-3-319-44564-9_16

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-44564-9_16

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-44563-2

  • Online ISBN: 978-3-319-44564-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics