Skip to main content

Light Stemming for Arabic Information Retrieval

  • Chapter
Book cover Arabic Computational Morphology

Part of the book series: Text, Speech and Language Technology ((TLTB,volume 38))

Abstract

Computational Morphology is an urgent problem for Arabic Natural Language Processing, because Arabic is a highly inflected language. We have found, however, that a full solution to this problem is not required for effective information retrieval. Light stemming allows remarkably good information retrieval without providing correct morphological analyses. We developed several light stemmers for Arabic, and assessed their effectiveness for information retrieval using standard TREC data. We have also compared light stemming with several stemmers based on morphological analysis. The light stemmer, light10, outperformed the other approaches. It has been included in the Lemur toolkit, and is becoming widely used Arabic information retrieval.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 129.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 169.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Abu-Salem, H., Al-Omari, M., and Evens, M. Stemming methodologies over individual query words for Arabic information retrieval. JASIS, 50 (6), pp. 524–529, 1999.

    Article  Google Scholar 

  2. Al-Fedaghi, S. S. and Al-Anzi, F. S. A new algorithm to generate Arabic root-pattern forms. In Proceedings of the 11th national computer conference. King Fahd University of Petroleum & Minerals, Dhahran, Saudi Arabia, pp. 391–400, 1989.

    Google Scholar 

  3. Aljlayl, M., Beitzel, S., Jensen, E., Chowdhury, A., Holmes, D., Lee, M., Grossman, D., and Frieder, O. IIT at TREC-10. In TREC 2001. Gaithersburg: NIST, pp. 265–275, 2001.

    Google Scholar 

  4. Al-Kharashi, I. and Evens, M. W. Comparing words, stems, and roots as index terms in an Arabic information retrieval system. JASIS, 45 (8), pp. 548–560, 1994.

    Article  Google Scholar 

  5. Allan, J., Callan, J., Collins-Thompson, K., Croft, B., Feng, F., Fisher, D., Lafferty, J., Larkey, L., Truong, T. N., Ogilvie, P., Si, L., Strohman, T., Turtle, H., and Zhai, C. The Lemur toolkit for language modeling and information retrieval. http://www.lemurproject.org/lemur

    Google Scholar 

  6. Al-Shalabi, R. Design and implementation of an Arabic morphological system to support natural language processing. PhD thesis, Computer Science, Illinois Institute of Technology, Chicago, 1996.

    Google Scholar 

  7. Beesley, K. R. Arabic finite-state morphological analysis and generation. In COLING-96: Proceedings of the 16th international conference on computational linguistics, vol. 1, pp. 89–94, 1996.

    Article  Google Scholar 

  8. Berlian, V., Vega, S. N., and Bressan, S. Indexing the Indonesian web: Language identification and miscellaneous issues. Presented at Tenth International World Wide Web Conference, Hong Kong, 2001.

    Google Scholar 

  9. Brent, M. R. Speech segmentation and word discovery: A computational perspective. Trends in Cognitive Science, 3 (8), pp. 294–301, 1999.

    Article  Google Scholar 

  10. Buckwalter, T. Qamus: Arabic lexicography. http://www.qamus.org/

    Google Scholar 

  11. Callan, J. P., Croft, W. B., and Broglio, J. TREC and TIPSTER experiments with INQUERY. Information Processing and Management, 31 (3), pp. 327–343, 1995.

    Article  Google Scholar 

  12. Carlberger, J., Dalianis, H., Hassel, M., and Knutsson, O. Improving precision in information retrieval for Swedish using stemming. In Proceedings of NODALIDA ’01 - 13th Nordic conference on computational linguistics. Uppsala, Sweden, 2001. http://www.nada.kth.se/∼xmartin/papers/Stemming_NODALIDA01.pdf

    Google Scholar 

  13. Chen, A. and Gey, F. Building an Arabic stemmer for information retrieval. In TREC 2002. Gaithersburg: NIST, pp 631–639, 2002.

    Google Scholar 

  14. Darwish, K. Building a shallow morphological analyzer in one day. ACL 2002 Workshop on Computational Approaches to Semitic languages, pp. 47–54, July 11, 2002.

    Google Scholar 

  15. Darwish, K., Doermann, D., Jones, R., Oard, D., and Rautiainen, M. TREC-10 experiments at Maryland: CLIR and video. In TREC 2001. Gaithersburg: NIST, pp 549–562, 2001.

    Google Scholar 

  16. Darwish, K. and Oard, D.W. CLIR Experiments at Maryland for TREC-2002: Evidence combination for Arabic-English retrieval. In TREC 2002. Gaithersburg: NIST, pp 703–710, 2002.

    Google Scholar 

  17. de Marcken, C. Unsupervised language acquisition. PhD thesis, MIT, Cambridge, 1995.

    Google Scholar 

  18. De Roeck, A. N. and Al-Fares, W. A morphologically sensitive clustering algorithm for identifying Arabic roots. In Proceedings ACL-2000. Hong Kong, pp 199–206, 2000.

    Google Scholar 

  19. Diab, M. ArabicSVMTools. http://www.stanford.edu/∼mdiab/software/ArabicSVMTools.tar.gz. 2004.

    Google Scholar 

  20. Diab, M., Hacioglu, K., and Jurafsky, D. Automatic tagging of Arabic text: From raw test to base phrase chunks. In Proceedings of HLT-NAACL, pp 149–152, 2004. http://www.stanford.edu/∼mdiab/papers/ArabicChunks.pdf.

    Google Scholar 

  21. Ekmekcioglu, F. C., Lynch, M. F., and Willett, P. Stemming and n-gram matching for term conflation in Turkish texts. Information Research News, 7 (1), pp. 2–6, 1996.

    Google Scholar 

  22. Flenner, G. Ein quantitatives Morphsegmentierungssytem fur Spanische Wortformen. In Computatio linguae II, U. Klenk, Ed. Stuttgart: Steiner Verlag, pp. 31–62, 1994.

    Google Scholar 

  23. Frakes, W. B. Stemming algorithms. In Information retrieval: Data structures and algorithms, W. B. Frakes and R. Baeza-Yates, Eds. Englewood Cliffs, NJ: Prentice Hall, Chapter 8, 1992.

    Google Scholar 

  24. Freund, E. and Willett, P. Online identification of word variants and arbitrary truncation searching using a string similarity measure. Information Technology: Research and Development, 1, pp. 177–187, 1982.

    Google Scholar 

  25. Gey, F. C. and Oard, D. W. The TREC-2001 cross-language information retrieval track: Searching Arabic using English, French, or Arabic queries. In TREC 2001. Gaithersburg: NIST, pp 16–26, 2002.

    Google Scholar 

  26. Goldsmith, J. Unsupervised learning of the morphology of a natural language. Computational Linguistics, 27 (2), pp. 153–198, 2000.

    Article  Google Scholar 

  27. Goldsmith, J., Higgins, D., and Soglasnova, S. Automatic language-specific stemming in information retrieval. In Cross-language information retrieval and evaluation: Proceedings of the CLEF 2000 workshop, C. Peters, Ed.: Springer Verlag, pp. 273–283, 2001.

    Google Scholar 

  28. Goweder, A. and De Roeck, A. Assessment of a significant Arabic corpus. Presented at the Arabic NLP Workshop at ACL/EACL 2001, Toulouse, France, 2001. http://www.elsnet.org/arabic2001/goweder.pdf

    Google Scholar 

  29. Greengrass, M., Robertson, A. M., Robyn, S., and Willett, P. Processing morphological variants in searches of Latin text. Information Research News, 6 (4), pp. 2–5, 1996.

    Google Scholar 

  30. Hafer, M. A. and Weiss, S. F. Word segmentation by letter successor varieties. Information Storage and Retrieval, 10, pp. 371–385, 1974.

    Article  Google Scholar 

  31. Hull, D. A. Stemming algorithms - a case study for detailed evaluation. JASIS, 47 (1), pp. 70–84, 1996.

    Article  Google Scholar 

  32. Janssen, A. Segmentierung Franzosischer Wortformen in Morphe ohne Verwendung eines Lexikons. In Computatio linguae, U. Klenk, Ed. Stuttgart: Steiner Verlag, pp. 74–95, 1992.

    Google Scholar 

  33. Khoja, S. and Garside, R. Stemming Arabic text. Computing Department, Lancaster University, Lancaster, 1999. http://www.comp.lancs.ac.uk/computing/users/khoja/stemmer.ps

    Google Scholar 

  34. Klenk, U. Verfahren morphologischer Segmentierung und die Wortstruktur im Spanischen. In Computatio Linguae, Aufsätze zur algorithmischen und quantitativen Analyse der Sprache, U. Klenk, Ed. Stuttgart: Steiner Verlag, pp 110–124, 1992.

    Google Scholar 

  35. Kraaij, W. and Pohlmann, R. Viewing stemming as recall enhancement. In Proceedings of ACM SIGIR96. pp. 40–48, 1996.

    Google Scholar 

  36. Krovetz, R. Viewing morphology as an inference process. In Proceedings of ACM SIGIR93, pp. 191–203, 1993.

    Google Scholar 

  37. Larkey, Leah S., Ballesteros, L., and Connell, M. (2002) Improving stemming for Arabic information retrieval: Light stemming and co-occurrence analysis In Proceedings of the 25th annual international conference on research and development in information retrieval (SIGIR 2002), Tampere, Finland, August 11–15, 2002, pp. 275–282.

    Google Scholar 

  38. Larkey, L. S. and Connell, M. E. Arabic information retrieval at UMass in TREC-10. In TREC 2001. Gaithersburg: NIST, 2001.

    Google Scholar 

  39. LDC, Linguistic Data Consortium. Buckwalter Morphological Analyzer Version 1.0, LDC2002L49, 2002. http://www.ldc.upenn.edu/Catalog/.

    Google Scholar 

  40. LDC, Linguistic Data Consortium. Arabic Penn TreeBank 1, v2.0. LDC2003T06, 2003. http://www.ldc.upenn.edu/Catalog/

    Google Scholar 

  41. Lovins, J. B. Development of a stemming algorithm. Mechanical Translation and Computational Linguistics, 11, pp. 22–31, 1968.

    Google Scholar 

  42. Mayfield, J., McNamee, P., Costello, C., Piatko, C., and Banerjee, A. JHU/APL at TREC 2001: Experiments in filtering and in Arabic, video, and web retrieval. In TREC 2001. Gaithersburg: NIST, pp 332–341, 2001.

    Google Scholar 

  43. McNamee, P., Mayfield, J., and Piatko, C. A language-independent approach to European text retrieval. In Cross-language information retrieval and evaluation: Proceedings of the CLEF 2000 workshop, C. Peters, Ed.: Springer Verlag, pp. 129–139, 2000.

    Google Scholar 

  44. Monz, C. and de Rijke, M. Shallow morphological analysis in monolingual information retrieval for German and Italian. In Cross-language information retrieval and evaluation: Proceedings of the CLEF 2001 workshop, C. Peters, Ed.: Springer Verlag, 2001. http://staff.science.uva.nl/∼christof/Papers/clef-2001-post.pdf

    Google Scholar 

  45. Moulinier, I., McCulloh, A., and Lund, E. West group at CLEF 2000: Non-English monolingual retrieval. In Cross-language information retrieval and evaluation: Proceedings of the CLEF 2000 workshop, C. Peters, Ed.: Springer Verlag, pp. 176–187, 2001.

    Google Scholar 

  46. Oard, D. W., Levow, G. -A., and Cabezas, C. I. CLEF experiments at Maryland: Statistical stemming and backoff translation. In Cross-language information retrieval and evaluation: Proceedings of the CLEF 2000 workshop, C. Peters, Ed.: Springer Verlag, pp. 176–187, 2001.

    Google Scholar 

  47. NIST. Topic Detection and Tracking Resources. http://www.nist.gov/speech/tests/tdt/resources.htm. Created 2000, updated 2002.

    Google Scholar 

  48. Pirkola, A. Morphological typology of languages for IR. Journal of Documentation, 57 (3), pp. 330–348, 2001.

    Article  Google Scholar 

  49. Popovic, M. and Willett, P. The effectiveness of stemming for natural-language access to Slovene textual data. JASIS, 43 (5), pp. 384–390, 1992.

    Article  Google Scholar 

  50. Porter, M. F. An algorithm for suffix stripping. Program, 14 (3), pp. 130–137, 1980.

    Google Scholar 

  51. Rogati, M., McCarley, S., and Yang, Y. Unsupervised learning of Arabic stemming using a parallel corpus. In Proceedings ACL-2003, Sapporo, Japan, pp. 391–398, July 2003. http://acl.ldc.upenn.edu/acl2003/main/pdf/Rogati.pdf

    Google Scholar 

  52. Siegel, S. Nonparametric statistics for the behavioral sciences. New York: McGraw-Hill, 1956.

    Google Scholar 

  53. Taghva, K., Elkoury, R., and Coombs, J. Arabic Stemming without a root dictionary. 2005. www.isri.unlv.edu/publications/isripub/Taghva2005b.pdf

    Google Scholar 

  54. Tai, S. Y., Ong, C. S., and Abdullah, N. A. On designing an automated Malaysian stemmer for the Malay language. (poster). In Proceedings of the fifth international workshop on information retrieval with Asian languages, Hong Kong, pp. 207–208, 2000.

    Google Scholar 

  55. Xu, J. and Croft, W. B. Corpus-based stemming using co-occurrence of word variants. ACM Transactions on Information Systems, 16 (1), pp. 61–81, 1998.

    Article  Google Scholar 

  56. Xu, J., Fraser, A., and Weischedel, R. TREC 2001 cross-lingual retrieval at BBN. In TREC 2001. Gaithersburg: NIST, pp 68–78, 2001.

    Google Scholar 

  57. Xu, J., Fraser, A., and Weischedel, R. Empirical studies in strategies for Arabic retrieval. In Sigir 2002. Tampere, Finland: ACM, pp. 269–274, 2002.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2007 Springer

About this chapter

Cite this chapter

Larkey, L.S., Ballesteros, L., Connell, M.E. (2007). Light Stemming for Arabic Information Retrieval. In: Soudi, A., Bosch, A.v., Neumann, G. (eds) Arabic Computational Morphology. Text, Speech and Language Technology, vol 38. Springer, Dordrecht. https://doi.org/10.1007/978-1-4020-6046-5_12

Download citation

Publish with us

Policies and ethics