Light Stemming for Arabic Information Retrieval

Larkey, Leah S.; Ballesteros, Lisa; Connell, Margaret E.

doi:10.1007/978-1-4020-6046-5_12

Leah S. Larkey¹⁴,
Lisa Ballesteros¹⁵ &
Margaret E. Connell¹⁶

Part of the book series: Text, Speech and Language Technology ((TLTB,volume 38))

1216 Accesses
98 Citations

Abstract

Computational Morphology is an urgent problem for Arabic Natural Language Processing, because Arabic is a highly inflected language. We have found, however, that a full solution to this problem is not required for effective information retrieval. Light stemming allows remarkably good information retrieval without providing correct morphological analyses. We developed several light stemmers for Arabic, and assessed their effectiveness for information retrieval using standard TREC data. We have also compared light stemming with several stemmers based on morphological analysis. The light stemmer, light10, outperformed the other approaches. It has been included in the Lemur toolkit, and is becoming widely used Arabic information retrieval.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Softcover Book: USD 169.99; Price excludes VAT (USA)

Hardcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Abu-Salem, H., Al-Omari, M., and Evens, M. Stemming methodologies over individual query words for Arabic information retrieval. JASIS, 50 (6), pp. 524–529, 1999.
Article Google Scholar
Al-Fedaghi, S. S. and Al-Anzi, F. S. A new algorithm to generate Arabic root-pattern forms. In Proceedings of the 11th national computer conference. King Fahd University of Petroleum & Minerals, Dhahran, Saudi Arabia, pp. 391–400, 1989.
Google Scholar
Aljlayl, M., Beitzel, S., Jensen, E., Chowdhury, A., Holmes, D., Lee, M., Grossman, D., and Frieder, O. IIT at TREC-10. In TREC 2001. Gaithersburg: NIST, pp. 265–275, 2001.
Google Scholar
Al-Kharashi, I. and Evens, M. W. Comparing words, stems, and roots as index terms in an Arabic information retrieval system. JASIS, 45 (8), pp. 548–560, 1994.
Article Google Scholar
Allan, J., Callan, J., Collins-Thompson, K., Croft, B., Feng, F., Fisher, D., Lafferty, J., Larkey, L., Truong, T. N., Ogilvie, P., Si, L., Strohman, T., Turtle, H., and Zhai, C. The Lemur toolkit for language modeling and information retrieval. http://www.lemurproject.org/lemur
Google Scholar
Al-Shalabi, R. Design and implementation of an Arabic morphological system to support natural language processing. PhD thesis, Computer Science, Illinois Institute of Technology, Chicago, 1996.
Google Scholar
Beesley, K. R. Arabic finite-state morphological analysis and generation. In COLING-96: Proceedings of the 16th international conference on computational linguistics, vol. 1, pp. 89–94, 1996.
Article Google Scholar
Berlian, V., Vega, S. N., and Bressan, S. Indexing the Indonesian web: Language identification and miscellaneous issues. Presented at Tenth International World Wide Web Conference, Hong Kong, 2001.
Google Scholar
Brent, M. R. Speech segmentation and word discovery: A computational perspective. Trends in Cognitive Science, 3 (8), pp. 294–301, 1999.
Article Google Scholar
Buckwalter, T. Qamus: Arabic lexicography. http://www.qamus.org/
Google Scholar
Callan, J. P., Croft, W. B., and Broglio, J. TREC and TIPSTER experiments with INQUERY. Information Processing and Management, 31 (3), pp. 327–343, 1995.
Article Google Scholar
Carlberger, J., Dalianis, H., Hassel, M., and Knutsson, O. Improving precision in information retrieval for Swedish using stemming. In Proceedings of NODALIDA ’01 - 13th Nordic conference on computational linguistics. Uppsala, Sweden, 2001. http://www.nada.kth.se/∼xmartin/papers/Stemming_NODALIDA01.pdf
Google Scholar
Chen, A. and Gey, F. Building an Arabic stemmer for information retrieval. In TREC 2002. Gaithersburg: NIST, pp 631–639, 2002.
Google Scholar
Darwish, K. Building a shallow morphological analyzer in one day. ACL 2002 Workshop on Computational Approaches to Semitic languages, pp. 47–54, July 11, 2002.
Google Scholar
Darwish, K., Doermann, D., Jones, R., Oard, D., and Rautiainen, M. TREC-10 experiments at Maryland: CLIR and video. In TREC 2001. Gaithersburg: NIST, pp 549–562, 2001.
Google Scholar
Darwish, K. and Oard, D.W. CLIR Experiments at Maryland for TREC-2002: Evidence combination for Arabic-English retrieval. In TREC 2002. Gaithersburg: NIST, pp 703–710, 2002.
Google Scholar
de Marcken, C. Unsupervised language acquisition. PhD thesis, MIT, Cambridge, 1995.
Google Scholar
De Roeck, A. N. and Al-Fares, W. A morphologically sensitive clustering algorithm for identifying Arabic roots. In Proceedings ACL-2000. Hong Kong, pp 199–206, 2000.
Google Scholar
Diab, M. ArabicSVMTools. http://www.stanford.edu/∼mdiab/software/ArabicSVMTools.tar.gz. 2004.
Google Scholar
Diab, M., Hacioglu, K., and Jurafsky, D. Automatic tagging of Arabic text: From raw test to base phrase chunks. In Proceedings of HLT-NAACL, pp 149–152, 2004. http://www.stanford.edu/∼mdiab/papers/ArabicChunks.pdf.
Google Scholar
Ekmekcioglu, F. C., Lynch, M. F., and Willett, P. Stemming and n-gram matching for term conflation in Turkish texts. Information Research News, 7 (1), pp. 2–6, 1996.
Google Scholar
Flenner, G. Ein quantitatives Morphsegmentierungssytem fur Spanische Wortformen. In Computatio linguae II, U. Klenk, Ed. Stuttgart: Steiner Verlag, pp. 31–62, 1994.
Google Scholar
Frakes, W. B. Stemming algorithms. In Information retrieval: Data structures and algorithms, W. B. Frakes and R. Baeza-Yates, Eds. Englewood Cliffs, NJ: Prentice Hall, Chapter 8, 1992.
Google Scholar
Freund, E. and Willett, P. Online identification of word variants and arbitrary truncation searching using a string similarity measure. Information Technology: Research and Development, 1, pp. 177–187, 1982.
Google Scholar
Gey, F. C. and Oard, D. W. The TREC-2001 cross-language information retrieval track: Searching Arabic using English, French, or Arabic queries. In TREC 2001. Gaithersburg: NIST, pp 16–26, 2002.
Google Scholar
Goldsmith, J. Unsupervised learning of the morphology of a natural language. Computational Linguistics, 27 (2), pp. 153–198, 2000.
Article Google Scholar
Goldsmith, J., Higgins, D., and Soglasnova, S. Automatic language-specific stemming in information retrieval. In Cross-language information retrieval and evaluation: Proceedings of the CLEF 2000 workshop, C. Peters, Ed.: Springer Verlag, pp. 273–283, 2001.
Google Scholar
Goweder, A. and De Roeck, A. Assessment of a significant Arabic corpus. Presented at the Arabic NLP Workshop at ACL/EACL 2001, Toulouse, France, 2001. http://www.elsnet.org/arabic2001/goweder.pdf
Google Scholar
Greengrass, M., Robertson, A. M., Robyn, S., and Willett, P. Processing morphological variants in searches of Latin text. Information Research News, 6 (4), pp. 2–5, 1996.
Google Scholar
Hafer, M. A. and Weiss, S. F. Word segmentation by letter successor varieties. Information Storage and Retrieval, 10, pp. 371–385, 1974.
Article Google Scholar
Hull, D. A. Stemming algorithms - a case study for detailed evaluation. JASIS, 47 (1), pp. 70–84, 1996.
Article Google Scholar
Janssen, A. Segmentierung Franzosischer Wortformen in Morphe ohne Verwendung eines Lexikons. In Computatio linguae, U. Klenk, Ed. Stuttgart: Steiner Verlag, pp. 74–95, 1992.
Google Scholar
Khoja, S. and Garside, R. Stemming Arabic text. Computing Department, Lancaster University, Lancaster, 1999. http://www.comp.lancs.ac.uk/computing/users/khoja/stemmer.ps
Google Scholar
Klenk, U. Verfahren morphologischer Segmentierung und die Wortstruktur im Spanischen. In Computatio Linguae, Aufsätze zur algorithmischen und quantitativen Analyse der Sprache, U. Klenk, Ed. Stuttgart: Steiner Verlag, pp 110–124, 1992.
Google Scholar
Kraaij, W. and Pohlmann, R. Viewing stemming as recall enhancement. In Proceedings of ACM SIGIR96. pp. 40–48, 1996.
Google Scholar
Krovetz, R. Viewing morphology as an inference process. In Proceedings of ACM SIGIR93, pp. 191–203, 1993.
Google Scholar
Larkey, Leah S., Ballesteros, L., and Connell, M. (2002) Improving stemming for Arabic information retrieval: Light stemming and co-occurrence analysis In Proceedings of the 25th annual international conference on research and development in information retrieval (SIGIR 2002), Tampere, Finland, August 11–15, 2002, pp. 275–282.
Google Scholar
Larkey, L. S. and Connell, M. E. Arabic information retrieval at UMass in TREC-10. In TREC 2001. Gaithersburg: NIST, 2001.
Google Scholar
LDC, Linguistic Data Consortium. Buckwalter Morphological Analyzer Version 1.0, LDC2002L49, 2002. http://www.ldc.upenn.edu/Catalog/.
Google Scholar
LDC, Linguistic Data Consortium. Arabic Penn TreeBank 1, v2.0. LDC2003T06, 2003. http://www.ldc.upenn.edu/Catalog/
Google Scholar
Lovins, J. B. Development of a stemming algorithm. Mechanical Translation and Computational Linguistics, 11, pp. 22–31, 1968.
Google Scholar
Mayfield, J., McNamee, P., Costello, C., Piatko, C., and Banerjee, A. JHU/APL at TREC 2001: Experiments in filtering and in Arabic, video, and web retrieval. In TREC 2001. Gaithersburg: NIST, pp 332–341, 2001.
Google Scholar
McNamee, P., Mayfield, J., and Piatko, C. A language-independent approach to European text retrieval. In Cross-language information retrieval and evaluation: Proceedings of the CLEF 2000 workshop, C. Peters, Ed.: Springer Verlag, pp. 129–139, 2000.
Google Scholar
Monz, C. and de Rijke, M. Shallow morphological analysis in monolingual information retrieval for German and Italian. In Cross-language information retrieval and evaluation: Proceedings of the CLEF 2001 workshop, C. Peters, Ed.: Springer Verlag, 2001. http://staff.science.uva.nl/∼christof/Papers/clef-2001-post.pdf
Google Scholar
Moulinier, I., McCulloh, A., and Lund, E. West group at CLEF 2000: Non-English monolingual retrieval. In Cross-language information retrieval and evaluation: Proceedings of the CLEF 2000 workshop, C. Peters, Ed.: Springer Verlag, pp. 176–187, 2001.
Google Scholar
Oard, D. W., Levow, G. -A., and Cabezas, C. I. CLEF experiments at Maryland: Statistical stemming and backoff translation. In Cross-language information retrieval and evaluation: Proceedings of the CLEF 2000 workshop, C. Peters, Ed.: Springer Verlag, pp. 176–187, 2001.
Google Scholar
NIST. Topic Detection and Tracking Resources. http://www.nist.gov/speech/tests/tdt/resources.htm. Created 2000, updated 2002.
Google Scholar
Pirkola, A. Morphological typology of languages for IR. Journal of Documentation, 57 (3), pp. 330–348, 2001.
Article Google Scholar
Popovic, M. and Willett, P. The effectiveness of stemming for natural-language access to Slovene textual data. JASIS, 43 (5), pp. 384–390, 1992.
Article Google Scholar
Porter, M. F. An algorithm for suffix stripping. Program, 14 (3), pp. 130–137, 1980.
Google Scholar
Rogati, M., McCarley, S., and Yang, Y. Unsupervised learning of Arabic stemming using a parallel corpus. In Proceedings ACL-2003, Sapporo, Japan, pp. 391–398, July 2003. http://acl.ldc.upenn.edu/acl2003/main/pdf/Rogati.pdf
Google Scholar
Siegel, S. Nonparametric statistics for the behavioral sciences. New York: McGraw-Hill, 1956.
Google Scholar
Taghva, K., Elkoury, R., and Coombs, J. Arabic Stemming without a root dictionary. 2005. www.isri.unlv.edu/publications/isripub/Taghva2005b.pdf
Google Scholar
Tai, S. Y., Ong, C. S., and Abdullah, N. A. On designing an automated Malaysian stemmer for the Malay language. (poster). In Proceedings of the fifth international workshop on information retrieval with Asian languages, Hong Kong, pp. 207–208, 2000.
Google Scholar
Xu, J. and Croft, W. B. Corpus-based stemming using co-occurrence of word variants. ACM Transactions on Information Systems, 16 (1), pp. 61–81, 1998.
Article Google Scholar
Xu, J., Fraser, A., and Weischedel, R. TREC 2001 cross-lingual retrieval at BBN. In TREC 2001. Gaithersburg: NIST, pp 68–78, 2001.
Google Scholar
Xu, J., Fraser, A., and Weischedel, R. Empirical studies in strategies for Arabic retrieval. In Sigir 2002. Tampere, Finland: ACM, pp. 269–274, 2002.
Google Scholar

Download references

Author information

Authors and Affiliations

Chiliad Publishing, 44 Belchertown Rd, Amherst, MA 01002
Leah S. Larkey
Computer Science Dept, Mt. Holyoke College, South Hadley, MA 01075
Lisa Ballesteros
Dept. of Computer Science, Univ. of Massachusetts, Amherst, MA 01003
Margaret E. Connell

Authors

Leah S. Larkey
View author publications
You can also search for this author in PubMed Google Scholar
Lisa Ballesteros
View author publications
You can also search for this author in PubMed Google Scholar
Margaret E. Connell
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Ecole Nationale de I’Industrie Minérale, Rabat, Morocco
Abdelhadi Soudi
Tilburg University, The Netherlands
Antal van den Bosch
Deutsches Forschungszentrum für Künstliche Intelligenz, Saarbrücken, Germany
Günter Neumann

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Larkey, L.S., Ballesteros, L., Connell, M.E. (2007). Light Stemming for Arabic Information Retrieval. In: Soudi, A., Bosch, A.v., Neumann, G. (eds) Arabic Computational Morphology. Text, Speech and Language Technology, vol 38. Springer, Dordrecht. https://doi.org/10.1007/978-1-4020-6046-5_12

Download citation

DOI: https://doi.org/10.1007/978-1-4020-6046-5_12
Publisher Name: Springer, Dordrecht
Print ISBN: 978-1-4020-6045-8
Online ISBN: 978-1-4020-6046-5
eBook Packages: Humanities, Social Sciences and LawSocial Sciences (R0)

Publish with us

Policies and ethics