Advertisement

Adapting Morphology for Arabic Information Retrieval*

  • Kareem Darwish
  • Douglas W. Oard
Part of the Text, Speech and Language Technology book series (TLTB, volume 38)

Abstract

This chapter presents an adaptation of existing techniques in Arabic morphology by leveraging corpus statistics to make them suitable for Information Retrieval (IR). The adaptation resulted in the development of Sebawai, an shallow Arabic morphological analyzer, and Al-Stem, an Arabic light stemmer. Both were used to produce Arabic index terms for Arabic IR experimentation. Sebawai is concerned with generating possible roots and stems of given Arabic word along with probability estimates of deriving the word from each of the possible roots. The probability estimates were used a guide to determine which prefixes and suffixes should be used to build the light stemmer Al-Stem. The use of the Sebawai generated roots and stems as index terms along with the stems from Al-Stem are evaluated in an information retrieval application and the results are compared.

Keywords

Index Term Arabic Text Arabic Word Common Prefix Linguistic Data Consortium 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Abdul-Al-Aal, A. (1987). An-Nahw Ashamil. Cairo, Egypt: Maktabat Annahda Al-Masriya.Google Scholar
  2. Abu-Salem, H., Al-Omari, M. & Evens, M. (1999). Stemming Methodologies Over Individual Query Words for Arabic Information Retrieval. Journal of the American Society for Information Science and Technology, 50(6), 524–529.CrossRefGoogle Scholar
  3. Ahmed, M. (2000). A Large-Scale Computational Processor of Arabic Morphology and Applications. Faculty of Engineering, Cairo University, Cairo, Egypt.Google Scholar
  4. Aljlayl, M., Beitzel, S., Jensen, E., Chowdhury, A., Holmes, D., Lee, M., Grossman, D. & Frieder, O (2001). IIT at TREC-10. In Proceedings of the Tenth Text REtrieval Conference (pp.265–274), Gaithersburg, MD. http://trec.nist.gov/pubs/trec10/papers/IIT-TREC10.pdfGoogle Scholar
  5. Al-Kharashi, I. & Evens, M. (1994). Comparing Words, Stems, and Roots as Index Terms in an Arabic Information Retrieval System. Journal of the American Society for Information Science and Technology, 45(8), 548–560.CrossRefGoogle Scholar
  6. Antworth, E. (1990). PC-KIMMO: a two-level processor for morphological analysis. In Occasional Publications in Academic Computing. Dallas, TX: Summer Institute of Linguistics.Google Scholar
  7. Beesley, K. (1996). Arabic Finite-State Morphological Analysis and Generation. In Proceedings of the International Conference on Computational Linguistics (COLING-96, vol. 1, pp. 89–94).Google Scholar
  8. Beesley, K., Buckwalter, T. & Newton, S. (1989). Two-Level Finite-State Analysis of Arabic Morphology. In Proceedings of the Seminar on Bilingual Computing in Arabic and English, Cambridge, England.Google Scholar
  9. Chen, A. & Gey, F. (2001). Translation Term Weighting and Combining Translation Resources in Cross-Language Retrieval. In Proceedings of the Tenth Text REtrieval Conference (pp. 529–533), Gaithersburg, MD. http://trec.nist.gov/pubs/trec10/papers/berkeley_trec10.pdfGoogle Scholar
  10. Chen, A. & Gey, F. (2002). Building an Arabic Stemmer for Information Retrieval. In Proceedings of the Eleventh Text REtrieval Conference, Gaithersburg, MD. http://trec.nist.gov/pubs/trec11/papers/ucalberkeley.chen.pdfGoogle Scholar
  11. Darwish, K. (2002). Building a Shallow Arabic Morphological Analyzer in One Day. In Proceedings of the ACL Workshop on Computational Approaches to Semitic Languages (pp. 1–8). Philadelphia, PA.Google Scholar
  12. Darwish, K. & Oard, D. (2002a). CLIR Experiments at Maryland for TREC 2002: Evidence Combination for Arabic-English Retrieval. In Proceedings of the Eleventh Text REtrieval Conference, Gaithersburg, MD. http://trec.nist.gov/pubs/trec11/papers/umd.darwish.pdfGoogle Scholar
  13. Darwish, K. & Oard, D. (2002b). Term Selection for Searching Printed Arabic. In Proceedings of the Special Interest Group on Information Retrieval Conference (SIGIR, pp. 261–268), Tampere, Finland.Google Scholar
  14. Fraser, A., Xu, J. & Weischedel, R. (2002). TREC 2002 Cross-lingual Retrieval at BBN. In Proceedings of the Eleventh Text REtrieval Conference, Gaithersburg, MD. http://trec.nist.gov/pubs/trec11/papers/bbn.xu.cross.pdfGoogle Scholar
  15. Gey, F. & Oard, D. (2001). The TREC-2001 Cross-Language Information Retrieval Track: Searching Arabic Using English, French or Arabic Queries. In Proceedings of the Tenth Text REtrieval Conference, Gaithersburg, MD. http://trec.nist.gov/pubs/trec10/papers/clirtrack.pdfGoogle Scholar
  16. Goldsmith, J. (2000). Unsupervised Learning of the Morphology of a Natural Language. Retrieved from http://humanities.uchicago.edu/faculty/goldsmith/Google Scholar
  17. Graff, D. & Walker, K. (2001). Arabic Newswire Part 1. Linguistic Data Consortium, Philadelphia. LDC catalog number LDC2001T55 and ISBN 1-58563-190-6.Google Scholar
  18. Hmeidi, I., Kanaan, G. & Evens, M. (1997). Design and Implementation of Automatic Indexing for Information Retrieval with Arabic Documents. Journal of the American Society for Information Science and Technology, 48(10), 867–881.CrossRefGoogle Scholar
  19. Ibn Manzour (2006). Lisan Al-Arab. Retrieved from http://www.muhaddith.org/Google Scholar
  20. Jurafsky, D. & Martin, J. (2000). Speech and Language Processing. Saddle River, NJ: Prentice Hall.Google Scholar
  21. Kiraz, G. (1998). Arabic Computational Morphology in the West. In Proceedings of The 6 th International Conference and Exhibition on Multi-lingual Computing, Cambridge.Google Scholar
  22. Koskenniemi, K. (1983). Two Level Morphology: A General Computational Model for Word-form Recognition and Production. Department of General Linguistics, University of Helsinki.Google Scholar
  23. Larkey, L., Allen, J., Connell, M. E., Bolivar, A. & Wade, C. (2002). UMass at TREC-2002: Cross Language and Novelty Tracks. In Proceedings of the Eleventh Text REtrieval Conference, Gaithersburg, MD. http://trec.nist.gov/pubs/trec11/papers/umass.wade.pdfGoogle Scholar
  24. Larkey, L., Ballesteros, L. & Connell, M. (2002). Improving Stemming for Arabic Information Retrieval: Light Stemming and Co-occurrence Analysis. In Proceedings of the Special Interest Group on Information Retrieval (SIGIR, pp. 275–282), Tampere, Finland.Google Scholar
  25. Mayfield, J., McNamee, P., Costello, C., Piatko, C. & Banerjee, A. (2001). JHU/APL at TREC 2001: Experiments in Filtering and in Arabic, Video, and Web Retrieval. In Proceedings of the Tenth Text REtrieval Conference, Gaithersburg, MD. http://trec.nist.gov/pubs/trec10/papers/jhuapl01.pdfGoogle Scholar
  26. Oard, D. & Gey, F. (2002). The TREC 2002 Arabic/English CLIR Track. In Proceedings of the Eleventh Text REtrieval Conference, Gaithersburg, MD. http://trec.nist.gov/pubs/trec11/papers/OVERVIEW.gey.ps.gzGoogle Scholar
  27. Robertson, S. & Sparck Jones, K. (1997). Simple Proven Approaches to Text Retrieval. Cambridge University Computer Laboratory.Google Scholar
  28. Xu, J., Fraser, A. & Weischedel, R. (2001). Cross-Lingual Retrieval at BBN. In Proceedings of the Tenth Text REtrieval Conference (pp. 68–75), Gaithersburg, MD. http://trec.nist.gov/pubs/trec10/papers/BBNTREC2001.pdfGoogle Scholar

Copyright information

© Springer 2007

Authors and Affiliations

  • Kareem Darwish
    • 1
  • Douglas W. Oard
    • 2
  1. 1.IBM Technology Development CenterEl-AhramGizaEgypt
  2. 2.College of Information Studies & UMIACSUniversity of MarylandCollege Park

Personalised recommendations